Overview:
The Senior Site Reliability Engineer plays a critical role in ensuring the reliability scalability and performance of our systems and services. They are responsible for designing and implementing tools and automated solutions to improve system reliability monitoring and incident response.
Key Responsibilities:
- Develop and maintain infrastructure as code using tools like Terraform and Ansible
- Implement and maintain monitoring alerting and reporting systems
- Collaborate with crossfunctional teams to improve system reliability and performance
- Perform system capacity planning and demand forecasting
- Automate routine operational tasks and processes
- Participate in incident response and oncall rotation
- Optimize the performance and efficiency of various systems and platforms
- Conduct system failure analysis and provide root cause analysis
- Implement and manage CI/CD pipelines
- Conduct periodic performance and security audits
- Lead efforts to improve overall system architecture
- Troubleshoot and resolve complex technical issues
- Collaborate with development teams to improve application deployment processes
- Ensure compliance with security and data protection best practices
Required Qualifications:
- Bachelor s degree in Computer Science Engineering or a related field
- 6 years of experience in a site reliability engineering or related role
- Strong experience with Linux system administration and troubleshooting
- Proficiency in scripting and programming languages such as Python Shell or Go
- Experience with automation and configuration management tools like Puppet Chef or Ansible
- Solid understanding of networking concepts and protocols
- Expertise in cloud computing platforms such as AWS Azure or GCP
- Proven track record of designing and implementing scalable reliable and maintainable systems
- Experience with containerization and orchestration tools like Docker and Kubernetes
- Knowledge of continuous integration and continuous deployment (CI/CD) practices and tools
- Excellent problemsolving and troubleshooting skills
- Strong communication and collaboration abilities
- Relevant certifications such as AWS Certified DevOps Engineer Certified Kubernetes Administrator or similar
- Ability to work effectively in a fastpaced dynamic environment
- Experience with incident management and oncall support
python,networking,ci/cd pipelines,automation tools,orchestration tools,monitoring,reliability,continuous integration,scripting languages,certified kubernetes administrator,problem-solving,terraform,on-call support,communication,capacity planning,alerting,linux system administration,ansible,cloud computing platforms,aws certified devops engineer,networking concepts,troubleshooting,continuous deployment,performance audits,containerization,cloud computing,security audits,automation,linux,incident management