Overview
The Site Reliability Engineer plays a crucial role in ensuring the reliability performance and scalability of the infrastructure and applications. This role is vital in maintaining a seamless and efficient operation of technology systems within the organization and ensuring that they meet the high standards of availability and performance required by both internal and external users.
Key responsibilities
- Design and implement automation for various processes to improve efficiency and reliability
- Develop monitoring solutions to ensure the health and performance of systems
- Participate in oncall rotations and handle incident response troubleshooting and resolution
- Create and maintain scripts for operational tasks and automation
- Conduct capacity planning and manage the scalability of the systems
- Collaborate with development teams to improve system reliability and performance
- Deploy and maintain cloud services and infrastructure
- Define and implement service level objectives and indicators
- Ensure security best practices are followed in all aspects of infrastructure and services
- Perform system and application performance tuning and capacity forecasting
- Conduct postincident reviews and implement preventive measures
- Participate in the design and implementation of disaster recovery plans
- Document procedures configurations and processes
- Contribute to the continuous improvement of processes and tools
- Stay updated with industry trends and best practices
Required qualifications
- Bachelors degree in Computer Science Engineering or a related field
- Proven experience in a Site Reliability Engineer or similar role
- Strong understanding of software development system administration and networking
- Proficiency in scripting (e.g. Python Shell Perl)
- Experience with monitoring and alerting tools (e.g. Nagios Datadog Prometheus)
- Expertise in cloud services and infrastructure (e.g. AWS GCP Azure)
- Knowledge of containerization and orchestration technologies (e.g. Docker Kubernetes)
- Experience with CI/CD pipelines and configuration management tools (e.g. Jenkins Ansible)
- Solid understanding of TCP/IP HTTP DNS and other network protocols
- Ability to analyze and troubleshoot complex systems and applications
- Experience with incident management and oncall responsibilities
- Familiarity with security best practices and tools
- Excellent communication and collaboration skills
- Certifications such as AWS Certified SysOps Administrator or Google Professional Cloud DevOps Engineer is a plus
- Continuous learning and selfimprovement mindset
automation,continuous improvement,capacity planning,system administration,on-call,gcp,performance tuning,disaster recovery,http,incident management,service level objectives,containerization,ci/cd pipelines,networking,cloud services,dns,tcp/ip,scripting,monitoring,reliability,communication,documentation,azure,docker,kubernetes,troubleshooting,ansible,security best practices,jenkins,aws