JOB OVERVIEW
We are looking for a detailoriented and experienced Site Reliability Engineer to join our team. The Site Reliability Engineer will be responsible for creating and implementing scalable software solutions in order to meet system and application performance goals. You will also be responsible for troubleshooting system errors and resolving any relevant issues.
ROLES AND RESPONSIBILITIES
System Monitoring and Incident Response: for implementing monitoring solutions to track system health
performance and availability. They proactively monitor systems identify issues and respond to incidents
promptly working to minimize downtime and mitigate impacts.
PostIncident Analysis: Led incident response efforts coordinated with crossfunctional teams and
conducted postincident analysis to identify root causes and implement preventive measures.
Continuous Improvement and Reliability Engineering: SREs drive continuous improvement efforts by
identifying areas for enhancement implementing best practices and fostering a culture of reliability
engineering. They participate in postmortems conduct blameless retrospectives and drive initiatives to
improve system reliability stability and maintainability.
Collaboration and Knowledge Sharing: SREs collaborate closely with software engineers operations teams
and other stakeholders to ensure smooth coordination and effective communication. They share knowledge
provide technical guidance and contribute to the development of a strong engineering culture.
Support and maintain configuration management for various applications and systems
Implement comprehensive service monitoring including dashboards metrics and alerts
Define measure and meet key service level objectives such as uptime performance incidents and chronic
problems
Partner with application and business stakeholders to ensure high quality product development and release
Collaborate with the development team to enhance system reliability and performance.
QUALIFICATIONS:
Bachelor s degree in Information Technology Computer Science or related field.
Strong knowledge of software development processes and procedures.
Strong problemsolving abilities.
Excellent understanding of computer systems servers and network systems.
Ability to work under pressure and manage multiple tasks simultaneously.
Strong communication and interpersonal skills.
Strong knowledge of coding languages like Python Java Go etc.
Ability to program (structured and OOP) using one or more highlevel languages such as Python Java C/C
Ruby and JavaScript
Experience with distributed storage technologies such as NFS HDFS Ceph and Amazon S3 as well as dynamic
resource management frameworks (Apache Mesos KubernetesYarn)
JOB DESCRIPTION
Experience with cloud computing platforms such as AWS Azure or Google Cloud
Experience with DevOps tools such as Git Jenkins Ansible Terraform Docker etc.
Experience with monitoring tools such as Splunk Prometheus
devops practices,reliability,reliability engineering,continuous improvement,system monitoring,cloud infrastructure,incident response,post-incident analysis,configuration management,kubernetes,key service level objectives,programming (python, java, go, c/c++, ruby, javascript),devops tools (git, jenkins, ansible, terraform, docker),devops,monitoring services,monitoring tools (splunk, prometheus),aws,service monitoring,splunk,prometheus,ansible,cloud computing (aws, azure, google cloud)