As a Site Reliability Engineer you will:
- Ensure the reliability availability and performance of services through effective monitoring observability and incident response.
- Utilize your expertise in the ELK Stack (Elasticsearch Logstash Kibana) for troubleshooting logging and optimizing system performance.
- Collaborate closely with development teams to enhance operational efficiency and streamline incident management processes.
- Drive service reliability improvements including proactive monitoring root cause analysis and capacity planning.
What You Bring to the Table:
- Over 10 years of experience in Site Reliability Engineering (SRE) specifically in service reliability and observability.
- Extensive experience working with the ELK Stack including configuring maintaining and optimizing ELK services.
- Strong understanding of the principles of SRE service uptime incident management and system performance tuning.
- Expertise in managing largescale distributed systems and ensuring high availability in production environments.
You should possess the ability to:
- Develop and implement robust monitoring and alerting systems using ELK Stack.
- Perform thorough service reliability assessments and take proactive actions to improve service availability.
- Troubleshoot complex systems and identify bottlenecks or performance issues quickly.
- Collaborate with crossfunctional teams to design and deploy resilient and scalable infrastructure.
What we bring to the table:
- An opportunity to work in a dynamic cuttingedge environment where your contributions directly impact system reliability.
- A collaborative team environment fostering learning growth and technical innovation.
- The chance to work with advanced tools and technologies to enhance the operational excellence of largescale services.
As a Site Reliability Engineer, you will: Ensure the reliability, availability, and performance of services through effective monitoring, observability, and incident response. Utilize your expertise in the ELK Stack (Elasticsearch, Logstash, Kibana) for troubleshooting, logging, and optimizing system performance. Collaborate closely with development teams to enhance operational efficiency and streamline incident management processes. Drive service reliability improvements, including proactive monitoring, root cause analysis, and capacity planning. What You Bring to the Table: Over 10 years of experience in Site Reliability Engineering (SRE), specifically in service reliability and observability. Extensive experience working with the ELK Stack, including configuring, maintaining, and optimizing ELK services. Strong understanding of the principles of SRE, service uptime, incident management, and system performance tuning. Expertise in managing large-scale, distributed systems and ensuring high availability in production environments. You should possess the ability to: Develop and implement robust monitoring and alerting systems using ELK Stack. Perform thorough service reliability assessments and take proactive actions to improve service availability. Troubleshoot complex systems and identify bottlenecks or performance issues quickly. Collaborate with cross-functional teams to design and deploy resilient and scalable infrastructure. What we bring to the table: An opportunity to work in a dynamic, cutting-edge environment where your contributions directly impact system reliability. A collaborative team environment, fostering learning, growth, and technical innovation. The chance to work with advanced tools and technologies to enhance the operational excellence of large-scale services.