Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailLead Site Reliability Engineer
Remote
1. SRE Implementations: Look for candidates who have experience implementing SRE principles including the establishment of Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets to ensure system reliability and availability.
2. Observability: Search for keywords related to observability including familiarity with concepts such as fullstack observability and distributed tracing
3. Tool Proficiency: Datadog CloudWatch Synthetic Monitoring tools
4. Building SRE Culture: Evaluate candidates based on their ability to develop SRE frameworks within organizations such as creating SRE charters and fostering a culture of reliability and accountability across teams.
5. Automation: Look for candidates with extensive experience in automation including the automation of repetitive tasks infrastructure provisioning and deployment processes to streamline operations and enhance efficiency.
6. Chaos Engineering: Consider candidates who have experience in Chaos Engineering practices and related tools demonstrating their ability to proactively identify system weaknesses and improve resilience through controlled experiments.
Job Details:
Lead and mentor a team of SREs to ensure operational excellence and maximize the reliability and availability of client systems.
Minimum 10 years of work experience in DevOps/SRE including leadership roles.
Architect and design highly scalable and available infrastructure solutions integrating best practices in reliability engineering and automation.
Collaborate with crossfunctional teams (DevOps Development IT) to implement SRE principles throughout the software development life cycle.
Establish and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services monitoring and maintaining performance against defined targets.
Implement and enhance observability alerting and incident response processes to proactively address issues and minimize downtime.
Drive continuous improvement initiatives identifying bottlenecks and optimizing within the infrastructure and application stack.
Develop and maintain documentation related to system architecture configuration and procedures.
Stay current with industry trends recommending and adopting new tools and practices to enhance system reliability.
Qualifications:
Strong background in designing and implementing highly available and scalable infrastructure.
Proficiency in scripting and automation using Python or Shell
Experience with container orchestration platforms serverless architectures CI/CD pipelines and IaC implementations. (Ansible & Terraform)
Experience with Observability tools (preferred: Datadog CloudWatch).
Indepth knowledge of cloud computing platforms (preferred: AWS).
Solid understanding of SRE/DevOps principles and practices.
Excellent problemsolving skills with the ability to troubleshoot complex issues in production environments.
Strong communication and leadership skills fostering effective collaboration with crossfunctional teams.
Relevant certifications in SRE DevOps Cloud etc. are a plus
Full Time