As a SRE Manager you will focus on ensuring the reliability performance and scalability of services and infrastructure.
Reporting to the Head of Engineering you will be part of the Product & Technology team will actively participate in all aspects of Site Reliability Engineering including technical vision telemetry and observation decisions automation strategy solution delivery and platform incident and problem management. This is a leadership role with both technical and people leadership responsibilities. As such this role participates in short and longterm systems planning teams and organizational planning. This position reports directly to the Director of Engineering.
What you will do
- Provide technical and people leadership to the SRE teams by facilitating oneoneone team and performance review meetings
- Fulfil the role of Escalation Manager/Critical Incident Manager on critical/ major incidents by facilitating quick and effective incident resolution to minimize player and business impact.
- Conduct RCA and PostIncident Reviews (PIRs) in a Blameless manner to identify root causes and prevent recurrence.
- Build advanced Incident Management and Problem Management support (SOPs and runbooks) to effectively identify remediate and resolve issues related to platform reliability stability and performance through careful analysis of telemetry data and system logs.
- Continuously work to improve problem identification and service restoration of platforms by leading and overseeing efforts to define enhance and deliver automated alerting and response systems with intelligent selfhealing capabilities
- Collaborate with platform engineers through implementation decisions to achieve highly reliable infrastructure systems and integrations (develop synthetic monitoring health dashboards reliable alerts and system performance).
- Promote automation (CI/CD) infrastructureascode (IAC) practices develop tools and process for seamless deployments rollbacks monitoring and troubleshooting.
- Define and ensure proper reviews are built to minimise the Mean Time to Recover/ Discover (MTTR/ MTTD) and Mean Time to Failure (MTTF).
- Works with development teams to set error budgets SLIs/ SLOs and policies. Works with SRE to implement alerts and policies to minimize the impact failures and outages have on players.
Qualifications :
- Graduate or PostGraduate with strong engineering background.
- 10 years of experience working in global organizations with the ability to effectively communicate with executives leaders and individual contributors across the organization.
- 5 years of SRE experience working with telemetry observation selfhealing solutions and platform automation.
- Proficient in analysing complex technical issues identifying root causes and implementing effective solutions under pressure.
- Experience with monitoring logging & telemetry tools like New Relic Splunk ELK Nagios Prometheus AWS CloudWatch Datadog etc.
- Experience in Disaster Recovery Chaos Engineering with tools like Chaos Mesh and Chaos Monkey and periodically testing resiliency and failovers.
- Handon experience in the monitoring of Exposure with automation and tools such as (but not limited to) GitlabCI Jenkins Terraform Ansible etc.
- Expert in designing creating and supporting Automation (PowerShell Python Ruby AWK SED etc.) to run healthchecks and selfhealing capabilities for the platforms.
- Experience with Networking Content Delivery Networks (CDN e.g. Akamai Cloudflare) streaming platform technologies like Apache Kafka and Databases: (Oracle MS SQL etc.)
- Experience with Cloud platforms esp. Amazon Web Services (AWS)
- Application Security the practice of safeguarding application through access control Authn & Authz data encryption secure communication using TLS/SSL and MTLS.
- Collaboration & Change Management tools: Jira ServiceNow SharePoint etc.
- Experience in managing relationships with thirdparty vendors and service providers contributing to the business.
Remote Work :
No
Employment Type :
Fulltime