Tasks
EndtoEnd Engineering Leadership: Oversee the design and implementation of resilient engineering across the technology domains.
Cloud and OnPremises Infrastructure Expertise: Design and review resilient solutions in both cloudbased and onpremises environments.
Chaos Engineering Infrastructure Initiatives: Lead chaos engineering efforts to proactively identify and mitigate potential system weaknesses.
Standards for Monitoring and Alerting: Collaborate with Teams to evolve existing standards for system monitoring and alerting to ensure rapid detection and response.
Resiliency Architecture Reviews: Represent the IT Resiliency Office during the Architectural Review Board.
Enterprisewide Collaboration and stakeholder management: Collaborate with various teams across the organization to align and prioritize resiliency and recovery efforts.
Automation: Expertise with IaC and Tools such as Ansible.
Incident Response and Recovery: Integrate with post mortem process from a major incident to identify areas of opportunity for enhancing resiliency.
Development: Evangelize standards and practices among the Technology organization to enrich our resiliency posture.
Reporting and Documentation: Develop standardized regular reporting on resilience activities risks and improvements to the Leadership team.
Requirements
Qualifications:
- Bachelors degree or equivalent experience.
- 510 years experience with platform engineering with a focus on IaC DevOps practices and orchestration tools.
- Preferred but not required experience as a Team lead or a hands on Technical Manager role that can engage and deliver projects to completion
- A track record of successfully architecting and deploying enterpriselevel solutions that prioritize system uptime and data integrity across various operational scenarios.
- Demonstrated ability to design and implement systems that ensure high availability support massive transaction volumes and facilitate seamless disaster recovery processes.
- Infrastructure and service architecture & engineering experience including functional and technical requirements gathering and solution development.
- Strong dedication to customer needs with excellent communication and the ability to build lasting relationships alongside the capability to articulate complex resilience strategies in a clear and impactful manner.
- Deep insight into the complexities of multiAZ and multiRegion cloud platforms with a keen understanding of how these impact system resilience and disaster recovery planning.
- Proven experience in the ongoing management of missioncritical systems that require constant uptime including outofhours support and rapid response to incidents.
- Knowledgeable in evaluating and deciding on tradeoffs between consistency availability and partition tolerance especially in the context of system failures and recovery strategies.
- Wellversed in various cloud service models such as SaaS PaaS and IaaS with handson experience in designing resilient services on leading public cloud platforms.
- Proficient in Chaos Engineering principles and practices with experience in designing and conducting experiments to validate the systems capability to withstand turbulent conditions.
- Skilled in implementing observability solutions that provide realtime insights into the performance and health of systems aiding in proactive issue detection and resolution.
- Practical experience operating in an Agile development environment.