Role: Resilience Expert
Job Description:
Relevant Experience: 10 Years
Closely related to the Googles definition of an SRE however here almost exclusively focused on resilience itself. Specifically how an application/product/service can be made more reliable. This can be before during or after code has been written for that product.
- Define/create/implement standards and drive implementation of resilient design
- Understand what happens if a downstream service fails. How is our upstream response handled What is the customer experience (impact)
- Define/create/implement fallback mechanisms/circuit breakers understand if its appropriate to create one at all. Define/create logic for aforementioned circuit breakers (experience shows todays implementations may have a negative impact)
- How do we tackle E2E resilience on a customer journey
- Define/create/implement timeouts settings E2E (these have caused negative outcomes in the past)
- Participate in complex operational issues E2E identifying root causes and architectural solutions (or other improvements) to avoid reoccurrence
- Work closely with architecture team and Tech Leads in early stages of SDLC