Role: Site Reliability Engineer
Location: Boston MA
Duration: Long term contract
As a Site Reliability Engineer you will be responsible for conducting Root Cause Analysis meetings fostering a
blamefree environment to ensure comprehensive information about events and their resolutions is gathered
effectively. This role requires the ability to navigate complex technical issues while promoting open and transparent
discussions among team members. You will utilize trends and metrics to identify improvement opportunities within
existing frameworks tools and processes to improve systems continuously.
Responsibilities:
- You will be part of the SRE team who are focused on Root Cause Analysis of critical production outages to improve resiliency.
- Lead problem tickets and improvements to major software components systems and features to improve the
- availability scalability latency and efficiency of client system.
- Engage in and improve the service lifecycle from inception and design to deployment operation and refinement based on lessons learned through deep dives.
- SSSREWORDDOCUMENTTEMPLATE Handson troubleshooting VMware Kubernetes System Software functionality performance and configuration issues.
- Be a trusted technical advisor who leads complex root cause analysis investigations from beginning to end until improvement implementation.
- Demonstrate sound knowledge of gathering logs and facilitating the root cause analysis with crossfunctional teams.
- Assist internal teams with corrective actions and improvement tickets and influence the completion goals.
- Flexibility to work during occasional out of hours including weekend may be required depending on the criticality and workload demands.
Qualifications:
- Bachelors degree in software engineering Information systems computer science or a related field.
- 10 years of experience working on ITSM tools such as Jira ServiceNow etc.
- 8 years of infrastructure engineering experience with a record demonstrating handson troubleshooting in largescale solutions onprem distributed systems and customdeveloped software applications.
- 8 years of experience in operating production systems including troubleshooting testing and automation.
- 5 years of experience leading technical Root Cause Analysis (Software focus is a plus).
- Team player with excellent communication skills and the ability to prioritize multiple tasks.
- Experience with executive communication report writing and presentation skills to nontechnical audiences.
- Strong technical background in container technologies such as Kubernetes detaildriven and excellent problemsolving abilities.
- Experience in the advanced use of tools like Prometheus Grafana Logic Monitor Elastic and PowerBi is a plus.