Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailSite Reliability Engineer (SRE)
Only GC / USC / GCEAD
Site Reliability Engineer (SRE)
Job Description
Key Responsibilities:
At least 12 years of experience defining and implementing Monitoring solutions alerts Telemetry and instrumentation for onpremises and cloud platforms for large enterprises
Site Reliability Engineer will be playing a key role in building Observability and Resilience capabilities on cloud platform (Azure). Responsibilities of the SRE will be:
Build and configure alerts tracing telemetry and instrumentation required for Infrastructure Monitoring and Application Performance Management.
Role entails implementing dashboards to monitor and share Observability at various levels (engineering teams portfolio senior management).
Support resilience engineering (application and infrastructure resilience) to meet availability requirements.
Work with development engineers cloud engineers product teams and support engineers to gather requirements implement and evolve observability and resilience solutions.
Key Skillsets :
Good knowledge on Observability and Application Performance Monitoring best practices KPIs/metrics on Cloud platforms
Experience in monitoring tools such as Splunk Dyna Trace Prometheus Cloud Watch Azure Monitor New Relic other opensource tools.
Experience building monitoring solutions for variety of workloads such as Micro services (Java / Spring boot desirable) databases Kafka Kubernetes
Experience in resilience engineering and implementing high availability solutions
Experience creating Monitoring dashboards using tools such as Grafana (Preferred) Splunk Kibana Power BI
Ability to work in a fast paced and agile environment
SRE Maturity Level 3 (Expectation)
DevOps Observability
o DORA Metrics are visible
Deployment frequency Mean Time To Restore (MTTR) Cycle time Change failure rate
IaC (Infrastructure as Code)
o Platforms leverage IaC
Test / Release automation
o Unit tests
Test in a vacuum
o Integration tests
o Load test results validated against SLOs
o Test run as part of CI/CD pipeline
o Automated rollback
o Business Continuity Plan for Recovering Service(s)
Capacity planning review
o Show saturation of service as compared to load test and production peak load
Product Management (Security)
o Security scanning
o Documented procedures for Vulnerability Management
o Integrated into CI/CD pipeline (partner with security)
SRE Maturity Level 4 (Advanced)
Modernized application
o Deployment to Kubernetes Azure or SaaS via CI/CD pipeline
Synthetic Monitoring
Canary / Blue Green Deployment
SelfHealing
Auto scaling
Identify KPIs for business performance
Chaos Engineering
Enterprise Process TieIns
Problem management will as part of RCA will review the maturity level of the incident owner
Full Time