Position: Site Reliability Engineer
Location: 100% Remote
Duration: 12 Months Contract
Interview: Video
Key Areas of Responsibility:
- This is a strategic and handson position where you will work closely with crossfunctional teams to identify potential issues and provide innovative insights to optimize system performance stability and availability.
- Guide cross functional teams to manage and support their PagerDuty alerts teams schedules escalation policies and automations.
- The engineer will also be responsible for automating alerting and remediation processes to reduce mean time to resolution (MTTR) and improve system uptime.
- Monitor Server network infrastructure and application performance metrics and identify patterns and trends to improve system performance and reliability.
- Troubleshoot issues and outages working closely with development and operations teams to identify root causes and develop solutions.
- Collaborate with crossfunctional teams to support incident management change management and problem management processes.
- Proactively detect and prevent future problems/incidents and initiate the Problem Management process to allow quicker diagnosis and resolution.
- Develop trend analysis and prepare service improvement plans to address identified gaps.
- Build strong relationships with key stakeholders including senior management department heads and external partners to ensure their support and engagement in incident management initiatives.
- Foster a culture of continuous improvement staying abreast of industry trends emerging technologies and best practices to enhance incident management capabilities.
- Create dashboards and reports to provide insights into operational performance and health.
- Build automation to optimize processes and workflows within our oncall systems and monitoring platforms.
- Complete any assigned project work or tasks with a view to improving existing processes capabilities and seek out automation opportunities.
- Ability to support oncall rotation and offhours support as required.
Minimum Qualifications:
- Bachelors Degree in IT Business Management or a related discipline preferred.
- 5 of direct experience working in the observability operations or DevOps domains.
- Proficient in Observability monitoring PagerDuty and logging tools Like Datadog Dynatrace PowerBI etc.
- 3 years of technical experience: systems engineering SRE DevOps software engineering
Other Required Qualifications:
- Excellent written and verbal communication skills with the ability to communicate effectively with all stakeholders including senior leadership.
- Strong ability to understand accurately translate and produce technical information for a general and business audience.
- Strong experience with change incident and problem management principles methodologies and tools.
- Experience using configuration and change tools to include such as ServiceNow Change and CMDB and or related tools.
- Experience with project delivery methodologies (Agile Scrum).
- Hands on experience with monitoring and performance monitoring tools: DataDog Dynatrace Splunk etc.
Preferred Qualifications:
- ITIL v3 Foundation Certification Preferred
- Certification in Project Management
- Experience implementing continuous process improvements within a configuration change release or asset management program
- Cloud certifications (Azure AWS GCP)
- Direct experience scripting in two of the following languages: Python PowerShell Bash.
- Proficient at technical and business writing