Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailNot Disclosed
Salary Not Disclosed
1 Vacancy
A reputable IT provider on the Czech and international market the company specializes in custom application development and consultancy services focusing on developing distribution portals in a cloud environment including the implementation of modern technologies such as GraphQL API creating specific software solutions such as simulators and developing frontend platforms for various purposes. Known for its autonomy precise work and deep technological knowledge the company provides expert consultancy services and organizes regular training for its employees and contractors. The team of experts has an extensive experience in various industries allowing it to effectively support and develop client IT systems.
Contract: Fulltime hybrid (3 days onsite is a must) Freelancer Contract
Job Overview: The SRE will monitor troubleshoot and maintain our infrastructure with an emphasis on reliability scalability and automation. This role is ideal for candidates with foundational NOC experience who are interested in expanding their skills to include SRE practices and modern infrastructure management.
Responsibilities:
Monitoring and Incident Response:
Continuously monitor application performance system health and network status.
Respond swiftly to incidents performing root cause analysis and implementing resolution strategies.
Escalate issues when necessary and lead the collaboration and communication for swift resolution of incidents.
Automated Monitoring and Alerting:
Use and configure monitoring tools (e.g. Prometheus Grafana Coralogix Splunk) to improve visibility into system performance.
Develop and refine alerting rules to reduce noise and improve incident detection.
Troubleshooting and System Maintenance:
Perform initial troubleshooting and diagnostics across application infrastructure and network layers.
Work with Developers and DevOps to implement fixes validate configurations and ensure systems are resilient to future incidents.
Operational Automation:
Automate repetitive tasks such as alert handling system checks and routine maintenance.
Use scripting (e.g. Python Bash) and InfrastructureasCode (IaC) tools (e.g. Terraform Crossplane) to improve operational efficiency.
Documentation and Knowledge Sharing:
Document processes incidents and troubleshooting steps maintaining a knowledge base for common issues.
Contribute to runbooks for automated troubleshooting and escalate complex issues to SREs or other technical teams.
Continuous Improvement:
Analyze incidents and recurring issues to identify areas for improvement in system reliability and automation.
Lead postincident reviews and contribute insights for future preventive actions.
Opportunities for further education
Flexibility in employment status: possibility to work as a permanent employee or as a contractor
Flexible working hours
Option for home office
Office located in the center of Prague excellent public transport accessibility
Company benefits and events
Great team atmosphere
Full Time