Approach operations challenges with a software engineering perspective leveraging:
- Coding Automation and Engineering principles.
- Monitor and appropriate address system issues.
- Create strategies to detect issues.
- Design systems to troubleshoot automatically.
- Write and review postmortems.
- Collaborate with development teams and other stakeholders to identify potential risks.
- Once risks are identified you will analyze and evaluate potential impact and likelihood of occurrence.
- Based on the risk assessment you will implement various risk mitigation strategies to mitigate operational risks.
- Continuously monitor and review the effectiveness of their risk strategies.
- Study historical trends in terms of performance by using metrics like charts and graphs.
- Trace the problems with system monitoring tools.
- Monitor the log files to manage infrastructures at scale.
- Minimizing the MTTR for reliable systems is necessary to reduce downtime
- As an SRE you can improve this metric by resolving the incidents quickly.
- Maintain internal tooling.
- Monitoring system performance identifying bottlenecks and executing pipeline optimization.
- Implementing comprehensive service metrics to track and report on system reliability performance and efficiency.
- Developing and maintaining CI/CD pipelines enhancing the consistency and speed of software deployment.
- Automating routine tasks and creating tools to improve team efficiency and robust system.
- Collaborating with development teams to integrate operational considerations into the software development life cycle.
- Managing incident response protocols including oncall rotations for junior engineers and strategic planning for senior personnel.
- Conducting postincident reviews to prevent recurrence and refine the system reliability framework.
- Contributing to disaster recovery plans and ensuring robust backup systems are in place.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting platform management and capacity planning.
- Create sustainable systems and services through automation and uplifts.
- Balance feature development speed and reliability with welldefined servicelevel objectives.
- Working oncall shift to prevent incidents from ever happening.
- Running our infrastructure with Ansible Terraform GitLab CI/CD and Kubernetes.
Qualifications :
- Experience in using: Linux UNIX and Windows
- DB administration & maintenance: Oracle Cassandra PostgreSQL AWS DB setups Caching DB.
- Familiar with: GIT Jira Jenkins Ansible
- Strong knowledge of DevOps and CI/CD pipeline (GitHub Terraform)
- Knowledge of monitoring solutions: Grafana Prometheus Dynatrace
- Handson AWS implementation experience across a broad range of AWS services.
- Must have AWS development experience (Containerization Docker Amazon EKS Lambda EC2 S3 Amazon Document DB PostgreSQL)
- Experience with core AWS platform architecture including areas such as: Organizations Account Design VPC Subnet segmentation strategies.
- Comfortable working with cloudnative infrastructure such as AWS Lambda Google App Engine and Azure Cloud Services.
- Backup and Disaster Recovery approach and design
- Environment and application automation
- Proficiency in programming languages such as Python Go or Java
- Familiar with Encryption Logging and Privacy/Security Protocols (e.g. TLS 1.2 ELK stack)
- Good knowledge of REST/SOAP/JSON web service API implementation.
- Bachelors degree in Computer Science Information Technology or a related field.
- Relevant industry certifications such as through the Site Reliability Engineering (SRE) Foundation.
- Strong understanding of cloudbased applications and infrastructure including AWS Azure or Google Cloud.
- Experience with IT operations best practices such as ITIL COBIT or DevOps.
- Experience with IT service management tools such as ServiceNow or Remedy.
- Familiarity with banking customer acquisition applications is preferred.
Additional Information :
Benefits:
- Full access to foreign language learning platform
- Personalized access to tech learning platforms
- Tailored workshops and trainings to sustain your growth
- Medical subscription
- Meal tickets
- Monthly budget to allocate on flexible benefit platform
- Access to 7 Card services
- Wellbeing activities and gatherings
Hybrid: 12 days/week from office (Bucharest)
Remote Work :
No
Employment Type :
Fulltime