We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our team. As an Sr.SRE you will play a pivotal role in ensuring the reliability scalability and performance of our cloudbased infrastructure.
You will collaborate closely with development operations and other teams to implement and maintain efficient and resilient systems.
Responsibilities:
- Infrastructure Automation: Developing deploying and overseeing Infrastructure as Code (IaC) solutions using tools such as Terraform and Ansible to automate the provisioning configuration and deployment processes.
- Cloud Platform Expertise: Deep understanding of AWS cloud services including EC2 S3 VPC RDS EKS ECS CF and more. Experience with serverless architecture and AWS Lambda functions is a plus.
- Containerization and Orchestration: Proficiency in containerization technologies (Docker) and orchestration platforms (Kubernetes) with deploying applications using tools like K8s and Helm.
- CI/CD Pipelines: Build and maintain robust CI/CD pipelines using tools like Jenkins.
- Monitoring and Alerting: Implement comprehensive monitoring and alerting solutions using tools like ELK Datadog CloudWatch Grafana to proactively identify and resolve issues.
- Incident Management: Drive incident response processes troubleshoot complex issues and perform Root Cause analysis (RCA) to prevent future occurrences (CAPA).
- Performance Tuning: Continuously optimize system performance identify bottlenecks and implement strategies to improve scalability and efficiency.
- Cost Optimization: Identify and implement strategies to reduce cloud costs while maintaining performance and reliability.
- Security Best Practices: Adhere to security best practices and implement measures to protect infrastructure and data from vulnerabilities and threats.
- Collaboration and Communication: Work effectively with crossfunctional teams to understand business requirements and provide technical guidance.
- SOP Documentation: Create and maintain documentation for infrastructure processes and incident management protocols.
#LINA01
Remote Work :
No
Employment Type :
Fulltime