Job Summary:
We are seeking a highly skilled Site Reliability Engineer (SRE) with experience in building and managing EKS (Elastic Kubernetes Service) environments. The ideal candidate will be responsible for designing deploying and maintaining reliable systems while supporting our DevOps practices. A background in observability tools such as ELK (Elasticsearch Logstash Kibana) and Grafana is highly preferred.
Key Responsibilities:
- EKS Build and Run:
- Design implement and manage EKS clusters to ensure high availability and scalability.
- Automate provisioning deployment and scaling of EKS environments.
- Monitor and maintain the health and performance of Kubernetes workloads in EKS.
- Site Reliability Engineering:
- Enhance system reliability through the development of monitoring automation and faulttolerant solutions.
- Build tools and automation to streamline infrastructure management and operational tasks.
- Respond to incidents troubleshoot performance issues and conduct root cause analysis.
- DevOps Collaboration:
- Support CI/CD pipelines including integrating EKS into the DevOps lifecycle.
- Ensure seamless collaboration with development teams to deliver infrastructure as code (IaC) and automate deployments.
- Observability & Monitoring:
- Implement and optimize observability solutions using tools like ELK Stack and Grafana.
- Establish robust logging monitoring and alerting frameworks to improve system transparency and uptime.
Required Skills & Experience:
- Kubernetes/EKS Expertise: Strong experience in deploying and managing Kubernetes clusters specifically on AWS EKS.
- Cloud Platforms: Advanced knowledge of AWS services and infrastructure.
- DevOps Tools: Familiarity with DevOps practices and tools like Terraform Ansible Jenkins or GitLab CI/CD.
- Observability: Handson experience with ELK Stack (Elasticsearch Logstash Kibana) and Grafana.
- Automation & Scripting: Proficiency in scripting languages (e.g. Python Bash) and automation frameworks.
- System Administration: Solid understanding of Linux/Unix systems and networking.
Preferred Qualifications:
- Background in building observability pipelines and frameworks.
- Experience with Prometheus Loki or other observability tools is a plus.
- Certification in AWS (e.g. AWS Certified Solutions Architect or DevOps Engineer) is an advantage.
Soft Skills:
- Excellent problemsolving and troubleshooting skills.
- Strong communication and teamwork abilities.
- A proactive approach to learning and adopting new technologies.
kubernetes,building,gitlab ci/cd,eks,ci,grafana,aws,infrastructure,automation,elk stack,reliability,networking,jenkins,ansible,bash,terraform,unix,skills,linux,devops,python