Job Title: Site Reliability Engineer
Location: Bellevue WA/ Frisco TX/ Atlanta GA/ Overland Park KS Hybrid
Term: Contract
Job Description:
Key Responsibilities:
- Kubernetes Management: Deploy manage and optimize Kubernetes clusters in production and staging environments ensuring high availability and efficient resource utilization.
- AWS Infrastructure: Leverage AWS cloud services (EC2 S3 RDS EKS Lambda etc.) to build manage and scale cloudnative infrastructure.
- Automation & Infrastructure as Code: Develop and maintain automated workflows using Infrastructure as Code (IaC) tools like Terraform CloudFormation or Ansible to provision configure and manage cloud infrastructure.CI/CD Pipeline Support: Build optimize and maintain CI/CD pipelines to enable seamless code delivery and deployments using tools like Jenkins GitLab CI or CircleCI.
- Monitoring & Observability: Implement and maintain monitoring alerting and logging solutions using tools such as Prometheus Grafana CloudWatch or ELK stack to ensure system health and availability.
- Incident Response: Lead and support incident response efforts conduct root cause analysis and implement postincident reviews to improve system resilience.
- Performance Optimization: Identify and resolve performance bottlenecks improve system efficiency and ensure applications and infrastructure are optimized for both cost and performance.
- Security & Compliance: Work with security teams to implement best practices for securing Kubernetes clusters AWS resources and platform infrastructure including access controls network policies and encryption.
- Collaboration & Documentation: Work closely with development DevOps and infrastructure teams to align on best practices improve automation and document procedures for infrastructure management and troubleshooting.
Required Qualifications:
- Kubernetes Expertise: Strong expertise in managing and scaling Kubernetes clusters including experience with Kubernetes networking storage and multicluster architectures.
- AWS Cloud Expertise: Proficiency with AWS services such as EC2 S3 EKS RDS VPC Lambda IAM CloudWatch and others. Experience with AWS best practices for scalability security and cost management.
- Infrastructure as Code (IaC): Handson experience with IaC tools such as Terraform AWS CloudFormation or Ansible for provisioning and managing cloud infrastructure.CI/CD Pipelines: Experience building and maintaining continuous integration and continuous deployment (CI/CD) pipelines using Jenkins GitLab CI or similar tools.
- Scripting & Automation: Proficiency in scripting languages such as Python Bash or Go to automate operational tasks and improve workflows.
- Monitoring & Logging: Experience with monitoring logging and alerting tools like Prometheus Grafana CloudWatch ELK stack or similar tools.
- Troubleshooting & Incident Management: Ability to troubleshoot complex issues in distributed systems conduct root cause analysis and implement solutions to prevent recurrence.
- Collaboration Skills: Strong communication skills with the ability to work collaboratively with developers operations and product team
Key Skills:
Kubernetes AWS Cloud Reliability Engineer CI/CD IAC