Role Overview:
As a Cloud Site Reliability Engineer you will collaborate with a dynamic team of operations engineers and software developers to maintain and enhance the reliability scalability and performance of our Cloudbased products and solutions. You will play a pivotal role in ensuring seamless operations implementing robust solutions and proactively addressing challenges in a fastpaced innovative environment.
Key Responsibilities:
- Analyze existing infrastructure and propose scalable efficient solutions driving their implementation with the team.
- Lead incident management efforts including root cause analysis issue resolution and implementing preventative measures.
- Develop strategies to improve Mean Time Between Failures (MTBF) and reduce Mean Time to Recovery (MTTR).
- Evaluate the entire application stack identifying and resolving failures defects and performance bottlenecks.
- Optimize and automate operational procedures to enhance system efficiency.
- Maintain and enhance monitoring systems for comprehensive visibility into deployed environments.
- Investigate and address performance degradations collaborating with teams to deliver sustainable solutions.
- Research emerging technologies and trends advocating for the adoption of cuttingedge tools and practices.
- Drive improvements in system reliability scalability and performance by influencing architecture and design decisions.
- Design and execute automated tests to validate infrastructure and software reliability.
- Troubleshoot and resolve issues spanning infrastructure software application and networking layers.
Required Qualifications:
- 3 years of experience as a Site Reliability Engineer or in a similar role.
- 2 years of handson experience with AWS services; an AWS certification (Solutions Architect or DevOps Engineer) is mandatory.
- Proven experience as a senior member of an infrastructure or software engineering team.
- Proficiency with AWS services such as EC2 RDS Lambda CloudFront ELB and API Gateway.
- Solid experience in IT infrastructure including Linux/Unix and Windows systems networking and firewall concepts.
- Expertise with CI/CD tools like Jenkins TeamCity and source control systems like Bitbucket.
- Advanced scripting skills with Python as a preferred language.
- Strong understanding of system reliability performance tuning and scalability.
- Familiarity with big data technologies like Spark Hadoop and Scala is advantageous.
- Sound knowledge of cloudnative services network technologies and faulttolerant system design.
- Proficiency with RDBMS and cloud database engines (e.g. PostgreSQL MySQL).
- Experience with clusters load balancers and CDN technologies.
- Familiarity with tools like Splunk Datadog or equivalent monitoring platforms.
Preferred Qualifications:
- Bachelors degree in Computer Science Engineering or a related technical field (Masters degree preferred).
- Exposure to big data ecosystems and frameworks is a plus.
- Strong analytical and problemsolving skills with a proactive and adaptable mindset.
- Excellent communication skills with the ability to work effectively in a global team.
- Proven ability to quickly learn and adopt new platforms tools and technologies.
Why Join Us
- Work on cuttingedge cloud solutions for a diverse global clientele.
- Collaborate with a passionate team in a fastpaced innovative environment.
- Drive impactful projects that influence the stability and scalability of missioncritical systems.
- Stay at the forefront of cloud technology and grow professionally with continuous learning opportunities.