drjobs Lead Site Reliability Engineer

Lead Site Reliability Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Atlanta, GA - USA

Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Job Title: Lead Site Reliability Engineer

Location: Atlanta GA (Onsite)

Duration: Long Term Contract





Job Description:

Desired Skillset:


Strong expertise in AWS cloud platforms.
Proficiency in automation scripting and monitoring tools including OpenShift CloudFormation Terraform Ansible Shell and Python.
Indepth knowledge of infrastructure layers such as Linux OS virtualization platforms softwaredefined networking load balancers firewalls API tools monitoring tools and storage/backup strategies.
Extensive experience with enterprise systems and missioncritical application operations including issue resolution.
Experience with automating and operationalizing Development/QA using CI/CD tools such as GitLab GitHub Jenkins Maven Gradle and Nexus.
Working experience in Software Release Management.


Minimum Experience:


3 years in DevOps or SysOps engineering focusing on major cloud platforms (preferably AWS).
2 years of application development including data streaming and deployment of highavailability critical application components.
1 year in a Site Reliability Engineering (SRE) role preferred.
Overall 7 years of professional experience.


Responsibilities
As a Site Reliability Engineer (SRE) with expertise in AWS cloud infrastructure and application monitoring you will ensure the reliability scalability and performance of our cloudbased systems and applications. Key responsibilities include:


Proven experience as an SRE or in a similar role focusing on AWS cloud infrastructure.
Deep understanding of AWS services (Lambda S3 SQS IAM Route 53 etc.) and proficiency in Infrastructure as Code (Terraform CloudFormation).
Handson experience with monitoring tools (CloudWatch Sumo Logic Dynatrace Grafana) for performance monitoring and alerting.
Proficiency in scripting and automation (Python Bash) for building and maintaining deployment pipelines and infrastructure.
Strong analytical and troubleshooting skills to diagnose and resolve complex infrastructure application and data issues.
Experience with containerization technologies (Docker Kubernetes) and serverless architectures (AWS Lambda).
Familiarity with CI/CD pipelines and version control systems (Git) for continuous integration and deployment.


Qualifications:


Manage and optimize data streaming and API components in OpenShift (onpremises) and AWS.
Review and optimize application APIs and processes to enhance response times across various components.
Automated testing processes including data quality checks production delivery and deployment for production environments.
Develop integrations between onpremises applications AWS and thirdparty tools (ServiceNow VersionOne Sumo).
Collaborate with teams to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Lead performance monitoring and troubleshooting of platform applications identifying root causes and documenting solutions.
Evolve cloud infrastructure for the application suite by experimenting with new technologies and completing prototypes to assess benefits.
Design and develop CI/CD pipelines to deploy application artifacts including APIs and data process jobs.
Configure and implement monitoring and alerting metrics to enable proactive issue detection by support teams.
Maintain data integrity and access control using AWS security tools and services such as HSM and IAM.
Develop and monitor AWS billing tools generate cost reports and implement cost optimization strategies.
Work with security architects to design and implement data security tools encryption and key management.
Address security vulnerabilities identified by audits and the wider security community and develop solutions for support teams to regularly scan and resolve issues.
Monitor and analyze platform capacity and performance collaborating with architecture teams to design elastic infrastructure for irregular traffic bursts.
Contribute to the design and implementation of backup strategies for service restoration and disaster recovery.
Provide continuous input to architecture infrastructure and application teams to improve design performance and security.
BS in Computer Science or a related technical field or equivalent practical experience.


ANSIBLE , GRADLE , DOCKER , GITLAB , AWS SERVICES , MAVEN , KUBERNETES , AWS , PYTHON , TERRAFORM , GITHUB

Employment Type

Full Time

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.