Site Reliability Engineer 1

CloudifyOps Pvt .Ltd

Posted on : 22-01-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Jobs by Experience

5years

Job Location

Bengaluru - India

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 22-01-2025

Job Description

We are looking for a skilled Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will ensure the reliability scalability and performance of our systems while implementing DevOps best practices. You will work on infrastructure automation incident management and observability to maintain a stable production environment.

Key Responsibilities:

Infrastructure & Operations: Manage administer and optimize our Linuxbased infrastructure ensuring high availability scalability and performance for missioncritical applications.

Incident Management: Lead and manage incident response troubleshoot complex issues across production systems and ensure timely resolution based on SLA SLO and SLI.

Linux Administration: Perform indepth system administration tasks including user management system tuning performance optimization patching and log analysis for Linux servers.

Automation & Configuration Management: Automate repetitive tasks manage Docker containers and implement configuration management and provisioning using Terraform and other automation tools.

Monitoring & Performance Tuning: Utilize Grafana Prometheus and other monitoring tools to track system health and performance ensuring proactive measures to maintain reliability and reduce downtime.

Runbook & Documentation: Develop and maintain runbooks detailed troubleshooting guides and operational documentation for incident response and system recovery.

Debugging & Troubleshooting: Perform advanced debugging and rootcause analysis to diagnose complex system issues with a focus on minimizing system downtime and improving operational stability.

Collaboration: Work closely with crossfunctional teams including development QA and operations to ensure reliability and performance standards for new features and releases.

Capacity Planning & Optimization: Plan for future infrastructure needs scale systems

as required and optimize resource utilization to meet growing demands.

Skills and Qualifications:

5 years of experience with Strong Linux Administration with Networking including deep expertise in system configuration user management kernel tuning log analysis and performance troubleshooting.

Handson experience with Docker and container orchestration (e.g. Kubernetes).

Solid knowledge of Terraform for infrastructure automation provisioning and management.

Proficiency with monitoring tools like Grafana Prometheus and Opsgenie to track performance and uptime.

Experience in incident management ensuring that issues are resolved promptly while maintaining SLA SLO and SLI metrics.

Expertise in debugging and resolving complex technical issues in distributed systems with a focus on minimizing downtime.

Proven ability to write and maintain runbooks and operational procedures for troubleshooting and system recovery.

Experience in data center management and ensuring 24/7 availability of production infrastructure.

Strong understanding of automation tools (e.g. Terraform Ansible) and continuous improvement practices.

Excellent communication and teamwork skills with the ability to collaborate effectively across departments.

Linux networking is primary Ubuntu based infra. Strong linux skills required. For eg: Hardening

Ansible for configuration management jenkins Gitlabs for deploying infra.

Automation of tasks Shell scripting and Python.

ELK for logging Prometheus & Grafana

Databricks & Tableau are also being used so SQL skills are required.

Docker containers understanding is a must. Preferred Qualifications:

Experience with cloud platforms (AWS Azure GCP).

Familiarity with CI/CD pipelines and tools like Jenkins Git or similar.

Exposure to Kubernetes and container orchestration technologies.

Certifications in Linux administration SRE DevOps or cloud technologies.

Employment Type

Full Time

Company Industry

Key Skills

Apply Now

About Company

CloudifyOps Pvt .Ltd

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer 1

CloudifyOps Pvt .Ltd

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Engineer - Test

Network Engineer

Test Engineer

Senior Staff Engineer Big Data Engineer

AIML Engineer

QA Software Engineer Sr QA Engineer

Software Test Engineer

Fullstack Software Engineer