Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailRole : SRE Engineer
Loc: Austin , TX
Type :Fulltime
Who are we looking for?
Application SRE with overall experience of 8+ years of experience in development and supporting Complex and critical large scale distributed systems and extensive hands-on experience in handling production failures & driving root cause analysis and remediation.
Primary Responsibilities:
Effectively handle the Production outages & Performance Issues with quality analysis quick resolutions
Manage incidents and effectively communicate with users, application owners and senior stakeholders across all areas.
Work with development teams to improve applications' operational features for faster MTTD and MTTR and auto recovery
Identify and/or analyze patterns of incidents/problem, conduct flawless post-mortems, develop permanent remediation plans, implement automation to prevent future incidents from re-occurring again
Identify s / processes that can be automated and then work with Engineering team in automating them
Build and improve run books for generalists to minimize operational errors and gain fungibility/efficiency
Build E2E Monitoring (Hardware, Availability, Logging, distributed tracing, Business Transaction) of the system as well as End User Experience Monitoring using APM Tools like Splunk, Appdynamics, 1000Eyes etc. as a developer/configurator for performance diagnostics, monitoring, ing & Dashboarding.
Strong understanding of deployment methodologies and hands on experience in production deployments.
Develop Self-healing solutions for the repeated infrastructure and service failures.
Minimize manual involvement by driving solutions, automation and implementing continuous improvements that creates an operating environment, including development & configuration for dynamic monitoring, ing & recovery
Technical Skills:
Min 8+ years IT experience which includes atleast 3 years of web application production support
Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.
Should have solid hands-on experience in troubleshooting and fixing application failures, application Performance degradation, Code issues, cloud platform issues, Batch Failures, Mongo DB failures, Network failures.
ITIL working knowledge: Event, Incident, Release, Problem and Knowledge Management.
Experience with instrumentation, monitoring, ing, and responding - relative to performance and availability of application, using tools such as AppDynamics, Splunk, 1000Eyes,ITRS etc.
Experience in Administration of Linux Servers, Networking and Load Balancing.
Clear understanding of one or more Cloud systems (PCF, GCP, AWS, Azure Cloud or others)
Hands-on experience in performing Production deployments using tools like cf-cli, bamboo
Hands-on experience in CICD implementation
Qualification:
Education qualification: B.Tech, BE, BCA, MCA, M. Tech or equivalent technical degree from a reputed college.
Full Time