drjobs Senior Engineer - Incident Triage and Monitoring

Senior Engineer - Incident Triage and Monitoring

Employer Active

The job posting is outdated and position may be filled
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Y - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Job Description

Role: Senior Engineer Incident Triage and Monitoring
Location : Reston VA (Hybrid)
Working hours: EST
Duration: Long Term
Description
In this incident management function manage incidents to resolution in a 24/7/365 environment using the clients incident management processes effectively guide incident and triage calls from a technical perspective share technical details obtained from monitoring tools and dashboards to aid troubleshooting outline details of resolution activities recommend and implement improved processes provide timely status updates to stakeholders assist with postmortem related activities and support various efforts related to operational improvements. Manage efforts to maintain application in production including troubleshooting stoppages repairing bugs documenting application performance and coordinating with technology infrastructure management.
Highlights:
  • Seeking Senior AWS Cloud Engineer or Architect
  • Extensive handson experience with AWS
  • Strong knowledge of AWS infrastructure
  • Proficient with monitoring tools for identifying errors and root causes within AWS infrastructure
  • Experience across multiple components of AWS to handle various incidents
  • Soft skills and communication important for incident triage
  • AWS certification mandatory preferably AWS Solution Architect Associate level
  • AWS Solution Architect Professional or AWS SysOps certification preferred
  • Interviews focus on scenariobased questions to assess technical triage abilities and problemsolving skills
KEY JOB FUNCTIONS
  • Manage IT production incidents to resolution in a 24/7/365 environment using the incident management processes and communicate management of incident status impact and resolution actions.
  • Effectively lead and guide Incident triage calls from a technical perspective analyzing different components of the infrastructure and application environment via the use of a variety of monitoring tools and processes.
  • Troubleshoot the incidents and identify root cause quickly using operations wire data analytics application performance management and event correlation monitoring tools.
  • Perform analysis of data evaluating multiple application protocols including web database storage and supporting infrastructure such as AWS UNIX DNS LDAP SSL SMTP and FTP.
  • Troubleshooting and resolving incidents on the AWS cloud infrastructure.
  • Hands on experience managing and monitoring applications deployed on Amazon Web Services (AWS) using tools like EC2 ELB RDS Redshift DynamoDB Aurora Route53 ECS Lambada S3 Batch CloudWatch CloudTrail WAF etc.
  • Experience with building tools for monitoring and troubleshooting of system resources in an AWS environment. Ability to triage AWS related incidents using monitoring tools on AWS Cloud.
  • Experience with performance engineering of AWS Cloud applications.
  • Hands on experience with transaction level monitoring using Dynatrace OpenTelemetry and Splunk.
  • Ability to perform transaction level monitoring and troubleshooting in AWS cloud platform.
  • Eyes on glass monitoring of the health of applications as well as the underlying infrastructure.
  • Monitoring experience with tools like Extrahop SolarWinds Netcool suite Catchpoint MoogSoft.
  • Ability to analyze dashboards and reporting/monitoring tools to look at trends and patterns in application health and performance.
  • Proactively looking for hardware software and environmental alerts or malfunctions.
  • Influence other technical teams on the calls and articulate troubleshooting steps effectively.
  • Lead required technical followup calls for critical incidents.
  • Assist with documentation of Root Cause Analysis (RCA) or Correction of Errors (COE) and data quality for all ECC communicated incidents.
  • Ensure appropriate functional and management escalation takes place as per the standards and procedures.
  • Follow up on items that could potentially negatively impact production operations assist with postmortem related activities and support various efforts related to operational improvements.
  • Based on recommendations from management implement new and improved processes change processes perform new tasks create reports and address adhoc requests.
  • Participate in oncall rotation. Ability to work on any shifts as needed including weekends and night shifts.
  • Ability to report incident details and metrics to senior leadership.
EDUCATION
Bachelors Degree or equivalent required.
MINIMUM EXPERIENCE
6 years of related experience
SPECIALIZED KNOWLEDGE & SKILLS
  • 6 years of working experience with different IT Infrastructure components such as AWS Unix/ Linux Servers Wintel Servers AWS networks firewalls routers load balancers VPN Apache web logic LDAP Active Directory Exchange Oracle/MS SQL databases SAN Virtualization Email systems Enterprise monitoring and access management solutions for single sign on. Subject matter expertise is not required and experience with at least eight of the above is preferred.
  • Senior level handson working experience with Amazon Web Services (AWS).
  • Proven methodical approach to problem identification monitoring problem solving and resolution.
  • Ability to analyze different components of the infrastructure and application environments during Incident triage calls.
  • Aptitude to influence other technical teams on the incident calls and articulate troubleshooting steps effectively.
  • Experience and confidence working with all levels of management; excellent written and verbal skills.
  • Able to quickly and concisely communicate with senior management on technical issues in nontechnical terms and to run large conference calls during Incident calls with a wide range of personnel and management levels.
  • Strong relationship management skills and aptitude to multitask and work well in a high stress environment both within teams and independently.
  • AWS Solution Architect Associate or higher certification
Preferred Qualifications:
  • Understanding of tools like CloudFormation or Terraform
  • Management and troubleshooting of Middleware products on UNIX and Linux environments. Knowledge of Service Oriented Architecture (SOA) Java etc.
  • Understanding of Azure or Google Cloud.
  • Prior Financial industry experience.

Employment Type

Full Time

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.