Role: Senior Hadoop SRE (Hadoop Reliability Engineer)
Locations: Plano TX / Jersey City NJ / Atlanta GA / New York NY / Charlotte NC / Newark Delaware
Duration : Long Term
Hadoop Site Reliability Engineer will be responsible for building and enhancing the tooling needed to deploy and operate Hadoop clusters at scale. Responsible for monitoring troubleshooting automating and continuously developing tools to improve the availability and resiliency of the data ecosystem.
- We are seeking an experienced Site Reliability Engineer with expertise in managing reliability of large Hadoop clusters.
- Experience in building managing and tuning performance of Hadoop platforms.
- Excellent Shell Python programming skills for automation requirement for repetitive devops tasks
- Tune alerting and setup observability to proactively identify the issues and performance problems.
- Deploy and scale Hadoop Infrastructure capacity planning data cluster monitoring and troubleshooting and drive operational enhancements.
- DevOps Deployment Production Ops
- Production Ops processes in Hadoop environment on the application side ability to handle and work with multiple selfservice teams.
- Expert in reliability Engineering Incident management Observability monitoring Incident management
- Subject Matter expert on above
- Not looking for Infra focused person but at the Hadoop platform level
- The ideal candidate will be a bridge between these L1L2 and L3 level support across the largest instance in the bank (50 TB SDP Strategic Data Platform) who can communicate technically and demonstrate leadership in resolving problems for production ops
- Handson and strong understanding of Hadoop architecture
- Experience with Hadoop ecosystem components HDFS YARN MapReduce & cluster management tools like Ambari or Cloudera Manager and related technologies.
- Proficiency in scripting Linux system administration networking and troubleshooting skills.