Lead complex technology initiatives towards Hadoop platform stability automation initiatives.
Platform Incident and Problem management
Be part of 24/7 platform support team to perform platform incident management and triaging team.
Assess the impact of the platform issue and prioritize the triage by partnering with platform tenants platform development and Hadoop vendor teams.
Facilitate and perform root cause analysis of platform incidents towards determining permanent resolution involving Tenants and Hadoop vendor.
Communicate with Stakeholders leadership team on the triaging progress status.
To troubleshoot tenant reported failures by deep diving onto client logs to identify the root cause.
Use Autosys as scheduler to schedule execute platform routine activities.
Hadoop Cluster Monitoring
Monitor Hadoop cluster health and performance.
Automate new monitoring opportunities leveraging available monitoring and alerting tools such as Prometheus Thousand Eyes Splunk etc.
Perform troubleshooting on the actionable alerts and tuning to ensure optimal cluster performance.
Hadoop Cluster Configuration and tuning
Periodically tune and configure Hadoop ecosystem components such as HDFS YARN Hive HBase Spark HPOS etc.
Schedule the Change request and track the approval process for implementation.
Platform Backup and Recovery:
Ensure Hadoop platform is resilient with data backup and recovery strategies in place.
Perform emergency or scheduled failover of platform to Disaster recovery site and fall back.
Upgrades and Patch Management:
Plan and execute upgrades of Hadoop ecosystem components and patches.
Ensure compatibility and smooth integration of new features and enhancements.
Work with Operating System administrators and patching automation build team to schedule and execute patching.
Perform Cluster health validation post patching and communicate with stakeholders.
Identify automation and process improvement opportunities related to patching to implement it.
Documentation and Reporting:
Maintain comprehensive documentation of Hadoop cluster configurations & troubleshooting processes and procedures.
Develop and Schedule to Generate reports on cluster usage performance metrics and capacity utilization
Collaboration and Support:
Ability to work both independently and in collaboration with Platform development and Platform tenant teams to troubleshoot and fix Hadoop platform issues maintain production stability to enterprisewide applications.
Collaborate and consult with key technical experts senior technology team and Hadoop vendor to resolve complex technical issues and achieve goals.
Mandatory Skills and Qualifications:
Proven Experience in technology incident and problem management role.
Proven Experience in using monitoring tools like Prometheus Splunk SiteScope & thousand eyes etc.
Proven experience as a Unix Scripting
Proven experience in Hadoop administration or in a similar role.
Strong knowledge of Hadoop ecosystem and its components (Hive HBase Spark Hue ).
Proven Experience is using Autosys or similar job scheduler.
Excellent problemsolving and troubleshooting skills.
Excellent communication and people skills for collaborating with crossfunctional teams.
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.