Role : Site Reliability Engineer With Machine Learning
Location : Austin TX (Day one Onsite)
Good experience in SRE with ML Ops ML Flows & very good at Scripting is required.
Job description:
The ideal candidate would be the person who had experience on Kubernetes Machine Learning workflows (preferably Amazon Sagemaker) Python scripting Rubix. The person should have experience in Jupyter Notebooks as SRE
- Successful candidate will several years of experience in supporting large enterprise system with at least 10 different upstream and downstream systems. Identifying issues from Splunk logs.
- Technically sound in AWS Kubernetes and Python basic SQL ML Ops knowledge like MLFlow is a plus.
- Answering/Fixing support issues for DatalaLab.
- Implement and maintain Infra as Code and Build pipeline.
- Taking measures to minimize oncall incidents.
- Post incident reviews
- Work with dev teams to ensure that the new features meet the reliability and performance goals.
- Ability to work with geographically distributed teams in India and SCV
Regards
Bharath