This is a remote position.
Job Title: Cloud Infra Control Plane Service Engineering Architect
Location: Remote work candidates in the Bay Area or Seattle will be prioritized. All candidates should expect to work 3pm pst thru 9pm pst at least 2 days a week
Duration:6 months
the project is all around implementing an Nvidia SuperPod. Major major bonus points for candidates who have that experience.
Key Responsibilities:
Roles and Responsibilities:
Infrastructure Management:
- Manage and monitor computer clusters ensuring high availability and performance.
- Implement and maintain automation scripts for infrastructure provisioning and management. Design and Implementation:
- Design implement and maintain computer services for both GPU and nonGPU environments.
- Develop and optimize algorithms for highperformance computing tasks especially in the AI/ML Training and Inference domain. Performance Optimization:
- Analyze and optimize the performance of compute workloads.
- Implement best practices for resource utilization and efficiency. Collaboration:
- Work closely with data scientists researchers and other engineering teams to understand and meet their compute requirements.
- Collaborate with hardware vendors to evaluate and integrate new technologies. Security and Compliance:
- Ensure that compute services comply with security policies and industry standards.
- Implement and maintain security measures to protect data and compute resources. Troubleshooting and Support:
- Provide support for computerelated issues including debugging and resolving hardware and software problems.
- Develop and maintain documentation for troubleshooting procedures and best practices. Continuous Improvement:
- Stay updated with the latest advancements in compute technologies and integrate them into the infrastructure.
- Continuously improve the reliability scalability and performance of compute services. Qualifications:
Education:
- Bachelors or Masters degree in Computer Science Engineering or a related field.
- NVIDIA and AI Certification Experience:
- Years of experience managing onpremise GPU or non GPU systems
- Proven experience in managing and optimizing GPU and nonGPU computer environments.
- AI Infra Engineering building and operating skills
- Experience with highperformance computing (HPC) and parallel processing including Baremetel large scale virtual environments.
- Implement virtualization architectures leveraging expertise with Kubernetes distributions like OpenShift or Rancher and cloud technologies on bare metal environments.
- Proficiency in hardware technologies such as SRIOV DPU and GPU with proven experience in implementing these technologies in virtualized and containerized environments. Technical Skills:
- Proficiency in programming languages such as Python C or similar.
- Experience with infrastructure as code (IaC) tools like Terraform Ansible or similar.
- Familiarity with containerization and orchestration tools like Docker and Kubernetes.
- Familiarity with Kubernetes underlying technologies with CRI CSI CNI Operators GPU device plugin RMDA/InfiniBand integration
- Knowledge of cloud platforms (AWS Azure GCP) and their compute services. Soft Skills:
- Strong problemsolving skills and attention to detail.
- Excellent communication and collaboration skills.
- Ability to work in a fastpaced dynamic environment.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field. NVIDIA and AI Certification Experience: Years of experience managing on-premise GPU or non GPU systems Proven experience in managing and optimizing GPU and non-GPU computer environments. AI Infra Engineering building and operating skills Experience with high-performance computing (HPC) and parallel processing including Baremetel, large scale virtual environments. Implement virtualization architectures, leveraging expertise with Kubernetes distributions like OpenShift or Rancher, and cloud technologies on bare metal environments. Proficiency in hardware technologies such as SR-IOV, DPU, and GPU, with proven experience in implementing these technologies in virtualized and containerized environments. Technical Skills: Proficiency in programming languages such as Python, C++, or similar. Experience with infrastructure as code (IaC) tools like Terraform, Ansible, or similar. Familiarity with containerization and orchestration tools like Docker and Kubernetes. Familiarity with Kubernetes underlying technologies with CRI, CSI, CNI, Operators, GPU device plugin, RMDA/InfiniBand integration Knowledge of cloud platforms (AWS, Azure, GCP) and their compute services. Soft Skills: