drjobs Cloud Infra Control Plane Service Engineering Architect

Cloud Infra Control Plane Service Engineering Architect

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Jobs by Experience drjobs

5years

Job Location drjobs

USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

This is a remote position.

Job Title: Cloud Infra Control Plane Service Engineering Architect

Location: Remote work candidates in the Bay Area or Seattle will be prioritized. All candidates should expect to work 3pm pst thru 9pm pst at least 2 days a week

Duration:6 months

the project is all around implementing an Nvidia SuperPod. Major major bonus points for candidates who have that experience.

Key Responsibilities:

Roles and Responsibilities:

Infrastructure Management:

  • Manage and monitor computer clusters ensuring high availability and performance.
  • Implement and maintain automation scripts for infrastructure provisioning and management. Design and Implementation:
  • Design implement and maintain computer services for both GPU and nonGPU environments.
  • Develop and optimize algorithms for highperformance computing tasks especially in the AI/ML Training and Inference domain. Performance Optimization:
  • Analyze and optimize the performance of compute workloads.
  • Implement best practices for resource utilization and efficiency. Collaboration:
  • Work closely with data scientists researchers and other engineering teams to understand and meet their compute requirements.
  • Collaborate with hardware vendors to evaluate and integrate new technologies. Security and Compliance:
  • Ensure that compute services comply with security policies and industry standards.
  • Implement and maintain security measures to protect data and compute resources. Troubleshooting and Support:
  • Provide support for computerelated issues including debugging and resolving hardware and software problems.
  • Develop and maintain documentation for troubleshooting procedures and best practices. Continuous Improvement:
  • Stay updated with the latest advancements in compute technologies and integrate them into the infrastructure.
  • Continuously improve the reliability scalability and performance of compute services. Qualifications:

Education:

  • Bachelors or Masters degree in Computer Science Engineering or a related field.
  • NVIDIA and AI Certification Experience:
  • Years of experience managing onpremise GPU or non GPU systems
  • Proven experience in managing and optimizing GPU and nonGPU computer environments.
  • AI Infra Engineering building and operating skills
  • Experience with highperformance computing (HPC) and parallel processing including Baremetel large scale virtual environments.
  • Implement virtualization architectures leveraging expertise with Kubernetes distributions like OpenShift or Rancher and cloud technologies on bare metal environments.
  • Proficiency in hardware technologies such as SRIOV DPU and GPU with proven experience in implementing these technologies in virtualized and containerized environments. Technical Skills:
  • Proficiency in programming languages such as Python C or similar.
  • Experience with infrastructure as code (IaC) tools like Terraform Ansible or similar.
  • Familiarity with containerization and orchestration tools like Docker and Kubernetes.
  • Familiarity with Kubernetes underlying technologies with CRI CSI CNI Operators GPU device plugin RMDA/InfiniBand integration
  • Knowledge of cloud platforms (AWS Azure GCP) and their compute services. Soft Skills:
  • Strong problemsolving skills and attention to detail.
  • Excellent communication and collaboration skills.
  • Ability to work in a fastpaced dynamic environment.


Bachelor's or Master's degree in Computer Science, Engineering, or a related field. NVIDIA and AI Certification Experience: Years of experience managing on-premise GPU or non GPU systems Proven experience in managing and optimizing GPU and non-GPU computer environments. AI Infra Engineering building and operating skills Experience with high-performance computing (HPC) and parallel processing including Baremetel, large scale virtual environments. Implement virtualization architectures, leveraging expertise with Kubernetes distributions like OpenShift or Rancher, and cloud technologies on bare metal environments. Proficiency in hardware technologies such as SR-IOV, DPU, and GPU, with proven experience in implementing these technologies in virtualized and containerized environments. Technical Skills: Proficiency in programming languages such as Python, C++, or similar. Experience with infrastructure as code (IaC) tools like Terraform, Ansible, or similar. Familiarity with containerization and orchestration tools like Docker and Kubernetes. Familiarity with Kubernetes underlying technologies with CRI, CSI, CNI, Operators, GPU device plugin, RMDA/InfiniBand integration Knowledge of cloud platforms (AWS, Azure, GCP) and their compute services. Soft Skills:

Employment Type

Remote

Company Industry

Key Skills

  • React Native
  • AI
  • Enterprise Software
  • React
  • Node.js
  • Redis
  • AWS
  • Software Development
  • IOS
  • Team Management
  • Product Development
  • Mobile Applications

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.