drjobs Network HPC Engineer

Network HPC Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Austin, TX - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Job Title: Network HPC Engineer

Duration: 06 months Contract

Location: Austin Pflugerville TX Onsite

Responsibilities :

  • Designing and deploying HPC clusters consisting of highperformance servers interconnected by highspeed networks such as InfiniBand (IB) or Ethernet/RoCE with RDMA capabilities.

InfiniBand Responsibilities:

  • Fabric Design and Configuration: Designing InfiniBand fabrics including switches host channel adapters (HCAs) and cables to ensure optimal performance scalability and fault tolerance. Configuring switch ports virtual lanes (VLs) and routing tables to facilitate efficient data communication within the InfiniBand fabric.
  • Topology Optimization: Analyzing workload characteristics and traffic patterns to design InfiniBand topologies (e.g. fattree hypercube) that minimize latency and maximize bandwidth utilization. Implementing routing policies and congestion control mechanisms to optimize traffic flow and prevent network congestion.
  • Fabric Monitoring and Management: Monitoring InfiniBand fabric health and performance using management tools such as Subnet Manager (SM) and Performance Monitoring Counters (PMCs). Performing regular maintenance tasks including firmware updates port diagnostics and error detection and correction.
  • Quality of Service (QoS): Implementing QoS policies to prioritize traffic based on application requirements and service levels. Configuring traffic classes service levels and virtual lanes (VLs) to ensure predictable performance for latencysensitive applications.
  • Security and Access Control: Securing the InfiniBand fabric with features such as subnet partitioning (subnet manager security) and encryption to protect data integrity and confidentiality. Enforcing access controls and authentication 6 mechanisms to restrict unauthorized access to the InfiniBand network.


Responsibilities

  • Network Design and Configuration: Designing and configuring RoCE networks including switches network adapters and Ethernet fabrics to provide lowlatency highbandwidth communication for RDMA traffic. Optimizing network settings such as MTU (Maximum Transmission Unit) buffer sizes and flow control parameters to maximize RoCE performance.
  • Congestion Management: Implementing congestion management mechanisms such as Priority Flow Control (PFC) and Data Center Bridging (DCB) to prevent congestion and ensure fair allocation of network resources. Monitoring network traffic and congestion levels to dynamically adjust congestion control settings and avoid performance degradation.
  • Routing and Switching Optimization: Configuring RoCEaware switches and routers to support RDMA traffic and enable efficient routing of packets between endpoints. Tuning switch port settings forwarding tables and routing protocols to minimize packet loss and maximize throughput for RoCE traffic.
    Performance Monitoring and Tuning: Monitoring RoCE network performance metrics such as latency throughput and packet loss using tools like Ethernet Performance Monitoring (EPM) and InfiniBand Performance Monitoring (IPM). Analyzing performance data to identify bottlenecks optimize network configurations and finetune RoCE parameters for optimal performance.
  • Security and Authentication: Implementing security measures such as MACsec (Media Access Control Security) and IPsec (Internet Protocol Security) to encrypt and authenticate RDMA traffic over RoCE networks. Enforcing access controls and certificatebased authentication to ensure secure communication between RoCE endpoints.
  • Vendor Management: Coordinating with hardware and software vendors to ensure compatibility and support for products in multivendor environments. Developing Billing of Materials. Clearly define technical requirements including performance scalability compatibility and specific features needed for RoCE. Assess the technical specifications performance benchmarks and compatibility with existing infrastructure. Implement a PoC to test the switches in a controlled environment and ensure they meet performance 7 and reliability expectations. Evaluate the vendors technical support capabilities including responsiveness expertise and available resources. Maintain regular communication with the vendor to stay informed about product updates potential issues and upcoming changes. Schedule periodic meetings to review performance/bugs discuss any concerns and plan for future needs.

Employment Type

Full Time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.