drjobs HPC Validation and Performance Engineer

HPC Validation and Performance Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

London - UK

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips

GResearch is a leading quantitative research and technology firm with offices in London and Dallas.

We are proud to employ some of the best people in their field and to nurture their talent in a dynamic flexible and highly stimulating culture where worldbeating ideas are cultivated and rewarded.

This is a role based in our new Soho Place office opened in 2023 in the heart of Central London and home to our Research Lab.

The role

As an HPC Validation and Performance Engineer at GResearch you will take ownership of the validation and optimization of our HPC CPU and GPU calc farms.

This critical role will involve developing a validation and performance baselining framework which ensures system readiness for AI/ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking realtime observability and longterm strategic readiness.

You will drive the implementation of advanced tooling and frameworks maintaining an infrastructure that is crucial to our cuttingedge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as we continue to globally scale our datacentre footprint.

We are looking for someone with deep technical expertise in compute storage or networking optimisations and performance engineering who can develop solutions that scale with our growing infrastructure.

This role demands a forwardthinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to keep GResearch at the forefront of innovation.

Key responsibilities of the role include:

  • Architecting and implementing a validation framework to certify the readiness of GPU nodes across a large distributed environment
  • Defining methodologies to continually assess performance and optimising infrastructure across AI/ML workloads
  • Developing and executing comprehensive performance testing using industry benchmarks ensuring optimal performance across HPC compute storage and networking
  • Leading efforts to identify and resolve bottlenecks in system performance.
  • Building robust scalable tools for automated validation and testing utilising Python Go Kubernetes and CI/CD pipelines to streamline continuous validation and benchmarking processes
  • Implementing monitoring solutions using Prometheus Grafana and other modern monitoring technologies to track performance metrics and realtime health of the cluster
  • Defining and implementing best practice for continuous performance validation ensuring that the infrastructure remains reliable and efficient as new technologies emerge
  • Staying informed on industry trends and advancements to ensure longterm strategic alignment
  • Working crossfunctionally with engineering infrastructure and research teams to align validation efforts with the broader business objectives ensuring that the platform meets evolving research demands

Who are we looking for

The ideal candidate will have the following skills and experience:

  • Accelerator performance experience including profiling and tuning with largescale GPU clusters
  • Indepth understanding of NVIDIA ClusterKit Nsight and Validation Suite MLPerf and DCGM tools for GPU and DPUs
  • Networking & Storage performance experience including profiling and optimisation with NVIDIA ClusterKit iPerf or equivalent across InfiniBand/RoCe network implementations
  • System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
  • Experience with HPC workloads across distributed global locations bringing data driven performance data to compliment key architectural decisions
  • Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python Go and Kubernetes in a Ubuntu Linux environment
  • Expertise with key monitoring platforms including OTEL Prometheus ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
  • A deep understanding of emerging technologies architectures and strategies with the ability to assess their potential impact on infrastructure and adopt them as part of a longterm plan
  • Proven ability to lead complex technical projects influence decisions and engage with stakeholders across technical and research teams

Why should you apply

  • Highly competitive compensation plus annual discretionary bonus
  • Lunch provided (viaJust Eat for Business and dedicated barista bar
  • 30 days annual leave
  • 9 company pension contributions
  • Informal dress code and excellent work/life balance
  • Comprehensive healthcare and life assurance
  • Cycletowork scheme
  • Monthly company events

GResearch is committed to cultivating and preserving an inclusive work environment. We are an ideasdriven business and we place great value on diversity of experience and opinions.

We want to ensure that applicants receive a recruitment experience that enables them to perform at their best. If you have a disability or special need that requires accommodation please let us know in the relevant section

Employment Type

Full-Time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.