Location: Philadelphia pa MUST be onsite within 40 miles and 60 minutes of drive
Duration: 1 year
Job Summary
We are looking for a highly skilled and handson Distributed Computing Engineer to help build a platform for executing largescale AI workloads across a fleet of distributed processors. This role involves designing and implementing systems for distributing tasks managing state and aggregating results efficiently. The ideal candidate will have strong engineering expertise in building distributed systems and a deep understanding of the design principles behind orchestration frameworks like Kuberneteswithout being limited to specific tools. Youll work at the intersection of cuttingedge AI and distributed computing helping shape the platform that powers nextgeneration AI workloads.
Key Responsibilities
- Build and optimize a distributed computing platform for executing AI workloads across many nodes ensuring scalability reliability and performance.
- Implement systems for job scheduling task distribution state tracking and faulttolerant execution.
- Apply distributed systems concepts such as partitioning replication consensus and eventual consistency to ensure robust system behavior.
- Design solutions inspired by modern orchestration frameworks (e.g. Kubernetes) while tailoring them to meet the unique requirements of AI workload distribution.
- Write clean efficient code in languages like Python Go or C to build highperformance distributed components.
- Collaborate with other engineers to integrate distributed computing capabilities into larger AI pipelines.
- Contribute to system monitoring and debugging tools to ensure realtime visibility into system health and performance.
- Stay current with advancements in distributed systems orchestration techniques and AI model execution to bring innovative ideas to the team.
Qualifications
- Bachelors or Masters degree in Computer Science Software Engineering or a related field (or equivalent experience).
- Strong experience in building distributed systems with a focus on scalability fault tolerance and performance optimization.
- Handson experience with concepts like task scheduling state management and distributed coordination protocols (e.g. leader election or consensus).
- Familiarity with the design principles behind container orchestration frameworks (e.g. Kubernetes) including declarative configuration automated scaling and service discovery.
- Proficiency in one or more programming languages such as Python Go or C for building highperformance systems.
- Experience working with networking concepts (e.g. RPCs gRPC) and designing communication between distributed components.
- AI/ML workflows or largescale data processing required.
Preferred Skills
- Experience deploying distributed systems on cloud platforms (AWS GCP Azure).
- Knowledge of message queues (e.g. Kafka) or eventdriven architectures for task distribution.
- Familiarity with debugging distributed systems using logging/observability tools like Prometheus or Grafana.