Software Engineer - Distributed Systems

GW435
  • $250,000-$270,000
  • San Francisco, CA
  • Permanent

About the job


Software Engineer - Distributed Systems

We’re working with a well-funded Series A company building a new class of cloud infrastructure for AI. They’re tackling a fundamental problem: today’s AI systems are tightly coupled to specific hardware, creating limits in cost, scale, and efficiency.


Their approach decouples workloads from hardware — dynamically partitioning and scheduling them across heterogeneous compute (GPUs, accelerators, multi-gen systems). This is deep, production-grade distributed systems work operating at real scale.

What you’ll do

  • Own core distributed systems from design → build → deployment → operation
  • Design scheduling, routing, and resource management systems across thousands of nodes
  • Build production-grade control planes and APIs for workload orchestration
  • Make explicit tradeoffs around performance, reliability, and efficiency at scale
  • Debug complex distributed failures and continuously improve system behaviour


What makes this interesting

  • High ownership: you’re building foundational infrastructure, not abstracted layers
  • Real scale: systems designed for large, multi-cluster / datacenter environments
  • Hard problems: concurrency, scheduling, failure modes, and resource allocation
  • Heterogeneous compute: working beyond standard cloud abstractions
  • Early-stage: opportunity to shape architecture with real production constraints


We’re looking for

  • Engineers who have built or operated distributed systems in production
  • Strong fundamentals in concurrency, systems design, and failure handling
  • Evidence of ownership over meaningful systems (not just contributions)
  • Comfort reasoning about tradeoffs in large-scale environments
  • Ability to clearly explain design decisions and system behaviour


It's not necessary, but it's great if you have:

  • Experience with Kubernetes or similar systems beyond basic usage
  • Background in scheduling, queues, or resource management systems
  • Experience designing service-oriented architectures (RPC, async systems)
  • Systems-level programming experience (e.g. Go, C++, Python)


Anna Heneghan Senior ML Research & Engineering Recruiter

Apply for this role