Distributed Systems Engineer

GW401
  • $200,000-$300,000
  • San Francisco, CA
  • Permanent

About the job


I’m partnered with a AI infrastructure startup building a new foundation for how large-scale AI systems run.

They’re addressing fundamental limits in power, cost, and hardware by decoupling workloads from infrastructure and enabling heterogeneous compute across CPUs, GPUs, and emerging accelerators.


Strong early traction:

🚀 $80M Series A

 🚀 Deployments with Fortune 500 + AI-native companies

 🚀 Working directly with foundation labs and hyperscalers



The Role

This is a core distributed systems role focused on building the platform that runs AI workloads at scale.

You will build systems that schedule, route, and operate workloads across thousands of nodes in production.

Typical problems:

• Distributed scheduling and orchestration

 • Resource allocation across large-scale systems

 • Reliability, fault tolerance, and failure handling


You’ll work across the stack with compilers, runtimes, and hardware to ensure performance and correctness.



What They’re Looking For

• Proven ownership of distributed systems in production

 • Strong Kubernetes experience

 • Deep understanding of concurrency, failure modes, and system tradeoffs

 • Strong programming in Go, C++, or Python


Ideal Additional Experience

• Experience with ML inference systems or performance-critical workloads

 • Familiarity with scheduling, queues, or resource management systems



💡 Does this take your interest? Lets chat -


Anna Button Researcher

Apply for this role