Software Engineer, ML Inference

GW478
  • $250,000-$320,000
  • San Francisco, CA
  • Permanent

About the job


Software Engineer, ML Inference

San Francisco (On-Site)

$250,000–$320,000 base + equity


Why this role

Early-stage infrastructure company building a next-generation AI cloud — rethinking how frontier models run across heterogeneous compute environments.


This team is focused on the hardest part of the stack: making large-scale model inference fast, reliable, and production-ready.


You’ll own the systems that actually execute models in production — working across runtime, serving infrastructure, memory management, and hardware optimisation.


What you’ll do

  • Build and scale end-to-end inference systems from request → runtime → response
  • Optimise latency, throughput, concurrency, and reliability under real production workloads
  • Design batching, scheduling, and queuing systems for high-performance serving
  • Improve KV cache management and memory efficiency at scale
  • Debug performance bottlenecks across model, runtime, and hardware layers
  • Work closely with systems, infrastructure, and ML teams to push inference performance forward

What makes this interesting

  • Deep work on LLM inference internals including prefill, decode, and attention optimisation
  • Solving real-world trade-offs between tail latency and throughput
  • Optimising workloads across GPUs and next-generation accelerators
  • Hands-on work with vLLM, TensorRT-LLM, and custom inference runtimes
  • Opportunity to shape core infrastructure at an early-stage company

What they’re looking for

  • Experience building ML inference or model serving systems
  • Strong systems engineering or backend infrastructure fundamentals
  • Experience working on performance, scaling, memory, or distributed systems challenges
  • Strong Python and/or C++ skills
  • Familiarity with modern inference frameworks and runtimes is a plus


APPLY NOW!


Anna Heneghan Senior ML Research & Engineering Recruiter

Apply for this role