Inference Engineer

GW497
  • $300,000-$325,000
  • San Francisco, CA
  • Permanent

About the job


Inference Engineer


We’re partnered with an AI infrastructure company building next-generation systems for large-scale AI workloads.


Their platform is rethinking how inference runs at scale - intelligently orchestrating workloads across heterogeneous hardware to unlock major gains in performance, efficiency, and cost. The team is solving some of the hardest problems in modern AI infrastructure: inference scheduling, KV cache management, runtime optimization, memory efficiency, and low-latency serving across distributed systems.


They’re looking for engineers who care deeply about how models execute in production — not just training models, but making them fast, scalable, and reliable under real-world load.


What You’ll Work On

  • Designing and optimizing large-scale inference pipelines
  • Improving latency, throughput, and concurrency under production workloads
  • Building inference runtimes and serving infrastructure
  • Optimizing batching, scheduling, and request orchestration
  • Managing KV cache allocation, reuse, placement, and eviction strategies
  • Improving prefill/decode performance and memory efficiency
  • Profiling bottlenecks across model, runtime, and distributed system layers
  • Collaborating closely with compiler, kernel, and systems engineers


What They’re Looking For

  • Strong systems engineering fundamentals
  • Experience building or scaling ML inference / model serving systems
  • Deep understanding of performance optimization and memory behavior
  • Experience with runtimes such as vLLM, TensorRT-LLM, or custom serving infrastructure
  • Strong understanding of transformer architectures and attention mechanisms
  • Familiarity with batching, scheduling, concurrency, and cache management
  • Strong Python and/or C++ engineering skills


Why Join

  • Work on cutting-edge inference infrastructure and AI systems problems
  • Build systems designed for next-generation AI scale
  • Small, highly technical engineering team
  • Significant ownership and technical impact
  • Opportunity to shape foundational infrastructure for future AI workloads


Anna Heneghan Senior ML Research & Engineering Recruiter

Apply for this role