Inference Engineer
- $300,000-$325,000
- San Francisco, CA
- Permanent
About the job
Inference Engineer
We’re partnered with an AI infrastructure company building next-generation systems for large-scale AI workloads.
Their platform is rethinking how inference runs at scale - intelligently orchestrating workloads across heterogeneous hardware to unlock major gains in performance, efficiency, and cost. The team is solving some of the hardest problems in modern AI infrastructure: inference scheduling, KV cache management, runtime optimization, memory efficiency, and low-latency serving across distributed systems.
They’re looking for engineers who care deeply about how models execute in production — not just training models, but making them fast, scalable, and reliable under real-world load.
What You’ll Work On
- Designing and optimizing large-scale inference pipelines
- Improving latency, throughput, and concurrency under production workloads
- Building inference runtimes and serving infrastructure
- Optimizing batching, scheduling, and request orchestration
- Managing KV cache allocation, reuse, placement, and eviction strategies
- Improving prefill/decode performance and memory efficiency
- Profiling bottlenecks across model, runtime, and distributed system layers
- Collaborating closely with compiler, kernel, and systems engineers
What They’re Looking For
- Strong systems engineering fundamentals
- Experience building or scaling ML inference / model serving systems
- Deep understanding of performance optimization and memory behavior
- Experience with runtimes such as vLLM, TensorRT-LLM, or custom serving infrastructure
- Strong understanding of transformer architectures and attention mechanisms
- Familiarity with batching, scheduling, concurrency, and cache management
- Strong Python and/or C++ engineering skills
Why Join
- Work on cutting-edge inference infrastructure and AI systems problems
- Build systems designed for next-generation AI scale
- Small, highly technical engineering team
- Significant ownership and technical impact
- Opportunity to shape foundational infrastructure for future AI workloads