Inference Engineer
- $200,000-$350,000
- Santa Clara, CA
- Permanent
About the job
Senior / Principal Machine Learning Engineer – Inference Serving Frameworks
Full-time | On-site | Bay Area
About the Company
We are a VC-backed, stealth-mode startup building rack-level AI inference systems. Our differentiated system-on-chip architecture enables system-level innovations designed to maximize efficiency for data center-scale inference serving.
The team is building hardware and extending open-source software to serve leading-edge models with extreme efficiency. We are looking for highly skilled engineers who can help architect and optimize large-scale inference systems across software, hardware, networking, and scheduling.
Leveling is determined by scope, ownership, and leadership, not only years of experience.
About the Role
As a Senior or Principal Machine Learning Engineer focused on inference serving frameworks, you will lead or serve as a core member of a team building state-of-the-art inference serving and cluster scheduling capabilities.
You will work alongside hardware and software experts to architect high-performance inference stacks and design resource scheduling strategies that push the frontier of efficiency for large-scale open-source models on custom AI infrastructure.
Key Responsibilities
- Design, develop, and tune multi-node inference techniques to optimize throughput and latency.
- Apply strategies such as tensor parallelism, pipeline parallelism, expert parallelism, continuous batching, and KV cache management.
- Optimize at the intersection of compute, networking, and storage for large-scale model serving.
- Drive performance improvements in inference frameworks such as vLLM, SGLang, PyTorch, or similar systems.
- Develop advanced cluster scheduling algorithms to improve throughput, latency, and resource utilization.
- Engage with the open-source community to upstream optimizations, influence roadmaps, and support long-term maintainability.
- Apply best practices in benchmarking, testing, profiling, and debugging to maintain a robust production-grade stack.
Experience and Qualifications
- Strong proficiency in Python, C++, and PyTorch.
- Demonstrated history of shipping high-quality software in a startup or fast-paced technical environment.
- Experience as a developer of one or more LLM inference serving frameworks, such as vLLM, SGLang, or comparable systems.
- Deep understanding of LLM inference internals, including KV cache management, batching, attention mechanisms, and serving-time performance tradeoffs.
- Experience running and optimizing large-scale workloads on heterogeneous clusters.
- Familiarity with networking, storage management, distributed scheduling, or related systems.
- Proficiency in performance analysis and systems-level debugging.
- GPU kernel development experience using CUDA, Triton, ROCm, or similar technologies is a plus.
- Master’s or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field, or equivalent practical experience.
Bonus Experience
- Experience contributing to or maintaining open-source inference-serving frameworks.
- Familiarity with advanced scheduling or memory systems for LLM serving.
- Experience optimizing inference workloads on custom or heterogeneous AI hardware.
- Understanding of cluster-scale bottlenecks across compute, memory, networking, and storage.
- Prior experience in stealth, early-stage, or fast-moving infrastructure startups.
What We Offer
- Opportunity to work on next-generation AI inference infrastructure.
- Direct collaboration with hardware and software experts.
- High ownership over core serving and scheduling systems.
- Fast-paced startup environment with significant technical scope and impact.