Inference Engineer

GW487
  • $200,000-$350,000
  • Santa Clara, CA
  • Permanent

About the job


Senior / Principal Machine Learning Engineer – Inference Serving Frameworks

Full-time | On-site | Bay Area


About the Company


We are a VC-backed, stealth-mode startup building rack-level AI inference systems. Our differentiated system-on-chip architecture enables system-level innovations designed to maximize efficiency for data center-scale inference serving.


The team is building hardware and extending open-source software to serve leading-edge models with extreme efficiency. We are looking for highly skilled engineers who can help architect and optimize large-scale inference systems across software, hardware, networking, and scheduling.

Leveling is determined by scope, ownership, and leadership, not only years of experience.


About the Role


As a Senior or Principal Machine Learning Engineer focused on inference serving frameworks, you will lead or serve as a core member of a team building state-of-the-art inference serving and cluster scheduling capabilities.


You will work alongside hardware and software experts to architect high-performance inference stacks and design resource scheduling strategies that push the frontier of efficiency for large-scale open-source models on custom AI infrastructure.


Key Responsibilities


  • Design, develop, and tune multi-node inference techniques to optimize throughput and latency.
  • Apply strategies such as tensor parallelism, pipeline parallelism, expert parallelism, continuous batching, and KV cache management.
  • Optimize at the intersection of compute, networking, and storage for large-scale model serving.
  • Drive performance improvements in inference frameworks such as vLLM, SGLang, PyTorch, or similar systems.
  • Develop advanced cluster scheduling algorithms to improve throughput, latency, and resource utilization.
  • Engage with the open-source community to upstream optimizations, influence roadmaps, and support long-term maintainability.
  • Apply best practices in benchmarking, testing, profiling, and debugging to maintain a robust production-grade stack.


Experience and Qualifications


  • Strong proficiency in Python, C++, and PyTorch.
  • Demonstrated history of shipping high-quality software in a startup or fast-paced technical environment.
  • Experience as a developer of one or more LLM inference serving frameworks, such as vLLM, SGLang, or comparable systems.
  • Deep understanding of LLM inference internals, including KV cache management, batching, attention mechanisms, and serving-time performance tradeoffs.
  • Experience running and optimizing large-scale workloads on heterogeneous clusters.
  • Familiarity with networking, storage management, distributed scheduling, or related systems.
  • Proficiency in performance analysis and systems-level debugging.
  • GPU kernel development experience using CUDA, Triton, ROCm, or similar technologies is a plus.
  • Master’s or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field, or equivalent practical experience.


Bonus Experience


  • Experience contributing to or maintaining open-source inference-serving frameworks.
  • Familiarity with advanced scheduling or memory systems for LLM serving.
  • Experience optimizing inference workloads on custom or heterogeneous AI hardware.
  • Understanding of cluster-scale bottlenecks across compute, memory, networking, and storage.
  • Prior experience in stealth, early-stage, or fast-moving infrastructure startups.


What We Offer


  • Opportunity to work on next-generation AI inference infrastructure.
  • Direct collaboration with hardware and software experts.
  • High ownership over core serving and scheduling systems.
  • Fast-paced startup environment with significant technical scope and impact.


Kelly Dougherty Researcher

Apply for this role