Lead LLM Evals Engineer

GW267
  • $250,000-$350,000
  • San Francisco, CA
  • Permanent

About the job


Lead LLM Evals Engineer | SF or Redwood City


I'm hiring a Lead LLM Evals Engineer to join an early-stage physical AI startup building systems with general physical ability to experiment, engineer, and manufacture anything. They’re a small, deeply technical team pushing agentic LLMs into real autonomous workflows tied to physical systems, factories, and end-to-end execution.


This role owns the evaluation and verification layer for agentic LLM systems operating in complex, long-horizon environments. You’ll build eval harnesses, automated verifiers, and regression gates that determine whether agents can actually plan, execute, recover, and ship real outcomes across simulated and real-world workflows. The work directly shapes how fast these systems improve, how safely

they operate, and whether progress is real or illusory.


→ Build eval harnesses for agentic LLM systems in complex workflows

→ Design verifiers for planning, execution, recovery, and constraint adherence

→ Turn eval failures into training signals with research and systems teams


Both Senior & Lead levels considered.


Interested? Apply now!


Nick Bell ML Research & Engineering Recruiter

Apply for this role