How top AI teams build LLM judges for reliable products

Executive overview

As AI products scale, manual evaluation becomes a bottleneck. LLM judges — AI models that critique other AI output — automate this at scale, but only work when built on a solid eval foundation first.

The process: set up binary evals, hire a domain expert to judge real outputs, use those judgments to build an LLM judge, then continuously align the judge back to the human over time.

The core insight: an LLM judge is only as good as the human judgments it was trained on — getting the expert's reasoning right is the most important step.

Why evals come before judges

  • AI products involve multi-step chains; failures are invisible without instrumentation.
  • Binary evals (pass/fail, yes/no) outperform gradient scales — the difference between a 3 and a 4 is subjective and hard to act on.
  • Focus on the 20% of failure points causing 80% of errors before automating anything.
  • Automate a broken process and you scale the brokenness.

Getting the right expert

  • Developers rarely know what "good" looks like in a specialised domain — hire a lawyer, doctor, or accountant who does.
  • Experts surface subconscious nuance when forced to give reasons, not just verdicts.
  • Early expert involvement creates buy-in for the final product.
  • Low-context communication cultures (remote-first, async-heavy) tend to produce better judges because they default to over-explaining.

Generating evaluation data

  • Data needs to cover the full "shape" of user interactions: personas (new user, expert, elderly), features (email summary, meeting scheduling), and scenarios (no match found, multiple matches, invalid input).
  • Fill the shape with real user data first; use synthetic AI-generated data to cover gaps.
  • For synthetic generation, prompt a separate LLM with the persona + scenario + assumption (e.g. "frustrated customer, order number does not exist") to create realistic inputs.

Running the human judgment phase

  • Feed each generated input through the AI product and capture the output (traces via tools like Phoenix, BrainTrust, or Humanloop).
  • Present outputs to the expert in a zero-friction format — Excel works; a simple UI works.
  • Collect both a binary verdict and a written reason for every item, including passes.
  • Instruct the expert to write reasons as if explaining to a new employee — enough context for the AI to learn from.
  • Even passing responses should get critique: noting what could have been improved raises the quality ceiling.

Building the LLM judge system prompt

Structure the system prompt in four parts:

  1. Role and task — what the judge is evaluating and what "good" means in context.
  2. Additional context — domain knowledge the judge needs (e.g. query language syntax, business rules).
  3. Guidelines — analyze carefully, apply domain lens, return a binary verdict plus a written reason.
  4. Few-shot examples — include 5–15 human judgment examples (question + AI response + critique + verdict), wrapped in XML delimiters so the model can segment them.
  • XML tags act as delimiters, helping the model distinguish sections of the prompt.
  • More context window = more few-shot examples = better calibration.

Advanced: adaptive few-shot retrieval

  • If context window is limited, swap the examples section dynamically at inference time.
  • Route incoming questions to a topic-specific example bank, pull the most relevant human judgments, and inject them into the prompt in near-real time.
  • This makes the judge's calibration improve per question type, not just globally.

Evaluating and improving the judge

  • Sample ~50 diverse conversations from the judge's output, covering the full data shape.
  • Have the human expert review the judge's verdicts and critiques — the human is now judging the judge.
  • Track an agreement column: did the LLM judge match the human's verdict (true/false)?
  • Periodically (weekly, monthly, or after major product changes) update the judge's system prompt with new insights from the review.
  • Goal: push judge agreement toward human-level accuracy at production scale.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.