The original is one click away. Open original ↗
How top AI teams build LLM judges for reliable products
Executive overview
As AI products scale, manual evaluation becomes a bottleneck. LLM judges — AI models that critique other AI output — automate this at scale, but only work when built on a solid eval foundation first.
The process: set up binary evals, hire a domain expert to judge real outputs, use those judgments to build an LLM judge, then continuously align the judge back to the human over time.
The core insight: an LLM judge is only as good as the human judgments it was trained on — getting the expert's reasoning right is the most important step.
Why evals come before judges
- AI products involve multi-step chains; failures are invisible without instrumentation.
- Binary evals (pass/fail, yes/no) outperform gradient scales — the difference between a 3 and a 4 is subjective and hard to act on.
- Focus on the 20% of failure points causing 80% of errors before automating anything.
- Automate a broken process and you scale the brokenness.
Getting the right expert
- Developers rarely know what "good" looks like in a specialised domain — hire a lawyer, doctor, or accountant who does.
- Experts surface subconscious nuance when forced to give reasons, not just verdicts.
- Early expert involvement creates buy-in for the final product.
- Low-context communication cultures (remote-first, async-heavy) tend to produce better judges because they default to over-explaining.
Generating evaluation data
- Data needs to cover the full "shape" of user interactions: personas (new user, expert, elderly), features (email summary, meeting scheduling), and scenarios (no match found, multiple matches, invalid input).
- Fill the shape with real user data first; use synthetic AI-generated data to cover gaps.
- For synthetic generation, prompt a separate LLM with the persona + scenario + assumption (e.g. "frustrated customer, order number does not exist") to create realistic inputs.
Running the human judgment phase
- Feed each generated input through the AI product and capture the output (traces via tools like Phoenix, BrainTrust, or Humanloop).
- Present outputs to the expert in a zero-friction format — Excel works; a simple UI works.
- Collect both a binary verdict and a written reason for every item, including passes.
- Instruct the expert to write reasons as if explaining to a new employee — enough context for the AI to learn from.
- Even passing responses should get critique: noting what could have been improved raises the quality ceiling.
Building the LLM judge system prompt
Structure the system prompt in four parts:
- Role and task — what the judge is evaluating and what "good" means in context.
- Additional context — domain knowledge the judge needs (e.g. query language syntax, business rules).
- Guidelines — analyze carefully, apply domain lens, return a binary verdict plus a written reason.
- Few-shot examples — include 5–15 human judgment examples (question + AI response + critique + verdict), wrapped in XML delimiters so the model can segment them.
- XML tags act as delimiters, helping the model distinguish sections of the prompt.
- More context window = more few-shot examples = better calibration.
Advanced: adaptive few-shot retrieval
- If context window is limited, swap the examples section dynamically at inference time.
- Route incoming questions to a topic-specific example bank, pull the most relevant human judgments, and inject them into the prompt in near-real time.
- This makes the judge's calibration improve per question type, not just globally.
Evaluating and improving the judge
- Sample ~50 diverse conversations from the judge's output, covering the full data shape.
- Have the human expert review the judge's verdicts and critiques — the human is now judging the judge.
- Track an agreement column: did the LLM judge match the human's verdict (true/false)?
- Periodically (weekly, monthly, or after major product changes) update the judge's system prompt with new insights from the review.
- Goal: push judge agreement toward human-level accuracy at production scale.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.