How top AI teams build LLM judges for reliable products

Executive overview

As AI products scale, manual evaluation becomes a bottleneck. LLM judges — AI models that critique other AI output — automate this at scale, but only work when built on a solid eval foundation first.

The process: set up binary evals, hire a domain expert to judge real outputs, use those judgments to build an LLM judge, then continuously align the judge back to the human over time.

The core insight: an LLM judge is only as good as the human judgments it was trained on — getting the expert's reasoning right is the most important step.

Why evals come before judges

AI products involve multi-step chains; failures are invisible without instrumentation.
Binary evals (pass/fail, yes/no) outperform gradient scales — the difference between a 3 and a 4 is subjective and hard to act on.
Focus on the 20% of failure points causing 80% of errors before automating anything.
Automate a broken process and you scale the brokenness.

Getting the right expert

Developers rarely know what "good" looks like in a specialised domain — hire a lawyer, doctor, or accountant who does.
Experts surface subconscious nuance when forced to give reasons, not just verdicts.
Early expert involvement creates buy-in for the final product.
Low-context communication cultures (remote-first, async-heavy) tend to produce better judges because they default to over-explaining.

Generating evaluation data

Data needs to cover the full "shape" of user interactions: personas (new user, expert, elderly), features (email summary, meeting scheduling), and scenarios (no match found, multiple matches, invalid input).
Fill the shape with real user data first; use synthetic AI-generated data to cover gaps.
For synthetic generation, prompt a separate LLM with the persona + scenario + assumption (e.g. "frustrated customer, order number does not exist") to create realistic inputs.

Running the human judgment phase

Feed each generated input through the AI product and capture the output (traces via tools like Phoenix, BrainTrust, or Humanloop).
Present outputs to the expert in a zero-friction format — Excel works; a simple UI works.
Collect both a binary verdict and a written reason for every item, including passes.
Instruct the expert to write reasons as if explaining to a new employee — enough context for the AI to learn from.
Even passing responses should get critique: noting what could have been improved raises the quality ceiling.

Building the LLM judge system prompt

Structure the system prompt in four parts:

Role and task — what the judge is evaluating and what "good" means in context.
Additional context — domain knowledge the judge needs (e.g. query language syntax, business rules).
Guidelines — analyze carefully, apply domain lens, return a binary verdict plus a written reason.
Few-shot examples — include 5–15 human judgment examples (question + AI response + critique + verdict), wrapped in XML delimiters so the model can segment them.

XML tags act as delimiters, helping the model distinguish sections of the prompt.
More context window = more few-shot examples = better calibration.

Advanced: adaptive few-shot retrieval

If context window is limited, swap the examples section dynamically at inference time.
Route incoming questions to a topic-specific example bank, pull the most relevant human judgments, and inject them into the prompt in near-real time.
This makes the judge's calibration improve per question type, not just globally.

Evaluating and improving the judge

Sample ~50 diverse conversations from the judge's output, covering the full data shape.
Have the human expert review the judge's verdicts and critiques — the human is now judging the judge.
Track an agreement column: did the LLM judge match the human's verdict (true/false)?
Periodically (weekly, monthly, or after major product changes) update the judge's system prompt with new insights from the review.
Goal: push judge agreement toward human-level accuracy at production scale.

Building $10,000 software MVPs with AI in under an hour

Brett Malinowski May 14, 2026

AI tools & automation 9

MVP & prototyping 8

Automation & tools 6

One person with Claude Code can replace a three-person agency team
Partner with niche creators who already have audience and distribution
Use pre-built components for payments and chat — don't build infrastructure from scratch

AI strategy & adoption

YouTube

How to actually make money with AI: five brutal truths

Dan Martell May 14, 2026

AI strategy & adoption 9

Business models 8

Automation & tools 5

AI is a hammer — you still need to find the nail
Validate with manual "Wizard of Oz" delivery before automating anything
Future orgs are workflow-based; humans own outcomes, agents own tasks

AI strategy & adoption

YouTube

How to choose the right home for your AI workflow

Dylan Davis May 13, 2026

AI strategy & adoption 9

Automation & tools 6

AI defaults to building apps — that's usually the wrong choice
85–90% of workflows belong inside a project or skill, not deployed code
Deploying an app triggers per-token API costs that subscriptions don't cover

How top AI teams build LLM judges for reliable products

Executive overview

Why evals come before judges

Getting the right expert

Generating evaluation data

Running the human judgment phase

Building the LLM judge system prompt

Advanced: adaptive few-shot retrieval

Evaluating and improving the judge

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Why evals come before judges

Getting the right expert

Generating evaluation data

Running the human judgment phase

Building the LLM judge system prompt

Advanced: adaptive few-shot retrieval

Evaluating and improving the judge

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.