The original is one click away. Open original ↗
How to build AI evals: a practical guide for product builders
Executive overview
Most teams shipping AI products have no systematic way to know when their product is failing. Evals fill that gap — they are structured data analysis on your LLM application, not just automated tests.
The core process moves from manual trace review to failure categorization to automated judges, and it compounds: the upfront cost is roughly three to four days, then roughly thirty minutes a week.
Evals are the highest-ROI activity an AI product team can engage in — and any PM can lead the process.
What evals actually are
- Evals cover a spectrum: manual data analysis, code-based checks, LLM-as-judge prompts, A/B tests, and product metrics.
- Unit tests are a small subset — jumping straight to tests without data analysis first is the most common mistake.
- The goal is actionable product improvement, not a beautiful eval suite.
- Online monitoring (running judges against production traces daily or hourly) is as important as pre-ship testing.
Step 1: error analysis — look at your traces
- Start by reviewing raw traces in an observability tool (BrainTrust, Phoenix/Arize, LangSmith, or a spreadsheet).
- Write a brief note on the first thing you see that is wrong in each trace — the most upstream error only.
- Notes should be informal but specific; vague labels like "janky" cannot be categorized later.
- This step must be done by a human with domain expertise — an LLM will typically report that every trace looks fine.
- Appoint one benevolent dictator: a single domain expert whose judgment is trusted, to avoid committee paralysis and keep the process fast.
- Review until you reach theoretical saturation — the point where new traces stop revealing new failure types. For most products, 40–100 traces is enough.
Step 2: open coding and axial coding
- Open codes are the raw notes from trace review.
- Feed those notes to an LLM (Claude, ChatGPT, Gemini) and ask it to produce axial codes — clustered failure categories.
- Review the generated categories: rename overly generic labels, merge redundant ones, split anything too broad to be actionable.
- Add a "none of the above" catch-all so the LLM flags gaps in your taxonomy.
- Use a spreadsheet formula or AI prompt to automatically label every trace with an axial code.
Step 3: count and prioritize
- Run a pivot table on the labeled traces to see failure frequency by category.
- Frequency is not the only signal — some low-frequency failures are high business risk and should be prioritized regardless.
- Some failures need no eval: if the fix is obvious (a missing instruction in the prompt), just fix it and move on.
- Target four to seven failure modes for automated evaluation; not every problem warrants an LLM judge.
Step 4: build LLM-as-judge evaluators
- Write a separate judge prompt for each failure mode — tightly scoped, evaluating one thing only.
- Output must be binary (true/false, pass/fail). Likert scales (1–5 or 1–7) produce uninterpretable averages and erode trust.
- Before deploying a judge, validate it against your manually labeled traces using a confusion matrix.
- Raw agreement percentage is misleading if errors are rare — check each cell of the matrix (false positives and false negatives separately).
- Iterate on the judge prompt until misalignment on the non-green cells is minimal.
- Once validated, deploy the judge in unit tests (pre-ship CI) and as a production monitor (sampled traces on a schedule).
The eval-as-PRD insight
- A well-written LLM judge prompt is effectively a living product requirements document.
- It captures exactly how the product should behave and runs continuously — not just at release.
- Expectations discovered through trace review will change the prompt in ways no upfront PRD could have anticipated (criteria drift is real and documented in research).
Common misconceptions
- "AI can do the error analysis for me" — it cannot; it lacks product context and will report clean traces as fine.
- "We don't need evals, we just vibe" — teams that claim this are typically relying on evals embedded in the foundation model's training and doing informal error analysis without naming it.
- "evals vs A/B tests" is a false dichotomy — A/B tests are one tool within the broader eval process; without prior error analysis, A/B test hypotheses are usually wrong.
- Coding agents (Claude Code, Codex) are a special case: the developer is the domain expert and is dog-fooding constantly. Generalizing their process to other product categories is a mistake.
Getting started
- Reserve three to four days for the first round: trace review, open coding, axial coding, building one or two judges, integrating into CI.
- After setup, ongoing effort is roughly thirty minutes per week.
- Build a lightweight internal tool to make trace review frictionless — remove all friction from the highest-ROI activity.
- Use LLMs freely for synthesis tasks (categorizing codes, drafting judge prompts, organizing notes) but keep humans in the loop for judgment calls.
- Share failures and successes openly; the field benefits from more people writing and teaching about application-specific evals.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.