How to build AI evals: a practical guide for product builders

Executive overview

Most teams shipping AI products have no systematic way to know when their product is failing. Evals fill that gap — they are structured data analysis on your LLM application, not just automated tests.

The core process moves from manual trace review to failure categorization to automated judges, and it compounds: the upfront cost is roughly three to four days, then roughly thirty minutes a week.

Evals are the highest-ROI activity an AI product team can engage in — and any PM can lead the process.

What evals actually are

  • Evals cover a spectrum: manual data analysis, code-based checks, LLM-as-judge prompts, A/B tests, and product metrics.
  • Unit tests are a small subset — jumping straight to tests without data analysis first is the most common mistake.
  • The goal is actionable product improvement, not a beautiful eval suite.
  • Online monitoring (running judges against production traces daily or hourly) is as important as pre-ship testing.

Step 1: error analysis — look at your traces

  • Start by reviewing raw traces in an observability tool (BrainTrust, Phoenix/Arize, LangSmith, or a spreadsheet).
  • Write a brief note on the first thing you see that is wrong in each trace — the most upstream error only.
  • Notes should be informal but specific; vague labels like "janky" cannot be categorized later.
  • This step must be done by a human with domain expertise — an LLM will typically report that every trace looks fine.
  • Appoint one benevolent dictator: a single domain expert whose judgment is trusted, to avoid committee paralysis and keep the process fast.
  • Review until you reach theoretical saturation — the point where new traces stop revealing new failure types. For most products, 40–100 traces is enough.

Step 2: open coding and axial coding

  • Open codes are the raw notes from trace review.
  • Feed those notes to an LLM (Claude, ChatGPT, Gemini) and ask it to produce axial codes — clustered failure categories.
  • Review the generated categories: rename overly generic labels, merge redundant ones, split anything too broad to be actionable.
  • Add a "none of the above" catch-all so the LLM flags gaps in your taxonomy.
  • Use a spreadsheet formula or AI prompt to automatically label every trace with an axial code.

Step 3: count and prioritize

  • Run a pivot table on the labeled traces to see failure frequency by category.
  • Frequency is not the only signal — some low-frequency failures are high business risk and should be prioritized regardless.
  • Some failures need no eval: if the fix is obvious (a missing instruction in the prompt), just fix it and move on.
  • Target four to seven failure modes for automated evaluation; not every problem warrants an LLM judge.

Step 4: build LLM-as-judge evaluators

  • Write a separate judge prompt for each failure mode — tightly scoped, evaluating one thing only.
  • Output must be binary (true/false, pass/fail). Likert scales (1–5 or 1–7) produce uninterpretable averages and erode trust.
  • Before deploying a judge, validate it against your manually labeled traces using a confusion matrix.
  • Raw agreement percentage is misleading if errors are rare — check each cell of the matrix (false positives and false negatives separately).
  • Iterate on the judge prompt until misalignment on the non-green cells is minimal.
  • Once validated, deploy the judge in unit tests (pre-ship CI) and as a production monitor (sampled traces on a schedule).

The eval-as-PRD insight

  • A well-written LLM judge prompt is effectively a living product requirements document.
  • It captures exactly how the product should behave and runs continuously — not just at release.
  • Expectations discovered through trace review will change the prompt in ways no upfront PRD could have anticipated (criteria drift is real and documented in research).

Common misconceptions

  • "AI can do the error analysis for me" — it cannot; it lacks product context and will report clean traces as fine.
  • "We don't need evals, we just vibe" — teams that claim this are typically relying on evals embedded in the foundation model's training and doing informal error analysis without naming it.
  • "evals vs A/B tests" is a false dichotomy — A/B tests are one tool within the broader eval process; without prior error analysis, A/B test hypotheses are usually wrong.
  • Coding agents (Claude Code, Codex) are a special case: the developer is the domain expert and is dog-fooding constantly. Generalizing their process to other product categories is a mistake.

Getting started

  • Reserve three to four days for the first round: trace review, open coding, axial coding, building one or two judges, integrating into CI.
  • After setup, ongoing effort is roughly thirty minutes per week.
  • Build a lightweight internal tool to make trace review frictionless — remove all friction from the highest-ROI activity.
  • Use LLMs freely for synthesis tasks (categorizing codes, drafting judge prompts, organizing notes) but keep humans in the loop for judgment calls.
  • Share failures and successes openly; the field benefits from more people writing and teaching about application-specific evals.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.