Why successful AI companies obsess over evaluations

Executive overview

AI models are stochastic — without visibility into where a multi-step agent fails, you cannot reliably improve it. Evals give you that visibility, letting you apply 80% of your improvement effort to the 20% of steps causing most errors.

Start with unit tests before evals. Then instrument your agent, look at real data, and use binary (yes/no) criteria rather than subjective rating scales.

The core insight: you can't fix what you can't see — evals are the mechanism for seeing.

Why evals matter more as complexity grows

  • Simple prompt-response AI rarely needs evals; multi-step agents absolutely do
  • With 10 pipeline steps, errors can hide anywhere — evals reveal exactly which steps fail
  • The 80-20 principle applies: fix the two steps causing most errors and get outsized improvement
  • Importance of evals scales with product sophistication and user volume

Start with unit tests first

  • Unit tests are low-hanging fruit — write them before adding eval infrastructure
  • Each agent step usually has deterministic expectations that can be asserted
  • AI can write the unit tests and generate synthetic test data for you (Cursor, Windsurf, etc.)
  • Example checks for a RAG chatbot: response not empty, response within length limit
  • Pass these tests first, then move to evals

Error analysis: just look at your data

  • An eval is fundamentally just structured inspection of your AI's input/output pairs
  • Review 50–100 real interaction examples to identify error categories and themes
  • Categorise interactions by: scenario type, user persona, feature used
  • Common scenario categories: multi-match, no match, vague request, new vs expert user
  • Increase sample size over time (50 → 100 → 200) until error themes stabilise

Tooling: avoid nerd-sniping

  • Many tools exist (LangSmith, Braintrust, Phoenix, etc.) — don't get paralysed choosing
  • Phoenix has a free open-source option; integrate by inserting tracers into each function
  • Export data to CSV and analyse in Google Sheets — this is sufficient to start
  • Custom dashboards (e.g. Shiny for Python) reduce friction when reviewing conversations
  • Pick a simple tool, get the data out, look at it — the tool is not the goal

Use binary evaluation criteria, not rating scales

  • Avoid spectrum frameworks (1–5 helpfulness, A–F accuracy) — the bands are too subjective
  • Force every evaluation question to a yes/no answer
  • Example binary checks for RAG: "Is the answer factually correct?" "Is the answer relevant to the question?"
  • Binary criteria make improvement measurable: you know definitively if a step got better

LLM as a judge

  • An LLM judge is a second AI that automatically evaluates the outputs of your product AI
  • Setup process: recruit a domain expert (lawyer, accountant, support lead), generate synthetic data for each scenario/persona, have the expert score and critique responses
  • Store expert judgments in a simple spreadsheet — avoid complex tooling that blocks non-developers
  • Run error analysis on the expert scores to find the dominant error themes
  • Build the LLM judge based on the expert's criteria, then iterate its system prompt over time

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.