The original is one click away. Open original ↗
Why successful AI companies obsess over evaluations
Executive overview
AI models are stochastic — without visibility into where a multi-step agent fails, you cannot reliably improve it. Evals give you that visibility, letting you apply 80% of your improvement effort to the 20% of steps causing most errors.
Start with unit tests before evals. Then instrument your agent, look at real data, and use binary (yes/no) criteria rather than subjective rating scales.
The core insight: you can't fix what you can't see — evals are the mechanism for seeing.
Why evals matter more as complexity grows
- Simple prompt-response AI rarely needs evals; multi-step agents absolutely do
- With 10 pipeline steps, errors can hide anywhere — evals reveal exactly which steps fail
- The 80-20 principle applies: fix the two steps causing most errors and get outsized improvement
- Importance of evals scales with product sophistication and user volume
Start with unit tests first
- Unit tests are low-hanging fruit — write them before adding eval infrastructure
- Each agent step usually has deterministic expectations that can be asserted
- AI can write the unit tests and generate synthetic test data for you (Cursor, Windsurf, etc.)
- Example checks for a RAG chatbot: response not empty, response within length limit
- Pass these tests first, then move to evals
Error analysis: just look at your data
- An eval is fundamentally just structured inspection of your AI's input/output pairs
- Review 50–100 real interaction examples to identify error categories and themes
- Categorise interactions by: scenario type, user persona, feature used
- Common scenario categories: multi-match, no match, vague request, new vs expert user
- Increase sample size over time (50 → 100 → 200) until error themes stabilise
Tooling: avoid nerd-sniping
- Many tools exist (LangSmith, Braintrust, Phoenix, etc.) — don't get paralysed choosing
- Phoenix has a free open-source option; integrate by inserting tracers into each function
- Export data to CSV and analyse in Google Sheets — this is sufficient to start
- Custom dashboards (e.g. Shiny for Python) reduce friction when reviewing conversations
- Pick a simple tool, get the data out, look at it — the tool is not the goal
Use binary evaluation criteria, not rating scales
- Avoid spectrum frameworks (1–5 helpfulness, A–F accuracy) — the bands are too subjective
- Force every evaluation question to a yes/no answer
- Example binary checks for RAG: "Is the answer factually correct?" "Is the answer relevant to the question?"
- Binary criteria make improvement measurable: you know definitively if a step got better
LLM as a judge
- An LLM judge is a second AI that automatically evaluates the outputs of your product AI
- Setup process: recruit a domain expert (lawyer, accountant, support lead), generate synthetic data for each scenario/persona, have the expert score and critique responses
- Store expert judgments in a simple spreadsheet — avoid complex tooling that blocks non-developers
- Run error analysis on the expert scores to find the dominant error themes
- Build the LLM judge based on the expert's criteria, then iterate its system prompt over time
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.