Why successful AI companies obsess over evaluations

Executive overview

AI models are stochastic — without visibility into where a multi-step agent fails, you cannot reliably improve it. Evals give you that visibility, letting you apply 80% of your improvement effort to the 20% of steps causing most errors.

Start with unit tests before evals. Then instrument your agent, look at real data, and use binary (yes/no) criteria rather than subjective rating scales.

The core insight: you can't fix what you can't see — evals are the mechanism for seeing.

Why evals matter more as complexity grows

Simple prompt-response AI rarely needs evals; multi-step agents absolutely do
With 10 pipeline steps, errors can hide anywhere — evals reveal exactly which steps fail
The 80-20 principle applies: fix the two steps causing most errors and get outsized improvement
Importance of evals scales with product sophistication and user volume

Start with unit tests first

Unit tests are low-hanging fruit — write them before adding eval infrastructure
Each agent step usually has deterministic expectations that can be asserted
AI can write the unit tests and generate synthetic test data for you (Cursor, Windsurf, etc.)
Example checks for a RAG chatbot: response not empty, response within length limit
Pass these tests first, then move to evals

Error analysis: just look at your data

An eval is fundamentally just structured inspection of your AI's input/output pairs
Review 50–100 real interaction examples to identify error categories and themes
Categorise interactions by: scenario type, user persona, feature used
Common scenario categories: multi-match, no match, vague request, new vs expert user
Increase sample size over time (50 → 100 → 200) until error themes stabilise

Tooling: avoid nerd-sniping

Many tools exist (LangSmith, Braintrust, Phoenix, etc.) — don't get paralysed choosing
Phoenix has a free open-source option; integrate by inserting tracers into each function
Export data to CSV and analyse in Google Sheets — this is sufficient to start
Custom dashboards (e.g. Shiny for Python) reduce friction when reviewing conversations
Pick a simple tool, get the data out, look at it — the tool is not the goal

Use binary evaluation criteria, not rating scales

Avoid spectrum frameworks (1–5 helpfulness, A–F accuracy) — the bands are too subjective
Force every evaluation question to a yes/no answer
Example binary checks for RAG: "Is the answer factually correct?" "Is the answer relevant to the question?"
Binary criteria make improvement measurable: you know definitively if a step got better

LLM as a judge

An LLM judge is a second AI that automatically evaluates the outputs of your product AI
Setup process: recruit a domain expert (lawyer, accountant, support lead), generate synthetic data for each scenario/persona, have the expert score and critique responses
Store expert judgments in a simple spreadsheet — avoid complex tooling that blocks non-developers
Run error analysis on the expert scores to find the dominant error themes
Build the LLM judge based on the expert's criteria, then iterate its system prompt over time

Building $10,000 software MVPs with AI in under an hour

Brett Malinowski May 14, 2026

AI tools & automation 9

MVP & prototyping 8

Automation & tools 6

One person with Claude Code can replace a three-person agency team
Partner with niche creators who already have audience and distribution
Use pre-built components for payments and chat — don't build infrastructure from scratch

AI strategy & adoption

YouTube

How to actually make money with AI: five brutal truths

Dan Martell May 14, 2026

AI strategy & adoption 9

Business models 8

Automation & tools 5

AI is a hammer — you still need to find the nail
Validate with manual "Wizard of Oz" delivery before automating anything
Future orgs are workflow-based; humans own outcomes, agents own tasks