The original is one click away. Open original ↗
How to evaluate AI products before spending money
Executive overview
Most businesses waste money on AI tools they could have ruled out in a week. The problem is evaluating AI by gut feel or vendor-provided metrics rather than custom, task-specific criteria.
A three-step framework fixes this: define what good looks like, build binary evaluations for your use case, then test during the free trial before committing.
You don't need AI to be perfect — you need a meaningful improvement over your current baseline.
The three-step evaluation framework
- Define your primary metric — the specific ROI you expect (time saved, revenue generated, volume handled).
- Set secondary sub-goals that contribute to the primary metric.
- Create binary evals — yes/no questions that measure whether the AI achieved a specific outcome.
- Avoid off-the-shelf vendor evals; they are generic and rarely match your actual needs.
- Build your own evals — the process sharpens your understanding of what matters.
- Run evaluations during the free trial period to avoid spending money on tools that don't fit.
Why binary evaluations matter
- Spectrum-based evals (1–10 scores) generate subjective debate and slow decisions.
- Binary evals are unambiguous: the AI either achieved the outcome or it didn't.
- Every use case can support binary evals if you define the criteria precisely enough.
Example: contract review assistant
Primary goal — AI catches at least 95% of contract issues faster than lawyers.
- Does the AI identify the riskiest clauses in 19 of 20 contracts? Yes/no.
- Does it flag missing standard terms (termination, liability) consistently? Yes/no.
Secondary goal — reduce contract review time below 15 minutes (from 90 minutes).
- Does a lawyer reviewing the AI summary finish in under 15 minutes? Yes/no.
- Does the AI summary surface all key sections on the first page? Yes/no.
Example: customer service chatbot
Primary goal — reduce support ticket volume by 40%.
- Does the AI resolve at least 4 of 10 issues without human intervention? Yes/no.
- Does it answer at least 8 of 20 common questions autonomously? Yes/no.
Secondary goal — keep customer satisfaction above 80%.
- Do 8 of 10 customers rate the AI conversation as good or better? Yes/no.
- Do fewer than 2 of 10 customers immediately try to bypass the AI? Yes/no.
Example: sales email personaliser
Primary goal — double cold email reply rate from 2% to 4%.
- Do 100 AI-personalised emails generate 4 or more replies? Yes/no.
- Does every email include a personalisation element from the AI's research? Yes/no.
Secondary goal — ensure opening lines are unique per prospect.
- Across 50 emails to similar personas, are all openers distinct? Yes/no.
- Does each opener reference something specific from the prospect's LinkedIn or website? Yes/no.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.