How to evaluate AI products before spending money

Executive overview

Most businesses waste money on AI tools they could have ruled out in a week. The problem is evaluating AI by gut feel or vendor-provided metrics rather than custom, task-specific criteria.

A three-step framework fixes this: define what good looks like, build binary evaluations for your use case, then test during the free trial before committing.

You don't need AI to be perfect — you need a meaningful improvement over your current baseline.

The three-step evaluation framework

  • Define your primary metric — the specific ROI you expect (time saved, revenue generated, volume handled).
  • Set secondary sub-goals that contribute to the primary metric.
  • Create binary evals — yes/no questions that measure whether the AI achieved a specific outcome.
  • Avoid off-the-shelf vendor evals; they are generic and rarely match your actual needs.
  • Build your own evals — the process sharpens your understanding of what matters.
  • Run evaluations during the free trial period to avoid spending money on tools that don't fit.

Why binary evaluations matter

  • Spectrum-based evals (1–10 scores) generate subjective debate and slow decisions.
  • Binary evals are unambiguous: the AI either achieved the outcome or it didn't.
  • Every use case can support binary evals if you define the criteria precisely enough.

Example: contract review assistant

Primary goal — AI catches at least 95% of contract issues faster than lawyers.

  • Does the AI identify the riskiest clauses in 19 of 20 contracts? Yes/no.
  • Does it flag missing standard terms (termination, liability) consistently? Yes/no.

Secondary goal — reduce contract review time below 15 minutes (from 90 minutes).

  • Does a lawyer reviewing the AI summary finish in under 15 minutes? Yes/no.
  • Does the AI summary surface all key sections on the first page? Yes/no.

Example: customer service chatbot

Primary goal — reduce support ticket volume by 40%.

  • Does the AI resolve at least 4 of 10 issues without human intervention? Yes/no.
  • Does it answer at least 8 of 20 common questions autonomously? Yes/no.

Secondary goal — keep customer satisfaction above 80%.

  • Do 8 of 10 customers rate the AI conversation as good or better? Yes/no.
  • Do fewer than 2 of 10 customers immediately try to bypass the AI? Yes/no.

Example: sales email personaliser

Primary goal — double cold email reply rate from 2% to 4%.

  • Do 100 AI-personalised emails generate 4 or more replies? Yes/no.
  • Does every email include a personalisation element from the AI's research? Yes/no.

Secondary goal — ensure opening lines are unique per prospect.

  • Across 50 emails to similar personas, are all openers distinct? Yes/no.
  • Does each opener reference something specific from the prospect's LinkedIn or website? Yes/no.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.