How Casetext built a vertical AI agent and sold for $650M

Executive overview

Legal tech was stuck with incremental improvements lawyers could easily ignore. GPT-4 changed that — not just the technology, but the market's willingness to adopt it.

Casetext spent 10 years building in the legal space, then pivoted 120 people in 48 hours when they saw an early GPT-4 demo. The result was Co-Counsel: an AI legal assistant that does research, document review, and memo drafting at the level of a skilled associate.

The real moat in vertical AI is not the model — it's the layers of domain logic, data pipelines, and test-driven prompt engineering required to get from "impressive demo" to "works 100% of the time."

The 10-year slog before the breakthrough

Casetext started in 2012 as a user-generated legal annotation site, modelled on Wikipedia and Stack Overflow
Lawyers bill by the hour — they had no incentive to contribute free content; the UGC model failed entirely
Pivoted to NLP and machine learning: built citation-graph tools that surfaced cases lawyers had missed
Revenue was growing 70–80% YoY and approaching $15–20M ARR, but the product was still incremental
Lawyers earning $5M/year had no desire to change anything; resistance was structural, not irrational
ChatGPT's public release flipped this overnight — the same lawyers started calling, asking to get ahead of AI

The 48-hour pivot

Jake and his co-founder were under NDA with OpenAI, testing an early GPT-4 build months before public launch
Within 48 hours of first seeing it, the decision was made: redirect all 120 people to a new product
They built the first prototype themselves before telling the rest of the company — the NDA initially covered only the two founders
One week later, at an executive offsite, the team expected a sales planning session; instead they saw a laptop demo
Showing the product to customers live — watching senior attorneys have visible existential reactions — converted internal skeptics faster than any argument

Why GPT-4 was the threshold, not GPT-3

GPT-3.5 scored 10th percentile on the bar exam; GPT-4 scored above 90th percentile
Earlier models produced fluent but hallucinated output — plausible-sounding law that didn't exist
Legal work requires zero tolerance for fabrication: wrong case citations or misquoted statutes cause real harm
The jump from 3.5 to 4 wasn't incremental — it crossed the threshold where accurate, citation-grounded output became achievable

Building Co-Counsel: the skills architecture

Each capability ("skill") was built by working backwards from the end result a lawyer actually needs
Research workflow example: English query → search syntax → run queries → read results → build outline → write memo
Each step in that chain became a separate prompt; a full skill might involve a dozen to two dozen chained prompts
For each prompt, the team wrote gold-standard input/output pairs and built test batteries — starting at dozens, scaling to thousands
This is test-driven development applied to prompting: adding instructions to fix one failure must not break others

The "GPT wrapper" objection

Co-Counsel is not a wrapper: it includes proprietary legal datasets, automated annotations, integrations with legal-specific document management systems, and custom OCR pipelines
OCR alone required handling handwriting, tilted scans, and the legal practice of printing four pages onto one — each case requiring explicit handling
Everything before the model hits the context window can represent dozens of engineering decisions
Prompt strategy, information formatting, and step decomposition are IP — hard to build, hard to replicate
Analogy: Salesforce was "just a SQL wrapper" but the business logic on top is what made it a billion-dollar company

Getting from 70% to 100%

70% accuracy is good enough for a $20/month consumer product; mission-critical enterprise use cases require 100%
Test-driven prompting surfaces failure patterns; root-causing why a prompt fails usually reveals ambiguous instructions or missing context
Once a prompt passes a large battery of tests, generalisation to unseen inputs is reliable
First impressions matter acutely: one bad early experience, especially for a non-technical professional, ends adoption permanently
The last mile (or hundred miles) is where the value is — and where most "GPT wrappers" stop

On O1 and the next generation of reasoning models

O1 passed tests that all prior models failed: detecting subtle misquotations in a 40-page legal brief when given the source case
The shift from pattern-matching (System 1) to deliberate reasoning (System 2) is what unlocks genuinely complex legal tasks
One emerging prompting technique with O1: injecting domain-specific thinking frameworks, not just examples of good answers
Teaching a model how to think about a problem — not just what a good answer looks like — may be the next frontier
Early; no conclusive results yet, but the direction is clear

How Casetext built a vertical AI agent and sold for $650M

Executive overview

The 10-year slog before the breakthrough

The 48-hour pivot

Why GPT-4 was the threshold, not GPT-3

Building Co-Counsel: the skills architecture

The "GPT wrapper" objection

Getting from 70% to 100%

On O1 and the next generation of reasoning models

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

The 10-year slog before the breakthrough

The 48-hour pivot

Why GPT-4 was the threshold, not GPT-3

Building Co-Counsel: the skills architecture

The "GPT wrapper" objection

Getting from 70% to 100%

On O1 and the next generation of reasoning models

More like this — when you're ready for early access.

More in Founder Stories

What a $7B founder learned building Glean from scratch

From four failed co-founder splits to a $66M solo startup

The real cost of avoiding hard conversations in leadership

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.