The original is one click away. Open original ↗
How Casetext built a vertical AI agent and sold for $650M
Executive overview
Legal tech was stuck with incremental improvements lawyers could easily ignore. GPT-4 changed that — not just the technology, but the market's willingness to adopt it.
Casetext spent 10 years building in the legal space, then pivoted 120 people in 48 hours when they saw an early GPT-4 demo. The result was Co-Counsel: an AI legal assistant that does research, document review, and memo drafting at the level of a skilled associate.
The real moat in vertical AI is not the model — it's the layers of domain logic, data pipelines, and test-driven prompt engineering required to get from "impressive demo" to "works 100% of the time."
The 10-year slog before the breakthrough
- Casetext started in 2012 as a user-generated legal annotation site, modelled on Wikipedia and Stack Overflow
- Lawyers bill by the hour — they had no incentive to contribute free content; the UGC model failed entirely
- Pivoted to NLP and machine learning: built citation-graph tools that surfaced cases lawyers had missed
- Revenue was growing 70–80% YoY and approaching $15–20M ARR, but the product was still incremental
- Lawyers earning $5M/year had no desire to change anything; resistance was structural, not irrational
- ChatGPT's public release flipped this overnight — the same lawyers started calling, asking to get ahead of AI
The 48-hour pivot
- Jake and his co-founder were under NDA with OpenAI, testing an early GPT-4 build months before public launch
- Within 48 hours of first seeing it, the decision was made: redirect all 120 people to a new product
- They built the first prototype themselves before telling the rest of the company — the NDA initially covered only the two founders
- One week later, at an executive offsite, the team expected a sales planning session; instead they saw a laptop demo
- Showing the product to customers live — watching senior attorneys have visible existential reactions — converted internal skeptics faster than any argument
Why GPT-4 was the threshold, not GPT-3
- GPT-3.5 scored 10th percentile on the bar exam; GPT-4 scored above 90th percentile
- Earlier models produced fluent but hallucinated output — plausible-sounding law that didn't exist
- Legal work requires zero tolerance for fabrication: wrong case citations or misquoted statutes cause real harm
- The jump from 3.5 to 4 wasn't incremental — it crossed the threshold where accurate, citation-grounded output became achievable
Building Co-Counsel: the skills architecture
- Each capability ("skill") was built by working backwards from the end result a lawyer actually needs
- Research workflow example: English query → search syntax → run queries → read results → build outline → write memo
- Each step in that chain became a separate prompt; a full skill might involve a dozen to two dozen chained prompts
- For each prompt, the team wrote gold-standard input/output pairs and built test batteries — starting at dozens, scaling to thousands
- This is test-driven development applied to prompting: adding instructions to fix one failure must not break others
The "GPT wrapper" objection
- Co-Counsel is not a wrapper: it includes proprietary legal datasets, automated annotations, integrations with legal-specific document management systems, and custom OCR pipelines
- OCR alone required handling handwriting, tilted scans, and the legal practice of printing four pages onto one — each case requiring explicit handling
- Everything before the model hits the context window can represent dozens of engineering decisions
- Prompt strategy, information formatting, and step decomposition are IP — hard to build, hard to replicate
- Analogy: Salesforce was "just a SQL wrapper" but the business logic on top is what made it a billion-dollar company
Getting from 70% to 100%
- 70% accuracy is good enough for a $20/month consumer product; mission-critical enterprise use cases require 100%
- Test-driven prompting surfaces failure patterns; root-causing why a prompt fails usually reveals ambiguous instructions or missing context
- Once a prompt passes a large battery of tests, generalisation to unseen inputs is reliable
- First impressions matter acutely: one bad early experience, especially for a non-technical professional, ends adoption permanently
- The last mile (or hundred miles) is where the value is — and where most "GPT wrappers" stop
On O1 and the next generation of reasoning models
- O1 passed tests that all prior models failed: detecting subtle misquotations in a 40-page legal brief when given the source case
- The shift from pattern-matching (System 1) to deliberate reasoning (System 2) is what unlocks genuinely complex legal tasks
- One emerging prompting technique with O1: injecting domain-specific thinking frameworks, not just examples of good answers
- Teaching a model how to think about a problem — not just what a good answer looks like — may be the next frontier
- Early; no conclusive results yet, but the direction is clear
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.