The original is one click away. Open original ↗
How Scale AI trains frontier models and why expert data is the new moat
Executive overview
AI models have shifted from knowing things to doing things — and getting agents to act reliably inside real-world software systems is far harder than headlines suggest. Scale AI sits at the center of this work: supplying expert-labelled data and reinforcement learning environments to frontier labs, and building AI applications for enterprise and government customers.
The core bottleneck is no longer raw data volume but human judgment — specifically, the kind of deep domain expertise that tells a model what "good" looks like in a given context.
The real infrastructure of AI progress is expert humans digitising their judgment so models can act reliably in the real world.
What Scale actually does
- Meta invested $14B for 49% non-voting stock; Scale remains fully independent with its own board and governance
- Alex Wang moved to Meta to lead a superintelligence team; Jason Droege now runs Scale
- Two major business units, each with hundreds of millions in revenue: data supply to model builders, and AI applications/services for enterprise and government
- About 1,100 employees; 250 open roles as of recording
- The company has grown every month since the Meta deal
How data labelling has evolved
- 18 months ago: annotators compared short stories and gave preference rankings — basic, generalist work
- Today: tasks involve world-class engineers building full websites, PhDs explaining nuanced cancer topics to models — hours of work per task
- 80% of Scale's contributor network holds a bachelor's degree or higher; ~15% hold PhDs
- Contributors are found primarily through peer referrals, campus programmes, and LinkedIn — the best come from grassroots networks
- Referrals dominate because contributors find the work meaningful: using expertise to fix a model's gaps in their own field is intrinsically motivating
Reinforcement learning environments
- RL environments are sandboxes where AI agents practice completing goals inside real software systems (e.g. navigating a Salesforce instance)
- The agent must learn: how to read configurations, how to execute a business process reliably, and when to escalate to a human
- The key research question: how generalizable is each task? More generalizable data is more valuable
- The number of software environments times the number of goals within each is effectively infinite — so the strategy is collecting data that transfers broadly, not exhaustively
Evals: what "good" looks like
- Evals are the benchmark for model quality — a comprehensive set of examples showing what the correct or preferred output is
- For enterprise and government customers, evals are the primary work: the customer's own experts must define what good looks like in their specific context
- A document with identical wording can mean something different at two different companies — off-the-shelf models plus RAG plus fine-tuning can only get you so far
- The bottleneck is digitising human judgment at the company-specific level, not just general expertise
Enterprise AI: what's actually working
- Most POCs reach 60–70% accuracy and teams assume "the rest is easy" — it isn't; each additional sigma of reliability is an order-of-magnitude harder
- Robust deployment takes 6–12 months when done properly: legal, policy, regulatory, change management, and accuracy thresholds all have to align
- The 95% POC failure rate is somewhat overstated — it reflects how easy it is to start a project, not how often serious efforts fail
- AI performs best where current human accuracy is low (10–20%); it struggles to close the last 2% in processes already running at 98% accuracy
- Healthcare example: an AI tool that reads 200–300 pages of patient records and surfaces the top 5–10 considerations — including non-obvious drug-allergy conflicts a human might miss
Where models are heading (2–3 year view)
- The shift is from knowing to doing — knowledge benchmarks are near saturation; the frontier is reliable action
- Agent reliability inside real systems (calendar, CRM, healthcare) is just beginning; trajectory uncertainty is high
- Technology will likely reach a point in 2–3 years that forces policy makers and organisations to respond — the bottleneck becomes change management, not capability
- The "white collar apocalypse" thesis is premature for the next 1–2 years; human adaptability is consistently underestimated in these predictions
Building and evaluating new businesses
- Two things make companies work: a founder who is a force of nature over a long duration, and a fundamentally good market/business model
- Quick filters: does the business have network effects, lock-in, and increasing value at scale? If not, why not?
- High gross margin is a coarse but fast instrument — if you can't defend a high margin, ask why; the answer usually reveals the real competitive problem
- The urgency of the buyer matters more than the value of the product — building something valuable that isn't the customer's top daily priority creates a very long road
Hiring and team composition
- For ~95% of roles: hire for curious problem-solving, ability to work across people, and leadership potential — not specific prior experience
- For the other ~5% (roles where speed to market is critical, e.g. frontier researchers): prior experience and relationships override the general framework
- A stable management team that knows each other's strengths and compensates for weaknesses outperforms a serially "upgraded" team as the company scales
- The world is changing fast enough that adaptability and growth trajectory matter more than one-to-one experience match
Lessons from Uber Eats
- Launched December 2015 in Toronto; $20K in sales within two hours
- Grew from zero to $20B GMV in four and a half years; now ~$80B
- Key insight: restaurant incremental gross margin on delivery orders is 70–80% (ingredients scale, labour and real estate are fixed) — that justified a 25–30% take rate
- Pushed McDonald's away for four to five months on principle; the delay led to better deal terms and an exclusive relationship that accelerated chain adoption
- Tried and abandoned: convenience vans, generalist point-to-point delivery, grocery — food delivery was the one signal that kept strengthening on every dimension
On independent thinking and founder mindset
- The question to ask before starting anything: "Why do I have an insight that a million other smart, working entrepreneurs don't have?"
- Don't fall in love with ideas — the mission is solving the customer's problem, not validating your prior belief
- "Not losing is a precursor to winning" — survival enables the timing and insight corrections that eventually produce success; high-risk decisions that fail leave no path forward
- The end is never the end: the moments that feel impassable almost always have an imperfect but workable solution
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.