How Scale AI trains frontier models and why expert data is the new moat

Executive overview

AI models have shifted from knowing things to doing things — and getting agents to act reliably inside real-world software systems is far harder than headlines suggest. Scale AI sits at the center of this work: supplying expert-labelled data and reinforcement learning environments to frontier labs, and building AI applications for enterprise and government customers.

The core bottleneck is no longer raw data volume but human judgment — specifically, the kind of deep domain expertise that tells a model what "good" looks like in a given context.

The real infrastructure of AI progress is expert humans digitising their judgment so models can act reliably in the real world.

What Scale actually does

  • Meta invested $14B for 49% non-voting stock; Scale remains fully independent with its own board and governance
  • Alex Wang moved to Meta to lead a superintelligence team; Jason Droege now runs Scale
  • Two major business units, each with hundreds of millions in revenue: data supply to model builders, and AI applications/services for enterprise and government
  • About 1,100 employees; 250 open roles as of recording
  • The company has grown every month since the Meta deal

How data labelling has evolved

  • 18 months ago: annotators compared short stories and gave preference rankings — basic, generalist work
  • Today: tasks involve world-class engineers building full websites, PhDs explaining nuanced cancer topics to models — hours of work per task
  • 80% of Scale's contributor network holds a bachelor's degree or higher; ~15% hold PhDs
  • Contributors are found primarily through peer referrals, campus programmes, and LinkedIn — the best come from grassroots networks
  • Referrals dominate because contributors find the work meaningful: using expertise to fix a model's gaps in their own field is intrinsically motivating

Reinforcement learning environments

  • RL environments are sandboxes where AI agents practice completing goals inside real software systems (e.g. navigating a Salesforce instance)
  • The agent must learn: how to read configurations, how to execute a business process reliably, and when to escalate to a human
  • The key research question: how generalizable is each task? More generalizable data is more valuable
  • The number of software environments times the number of goals within each is effectively infinite — so the strategy is collecting data that transfers broadly, not exhaustively

Evals: what "good" looks like

  • Evals are the benchmark for model quality — a comprehensive set of examples showing what the correct or preferred output is
  • For enterprise and government customers, evals are the primary work: the customer's own experts must define what good looks like in their specific context
  • A document with identical wording can mean something different at two different companies — off-the-shelf models plus RAG plus fine-tuning can only get you so far
  • The bottleneck is digitising human judgment at the company-specific level, not just general expertise

Enterprise AI: what's actually working

  • Most POCs reach 60–70% accuracy and teams assume "the rest is easy" — it isn't; each additional sigma of reliability is an order-of-magnitude harder
  • Robust deployment takes 6–12 months when done properly: legal, policy, regulatory, change management, and accuracy thresholds all have to align
  • The 95% POC failure rate is somewhat overstated — it reflects how easy it is to start a project, not how often serious efforts fail
  • AI performs best where current human accuracy is low (10–20%); it struggles to close the last 2% in processes already running at 98% accuracy
  • Healthcare example: an AI tool that reads 200–300 pages of patient records and surfaces the top 5–10 considerations — including non-obvious drug-allergy conflicts a human might miss

Where models are heading (2–3 year view)

  • The shift is from knowing to doing — knowledge benchmarks are near saturation; the frontier is reliable action
  • Agent reliability inside real systems (calendar, CRM, healthcare) is just beginning; trajectory uncertainty is high
  • Technology will likely reach a point in 2–3 years that forces policy makers and organisations to respond — the bottleneck becomes change management, not capability
  • The "white collar apocalypse" thesis is premature for the next 1–2 years; human adaptability is consistently underestimated in these predictions

Building and evaluating new businesses

  • Two things make companies work: a founder who is a force of nature over a long duration, and a fundamentally good market/business model
  • Quick filters: does the business have network effects, lock-in, and increasing value at scale? If not, why not?
  • High gross margin is a coarse but fast instrument — if you can't defend a high margin, ask why; the answer usually reveals the real competitive problem
  • The urgency of the buyer matters more than the value of the product — building something valuable that isn't the customer's top daily priority creates a very long road

Hiring and team composition

  • For ~95% of roles: hire for curious problem-solving, ability to work across people, and leadership potential — not specific prior experience
  • For the other ~5% (roles where speed to market is critical, e.g. frontier researchers): prior experience and relationships override the general framework
  • A stable management team that knows each other's strengths and compensates for weaknesses outperforms a serially "upgraded" team as the company scales
  • The world is changing fast enough that adaptability and growth trajectory matter more than one-to-one experience match

Lessons from Uber Eats

  • Launched December 2015 in Toronto; $20K in sales within two hours
  • Grew from zero to $20B GMV in four and a half years; now ~$80B
  • Key insight: restaurant incremental gross margin on delivery orders is 70–80% (ingredients scale, labour and real estate are fixed) — that justified a 25–30% take rate
  • Pushed McDonald's away for four to five months on principle; the delay led to better deal terms and an exclusive relationship that accelerated chain adoption
  • Tried and abandoned: convenience vans, generalist point-to-point delivery, grocery — food delivery was the one signal that kept strengthening on every dimension

On independent thinking and founder mindset

  • The question to ask before starting anything: "Why do I have an insight that a million other smart, working entrepreneurs don't have?"
  • Don't fall in love with ideas — the mission is solving the customer's problem, not validating your prior belief
  • "Not losing is a precursor to winning" — survival enables the timing and insight corrections that eventually produce success; high-risk decisions that fail leave no path forward
  • The end is never the end: the moments that feel impassable almost always have an imperfect but workable solution

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.