How Surge AI hit $1B revenue with under 100 people and no VC funding

Executive overview

Surge AI reached over $1 billion in revenue in under four years, with fewer than 100 employees, no VC funding, and profitable from day one. Edwin Chen built it by rejecting the standard Silicon Valley playbook: no fundraising, no PR, no pivoting — just relentless focus on high-quality AI training data.

The core argument is that most AI labs are optimizing for the wrong things: gaming benchmarks, chasing leaderboard rankings, and training models for engagement rather than genuine usefulness. Surge exists to help labs define and measure what actually matters.

The company that only you could build is the one worth building — everything else is noise.

Building a billion-dollar company without raising money

Surge hit $1B+ revenue in under four years with under 100 people — bootstrapped from day one
No VC funding meant no TechCrunch headlines, no viral tweets — growth came entirely from word of mouth among researchers who understood data quality
Early customers were mission-aligned: they cared deeply about data quality because they knew it would make their models better
Fewer employees means less capital needed, which means no requirement to raise — enabling founders who are great at technology rather than great at pitching
The playbook Chen rejects: pivot every two weeks, blitz-scale hiring, chase growth with dark patterns
Advice for founders: build the one thing only you could build — the thing that wouldn't exist without your specific background and insight
If you fail because the market wasn't ready, that's better than pivoting into another LLM wrapper

Why data quality is misunderstood

Most people think quality means checking boxes: does the poem have eight lines, does it mention the moon?
Surge's standard is closer to Nobel Prize-level work: is the poem unique, surprising, emotionally resonant, full of subtle imagery?
Quality is subjective, complex, and hard to measure — Surge builds ML systems tracking thousands of signals per worker and per task
Two distinct problems: removing the worst of the worst (content moderation) and surfacing the best of the best (talent discovery)
Workers are matched to tasks by expertise, keystroke patterns, speed, output quality, and whether their work measurably improves model performance
Claude's long lead in coding and writing quality is attributed in part to Anthropic having better taste in what data to train on and what objective function to optimize toward
Post-training is an art, not a science — decisions about visual design preferences, minimalism, and stylistic trade-offs all compound into model character

The benchmark problem

Benchmarks are often just wrong: incorrect answers, messy data, flawed evaluation design
They reward hill-climbing on well-defined objective tasks, not real-world messiness
Models can win IMO gold medals but still struggle to parse PDFs — because IMO has clean objectivity that PDF parsing lacks
Labs sometimes game benchmarks by tweaking system prompts or sampling parameters to inflate scores
Researchers at labs describe being pressured to climb leaderboards even when doing so degrades real-world model quality
LM Arena (a popular crowdsourced leaderboard) rewards superficial responses: more emojis, longer outputs, more bolding — not accuracy
Surge's alternative: expert human annotators who deeply verify model outputs, check equations, test code, and evaluate across multiple dimensions

AI optimized for engagement, not truth

Social media optimization always produced the same result: clickbait, sensationalism, bikinis and Bigfoot filling feeds
The same dynamic is emerging in AI: models that flatter users, feed delusions, pull people down rabbit holes
A model that keeps suggesting 20 more email improvements sucks up time and engagement — a better model tells you to stop and send
Different labs are beginning to diverge meaningfully in model personality and values — this differentiation will only grow
Anthropic is cited as unusually principled in what it does and doesn't optimize for
The right question for any product decision: are you building AI that advances humanity, or AI that makes people lazier?

Reinforcement learning environments

RL environments are simulations of the real world — like video games with fully fleshed-out businesses, tools, Slack threads, GitHub PRs, and databases
Models are given tasks, attempt them, and receive rewards based on success — teaching end-to-end behavior, not single-step responses
Current models perform well on isolated benchmarks but fail catastrophically in messy multi-step environments where step 1 affects step 50
Trajectories matter: a model that randomly stumbles to the correct answer after 50 failed attempts should not be rewarded the same as one that reasons cleanly to it
Post-training progression: SFT (mimicking a master) → RLHF (feedback on ranked outputs) → rubrics and verifiers (graded with detailed feedback) → RL environments (learning by doing in simulated worlds)
The analogy for becoming a great writer applies: you don't memorize grammar rules — you read great books, write constantly, get feedback, develop taste

The future of AI development

Models will become increasingly differentiated by the values of the companies building them — not just by benchmark scores
Chen expects 80% of an average L6 software engineer's job to be automated within one to two years, with 90% taking several more years — AGI is a decade or more away
LLMs will need to be supplemented by new learning paradigms before AGI is reached
Under-hyped: built-in mini apps and artifacts inside chatbots (Claude Artifacts cited as an example)
Over-hyped: vibe coding — dumping AI-generated code into codebases without understanding it will create long-term maintainability debt
The types of companies being built will change: smaller teams, less capital required, founders who are builders rather than pitchers

Surge's research team

Surge operates more like a research lab than a startup — internal researchers work on better benchmarks and leaderboards to fix the problems Chen identifies
Deployed researchers work alongside customers to diagnose model weaknesses and design data sets and evals to close gaps
Research also covers training Surge's own models internally to determine what types of data and annotators produce the best results
Chen's personal practice: every time a new model is released, he does a deep dive — running evals, comparing capabilities, writing analysis distributed to customers

How Surge AI hit $1B revenue with under 100 people and no VC funding

Executive overview

Building a billion-dollar company without raising money

Why data quality is misunderstood

The benchmark problem

AI optimized for engagement, not truth

Reinforcement learning environments

The future of AI development

Surge's research team

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Building a billion-dollar company without raising money

Why data quality is misunderstood

The benchmark problem

AI optimized for engagement, not truth

Reinforcement learning environments

The future of AI development

Surge's research team

More like this — when you're ready for early access.

More in Founder Stories

What a $7B founder learned building Glean from scratch

From four failed co-founder splits to a $66M solo startup

The real cost of avoiding hard conversations in leadership

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.