How Surge AI hit $1B revenue with under 100 people and no VC funding

Executive overview

Surge AI reached over $1 billion in revenue in under four years, with fewer than 100 employees, no VC funding, and profitable from day one. Edwin Chen built it by rejecting the standard Silicon Valley playbook: no fundraising, no PR, no pivoting — just relentless focus on high-quality AI training data.

The core argument is that most AI labs are optimizing for the wrong things: gaming benchmarks, chasing leaderboard rankings, and training models for engagement rather than genuine usefulness. Surge exists to help labs define and measure what actually matters.

The company that only you could build is the one worth building — everything else is noise.

Building a billion-dollar company without raising money

  • Surge hit $1B+ revenue in under four years with under 100 people — bootstrapped from day one
  • No VC funding meant no TechCrunch headlines, no viral tweets — growth came entirely from word of mouth among researchers who understood data quality
  • Early customers were mission-aligned: they cared deeply about data quality because they knew it would make their models better
  • Fewer employees means less capital needed, which means no requirement to raise — enabling founders who are great at technology rather than great at pitching
  • The playbook Chen rejects: pivot every two weeks, blitz-scale hiring, chase growth with dark patterns
  • Advice for founders: build the one thing only you could build — the thing that wouldn't exist without your specific background and insight
  • If you fail because the market wasn't ready, that's better than pivoting into another LLM wrapper

Why data quality is misunderstood

  • Most people think quality means checking boxes: does the poem have eight lines, does it mention the moon?
  • Surge's standard is closer to Nobel Prize-level work: is the poem unique, surprising, emotionally resonant, full of subtle imagery?
  • Quality is subjective, complex, and hard to measure — Surge builds ML systems tracking thousands of signals per worker and per task
  • Two distinct problems: removing the worst of the worst (content moderation) and surfacing the best of the best (talent discovery)
  • Workers are matched to tasks by expertise, keystroke patterns, speed, output quality, and whether their work measurably improves model performance
  • Claude's long lead in coding and writing quality is attributed in part to Anthropic having better taste in what data to train on and what objective function to optimize toward
  • Post-training is an art, not a science — decisions about visual design preferences, minimalism, and stylistic trade-offs all compound into model character

The benchmark problem

  • Benchmarks are often just wrong: incorrect answers, messy data, flawed evaluation design
  • They reward hill-climbing on well-defined objective tasks, not real-world messiness
  • Models can win IMO gold medals but still struggle to parse PDFs — because IMO has clean objectivity that PDF parsing lacks
  • Labs sometimes game benchmarks by tweaking system prompts or sampling parameters to inflate scores
  • Researchers at labs describe being pressured to climb leaderboards even when doing so degrades real-world model quality
  • LM Arena (a popular crowdsourced leaderboard) rewards superficial responses: more emojis, longer outputs, more bolding — not accuracy
  • Surge's alternative: expert human annotators who deeply verify model outputs, check equations, test code, and evaluate across multiple dimensions

AI optimized for engagement, not truth

  • Social media optimization always produced the same result: clickbait, sensationalism, bikinis and Bigfoot filling feeds
  • The same dynamic is emerging in AI: models that flatter users, feed delusions, pull people down rabbit holes
  • A model that keeps suggesting 20 more email improvements sucks up time and engagement — a better model tells you to stop and send
  • Different labs are beginning to diverge meaningfully in model personality and values — this differentiation will only grow
  • Anthropic is cited as unusually principled in what it does and doesn't optimize for
  • The right question for any product decision: are you building AI that advances humanity, or AI that makes people lazier?

Reinforcement learning environments

  • RL environments are simulations of the real world — like video games with fully fleshed-out businesses, tools, Slack threads, GitHub PRs, and databases
  • Models are given tasks, attempt them, and receive rewards based on success — teaching end-to-end behavior, not single-step responses
  • Current models perform well on isolated benchmarks but fail catastrophically in messy multi-step environments where step 1 affects step 50
  • Trajectories matter: a model that randomly stumbles to the correct answer after 50 failed attempts should not be rewarded the same as one that reasons cleanly to it
  • Post-training progression: SFT (mimicking a master) → RLHF (feedback on ranked outputs) → rubrics and verifiers (graded with detailed feedback) → RL environments (learning by doing in simulated worlds)
  • The analogy for becoming a great writer applies: you don't memorize grammar rules — you read great books, write constantly, get feedback, develop taste

The future of AI development

  • Models will become increasingly differentiated by the values of the companies building them — not just by benchmark scores
  • Chen expects 80% of an average L6 software engineer's job to be automated within one to two years, with 90% taking several more years — AGI is a decade or more away
  • LLMs will need to be supplemented by new learning paradigms before AGI is reached
  • Under-hyped: built-in mini apps and artifacts inside chatbots (Claude Artifacts cited as an example)
  • Over-hyped: vibe coding — dumping AI-generated code into codebases without understanding it will create long-term maintainability debt
  • The types of companies being built will change: smaller teams, less capital required, founders who are builders rather than pitchers

Surge's research team

  • Surge operates more like a research lab than a startup — internal researchers work on better benchmarks and leaderboards to fix the problems Chen identifies
  • Deployed researchers work alongside customers to diagnose model weaknesses and design data sets and evals to close gaps
  • Research also covers training Surge's own models internally to determine what types of data and annotators produce the best results
  • Chen's personal practice: every time a new model is released, he does a deep dive — running evals, comparing capabilities, writing analysis distributed to customers

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.