The original is one click away. Open original ↗
A/B testing at scale: lessons from Microsoft, Amazon, and Airbnb
Executive overview
Most product teams ship changes that hurt or do nothing — and never know it. Controlled experiments are the only reliable way to find out if a change actually works. Ronny Kohavi, who built experimentation platforms at Amazon, Microsoft (Bing), and Airbnb, argues that no company can experiment too much, but most experiment badly.
The core problem is trust. An experiment platform is only useful if its results are believable. Getting that right — statistically sound metrics, sample ratio mismatch detection, correct p-value interpretation — is harder than teams assume.
You can't predict which ideas will win; the experiment is the oracle.
Failure rates and expectations
- At Microsoft overall, ~66% of ideas fail
- At Bing (heavily optimised domain), ~85% fail
- At Airbnb search, 92% of experiments failed to move the key metric
- Booking, Google Ads, and others report 80–90% failure rates
- 10% of experiments are aborted on day one due to implementation bugs, not bad ideas
- Most value comes from many small wins accumulating, not home runs
The overall evaluation criterion (OEC)
- The OEC is the single metric (or composite) that captures what you're optimising for long-term
- Revenue alone is the wrong OEC — you can always make more money short-term by degrading user experience
- Define OEC so it is causally predictive of lifetime value
- Constraint framing helps: "increase revenue without exceeding X vertical pixels of ad space"
- Countervailing metrics catch hidden harms — battery life, churn, unsubscribe rates
- If your team can't agree on the direction of a metric (more time on site: good or bad?), the OEC is invalid
When to start running experiments
- Below tens of thousands of users, the statistics don't work for most metrics
- For a retail site targeting 5% improvements in conversion: ~200,000 users needed
- Below that threshold: build the culture, build the platform, integrate the tooling
- At 200,000+ users, you can test everything and reliably detect meaningful effects
- Focus on detecting 5–10% effects at early stages, not 1%
Trust and statistical validity
- An experiment platform is a safety net (catch bad launches fast) and an oracle (tell you what actually happened)
- Trust is easy to lose and hard to rebuild — statistical errors destroy organisational confidence
- Sample ratio mismatch (SRM): if you designed for a 50/50 split and got 50.2/49.8, calculate whether that deviation could occur by chance — often it can't
- At Microsoft, ~8% of experiments had SRM; third-party vendors report 6–10%
- Common SRM causes: bots hitting control and treatment at different rates, flawed data pipelines, skewed traffic sources
- Surface SRM visibly — blank out the scorecard, not just a dismissible banner
P-values and false positives
- The most common misreading: a p-value of 0.02 does not mean 98% probability the treatment is better
- P-value is the probability of observing the data if the null hypothesis is true — not the probability the hypothesis is true
- To get the probability you actually want, apply Bayes' rule using a prior (historical success rate)
- At Airbnb search (8% success rate), a statistically significant result at p < 0.05 has a 26% chance of being a false positive
- Mitigation: require p < 0.01, then replicate; combine experiments using Fisher's or Stouffer's method
Twyman's law
- "Any figure that looks interesting or different is usually wrong"
- If a result is 10x larger than your typical experiment movement, investigate before celebrating
- Nine out of ten times Twyman's law is invoked, a flaw is found
- Exception: some genuine outliers exist, but they require multiple replications to confirm
Portfolio thinking and big bets
- Experimentation doesn't prevent innovation — but you need a portfolio
- Allocate most effort to incremental improvements; reserve some for high-risk, high-reward bets
- Be ready for big bets to fail ~80% of the time
- Examples of costly failures: Bing's social integration (100 person-years, all experiments flat or negative), Netflix social, Airbnb social, Airbnb online experiences
- OFAT (one factor at a time): decompose redesigns into smaller testable changes; of 17 changes, maybe 4 are positive — ship those
Large redesigns
- Full redesigns almost never win outright in experiments
- Teams launch them anyway because of sunk cost — months of work creates pressure to ship even on flat or negative results
- A data-driven org should not ship on flat unless legally required
- If legally forced to take a hit, run three variants and ship the one that hurts least
- Right approach: move incrementally in the intended direction, testing on the way
Building and scaling an experiment platform
- Goal: reduce marginal cost of running an experiment to near zero
- Self-service setup, templated metric scorecards (e.g. 2,000 metrics per UI experiment type)
- Insufficient platform investment forces reliance on data scientists for each analysis — doesn't scale
- Six maturity axes exist (crawl → walk → run → fly); assess where you are before deciding what to build next
- Build vs. buy: for most teams starting out, third-party platforms are now good enough; the choice is usually how much to build on top
Institutional memory
- Run quarterly reviews of the most surprising experiments (not just the winners)
- Surprising = large gap between expected and actual result, in either direction
- Negative surprises (e.g. improving Windows indexer killed battery life) are as valuable as wins
- Searchable experiment history by keyword prevents teams from repeating failures
- Document wins clearly — semi-forgotten successes get reinvented (e.g. opening links in new tabs, rediscovered at multiple companies)
Speeding up experiments
- A good platform delivers a scorecard within a day of experiment completion — no waiting for a data scientist
- Variance reduction techniques let you get significant results with fewer users:
- Cap skewed metrics (e.g. nights booked capped at 30/month)
- CUPED (Controlled-experiment Using Pre-Experiment Data): adjusts results using pre-experiment data, reducing variance without introducing bias
- Replicate important results rather than relying on a single run
Getting started at a resistant organisation
- Find a team that launches frequently (weekly or daily, not quarterly)
- Make sure that team has a clear, agreed OEC
- Use that beachhead to demonstrate surprising results; cross-pollination spreads the culture
- Show historical data on redesign failures — that's often the most persuasive evidence
- Start building the platform early, even before you have enough users to run valid experiments
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.