A/B testing at scale: lessons from Microsoft, Amazon, and Airbnb

Executive overview

Most product teams ship changes that hurt or do nothing — and never know it. Controlled experiments are the only reliable way to find out if a change actually works. Ronny Kohavi, who built experimentation platforms at Amazon, Microsoft (Bing), and Airbnb, argues that no company can experiment too much, but most experiment badly.

The core problem is trust. An experiment platform is only useful if its results are believable. Getting that right — statistically sound metrics, sample ratio mismatch detection, correct p-value interpretation — is harder than teams assume.

You can't predict which ideas will win; the experiment is the oracle.

Failure rates and expectations

At Microsoft overall, ~66% of ideas fail
At Bing (heavily optimised domain), ~85% fail
At Airbnb search, 92% of experiments failed to move the key metric
Booking, Google Ads, and others report 80–90% failure rates
10% of experiments are aborted on day one due to implementation bugs, not bad ideas
Most value comes from many small wins accumulating, not home runs

The overall evaluation criterion (OEC)

The OEC is the single metric (or composite) that captures what you're optimising for long-term
Revenue alone is the wrong OEC — you can always make more money short-term by degrading user experience
Define OEC so it is causally predictive of lifetime value
Constraint framing helps: "increase revenue without exceeding X vertical pixels of ad space"
Countervailing metrics catch hidden harms — battery life, churn, unsubscribe rates
If your team can't agree on the direction of a metric (more time on site: good or bad?), the OEC is invalid

When to start running experiments

Below tens of thousands of users, the statistics don't work for most metrics
For a retail site targeting 5% improvements in conversion: ~200,000 users needed
Below that threshold: build the culture, build the platform, integrate the tooling
At 200,000+ users, you can test everything and reliably detect meaningful effects
Focus on detecting 5–10% effects at early stages, not 1%

Trust and statistical validity

An experiment platform is a safety net (catch bad launches fast) and an oracle (tell you what actually happened)
Trust is easy to lose and hard to rebuild — statistical errors destroy organisational confidence
Sample ratio mismatch (SRM): if you designed for a 50/50 split and got 50.2/49.8, calculate whether that deviation could occur by chance — often it can't
At Microsoft, ~8% of experiments had SRM; third-party vendors report 6–10%
Common SRM causes: bots hitting control and treatment at different rates, flawed data pipelines, skewed traffic sources
Surface SRM visibly — blank out the scorecard, not just a dismissible banner

P-values and false positives

The most common misreading: a p-value of 0.02 does not mean 98% probability the treatment is better
P-value is the probability of observing the data if the null hypothesis is true — not the probability the hypothesis is true
To get the probability you actually want, apply Bayes' rule using a prior (historical success rate)
At Airbnb search (8% success rate), a statistically significant result at p < 0.05 has a 26% chance of being a false positive
Mitigation: require p < 0.01, then replicate; combine experiments using Fisher's or Stouffer's method

Twyman's law

"Any figure that looks interesting or different is usually wrong"
If a result is 10x larger than your typical experiment movement, investigate before celebrating
Nine out of ten times Twyman's law is invoked, a flaw is found
Exception: some genuine outliers exist, but they require multiple replications to confirm

Portfolio thinking and big bets

Experimentation doesn't prevent innovation — but you need a portfolio
Allocate most effort to incremental improvements; reserve some for high-risk, high-reward bets
Be ready for big bets to fail ~80% of the time
Examples of costly failures: Bing's social integration (100 person-years, all experiments flat or negative), Netflix social, Airbnb social, Airbnb online experiences
OFAT (one factor at a time): decompose redesigns into smaller testable changes; of 17 changes, maybe 4 are positive — ship those

Large redesigns

Full redesigns almost never win outright in experiments
Teams launch them anyway because of sunk cost — months of work creates pressure to ship even on flat or negative results
A data-driven org should not ship on flat unless legally required
If legally forced to take a hit, run three variants and ship the one that hurts least
Right approach: move incrementally in the intended direction, testing on the way

Building and scaling an experiment platform

Goal: reduce marginal cost of running an experiment to near zero
Self-service setup, templated metric scorecards (e.g. 2,000 metrics per UI experiment type)
Insufficient platform investment forces reliance on data scientists for each analysis — doesn't scale
Six maturity axes exist (crawl → walk → run → fly); assess where you are before deciding what to build next
Build vs. buy: for most teams starting out, third-party platforms are now good enough; the choice is usually how much to build on top

Institutional memory

Run quarterly reviews of the most surprising experiments (not just the winners)
Surprising = large gap between expected and actual result, in either direction
Negative surprises (e.g. improving Windows indexer killed battery life) are as valuable as wins
Searchable experiment history by keyword prevents teams from repeating failures
Document wins clearly — semi-forgotten successes get reinvented (e.g. opening links in new tabs, rediscovered at multiple companies)

Speeding up experiments

A good platform delivers a scorecard within a day of experiment completion — no waiting for a data scientist
Variance reduction techniques let you get significant results with fewer users:
- Cap skewed metrics (e.g. nights booked capped at 30/month)
- CUPED (Controlled-experiment Using Pre-Experiment Data): adjusts results using pre-experiment data, reducing variance without introducing bias
Replicate important results rather than relying on a single run

Getting started at a resistant organisation

Find a team that launches frequently (weekly or daily, not quarterly)
Make sure that team has a clear, agreed OEC
Use that beachhead to demonstrate surprising results; cross-pollination spreads the culture
Show historical data on redesign failures — that's often the most persuasive evidence
Start building the platform early, even before you have enough users to run valid experiments

A/B testing at scale: lessons from Microsoft, Amazon, and Airbnb

Executive overview

Failure rates and expectations

The overall evaluation criterion (OEC)

When to start running experiments

Trust and statistical validity

P-values and false positives

Twyman's law

Portfolio thinking and big bets

Large redesigns

Building and scaling an experiment platform

Institutional memory

Speeding up experiments

Getting started at a resistant organisation

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Failure rates and expectations

The overall evaluation criterion (OEC)

When to start running experiments

Trust and statistical validity

P-values and false positives

Twyman's law

Portfolio thinking and big bets

Large redesigns

Building and scaling an experiment platform

Institutional memory

Speeding up experiments

Getting started at a resistant organisation

More like this — when you're ready for early access.

More in Product

How a disposable camera app hit $20K/month in 83 days

Agency over skills: how AI is reshaping product teams

How Stripe redesigned its homepage after six years

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.