Quick tour

Watch the video

Each lesson starts with a short video. It covers the same material as the text — just pick the format you prefer. You can skip it and read instead.

The experimentation machine: trustworthy A/B testing

Most product ideas fail. This lesson explains how to build a system of controlled experiments that filters out expensive illusions and ships only what actually works.

Written by Elena VasquezGrowth & Conversion
Lesson 4/5Growth~45 min read

Watch

Video version of this lesson


Read

Full lesson text

Why this matters?

In Lesson 3, we explored the Hook Model — the psychological engine that drives user retention. We learned how to design triggers, actions, and rewards that solve user "itches." But there is a dangerous trap in product management: the belief that because a feature is "logical" or "well-designed," it will automatically succeed.

Ronny Kohavi, former VP at Airbnb and Microsoft, brings a humbling reality: in mature organizations like Google, Bing, or Amazon, 80% to 90% of ideas fail to move the intended metrics. Many of them actually hurt the product.

If you are not experimenting, you are likely shipping code that is flat or negative 70-80% of the time.

This lesson is about building the "Experimentation Machine" — a system of controlled trials that separates winning strategies from expensive illusions.


The humbling reality of product development

Why do we need A/B testing? Why not trust a smart Product Manager or a visionary CEO?

Because user behavior is unpredictable. A small change in a UI, a tweak in an algorithm, or a new feature often produces unexpected results.

The Bing case study:

At Bing, an engineer suggested moving the second line of an ad to the first line to make the title larger. The idea sat in the backlog for six months because it seemed trivial. When finally tested, it increased revenue by 12% — over $100 million annually — without hurting user satisfaction.

Conversely, many "massive redesigns" that teams work on for months result in drops in conversion. Without A/B testing, these losses are often hidden by seasonal trends or marketing noise.

Failure rates across tech giants:

CompanyIdeas that fail to move the target metric
Microsoft66%
Bing (optimized)85%
Airbnb Search92%
You might ask: "If my success rate is only 8%, should I be worried?" No. The goal of the Experimentation Machine is to ensure you only ship that 8% and discard the rest. This "inch by inch" improvement is how Amazon and Spotify achieve massive compounding growth over a decade.
If most ideas fail, how do you know which metric to trust when evaluating a test? That is where the OEC comes in.

The OEC: the Overall Evaluation Criterion

A common mistake in A/B testing is optimizing for a single, short-term metric like "clicks" or "revenue." This leads to what Kohavi calls the "spam trap."

The Amazon email case study:

At Amazon, the email team was credited for every purchase made after a user clicked a recommendation email. To hit their targets, the team simply sent more emails. Revenue went up — but unsubscribes skyrocketed. They were destroying long-term user lifetime value for short-term gains.

The solution is an OEC (Overall Evaluation Criterion) — a composite metric that balances short-term wins with long-term health:

OEC componentPurposeExample (e-commerce)
Success metricWhat we want to increaseCompleted purchases
Guardrail metricWhat we must not hurtUnsubscribe rate, page load time, latency
Countervailing metricA check on the quality of successReturn rate, customer support tickets
The OEC must be a leading indicator of lifetime value. If you can prove that "users who perform Action X are 3x more likely to be active in 1 year," then Action X belongs in your OEC.
With the right metric in hand, the next question is: how do you make sure the experiment itself is trustworthy?

Designing trustworthy experiments

Trust is the most important element of an experimentation platform. If the data is wrong, the machine is broken. Two laws govern the trustworthiness of your tests.

Twyman's Law:

"Any figure that looks interesting or different is usually wrong."

If your experiment shows a 20% lift in a major metric, your first reaction should not be to celebrate. It should be to investigate. Large jumps are almost always caused by:

  • Tracking bugs: Logging an event twice
  • Sample Ratio Mismatch (SRM): A technical glitch where the 50/50 split is actually 50.2/49.8
  • Bot traffic: Automated scripts hitting one variation more than the other

The peeking problem:

Most PMs look at their A/B test dashboards every hour. If they see a "green" result (P < 0.05), they want to stop the test and claim victory. This is a statistical disaster.

Think of a basketball game between the Lakers and the Warriors. The Warriors might be the better team and will win 95% of the time after 4 quarters. But at some point during the game — perhaps in the first 5 minutes — the Lakers will be leading. If you "stop the game" the moment the Lakers are ahead, you draw a false conclusion.

The rule:

  • You must commit to a sample size before you start
  • If you stop early because a result "looks significant," you inflate your false-positive rate from 5% to over 50%
Trustworthy data requires patience. But even before you run your first test, you need enough users for the statistics to work.

When and how to start testing

A/B testing is not for every company at every stage. You need enough "units" (users) for the statistics to work.

Scale heuristics:

User baseWhat you can do
Under 10,000Do not A/B test. The noise will drown out the signal. Focus on qualitative feedback and "doing things that don't scale"
10,000 – 100,000You can detect large effects (e.g., a 10% change)
200,000+You can detect small, incremental 1% wins that compound into massive revenue

Build vs. buy:

  • Early stage: Use third-party tools like Optimizely, Amplitude, or Eppo
  • At scale: Consider building an internal platform that integrates with your data pipeline to automate SRM checks and OEC calculations. Marginal cost of testing needs to reach zero
Once you have the infrastructure to test, a new question arises: should you iterate on the current design — or throw it away and start fresh?

Iteration vs. the "big bang" redesign

One of the most frequent conflicts in a product team is the desire for a total redesign. A designer might say: "We cannot iterate our way out of this local maximum. We need a fresh start."

Kohavi warns that big bang redesigns usually fail. When you change 20 variables at once, you have no idea which one is helping and which 19 are hurting.

The strategy of decomposition:

Instead of shipping a redesign all at once:

  1. Decompose the redesign into 10 smaller experiments
  2. Test the new navigation menu
  3. Test the new color palette
  4. Test the new search algorithm
  5. Keep the winners, discard the losers
If you must do a full redesign for branding reasons (like the Airbnb rebrand), run it as an experiment with a "long-term holdout." Keep 5% of users on the old design for 3 months to see if the "novelty effect" wears off and if long-term metrics actually improve.
The machine is built. The experiments are running. But none of this works unless the team trusts the data more than their own intuition.

Creating a culture of experimentation

Building the machine is 20% technical and 80% cultural. To succeed, you must move from a "HiPPO" culture to an experimentation culture.

AttributeHiPPO cultureExperimentation culture
Decision makerHighest Paid Person's OpinionThe controlled experiment (the user)
Attitude toward failurePunished or hiddenViewed as learning and institutional memory
SpeedDetermined by meeting schedulesDetermined by data velocity
GoalShipping featuresMoving the North Star Metric

Institutional memory:

Do not just run tests — document them. Create a library of "surprising results." Every quarter, hold a meeting to discuss the "surprising loser" — an idea everyone thought would work but did not. This humbles the team and reinforces the need for the Machine.


The path ahead

The Experimentation Machine is the filter that ensures only the best ideas reach your users. It validates the "hooks" we built in Lesson 3 and moves the "North Star" we defined in Lesson 2.

Once you have a product that is scientifically proven to work and users who habitually return, you have reached the rarest state in business: product-market fit at scale. Now you face the final challenge: a competitor sees your success and tries to clone you. You cannot afford to move slowly anymore.

In the next lesson, we will cover Blitzscaling — how to prioritize speed over efficiency to dominate a "winner-take-most" market before the competition catches up.

Listen

Audio version of this lesson

PODCASTAUDIO

READY

The experimentation machine: trustworthy A/B testing

00:00 / 21:46

Think

What would you do in these scenarios?

Simulator

1 / 5
Sim_v4.0.exe

The miracle algorithm

You are a PM at an e-commerce company. Your data science team developed a new recommendation algorithm. After 3 days of testing, the dashboard shows a 45% increase in Add to Cart actions. The engineering team wants to roll it out to 100% of users immediately. What is your first response?


Practice

Test yourself and review key terms

Knowledge check

Q1/10

According to Kohavi's research at companies like Microsoft, Bing, and Airbnb, what is the typical failure rate for new product ideas?

Concepts

Question

What is a Countervailing Metric?

Show answer

Answer

A check on the quality of the success metric — for example, tracking return rates alongside completed purchases, or support tickets alongside sign-ups.

1 / 23

Apply

Your action steps for today

  1. 01

    The OEC audit

    Take the last feature your team shipped without A/B testing. Write down what the OEC should have been — a success metric, a guardrail metric, and a countervailing metric. If you cannot define all three, the experiment was incomplete.

  2. 02

    The peeking test

    Check your current A/B tests. Did anyone commit to a sample size before launch? If the test was stopped early because it "looked significant," the result is likely a false positive.

  3. 03

    The surprising loser

    Ask your team to name one feature everyone was sure would work but did not. If nobody can name one, your team is not running enough experiments — or not being honest about the results.

Finish

You made it through this lesson

How was this lesson?

Thank you!

Your feedback helps us improve. We appreciate the time you took to share your thoughts!

Note

Some examples and details may be simplified to better convey the core idea. Every business is different — adapt these ideas to your specific context and situation.