10 things we've learned about A/B testing for startups
Contents
Every A/B test has four components:
- A goal
- A change
- A control group
- A test group
You’re testing a change (2) on your test group (4) to see if it has a statistically significant impact on the goal (1) compared to the control group (3).
But those are just the basics. In this week’s issue, we explore the secrets of running truly successful A/B tests (and some pitfalls to avoid).
This week’s theme is: Becoming an A/B testing ninja
This post was first published in our Substack newsletter, Product for Engineers. It's all about helping engineers and founders build better products by learning product skills. We send it (roughly) every two weeks. Subscribe here.
1. You need to embrace failure 📉
It shouldn’t surprise you when your A/B tests fail. This is true across the industry:
At Bing, only about 10% to 20% of experiments generate positive results, but they improved revenue per search by 10-25% a year.
Booking.com runs an estimated 25,000 tests per year. Only 10% of them generate positive results.
Using their “Universe” system, Coinbase ran 44 experiments in 8 months. 35 “resolved to control” – i.e. they showed no improvement.
Let’s Do This, a Y Combinator startup with $80M in funding, ran 10 or more A/B tests each week. 80% failed.
Thomas Owers, who ran growth at Let’s Do This, states engineers who run a lot of experiments "need to get comfortable knowing their code will be deleted."
He emphasizes two key points:
- An experiment isn't a failure if it doesn't produce the results you expected.
- An experiment is only a failure when the team doesn't learn anything from it.
Read about Thomas’ experience in How to start a growth team (as an engineer) on the PostHog blog.
2. Good A/B tests have 5 traits ✅
A specific, measurable goal An ambiguous goal leads to an unclear A/B testing process. It isn’t clear what A/B test to run to “increase sales.” “Increase demo bookings from the sales page” is actionable.
A clear hypothesis about why your change will achieve your goal A good hypothesis explains why you’re making the change, and what you expect to happen. Without one, how will you know if your test was successful?
Test as small of a change as reasonably possible Change too much and it’s unclear which (if any) change made a difference, and why. That said, a change is too small if it’s unlikely to impact user behavior, so choose carefully.
A sufficiently large sample size You need a large enough sample size for your test and control groups for them to be statistically significant. We calculate this automatically in PostHog, btw.
A long enough test duration Depends on your sample size and confidence level, but a good rule of thumb is a minimum of one week and a maximum of one month.
Read more about these traits in A software engineer's guide to A/B testing.
3. Use the “right place, right time” rule 💡
Right place, right time is a simple maxim for running successful tests.
Right place = Group your changes in as few places as possible, use feature flags to control who sees your changes, test only those involved in the test see the changes, and don’t capture metrics from unaffected users.
Right time = Use feature flags to control when your changes show, run your test to reach significance, and avoid the peeking problem.
- Create a proposal system 🏦 As we noted earlier, good tests need a clear hypothesis. A simple way to achieve this is to create a consistent process for them.
Monzo, the British online bank, asks four simple questions before running any test:
What problem are you trying to solve?
Why should we solve it?
How should we solve it? (optional)
What if this problem didn’t exist? (optional)
Answering these questions helps Monzo create consistent hypotheses containing a proposed solution to a problem, and the expected outcome. It also allows anyone to propose a test, including staff who don’t typically run experiments.