Statistical significance & other A/B pitfalls
Last week I tossed a coin a hundred times. 49 heads. Then I changed into a red t-shirt and tossed the same coin another hundred times. 51 heads. From this, I conclude that wearing a red shirt gives a 4.1% increase in conversion in throwing heads.
A ridiculous experiment (yes, I really did it) with a ridiculous conclusion, yet I sometimes see similarly unreliable analysis in A/B testing.
It’s logical and laudable that designers should seek data in our quest for verifiability and return on investment. But data must be handled with care, and mathematical rigour isn’t a common part of a designer’s repertoire.
Here’s an example from ABTests.com, a worthwhile project that I feel slightly bad to pick on.
The two versions are subtly different:
- Version A: Upload button bold, Convert button bold, Convert button has a right arrow
- Version B: All buttons regular weight, no right arrow on Convert button
Although minor changes can cause major surprises, I wouldn’t expect these small differences to improve the form’s usability. With the caveat that I don’t know the users or product, I’d even speculate that Version B could perform worse since it reduces the priority of the calls to action and removes the signifier of progression.
The designer claims that version B showed a 30.4% conversion improvement in an A/B test. Here’s why this isn’t quite accurate.
The role of chance
Any A/B test is a trial, so called because we’re observing evidence gained by trying something out. I can never truly know that there’s a 50% chance of a coin landing as a head or a tail – I can only run trials and observe the evidence. Similarly, we can never truly know that a design leads to higher conversion – we can only run trials and observe the evidence. If that empirical evidence is strong enough, we conclude that the design is an improvement. If not, we don’t.
To be valid, trials need to be sufficiently large. By tossing my coin 100 or 1000 times I reduce the influence of chance, but even then I’ll still get slightly different results with each trial. Similarly, a design may have 27.5% conversion on Monday, 31.3% on Tuesday and 26.0% on Wednesday. This random variation should always be the first cause considered of any change in observed results.
The null hypothesis
Statisticians use something called a null hypothesis to account for this possibility. The null hypothesis for the A/B test above might be something like this:
The difference in conversion between Version A and Version B is caused by random variation.
It’s then the job of the trial to disprove the null hypothesis. If it does, we can adopt the alternative explanation:
The difference in conversion between Version A and Version B is caused by the design differences between the two.
To determine whether we can reject the null hypothesis, we use certain mathematical equations to calculate the likelihood that the observed variation could be caused by chance. These equations are beyond the scope of this post but include Student’s t test, χ-squared and ANOVA (Wikipedia links given for the eager). Here’s a site that does the calculations for you, assuming a standard A/B conversion test with a clear Yes or No outcome.
If the arithmetic shows that the likelihood of the result being random is very small (usually below 5%), we reject the null hypothesis. In effect we’re saying “it’s very unlikely that this result is down to chance. Instead, it’s probably caused by the change we introduced” – in which case we say the results are statistically significant. Note that we still can’t guarantee that this is the right interpretation – significance is about proof only beyond reasonable doubt.
Running the calculations on the above data shows that the results aren’t statistically significant: the evidence isn’t strong enough to reject the null hypothesis that the difference in conversion is simply down to luck. The main problem is the small sample size (128 and 108 users respectively), so I would advise the designer, Johann, to repeat the test with more users. Assuming the observed conversions seen didn’t change (a big assumption) a sample size of approximately 200 users per variant should be sufficient for significance. He could then either reject the null hypothesis or the results would remain inconclusive, in which case there’s no evidence the design has made a difference. In Johann’s defence, he recently posted that he takes the point about significance, and I’m looking forward to seeing more conclusive data for this intriguing test.
Significance isn’t the only slippery problem A/B tests face. For starters, quoting conversion improvements is always fraught with difficulty. Since conversion is usually measured in percentages (in this example, 31.3% and 40.7%) there are two ways to quote improvements. We can say that conversions increased by:
- 9.4% – the difference between the two
- 30.4% – the amount that 40.7% is bigger than 31.3%*
Any percentage improvement quoted in isolation should be challenged: which of these two calculations has been used? It’s dangerously easy to assume the wrong figure without sufficient context.
The A/B death spiral
A/B tests also suffer from a common quantitative problem, in that they tell us what but not why. I’ve written about this previously in What if the design gods forsake us. It’s wise to back up numerical tests with qualitative evaluation (eg. a guerrilla usability test) so we can make informed decisions if data suggests we need to rethink a design.
Even with backup, sometimes A/B tests are simply the wrong tool for the job. They can provide powerful insight in some cases, but in the wrong place they can be a blind alley or, worse, a weapon of disempowerment. Logical positivism and design don’t mix – not everything we do can be empirically verified – yet some businesses fall back on A/B testing in lieu of genuine design thinking. I call this the “A/B death spiral”, and it plays out something like this:
Designer: Here’s a new design for this screen. You’ll see it has a new navigation style, tweaked colour palette and I’ve moved the main interactions to a tabbed area.
Product owner: Wow, those are pretty big changes for such a high-risk screen. I tell you what: let’s test them individually to see which of these changes works and which doesn’t…
As the proverb suggests, sometimes you can’t jump a twenty foot chasm in two ten foot leaps. Cherry-picking only those design elements that are “proven” by an A/B test can be a route to fragmented, incoherent design. It may earn marginally more money in the short term, but it becomes hard to avoid a descent into poor UX and the long-term harm this causes.
Being faithful to data
Given the potential hazards, I’m concerned about the naïveté with which some designers approach quantitative testing. The world of statistics rewards an honest search for the truth, not dilettantism, and I’d advise any designer moving in statistical circles to pick up some basic stats theory, or at least partner with someone knowledgeable.
A flawed A/B test, be it statistically insignificant, misapplied or misquoted, is nothing more than anecdotal evidence. It’s the same crime as making a website red on the feedback of one user. Yet an impatient designer, seeing the example I quoted above, could quickly jump to a false conclusion: “I should remove arrows from continue buttons: it’s 30.4% better.” Perhaps this designer deserves what he gets. It’s likely he’s only really interested in shortcuts to good UX, and linkbait lists of “Twelve ways to make your site more usable.” Since he understands neither the mathematics nor the context of this trial (timescales, userbase, surrounding task) he will inevitably grab the wrong end of the stick. Nonetheless, he is out there.
Don’t let yourself be that designer.
* subject to rounding