What is A/B Testing?
Almost everyone hated learning statistics (well, maybe except some statisticians). With all those distributions and critical values that we needed to memorize, we just ended up with a headache. You might have swore not to ever touch the subject again; that is, until you had to analyze an A/B test.
A/B testing is the “fun” name for Randomized Controlled Trials, where we randomly assign users to two (or more) groups and measure a metric (KPI) in each of them. For example, we can randomly present users a green “buy now!” button or a red one, and measure whether the click rate (or purchase rate) is higher in one of them.
Eventually, we need statistics when we analyze the differences between the two groups. When we see that one button is performing better, we can’t know if it’s “real” or just by chance; because even if the color doesn’t affect the click rate at all, we don’t expect the groups to have exactly the same rate. Statistics help us distinguish between these two options.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
The question we ask the A/B test is “which version is better?”. If you’re using a Frequentist framework, the answer you’ll get when looking at the P-value is “well, if there is no difference between the buttons, then the probability to see an uplift such as yours (or more extreme) is x%”.
It Doesn’t Have to be Complicated
Statistics doesn’t have to be that complicated. It turns out that an alternative framework can give us the following answer: “The probability that the green button is better is y%. This is what you’re risking by choosing it”. This framework is called the Bayesian framework, and it is gaining more and more popularity in the online industries.
The most important advantage of Bayesian statistics is that it is understandable
While there are many advantages (and some disadvantages) for the framework, I think the most important advantage of Bayesian A/B testing is that it is understandable. It answers exactly what we need to know in the case of uncertainty: What are the chances we are wrong, and what are the risks in that case.
What’s the Difference?
In short, in the frequentist framework we assume that there are two alternative worlds: One where there isn’t a difference between the red and green buttons (the null hypothesis), and one where there is. We assume we live in the first world (we assume the null hypothesis is true), and we want to disprove that, with a certain level of certainty (the significance level). What if we didn’t manage to prove that? That doesn’t necessarily mean that we’re in the first world (an insignificant result doesn’t prove the null hypothesis).
The Bayesian framework has a completely different point of view. We start by saying “the click rate of the red and green buttons can be any rate between 0% and 100%, with equal chance” (this is what we call the prior). This means that each button initially has a 50% chance to be better than the other. As we start gathering data, we update our knowledge, and we can say things like “Given the data I have observed, I now think there is a 70% chance that the green button is better”. We call this the posterior.
What I love about the Bayesian framework is that it embraces uncertainty. To my view, the common interpretation of frequentist inference is almost “deterministic”. It’s funny to say that on a statistical framework, but think about the way we interpret hypothesis testing — if the P-value is lower than .05, the result is significant (“real”); if it’s higher, then the result “isn’t real”. But real life is uncertain; we can’t just set a .05 threshold and say “the P-value told us so”. The Bayesian framework embraces uncertainty — and tells you “there’s a 95% chance the green button is better — but there’s a 5% chance it’s not. The choice is yours”.
Bayesian Metrics
The funny thing is that most metrics aren’t that numerically different between the two methods, at least in the A/B test context; however they completely differ in their interpretation. In the frequentist method we have a Confidence Interval; the matching Bayesian metric is the Credibility Interval. I personally view the “Probability B is Better” as the matching metric of the P-value: both are computed very similarly, but their meanings are totally different!
However, Bayesian A/B testing does have a metric that does not have a parallel in the frequentist framework: The Risk. We calculate the risk for both A and B, and it’s interpretation is: “If I chose B when B is actually worse than A, how much will I lose?”. This metric is also used as a decision rule in the A/B test.
The Risk has two major advantages over the P-value. First of all, it’s threshold is in business jargon (“what is the cost I am willing to pay if I’m wrong?”), while the frequentist significance level is statistical gibberish (“what is the type-I error rate I am willing to accept?”). Second, it’s much more robust to sequential examination (“peaking”) than the P-value, and as a result the sample size doesn’t have to be pre-determined when using the Bayesian framework. This is also a major advantage of Bayesian A/B testing.
Summary
A/B testing doesn’t have to be confusing. Using the Bayesian framework, we can answer exactly the question we ask ourselves when we need to make a decision: “Which version is better, what is the chance I’m wrong, and what is the price I will pay if that is the case”. This answer is much more business oriented, and any colleague in our company can comprehend it easily, no matter what is their role or expertise.
People are Bayesian. We ask ourselves “will the chicken cross the road?”, not “Am I in the null reality where the chicken doesn’t cross the road or the alternative one where it does with at least 95% confidence?”. Why should we treat online A/B testing differently?
References
I really tried to keep it simple in this post and not to get too deep into the maths. If you want to dig (a little bit) deeper, I really recommend these posts:
- The Power of Bayesian A/B Testing by Michael Frasco
- Bayesian A/B testing — a practical exploration with simulations by Blake Arnold
Check out my next post, where I describe how we implemented Bayesian A/B testing at Wix, where performance at a large scale is critical.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.