If you’ve read my previous post, you already know why I think you should move to Bayesian A/B testing. In this post, I give a short overview over the statistical models behind Bayesian A/B tests, and present the ways we implemented them at Wix.com — where we deal with a massive scale of A/B tests.
I wrote some practical examples in Python along this post. You can easily replicate them by copy-pasting the code from here and here. While the code is rather simple, the formulas I derive can be a little advanced; however I don’t think it is critical to understand all the derivations to use the code.
If you don’t have a background in Statistics and Bayesian inference, I recommend you read the articles in my references section. Also, a lot of concepts I use have links to Wikipedia, where they are explained in more detail.
The Bayesian A/B Test Model
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Bayesian statistics is built on two main concepts: the prior distribution — what we “know” about the KPI before the test, and the posterior distribution — what we know after gathering data. We find the posterior by “updating” the prior with our data. In an A/B test context, we try to keep it simple and use prior distributions that can be easily updated (conjugate priors), like the Beta-Binomial model.
The Beta-Binomial model is used to model binary data such as conversions or clicks (“did the user convert or not?”). I’ll also review the Normal-Normal model, which is used for continuous data (for example revenue per user).
In the Beta-Binomial model we assume that the conversion rate Pᴀ has a Beta distribution with parameters α and β. A common choice for α and β is 1, which results with a uniform distribution (sometimes referred to as an “uninformative prior”). I discuss the choice of the prior in more detail in the appendix, but for now let’s continue as if they have already been chosen.
We denote the count of converted users with Xᴀ and the count of all users (converted or not) with nᴀ, and we model Xᴀ | Pᴀ ~ Bin(nᴀ, pᴀ). Since Pᴀ ~ Beta(α, β), by using Bayes’ theorem we get:
Meaning we “update” our prior by adding the number of successes to α and the number of failures to β. Pretty simple, don’t you think?
In the Normal-Normal model, we assume that the expected revenue per user (or any other continuous metric) μᴀ has a Gaussian distribution with parameters μ₀ and σ²₀/n₀. Here we denote the sample average with X̄ᴀ and the sample standard deviation with sᴀ, and we assume that X̄ᴀ is approximately Normally distributed as well. This time, the update rule is a bit more tricky:
The posterior’s expected value is a weighted average of the Prior mean and the sample mean, and the weights are inverse to their variances.
All these formulas may seem a little overwhelming, but the calculations in NumPy are pretty simple:
x_a, n_a = 254, 1283 # converted and total users in A x_b, n_b = 289, 1321 # converted and total users in B m_a, s_a = 52.3, 14.1 # average and standard deviation over all users in A m_b, s_b = 52.8, 13.7 # average and standard deviation over all users in B alpha_0, beta_0 = 1, 1 # Beta prior mu0, s0, n0 = 0, 1, 0 # Gaussian prior
Snippet 1: Setting up some dummy data
import numpy as np # updating Beta prior alpha_a = alpha_0 + x_a beta_a = beta_0 + n_a - x_a alpha_b = alpha_0 + x_b beta_b = beta_0 + n_b - x_b # updating Gaussian prior is a bit more tricky inv_vars_a = n0 / np.power(s0, 2), n_a / np.power(s_a, 2) mu_a = np.average((mu0, m_a), weights=inv_vars_a) sd_a = 1 / np.sqrt(np.sum(inv_vars_a)) inv_vars_b = n0 / np.power(s0, 2), n_b / np.power(s_b, 2) mu_b = np.average((mu0, m_b), weights=inv_vars_b) sd_b = 1 / np.sqrt(np.sum(inv_vars_b))
Snippet 2: Updating posteriors
How To Calculate (Most Metrics)
Now that we’ve found the posterior distributions of Pᴀ and Pʙ, we want to calculate inferential metrics, such as Credibility Intervals, the Probability B is Better and each version’s Risk. The most common and simple way to do this is by using Monte Carlo simulations. But at Wix.com we have thousands of KPIs in hundreds of A/B tests every day, and using simulations isn’t scalable. Instead, we use two methods: Gaussian Quadratures (more on that later) and… Central Limit Theorem (CLT) approximations.
You might be thinking — isn’t the CLT more of a frequentist kind of thing? And can the CLT even be applied to the Beta distribution? You’ll be right to question that. However, when you have a sample size of thousands in every A/B test, the CLT approximation for the Beta distribution is “good enough”. You can validate me by doing your own simulations (or looking at mine).
Here we’re going to approximate the Credibility Intervals and the Probability B is Better with the CLT. Again, while the formulas can be a bit exhaustive, the code with SciPy is almost a one-liner. We denote the difference Pʙ – Pᴀ with D₁ and the log-ratio log Pʙ/Pᴀ with D₂. The latter is used when we report the relative uplift instead of the percentage-point difference. We approximate D₁’s and D₂‘s distribution with:
To complete the formulas above, we denote the digamma function and the first polygamma function with ψ and ψ₁ respectively. You don’t really need to know what they do, just where to find them in SciPy! The equations below are simply copied from Wikipedia:
Using these formulas, we can easily calculate the probability B is better with P(D₁ > 0), and the Credibility Interval with the quantiles of D₁ or D₂:
from scipy.stats import norm, beta from scipy.special import digamma, polygamma log_beta_mean = lambda a, b: digamma(a) - digamma(a + b) var_beta_mean = lambda a, b: polygamma(1, a) - polygamma(1, a + b) d1_beta = norm(loc=beta.mean(alpha_b, beta_b) - beta.mean(alpha_a, beta_a), scale=np.sqrt(beta.var(alpha_a, beta_a) + beta.var(alpha_b, beta_b))) d2_beta = norm(loc=log_beta_mean(alpha_b, beta_b) - log_beta_mean(alpha_a, beta_a), scale=np.sqrt(var_beta_mean(alpha_a, beta_a) + var_beta_mean(alpha_b, beta_b))) print(f'The probability the conversion in B is higher is {d1_beta.sf(0)}') # The probability the conversion in B is higher is 0.9040503042127781 print(f'The 95% crediblity interval of (p_b/p_a-1) is {np.exp(d2_beta.ppf((.025, .975))) - 1}') # The 95% crediblity interval of (p_b/p_a-1) is [-0.0489547 0.28350359]
Snippet 3: Calculating (some) inferential metrics for the Beta-Binomial model
For the Normal-Normal case, D₁ is pretty straightforward. But what if we want to use the relative difference instead? We’ve decided to use the Delta method to find the approximate distribution of ln μᴀ. Now you might be thinking — how the heck can I take the logarithm of a Gaussian random variable? And (again) you’ll be completely right — the Gaussian distribution’ support includes negative numbers and its logarithm isn’t properly defined. But (again) with a sample size of tens of thousands, it’s a “good enough” approximation as μᴀ’s distribution is “pretty far away” from 0.
d1_norm = norm(loc=mu_b - mu_a, scale=np.sqrt(sd_a ** 2 + sd_b ** 2)) d2_norm = norm(loc=np.log(mu_b) - np.log(mu_a), scale=np.sqrt((sd_a / mu_a)**2 + (sd_b / mu_b)**2)) print(f'The probability the average income in B is 2% lower (or worse) is {d2_norm.cdf(np.log(.98))}') # The probability the average income in B is 2% lower (or worse) is 0.00011622384023304196 print(f'The 95% crediblity interval of (m_b - m_a) is {d1_norm.ppf((.025, .975))}') # The 95% crediblity interval of (m_b - m_a) is [-0.04834271 1.94494332]
Snippet 4: Calculating (some) inferential metrics for the Normal-Normal model
Calculating the Risk
Now we have finally arrived to the important part: The Risk measure is the most important measure in Bayesian A/B testing. It replaces the P-value as a decision rule, but also serves as a stopping rule — since the Bayesian A/B test has a dynamic sample size.
It is interpreted as “When B is worse than A, If I choose B, how many conversions am I expected to lose?”, and formally, it is defined as:
with 1 as an indicator function. Notice the integral on the third line — I really hate that integral! It doesn’t have an analytical solution, and we can’t approximate it with the CLT. But as I previously mentioned, Monte-Carlo simulations are NOT an option for us. So what did we do?
Gaussian quadratures (GQ) are a cool way of approximating integrals with a weighted sum of a small number of nodes. The nodes and weights are calculated by the GQ algorithm. Formally, we find n nodes (x) and weights (w) that best approximate the integral of g with a summation:
As we saw earlier, the Risk metric is an integral — so we can try to approximate it with a GQ. First, let’s simplify it’s expression:
Now ξᴀ is an integral we can approximate with a Gaussian quadrature! We can accurately calculate the Risk with about 20 nodes, instead of thousands in a Monte-Carlo simulation:
We implemented this approximation using scipy.special.roots_hermitnorm and a work-around over scipy.special.roots_sh_jacobi (more on the work-around in my notes). This isn’t very fast, but it’s the fastest way we found.
from scipy.special import roots_hermitenorm # , roots_sh_jacobi from orthogonal import roots_sh_jacobi # The following throws an integer overflow error # if using scipy when a + b are too large # Use the log trick instead (see my PR at scipy / walk around in repo) def beta_gq(n, a, b): x, w, m = roots_sh_jacobi(n, a + b - 1, a, True) w /= m return x, w nodes_a, weights_a = beta_gq(24, alpha_a, beta_a) nodes_b, weights_b = beta_gq(24, alpha_b, beta_b) gq = sum(nodes_a * beta.cdf(nodes_a, alpha_b, beta_b) * weights_a) + \ sum(nodes_b * beta.cdf(nodes_b, alpha_a, beta_a) * weights_b) risk_beta = gq - beta.mean((alpha_a, alpha_b), (beta_a, beta_b)) print(f'The risk of choosing A is losing {risk_beta[0]} conversions per user.\n' f'The risk of choosing B is losing {risk_beta[1]} conversions per user.') # The risk of choosing A is losing 0.021472801833822552 conversions per user. # The risk of choosing B is losing 0.0007175909729974506 conversions per user. def norm_gq(n, loc, scale): x, w, m = roots_hermitenorm(n, True) x = scale * x + loc w /= m return x, w nodes_a, weights_a = norm_gq(24, mu_a, sd_a) nodes_b, weights_b = norm_gq(24, mu_b, sd_b) gq = sum(nodes_a * norm.cdf(nodes_a, mu_b, sd_b) * weights_a) + \ sum(nodes_b * norm.cdf(nodes_b, mu_a, sd_a) * weights_b) risk_norm = gq - norm.mean((mu_a, mu_b)) print(f'The risk of choosing A is losing {risk_norm[0]}$ per user.\n' f'The risk of choosing B is losing {risk_norm[1]}$ per user.') # The risk of choosing A is losing 0.9544550905343314$ per user. # The risk of choosing B is losing 0.006154785995697409$ per user.
Snippet 5: Calculating the Risk metric with SciPy’s Gaussian quadrature
Summary
The mathematics and programming in Bayesian A/B testing are a bit more challenging than in the Frequentist framework. However, as I argued in my previous post, I think it is worth it. Although it is more computationally expensive, the added clarity and interpretability of the A/B test gives an enormous value for anyone who uses them.
In this post, I reviewed the basics of the Beta-Binomial and Normal-Normal models for Bayesian A/B testing, and presented some of the approximations we implemented at Wix.com. While these approximations may not be 100% accurate, they are “good enough” when experiments have thousands of users, and they allowed us to support Bayesian A/B testing on a large scale.
This post focused on how to calculate Bayesian metric of an A/B test, and not on how to analyze it or when to stop it. If you want to read more on those issues, the posts in my references give an excellent overview.
References
Here are some of the posts that I read when I was first learning about Bayesian A/B testing. The two latter posts are a bit more critical on the framework, and I recommend reading them especially.
- The Power of Bayesian A/B Testing by Michael Frasco
- Bayesian A/B testing — a practical exploration with simulations by Blake Arnold
- Bayesian A/B Testing at VWO by Chris Stucchio
- Bayesian vs. Frequentist A/B Testing: What’s the Difference? by Alex Birkett
- Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson
The last post is probably my favorite, I’ve read it a couple of times. Particularly, by recreating it’s simulations, I’ve learnt about the effect of choosing different priors on the error rate when stopping the Bayesian A/B test dynamically (more on that subject in the appendix).
I’ve also built some cool R Shiny apps related to Bayesian A/B testing, using only the formulas presented in this post:
- Beta-Binomial A/B Test Calculator
- Normal-Normal A/B Test Calculator
- A cool app comparing Runtime & Accuracy between Bayesian & Frequentist A/B testing
All the code I used in this post can be found in this directory.
Appendix: On The Choice of the Prior
There is a niche debate about the importance of choosing a prior in Bayesian A/B testing. To my opinion, using an informative and suitable prior has great importance, and here’s why:
- I view the prior as a sort of “regularization” on the A/B test, and it plays a major role when dealing with the multiple comparisons issue. Yes — it is still an issue, even though we’re not using the Frequentist framework..
- Since we use Bayesian A/B testing sequentially (looking on the results every day and stopping once the Risk is below the threshold), using a “weak” (low) prior increases our error rate. See David Robinson’s post in my references.
- One cannot avoid the dilemma of choosing a prior by using an uninformative prior. Choosing an uninformative prior is still a choice of a prior, just a very lousy one.
I think it is important to find and use an appropriate prior in a Bayesian model. At Wix, we did so by automating a “prior builder” process, where we automatically query historical data from the past few weeks and build a prior from that data, as Frasco and Arnold suggest in their posts. Only when there isn’t enough data, we fallback to an “uninformative” prior (α=1, β=1 for the Beta distribution, μ₀=0, σ₀=1, n₀=1 for the Gaussian distribution).
Notes
Ordered by descending level of importance
- scipy.special.roots_sh_jacobi encounters an integer overflow error when α or β are too large. I’ve copied the source code and did the “log trick” to make it work, you can see it here. I also have a PR (here) for that issue.
- If you’re programming in R, you can approximate the Risk with a Gaussian quadrature using the statmod::gauss.quad.prob function.
- The parameters μ₀=0, σ₀=1, n₀=1 aren’t the Gaussian distribution’s uninformative prior, but result with a very “weak” prior. The reason we don’t use the actual uninformative prior is because it doesn’t result with a conjugate prior, which makes calculations much more difficult.
- We found that the Normal-Normal model suits most of our continuous KPIs. While there are more sophisticated models (for example a product of an Exponential and Beta distribution, here), we found that in a large sample the Gaussian approximation yields similar results — and its much simpler to calculate (Occam’s razor to the rescue). Once again I invite you to validate me by doing your own simulations, or look at mine here.
- In my previous post I wrote that a lot of metrics aren’t that numerically different between the Bayesian and Frequentist models. Indeed, they have much resemblance, and in fact — if they didn’t we would have been worried!
Here’s a nice example: In a large sample the moments of the Beta distribution in Eq. 5 are almost identical to the CLT approximations of the rate’s estimate and its logarithm in a proportion test. Setting α, β = 0 for simplicity, one gets:
- Notice that in the Normal-Normal model I presented, the standard deviation σ₀ doesn’t have a posterior. The reason is that at Wix we don’t have an interest in inferring the standard deviation, and we treat it as a nuisance parameter. If you do need to infer it, I refer you to the Normal-Inverse-Gamma prior.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Matt says
I might be having a brain freeze, but why are your averages not (conversion/total users) for each A and B? I assume you are looking at a binomial outcome of converted vs. not converted
x_a, n_a = 254, 1283 # converted and total users in A
x_b, n_b = 289, 1321 # converted and total users in B
m_a, s_a = 52.3, 14.1 # average and standard deviation over all users in A
m_b, s_b = 52.8, 13.7 # average and standard deviation over all users in B