Simulating Split-Tests
In my previous blog post, I demonstrated how to visualize split test parameters by drawing distributions for A and B. In doing this, you can see the outcomes in which we can correctly conclude that B is greater than A. You can also visualize why false positives are possible by stacking the two distributions on top of each other and observing that some of the outcomes of B are past the cutoff point. If this still sounds confusing, I have one final way of demonstrating this: simulations. In this blog post, we will walk through writing a split test simulator in C++. You can find the full code for this demonstration here.
Although simulating split tests might sound complicated, it's actually very simple. First, let's consider a simple class for modeling A/B tests.
We need to simulate N trials for both A and B, where N is the required sample size. To simulate each trial, we can use Bernoulli trials. This is just a fancy way of assigning a success or failure outcome to each trial, given a probability p. This is accomplished by picking a random number between 0 and 1. If the random number is less than or equal to p, the trial is assigned a success; Otherwise, it's assigned a failure.
The following method is used to generate random numbers between 0 and 1 using the Mersenne Twister pseudo-random generator (PRNG). This PRNG is used because of its availability in the standard library and its suitability for statistical applications. Don't worry if this seems confusing. All you need to know is that this generates random numbers between 0 and 1.
The following method returns 1 for success with probability p; otherwise, it returns 0 for failure. Using this convention will make it easy to count all of the successes and failures.
Included below are some convenience methods for running a trial for A or B and updating the associated values.
Next, we can create a method to step through the simulation. Note that we randomly pick A or B with equal probability to simulate a user visiting the site and being randomly assigned to A or B. This method will also return a boolean to indicate whether or not the test should continue running.
Next, we need a way of running each simulation in its entirety. The function below runs multiple simulations and increments a global variable called "diffs," which we will use to keep track of the number of times a positive result was observed (i.e., the p-value was less than 0.05).
Note that we can use this function to perform our simulations, but let's set up multithreading to take advantage full advantage of all of the CPU cores on our machine. The following function does just that in addition to initializing diffs to 0 and printing the result of the simulation with the specified label.
Finally, in our main function, we can initialize and run the simulations for the "false positive rate" and "true positive rate," as shown below. Note that to simulate the false positive rate, we set the probability of B equal to A to simulate the scenario in which there's no actual difference between A and B.
Using a sample size calculator with a significance level of 5%, statistical power of 80%, baseline conversion rate of 20%, and a minimum detectable effect (MDE) of 5%, we get a required sample size of 25,580. If we run the program, we get the following output:
$ ./bazel-bin/main
false positive rate: 4.91%
true positive rate: 80.40%
As you can see, the false positive rate is approximately 5%, and the true positive rate is about 80%, which corresponds to the significance level of 5% and the statistical power of 80%.
Member discussion