Jul 31, 2023 4 min read Split Testing

Simulating Split-Tests

In my previous blog post, I demonstrated how to visualize split test parameters by drawing distributions for A and B. In doing this, you can see the outcomes in which we can correctly conclude that B is greater than A. You can also visualize why false positives are possible by stacking the two distributions on top of each other and observing that some of the outcomes of B are past the cutoff point. If this still sounds confusing, I have one final way of demonstrating this: simulations. In this blog post, we will walk through writing a split test simulator in C++. You can find the full code for this demonstration here.

Although simulating split tests might sound complicated, it's actually very simple. First, let's consider a simple class for modeling A/B tests.

class ABTest {
    double pa, pb; // probabilities for A and B
    int ca, cb;    // successes for A and B
    int na, nb;    // number of trials for A and B
    int n;         // required sample size
public:
    /*
     * Initializes a new ABTest with the probability a, the probability of b, and the required sample size
     * */
    ABTest(double a, double b, int requiredN);

    /*
     * Randomly picks A or B and calls trialA() or trialB() and returns true if the ABTest should continue running.
     */
    bool next();

    /*
     * Runs a test of equal proportions and returns the resulting p-value.
     */
    double pValue();
protected:
    /*
     * Runs a trial for A and updates the successes and number of trials.
     */
    void trialA();

    /*
     * Runs a trial for B and updates the successes and number of trials
     */
    void trialB();

    /*
     * Runs a Bernoulli trial with probability p, returning 1 for a success and 0 for a failure.
     */
    int trial(double p);

    /*
     * Returns a random number between 0 and 1.
     */
    double random();
};

Snippet 1: A/B Test Class

We need to simulate N trials for both A and B, where N is the required sample size. To simulate each trial, we can use Bernoulli trials. This is just a fancy way of assigning a success or failure outcome to each trial, given a probability p. This is accomplished by picking a random number between 0 and 1. If the random number is less than or equal to p, the trial is assigned a success; Otherwise, it's assigned a failure.

The following method is used to generate random numbers between 0 and 1 using the Mersenne Twister pseudo-random generator (PRNG). This PRNG is used because of its availability in the standard library and its suitability for statistical applications. Don't worry if this seems confusing. All you need to know is that this generates random numbers between 0 and 1.

double ABTest::random() {
    static thread_local mt19937* generator = nullptr;
    if (!generator) {
        unsigned seed = chrono::system_clock::now().time_since_epoch().count();
        generator = new mt19937(seed);
    }
    uniform_real_distribution<double> distribution(0, 1);
    return distribution(*generator);
}

Snippet 2: Random Number Generation

The following method returns 1 for success with probability p; otherwise, it returns 0 for failure. Using this convention will make it easy to count all of the successes and failures.

int ABTest::trial(double p) {
    return random() <= p ? 1 : 0;
}

Snippet 3: Bernoulli Trial

Included below are some convenience methods for running a trial for A or B and updating the associated values.

void ABTest::trialA() {
    ca += trial(pa);
    na++;
}

void ABTest::trialB() {
    cb += trial(pb);
    nb++;
}

Snippet 4: Convenience methods for running a trial for A or B

Next, we can create a method to step through the simulation. Note that we randomly pick A or B with equal probability to simulate a user visiting the site and being randomly assigned to A or B. This method will also return a boolean to indicate whether or not the test should continue running.

bool ABTest::next() {
    if(random() < 0.5) {
        trialA();
    } else {
        trialB();
    }

    return na < n || nb < n;
}

Snippet 5: Stepping through the simulation

Next, we need a way of running each simulation in its entirety. The function below runs multiple simulations and increments a global variable called "diffs," which we will use to keep track of the number of times a positive result was observed (i.e., the p-value was less than 0.05).

void runSimulations(int sims, double a, double b, int requiredN) {
    for(int i = 0; i < sims; i++) {
        ABTest ab = ABTest(a, b, requiredN);

        while (ab.next()) {}

        if(ab.pValue() < 0.05) {
            diffs++;
        }
    }
}

Snippet 6: Running the simulations

Note that we can use this function to perform our simulations, but let's set up multithreading to take advantage full advantage of all of the CPU cores on our machine. The following function does just that in addition to initializing diffs to 0 and printing the result of the simulation with the specified label.

void simulate(string label, int totalSims, double a, double b, int requiredN) {
    diffs = 0;
    int numThreads = thread::hardware_concurrency();
    thread threads[numThreads];
    int simsPerThread = totalSims / thread::hardware_concurrency();

    for(int i = 0; i < numThreads; i++) {
        threads[i] = thread(runSimulations, simsPerThread, a, b, requiredN);
    }
    for(int i = 0; i < numThreads; i++) {
        threads[i].join();
    }
    printf("%s: %.2f%%\n", label.c_str(), 100.0 *double(diffs) / double(numThreads*simsPerThread));
}

Snippet 7: Adding multithreading

Finally, in our main function, we can initialize and run the simulations for the "false positive rate" and "true positive rate," as shown below. Note that to simulate the false positive rate, we set the probability of B equal to A to simulate the scenario in which there's no actual difference between A and B.

int main() {
    int totalSims = 10000;
    double mde = 0.05;
    double a = 0.2;
    double b = a * (1+mde);
    int requiredN = 25580;

    simulate("false positive rate", totalSims, a, a, requiredN);
    simulate("true positive rate", totalSims, a, b, requiredN);

    return 0;
}

Snippet 8: Main function to simulate false and true positive rates

Using a sample size calculator with a significance level of 5%, statistical power of 80%, baseline conversion rate of 20%, and a minimum detectable effect (MDE) of 5%, we get a required sample size of 25,580. If we run the program, we get the following output:

$ ./bazel-bin/main
false positive rate: 4.91%
true positive rate: 80.40%

As you can see, the false positive rate is approximately 5%, and the true positive rate is about 80%, which corresponds to the significance level of 5% and the statistical power of 80%.