Acing a perfect ab test-mastering the art of an experiment

Edoe Balint
Aug 27, 2023
5 min read

Scope: Achieving an impeccable setup and steering clear of typical pitfalls during the execution of an online experiment.

If you have previously conducted an AB test, you are likely familiar with the frustration that can arise from not witnessing the anticipated outcomes, observing no discernible differences between the variants, or persistently continuing the test in hopes of achieving the desired results to no avail.

Having conducted over 500 experiments are here to impart the best practices that rely on straightforward reasoning. Additionally, by ensuring that you have these practical demonstrations, you can reduce the reliance on complex scientific explanations. Nevertheless, it is always possible to conduct statistical or scientific tests in the background. In general, the following factors serve as criteria for reaching conclusions in a test:

User Base - A larger user base enables us to draw conclusions more quickly.
Expected outcome - A greater difference in the expected outcome will enable us to observe a significant impact sooner (assuming our expectations are correct). Further details on how to achieve this will be addressed in a separate blog post.
Duration - A longer experiment duration increases the likelihood of establishing differentiation between the variants.
Standard deviation of measured KPIs - If the daily fluctuations of the key performance indicators (KPIs) we wish to measure are minimal, even a small expected change can lead to conclusive results.

Best practice to conclude the runtime for the experiment:

Considering the aforementioned factors, it is crucial to establish duration expectations prior to commencing the experiment. Falling into the common trap of anticipating a particular change, not observing it, and subsequently extending the duration should be avoided. To estimate the length of the experiment, helpful free tools like the following can provide a rough approximation: https://www.evanmiller.org/ab-testing/sample-size.html.

This simple calculator is for one metric ( conversion rate ) which is overall. In this document we will try to explain why we probably need much less data than what this calculator is suggesting.

Here is the most important point: the true question we need to ask: For any additional data point we collect, is it gaining confidence towards concluding or are we keeping the same trend (mixed data points)? Usually you can predict what is going to happen based on what you have collected thus far.

Why is this of such great significance? Our primary attention should be directed at the business objective rather than getting caught up in the analytical aspect. In essence, if we ascertain that our experimental concepts were either detrimental or unhelpful to the business, and this realization takes a considerable amount of time, what would be the outcome? It would be more appropriate to state that we won't uncover any advantages (with a 99% confidence level) in the initial stages, rather than continuing to gather additional data..

For each experiment we should clarify what is the most important KPI to examine. In this example, We will demonstrate this on a cohort metric called ARPU (Average Revenue Per User). and we will start measuring this with days since the experiment has started

How can you save data collections (reduce duration) by looking at Dx ARPU?

Instead of looking at the end goal, we look at all the components towards the goal.

For instance, we are running a pricing experiment for 3 weeks. Who will win? the variant with the highest d21 ( days since experiment started) ARPU

But instead of looking at this data point we will have a look on all the curve prior to d21 ARPU metric

Figure a - variant types

In the example above we can learn that:

For variant 1, time is not helping nor hurting the spread from control but it is solid (our internal score for concluding is high but not improving dramatically as time evolves)
For Variant 2, The the time is not helping the test and therefore gaining more data points will probably won’t help in concluding ( for our internal score any time the lines are crossing each other the score is dropping)
For Variant 3 time is helping the spread but the data is already solid for concluding ( our internal score will go higher and higher)

Figure b - our conference score (compared to the control)

Based on the above,

We learn that variant 1 is not gaining confidence but nor losing it either as we progress

2. Variant 2 is not getting any conference as we move forward (and therefore it might be a waste of time)

3. Variant 3 is gaining confidence as we move forward

If we will run a significant test ( Confidence Intervals vs Bayesian Intervals) we will see that both variants 1,3 are significantly different compared to control

2nd lair of the score (with lower impact on it ) will examine all the components of the arpu (Average Revenue Per User). For instance, if we are talking about an online pricing web experiment:
Visitor to user conversion rate (cohort based)
User to payer conversion rate (cohort based)
Repeat conversion rates (cohort based)
Engagement and retention rates (cohort based)

Same logic for the score will be here. If the trends of these KPIs are following the arpu curve the score will increase and the other way around.

Again, best way to look at this will be on a days since start as x axis

Some other important points :

If we are limited with volume, it might be better to take only one variant at a time? Which one to start with? The one with the highest expected magnitude ( no matter if it good or bad)
When things are so solid at the early beginning, there is no point to keep the experiment running.
During the experiment, we will collect data based on install (registration ) date cohorts. If not we need to make a few tests to make sure that the test is valid
Each day we look at how robust the data is ( our internal duration score) - Each cohort or additional that is not following the trend will hurt the score and the other way around (reminder- the added value here is that we see how it looked thus far and not only one score that is up to date!)
We should follow what we see to conclude and not based on a model. So for instance we should prefer Dx ARPU over expected LTV
Outliers: Depends on the product that is being tested there is a point to have the ability to eliminate the biggest customers from the test to make sure the trend is looking the same ( for instance on the gaming world, we are eliminating the whales to make sure one whale was not skewing the results )

Acing a perfect ab test-mastering the art of an experiment

Recent Posts

Comments