CUPED (Controlled-experiment Using Pre-Experiment Data) was introduced in Deng et. al. (2013) (Reference 1), and is probably the most used variance reduction technique in A/B testing in the tech industry. At a high level, you can think of CUPED as applying the technique of control variates (which I’ve written about in this previous blog post) to the A/B testing set-up. However, CUPED is such an important special case that it is worth describing in full. The paper also contains practical tips on how to implement CUPED and is definite must-read.
Before describing CUPED, let me cover some notation and background. Assume we are in the A/B testing setting and we want to evaluate the impact of some treatment on a response metric. For individual , let:
- denote the value of the metric we would see if the individual was given the treatment,
- denote the value of the metric we would see if the individual was not given the treatment (i.e. was in control),
- denote the observed value (i.e. or , depending on whether was in treatment or control).
We want to estimate the average treatment effect (ATE) across individuals, . The most commonly used estimator for this is the difference-in-means estimator
The difference-in-means estimator is unbiased for the ATE and has a certain variance. CUPED is another estimator for the ATE that is (approximately) unbiased and usually has smaller variance than the difference-in-means estimator.
Let’s focus on just estimating . The difference-of-means estimator estimates this with . Imagine that on top of collecting metric values in the treatment group, we also collected pre-experiment values on another (real-valued) variable . Let’s also assume that we know the mean of (which denote by ). For any fixed parameter , we have
is an unbiased estimator for . Working through some variance computations, we can show that the variance of is minimized when , and at this value of , we have
where is the correlation between and .
Dotting our i’s and crossing our t’s
Before we can use as an estimator, we need to address 3 issues.
First, we don’t know the value of . Notice that is simply the population regression coefficient for when we regress on . Hence, we can replace with its sample quantity , the regression coefficient for on with the sample that we have: .
This approximation causes to no longer be exactly unbiased, because in general: both and depend on , complicating the expectation computation. If we want exact unbiasedness, we can use a subsample to estimate , then use the rest of the sample in the expression for . (It’s usually not worth the effort to do so.)
Second, we don’t know the value of . We can’t simply use the sample mean as an estimate for it, because plugging that in simply reduces to the original sample mean . We could use a subsample to estimate , then use the rest of the sample in the expression for .
In the A/B testing setting, we don’t have to do anything that fancy! Remember that the quantity we are really interested in is not but . Using analagous reasoning for estimating , we see that
is an unbiased estimator for as well. The cancels out, so we don’t have to estimate it.
Third, we don’t know which to use. In theory, we can use any variable . However, recall the variance computation
where is the correlation between and . Thus, we want to pick variables that are most correlated with the metric that we are measuring. Deng et. al. (2013) note that in the A/B testing setting, the same metric we want to estimate () but evaluated on a pre-experiment time period often gives the most variance reduction. This often makes sense: e.g. for engagement metrics, users who are highly engaged before the experiment tend to be highly engaged during the experiment as well.
An important caution here is that must not be affected by the experiment’s treatment. (All pre-experiment variables meet this requirement, Reference 1 adds some other possibilities in Section 4.3.) This is because for CUPED to be unbiased, we assumed that has the same value for the treatment and control populations. If is affected by the treatment such that differs across the treatment arms, CUPED will be biased.
Some other notes
- There is an obvious generalization to go from one control variate to multiple control variates : see my previous blog post for some details.
- There is a strong connection between stratification and CUPED: see Section 3.3 and Appendix A of Reference 1.
- The discussion above applies to the estimation of the treatment effect for count metrics and not to ratio metrics (see this previous post for definitions of count and ratio metrics). See Appendix B of Reference 2 for how to apply CUPED to ratio metrics.
Let me end this post off with the 4 CUPED recommendations listed in the paper (emphasis mine):
- Variance reduction works best for metrics where the distribution varies significantly across the user population. One common class of such metrics [is] where the value is very different for light and heavy users. Queries-per-user is a paradigmatic example of such a metric.
- Using the metric measured in the pre-period as the covariate typically provides the best variance reduction.
- Using a pre-experiment period of 1-2 weeks works well for variance reduction. Too short a period will lead to poor matching, whereas too long a period will reduce correlation with the outcome metric during the experiment period.
- Never use covariates that could be affected by the treatment, as this could bias the results. We have shown an example where directionally opposite conclusions could result if this requirement is violated.
- Deng, A., et. al. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.