What is CUPED (Controlled-experiment Using Pre-Experiment Data)?

Introduction

CUPED (Controlled-experiment Using Pre-Experiment Data) was introduced in Deng et. al. (2013) (Reference 1), and is probably the most used variance reduction technique in A/B testing in the tech industry. At a high level, you can think of CUPED as applying the technique of control variates (which I’ve written about in this previous blog post) to the A/B testing set-up. However, CUPED is such an important special case that it is worth describing in full. The paper also contains practical tips on how to implement CUPED and is definite must-read.

Before describing CUPED, let me cover some notation and background. Assume we are in the A/B testing setting and we want to evaluate the impact of some treatment on a response metric. For individual $i$, let:

• $Y_i(T)$ denote the value of the metric we would see if the individual was given the treatment,
• $Y_i(C)$ denote the value of the metric we would see if the individual was not given the treatment (i.e. was in control),
• $Y_i$ denote the observed value (i.e. $Y_i = Y_i(T)$ or $Y_i = Y_i(C)$, depending on whether $i$ was in treatment or control).

We want to estimate the average treatment effect (ATE) across individuals, $\Delta = \mathbb{E}[Y_i(T) - Y_i(C)]$. The most commonly used estimator for this is the difference-in-means estimator

\begin{aligned} \hat\Delta &= \left(\dfrac{ \sum_{i \text{ in treatment}} Y_i}{\# \{ i \text{ in treatment} \}} \right) - \left(\dfrac{ \sum_{i \text{ in control}} Y_i}{\# \{ i \text{ in control} \}} \right) \\ &=: \overline{Y}_T - \overline{Y}_C. \end{aligned}

The difference-in-means estimator is unbiased for the ATE and has a certain variance. CUPED is another estimator for the ATE that is (approximately) unbiased and usually has smaller variance than the difference-in-means estimator.

Key idea

Let’s focus on just estimating $\mathbb{E}[Y_i(T)]$. The difference-of-means estimator  estimates this with $\overline{Y}_T$. Imagine that on top of collecting metric values $Y_1, Y_2, \dots, Y_{n_t}$ in the treatment group, we also collected pre-experiment values on another (real-valued) variable $X_1, X_2, \dots, X_{n_t}$. Let’s also assume that we know the mean of $X$ (which denote by $\mathbb{E}[X]$). For any fixed parameter $\theta$, we have

\begin{aligned} \mathbb{E}[Y_i(T)] &= \mathbb{E}[\overline{Y}_T] \\ &= \mathbb{E}[\overline{Y}_T - \theta X] + \theta \mathbb{E}[X] \\ &= \mathbb{E}[\overline{Y}_T - \theta \overline{X}_T] + \theta \mathbb{E}[X]. \end{aligned}

Hence,

$\widetilde{Y}_T = \overline{Y}_T - \theta \overline{X}_T + \theta \mathbb{E}[X]$

is an unbiased estimator for $\mathbb{E}[Y_i(T)]$. Working through some variance computations, we can show that the variance of $\widetilde{Y}_T$ is minimized when $\theta = \text{Cov}(Y, X) / \text{Var}(X)$, and at this value of $\theta$, we have

\begin{aligned} \text{Var} (\widetilde{Y}_T) = (1-\rho^2) \text{Var}(\overline{Y}_T) \leq \text{Var}(\overline{Y}_T), \end{aligned}

where $\rho$ is the correlation between $Y$ and $X$.

Dotting our i’s and crossing our t’s

Before we can use $\widetilde{Y}_T = \overline{Y}_T - \theta \overline{X}_T + \theta \mathbb{E}[X]$ as an estimator, we need to address 3 issues.

First, we don’t know the value of $\theta = \text{Cov}(Y, X) / \text{Var}(X)$. Notice that $\theta$ is simply the population regression coefficient for $X$ when we regress $Y$ on $X$. Hence, we can replace $\theta$ with its sample quantity $\hat\theta$, the regression coefficient for $Y$ on $X$ with the sample that we have: $(X_1, Y_1), \dots, (X_{n_t}, Y_{n_t})$.

This approximation causes $\widetilde{Y}_T$ to no longer be exactly unbiased, because $\mathbb{E}[\hat\theta \overline{X}_T] \neq \theta \mathbb{E}[X]$ in general: both $\hat\theta$ and $\overline{X}_T$ depend on $X_1, \dots, X_{n_t}$, complicating the expectation computation. If we want exact unbiasedness, we can use a subsample to estimate $\hat\theta$, then use the rest of the sample in the expression for $\widetilde{Y}_T$. (It’s usually not worth the effort to do so.)

Second, we don’t know the value of $\mathbb{E}[X]$. We can’t simply use the sample mean as an estimate for it, because plugging that in simply reduces $\widetilde{Y}_T$ to the original sample mean $\overline{Y}_T$. We could use a subsample to estimate $\mathbb{E}[X]$, then use the rest of the sample in the expression for $\widetilde{Y}_T$.

In the A/B testing setting, we don’t have to do anything that fancy! Remember that the quantity we are really interested in is not $\mathbb{E}[Y_i(T)]$ but $\Delta = \mathbb{E}[Y_i(T)] - \mathbb{E}[Y_i(C)]$. Using analagous reasoning for estimating $\mathbb{E}[Y_i(C)]$, we see that

\begin{aligned} \widetilde{Y}_T - \widetilde{Y}_C &= (\overline{Y}_T - \theta \overline{X}_T + \theta \mathbb{E}[X]) - (\overline{Y}_C - \theta \overline{X}_C + \theta \mathbb{E}[X]) \\ &= (\overline{Y}_T - \theta \overline{X}_T) - (\overline{Y}_C - \theta \overline{X}_C) \end{aligned}

is an unbiased estimator for $\Delta$ as well. The $\theta \mathbb{E}[X]$ cancels out, so we don’t have to estimate it.

Third, we don’t know which $X$ to use. In theory, we can use any variable $X$. However, recall the variance computation

\begin{aligned} \text{Var} (\widetilde{Y}_T) = (1-\rho^2) \text{Var}(\overline{Y}_T) \leq \text{Var}(\overline{Y}_T), \end{aligned}

where $\rho$ is the correlation between $Y$ and $X$. Thus, we want to pick variables that are most correlated with the metric that we are measuring. Deng et. al. (2013) note that in the A/B testing setting, the same metric we want to estimate ($Y$) but evaluated on a pre-experiment time period often gives the most variance reduction. This often makes sense: e.g. for engagement metrics, users who are highly engaged before the experiment tend to be highly engaged during the experiment as well.

An important caution here is that $X$ must not be affected by the experiment’s treatment. (All pre-experiment variables meet this requirement, Reference 1 adds some other possibilities in Section 4.3.) This is because for CUPED to be unbiased, we assumed that $\mathbb{E}[X]$ has the same value for the treatment and control populations. If $X$ is affected by the treatment such that $\mathbb{E}[X]$ differs across the treatment arms, CUPED will be biased.

Some other notes

• There is an obvious generalization to go from one control variate $X$ to multiple control variates $X_1, \dots, X_K$: see my previous blog post for some details.
• There is a strong connection between stratification and CUPED: see Section 3.3 and Appendix A of Reference 1.
• The discussion above applies to the estimation of the treatment effect for count metrics  and not to ratio metrics (see this previous post for definitions of count and ratio metrics). See Appendix B of Reference 2 for how to apply CUPED to ratio metrics.

Recommendations

Let me end this post off with the 4 CUPED recommendations listed in the paper (emphasis mine):

1. Variance reduction works best for metrics where the distribution varies significantly across the user population. One common class of such metrics [is] where the value is very different for light and heavy users. Queries-per-user is a paradigmatic example of such a metric.
2. Using the metric measured in the pre-period as the covariate typically provides the best variance reduction.
3. Using a pre-experiment period of 1-2 weeks works well for variance reduction. Too short a period will lead to poor matching, whereas too long a period will reduce correlation with the outcome metric during the experiment period.
4. Never use covariates that could be affected by the treatment, as this could bias the results. We have shown an example where directionally opposite conclusions could result if this requirement is violated.

References:

1. Deng, A., et. al. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.