Fieller’s confidence sets for ratio of two means

Assume you have a sample (X_1, Y_1), \dots, (X_n, Y_n) drawn i.i.d. from some underlying distribution. Let \mathbb{E}[X] = \mu_1 and \mathbb{E}[Y] = \mu_2. Our goal is to estimate the ratio of means \rho := \mu_2 / \mu_1 and provide a level-(1-\alpha) confidence set for it, i.e. a set R such that \mathbb{P}_\rho \{ \rho \in R \} \geq 1 - \alpha. Fieller’s confidence sets are one way to do this. (The earliest reference to Fieller’s work is listed as Reference 1. The exposition here follows that in von Luxburg & Franz 2009 (Reference 2).)

Define the standard estimators for the means and covariances:

\begin{aligned} \hat\mu_1 &:= \dfrac{1}{n}\sum_{i=1}^n X_i, \quad \hat\mu_2 := \dfrac{1}{n}\sum_{i=1}^n Y_i, \\  \hat{c}_{11} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (X_i - \hat\mu_1)^2, \\  \hat{c}_{22} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (Y_i - \hat\mu_2)^2, \\  \hat{c}_{12} &:= \hat{c}_{21} := \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \hat\mu_1)(Y_i - \hat\mu_2). \end{aligned}

Let q := q(t_{n-1}, 1 - \alpha/2) be the (1 - \alpha/2)-quantile of the t distribution with n-1 degrees of freedom. Compute the quantities

\begin{aligned} q_\text{exclusive}^2 &= \dfrac{\hat\mu_1^2}{\hat{c}_{11}}, \\  q_\text{complete}^2 &= \dfrac{\hat\mu_2^2 \hat{c}_{11} - 2\hat\mu_1 \hat\mu_2 \hat{c}_{12} + \hat\mu_1^2 \hat{c}_{22} }{\hat{c}_{11}\hat{c}_{22} - \hat{c}_{12}^2}, \end{aligned}

and

\begin{aligned} \ell_{1, 2} = \dfrac{1}{\hat\mu_1^2 - q^2 \hat{c}_{11}} \left[ (\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12}) \pm \sqrt{(\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12})^2 - (\hat\mu_1^2 - q^2 \hat{c}_{11})(\hat\mu_2^2 - q^2 \hat{c}_{22})} \right] .\end{aligned}

The confidence set R_\text{Fieller} is defined as follows:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\  (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The following result is often known as Fieller’s theorem:

Theorem (Fieller). If (X, Y) is jointly normal, then  R_{Fieller} is an exact confidence region of level 1 - \alpha for \rho, i.e. \mathbb{P}_\rho \{ \rho \in R \} = 1 - \alpha.

This is Theorem 3 of Reference 2, and there is a short proof of the result there. Of course, if (X,¬† Y) is not jointly normal (which is almost always the case), then Fieller’s confidence sets are no longer exact but approximate.

Fieller’s theorem is also valid in the more general setting where we are given two independent samples X_1, \dots, X_n and Y_1, \dots, Y_m (rather than paired samples), and use unbiased estimators for the means and independent unbiased estimators for the covariances. Reference 2 notes that the degrees of freedom for the t distribution needs to be chosen appropriately in this case.

I like the way Reference 2 explicitly lays out the 3 possibilities for the Fieller confidence set:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\  (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The first case corresponds to the setting where both \mathbb{E}[X] and \mathbb{E}[Y] are close to 0: here, we can’t really conclude anything about the ratio. The second case corresponds to the setting where \mathbb{E}[Y] is close to zero while \mathbb{E}[X] is not: here the ratio is something like C/\epsilon or -C/\epsilon where C is a big constant while \epsilon is small. The final case corresponds to the setting where both quantities are not close to zero. The Wikipedia article for Fieller’s theorem describes the confidence set using a single formula. Even though all 3 cases above are contained within this formula, I found the formula a little misleading because on the surface it looks like the confidence set is always of the form [a, b] for some real numbers a and b.

References:

  1. Fieller, E. C. (1932). The distribution of the index in a normal bivariate population.
  2. Von Luxburg, U., and Franz, V. H. (2009). A geometric approach to confidence sets for ratios: Fieller’s theorem, generalizations and bootstrap.

Count metrics and ratio metrics

In A/B testing, most of the metrics we work with fall into one of two categories: count metrics or ratio metrics. I think the naming convention can be a bit confusing; this post aims to clear up the confusion.

TLDR: I think the easiest way to distinguish between the two types of metrics is to compare the unit of analysis with the unit of randomization for the relevant A/B test.

Count metrics

For count metrics, the unit of analysis is the same as the unit of randomization. For example, if the unit of analysis is a user, count metrics would include revenue per user, clicks per user, etc. In mathematical notation, for a given randomization unit, let Y denote variable that we are interested in (revenue or clicks in the example above). The count metric we want to estimate (i.e. the estimand) is

M = \mathbb{E}[Y].

The most common way to estimate this is with the sample mean: take a random sample of randomization units (indexed by i = 1, \dots, n), get the values Y_1, \dots, Y_n, and estimate M with

\begin{aligned} \widehat{M} = \dfrac{\sum_i Y_i}{n}. \end{aligned}

Ratio metrics

For ratio metrics, the unit of analysis is at a more granular level than the unit of randomization. For example, when the unit of randomization is a user, ratio metrics would include revenue per session, clicks per page view, etc.

For a given randomization unit, let Y denote the variable we are interested in (revenue and clicks in the example above), and let Z denote the number of units of analysis for this randomization unit (sessions and page views in the example above). The ratio metric we want to estimate is

R = \dfrac{\mathbb{E}[Y]}{\mathbb{E}[Z]}.

The most common way to estimate this is to replace both the numerator and denominator with the respective sample means:

\begin{aligned} \widehat{R} = \dfrac{\sum_i Y_i / n}{\sum_i Z_i / n} = \dfrac{\sum_i Y_i}{\sum_i Z_i}. \end{aligned}

Some notes in closing

  1. Count metrics are a special case of ratio metrics with Z = 1 for all randomization units.
  2. Count metrics are easier to work with than ratio metrics because the denominators for count metrics are fixed while it is random for ratio metrics.
  3. Actually, it is not quite true that the denominators for count metrics are fixed. In practice, we often don’t specify the sample size n in advance. Rather, we run our experiment for a fixed length of time, and the sample size is simply the number of randomization units that entered the experiment during that time frame. Thus, n is really a random variable. However, in practice we just assume that n is fixed and analyze the experiment as such. (In other words, the analysis is conditional on the number of samples we got.)

References:

  1. Jin, Y., and Ba, S. (2021). Towards Optimal Variance Reduction in Online Controlled Experiments.

 

Confusion over ratio metrics in causal inference

Set-up

Assume that we are in the potential outcomes set-up in causal inference. We have a sample of n individuals, and individual i has potential outcomes Y_i(1) and Y_i(0). Y_i(1) denotes the value of individual i‘s response if the individual is in the treatment group, while Y_i(0) denotes the value if the individual is in the control group. The fundamental problem of causal inference is that as the experimenter, we only ever get to observe one of Y_i(1) and Y_i(0), but NEVER BOTH.

In causal inference, a common target that we want to estimate is the average treatment effect (ATE), defined as the expected difference in potential outcomes:

\begin{aligned} ATE = \mathbb{E}[Y(1) - Y(0)]. \end{aligned}

Ratio metrics: why we care

In some cases, we might be interested in a ratio of potential outcomes instead. This is not as unusual as one might think! We often hear claims such as “if you take this supplement, you will be x% stronger”: x can be expressed as

\begin{aligned} x = 100 \left( \frac{after}{before} - 1 \right). \end{aligned}

The work involved in estimating x is essentially the same as that in estimating \frac{after}{before}.

The wrong way to define ratio metrics

What is the target we are trying to estimate for a ratio metric? Instinctively, one might define the target as

\begin{aligned} \mathbb{E}\left[\frac{Y(1)}{Y(0)} \right]. \end{aligned}

This is incorrect! In statistical parlance, the quantity above is unidentifiable: even if we have infinite data from this model (i.e. we see an infinite number of observations), we still cannot estimate this target!

Let’s see this through an example. In this first set-up, imagine that there is no treatment effect, i.e. Y(1) = Y(0) for all individuals. Imagine also that Y(0) = 1 for half of the population, and Y(0) = 2 for half of the population and so \mathbb{E}[Y(1) / Y(0)] = 1.

If I were to run a huge randomized experiment with half of the observations in control and half of the observations in treatment, what would I see? I would see 50% of the controls having value 1, 50% of the controls having value 2, and the same for the treatment group.

Now, imagine a second set-up, where again Y(0) = 1 for half of the population and Y(0) = 2 for half of the population. However, if Y(0) = 1, then Y(1) = 2, and if Y(0) = 2, then Y(1) = 1. In this set-up, \mathbb{E}[Y(1) / Y(0)] = \frac{1}{2} \cdot (2/1) + \frac{1}{2} \cdot (1/2) = 1.25.

What would I see if I were to run a huge randomized experiment? I would see EXACTLY the same data as that in the first set-up: 50% 1s and 50% 2s in the control group, and the same in the treatment group! We will not be able to differentiate between set-ups 1 and 2, even with infinite data, as the observed data will be the same.

(Notice that this problem does not arise for the ATE: both of these set-ups have the same ATE: 0.)

What we do instead

The target that practitioners use for ratio metrics is

\begin{aligned} \frac{\mathbb{E}[Y(1)]}{\mathbb{E}[Y(0)]}. \end{aligned}

If you run it through the two set-ups above, you will find that this target will have the same value in both settings: 1.

Reflections

It took me a while to wrap my head around this. One takeaway I have is that the fundamental problem of causal inference forces us to think hard about what quantities we can even hope to estimate. This is why I think the issue of identification comes up a lot more in causal inference than in the rest of statistics.

I want to end this post off by highlighting two other things to worry about when using ratio metrics:

  1. What happens if the denominator can be negative? How do you interpret the target in that case?
  2. What happens if the denominator is very close to zero, or worse, equal to zero? Having something close to zero in the denominator usually causes estimates to be very unstable.

All that to say: approach ratio metrics with care!