# Fieller’s confidence sets for ratio of two means

Assume you have a sample $(X_1, Y_1), \dots, (X_n, Y_n)$ drawn i.i.d. from some underlying distribution. Let $\mathbb{E}[X] = \mu_1$ and $\mathbb{E}[Y] = \mu_2$. Our goal is to estimate the ratio of means $\rho := \mu_2 / \mu_1$ and provide a level-$(1-\alpha)$ confidence set for it, i.e. a set $R$ such that $\mathbb{P}_\rho \{ \rho \in R \} \geq 1 - \alpha$. Fieller’s confidence sets are one way to do this. (The earliest reference to Fieller’s work is listed as Reference 1. The exposition here follows that in von Luxburg & Franz 2009 (Reference 2).)

Define the standard estimators for the means and covariances:

\begin{aligned} \hat\mu_1 &:= \dfrac{1}{n}\sum_{i=1}^n X_i, \quad \hat\mu_2 := \dfrac{1}{n}\sum_{i=1}^n Y_i, \\ \hat{c}_{11} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (X_i - \hat\mu_1)^2, \\ \hat{c}_{22} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (Y_i - \hat\mu_2)^2, \\ \hat{c}_{12} &:= \hat{c}_{21} := \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \hat\mu_1)(Y_i - \hat\mu_2). \end{aligned}

Let $q := q(t_{n-1}, 1 - \alpha/2)$ be the $(1 - \alpha/2)$-quantile of the $t$ distribution with $n-1$ degrees of freedom. Compute the quantities

\begin{aligned} q_\text{exclusive}^2 &= \dfrac{\hat\mu_1^2}{\hat{c}_{11}}, \\ q_\text{complete}^2 &= \dfrac{\hat\mu_2^2 \hat{c}_{11} - 2\hat\mu_1 \hat\mu_2 \hat{c}_{12} + \hat\mu_1^2 \hat{c}_{22} }{\hat{c}_{11}\hat{c}_{22} - \hat{c}_{12}^2}, \end{aligned}

and

\begin{aligned} \ell_{1, 2} = \dfrac{1}{\hat\mu_1^2 - q^2 \hat{c}_{11}} \left[ (\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12}) \pm \sqrt{(\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12})^2 - (\hat\mu_1^2 - q^2 \hat{c}_{11})(\hat\mu_2^2 - q^2 \hat{c}_{22})} \right] .\end{aligned}

The confidence set $R_\text{Fieller}$ is defined as follows:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\ (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The following result is often known as Fieller’s theorem:

Theorem (Fieller). If $(X, Y)$ is jointly normal, then  $R_{Fieller}$ is an exact confidence region of level $1 - \alpha$ for $\rho$, i.e. $\mathbb{P}_\rho \{ \rho \in R \} = 1 - \alpha$.

This is Theorem 3 of Reference 2, and there is a short proof of the result there. Of course, if $(X, Y)$ is not jointly normal (which is almost always the case), then Fieller’s confidence sets are no longer exact but approximate.

Fieller’s theorem is also valid in the more general setting where we are given two independent samples $X_1, \dots, X_n$ and $Y_1, \dots, Y_m$ (rather than paired samples), and use unbiased estimators for the means and independent unbiased estimators for the covariances. Reference 2 notes that the degrees of freedom for the $t$ distribution needs to be chosen appropriately in this case.

I like the way Reference 2 explicitly lays out the 3 possibilities for the Fieller confidence set:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\ (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The first case corresponds to the setting where both $\mathbb{E}[X]$ and $\mathbb{E}[Y]$ are close to 0: here, we can’t really conclude anything about the ratio. The second case corresponds to the setting where $\mathbb{E}[Y]$ is close to zero while $\mathbb{E}[X]$ is not: here the ratio is something like $C/\epsilon$ or $-C/\epsilon$ where $C$ is a big constant while $\epsilon$ is small. The final case corresponds to the setting where both quantities are not close to zero. The Wikipedia article for Fieller’s theorem describes the confidence set using a single formula. Even though all 3 cases above are contained within this formula, I found the formula a little misleading because on the surface it looks like the confidence set is always of the form $[a, b]$ for some real numbers $a$ and $b$.

References:

1. Fieller, E. C. (1932). The distribution of the index in a normal bivariate population.
2. Von Luxburg, U., and Franz, V. H. (2009). A geometric approach to confidence sets for ratios: Fieller’s theorem, generalizations and bootstrap.

# What is the DeLong test for comparing AUCs?

Set-up

Let’s say that in our sample we have $m$ individuals who truly belong to class 1 (call this group $C_1$) and $n$ individuals who truly belong to class 2 (call this group $C_2$). For each of these individuals, the binary classifier gives a probability of the individual belonging to class 1: denote the probabilities for $C_1$ by $X_1, \dots, X_m$ and the probabilities for $C_2$ by $Y_1, \dots, Y_n$.

Define sensitivity (a.k.a. true positive rate) and specificity (a.k.a. true negative rate) as functions

\begin{aligned} \text{sens}(z) = \frac{1}{m} \sum_{i=1}^m 1\{ X_i \geq z\}, \qquad \text{spec}(z) = \frac{1}{n}\sum_{j=1}^n 1 \{ Y_j < z \}, \end{aligned}

where $z \in [0, 1]$. The receiver operating characteristic (ROC) curve is a common way to summarize the quality of a binary classifier: it simply plots sensitivity vs. 1 – specificity. An ROC curve will always pass through the points $(0, 0)$ and $(1, 1)$ and should be above the diagonal $y = x$. The closer the curve gets to the point $(0, 1)$, the better the binary classifier.

Area under the curve (AUC) is a way to summarize the entire ROC curve in a single number: it is simply the area under the ROC curve. An uninformative classifier will have an AUC of 0.5; the larger the AUC the better a classifier is thought to be. (One should really look at the full ROC as well to understand the classifier’s tradeoff between sensitivity and specificity.

(Note: Sometimes the ROC curve is drawn with specificity on the x-axis rather than 1-specificity. This has the effect of flipping the ROC curve along the vertical line $x = 0.5$ and does not change the AUC.)

A key insight is that the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to $C_1$ and $C_2$. We have a formula for computing this statistic:

\begin{aligned} \hat{\theta} = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i, Y_j), \quad \text{where}\quad \psi(X, Y) = \begin{cases} 1 &Y < X, \\ 1/2 &Y = X, \\ 0 &Y < X. \end{cases} \end{aligned}

It is an unbiased estimate of $\theta$, the probability that a randomly selected observation from the population represented by $C_2$ will have a score less than or equal to that for a randomly selected observation from the population represented by $C_1$.

Deriving an asymptotic distribution for AUCs

Imagine we are in a situation where instead of just one binary classifier we have $K$ of them. For observation $i$ in $C_1$, let $X_i^k$ denote classifier $k$‘s estimated probability that it belongs to class 1. Define $Y_j^k$ similarly for observations in $C_2$. The $k$th empirical AUC is defined by

\begin{aligned} \hat{\theta}^k = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i^k, Y_j^k). \end{aligned}

Let $\boldsymbol{\hat{\theta}} = \begin{pmatrix} \hat{\theta}_1 & \dots & \hat{\theta}_K \end{pmatrix}^T \in \mathbb{R}^K$ be the vector of the $K$ empirical AUCs, and let $\boldsymbol{\theta} = \begin{pmatrix} \theta_1 & \dots & \theta_K \end{pmatrix}$ be the vector of true AUCs. In order to do inference for the empirical AUCs we need to determine its probability distribution.

DeLong et al. (1988) note that the Mann-Whitney statistic is a generalized U-statistic, and hence the asymptotic theory developed for U-statistics applies. Let $\boldsymbol{L} \in \mathbb{R}^K$ be some fixed vector of coefficients. Then asymptotically

\begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

has the standard normal distribution $\mathcal{N}(0, 1)$. $\mathbf{S}_{10}$ and $\mathbf{S}_{01}$ are $K \times K$ matrices, with the $(r, s)$th element defined as follows:

\begin{aligned} \left( \mathbf{S}_{10} \right)_{rs} &= \frac{1}{m-1} \sum_{i=1}^m \left[V_{10}^r (X_i) - \hat{\theta}^r \right]\left[V_{10}^s (X_i) - \hat{\theta}^s \right], \text{ and} \\ \left( \mathbf{S}_{01} \right)_{rs} &= \frac{1}{n-1} \sum_{j=1}^n \left[V_{01}^r (Y_j) - \hat{\theta}^r \right]\left[V_{01}^s (Y_j) - \hat{\theta}^s \right], \text{ where} \\ V_{10}^r (X_i) &= \frac{1}{n}\sum_{j=1}^n \psi (X_i^r, Y_j^r), \text{ and} \\ V_{01}^r (Y_j) &= \frac{1}{m}\sum_{i=1}^m \psi (X_i^r, Y_j^r). \end{aligned}

From this asymptotic distribution we can construct confidence intervals and hypothesis tests for the contrast $\mathbf{L}^T \boldsymbol{\theta}$.

Example: Comparing two AUCs

If we just want to compare two AUCs (to test if they are equal), we can set $\mathbf{L} = \begin{pmatrix} 1 & -1 \end{pmatrix}^T$ in the above. The null hypothesis is

\begin{aligned} H_0: \theta_1 = \theta_2, \qquad i.e. \: \mathbf{L}^T \boldsymbol{\theta} = 0. \end{aligned}

Under the null hypothesis, the asymptotic distribution of the quantity

\begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} = \dfrac{\hat{\theta}_1 - \hat{\theta}_2}{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

Is the standard normal distribution. If the quantity is too large (in absolute value for a two-sided test), then we can reject the null hypothesis and conclude that there is a statistically significant difference between the two AUCs.

References:

1. DeLong, E. R., et al. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.

# Wilson score and Agresti-Coull intervals for binomial proportions

Let’s say that we are trying to estimate the success probability of some event. We run the experiment $n$ times and record the number of successes, denoted by the random variable $X$. A natural estimate for the success probability would be $\hat{p} = X / n$, the empirical success probability.

For the estimate to be useful, we need a sense of how certain we are of our estimate: we need something like a confidence interval. Possibly the most commonly used confidence interval is based on the normal approximation due to the central limit theorem: the $(1-\alpha)$-level confidence interval is

\begin{aligned} \hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \end{aligned}

where $z$ is the $(1 - \alpha / 2)$-quantile of the standard normal distribution. The problem with this confidence interval is that it is unreliable when $n$ is small or when $\hat{p}$ is close to 0 or 1. For example, if we get 0 successes out of 3 tries, the confidence interval above consists of the single point $\{ 0 \}$, which is likely to be overly precise.

The Wilson score interval (Edwin B. Wilson, 1927), tries to fix this by replacing $\hat{p}$ with $p$ in the standard error expression. Since we do not know the true value of $p$, we instead set the endpoints of the confidence interval to be the solutions of the equations

\begin{aligned} p = \hat{p} \pm z \sqrt{\frac{p(1-p)}{n}}. \end{aligned}

Solving these equations is straightforward, leading to a fairly nasty expression for the confidence interval:

\begin{aligned} \frac{\hat{p} + \frac{z^2}{2n}}{1 + \frac{z^2}{n}} \; \pm \; \frac{z}{1 + \frac{z^2}{n}} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n} + \frac{z^2}{4n}}. \end{aligned}

Let $\tilde{p} = \dfrac{\hat{p} + \frac{z^2}{2n}}{1 + \frac{z^2}{n}}$ be the center point of the Wilson score interval. It is interesting to note that

\begin{aligned} \tilde{p} = \hat{p} \left(\frac{n}{n+z^2} \right) + \frac{1}{2}\left( \frac{z^2}{n+z^2} \right). \end{aligned}

$\tilde{p}$ shrinks $\hat{p}$ toward $\dfrac{1}{2}$: it is a weighted mean of the two quantities, with increasing weight given to $\hat{p}$ as $n$ grows to infinity.

The Agresti-Coull interval adds an additional approximation to the Wilson score interval. Instead of using its complicated expression for the standard error, use the normal approximation as if $\tilde{p}$ was $\hat{p}$:

\begin{aligned} \tilde{p} \pm z \sqrt{\frac{\tilde{p}(1-\tilde{p})}{n}}. \end{aligned}

The Agresti-Coull interval has a nice interpretation for the 95% confidence interval. At that level, $z^2 \approx 4$, so we can view the Agresti-Coull interval as “add two success and two failures, then use the normal approximation interval”.

(Credits: I learnt of the Agresti-Coull interval from this post by Andrew Gelman.)

References:

1. Wikipedia. Binomial proportion confidence interval.
2. Agresti, A., and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions.