# What is the DeLong test for comparing AUCs?

Set-up

Let’s say that in our sample we have $m$ individuals who truly belong to class 1 (call this group $C_1$) and $n$ individuals who truly belong to class 2 (call this group $C_2$). For each of these individuals, the binary classifier gives a probability of the individual belonging to class 1: denote the probabilities for $C_1$ by $X_1, \dots, X_m$ and the probabilities for $C_2$ by $Y_1, \dots, Y_n$.

Define sensitivity (a.k.a. true positive rate) and specificity (a.k.a. true negative rate) as functions \begin{aligned} \text{sens}(z) = \frac{1}{m} \sum_{i=1}^m 1\{ X_i \geq z\}, \qquad \text{spec}(z) = \frac{1}{n}\sum_{j=1}^n 1 \{ Y_j < z \}, \end{aligned}

where $z \in [0, 1]$. The receiver operating characteristic (ROC) curve is a common way to summarize the quality of a binary classifier: it simply plots sensitivity vs. 1 – specificity. An ROC curve will always pass through the points $(0, 0)$ and $(1, 1)$ and should be above the diagonal $y = x$. The closer the curve gets to the point $(0, 1)$, the better the binary classifier.

Area under the curve (AUC) is a way to summarize the entire ROC curve in a single number: it is simply the area under the ROC curve. An uninformative classifier will have an AUC of 0.5; the larger the AUC the better a classifier is thought to be. (One should really look at the full ROC as well to understand the classifier’s tradeoff between sensitivity and specificity.

(Note: Sometimes the ROC curve is drawn with specificity on the x-axis rather than 1-specificity. This has the effect of flipping the ROC curve along the vertical line $x = 0.5$ and does not change the AUC.)

A key insight is that the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to $C_1$ and $C_2$. We have a formula for computing this statistic: \begin{aligned} \hat{\theta} = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i, Y_j), \quad \text{where}\quad \psi(X, Y) = \begin{cases} 1 &Y < X, \\ 1/2 &Y = X, \\ 0 &Y < X. \end{cases} \end{aligned}

It is an unbiased estimate of $\theta$, the probability that a randomly selected observation from the population represented by $C_2$ will have a score less than or equal to that for a randomly selected observation from the population represented by $C_1$.

Deriving an asymptotic distribution for AUCs

Imagine we are in a situation where instead of just one binary classifier we have $K$ of them. For observation $i$ in $C_1$, let $X_i^k$ denote classifier $k$‘s estimated probability that it belongs to class 1. Define $Y_j^k$ similarly for observations in $C_2$. The $k$th empirical AUC is defined by \begin{aligned} \hat{\theta}^k = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i^k, Y_j^k). \end{aligned}

Let $\boldsymbol{\hat{\theta}} = \begin{pmatrix} \hat{\theta}_1 & \dots & \hat{\theta}_K \end{pmatrix}^T \in \mathbb{R}^K$ be the vector of the $K$ empirical AUCs, and let $\boldsymbol{\theta} = \begin{pmatrix} \theta_1 & \dots & \theta_K \end{pmatrix}$ be the vector of true AUCs. In order to do inference for the empirical AUCs we need to determine its probability distribution.

DeLong et al. (1988) note that the Mann-Whitney statistic is a generalized U-statistic, and hence the asymptotic theory developed for U-statistics applies. Let $\boldsymbol{L} \in \mathbb{R}^K$ be some fixed vector of coefficients. Then asymptotically \begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

has the standard normal distribution $\mathcal{N}(0, 1)$. $\mathbf{S}_{10}$ and $\mathbf{S}_{01}$ are $K \times K$ matrices, with the $(r, s)$th element defined as follows: \begin{aligned} \left( \mathbf{S}_{10} \right)_{rs} &= \frac{1}{m-1} \sum_{i=1}^m \left[V_{10}^r (X_i) - \hat{\theta}^r \right]\left[V_{10}^s (X_i) - \hat{\theta}^s \right], \text{ and} \\ \left( \mathbf{S}_{01} \right)_{rs} &= \frac{1}{n-1} \sum_{j=1}^n \left[V_{01}^r (Y_j) - \hat{\theta}^r \right]\left[V_{01}^s (Y_j) - \hat{\theta}^s \right], \text{ where} \\ V_{10}^r (X_i) &= \frac{1}{n}\sum_{j=1}^n \psi (X_i^r, Y_j^r), \text{ and} \\ V_{01}^r (Y_j) &= \frac{1}{m}\sum_{i=1}^m \psi (X_i^r, Y_j^r). \end{aligned}

From this asymptotic distribution we can construct confidence intervals and hypothesis tests for the contrast $\mathbf{L}^T \boldsymbol{\theta}$.

Example: Comparing two AUCs

If we just want to compare two AUCs (to test if they are equal), we can set $\mathbf{L} = \begin{pmatrix} 1 & -1 \end{pmatrix}^T$ in the above. The null hypothesis is \begin{aligned} H_0: \theta_1 = \theta_2, \qquad i.e. \: \mathbf{L}^T \boldsymbol{\theta} = 0. \end{aligned}

Under the null hypothesis, the asymptotic distribution of the quantity \begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} = \dfrac{\hat{\theta}_1 - \hat{\theta}_2}{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

Is the standard normal distribution. If the quantity is too large (in absolute value for a two-sided test), then we can reject the null hypothesis and conclude that there is a statistically significant difference between the two AUCs.

References:

1. DeLong, E. R., et al. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.