What is the DeLong test for comparing AUCs?


Let’s say that in our sample we have m individuals who truly belong to class 1 (call this group C_1) and n individuals who truly belong to class 2 (call this group C_2). For each of these individuals, the binary classifier gives a probability of the individual belonging to class 1: denote the probabilities for C_1 by X_1, \dots, X_m and the probabilities for C_2 by Y_1, \dots, Y_n.

Define sensitivity (a.k.a. true positive rate) and specificity (a.k.a. true negative rate) as functions

\begin{aligned} \text{sens}(z) = \frac{1}{m} \sum_{i=1}^m 1\{ X_i \geq z\}, \qquad \text{spec}(z) = \frac{1}{n}\sum_{j=1}^n 1 \{ Y_j < z \}, \end{aligned}

where z \in [0, 1]. The receiver operating characteristic (ROC) curve is a common way to summarize the quality of a binary classifier: it simply plots sensitivity vs. 1 – specificity. An ROC curve will always pass through the points (0, 0) and (1, 1) and should be above the diagonal y = x. The closer the curve gets to the point (0, 1), the better the binary classifier.

Area under the curve (AUC) is a way to summarize the entire ROC curve in a single number: it is simply the area under the ROC curve. An uninformative classifier will have an AUC of 0.5; the larger the AUC the better a classifier is thought to be. (One should really look at the full ROC as well to understand the classifier’s tradeoff between sensitivity and specificity.

(Note: Sometimes the ROC curve is drawn with specificity on the x-axis rather than 1-specificity. This has the effect of flipping the ROC curve along the vertical line x = 0.5 and does not change the AUC.)

A key insight is that the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to C_1 and C_2. We have a formula for computing this statistic:

\begin{aligned} \hat{\theta} = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i, Y_j), \quad \text{where}\quad  \psi(X, Y) = \begin{cases} 1 &Y < X, \\ 1/2 &Y = X, \\ 0 &Y < X. \end{cases} \end{aligned}

It is an unbiased estimate of \theta, the probability that a randomly selected observation from the population represented by C_2 will have a score less than or equal to that for a randomly selected observation from the population represented by C_1.

Deriving an asymptotic distribution for AUCs

Imagine we are in a situation where instead of just one binary classifier we have K of them. For observation i in C_1, let X_i^k denote classifier k‘s estimated probability that it belongs to class 1. Define Y_j^k similarly for observations in C_2. The kth empirical AUC is defined by

\begin{aligned} \hat{\theta}^k = \frac{1}{mn}\sum_{i = 1}^m \sum_{j = 1}^n \psi(X_i^k, Y_j^k). \end{aligned}

Let \boldsymbol{\hat{\theta}} = \begin{pmatrix} \hat{\theta}_1 & \dots & \hat{\theta}_K \end{pmatrix}^T \in \mathbb{R}^K be the vector of the K empirical AUCs, and let \boldsymbol{\theta} = \begin{pmatrix} \theta_1 & \dots & \theta_K \end{pmatrix} be the vector of true AUCs. In order to do inference for the empirical AUCs we need to determine its probability distribution.

DeLong et al. (1988) note that the Mann-Whitney statistic is a generalized U-statistic, and hence the asymptotic theory developed for U-statistics applies. Let \boldsymbol{L} \in \mathbb{R}^K be some fixed vector of coefficients. Then asymptotically

\begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

has the standard normal distribution \mathcal{N}(0, 1). \mathbf{S}_{10} and \mathbf{S}_{01} are K \times K matrices, with the (r, s)th element defined as follows:

\begin{aligned} \left( \mathbf{S}_{10} \right)_{rs} &= \frac{1}{m-1} \sum_{i=1}^m \left[V_{10}^r (X_i) - \hat{\theta}^r \right]\left[V_{10}^s (X_i) - \hat{\theta}^s \right], \text{ and} \\  \left( \mathbf{S}_{01} \right)_{rs} &= \frac{1}{n-1} \sum_{j=1}^n \left[V_{01}^r (Y_j) - \hat{\theta}^r \right]\left[V_{01}^s (Y_j) - \hat{\theta}^s \right], \text{ where} \\  V_{10}^r (X_i) &= \frac{1}{n}\sum_{j=1}^n \psi (X_i^r, Y_j^r), \text{ and} \\  V_{01}^r (Y_j) &= \frac{1}{m}\sum_{i=1}^m \psi (X_i^r, Y_j^r). \end{aligned}

From this asymptotic distribution we can construct confidence intervals and hypothesis tests for the contrast \mathbf{L}^T \boldsymbol{\theta}.

Example: Comparing two AUCs

If we just want to compare two AUCs (to test if they are equal), we can set \mathbf{L} = \begin{pmatrix} 1 & -1 \end{pmatrix}^T in the above. The null hypothesis is

\begin{aligned} H_0: \theta_1 = \theta_2, \qquad i.e. \: \mathbf{L}^T \boldsymbol{\theta} = 0. \end{aligned}

Under the null hypothesis, the asymptotic distribution of the quantity

\begin{aligned} \dfrac{\mathbf{L}^T\boldsymbol{\hat{\theta}} - \mathbf{L}^T \boldsymbol{\theta} }{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} = \dfrac{\hat{\theta}_1 - \hat{\theta}_2}{\sqrt{\mathbf{L}^T \left( \frac{1}{m}\mathbf{S}_{10} + \frac{1}{n}\mathbf{S}_{01} \right) \mathbf{L} }} \end{aligned}

Is the standard normal distribution. If the quantity is too large (in absolute value for a two-sided test), then we can reject the null hypothesis and conclude that there is a statistically significant difference between the two AUCs.


  1. DeLong, E. R., et al. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s