Set-up
Let’s say that in our sample we have individuals who truly belong to class 1 (call this group
) and
individuals who truly belong to class 2 (call this group
). For each of these individuals, the binary classifier gives a probability of the individual belonging to class 1: denote the probabilities for
by
and the probabilities for
by
.
Define sensitivity (a.k.a. true positive rate) and specificity (a.k.a. true negative rate) as functions
where . The receiver operating characteristic (ROC) curve is a common way to summarize the quality of a binary classifier: it simply plots sensitivity vs. 1 – specificity. An ROC curve will always pass through the points
and
and should be above the diagonal
. The closer the curve gets to the point
, the better the binary classifier.
Area under the curve (AUC) is a way to summarize the entire ROC curve in a single number: it is simply the area under the ROC curve. An uninformative classifier will have an AUC of 0.5; the larger the AUC the better a classifier is thought to be. (One should really look at the full ROC as well to understand the classifier’s tradeoff between sensitivity and specificity.
(Note: Sometimes the ROC curve is drawn with specificity on the x-axis rather than 1-specificity. This has the effect of flipping the ROC curve along the vertical line and does not change the AUC.)
A key insight is that the area under an empirical ROC curve, when calculated by the trapezoidal rule, is equal to the Mann-Whitney two-sample statistic applied to and
. We have a formula for computing this statistic:
It is an unbiased estimate of , the probability that a randomly selected observation from the population represented by
will have a score less than or equal to that for a randomly selected observation from the population represented by
.
Deriving an asymptotic distribution for AUCs
Imagine we are in a situation where instead of just one binary classifier we have of them. For observation
in
, let
denote classifier
‘s estimated probability that it belongs to class 1. Define
similarly for observations in
. The
th empirical AUC is defined by
Let be the vector of the
empirical AUCs, and let
be the vector of true AUCs. In order to do inference for the empirical AUCs we need to determine its probability distribution.
DeLong et al. (1988) note that the Mann-Whitney statistic is a generalized U-statistic, and hence the asymptotic theory developed for U-statistics applies. Let be some fixed vector of coefficients. Then asymptotically
has the standard normal distribution .
and
are
matrices, with the
th element defined as follows:
From this asymptotic distribution we can construct confidence intervals and hypothesis tests for the contrast .
Example: Comparing two AUCs
If we just want to compare two AUCs (to test if they are equal), we can set in the above. The null hypothesis is
Under the null hypothesis, the asymptotic distribution of the quantity
Is the standard normal distribution. If the quantity is too large (in absolute value for a two-sided test), then we can reject the null hypothesis and conclude that there is a statistically significant difference between the two AUCs.
References:
- DeLong, E. R., et al. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.