# What is calibration?

What is calibration?

In the context of binary classification, calibration refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1.

The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines).

(Note: The idea of calibration can be extended naturally to multi-class classification; for simplicity I do not talk about it here.)

What does it mean for a classifier to be well-calibrated?

A classifier is said to be well-calibrated if the estimated probabilities it outputs are accurate. In other words, for any $p \in [0, 1]$, if I consider all the observations which the classifier assigns a probability $p$ of being in class 1, the long-run proportion of those which are truly in class 1 is $p$.

Note that it is possible for a classifier to have high sensitivity/specificity/AUC while being poorly calibrated. This is because these metrics only quantify how good the classifier is at ranking the probability of observations being in class 1 relative to each other. For example, compare classifier A which always outputs the true probability with classifier B which always outputs the true probability divided by 2. Both classifiers have the same sensitivity/specificity/AUC, but classifier A is perfectly calibrated while classifier B is not (its estimated probabilities for being in class 1 are too pessimistic).

(Aside: Nate Silver‘s The Signal and the Noise has an excellent chapter on calibration in weather forecasting.)

How can I find out if my model is well-calibrated?

To assess how well-calibrated a classifier is, we can plot a calibration curve: a plot of actual probability vs. estimated probability for each observation. If the model is perfectly calibrated, then the points should line up on the $y = x$ line. The difficulty here is that we don’t get to see actual probabilities: we only get to see 0s and 1s (“did this observation fall in class 1 or not?”). In practice, this is what we do:

1. Sort observations by the classifiers estimated probabilities.
2. Bin the observations into equally sized bins (it is common to pick 10 bins).
3. For each bin, plot the actual proportion of observations in class one against the mean estimated probability for the observations in the bin.

With this procedure, we will end up with a plot that looks something like this (taken from Reference 1):

Note that there is a tradeoff when it comes to selecting the number of bins: too few bins and we won’t have enough points on the curve, too many bins and we will have too few observations in each bin leading to more noise. It is common to select 10 bins.

In theory you should be able to construct a calibration curve with just the predictions, actual class membership, and a parameter for the number of bins. However, all the functions I’ve found in R that plot calibration curves are more sophisticated in their output and require significantly more complex inputs… Does anyone know of a routine that plots calibration curves with these bare-bone inputs? Below is my homebrew version (I assume that the tidyverse package is loaded):

GetCalibrationCurve <- function(y, y_pred, bins = 10) {
data.frame(y = y, y_pred = y_pred) %>%
arrange(y_pred) %>%
mutate(pos = row_number() / n(),
bin = ceiling(pos * bins)) %>%
group_by(bin) %>%
summarize(estimated_prob = mean(y_pred),
actual_prob = mean(y))
}


The function returns a dataframe with one row for each bin, giving the estimated and actual probabilities for the observations in that bin. Here is an example of how this function can be used to make a calibration curve:

# generate data
set.seed(1)
x <- matrix(rnorm(100 * 10), nrow = 100)
eta <- x[, 1] + x[, 2]^2 - x[, 3]^4
mu <- exp(eta) / (1 + exp(eta))
y <- sapply(mu, function(p) rbinom(1, size = 1, prob = p))
df <- data.frame(x, y)

# fit logistic regression model
fit <- glm(y ~ ., data = df, family = binomial())
y_pred <- predict(fit, df, type = "response")

# plot calibration curve
df <- GetCalibrationCurve(y, y_pred, bins = 10)
ggplot(df, aes(estimated_prob, actual_prob)) +
geom_point() +
geom_line() +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
theme_bw() +
labs(title = "Calibration curve", x = "Estimated probability",
y = "Actual probability")


This model does not seem to be calibrated well.

(See Reference 1 for code for plotting calibration curves in Python.)

How can I calibrate my model?

The Wikipedia page for calibration lists a number of methods for calibration. Based on my googling it looks like Platt scaling and isotonic regression are the more commonly used methods (I might write a post on them in the future). Reference 1 gives Python code for running these two methods. R has several different functions that perform calibration but none of them seem very easy to use. Reference 2 has some R code for both Platt scaling and isotonic regression.

Update (2020-10-27): I recently came across Reference 4 which is a really nice tutorial on calibration, and as a bonus links to a Github repo that has code for both Platt scaling and isotonic regression. I highly recommend it! (For some reason the Github link in the article has the correct text but sends you to the wrong URL. You can use the link earlier in this paragraph.)

Can I run a hypothesis test to check if my model is well-calibrated?

The most commonly used hypothesis test for checking model calibration is the Hosmer-Lemeshow goodness-of-fit test. It does have its deficiencies (see discussion section of Reference 3), and several other methods have been proposed as alternatives. It doesn’t seem like any of them has become the new de-facto standard.

References:

1. Poulopoulos, D. Classifier calibration.
2. NSS. (2016). Using Platt scaling and isotonic regression to minimize logloss error in R.
3. Dreiseitl, S., and Osl, M. (2012). Testing the calibration of classification models from first principles.
4. Huang, Y., Li, W., Macheret, F., Gabriel, R. A., and Ohno-Machado, L. (2020). A tutorial on calibration measurements and calibration models for clinical prediction models.