Is the EPL getting more unequal?

I recently heard that Manchester City were so far ahead in the English Premier League (EPL) that the race for first was basically over, even though they were still about 6-7 more games to go (out of a total of 38 games). At the other end of the table, I heard that Sheffield United were so far behind that they are all but certain to be relegated. This got me wondering: has the EPL become more unequal over time? I guess there are several ways to interpret what “unequal” means, and this post is my attempt at quantifying it.

The data for this post is available here, while all the code is available in one place here. (I also have a previous blog post from 2018 on the histogram of points scored in the EPL over time.)

Let’s start by looking at a density plot of the points scored by the 20 teams and how that has changed over the last 12 years (note that the latest year is at the top of the chart):

df <- read_csv("../data/epl_standings.csv")


# joy plot
ggplot(df, aes(x = Points, y = factor(Season))) +
    geom_density_ridges(scale = 3) +
    labs(y = "Season")

It looks like the spread for the more recent seasons is greater than that in earlier seasons, evidenced by the bulk of the distribution widening as we go from the bottom to the top of the chart.

Another way to view the data is to draw a boxplot for each season. In this view, it’s harder to see the spread that we saw in the joy plot.

# boxplot
ggplot(df, aes(x = Season, y = Points, group = Season)) +
    geom_boxplot() +
    labs(title = "Distribution of points by season")

One possible way to define “unequal” is to look at the difference between the number of points the top team earned vs. that for the bottom team: the larger the difference, the more unequal the EPL is becoming. The plot below shows the maximum and minimum number of points earned by a team across the years, as well as the difference between the two. We also show the linear regression lines for each of these values. With the upward slopes, it looks like the gap between the best and the worst is certainly increasing.

# plot of maximum and minimum
df %>% group_by(Season) %>%
    summarize(max = max(Points), min = min(Points)) %>%
    mutate(range = max - min) %>%
    pivot_longer(max:range) %>%
    ggplot(aes(x = Season, y = value, col = name)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Max/min points by season", y = "Points") +
    theme(legend.title = element_blank(), legend.position = "bottom")

The problem with using range is that it only tracks the difference between the best and the worst teams while ignoring all the teams in the middle. A measure that takes the middle into account is variance. Note that the number of points that a team can score lies in the range 0 to 38 \times 3 = 114. For random variables bounded in the interval [0, 114], the smallest possible variance is 0 (all teams scoring the same number of points), while the maximum possible variance is (114 - 0)^2 / 4 = 3249 (half the teams scoring 114 points, half the teams scoring 0 points). Based on the configurations attaining the extremes, it seems reasonable to interpret the scores having a higher variance as the league being more unequal.

Here is a plot of point variance over time along with the linear regression line:

df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

There’s quite a lot of fluctuation from year to year, but there seems to be an increasing trend. However, notice that the y-axis only goes from 250 to 450, while our min and max variances were 0 and 3249. Perhaps it’s better to have the y-axis go from 0 to 3249 to have a better sense of scale?

Before doing that, note that it’s actually not possible for half the teams to score maximum points. In the EPL, every team plays every other team exactly twice. Hence, only one team can have maximum points, since it means that everyone else loses to this team. I don’t have a proof for this, but I believe maximum variance happens when team 1 beats everyone, team 2 beats everyone except team 1, team 3 beats everyone except teams 1 and 2, and so on. With this configuration, the variance attained is just 1260.

Here’s the same line plot but with the y-axis going from 0 to 1260. With this scale, the change in variance over time looks pretty flat.

max_var_dist <- seq(0, 38 * 3, by = 6)
max_var <- var(max_var_dist)
df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, max_var), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

Which is the correct interpretation? Is the change in variance over time important enough that we can say the EPL has become more unequal, or is it essentially the same over time? This is where I think domain expertise comes into play. 1260 is a theoretical maximum for the variance, but my guess is that the layperson looking at two tables, one with variance 300 and one with variance 900, would be able to tell them apart and say that the latter is more unequal. Can the layperson really tell the difference between variances of 250 and 450? I would generate several tables having these variances and test if people could tell them apart.

Finally, the Gini coefficient is one other measure of inequality, with 0 being the most equal and 1 being the most unequal.

# plot of Gini
df %>% group_by(Season) %>%
    summarize(gini = DescTools::Gini(Points)) %>%
    ggplot(aes(x = Season, y = gini)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, 1), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Gini coefficient of point dist by season", 
         y = "Gini coefficient")

Here are the plots with different scales for the y-axis:

As with variance, the different scales give very different interpretations. It will require some research to figure out if a change of Gini coefficient from 0.17 to 0.22 is perceptible to the viewer.

What is a proper scoring rule?

What is a proper scoring rule?

In the realm of forecasting, a scoring rule is a way to measure how good a probabilistic forecast is.

Mathematically, if \Omega is the set of possible outcomes, then a probabilistic forecast is a probability distribution on \Omega (i.e. how likely each possible outcome is). A scoring rule S (or a score) is a function that takes a probability distribution on \Omega (denoted by P) and an outcome \omega \in \Omega as input and returns a real-valued number as output. If P is the probabilistic forecast and \omega is the actual outcome, then S(P, \omega) is interpreted as the reward (loss resp.) to the forecaster if the score is positively-oriented (negatively-oriented resp.).

Assume that the true probability distribution is denoted by Q. For each probabilistic forecast P, we can define the expected score as

S(P, Q) = \mathbb{E}_Q [S(P, \omega)] = \int S(P, \omega) dQ(\omega).

A positively oriented scoring rule is said to be proper if for all probability distributions P and Q, we have

S(Q, Q) \geq S(P, Q).

In other words, for a proper score, the forecaster maximizes the expected reward if he/she forecasts the true distribution. A strictly proper score is a score such that equality above is achieved uniquely at P = Q. To maximize a strictly proper score, a forecaster has every incentive to give an “honest” forecast and has no incentive to “hedge”.

Why would you ever use an improper scoring rule?

Since score propriety (i.e. a score being proper) seems like such a basic requirement, why would anyone use an improper scoring rule? It turns out that there are some scores with nice properties but are not improper.

For example, imagine that we want to compare scores against some baseline forecaster. Given some score S, we can convert it to a skill score through normalization:

\begin{aligned} SS(P, \omega) = \dfrac{S(P_{baseline}, \omega) - S(P, \omega)}{S(P_{baseline}, \omega)} = 1 - \dfrac{S(P, \omega)}{S(P_{baseline}, \omega)}, \end{aligned}

where P_{baseline} is the probabilistic forecast which the baseline makes. If S is negatively-oriented (smaller is better) and if the scores it produces are always non-negative, then the associated skill score is positively-oriented and in the range (-\infty, 1].

Skill scores seem reasonable as they give us a fixed upper bound to aspire to. However, in general skill scores are improper. (See Reference 2 for a proof, and Section 2.3 of Reference 1 for another normalization scheme.)

Two other examples of improper scores that have been used are the naive linear score and mean squared error (MSE) (see Reference 3 for details). Note that mean squared error here is not the usual MSE we use for point forecasts: there is a more general definition for probabilistic forecasts.

Some examples of proper scoring rules

Reference 1 contains a number of examples of proper scoring rules. First, assume that the sample space for the response is categorical (with loss of generality, let it be \{1, 2, \dots, J \}), and let the probabilistic forecast be represented by p_i = P(\omega = i). Here are 4 proper scoring rules for categorical variables (only 1-3 are strictly proper):

  1. The Brier score: S(P, i) = \displaystyle\sum_{j=1}^J (\delta_{ij} - p_j)^2, where \delta_{ij} = 1\{ i = j\}. (I wrote a previous post on the Brier score here.)
  2. The spherical score: For some parameter \alpha > 1, the (generalized) spherical score is defined as S(P, i) = \dfrac{p_i^{\alpha - 1}}{(\sum_{j=1}^J p_j^\alpha)^{(\alpha - 1)/\alpha}}. The traditional spherical score is the special case \alpha = 2.
  3. The logarithmic score: S(P, i) = \log p_i.
  4. The zero-one score: Let M = \text{argmin}_j p_j be the set of modes of the probabilistic forecast. Then the zero-one score is defined as S(P, i) = 1\{ i \in M \}.

There are similar examples for continuous responses. For simplicity, assume that the probabilistic forecast P has a density p w.r.t. Lebesgue measure. (See Reference 1 for a more mathematically rigorous description.) Define

\begin{aligned} \| p \|_\alpha = \left( \int p(\omega)^\alpha d\omega \right)^{1/\alpha}. \end{aligned}

  1. The quadratic score: S(P, \omega) = 2p(\omega) - \|p\|_2^2.
  2. The pseudospherical score: For \alpha > , S(P, \omega) = p(\omega)^{\alpha - 1} / \| p\|_\alpha^{\alpha - 1}.
  3. The logarithmic score: S(P, \omega) = \log p(\omega). This can be viewed as the limiting case of the pseudospherical score as \alpha \rightarrow 1. It is widely used, and is also known as the predictive deviance or ignorance score.

There are options for proper scoring rules if the probabilistic forecast P does not have a density. Every probabilistic forecast will have a cumulative distribution function F, and we can construct scores from that. The continuous ranked probability score (CRPS) is one such score, defined as

\begin{aligned} S(P, \omega) = -\int_{-\infty}^\infty [F(y) - 1\{ y \geq \omega \}]^2 dy. \end{aligned}

The CRPS corresponds to the integral of the Brier scores for the associated binary probability forecasts at all real-valued thresholds.


  1. Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation.
  2. Murphy, A. H. (1973). Hedging and skill scores for probability forecasts.
  3. Bröcker, J., and Smith, L. A. (2007). Scoring probabilistic forecasts: The importance of being proper.

Estimating pi with Monte Carlo simulation

Happy Pi Day! I don’t encounter \pi very much in my area of statistics, so this post might seem a little forced… In this post, I’m going to show one way to estimate \pi.

The starting point is the integral identity

\begin{aligned} 4 \int_0^1 \sqrt{1 - x^2} dx = \pi. \end{aligned}

There are two ways to see why this identity is true. The first is that the integral is simply computing the area of a quarter-circle with unit radius. The second is by explicitly evaluating the integral:

\begin{aligned} \int_0^1 \sqrt{1 - x^2}dx &= \left[ \frac{1}{2} \left( x \sqrt{1 - x^2} + \arcsin x \right) \right]_0^1 \\  &= \frac{1}{2} \left[ \frac{\pi}{2} - 0 \right] \\  &= \frac{\pi}{4}. \end{aligned}

If X \sim \text{Unif}[0,1], then the integral identity means that

\begin{aligned} \mathbb{E} \left[ 4\sqrt{1 - X^2} \right] = \pi. \end{aligned}

Hence, if we take i.i.d. draws X_1, \dots, X_n from the uniform distribution on [0,1], it is reasonable to expect that

\begin{aligned} \frac{1}{n}\sum_{i=1}^n 4\sqrt{1 - X_i^2} \approx \pi. \end{aligned}

The code below shows how well this estimation procedure does for one run as the sample size goes from 1 to 200:

N <- 200
x <- runif(N)
samples <- 4 * sqrt(1 - x^2)
estimates <- cumsum(samples) / 1:N
plot(1:N, estimates, type = "l",
     xlab = "Sample size", ylab = "Estimate of pi",
     main = "Estimates of pi vs. sample size")
abline(h = pi, col ="red", lty = 2)

The next plot shows the relative error on the y-axis instead (the red dotted line represents 1% relative error):

rel_error <- abs(estimates - pi) / pi * 100
plot(1:N, rel_error, type = "l",
xlab = "Sample size", ylab = "Relative error (%)",
main = "Relative error vs. sample size")
abline(h = 0)
abline(h = 1, col = "red", lty = 2)

Log-likelihood of probit regression is globally concave

Assume we are in the supervised learning setting with n observations, where observation i consists of the response y_i and features (x_{i1},\dots, x_{ip}). A generalized linear model (GLM) consists of 3 components:

  1. A random component, or a family of distributions f indexed by \mu_i (usually an exponential family), such that y_i \sim f_{\mu_i},
  2. A systematic component \eta_i = \sum_{i=1}^p \beta_j x_{ij}, and
  3. A link function g such that \eta_i = g(\mu_i).

(See this previous post for more details of the components of a GLM.) The user gets to define the family of distributions f and the link function g, and \beta = (\beta_1, \dots, \beta_p)^T is the parameter to be determined by maximum likelihood estimation.

For one-dimensional exponential families with the canonical link function, it is known that the log-likelihood of the GLM is globally concave in \beta (see, for example, Reference 1). Hence, the MLE \hat\beta can be found using methods such as gradient descent or coordinate descent. When non-canonical links are used, the GLM’s log-likelihood is no longer guaranteed to be concave in \beta. However, in some situations we can stilll show that the log-likelihood is concave in \beta. In this post, we show that the log-likelihood for probit regression is concave in \beta.

In the probit regression model, \mu_i = \Phi (\eta_i) = \Phi \left( \sum_{i=1}^p \beta_j x_{ij} \right), where \Phi is the cumulative distribution function (CDF) of the standard normal distribution. The responses are binary with y_i \sim \text{Bern}(\mu_i). The likelihood function is

\begin{aligned} L(\beta) &= \prod_{i=1}^n \mu_i^{y_i} (1- \mu_i)^{1 - y_i} \\ &= \prod_{i=1}^n [\Phi(x_i^T \beta)]^{y_i} [1 - \Phi(x_i^T \beta)]^{1 - y_i}, \end{aligned}

and the log-likelihood function is

\begin{aligned} \ell(\beta) = \sum_{i=1}^n y_i \log [\Phi(x_i^T \beta)] + (1-y_i) \log [1 - \Phi(x_i^T \beta)].  \end{aligned}

To show that \ell is concave in \beta, we make two reductions:

  1. Since the sum of concave functions is concave, it is enough to show that \beta \mapsto y \log [\Phi(x^T \beta)] + (1-y) \log [1 - \Phi(x^T \beta)] is concave.
  2. Since composition with an affine function preserves concavity, it is enough to show that f(x) = y \log [\Phi(x)] + (1-y) \log [1 - \Phi(x)] is concave in x. (Here, x \in \mathbb{R}.)

From here, we can show that f is concave by showing that its second derivative is negative: f''(x) \leq 0 for all x. Since y can only take on the values of 0 and 1, we can consider those cases separately.

Let \phi denote the probability density function of the standard normal distribution. Recall that \phi'(x) = -x \phi(x). When y = 1,

\begin{aligned} f'(x) &= \dfrac{\phi(x)}{\Phi(x)}, \\ f''(x)&= \dfrac{\Phi(x) [-x \phi(x)] - \phi(x)^2}{\Phi(x)^2} \\ &= \dfrac{\phi(x)}{\Phi(x)^2}[- x \Phi(x) - \phi(x)] \\ &\leq 0, \end{aligned}

since x \Phi(x) + \phi(x) \geq 0 for all x (see this previous post for a proof).

When y = 0,

\begin{aligned} f'(x) &= \dfrac{-\phi(x)}{1-\Phi(x)}, \\ f''(x) &= \dfrac{[1- \Phi(x)][x \phi(x)] + \phi(x) [-\phi(x)]}{[1 - \Phi(x)]^2} \\ &= -\dfrac{\phi(x)}{[1-\Phi(x)]^2} \left[ -x + x \Phi(x) + \phi(x) \right]. \end{aligned}

To show concavity of f, it remains to show that x\Phi(x) + \phi(x) \geq x for all x. But this is true: see this previous post for a proof.


  1. Chapter 9: Generalized Linear Models.

A lower bound on x Phi(x) + phi(x)

I recently came across this lower bound that I thought was interesting and not immediately obvious. Let \phi and \Phi be the probability density function (PDF) and the cumulative distribution function (CDF) of the standard normal distribution, i.e.

\begin{aligned} \phi (x) = \dfrac{1}{\sqrt{2\pi}} \exp \left( -\dfrac{x^2}{2}\right), \qquad \Phi(x) = \int_{-\infty}^x \phi (t)dt. \end{aligned}

Then the following inequality holds:

\begin{aligned} x \Phi(x) + \phi(x) > \max(0, x). \end{aligned}

The plot below illustrates this inequality with the function on the LHS and RHS being in red and black respectively:

x <- seq(-30, 30) / 10
plot(x, x * pnorm(x) + dnorm(x), type = 'l', col = "red",
     ylab = "f(x)")
lines(x, pmax(0, x), lty = 2)

The derivation of the inequality below is mine; there may be faster ways to do it. A key formula we will use in the proof is that for the derivative of \phi: \phi' (x) = -x\phi(x).

Let f(x) = x \Phi(x) + \phi(x). First, we show that f(x) > 0 for all x. Since, for all x,

\begin{aligned} f'(x) = x\phi(x) + \Phi(x) - x\phi(x) = \Phi(x) > 0, \end{aligned}

the function f is strictly increasing on the entire real line. Hence, using L’Hôpital’s rule,

\begin{aligned} f(x) &> \lim_{x \rightarrow -\infty} f(x) \\ &= \lim_{x \rightarrow -\infty} \frac{\Phi(x) + \phi(x) / x}{1/x} \\ &= \lim_{x \rightarrow -\infty} \frac{\phi(x) + \frac{x(-x\phi(x)) - \phi(x)}{x^2} }{-1/x^2} \\ &= \lim_{x \rightarrow -\infty} \frac{x^2 \phi(x) - x^2 \phi(x) - \phi(x) }{-1} \\ &= \lim_{x \rightarrow -\infty} \phi(x) \\ &= 0. \end{aligned}

Next, we show that f(x) > x for all x. Note that we only have to prove this for x > 0, since f(x) > 0 \geq x for x \leq 0. Letting g(x) = \Phi(x) + \dfrac{\phi(x)}{x}, we see that proving f(x) > x for x > 0 is equivalent to proving g(x) > 1 for x > 0. Since

\begin{aligned} g'(x) = \phi(x) + \frac{x(-x\phi(x)) - \phi(x)}{x^2} = - \phi(x) / x^2 < 0 \end{aligned}

for all x > 0, the function g is strictly decreasing for x > 1. Thus, for this range of x values,

\begin{aligned} g(x) &> \lim_{x \rightarrow \infty} g(x) \\ &= \lim_{x \rightarrow \infty} \frac{x \Phi(x) + \phi(x)}{x} \\ &= \lim_{x \rightarrow \infty} \frac{x\phi(x) + \Phi(x) - x\phi(x)}{1} \\ &= \lim_{x \rightarrow \infty} \Phi(x) \\ &= 1, \end{aligned}

as required. (We used L’Hôpital’s rule again in the sequence of equalities above.)

What is Gower’s distance?

Let’s say we have two observations x_i = (x_{i1}, \dots, x_{ip}) and x_j = (x_{j1}, \dots, x_{jp}), and we want a measure of how similar (or dissimilar) they are to each other. If each of the entries in these two vectors are quantitative (i.e. take on real-number values), then we can use distance measures as a dissimilarity measure (e.g. Euclidean distance, Manhattan distance). But what if the p features that make up aren’t all numeric? What if some of them are categorical or binary?

Gower’s distance, introduced in Gower (1971) (Reference 1), is a general similarity measure that can be used in this setting. For each feature k = 1, \dots, p, we define a score s_{ijk} \in [0,1]. If x_i and x_j are close to each other along feature k, then the score s_{ijk} is close to 1. Conversely, if they are far apart along feature k the score s_{ijk} is close to 0.

How the score s_{ijk} is computed depends on the type of feature k (to be described shortly). A quantity \delta_{ijk} is also computed: if x_i and x_j can be compared along feature k, then \delta_{ijk} = 1. If x_i and x_j cannot be compared along feature k (because, for example, of missing values), \delta_{ijk} is set to zero. Gower’s distance is just the average of the (known) scores:

\begin{aligned} S_{ij} = \dfrac{\sum_{k=1}^p s_{ijk} \delta_{ijk} }{\sum_{k=1}^p \delta_{ijk}}. \end{aligned}

Now, let’s talk about how to compute the scores for each type of feature. Gower describes 3 different types of features:

  1. Quantitative variables (numeric variables): s_{ijk} = 1 - |x_{ik} - x_{jk}| / R_k, where R_k is the range of the feature k, either in the population or in the sample.
  2. Qualitative variables (categorical variables): s_{ijk} = 1\{ x_{ik} = x_{jk} \}.
  3. Dichotomous variables: This is something we don’t talk about so much nowadays. A dichotomous variable is one where the feature is either present or absent, and “whose absence in both of a pair of individuals is not taken as a match” (emphasis mine). For such variables, s_{ijk} and \delta_{ijk} are determined by the table below (taken from Reference 1).

Finally, why is this called a distance? If x_i and x_j have similarity S_{ij}, then we can think of the distance between them as \sqrt{1 - S_{ij}}. For this to be a valid distance, it must satisfy the triangle inequality, i.e. for any 3 observations x_i, x_j and x_\ell, their similarities must satisfy

\begin{aligned} \sqrt{1 - S_{ij}} + \sqrt{1 - S_{j\ell}} \geq \sqrt{1 - S_{i\ell}}. \end{aligned}

Reference 1 proves that if there are no missing values (i.e. \delta_{ijk} = 1 for all i, j and k), then the distance defined above satisfies the triangle inequality.

(Credit: I learned of Gower’s distance from Max Kuhn‘s book Feature Engineering and Selection: A Practical Approach for Predictive Models (Section 8.5).)


  1. Gower, J. C. (1971). A general coefficient of similarity and some of Its properties.

The Box-Cox and Yeo-Johnson transformations for continuous variables

The Box-Cox and Yeo-Johnson transformations are two different ways to transform a continuous (numeric) variable so that the resulting variable looks more normally distributed. They are often used in feature engineering to reduce skew in the raw variables.

Box-Cox transformation

George Box and David Cox proposed the Box-Cox transformation in Reference 1. The transformation is really a family of transformations indexed by a parameter \lambda:

\begin{aligned} \psi(y, \lambda) = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} &\lambda \neq 0, \\ \log y &\lambda = 0. \end{cases} \end{aligned}

This is also known as a power transformation as the variable is raised to a particular power. (Note that under translation and scaling, the Box-Cox transformation is really the same as the transformation y \mapsto y^\lambda; the translation and scaling above is set so that the transformation is continuous at \lambda = 0, making it easier for theoretical analysis.)

The transformation above only works for y > 0. If y can be negative but we always have y > \lambda_2 for some \lambda_2, then we can use

\begin{aligned} \psi(y, \lambda) = \begin{cases} \dfrac{(y+\lambda_2)^{\lambda_1 - 1}}{\lambda_1} &\lambda_1 \neq 0, \\ \log (y+\lambda_2) &\lambda_1 = 0. \end{cases} \end{aligned}

The key question is how to choose the best value of \lambda. Box and Cox propose using maximum likelihood estimation to do so.

Yeo-Johnson transformation

Yeo and Johnson (2000) (Reference 2) note that the tweak above only works when y is bounded from below, and also that standard asymptotic results of maximum likelihood theory may not apply. They propose the following transformation:

\begin{aligned} \psi(y, \lambda) = \begin{cases}  \dfrac{(y+1)^\lambda - 1}{\lambda} &y \geq 0 \text{ and }\lambda \neq 0, \\  \log (y+1) &y \geq 0 \text{ and } \lambda = 0, \\  -\dfrac{(-y + 1)^{2 - \lambda} - 1}{2 - \lambda} &y < 0 \text{ and } \lambda \neq 2, \\  - \log(-y + 1) &y < 0, \lambda = 2.  \end{cases} \end{aligned}

The motivation for this transformation is rooted in the concept of relative skewness introduced by van Zwet (1964) that I won’t go into here. Some nice properties:

  • \psi is concave in y for \lambda < 1 and convex for \lambda > 1.
  • The constant shift of +1 makes is such that the transformed value will always have the same sign as the original value.
  • The constant shift of +1 also allows \psi to become the identity transformation when \lambda = 1.
  • The new transformations on the positive line are equivalent to the Box-Cox transformation for y > -1 (after accounting for the constant shift), so the Yeo-Johnson transformation can be viewed as a generalization of the Box-Cox transformation.

As with the Box-Cox transformation, the value of \lambda is chosen via maximum likelihood estimation.

(Credit: I first heard of the Box-Cox transformation in Art Owen‘s STATS 305A class, and I learned of the Yeo-Johnson transformation from Max Kuhn‘s book Feature Engineering and Selection: A Practical Approach for Predictive Models.)


  1. Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations.
  2. Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry.
  3. var Zwet, W. R. (1964). Convex transformations of random variables.

What is the jackknife+?

This post describes the jackknife+, a new method developed by Barber et al. (2021) (Reference 1) for constructing prediction intervals with provable coverage guarantees.

The data and the goal

Assume we have i.i.d. training data (X_i, Y_i) \in \mathbb{R}^d \times \mathbb{R} and a new test point (X_{n+1}, Y_{n+1}) drawn from the same distribution. We want to fit a regression model \widehat{\mu}: \mathbb{R}^d \mapsto \mathbb{R} using the training data, then produce a prediction interval for the test point, i.e. an interval around the prediction \widehat{\mu}(X_{n+1}) that is likely to contain the true response Y_{n+1}. Concretely, we want a prediction interval \widehat{C}_{n, \alpha} (as a function of the n training points) such that

\begin{aligned} \mathbb{P} \{ Y_{n+1} \in \widehat{C}_{n, \alpha}(X_{n+1})  \} \geq 1 - \alpha, \end{aligned}

where \alpha \in [0,1] is a pre-specified level of desired coverage. The probability is taken with respect to a new test point as well as with respect to the training data.

The jackknife and the jackknife+

Before introducing the methods, let me introduce some notation that will help us express the methods concisely.

  • For i = 1, \dots, n, let \widehat{\mu}_{-i} denote the regression model fit on the training data with the ith observation removed.
  • For i = 1, \dots, n, let R_i^{LOO} = |Y_i - \widehat{\mu}_{-i}(X_i)| denote the leave-one-out residual for observation i.
  • For any values v_1, \dots, v_n, let \widehat{q}_{n,\alpha}^+ \{ v_i \} and \widehat{q}_{n,\alpha}^- \{ v_i \} denote the 1 - \alpha and \alpha quantiles of the empirical distribution of these values respectively. (Note: The paper has a more precise, technical definition of these quantiles.)

The original jackknife prediction interval is defined as

\begin{aligned} \widehat{C}_{n, \alpha}^{jackknife}(X_{n+1}) &= \left[ \widehat{\mu}(X_{n+1}) - \widehat{q}_{n,\alpha}^+ \left\{ R_i^{LOO} \right\}, \widehat{\mu}(X_{n+1}) + \widehat{q}_{n,\alpha}^+ \left\{ R_i^{LOO} \right\} \right] \\  &= \left[ \widehat{q}_{n,\alpha}^- \left\{ \widehat{\mu}(X_{n+1}) - R_i^{LOO} \right\}, \widehat{q}_{n,\alpha}^+ \left\{ \widehat{\mu}(X_{n+1}) + R_i^{LOO} \right\} \right]. \end{aligned}

The newly proposed jackknife+ prediction interval is defined as

\begin{aligned} \widehat{C}_{n, \alpha}^{jackknife+}(X_{n+1}) = \left[ \widehat{q}_{n,\alpha}^- \left\{ \widehat{\mu}_{-i}(X_{n+1}) - R_i^{LOO} \right\}, \widehat{q}_{n,\alpha}^+ \left\{ \widehat{\mu}_{-i}(X_{n+1}) + R_i^{LOO} \right\} \right]. \end{aligned}

You might have to squint a little to see the difference between the two intervals. The jackknife centers the prediction interval on the predicted value \widehat{\mu}(X_{n+1}) fitted on the full training data, while the jackknife+ uses leave-one-out-points \widehat{\mu}_{-i}(X_{n+1}) instead. The figure below is a visual example of how the construction of the two types of prediction intervals might differ.

The beauty of this little change is that it comes with provable coverage guarantees, without any assumptions! Well, OK, extremely minor assumptions. We assume that the method used to obtain the regression model \widehat{\mu} is invariant to the ordering of the data, and that the sample size n \geq 2. No assumptions on the data generating distribution and on the regression method otherwise. Here’s the guarantee (Theorem 1):

Theorem: The jackknife+ prediction interval satisfies

\begin{aligned} \mathbb{P} \left\{ Y_{n+1} \in \widehat{C}_{n, \alpha}^{jackknife+}(X_{n+1})  \right\} \geq 1 - 2\alpha. \end{aligned}

The alert reader will notice that the RHS is 1 - 2\alpha and not 1 - \alpha as we would like. The authors address this in 3 different ways:

  1. 1 - 2\alpha is a lower bound on the coverage. Empirically it looks like coverage is often very close to the target level 1 - \alpha.
  2. They propose a version of the jackknife+, which they call jackknife-minimax, that has an 1 - \alpha (see Section 2.2 of Reference 1 for details). However, the authors were of the opinion that this method is generally too conservative.
  3. While there are pathological examples that cause the jackknife+ to give coverage very close to 1 - 2\alpha (Theorem 2), this does not happen often. In Section 5 of the paper, the authors introduce the notion of “in-sample stability” and “out-of-sample stability”, and give coverage results that are much closer to 1 - \alpha under some stability conditions (Theorem 5).

Extension to cross-validation (CV+)

The computation of the jackknife and jackknife+ intervals can be inefficient because we need to compute \widehat\mu_{-i} for i = 1, \dots, n. This amounts to refitting the model n times. The authors develop an extension of the jackknife+ that works in the K-fold cross-validation (CV) setting. Here, the model refitting only happens K times.

Assume that the training data is splt into K disjoint subsets S_1, \dots, S_K each of size m = n / K. Define the following quantities:

  • For k = 1, \dots, K, let \widehat\mu_{-S_k} be the regression model fit on the training data with S_k removed.
  • For i = 1, \dots, n, let R_i^{CV} =|Y_i - \widehat{\mu}_{-S_{k(i)}}(X_i)|, where k(i) is the subset that contains observation i.

The CV+ prediction interval is defined as

\begin{aligned} \widehat{C}_{n, K, \alpha}^{CV+}(X_{n+1}) = \left[ \widehat{q}_{n,\alpha}^- \left\{ \widehat{\mu}_{-S_{k(i)}}(X_{n+1}) - R_i^{CV} \right\}, \widehat{q}_{n,\alpha}^+ \left\{ \widehat{\mu}_{-S_{k(i)}}(X_{n+1}) + R_i^{CV} \right\} \right]. \end{aligned}

(Note that the jackknife+ prediction interval is simply the CV+ prediction interval with K = n.) The coverage guarantee for CV+ is as follows (Theorem 5):

Theorem: The CV+ prediction interval satisfies

\begin{aligned} \mathbb{P} \left\{ Y_{n+1} \in \widehat{C}_{n, K, \alpha}^{CV+}(X_{n+1})  \right\} &\geq 1 - 2\alpha - \min \left\{ \frac{2(1 - 1/K)}{n/K + 1}, \frac{1-K/n}{K+1} \right\} \\  &\geq 1 - 2\alpha - \sqrt{2/n}. \end{aligned}

Comparison with other methods

Reference 1 compares the jackknife+ against other methods in terms of theoretical guarantees and computational efficiency. I’ve reproduced the tables here, with the newly proposed methods highlighted in green.

Note: All figures in this post are taken from Reference 1.


  1. Barber, R. F. et al. (2021). Predictive inference with the jackknife+.

What are spaghetti plots and lasagna plots?

I recently heard of lasagna plots in a talk and I hadn’t heard of them before, so I looked it up. It turns out that the lasagna plot and its close cousin, the spaghetti plot, are plots that you’ve probably come across before. They are just technical terms that are used in certain domains with certain connotations.

According to Reference 1,

[the spaghetti plot] plots a subject’s values for the repeated outcome measure (vertical axis) versus time (horizontal axis) and connects the dots chronologically, using lines of uniform color and intensity for each subject.

Sounds like a line plot to me! A line plot with one line per subject and with time as the x-axis. The figure below demonstrates what a spaghetti plot is and what could go wrong. The left panel shows the time series for a single individual (one strand of spaghetti). Unfortunately, the entire dataset consists of 118 subjects, so plotting all the strands of spaghetti results in a hot mess.

Swihart et al. (2010) (Reference 1) introduce the lasagna plot as a potential remedy for the spaghetti plot’s shortcomings. The lasagna plot is a heatmap with some special connotations. Instead of one line per subject in the spaghetti plot, each subject gets one horizontal slice. Instead of depicting values using the y-axis, they are presented through the color of each part of the slice. Here is an example of a lasagna plot with 4 subjects and 6 timepoints:

Here is the lasagna plot for the same data which the spaghetti plot made a mess of. Since there are 118 subjects, each horizontal slice is very thin.

Certainly more palatable than the spaghetti plots, with colors to make it look like a real lasagna too! The panel on the left shows the plot with subjects labeled according to their position in the dataset. The right panel shows the plot with subjects ordered according to disease status, and within each disease status, ordered by data of recording. Some trends are obvious now: the top half tend to have lower values, and there seems to be a systematic pattern of missing data for some subjects.

Reference 1 goes into more detail about the possible ways to sort the rows and columns of the lasagna plot, as well as pointers on how to choose a suitable color palette.

Note: All figures were taken from Reference 1.


  1. Swihart, B. J. (2010). Lasagna plots: A saucy alternative to spaghetti plots.