General formula for the asymptotic covariance of the OLS estimator

This post derives the general formula for the covariance of the ordinary least squares (OLS) estimator.

Imagine we are in the regression setup with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response {\bf y} \in \mathbb{R}^n. Let {\bf x}_i \in \mathbb{R}^p and y_i \in \mathbb{R} denote the ith row of \bf X and the ith element of \bf y respectively. We can always make the following decomposition:

y_i = \mathbb{E} [y_i \mid {\bf x_i}] + \varepsilon_i,

where \mathbb{E}[\varepsilon_i \mid {\bf x_i}] = 0 and \varepsilon is uncorrelated with any function of {\bf x}_i. (This is Theorem 3.1.1 of Reference 1.)

The population regression function approximates y_i as {\bf x}_i^T \beta, where \beta solves the minimization problem

\beta = \text{argmin}_b \: \mathbb{E} \left[ (y_i - {\bf x}_i^T b)^2 \right].

It can be shown that

\beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i].

The ordinary least squares (OLS) estimator is a sample version of this and is given by

\begin{aligned} \hat\beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y} = \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i y_i \right). \end{aligned}

We are often interested in estimating the covariance matrix of \hat\beta as it is needed to construct standard errors for \hat\beta. Defining \varepsilon_i = y_i - {\bf x}_i^T \beta as the ith residual, we can rewrite the above as

\begin{aligned} \hat{\beta} &= \beta + \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i \varepsilon_i \right), \\  \sqrt{n}(\hat{\beta} - \beta) &= n \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \cdot \dfrac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). \end{aligned}

By Slutsky’s Theorem, the quantity above has the same asymptotic distribution as \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \cdot \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). Since \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}*, the Central Limit Theorem tells us that \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right) is asymptotically normally distributed with mean zero and covariance \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] (the matrix of fourth moments). Thus,

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

We can use the diagonal elements of \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} to construct standard errors of \hat\beta. The standard errors computed in this way are called heteroskedasticity-consistent standard errors (or White standard errors, or Eicker-White standard errors). They are “robust” in the sense that they use few assumptions on the data and the model: only those needed to make the Central Limit Theorem go through.

*Note: We do NOT need to assume that \mathbb{E} [y_i \mid {\bf x_i}] is linear in order to conclude that \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}. All we need are the relations \beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] and \varepsilon_i = y_i - {\bf x}_i^T \beta. The derivation is as follows:

\begin{aligned} \mathbb{E}[{\bf x}_i \varepsilon_i] &= \mathbb{E}[{\bf x}_i (y_i - {\bf x}_i^T \beta)] \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \beta \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] \\  &= {\bf 0}. \end{aligned}

Special case of homoskedastic errors

If we assume that the errors are homoskedastic, i.e.

\mathbb{E}[\varepsilon_i \mid X_i] = \sigma^2 for all i

for some constant \sigma, then the asymptotic covariance simplifies a little:

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \sigma^2 \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}


  1. Angrist, J. D., and Pischke, J.-S. (2009). Mostly harmless econometrics (Section 3.1.3).

Getting predictions from an isotonic regression model

TLDR: Pass the output of the isoreg function to as.stepfun to make an isotonic regression model into a black box object that takes in uncalibrated predictions and outputs calibrated ones.

Isotonic regression is a method for obtaining a monotonic fit for 1-dimensional data. Let’s say we have data (x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2 such that x_1 < x_2 < \dots < x_n. (We assume no ties among the x_i‘s for simplicity.) Informally, isotonic regression looks for \beta_1, \dots, \beta_n \in \mathbb{R} such that the \beta_i‘s approximate the y_i‘s well while being monotonically non-decreasing. (See this previous post for more technical details.)

Isotonic regression can be performed easily in R with the stats package’s isoreg function. Note the slightly unusual syntax when pulling out the fitted values (see the function’s documentation with ?isoreg to understand why this is the case). The plot shows the original data values as black crosses and the fitted values as blue dots. As expected, the blue dots are monotonically increasing.

# training data
x <- sample(2 * 1:15)
y <- 0.2 * x + rnorm(length(x))

# isotonic reg fit and plot
fit <- isoreg(x, y)
plot(x, y, pch = 4)
points(fit$x[fit$ord], fit$yf, pch = 16, col = "blue")

Isotonic regression is one commonly used method for calibration: see this previous post for background on calibration and this link for more details with python code. In this setting, we want the isotonic regression model to be a black box: we hand it uncalibrated predictions as an input, and it returns us calibrated predictions.

If you inspect the return value of the isoreg function, you will find that it is unable to interact with any new test data. Imagine that we have some new test data that we want to calibrate:

test_x <- sample(2 * 1:15 - 1)
test_y <- 0.2 * test_x + rnorm(length(test_x))

A naive and WRONG way to calibrate test_y would be to run isotonic regression on just the test data. The plot shows the fits for the training data as blue dots and the fits for the testing data as red squares: the overall fit is not monotonic.

# WRONG isotonic reg fit and plot
fit2 <- isoreg(test_x, test_y)

plot(x, y, pch = 4)
points(test_x, test_y, pch = 4)
points(fit$x[fit$ord], fit$yf, pch = 16, col = "blue")
points(fit2$x[fit2$ord], fit2$yf, pch = 15, col = "red")

A second WRONG way to calibrate test_y</code> would be to run isotonic regression on the combined training/test data. The plot shows that while the overall fit is monotonic, the predictions for the training data have shifted, i.e. you have changed the black box.

# WRONG isotonic reg fit and plot (v2)
all_x <- c(x, test_x)
all_y <- c(y, test_y)
fit3 <- isoreg(all_x, all_y)

plot(all_x, all_y, pch = 4)
points(fit$x[fit$ord], fit$yf, pch = 16, col = "blue")
points(fit3$x[fit3$ord], fit3$yf, pch = 15, col = "red")

The CORRECT way to make the isotonic regression model into a black box is to pass the output of isoreg to the as.stepfun function, like so:

isofit <- as.stepfun(isoreg(x, y))

isofit is the black box we seek: a function that we give uncalibrated predictions to get calibrated predictions in return. As the plot below shows, the overall fit is still monotonic, and the calibrated predictions for the training data do not change.

plot(all_x, all_y, pch = 4)
points(all_x, isofit(all_x), pch = 15, col = "red")
points(fit$x[fit$ord], fit$yf, pch = 16, col = "blue")

Variance of coefficients for linear models and generalized linear models

Variance of coefficients in linear models

Assume we are in the supervised learning setting with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response y \in \mathbb{R}^n. If the linear regression model is true, i.e.

y = {\bf X} \beta + \epsilon,

where \beta \in \mathbb{R}^p is the vector of coefficients and the entries of \epsilon \in \mathbb{R}^p are i.i.d mean 0 and variance \sigma^2, it is well-known that the maximum likelihood estimate (MLE) of \beta is

\hat\beta = ({\bf X}^T{\bf X})^{-1} {\bf X}^T y.

Assuming {\bf X} is fixed and that \epsilon is the only source of randomness, it follows easily that

Var(\hat\beta) = \sigma^2 ({\bf X}^T{\bf X})^{-1}.

It’s remarkable that (i) we have a closed form solution for the variance of \hat\beta, and (ii) this variance does not depend on the value of \beta.

Variance of coefficients in generalized linear models (GLMs)

We are not so lucky on both counts in the generalized linear model (GLM) setting. Assume that the true data generating model is given by a GLM, i.e.

\begin{aligned} Y_i &\stackrel{ind.}{\sim} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right], \\  \eta_i &= \sum_{j=1}^p \beta_j x_{ij}, \text{ and } \\  \eta_i &= g (\mu_i),  \end{aligned}

where i = 1, \dots, n indexes the observations, \mu_i = \mathbb{E}[Y_i], and g is the link function. (See this previous post for details.)

In this setting, we do not have a closed form solution for \hat\beta, the MLE of \beta. (See this previous post for the likelihood equations and ways to obtain the MLE numerically.) We are also not able to get a closed form solution for the variance of \hat\beta. We can however get a closed form solution for the asymptotic variance of \hat\beta:

\begin{aligned} Var( \hat\beta) &= ({\bf X}^T {\bf WX})^{-1}, \text{ where} \\  {\bf W}_{ij} &= \begin{cases} \left( \dfrac{\partial \mu_i}{\partial \eta_i} \right)^2 / Var(Y_i) &\text{if } i = j, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}


  1. Agresti, A. (2013). Categorical data analysis (3rd ed): Section 4.4.8.

What is the Frisch–Waugh–Lovell theorem?

The Frisch–Waugh–Lovell (FWL) theorem (named after econometricians Ragnar Frisch, Frederick V. Waugh and Michael C. Lovell) relates the coefficients for two different regressions.

Assume we have a response y \in \mathbb{R}^n and two data matrices X_1 \in \mathbb{R}^{n \times p_1} and X_2 \in \mathbb{R}^{n \times p_2}. On one hand, we could perform ordinary least squares (OLS) of y on X_1 and X_2 (jointly) to get coefficient vectors \hat\beta_1 \in \mathbb{R}^{p_1} and \hat\beta_2 \in \mathbb{R}^{p_2}. (The implicit model here is y = X_1 \beta_1 + X_2 \beta_2 + \text{noise}.)

Alternatively, we could do the following:

  1. Perform OLS of y on X_1 to get residuals \tilde{y}.
  2. Perform OLS of X_2 on X_1 (in a column-wise fashion) to get residuals \tilde{X_2}.
  3. Perform OLS of \tilde{y} on \tilde{X_2} to get coefficient vector \tilde{\beta}_2.

The Frisch–Waugh–Lovell theorem states that \hat{\beta}_2 = \tilde{\beta}_2.

The proof of the theorem is not hard. Reference 2 proves the theorem using two well-known facts about OLS, while Reference 3 proves the theorem using just matrix algebra.


  1. Wikipedia. Frisch-Waugh-Lovell theorem.
  2. Lovell, M. C. (2008). A simple proof of the FWL theorem.
  3. Belzile, L. (2019). Frisch-Waugh-Lovell theorem.

Robit regression

Let’s say we n observations, and for each observation we have values for p features, x_i \in \mathbb{R}^p, and a binary response y \in \{0, 1\}. We want to build a predictive model for y using the features we have. The most commonly used model for this is logistic regression. If we define p_i = Pr \{ y_i = 1\}, logistic regression models the relationship as

\begin{aligned} \text{logit}(p_i) = \log \left( \frac{p_i}{1-p_i} \right) = x_i^T \beta,  \end{aligned}

where \beta \in \mathbb{R}^p are the regression coefficients to be determined by the data.

Probit regression is another commonly used model for this setting. Instead of the link function being the logit, the link function is the inverse of the standard normal cumulative distribution function (CDF):

\begin{aligned} \Phi^{-1}(p_i) = x_i^T \beta, \end{aligned}

more commonly written as

\begin{aligned} p_i= \Phi(x_i^T \beta). \end{aligned}

Latent variable interpretation

Probit regression has a latent variable interpretation as well. The latent variable z_i is modeled as

\begin{aligned} z_i = x_i^T \beta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1), \end{aligned}

and the response is simply the indicator variable y_i = 1\{ z_i > 0 \}. Logistic regression has a similar latent variable interpretation too: there, instead of the error being normally distributed, it is distributed according to the standard logistic distribution.

Robit regression

Robit regression is a robust version of logistic/probit regression. It posits the same latent variable model. However, the error is distributed as a scaled t-distribution:

\begin{aligned} z_i = x_i^T \beta + \epsilon, \quad \epsilon \sim t_\nu \left( 0, \frac{\nu - 2}{\nu}\right), \end{aligned}

where \nu > 2 is the degrees-of-freedom parameter (can be estimated from the data) and the t distribution is scaled so that its standard deviation is 1. Because the t-distribution has fatter tails than the normal distribution, the model allows for occasional large errors, and as a result the estimate for \beta is less affected by outliers.

Robit regression is equivalent to probit regression when \nu = \infty, and is close to logistic regression when \nu = 7.


  1. Gelman, A., Hill, J., and Vehtari, A. (2020). Regression and other stories (Chapter 15).

The standardized group lasso

Assume that we are in the standard supervised learning setting with n observations. The response of interest is denoted by y \in \mathbb{R}^n, and we have K groups of predictor variables, with the kth group denoted by X^{(k)} \in \mathbb{R}^{n \times p_k}. Ordinary linear regression solves the optimization problem

\begin{aligned} \underset{\beta^{(1)}, \dots, \beta^{(K)}}{\text{minimize}} \quad \frac{1}{2} \left\| y - \sum_{k=1}^K X^{(k)} \beta^{(k)} \right\|_2^2, \end{aligned}

where \beta^{(k)} \in \mathbb{R}^{p_k} is the vector of coefficients associated with the kth group of predictors.

The original group lasso

If we want a solution which is sparse in the number of groups (i.e. \beta^{(k)} = \mathbf{0} for all but a handful of k), the group lasso (Yuan & Lin, 2006) is the first solution that many turn to. The group lasso solves the optimization problem

\begin{aligned} \underset{\beta^{(1)}, \dots, \beta^{(K)}}{\text{minimize}} \quad \frac{1}{2} \left\| y - \sum_{k=1}^K X^{(k)} \beta^{(k)} \right\|_2^2 + \lambda \sum_{k=1}^K \sqrt{p_k} \| \beta^{(k)} \|_2, \end{aligned}

where \lambda \geq 0 is a tuning parameter. (Sometimes the \sqrt{p_k} term is dropped.) For reasons that will become clear later, we call this problem the “unstandardized group lasso”.

While the group lasso is widely known, what is less commonly known is that the original group lasso paper assumes that the data matrices are orthonormal, i.e. (X^{(k)})^T X^{(k)} = I for all k. In order to apply the group lasso to non-orthornomal data matrices, we first have to sphere them (e.g. with the Gram-Schmidt process). However, if we sphere the data matrices and express the group lasso optimization problem in terms of the new data, we will get a problem that is not equivalent to the one above (because the \ell_2 penalty is going to be on a different coefficient vector).

A bigger problem happens if any of the p_k are larger than n. In that case, it is impossible to sphere X^{(k)} into an orthonormal matrix, and so we cannot use the unstandardized group lasso.

The standardized group lasso

These issues were pointed out in Simon & Tibshirani (2012). To clarify matters, they introduced the standardized group lasso, which is the solution to

\begin{aligned} \underset{\beta^{(1)}, \dots, \beta^{(K)}}{\text{minimize}} \quad \frac{1}{2} \left\| y - \sum_{k=1}^K X^{(k)} \beta^{(k)} \right\|_2^2 + \lambda \sum_{k=1}^K \sqrt{p_k} \| X^{(k)} \beta^{(k)} \|_2. \end{aligned}

For each group, instead of putting an \ell_2 penalty on the coefficient vector \beta^{(k)}, we put it on the predictions X^{(k)} \beta^{(k)}.

There is a strong connection between the standardized group lasso and the original (unstandardized) group lasso. If p_k \leq n for all k, the standardized group lasso is equivalent to orthonormalizing within groups, running the unstandardized group lasso, then transforming the coefficients back to the original basis.

The paper (Reference 2) also presents connections to other methods, as well as an efficient algorithm for computing the solution (the authors claim that it is even more efficient than that for the unstandardized group lasso).


  1. Yuan, M., and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
  2. Simon, N., and Tibshirani, R. (2012). Standardization and the group lasso penalty.

Strong and weak hierarchy in interaction models

In supervised learning, we want to use features X_1, X_2, \dots, X_p to predict some response variable y. That is, we want to find some function f

f = f(X_1, \dots, X_p)

such that y \approx f(X_1, \dots, X_p). In additive models (or main effect models), we constrain f to have the form

\begin{aligned} f(X_1, \dots, X_p) = \sum_{i=1}^p f_i (X_i), \end{aligned}

where the f_i‘s are univariate functions. This model is very interpretable but makes a big assumption that any change in a variable has the same effect for all subjects, regardless of the values for all the other variables. This can be unrealistic: for example, in trying to predict students’ test scores, the effect of “number of hours of study” on the test score probably depends on the subject of the test. If that is the case, we say that there is an interaction effect between number of hours of study and test subject.

Interaction models try to predict y using both main effects and interaction effects. In its most general form, the regression function f has the form

\begin{aligned} f(X_1, \dots, X_p) = \sum_{i=1}^p f_i (X_i) + \sum_{i=1}^p\sum_{j = i+1}^p f_{ij}(X_i, X_j), \end{aligned}

where the f_{ij}‘s are bivariate functions. This is sometimes called a two-way interaction model since only second-order interactions are included.


In statistics there is a strong tradition for limiting when the interactions can be included in the model, i.e. when f_{ij} \neq 0 (see Section 1.2 of Reference 1 for some discussion). For strong hierarchy, we only allow an interaction effect to be in the model if both the corresponding main effects are included in the model. For weak hierarchy, we only allow an interaction effect to be in the model if either of the corresponding main effects are included in the model.

In mathematical notation:

\begin{aligned} \text{Strong hierarchy: } &f_{ij} \neq 0 &\Rightarrow& f_i \neq 0 \text{ and } f_j \neq 0, \\  \text{Weak hierarchy: } &f_{ij} \neq 0 &\Rightarrow& f_i \neq 0 \text{ or } f_j \neq 0. \end{aligned}

Statisticians tend to seek out regression models which guarantee that the model fit satisfies strong or weak hierarchy (with strong hierarchy being more common).


  1. Bien, J., et al. (2013). A lasso for hierarchical interactions.

What is nearly-isotonic regression?

Let’s say we have data (x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2 such that x_1 < x_2 < \dots < x_n. (We assume no ties among the x_i‘s for simplicity.) Isotonic regression gives us a monotonic fit \beta_1 \leq \dots \leq \beta_n for the y_i‘s by solving the problem

\begin{aligned} \text{minimize}_{\beta_1, \dots, \beta_n} \quad& \sum_{i=1}^n (y_i - \beta_i)^2 \\ \text{subject to} \quad& \beta_1 \leq \dots \leq \beta_n. \end{aligned}

(See this previous post for more details.) Nearly-isotonic regression, introduced by Tibshirani et al. (2009) (Reference 1), generalizes isotonic regression by solving the problem

\begin{aligned} \text{minimize}_{\beta_1, \dots, \beta_n} \quad \frac{1}{2}\sum_{i=1}^n (y_i - \beta_i)^2 + \lambda \sum_{i=1}^{n-1} (\beta_i - \beta_{i+1})_+, \end{aligned}

where x_+  = \max (0, x) and \lambda \geq 0 is a user-specified hyperparameter.

It turns out that, due to properties of the optimization problem, the nearly-isotonic regression fit can be computed for all \lambda values in O(n \log n) time, making it a practical method for use. See Section 3 and Algorithm 1 of Reference 1 for details. (More accurately, we can determine the nearly-isotonic regression fit for a critical set of \lambda values: the fit for any other other \lambda value will be a linear interpolation of fits from this critical set.)

How is nearly-isotonic regression a generalization of isotonic regression? The term (\beta_i - \beta_{i+1})_+ is positive if and only if \beta_i > \beta_{i+1}, that is, if there is a monotonicity violation. The larger the violation, the larger the penalty. Instead of insisting on no violations at all, nearly-isotonic regression trades off the size of the violation with the improvement one gets from goodness of fit to the data. Nearly-isotonic regression gives us a series of fits that range from interpolation of the data (when \lambda = 0) to the isotonic regression fit (when \lambda = \infty). (Actually, you will get the isotonic regression fit once \lambda is big enough such that any change in the penalty cannot be mitigated by the goodness of fit improvement.)

Why might you want to use nearly-isotonic regression? One possible reason is to check if the assumption monotonicity is reasonable for your data. To do so, run nearly-isotonic regression with cross-validation on \lambda and compute the CV error for each \lambda value. If the CV error achieved by the isotonic regression fit (i.e. largest \lambda value) is close to or statistically indistinguishable from the minimum, that gives assurance that monotonicity is a reasonable assumption for your data.

You can perform nearly-isotonic regression in R with the neariso package. The neariso() function returns fits for an entire path of \lambda values. The animation below shows how the fit changes as \lambda gets larger and larger (code available here).

Note 1: The formulation for nearly-isotonic regression above assumes that the points x_1, \dots, x_n are equally spaced. If they are not, one should replace the penalty with

\begin{aligned} \lambda \sum_{i=1}^{n-1} \frac{(\beta_i - \beta_{i+1})_+}{x_{i+1} - x_i} \end{aligned}

to account for the different-sized gaps. The neariso package only seems to handle the case where the x_i‘s are equally spaced.

Note 2: The animation above was created by generating separate .png files for each value of \lambda, then stitching them together using the magick package. My initial hope was to create an animation that would transition smoothly between the different fits using the gganimate package but the transitions weren’t as smooth as I would have imagined them to be:

Does anyone know how this issue could be fixed? Code for the animation is below, full code available here.

p <- ggplot(df, aes(x = x, y = beta)) +
    geom_path(col = "blue") +
    geom_point(data = truth_df, aes(x = x, y = y), shape = 4) +
    labs(title = "Nearly isotonic regression fits",
         subtitle = paste("Lambda = ", "{lambda[as.integer(closest_state)]}")) +
    transition_states(iter, transition_length = 1, state_length = 2) +
    theme_bw() + 
    theme(plot.title = element_text(size = rel(1.5), face = "bold"))
animate(p, fps = 5)


  1. Tibshirani, R. J., Hoefling, H., and Tibshirani, R. (2011). Nearly-isotonic regression.

What is isotonic regression?

Isotonic regression is a method for obtaining a monotonic fit for 1-dimensional data. Let’s say we have data (x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2 such that x_1 < x_2 < \dots < x_n. (We assume no ties among the x_i‘s for simplicity.) Informally, isotonic regression looks for \beta_1, \dots, \beta_n \in \mathbb{R} such that the \beta_i‘s approximate the y_i‘s well while being monotonically non-decreasing. Formally, the \beta_i‘s are the solution to the optimization problem

\begin{aligned} \text{minimize}_{\beta_1, \dots, \beta_n} \quad& \sum_{i=1}^n (y_i - \beta_i)^2 \\ \text{subject to} \quad& \beta_1 \leq \dots \leq \beta_n. \end{aligned}

(Note: There is a corresponding solution for a monotonically non-increasing fit. Sometimes, the above is referred to as linear ordering isotonic regression, with isotonic regression referring to a more general version of the problem above. For more general versions, see Reference 1.)

Isotonic regression is useful for enforcing a monotonic fit to the data. Sometimes you might know that your trend is monotonic but the output from your model is not monotonic. In that situation you can use isotonic regression as a smoother for the data or a post-processing step to force your model prediction to be monotonic.

A commonly used algorithm to obtain the isotonic regression solution is the pool-adjacent-violators algorithm (PAVA). It runs in linear time and linear memory. At a high level it works like this: go from left to right and set \beta_i = y_i. If setting \beta_i as such causes a violation of monotonicity (i.e. \beta_i = y_i < y_{i-1} = \beta_{i-1}), replace both \beta_i and \beta_{i-1} with the mean (y_{i-1} + y_i)/2. This move may result in earlier violations (i.e. the new \beta_{i-1} may be less than \beta_{i-2}): if that happens, we need to go back and fix it via averaging.

The animation below works through an example of how PAVA works. (Essentially I wrote a homebrew version of PAVA but kept a record of the intermediate fits.) Note that there are a number of ways to implement PAVA: what you see below may not be the most efficient. Click here for the R code; you can amend the dataset there to get your own version of the animation below.

If you want to do isotonic regression in R, DON’T use my homebrew version, use the gpava function in the isotone package instead.


  1. Mair, P., Hornik, M., and de Leeuw, J. (2009). Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods.

Models for ordinal responses

A variable y is said to be ordinal if it takes on values or categories which have a natural ordering to them but the distances between the categories is not known. Ordinal variables are very common: some examples include T-shirt sizes, the Likert scale used in surveys (“for each of the following statements, choose along the scale of strongly disagree to strongly agree”), income brackets and ranked variables.

In this post, I introduce 3 commonly used methods for fitting regression models with an ordinal response. In what follows, I assume that the response variable y is ordinal, taking on category values \{ 1, 2, \dots, c \}. (As y is ordinal, we should think of these values as category numbers instead of actual quantitative values.) We want to fit a model where we can predict y based on a set of features or covariates x \in \mathbb{R}^p.

Cumulative logit model with proportional odds

The cumulative logit model with proportional odds assumes the following relationship between y and x:

\begin{aligned} \text{logit}[P(y \leq j)] = \log \dfrac{P(y \leq j)}{P(y > j)} = \alpha_j + \beta^T x, \qquad j = 1, \dots, c-1. \end{aligned}

This is a natural extension of logistic regression: there, we have c = 2. Note also that for fixed j, the above looks like logistic regression where we have made the response binary (above j vs. below or equal to j).

Note that in this model, each value of j gets its own intercept \alpha_j but the other coefficients \beta are fixed across j. Because of this, the model satisfies the proportional odds property: for all j,

\begin{aligned} \log \left[ \dfrac{P(y \leq j \mid x_1) / P(y > j \mid x_1)}{P(y \leq j \mid x_2) / P(y > j \mid x_2)} \right] = \beta^T (x_1 - x_2). \end{aligned}

With this model, the estimated probability of being less than or equal to j is

\begin{aligned} P(y \leq j) = \dfrac{\exp (\hat{\alpha}_j + \hat{\beta}^T x)}{1 + \exp (\hat{\alpha}_j + \hat{\beta}^T x)}. \end{aligned}

Hence, the estimated probability of being equal to j is

\begin{aligned} P(y = j)  &= P(y \leq j) - P(y \leq j - 1) \\  &= \dfrac{\exp (\hat{\alpha}_j + \hat{\beta}^T x)}{1 + \exp (\hat{\alpha}_j + \hat{\beta}^T x)} - \dfrac{\exp (\hat{\alpha}_{j-1} + \hat{\beta}^T x)}{1 + \exp (\hat{\alpha}_{j-1} + \hat{\beta}^T x)}. \end{aligned}

Adjacent-category logit model

The adjacent-category logit model assumes the relationship

\begin{aligned} \log \dfrac{P(y = j)}{P(y = j + 1)} = \alpha_j + \beta^T x, \qquad j = 1, \dots, c-1. \end{aligned}

Here the odds ratio uses only adjacent categories, whereas in the cumulative logit model it uses the entire response scale. Hence, the interpretation of the model is in terms of local odds ratios as opposed to cumulative odds ratios.

Having fit the model, the estimated probability of being equal to j is

\begin{aligned} P(y = j) = \dfrac{\exp (\hat{\alpha}_j + \hat{\beta}^T x)}{1 + \sum_{k=1}^{c-1} \exp(\hat{\alpha}_k + \hat{\beta}^T x) }. \end{aligned}

This is quite a different form from that for the cumulative logit model.

Continuation-ratio logit model

The continuation-ratio logit model assumes the relationship

\begin{aligned} \text{logit} [P(y = j \mid y \geq j)] = \log \dfrac{P(y = j \mid y \geq j)}{P(y > j \mid y \geq j)} = \alpha_j + \beta^T x, \qquad j = 1, \dots, c-1. \end{aligned}

Having fit the model, we have the estimated conditional probabilities

\begin{aligned} P(Y = j \mid Y \geq j) &= \dfrac{\exp (\hat{\alpha}_j + \hat{\beta}^T x)}{1 + \exp (\hat{\alpha}_j + \hat{\beta}^T x)}. \end{aligned}

With a bit of algebra we can convert them to unconditional probability estimates:

\begin{aligned} P(Y = j) = \dfrac{\exp (\hat{\alpha}_j + \hat{\beta}^T x)}{1 + \exp (\hat{\alpha}_j + \hat{\beta}^T x)} \prod_{k=1}^{j-1} \dfrac{1}{1 + \exp (\hat{\alpha}_k + \hat{\beta}^T x)}. \end{aligned}


  1. Agresti, A. (2010). Modeling Ordinal Categorical Data.