Generalized ridge regression and smoothing splines

Generalized ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector y \in \mathbb{R}^n and a design matrix {\bf{X}} \in \mathbb{R}^{n \times p}. Ridge regression is a commonly used regularization method which looks for \beta that minimizes the sum of the RSS and a penalty term:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- {\bf{X}}\beta)^T (y - {\bf{X}}\beta) + \lambda \beta^T  \beta,

where \lambda \geq 0 is a hyperparameter. In generalized ridge regression, we solve a more complicated problem:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- {\bf{X}}\beta)^T {\bf{W}} (y - {\bf{X}}\beta) + \beta^T {\bf{\Omega}} \beta,

where (i) {\bf{W}} \in \mathbb{R}^{n \times n} is a (usually diagonal) weight matrix that gives different observations different weights, and (ii) {\bf{\Omega}} \in \mathbb{R}^{p \times p} is a penalty matrix. (We’ve absorbed the hyperparameter \lambda into \bf{\Omega}.)

(Note: Sometimes when people refer to “generalized ridge regression”, they mean the problem above but with {\bf{W}} = {\bf{I}}_{n \times n} and \Omega = {\bf{V \Lambda V}^T}, where \bf{V} is from the singular value decomposition {\bf{X}} = {\bf{UDV}^T} and \bf{\Lambda} is some positive definite diagonal matrix.)

The generalized ridge regression solution can be solved for in the same way as ridge regression. Taking derivatives with respect to \beta and setting it equal to zero, we obtain

\begin{aligned} 2 {\bf{X}}^T {\bf{W}}y - 2 {\bf{X}}^T {\bf{WX}}\hat{\beta} - 2 {\bf{\Omega}}\hat{\beta} = 0. \end{aligned}

If {\bf{X}}^T {\bf{WX}} + {\bf{\Omega}} is invertible, then we can solve for \beta:

\begin{aligned} \hat{\beta} = ({\bf{X}}^T {\bf{WX}} + {\bf{\Omega}})^{-1} {\bf{X}}^T {\bf{W}}y. \end{aligned}

Smoothing splines

Given a set of observations (x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2, a smoothing spline is the function \hat{f}_\lambda which is the solution to

\hat{f}_\lambda = \underset{f}{\text{argmin}} \; \displaystyle\sum_{i=1}^n [y_i - f(x_i)]^2 + \lambda \displaystyle\int [f''(t)]^2 dt,

where \lambda \in [0, \infty] is a smoothing hyperparameter, and the argmin is taken over an appropriate Sobolev space for which the second term above is well-defined.

When \lambda = 0, \hat{f} can be any function that interpolates the data. When \lambda = \infty, we must have f''(t) = 0 for (almost surely) all t, meaning that \hat{f} must be the least squares fit. As \lambda increases from 0 to \infty, the functions go from very rough to very smooth.

It turns out that for \lambda \in (0, \infty), we can show that the smoothing spline is a natural cubic spline with knots at the unique values of the x_i (see Section 5.2.1 of Elements of Statistical Learning).

How does generalized ridge regression relate to smoothing splines?

Since a smoothing spline is a natural cubic spline with knots at the unique values of the x_i, we can write the solution as a linear combination of the natural spline basis functions:

\begin{aligned} f(x) = \sum_{j=1}^n N_j(x) \theta_j, \end{aligned}

where N_1, \dots, N_n are the natural spline basis functions:

\begin{aligned} N_1(x) &= 1, \\  N_2(x) &= x, \\  N_{k+2}(x) &= \frac{(x - x_k)_+^3 - (x - x_n)_+^3}{x_n - x_k} - \frac{(x - x_{n-1})_+^3 - (x - x_n)_+^3}{x_n - x_{n-1}}, \quad k = 3, 4, \dots, n-2. \end{aligned}

Writing {\bf{N}} \in \mathbb{R}^{n \times n} with N_{ij} = N_j(x_i), and {\bf{\Omega_N}} \in \mathbb{R}^{n \times n} with \Omega_{ij} = \int N_i''(t) N_j''(t) dt, the minimization problem over f now becomes a minimization problem over \theta = (\theta_1, \dots, \theta_n)^T \in \mathbb{R}^n:

\hat{\theta}_\lambda = \underset{\theta \in \mathbb{R}^n}{\text{argmin}} \;  (y - {\bf{N}}\theta)^T (y - {\bf{N}}\theta) + \lambda \theta^T {\bf{\Omega_N}} \theta,

which is a generalized ridge regression problem! Hence, we can solve for \theta:

\hat{\theta}_\lambda = ({\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}})^{-1} {\bf{N}}^T y.

The one remaining loose end is to show that {\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}} is invertible. Slide 2 of reference 2 is a one-slide proof that the matrix is positive definite, and hence it must be invertible.

References:

  1. van Wieringen, W. Ridge regression.
  2. Hansen, N. R. (2009). Statistics Learning.
Advertisement

Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector y \in \mathbb{R}^n and a design matrix X \in \mathbb{R}^{n \times p}. Ordinary least squares seeks the coefficient vector \beta \in \mathbb{R}^p which minimizes the residual sum of squares (RSS), i.e.

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).

Ridge regression is a commonly used regularization method which looks for \beta that minimizes the sum of the RSS and a penalty term:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,

where \|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2, and \lambda \geq 0 is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix X is fixed. The ordinary least squares model posits that the conditional distribution of the response y is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. In frequentism we think of \beta as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on \beta and perform any estimation we want using the posterior distribution of \beta.

Let’s say our prior distribution of \beta is that the \beta_j‘s are independent normals with the same variance, i.e. \beta \sim \mathcal{N}(0, \tau^2 I) for some constant \tau. This allows us to compute the posterior distribution of \beta:

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when \lambda = \dfrac{\sigma^2}{\tau^2}.

X^TX and ridge regression

Remember that X^TX is positive semi-definite (PSD) for any matrix X \in \mathbb{R}^{m \times n}? That implies that for any \lambda > 0, the matrix X^TX + \lambda I is positive definite (no semi).

The proof is simple: c is an eigenvalue of X^TX if and only if c + \lambda is an eigenvalue of X^TX + \lambda I. X^TX being PSD means that all its eigenvalues are non-negative. This in turn implies that all the eigenvalues of X^TX + \lambda I are positive, i.e. it is positive definite.

In particular, this means that for any \lambda > 0, the matrix X^TX + \lambda I is invertible. And this fact is what makes ridge regression work!

Let’s recall the set-up for ridge regression. We have a response y \in \mathbb{R}^n we wish to predict using features, which we organize into a data matrix X \in \mathbb{R}^{n \times p}. In linear regression, our predictions are of the form X \hat{\beta}, where \hat{\beta} is the solution to the optimization problem

\text{minimize}_\beta \quad \dfrac{1}{2} \| y - X \beta \|_2^2.

Differentiating the objective function w.r.t. \beta and setting it equal to 0, we find that \hat{\beta} must satisfy X^T X \hat{\beta} = X^T y. If X^T X were invertible, we can “move” it over to the RHS to obtain the solution \hat{\beta} = (X^T X)^{-1}X^T y.

What if X^TX is not invertible? (Remember, X^TX must be PSD, but not necessarily PD.) One possible solution is to add a “ridge” of \lambdas down the diagonal of the matrix to obtain the equation (X^TX + \lambda I) \hat{\beta} = (X^T X)^{-1} X^Ty. With this modification and the fact that X^TX + \lambda I is invertible, we obtain the ridge regression solution \hat{\beta} =(X^TX + \lambda I)^{-1}X^T y.