# Generalized ridge regression and smoothing splines

Generalized ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector $y \in \mathbb{R}^n$ and a design matrix ${\bf{X}} \in \mathbb{R}^{n \times p}$. Ridge regression is a commonly used regularization method which looks for $\beta$ that minimizes the sum of the RSS and a penalty term: $\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- {\bf{X}}\beta)^T (y - {\bf{X}}\beta) + \lambda \beta^T \beta,$

where $\lambda \geq 0$ is a hyperparameter. In generalized ridge regression, we solve a more complicated problem: $\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- {\bf{X}}\beta)^T {\bf{W}} (y - {\bf{X}}\beta) + \beta^T {\bf{\Omega}} \beta,$

where (i) ${\bf{W}} \in \mathbb{R}^{n \times n}$ is a (usually diagonal) weight matrix that gives different observations different weights, and (ii) ${\bf{\Omega}} \in \mathbb{R}^{p \times p}$ is a penalty matrix. (We’ve absorbed the hyperparameter $\lambda$ into $\bf{\Omega}$.)

(Note: Sometimes when people refer to “generalized ridge regression”, they mean the problem above but with ${\bf{W}} = {\bf{I}}_{n \times n}$ and $\Omega = {\bf{V \Lambda V}^T}$, where $\bf{V}$ is from the singular value decomposition ${\bf{X}} = {\bf{UDV}^T}$ and $\bf{\Lambda}$ is some positive definite diagonal matrix.)

The generalized ridge regression solution can be solved for in the same way as ridge regression. Taking derivatives with respect to $\beta$ and setting it equal to zero, we obtain \begin{aligned} 2 {\bf{X}}^T {\bf{W}}y - 2 {\bf{X}}^T {\bf{WX}}\hat{\beta} - 2 {\bf{\Omega}}\hat{\beta} = 0. \end{aligned}

If ${\bf{X}}^T {\bf{WX}} + {\bf{\Omega}}$ is invertible, then we can solve for $\beta$: \begin{aligned} \hat{\beta} = ({\bf{X}}^T {\bf{WX}} + {\bf{\Omega}})^{-1} {\bf{X}}^T {\bf{W}}y. \end{aligned}

Smoothing splines

Given a set of observations $(x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2$, a smoothing spline is the function $\hat{f}_\lambda$ which is the solution to $\hat{f}_\lambda = \underset{f}{\text{argmin}} \; \displaystyle\sum_{i=1}^n [y_i - f(x_i)]^2 + \lambda \displaystyle\int [f''(t)]^2 dt,$

where $\lambda \in [0, \infty]$ is a smoothing hyperparameter, and the argmin is taken over an appropriate Sobolev space for which the second term above is well-defined.

When $\lambda = 0$, $\hat{f}$ can be any function that interpolates the data. When $\lambda = \infty$, we must have $f''(t) = 0$ for (almost surely) all $t$, meaning that $\hat{f}$ must be the least squares fit. As $\lambda$ increases from 0 to $\infty$, the functions go from very rough to very smooth.

It turns out that for $\lambda \in (0, \infty)$, we can show that the smoothing spline is a natural cubic spline with knots at the unique values of the $x_i$ (see Section 5.2.1 of Elements of Statistical Learning).

How does generalized ridge regression relate to smoothing splines?

Since a smoothing spline is a natural cubic spline with knots at the unique values of the $x_i$, we can write the solution as a linear combination of the natural spline basis functions: \begin{aligned} f(x) = \sum_{j=1}^n N_j(x) \theta_j, \end{aligned}

where $N_1, \dots, N_n$ are the natural spline basis functions: \begin{aligned} N_1(x) &= 1, \\ N_2(x) &= x, \\ N_{k+2}(x) &= \frac{(x - x_k)_+^3 - (x - x_n)_+^3}{x_n - x_k} - \frac{(x - x_{n-1})_+^3 - (x - x_n)_+^3}{x_n - x_{n-1}}, \quad k = 3, 4, \dots, n-2. \end{aligned}

Writing ${\bf{N}} \in \mathbb{R}^{n \times n}$ with $N_{ij} = N_j(x_i)$, and ${\bf{\Omega_N}} \in \mathbb{R}^{n \times n}$ with $\Omega_{ij} = \int N_i''(t) N_j''(t) dt$, the minimization problem over $f$ now becomes a minimization problem over $\theta = (\theta_1, \dots, \theta_n)^T \in \mathbb{R}^n$: $\hat{\theta}_\lambda = \underset{\theta \in \mathbb{R}^n}{\text{argmin}} \; (y - {\bf{N}}\theta)^T (y - {\bf{N}}\theta) + \lambda \theta^T {\bf{\Omega_N}} \theta,$

which is a generalized ridge regression problem! Hence, we can solve for $\theta$: $\hat{\theta}_\lambda = ({\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}})^{-1} {\bf{N}}^T y.$

The one remaining loose end is to show that ${\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}}$ is invertible. Slide 2 of reference 2 is a one-slide proof that the matrix is positive definite, and hence it must be invertible.

References:

1. van Wieringen, W. Ridge regression.
2. Hansen, N. R. (2009). Statistics Learning.

# Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector $y \in \mathbb{R}^n$ and a design matrix $X \in \mathbb{R}^{n \times p}$. Ordinary least squares seeks the coefficient vector $\beta \in \mathbb{R}^p$ which minimizes the residual sum of squares (RSS), i.e. $\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).$

Ridge regression is a commonly used regularization method which looks for $\beta$ that minimizes the sum of the RSS and a penalty term: $\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,$

where $\|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2$, and $\lambda \geq 0$ is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix $X$ is fixed. The ordinary least squares model posits that the conditional distribution of the response $y$ is $y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),$

where $\sigma > 0$ is some constant. In frequentism we think of $\beta$ as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on $\beta$ and perform any estimation we want using the posterior distribution of $\beta$.

Let’s say our prior distribution of $\beta$ is that the $\beta_j$‘s are independent normals with the same variance, i.e. $\beta \sim \mathcal{N}(0, \tau^2 I)$ for some constant $\tau$. This allows us to compute the posterior distribution of $\beta$: \begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is \begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\ &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\ &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when $\lambda = \dfrac{\sigma^2}{\tau^2}$.

# X^TX and ridge regression

Remember that $X^TX$ is positive semi-definite (PSD) for any matrix $X \in \mathbb{R}^{m \times n}$? That implies that for any $\lambda > 0$, the matrix $X^TX + \lambda I$ is positive definite (no semi).

The proof is simple: $c$ is an eigenvalue of $X^TX$ if and only if $c + \lambda$ is an eigenvalue of $X^TX + \lambda I$. $X^TX$ being PSD means that all its eigenvalues are non-negative. This in turn implies that all the eigenvalues of $X^TX + \lambda I$ are positive, i.e. it is positive definite.

In particular, this means that for any $\lambda > 0$, the matrix $X^TX + \lambda I$ is invertible. And this fact is what makes ridge regression work!

Let’s recall the set-up for ridge regression. We have a response $y \in \mathbb{R}^n$ we wish to predict using features, which we organize into a data matrix $X \in \mathbb{R}^{n \times p}$. In linear regression, our predictions are of the form $X \hat{\beta}$, where $\hat{\beta}$ is the solution to the optimization problem $\text{minimize}_\beta \quad \dfrac{1}{2} \| y - X \beta \|_2^2.$

Differentiating the objective function w.r.t. $\beta$ and setting it equal to 0, we find that $\hat{\beta}$ must satisfy $X^T X \hat{\beta} = X^T y$. If $X^T X$ were invertible, we can “move” it over to the RHS to obtain the solution $\hat{\beta} = (X^T X)^{-1}X^T y$.

What if $X^TX$ is not invertible? (Remember, $X^TX$ must be PSD, but not necessarily PD.) One possible solution is to add a “ridge” of $\lambda$s down the diagonal of the matrix to obtain the equation $(X^TX + \lambda I) \hat{\beta} = (X^T X)^{-1} X^Ty$. With this modification and the fact that $X^TX + \lambda I$ is invertible, we obtain the ridge regression solution $\hat{\beta} =(X^TX + \lambda I)^{-1}X^T y$.