# Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector $y \in \mathbb{R}^n$ and a design matrix $X \in \mathbb{R}^{n \times p}$. Ordinary least squares seeks the coefficient vector $\beta \in \mathbb{R}^p$ which minimizes the residual sum of squares (RSS), i.e.

$\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).$

Ridge regression is a commonly used regularization method which looks for $\beta$ that minimizes the sum of the RSS and a penalty term:

$\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,$

where $\|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2$, and $\lambda \geq 0$ is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix $X$ is fixed. The ordinary least squares model posits that the conditional distribution of the response $y$ is

$y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),$

where $\sigma > 0$ is some constant. In frequentism we think of $\beta$ as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on $\beta$ and perform any estimation we want using the posterior distribution of $\beta$.

Let’s say our prior distribution of $\beta$ is that the $\beta_j$‘s are independent normals with the same variance, i.e. $\beta \sim \mathcal{N}(0, \tau^2 I)$ for some constant $\tau$. This allows us to compute the posterior distribution of $\beta$:

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\ &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\ &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when $\lambda = \dfrac{\sigma^2}{\tau^2}$.

## 8 thoughts on “Bayesian interpretation of ridge regression”

1. Hi, it seems that you omit 1/2 in the exponent of your formula, although it doesn’t affect the results.

Like

• The 1/2 is hidden in the \propto sign.

Like

• sorry, I mean the coefficient in the exponent, which I think it is not true to omit it.

Like

• Ah yes you are right. We can’t drop the 1/2 in the exponent when computing the posterior (but we can drop it when computing the MAP estimate). I’ve fixed the post to reflect this.

Like

2. Hi! I wonder what is ‘I’ in normal distribution notation N(XB, sigma^2 I)?

Like

• It refers to the identity matrix (of the correct size)!

Like

• Thank you!

Like