Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector y \in \mathbb{R}^n and a design matrix X \in \mathbb{R}^{n \times p}. Ordinary least squares seeks the coefficient vector \beta \in \mathbb{R}^p which minimizes the residual sum of squares (RSS), i.e.

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).

Ridge regression is a commonly used regularization method which looks for \beta that minimizes the sum of the RSS and a penalty term:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,

where \|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2, and \lambda \geq 0 is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix X is fixed. The ordinary least squares model posits that the conditional distribution of the response y is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. In frequentism we think of \beta as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on \beta and perform any estimation we want using the posterior distribution of \beta.

Let’s say our prior distribution of \beta is that the \beta_j‘s are independent normals with the same variance, i.e. \beta \sim \mathcal{N}(0, \tau^2 I) for some constant \tau. This allows us to compute the posterior distribution of \beta:

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when \lambda = \dfrac{\sigma^2}{\tau^2}.

8 thoughts on “Bayesian interpretation of ridge regression

  1. Pingback: The Bayesian lasso | Statistical Odds & Ends

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s