Assume that we are in the standard supervised learning setting, where we have a response vector and a design matrix . Ordinary least squares seeks the coefficient vector which minimizes the * residual sum of squares (RSS)*, i.e.

* Ridge regression* is a commonly used regularization method which looks for that minimizes the sum of the RSS and a penalty term:

where , and is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix is fixed. The ordinary least squares model posits that the conditional distribution of the response is

where is some constant. In frequentism we think of as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on and perform any estimation we want using the posterior distribution of .

Let’s say our prior distribution of is that the ‘s are independent normals with the same variance, i.e. for some constant . This allows us to compute the posterior distribution of :

From this expression, we can compute the mode of the posterior distribution, which is also known as the ** maximum a posteriori (MAP) estimate**. It is

which is the ridge regression estimate when .

Pingback: The Bayesian lasso | Statistical Odds & Ends

Hi, it seems that you omit `1/2` in the exponent of your formula, although it doesn’t affect the results.

LikeLike

The 1/2 is hidden in the \propto sign.

LikeLike

sorry, I mean the coefficient in the exponent, which I think it is not true to omit it.

LikeLike

Ah yes you are right. We can’t drop the 1/2 in the exponent when computing the posterior (but we can drop it when computing the MAP estimate). I’ve fixed the post to reflect this.

LikeLike

Hi! I wonder what is ‘I’ in normal distribution notation N(XB, sigma^2 I)?

LikeLike

It refers to the identity matrix (of the correct size)!

LikeLike

Thank you!

LikeLike