Assume that we are in the standard supervised learning setting, where we have a response vector and a design matrix . Ordinary least squares seeks the coefficient vector which minimizes the residual sum of squares (RSS), i.e.
Ridge regression is a commonly used regularization method which looks for that minimizes the sum of the RSS and a penalty term:
where , and is a hyperparameter.
The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix is fixed. The ordinary least squares model posits that the conditional distribution of the response is
where is some constant. In frequentism we think of as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on and perform any estimation we want using the posterior distribution of .
Let’s say our prior distribution of is that the ‘s are independent normals with the same variance, i.e. for some constant . This allows us to compute the posterior distribution of :
From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is
which is the ridge regression estimate when .
Pingback: The Bayesian lasso | Statistical Odds & Ends
Hi, it seems that you omit `1/2` in the exponent of your formula, although it doesn’t affect the results.
The 1/2 is hidden in the \propto sign.
sorry, I mean the coefficient in the exponent, which I think it is not true to omit it.
Ah yes you are right. We can’t drop the 1/2 in the exponent when computing the posterior (but we can drop it when computing the MAP estimate). I’ve fixed the post to reflect this.
Hi! I wonder what is ‘I’ in normal distribution notation N(XB, sigma^2 I)?
It refers to the identity matrix (of the correct size)!