Assume that we are in the standard supervised learning setting, where we have a response vector and a design matrix
. Ordinary least squares seeks the coefficient vector
which minimizes the residual sum of squares (RSS), i.e.
Ridge regression is a commonly used regularization method which looks for that minimizes the sum of the RSS and a penalty term:
where , and
is a hyperparameter.
The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix is fixed. The ordinary least squares model posits that the conditional distribution of the response
is
where is some constant. In frequentism we think of
as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on
and perform any estimation we want using the posterior distribution of
.
Let’s say our prior distribution of is that the
‘s are independent normals with the same variance, i.e.
for some constant
. This allows us to compute the posterior distribution of
:
From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is
which is the ridge regression estimate when .
Pingback: The Bayesian lasso | Statistical Odds & Ends
Hi, it seems that you omit `1/2` in the exponent of your formula, although it doesn’t affect the results.
LikeLike
The 1/2 is hidden in the \propto sign.
LikeLike
sorry, I mean the coefficient in the exponent, which I think it is not true to omit it.
LikeLike
Ah yes you are right. We can’t drop the 1/2 in the exponent when computing the posterior (but we can drop it when computing the MAP estimate). I’ve fixed the post to reflect this.
LikeLike
Hi! I wonder what is ‘I’ in normal distribution notation N(XB, sigma^2 I)?
LikeLike
It refers to the identity matrix (of the correct size)!
LikeLike
Thank you!
LikeLike