Assume that we are in the standard supervised learning setting, where we have a response vector and a design matrix . Ordinary least squares seeks the coefficient vector which minimizes the residual sum of squares (RSS), i.e.
Ridge regression is a commonly used regularization method which looks for that minimizes the sum of the RSS and a penalty term:
where , and is a hyperparameter.
The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix is fixed. The ordinary least squares model posits that the conditional distribution of the response is
where is some constant. In frequentism we think of as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on and perform any estimation we want using the posterior distribution of .
Let’s say our prior distribution of is that the ‘s are independent normals with the same variance, i.e. for some constant . This allows us to compute the posterior distribution of :
From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is
which is the ridge regression estimate when .