In this previous post, we noted that ridge regression has a Bayesian connection: it is the maximum a posteriori (MAP) estimate of the coefficient vector when the prior distribution of its coordinates are independent mean-zero Gaussians with the same variance, and the likelihood of the data is
where is some constant. The lasso has a similar interpretation which was noted in the original paper introducing the method (Tibshirani 1996). The lasso estimate is given by the optimization problem
where and is a hyperparameter. Assume has the prior distribution where the ‘s are independent and each having mean-zero Laplace distribution:
where is some constant. The posterior density of is given by
The MAP estimate, i.e. the value of which maximizes the posterior density, is given by
which is the lasso estimate for .
The Bayesian Lasso (Park & Casella 2008) takes this connection further by taking a fully Bayesian approach. Here is the specification of the full model:
In the above, denotes the probability density function. is another parameter which the authors suggest giving an independent flat prior. The prior for is the standard non-informative scale-invariant prior.
Because the Laplace distribution can be thought of as a mixture of normals, the posterior distribution can be sampled from via a Gibbs sampler.
What does the Bayesian lasso buy you? Well, it is a fully Bayesian method which seems to perform much like the lasso in practice. Because it is fully Bayesian, you get everything which comes with that point of view (e.g. credible intervals for any parameter of your choosing).
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
- Park, T. and Casella, G. (2008). The Bayesian Lasso.
Hi, there is a typo in the posterior of $\beta$, you miss the symbol `-` in the second exponent. And I also wonder why you can omit the coefficient `1/2`.
Thanks, fixed the typo! When computing the posterior, we only need to keep track of terms which depend on our parameter of interest.
For example, if $x$ is our parameter of interest and if we know that the posterior $p(x) \propto f(x)$, then $p(x) = kf(x)$ for all $x$ with $k$ being something that does not depend on $x$. In computing the MAP estimate, we are looking for argmax of p, which would be the same as argmax of f.