The Bayesian lasso

In this previous post, we noted that ridge regression has a Bayesian connection: it is the maximum a posteriori (MAP) estimate of the coefficient vector \beta when the prior distribution of its coordinates are independent mean-zero Gaussians with the same variance, and the likelihood of the data is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. The lasso has a similar interpretation which was noted in the original paper introducing the method (Tibshirani 1996). The lasso estimate is given by the optimization problem

\underset{\beta}{\text{minimize}} \quad (y - X\beta)^T (y - X\beta) + \lambda \|\beta\|_1,

where \|\beta \|_1 = |\beta_1| + \dots + |\beta_p| and \lambda \geq 0 is a hyperparameter. Assume \beta has the prior distribution where the \beta_j‘s are independent and each having mean-zero Laplace distribution:

f(\beta) = \displaystyle\prod_{j=1}^p \frac{1}{2\tau} \exp \left(-\frac{|\beta_j|}{\tau} \right),

where \tau is some constant. The posterior density of \beta is given by

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ -\frac{|\beta_1|}{\tau} - \dots - \frac{|\beta_p|}{\tau} \right] \cdot \exp \left[ -(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{\tau} \|\beta\|_1 \right]. \end{aligned}

The MAP estimate, i.e. the value of \beta which maximizes the posterior density, is given by

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{\tau} \|\beta\|_1 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau} \|\beta\|_1 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau} \|\beta\|_1, \end{aligned}

which is the lasso estimate for \lambda = \dfrac{\sigma^2}{\tau}.

The Bayesian Lasso (Park & Casella 2008) takes this connection further by taking a fully Bayesian approach. Here is the specification of the full model:

\begin{aligned} f(\sigma^2) &\propto 1/\sigma^2, \\  f(\beta \mid \sigma^2) &= \prod_{j=1}^p \frac{\lambda}{2\sqrt{\sigma^2}} \exp \left(-\frac{\lambda |\beta_j|}{\sqrt{\sigma^2}} \right), \\  y \mid \mu, X, \beta, \sigma^2 &\sim \mathcal{N}(\mu 1 + X\beta, \sigma^2 I). \end{aligned}

In the above, f denotes the probability density function. \mu is another parameter which the authors suggest giving an independent flat prior. The prior for \sigma^2 is the standard non-informative scale-invariant prior.

Because the Laplace distribution can be thought of as a mixture of normals, the posterior distribution can be sampled from via a Gibbs sampler.

What does the Bayesian lasso buy you? Well, it is a fully Bayesian method which seems to perform much like the lasso in practice. Because it is fully Bayesian, you get everything which comes with that point of view (e.g. credible intervals for any parameter of your choosing).

References:

  1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
  2. Park, T. and Casella, G. (2008). The Bayesian Lasso.
Advertisement

Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector y \in \mathbb{R}^n and a design matrix X \in \mathbb{R}^{n \times p}. Ordinary least squares seeks the coefficient vector \beta \in \mathbb{R}^p which minimizes the residual sum of squares (RSS), i.e.

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).

Ridge regression is a commonly used regularization method which looks for \beta that minimizes the sum of the RSS and a penalty term:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,

where \|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2, and \lambda \geq 0 is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix X is fixed. The ordinary least squares model posits that the conditional distribution of the response y is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. In frequentism we think of \beta as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on \beta and perform any estimation we want using the posterior distribution of \beta.

Let’s say our prior distribution of \beta is that the \beta_j‘s are independent normals with the same variance, i.e. \beta \sim \mathcal{N}(0, \tau^2 I) for some constant \tau. This allows us to compute the posterior distribution of \beta:

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when \lambda = \dfrac{\sigma^2}{\tau^2}.