Bayesian Additive Regression Trees (BART)

Bayesian Additive Regression Trees (BART), proposed by Chipman et al. (2010) (Reference 1), is a Bayesian “sum-of-trees” model where we use the sum of trees to model or approximate the target function f(x) = \mathbb{E}[Y \mid x], where Y is the response of interest and x represents the covariates that we are using to predict Y. (If there are p covariates, then x is a p-dimensional vector.) It is inspired by ensemble methods such as boosting, where a meta algorithm combines a large number of “weak learners”/”weak predictors” to get a much more accurate prediction.

As a Bayesian method, all we have to do is to specify the prior distribution for the parameters and the likelihood. From there, we can turn the proverbial Bayesian crank to get the posterior distribution of parameters and with it, posterior inference of any quantity we are interested in (e.g. point and estimate estimates of f(x) = \mathbb{E}[Y \mid x]).

Likelihood

The likelihood is easy to specify once we get definitions out of the way. Let T denote a binary decision tree. Assuming the tree has b terminal nodes, let M = \{ \mu_1, \dots, \mu_b \} be the set of parameter values associated with each of the terminal nodes such that if the x value ends up in the ith terminal node, the tree would return the value \mu_i. We can think of T as representing the structure of the tree, and of M as specifying the value we return to the user once the input hits one of the terminal nodes.

For a given T and M, let g(x; T, M) denote the function which assigns \mu_i to x if x ends up in the ith terminal node.

The likelihood for BART is

\begin{aligned} Y = \sum_{j=1}^m g(x; T_j, M_j) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2). \end{aligned}

The response is the sum of m binary decision trees with additive Gaussian noise.

Prior specification: General comments

We need to choose priors for m, \sigma, and (T_1, M_1), \dots, (T_m, M_m).

Chipman et al. actually decided not to put a prior on m for computational reasons. Instead, they suggested beginning with a default of m = 200, then check if one or two other choices of m makes any difference. According to them, as m increases, predictive performance improves dramatically at first, then levels off, and finally degrades slowly. (I believe this is the observed behavior with boosting as well.) As such, for prediction it appears only important to avoid having too few trees.

As for the other parameters, we introduce independence assumptions to simplify the prior specification. If \mu_{ij} \in M_j represents the terminal value of the ith node in T_j, then we assume that

\begin{aligned} p \left( (T_1, M_1), \dots, (T_m, M_m), \sigma \right) &= \left[ \prod_j p(T_j, M_j) \right] p(\sigma) \\  &= \left[ \prod_j p(M_j \mid T_j) p(T_j) \right] p(\sigma), \end{aligned}

and that

\begin{aligned} p(M_j \mid T_j) = \prod_i p(\mu_{ij} \mid T_j). \end{aligned}

To complete the prior specification, we just need to specify the priors for \sigma, T_j and \mu_{ij} \mid T_j. We do so in the following subsections. The paper notes that these priors were also used in an earlier paper by the same authors (Reference 2) where they considered a Bayesian model for a single decision tree.

The \sigma prior

For \sigma BART uses a standard conjugate prior, the inverse chi-square distribution

\sigma^2 \sim \nu \lambda / \chi_\nu^2,

where \nu and \lambda are the degrees of freedom and scale hyperparameters. Chipman et al. recommend choosing these two hyperparameters in a data-driven way:

  1. Get some estimate \hat{\sigma} of \sigma (e.g. sample standard deviation of Y, or fit ordinary least squares (OLS) of Y on X and take the residual standard deviation).
  2. Pick a value of \nu between 3 and 10 to get an appropriate shape.
  3. Pick a value of \lambda so that the qth quantile of the prior on \sigma is \hat{\sigma}, i.e. P(\sigma < \hat{\sigma}) = q. Chipman et al. recommend considering q = 0.75, 0.9 or 0.99.

For users who don’t want to choose \nu and q, the authors recommend defaults of \nu = 3 and q = 0.9.

The T_j prior

The prior for a tree T_j can be specified with 3 ingredients:

  1. The probability that a node at depth d (the root node having depth 0) is non-terminal.
  2. If a node is non-terminal, the probability that the ith covariate (out of p covariates) is the splitting variable for this node.
  3. Once the splitting variable for a node has been chosen, a probability over the possible cut-points for this variable.

Chipman et al. suggest the following:

  1. P(\text{node at depth } d \text{ is non-terminal}) = \alpha (1 + d)^{-\beta}, where \alpha \in (0, 1) and \beta \in [0, \infty) are hyperparameters. Authors suggest \alpha = 0.95 and \beta = 2 to favor small trees. With this choice, trees with 1, 2, 3, 4 and \geq 5 terminal nodes have prior probability of 0.05, 0.55, 0.28, 0.09 and 0.03 respectively.
  2. Chipman et al. suggest the uniform distribution over the p covariates.
  3. Given the splitting variable, Chipman et al. suggest the uniform prior on the discrete set of available splitting values for this variable.

The \mu_{ij} \mid T_j prior

For computational efficiency, Chipman et al. suggest using the conjugate normal distribution

p(\mu_{ij} \mid T_j) \sim \mathcal{N}(\mu_\mu, \sigma_\mu^2),

where \mu_\mu and \sigma_\mu are hyperparameters. As with the hyperparameters associated with the \sigma prior, the authors suggest setting them in a data-driven way. The way we do this is by ensuring that \mathbb{E}[Y \mid x] is in the right ballpark. With the prior above, \mathbb{E}[Y \mid x] has the prior \mathcal{N}(m\mu_\mu, m \sigma_\mu^2). Letting y_{min} and y_{max} be the min and max observed values of Y, choose \mu_\mu and \sigma_\mu such that

\begin{aligned} m \mu_\mu - k \sqrt{m}\sigma_\mu &= y_{min}, \\  m \mu_\mu + k \sqrt{m}\sigma_\mu &= y_{max}. \end{aligned}

If we choose k = 2, then this choice of hyperparameters means that the prior probability that \mathbb{E}[Y \mid x] \in (y_{min}, y_{max}) is 0.95.

Other notes

In theory, once we define a prior and likelihood we have a posterior. The practical question is whether we can derive this posterior or sample from it efficiently. Section 3 of the paper outlines a Bayesian backfitting MCMC algorithm that allows us to sample from the posterior distribution.

The set-up above applies for quantitative Y. For binary Y, the paper develops a probit model along similar lines in Section 4.

References:

  1. Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees.
  2. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search.
Advertisement

Laplace’s approximation for Bayesian posterior distribution

In Bayesian statistics, we want some parameter, \theta, that we want to estimate. To do so, we set a prior on the parameter, p(\theta), and assume a likelihood function, p(x \mid \theta), that models the relationship between the parameter and the data we see.

The key object in Bayesian inference is the posterior distribution, p(\theta \mid x), representing our belief about the distribution of the parameter given the prior and the data. It is computed using Bayes’ theorem:

\begin{aligned} p(\theta \mid x) = \frac{p(x \mid \theta) p (\theta)}{p(x)} =  \frac{p(x \mid \theta) p (\theta)}{\int p(x \mid \theta = t) p (\theta = t) d t}. \end{aligned}

A big difficulty of Bayesian statistics is evaluating the integral in the denominator above, also known as the normalization constant. Laplace’s approximation is one technique we can use to approximate the denominator.

(The rest of this post closely follows Reference 1.) For simpler notation, let’s rewrite the denominator as \int_a^b g(t) dt. Assume that we can find the point t_0 where the function g attains its maximum. Let h = \log g. Then, taking a second-order Taylor expansion of h around t_0, we have

\begin{aligned} \int_a^b g(t) dt &= \int_a^b \exp[h(t)] dt \\  &\approx \int_a^b \exp \left[ h(t_0) + h'(t_0) (t - t_0) + \frac{1}{2}h''(t_0) (t - t_0)^2 \right] dt \\  &=  \int_a^b \exp \left[ h(t_0) + \frac{1}{2}h''(t_0) (t - t_0)^2 \right] dt &(h'(t_0) = 0) \\  &= \exp [h(t_0)] \int_a^b \exp \left[ \frac{1}{2}h''(t_0) (t - t_0)^2 \right]dt \\  &= \exp [h(t_0)] \int_a^b \exp \left[ - \frac{1}{2} \frac{(t - t_0)^2}{-h''(t_0)^{-1}}  \right]dt. \end{aligned}

We recognize the integrand to be proportional to the PDF of a normal distribution with mean t_0 and variance -h''(t_0)^{-1}. Hence, we have a closed form for the integral in terms of the normal CDF:

\begin{aligned} \int_a^b g(t) dt = \exp [h(t_0)] \sqrt{\frac{2\pi}{-h''(t_0)}} \left[ \Phi(b \mid t_0, -h''(t_0)^{-1} - \Phi(a \mid t_0, -h''(t_0)^{-1} \right]. \end{aligned}

If a = -\infty and b = \infty (as is usually the case), the term in square brackets is equal to 1, and we obtain

\begin{aligned} \int_{-\infty}^\infty g(t) dt = \exp [h(t_0)] \sqrt{\frac{2\pi}{-h''(t_0)}}. \end{aligned}

Operationalizing Laplace’s approximation depends crucially on being able to locate t_0 well. Thus, Laplace’s approximation replaces an integration problem with a maximization problem.

How well does Laplace’s approximation do? That depends on how good the second-order Taylor approximation for h = \log g is. When the second-order Taylor approximation is exact, g is exactly proportional to a normal PDF. Hence, the closer g looks like a normal PDF, the better Laplace’s approximation will be.

References:

  1. Peng, R. D. (2018). Advanced Statistical Computing (Chapter 5.1).

The Bayesian lasso

In this previous post, we noted that ridge regression has a Bayesian connection: it is the maximum a posteriori (MAP) estimate of the coefficient vector \beta when the prior distribution of its coordinates are independent mean-zero Gaussians with the same variance, and the likelihood of the data is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. The lasso has a similar interpretation which was noted in the original paper introducing the method (Tibshirani 1996). The lasso estimate is given by the optimization problem

\underset{\beta}{\text{minimize}} \quad (y - X\beta)^T (y - X\beta) + \lambda \|\beta\|_1,

where \|\beta \|_1 = |\beta_1| + \dots + |\beta_p| and \lambda \geq 0 is a hyperparameter. Assume \beta has the prior distribution where the \beta_j‘s are independent and each having mean-zero Laplace distribution:

f(\beta) = \displaystyle\prod_{j=1}^p \frac{1}{2\tau} \exp \left(-\frac{|\beta_j|}{\tau} \right),

where \tau is some constant. The posterior density of \beta is given by

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ -\frac{|\beta_1|}{\tau} - \dots - \frac{|\beta_p|}{\tau} \right] \cdot \exp \left[ -(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{\tau} \|\beta\|_1 \right]. \end{aligned}

The MAP estimate, i.e. the value of \beta which maximizes the posterior density, is given by

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{\tau} \|\beta\|_1 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau} \|\beta\|_1 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau} \|\beta\|_1, \end{aligned}

which is the lasso estimate for \lambda = \dfrac{\sigma^2}{\tau}.

The Bayesian Lasso (Park & Casella 2008) takes this connection further by taking a fully Bayesian approach. Here is the specification of the full model:

\begin{aligned} f(\sigma^2) &\propto 1/\sigma^2, \\  f(\beta \mid \sigma^2) &= \prod_{j=1}^p \frac{\lambda}{2\sqrt{\sigma^2}} \exp \left(-\frac{\lambda |\beta_j|}{\sqrt{\sigma^2}} \right), \\  y \mid \mu, X, \beta, \sigma^2 &\sim \mathcal{N}(\mu 1 + X\beta, \sigma^2 I). \end{aligned}

In the above, f denotes the probability density function. \mu is another parameter which the authors suggest giving an independent flat prior. The prior for \sigma^2 is the standard non-informative scale-invariant prior.

Because the Laplace distribution can be thought of as a mixture of normals, the posterior distribution can be sampled from via a Gibbs sampler.

What does the Bayesian lasso buy you? Well, it is a fully Bayesian method which seems to perform much like the lasso in practice. Because it is fully Bayesian, you get everything which comes with that point of view (e.g. credible intervals for any parameter of your choosing).

References:

  1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
  2. Park, T. and Casella, G. (2008). The Bayesian Lasso.

Bayesian interpretation of ridge regression

Assume that we are in the standard supervised learning setting, where we have a response vector y \in \mathbb{R}^n and a design matrix X \in \mathbb{R}^{n \times p}. Ordinary least squares seeks the coefficient vector \beta \in \mathbb{R}^p which minimizes the residual sum of squares (RSS), i.e.

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta).

Ridge regression is a commonly used regularization method which looks for \beta that minimizes the sum of the RSS and a penalty term:

\hat{\beta} = \underset{\beta}{\text{argmin}} \; (y- X\beta)^T (y - X\beta) + \lambda \| \beta\|_2^2,

where \|\beta\|_2^2 = \beta_1^2 + \dots + \beta_p^2, and \lambda \geq 0 is a hyperparameter.

The ridge regression estimate has a Bayesian interpretation. Assume that the design matrix X is fixed. The ordinary least squares model posits that the conditional distribution of the response y is

y \mid X, \beta \sim \mathcal{N}(X\beta, \sigma^2 I),

where \sigma > 0 is some constant. In frequentism we think of \beta as being some fixed unknown vector that we want to estimate. In Bayesian statistics, we can impose a prior distribution on \beta and perform any estimation we want using the posterior distribution of \beta.

Let’s say our prior distribution of \beta is that the \beta_j‘s are independent normals with the same variance, i.e. \beta \sim \mathcal{N}(0, \tau^2 I) for some constant \tau. This allows us to compute the posterior distribution of \beta:

\begin{aligned} p(\beta \mid y, X) &\propto p(\beta) \cdot p(y \mid X, \beta) \\ &\propto \exp \left[ - \frac{1}{2} (\beta - 0)^T \frac{1}{\tau^2} I (\beta - 0) \right] \cdot \exp \left[ -\frac{1}{2}(y - X\beta)^T \frac{1}{\sigma^2} (y - X\beta) \right] \\ &= \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right]. \end{aligned}

From this expression, we can compute the mode of the posterior distribution, which is also known as the maximum a posteriori (MAP) estimate. It is

\begin{aligned} \hat{\beta} &= \underset{\beta}{\text{argmax}} \quad \exp \left[ -\frac{1}{2\sigma^2}(y-X\beta)^T (y - X \beta) - \frac{1}{2\tau^2} \|\beta\|_2^2 \right] \\  &= \underset{\beta}{\text{argmin}} \quad \frac{1}{\sigma^2}(y-X\beta)^T (y - X \beta) + \frac{1}{\tau^2} \|\beta\|_2^2 \\  &= \underset{\beta}{\text{argmin}} \quad (y-X\beta)^T (y - X \beta) + \frac{\sigma^2}{\tau^2} \|\beta\|_2^2, \end{aligned}

which is the ridge regression estimate when \lambda = \dfrac{\sigma^2}{\tau^2}.