# Consistency and unbiasedness

Let $P$ be some underlying true probability distribution. We are interested in some parameter of the probability distribution $\theta = \theta(P)$ (e.g. the mean, median or variance of $P$). To estimate $\theta$, we have on hand samples $X_1, X_2, \dots$, which are i.i.d. draws from the distribution $P$. For each $n = 1, 2, \dots$, let $T_n$ be the estimator based on the first $n$ data points.

• The estimator $T_n$ is consistent if $T_n$ converges in probability to $\theta$.
• It is unbiased if it is, on average, equal to the true value of the parameter, i.e. $\theta$ if $\mathbb{E}[T_n] = \theta$.
• It is asymptotically unbiased if $\mathbb{E}[T_n] \rightarrow \theta$ as $n \rightarrow \infty$.

Unbiasedness is not the same as consistency

It’s important to note that unbiasedness and consistency do not imply each other. The examples below from Reference 1 show that.

Unbiasedness does not imply consistency: Let $X_i \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2)$, $i = 1, 2, \dots$. Consider the estimator $T_n = X_1$ for the mean $\mu$. We always have $\mathbb{E}[T_n] = \mathbb{E}[X_1] = \mu$, so it is unbiased. However, $T_n$ converges in distribution to $\mathcal{N}(\mu, \sigma^2)$, and so is not consistent.

Consistency does not imply unbiasedness: Let $X_i \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2)$, $i = 1, 2, \dots$. The maximum likelihood estimator (MLE) for $\sigma^2$ is $T_n = \dfrac{1}{n}\sum_{i=1}^n (X_i - \overline{X})^2$, where $\overline{X} = (X_1 + \dots + X_n) / n$. It is consistent (the MLE is always consistent), but it is not hard to show that $\mathbb{E}[T_n] = \dfrac{n-1}{n}\sigma^2$, i.e. it is biased.

Asymptotic unbiasedness and consistency

Asymptotic unbiasedness and consistency also do not imply each other.

Asymptotic unbiasedness does not imply consistency: This is a variation of the example for “unbiasedness does not imply consistency”. Instead of taking $T_n = X_1$, take $T_n = X_1 + 1/n$.

Consistency does not imply asymptotic unbiasedness: From Reference 2: consider a silly example where $\theta = 0$ and we want to estimate $\theta$ using random variables $T_n$ with

\begin{aligned} T_n = \begin{cases} 0 &\text{with probability } (n-1)/n, \\ n &\text{with probability 1/n}. \end{cases} \end{aligned}

$T_n$ is consistent since it converges in probability to 0, but it is not asymptotically unbiased: $\mathbb{E}[T_n] = 1$ for every $n$.

References:

# Git references

Git is an essential distributed version control system for managing programming projects (and in my case, maintaining R packages).  While essential, I find it quite difficult to remember common git idioms for simple operations. This post lists references that I personally find useful: hopefully some of you might find them useful too.

Why am I not simply posting the git commands? That’s because the correct git command depends a lot on the context. By linking to the full reference, you can check if the situation the reference is addressing matches yours before applying its solution.

(Note: This post will be updated periodically as I find other useful idioms that I keep forgetting.)

# What is a dgCMatrix object made of? (sparse matrix format in R)

I’ve been working with sparse matrices in R recently (those created using Matrix::Matrix with the option sparse=TRUE) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class dgCMatrix object.

(Click here for full documentation of the Matrix package (and it is a lot–like, 215 pages a lot).)

Background

It turns out that there is some documentation on dgCMatrix objects within the Matrix package. One can access it using the following code:

library(Matrix)
?dgCMatrix-class


According to the documentation, the dgCMatrix class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.

An example

We’ll use a small matrix as a running example in this post:

library(Matrix)
M <- Matrix(c(0, 0,  0, 2,
6, 0, -1, 5,
0, 4,  3, 0,
0, 0,  5, 0),
byrow = TRUE, nrow = 4, sparse = TRUE)
rownames(M) <- paste0("r", 1:4)
colnames(M) <- paste0("c", 1:4)
M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .


Running str on x tells us that the dgCMatrix object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham’s Advanced R.)

str(M)
# Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# ..@ i       : int [1:7] 1 2 1 2 3 0 1
# ..@ p       : int [1:5] 0 1 2 5 7
# ..@ Dim     : int [1:2] 4 4
# ..@ Dimnames:List of 2
# .. ..$: chr [1:4] "r1" "r2" "r3" "r4" # .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"
# ..@ x       : num [1:7] 6 4 -1 3 5 2 5
# ..@ factors : list()


x, i and p

If a matrix M has nn non-zero entries, then its x slot is a vector of length nn containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@x
# [1]  6  4 -1  3  5  2  5

as.numeric(M)[as.numeric(M) != 0]
# [1]  6  4 -1  3  5  2  5


The i slot is a vector of length nn. The kth element of M@i is the row index of the kth non-zero element (as listed in M@x). One big thing to note here is that the first row has index ZERO, unlike R’s indexing convention. In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of M@i is 1.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@i
# [1] 1 2 1 2 3 0 1


If the matrix has nvars columns, then the p slot is a vector of length nvars + 1. If we index the columns such that the first column has index ZERO, then M@p[1] = 0, and M@p[j+2] - M@p[j+1] gives us the number of non-zero elements in column j.

In our example, when j = 2, M@p[2+2] - M@p[2+1] = 5 - 2 = 3, so there are 3 non-zero elements in column index 2 (i.e. the third column).

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@p
# [1] 0 1 2 5 7


With the x, i and p slots, one can reconstruct the entries of the matrix.

Dim and Dimnames

These two slots are fairly obvious. Dim is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. Dimnames is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

factors

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like factors is

… [an] Object of class “list” – a list of factorizations of the matrix. Note that this is typically empty, i.e., list(), initially and is updated automagically whenever a matrix factorization is
computed.

My understanding is if we perform any matrix factorizations or decompositions on a dgCMatrix object, it stores the factorization under factors so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

M@factors
# list()

Mlu <- lu(M)  # perform triangular decomposition
str(M@factors)
# List of 1
# $LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots # .. ..@ L :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:4] 0 1 2 3 # .. .. .. ..@ p : int [1:5] 0 1 2 3 4 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1"
# .. .. .. .. ..$: NULL # .. .. .. ..@ x : num [1:4] 1 1 1 1 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ U :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:7] 0 1 0 1 2 0 3 # .. .. .. ..@ p : int [1:5] 0 1 2 5 7 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : NULL
# .. .. .. .. ..$: chr [1:4] "c1" "c2" "c3" "c4" # .. .. .. ..@ x : num [1:7] 6 4 -1 3 5 5 2 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ p : int [1:4] 1 2 3 0 # .. ..@ q : int [1:4] 0 1 2 3 # .. ..@ Dim: int [1:2] 4 4  Here is an example which shows that the decomposition is only performed once: set.seed(1) M <- runif(9e6) M[sample.int(9e6, size = 8e6)] <- 0 M <- Matrix(M, nrow = 3e3, sparse = TRUE) system.time(lu(M)) # user system elapsed # 13.527 0.161 13.701 system.time(lu(M)) # user system elapsed # 0 0 0  # A deep dive into glmnet: predict.glmnet I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation. In this post, instead of looking at one of the function options of glmnet, we’ll look at the predict method for a glmnet object instead. The object returned by glmnet (call it fit) has class "glmnet"; when we run predict(fit), it runs the predict method for class "glmnet" objects, i.e. predict.glmnet(fit). For reference, here is the full signature of the predict.glmnet function/method (v3.0-2): predict(object, newx, s = NULL, type = c("link", "response", "coefficients", "nonzero", "class"), exact = FALSE, newoffset, ...)  In the above, object is a fitted "glmnet" object (call it fit). Recall that every glmnet fit has a lambda sequence associated with it: this will be important in understanding what follows. (This sequence can be accessed via fit$lambda.)

For the rest of this post, we will use the following data example:

set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), nrow = n)
beta <- matrix(c(rep(1, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)

fit <- glmnet(x, y)


Function option: newx

newx is simply the new x matrix at which we want predictions for. So for example, if we want predictions for the training x matrix, we would do

predict(fit, x)


If no other arguments are passed, we will get a matrix of predictions, each column corresponding to predictions for each value of $\lambda$ in fit$lambda. For our example, fit$lambda has length 68 and x consists of 100 rows/observations, so predict(fit, x) returns a $100 \times 68$ matrix.

length(fit$lambda) # [1] 68 dim(predict(fit, x)) # [1] 100 68  newx must be provided except when type="coefficients" or type="nonzero" (more on these types later). Function option: newoffset If the original glmnet call was fit with an offset, then an offset must be included in the predict call under the newoffset option. If not included, an error will be thrown. set.seed(2) offset <- rnorm(n) fit2 <- glmnet(x, y, offset = offset) predict(fit2, x) # Error: No newoffset provided for prediction, yet offset used in fit of glmnet  The reverse is true, in that if the original glmnet call was NOT fit with an offset, then predict will not allow you to include an offset in the prediction, EVEN if you pass it the newoffset option. It does not throw a warning or error, but simply ignore the newoffset option. You have been warned! This is demonstrated in the code snippet below. pred_no_offset <- predict(fit, x) pred_w_offset <- predict(fit, x, offset = offset) max(abs(pred_no_offset - pred_w_offset)) # [1] 0  Function option: s and exact s indicates the $\lambda$ values for which we want predictions at. If the user does not specify s, predict will give predictions for each of the $\lambda$ values in fit$lambda.

(Why is this option named s and not the more intuitive lambda? In page 5 of this vignette, the authors say they made this choice “in case later we want to allow one to specify the model size in other ways”. lambda controls the model size in the sense that the larger it is, the more coefficients will be forced to zero. There are other ways to specify model size. For example, one could imagine a function option where we specify the number of non-zero coefficients we want in the model, or where we specify the maximum $\ell_1$ norm the coefficient vector can have. None of these other options have been implemented at the moment.)

If the user-specified s values all belong to fit$lambda, then predict pulls out the coefficients corresponding to those values and returns predictions. In this case, the exact option has no effect. If the user-specified s value does NOT belong to fit$lambda, things get interesting. If exact=FALSE (the default), predict uses linear interpolation to make predictions. (More accurately, it does linear interpolation of the coefficients, which translates to linear interpolation of the predictions.) As stated in the documentation: “while this is often a good approximation, it can sometimes be a bit coarse”.

As a demonstration: In the snippet below, we look at the predictions at a value of $\lambda$ that lies between the two largest values in fit$lambda. If the function does as the documentation says, the last line should give a value of 0 (to machine precision). b1 <- as.numeric(predict(fit, x, s = fit$lambda[1]))
b2 <- as.numeric(predict(fit, x, s = fit$lambda[2])) b3 <- as.numeric(predict(fit, x, s = 0.3*fit$lambda[1] + 0.7*fit$lambda[2])) max(abs(b3 - (0.3*b1 + 0.7*b2))) # [1] 3.885781e-16  What happens if we have values in s that are not within the range of fit$lambda? First, I would recommend using exact=TRUE because extrapolation beyond the range of fit$lambda is dangerous in general. In my little experiments, it looks like predict simply returns the predictions for the $\lambda$ value in fit$lambda that is closest to s.

If exact=TRUE, predict merges s with fit$lambda to get a single (decreasing) $\lambda$ sequence, refits the glmnet model, then returns predictions at the $\lambda$ values in s. If your training data is very large, this refitting could take a long time. One note when using exact=TRUE is that you have to pass in additional arguments in order for the refitting to happen. That’s because the fitted glmnet object does not contain all the ingredients needed to do refitting. For our example, to predict for fit we need to supply x and y as well. For more complicated glmnet calls, more options have to be provided. predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE)
# Error: used coef.glmnet() or predict.glmnet() with exact=TRUE
# so must in addition supply original argument(s)  x and y  in order to
# safely rerun glmnet
predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE, x = x, y = y) # glmnet correctly returns predictions...  Function option: type The type option determines the type of prediction returned. type="coefficients" returns the model coefficients for the $\lambda$ values in s as a sparse matrix. type="nonzero" returns a list, with each element being a vector of the features which have non-zero features. For example, the code snippet below shows that for the second and third $\lambda$ values in fit$lambda, the features that have non-zero coefficients are feature 5 and features 3 and 5 respectively.

predict(fit, type = "nonzero", s = fit$lambda[2:3]) #$1
# [1] 5
#
# 2 # [1] 3 5  For type="coefficients" and type="nonzero", the user does not have to provide a newx argument since the return value does not depend on where we want the predictions. For the rest of the possible values of type, newx is required. For type="link" (the default) and type="response" it helps to know a little GLM theory. For a observation having values $x \in \mathbb{R}^p$, type="link" returns $x^T \beta$, where $\beta$ is the coefficient vector corresponding to a $\lambda$ value in s. For type="response", $x^T \beta$ is passed through the GLM’s inverse link function to return predictions on the y scale. For “gaussian” family it is still $x^T \beta$. For “binomial” and “poisson” families it is $\exp(x^T \beta) / (1 + \exp(x^T \beta))$ and $\exp(x^T \beta)$ respectively. For “multinomial” it returns fitted probabilities and for “cox” it returns fitted relative risk. The final possibility, type="class", applies only to “binomial” and “multinomial” families. For each observation, it simply returns the class with the highest predicted probability. Bonus: The coef method The coef method for glmnet is actually just a special case of the predict method. This can be seen from the source code: coef.glmnet # function (object, s = NULL, exact = FALSE, ...) # predict(object, s = s, type = "coefficients", exact = exact, # ...) # <bytecode: 0x7ff3ae934f20> # <environment: namespace:glmnet>  Bonus: predict.elnet, predict.lognet, … If you inspect the class of the object returned by a glmnet call, you will realize that it has more than one class. In the code below, we see that “gaussian” family results in an “elnet” class object. (“binomial” family returns a “lognet” object, “poisson” family returns a “fishnet” object, etc.) class(fit) # [1] "elnet" "glmnet"  These classes have their own predict methods as well, but they draw on this base predict.glmnet call. As an example, here is the code for predict.fishnet: glmnet:::predict.fishnet # function (object, newx, s = NULL, type = c("link", "response", # "coefficients", "nonzero"), exact = FALSE, newoffset, ...) # { # type = match.arg(type) # nfit = NextMethod("predict") # switch(type, response = exp(nfit), nfit) # } # <bytecode: 0x7ff3ab622040> # <environment: namespace:glmnet>  What happens here is that predict.glmnet is first called. If type is not "response", then we simply return whatever predict.glmnet would have returned. However, if type="response", then (i) we call predict.glmnet, and (ii) the predictions are passed through the function $x \mapsto \exp(x)$ before being returned. This is how predict is able to give the correct return output across the different family and type options. # Derivation of GLM likelihood equations In this previous post, I stated the likelihood equations (or score equations) for generalized linear models (GLMs). Any solution to the score equations is a maximum likelihood estimator (MLE) for the GLM. In this post, I work through the derivation of the score equations. Recap: What is a GLM? Assume we have $n$ data points $(x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$. We want to build a generalized linear model (GLM) of the response $y$ using the $p$ other features $X_1, \dots, X_p$. To that end, assume that the $x$ values are all fixed. Assume that $y_1, \dots, y_n$ are samples of independent random variables $Y_1, \dots, Y_n$ which have the probability density (or mass) function of the form \begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \qquad -(1) \end{aligned} In the above, the form of $f$ is known, but not the values of the $\theta_i$‘s and $\phi$. Let $\mu_i = \mathbb{E} [Y_i]$. We assume that \begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \qquad -(2) \end{aligned} where $\eta_i = g(\mu_i)$ for some link function $g$, assumed to be known. Recap: The likelihood/score equations The goal of fitting a GLM is to find estimates for $\beta_1, \dots, \beta_p$. More specifically, we want to find the values of $\beta_1, \dots, \beta_p$ which maximize the (log)-likelihood of the data: \begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned} To do that, we differentiate $L$ w.r.t. $\beta$ and set it to be 0. Using the chain rule as well as the form of $f$, after some algebra we have \begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \qquad -(3) \end{aligned} Hence, the likelihood equations (or score equations) are \begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \qquad -(4) \end{aligned} $\beta$ appears implicitly in the equations above through $\mu_i$: $\mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij})$. (See the original post for a matrix form of these equations.) Likelihood equations: A derivation All this involves is evaluating the 4 partial derivatives in $(3)$ and multiplying them together. Using the form of the probability density function for $Y_i$ $(1)$: \begin{aligned} L_i &= \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi), \\ \dfrac{\partial L_i}{\partial \theta_i} &= \frac{y_i - b'(\theta_i)}{a(\phi)}. \end{aligned} The second partial derivative,\partial \theta_i / \partial \mu_i\$, is probably the trickiest of the lot: simplifying it requires some properties of exponential families. Under general regularity conditions, we have

\begin{aligned} \mathbb{E} \left( \dfrac{\partial L}{\partial \theta} \right) &= 0, &-(5) \\ - \mathbb{E} \left( \dfrac{\partial^2 L}{\partial \theta^2} \right) &= \mathbb{E} \left[ \left( \dfrac{\partial L}{\partial \theta} \right)^2 \right]. &-(6) \end{aligned}

Applying $(5)$ to our setting:

\begin{aligned} \mathbb{E} \left[ \frac{Y_i - b'(\theta_i)}{a(\phi)} \right] &= 0, \\ \mu_i = \mathbb{E}[Y_i] &= b'(\theta_i), \text{ and } \quad \dfrac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i). \end{aligned}

Applying $(6)$ to our setting:

\begin{aligned} - \dfrac{b''(\theta_i)}{a(\phi)} = \mathbb{E} \left( \dfrac{\partial^2 L_i}{\partial \theta_i^2} \right) &= -\mathbb{E} \left[\left( \dfrac{\partial L_i}{\partial \theta_i} \right)^2 \right] = -\mathbb{E} \left[ \left( \frac{Y_i - b'(\theta_i)}{a(\phi)} \right)^2 \right], \\ \dfrac{b''(\theta_i)}{a(\phi)} &= \dfrac{\mathbb{E}[(Y_i - \mu_i)^2]}{a(\phi)^2}, \\ b''(\theta_i) &= \text{Var}(Y_i) / a(\phi). \end{aligned}

Thus,

\begin{aligned} \dfrac{\partial \mu_i}{\partial \theta_i} &= b''(\theta_i) = \text{Var}(Y_i) / a(\phi), \\ \dfrac{\partial \theta_i}{\partial \mu_i} &= a(\phi) / \text{Var}(Y_i). \end{aligned}

The third partial derivative, $\partial \mu_i / \partial \eta_i$, actually appears in the likelihood equations so we don’t have to do any algebraic manipulation here. Finally, using the systematic component of the GLM $(2)$, we have

\begin{aligned} \dfrac{\partial \eta_i}{\partial \beta_j} = x_{ij}. \end{aligned}

Putting these 4 parts together:

\begin{aligned} \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} &= \frac{y_i - b'(\theta_i)}{a(\phi)} \cdot \frac{a(\phi)}{\text{Var}(Y_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot x_{ij} \\ &= \dfrac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)}\frac{\partial \mu_i}{\partial \eta_i}, \end{aligned}

as required.

References:

1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

# rOpenSci community calls

This is a short PSA about an R resource that I recently learnt about (and participated in): rOpenSci community calls. According to the website, these community calls happen quarterly, and is a place where the public can learn about “best practices, new projects, Q&As with well known developers, and… rOpenSci developments”.

I heard about the most recent community call (“Maintaining an R package”) via an announcement on R-bloggers. The topic was of personal interest to me and the panelists were experienced/interesting enough that I felt I could learn a lot by participating. For reference, here were the speakers/panelists for this call:

• Julia Silge, Data scientist & software engineer @ RStudio
• Elin Waring, Professor of Sociology and Interim Dean of health sciences, human services and nursing @ Lehman College, CUNY
• Erin Grand, Data scientist @ Uncommon Schools
• Leonardo Collado-Torres, Research scientist @ Lieber institute for brain development
• Scott Chamberlain, Co-founder and technical lead @ rOpenSci

Here are some of my quick observations of the event as a participant:

• The calls are publicly hosted on Zoom, which made it really easy to join. Overall the video and sound quality was good and clear enough that I wasn’t straining to hear the speakers.
• At the beginning of the call, Stefanie, the community manager hosting this call, suggested that those who were comfortable share their screen so that we could put faces to names. That was a small simple touch that made the call more personal!
• As the call is happening, attendees can collaboratively update a shared document capturing the key points of the discussion. It is then made publicly available soon after the call is over. (As an example, this is the collaborative document of the call I attended.)
• Through the collaborative document, not only could participants ask the speakers questions, but other participants could answer and comment on those questions as well!
• rOpenSci does a really good job of recording different aspects of the call and archiving them for future reference. Each call has its own webpage with all the resources associated with it. For the call I attended, all the resources are here. There are a list of resource links (including one for the collaborative notes), as well as a video recording of the call itself!

I enjoyed listening in on the call and am very much looking forward to the next one! I hope that you will considering joining in as well.

# Bayesian Additive Regression Trees (BART)

Bayesian Additive Regression Trees (BART), proposed by Chipman et al. (2010) (Reference 1), is a Bayesian “sum-of-trees” model where we use the sum of trees to model or approximate the target function $f(x) = \mathbb{E}[Y \mid x]$, where $Y$ is the response of interest and $x$ represents the covariates that we are using to predict $Y$. (If there are $p$ covariates, then $x$ is a $p$-dimensional vector.) It is inspired by ensemble methods such as boosting, where a meta algorithm combines a large number of “weak learners”/”weak predictors” to get a much more accurate prediction.

As a Bayesian method, all we have to do is to specify the prior distribution for the parameters and the likelihood. From there, we can turn the proverbial Bayesian crank to get the posterior distribution of parameters and with it, posterior inference of any quantity we are interested in (e.g. point and estimate estimates of $f(x) = \mathbb{E}[Y \mid x]$).

Likelihood

The likelihood is easy to specify once we get definitions out of the way. Let $T$ denote a binary decision tree. Assuming the tree has $b$ terminal nodes, let $M = \{ \mu_1, \dots, \mu_b \}$ be the set of parameter values associated with each of the terminal nodes such that if the $x$ value ends up in the $i$th terminal node, the tree would return the value $\mu_i$. We can think of $T$ as representing the structure of the tree, and of $M$ as specifying the value we return to the user once the input hits one of the terminal nodes.

For a given $T$ and $M$, let $g(x; T, M)$ denote the function which assigns $\mu_i$ to $x$ if $x$ ends up in the $i$th terminal node.

The likelihood for BART is

\begin{aligned} Y = \sum_{j=1}^m g(x; T_j, M_j) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2). \end{aligned}

The response is the sum of $m$ binary decision trees with additive Gaussian noise.

We need to choose priors for $m$, $\sigma$, and $(T_1, M_1), \dots, (T_m, M_m)$.

Chipman et al. actually decided not to put a prior on $m$ for computational reasons. Instead, they suggested beginning with a default of $m = 200$, then check if one or two other choices of $m$ makes any difference. According to them, as $m$ increases, predictive performance improves dramatically at first, then levels off, and finally degrades slowly. (I believe this is the observed behavior with boosting as well.) As such, for prediction it appears only important to avoid having too few trees.

As for the other parameters, we introduce independence assumptions to simplify the prior specification. If $\mu_{ij} \in M_j$ represents the terminal value of the $i$th node in $T_j$, then we assume that

\begin{aligned} p \left( (T_1, M_1), \dots, (T_m, M_m), \sigma \right) &= \left[ \prod_j p(T_j, M_j) \right] p(\sigma) \\ &= \left[ \prod_j p(M_j \mid T_j) p(T_j) \right] p(\sigma), \end{aligned}

and that

\begin{aligned} p(M_j \mid T_j) = \prod_i p(\mu_{ij} \mid T_j). \end{aligned}

To complete the prior specification, we just need to specify the priors for $\sigma$, $T_j$ and $\mu_{ij} \mid T_j$. We do so in the following subsections. The paper notes that these priors were also used in an earlier paper by the same authors (Reference 2) where they considered a Bayesian model for a single decision tree.

The $\sigma$ prior

For $\sigma$ BART uses a standard conjugate prior, the inverse chi-square distribution

$\sigma^2 \sim \nu \lambda / \chi_\nu^2,$

where $\nu$ and $\lambda$ are the degrees of freedom and scale hyperparameters. Chipman et al. recommend choosing these two hyperparameters in a data-driven way:

1. Get some estimate $\hat{\sigma}$ of $\sigma$ (e.g. sample standard deviation of $Y$, or fit ordinary least squares (OLS) of $Y$ on $X$ and take the residual standard deviation).
2. Pick a value of $\nu$ between 3 and 10 to get an appropriate shape.
3. Pick a value of $\lambda$ so that the $q$th quantile of the prior on $\sigma$ is $\hat{\sigma}$, i.e. $P(\sigma < \hat{\sigma}) = q$. Chipman et al. recommend considering $q = 0.75$, $0.9$ or $0.99$.

For users who don’t want to choose $\nu$ and $q$, the authors recommend defaults of $\nu = 3$ and $q = 0.9$.

The $T_j$ prior

The prior for a tree $T_j$ can be specified with 3 ingredients:

1. The probability that a node at depth $d$ (the root node having depth 0) is non-terminal.
2. If a node is non-terminal, the probability that the $i$th covariate (out of $p$ covariates) is the splitting variable for this node.
3. Once the splitting variable for a node has been chosen, a probability over the possible cut-points for this variable.

Chipman et al. suggest the following:

1. $P(\text{node at depth } d \text{ is non-terminal}) = \alpha (1 + d)^{-\beta}$, where $\alpha \in (0, 1)$ and $\beta \in [0, \infty)$ are hyperparameters. Authors suggest $\alpha = 0.95$ and $\beta = 2$ to favor small trees. With this choice, trees with 1, 2, 3, 4 and $\geq 5$ terminal nodes have prior probability of 0.05, 0.55, 0.28, 0.09 and 0.03 respectively.
2. Chipman et al. suggest the uniform distribution over the $p$ covariates.
3. Given the splitting variable, Chipman et al. suggest the uniform prior on the discrete set of available splitting values for this variable.

The $\mu_{ij} \mid T_j$ prior

For computational efficiency, Chipman et al. suggest using the conjugate normal distribution

$p(\mu_{ij} \mid T_j) \sim \mathcal{N}(\mu_\mu, \sigma_\mu^2),$

where $\mu_\mu$ and $\sigma_\mu$ are hyperparameters. As with the hyperparameters associated with the $\sigma$ prior, the authors suggest setting them in a data-driven way. The way we do this is by ensuring that $\mathbb{E}[Y \mid x]$ is in the right ballpark. With the prior above, $\mathbb{E}[Y \mid x]$ has the prior $\mathcal{N}(m\mu_\mu, m \sigma_\mu^2)$. Letting $y_{min}$ and $y_{max}$ be the min and max observed values of $Y$, choose $\mu_\mu$ and $\sigma_\mu$ such that

\begin{aligned} m \mu_\mu - k \sqrt{m}\sigma_\mu &= y_{min}, \\ m \mu_\mu + k \sqrt{m}\sigma_\mu &= y_{max}. \end{aligned}

If we choose $k = 2$, then this choice of hyperparameters means that the prior probability that $\mathbb{E}[Y \mid x] \in (y_{min}, y_{max})$ is 0.95.

Other notes

In theory, once we define a prior and likelihood we have a posterior. The practical question is whether we can derive this posterior or sample from it efficiently. Section 3 of the paper outlines a Bayesian backfitting MCMC algorithm that allows us to sample from the posterior distribution.

The set-up above applies for quantitative $Y$. For binary $Y$, the paper develops a probit model along similar lines in Section 4.

References:

1. Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees.
2. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search.