What is the jackknife?

According to Wikipedia, the jackknife technique “was developed by Maurice Quenouille from 1949 and refined in 1956″, and was so named by John Tukey in 1958 (who also expanded on the technique). Quenouille’s 1956 paper “Notes on Bias in Estimation” gives a clear explanation of the intuition behind the jackknife estimate.

In the general statistical set-up, we are interested in estimating some parameter \theta, using observations x_1, \dots, x_n. Let \hat\theta_n denote our estimate for \theta based on these n observations. (For concreteness, one example you could consider is \theta = \sigma^2, i.e. variance of the distribution, and \hat\theta_n = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2.)

There are several different properties we might want our estimate \hat\theta_n to have, e.g. efficiency, sufficiency, consistency, unbiasedness. The maximum likelihood estimate (MLE) has the first 3 properties but not the 4th (except in rare situations). If we have an MLE (or some other consistent estimate), how can we adjust it to obtain an estimate that is less biased?

Under some conditions, we can perform a Taylor expansion to get

\begin{aligned} \mathbb{E}[\hat\theta_n - \theta] = \frac{a_1}{n} + \frac{a_2}{n^2} + \frac{a_3}{n^3} + \dots \end{aligned}

If this is the case, then if we define \theta' = n\hat\theta_n - (n-1)\hat\theta_{n-1}, the equation above gives

\begin{aligned} \mathbb{E}[\theta'] &= n \left( \theta + \frac{a_1}{n} + \frac{a_2}{n^2} + \frac{a_3}{n^3} + \dots \right) - (n-1)\left( \theta + \frac{a_1}{n-1} + \frac{a_2}{(n-1)^2} + \frac{a_3}{(n-1)^3} + \dots \right) \\  &= \left( n\theta + a_1 + \frac{a_2}{n} + \frac{a_3}{n^2} + \dots \right) - \left( (n-1)\theta + a_1 + \frac{a_2}{n-1} + \frac{a_3}{(n-1)^2} + \dots \right) \\  &= \theta + \frac{a_2}{n(n-1)} + \dots \\  &= \theta + O\left( \frac{1}{n^2} \right), \end{aligned}

so \theta' is biased to order O\left( \frac{1}{n^2} \right), while our original estimate \hat\theta_n was biased to order O\left( \frac{1}{n} \right).

In defining \theta', we used \hat\theta_{n-1} which was based on x_1, \dots, x_{n-1}. There is no a priori reason why we should this particular set of n-1 points: we could have easily used \{ x_2, \dots, x_n \}, \{ x_1, x_3, x_4, \dots, x_n \}, or any other set of n-1 points instead. In fact, Quenouille argues that in order to minimize the loss in efficiency, one should use all possible sets of n-1 observations.

If we define \hat\theta_{(-i)} as the estimate of \theta based on all the observations except x_i, then replacing \hat\theta_{n-1} by the mean of these \hat\theta_{(-i)}‘s (which we denote by \tilde\theta) yields the jackknife estimate:

\begin{aligned} \hat\theta_{jack} = n\hat\theta_n - (n-1) \tilde\theta = n\hat\theta_n - (n-1) \cdot \frac{1}{n}\sum_{i=1}^n \hat\theta_{(-i)}. \end{aligned}

Instead of bias estimation, Tukey saw the utility of the jackknife for robust interval estimation. Arguing that the \hat\theta_{(-i)}‘s could be treated as approximately i.i.d. random variables, the statistic

\dfrac{\sqrt{n}(\tilde\theta - \theta)}{\sqrt{\frac{1}{n-1} \sum_{i=1}^n (\hat\theta_{(-i)} - \tilde\theta)^2} }

should have an approximate t distribution with n-1 degrees of freedom. Thus, we can make use of this pivotal statistic to do interval estimation.

References:

  1. Miller, R. (1974). The Jackknife–A Review.
  2. Quenouille, M. H. (1956). Notes on Bias in Estimation.
  3. Wikipedia. Jackknife resampling.
Advertisements

Beamer: inserting section slides before each section

When creating a presentation in LaTeX using Beamer, the \tableofcontents automatically creates a table of contents for you based on the \section, \subsection and \subsection tags. For example, in one of my documents,

\begin{frame}{Outline of talk}
\tableofcontents
\end{frame}

creates this slide:

Sometimes, in addition to a table of contents slide at the beginning of the presentation, I want a slide at the beginning of each section to remind the audience of where we are in the presentation. We can do this by adding the following lines of code before the \begin{document} command:

\AtBeginSection[]
{
    \begin{frame}
        \frametitle{Table of Contents}
        \tableofcontents[currentsection]
    \end{frame}
}

This will create slides just like \tableofcontents, except that all but the current section will be grayed out. For example, the second section will start with the following slide:

If you want this behavior for subsections as well, you can add the following code:

\AtBeginSubsection[]
{
    \begin{frame}
        \frametitle{Table of Contents}
        \tableofcontents[currentsection,currentsubsection]
    \end{frame}
}

References:

  1. WikiBooks. LaTeX/Presentations: Table of Contents.

The math behind quantile regression

In this previous post, we introduced quantile regression and demonstrated how to run it in R. In this post, we dig into the math behind quantile regression.

Actually, before we get into quantile regression, let’s talk about quantiles. Given some observations y_1, y_2, \dots, y_n \in \mathbb{R}, how do we define the 50th quantile, better known as the median? In high school, we are taught that if we line up all our observations in ascending order, the median is the element right in the middle if n is odd, or the sum of the two middle elements if n is even. Quantiles can be defined in a similar manner: for \tau \in (0,1), the \tauth quantile is a value such that at least \tau fraction of the observations are less than or equal to it.

This is all well and good, but it’s a bit hard to generalize this notion easily. Instead, we can define the \tauth quantile as the solution to a minimization problem; the “algorithm” for finding the \tauth quantile in the previous paragraph can be viewed as a way to solve this minimization problem.

Here is the minimization problem: if we define \rho_\tau(y) = y (\tau - 1\{ y < 0\}), then for a set of observations y_1, \dots, y_n \in \mathbb{R}, we defined its \tauth quantile as

\begin{aligned} \hat{q}_\tau &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n \rho_\tau (y_i - q) \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n (y_i - q)(\tau - 1\{ y_i < q \}) \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \left[ (\tau - 1) \sum_{y_i < q} (y_i - q) + \tau \sum_{y_i \geq q} (y_i - q) \right]. \quad (1) \end{aligned}

When \tau = 1/2, we see that the above reduces to

\begin{aligned} \hat{q}_{0.5} &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \left[ -\frac{1}{2} \sum_{y_i < q} (y_i - q) + \frac{1}{2} \sum_{y_i \geq q} (y_i - q) \right] \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n |y_i - q|. \end{aligned}

Here is some hand-wavy reasoning for why the sample \tauth quantile minimizes the expression in (1). Assume that a proportion p of the terms are in the first summation \displaystyle\sum_{y_i < q} (y_i - q), and that 1-p of the terms are in the second summation \displaystyle\sum_{y_i \geq q} (y_i - q). If we increase q by some little \epsilon so that the number of terms in each summation doesn’t change, then overall the expression increases by

\begin{aligned} np(1-\tau)\epsilon - n(1-p)\tau \epsilon = n\epsilon (p - \tau). \end{aligned}

If p < \tau, then we can decrease the objective function value in (1) by making q larger. If p > \tau, then we can decrease the objective function value by making q smaller. The only place where we can’t have any improvement is when p = \tau, i.e. q chosen such that \tau of the observations are less than it.

Let’s extend this minimization problem idea to quantile regression. For each response value y_i, we have a corresponding estimate \hat{y}_i We can interpret the setting above as trying to minimize

\displaystyle\sum_{i=1}^n \rho_\tau (y_i - \hat{y}_i),

where the \hat{y}_i‘s are all forced to be equal to one value, q. In quantile regression, we are forced instead to have \hat{y}_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}, where x_{i1}, \dots, x_{ip} are the values for the ith observation for the p features on hand. Thus, we can rewrite the minimization problem as

\underset{\beta_0, \dots, \beta_p}{\text{minimize}} \quad \displaystyle\sum_{i=1}^n \rho_\tau (y_i - \beta_0 - \beta_1 x_{i1} - \dots - \beta_p x_{ip}),

or in vector notation (and incorporating the intercept as a feature),

\underset{\beta}{\text{minimize}} \quad \displaystyle\sum_{i=1}^n \rho_\tau (y_i - x_i^T \beta).

Notice that this is very similar to ordinary least squares regression. In that case, instead of applying \rho_\tau to each y_i - x_i^T \beta, we apply x \mapsto \frac{1}{2}x^2.

References:

  1. Wikipedia. Quantile regression.

Quantile regression in R

Quantile regression: what is it?

Let y \in \mathbb{R} be some response variable of interest, and let x \in \mathbb{R}^p be a vector of features or predictors that we want to use to model the response. In linear regression, we are trying to estimate the conditional mean function, \mathbb{E}[y \mid x], by a linear combination of the features.

While the conditional mean function is often what we want to model, sometimes we may want to model something else. On a recent episode of the Linear Digressions podcast, Katie and Ben talked about a situation in Uber where it might make sense to model a conditional quantile function.

Let’s make this more concrete. Say Uber came up with a new algorithm for dispatching drivers and we are interested in how this algorithm fares in terms of wait times for consumers. A simple (linear regression) model for this is

\mathbb{E}[wait\_time \mid algorithm] = a + b \cdot algorithm,

where algorithm = 1 if the new algorithm was used to dispatch a driver, and algorithm = 0 if the previous algorithm was used. From this model, we can say that under the old algorithm, mean wait time was a, but under the new algorithm, mean wait time is a + b. So if b < 0, I would infer that my new algorithm is "doing better".

But is it really? What if the new algorithm improves wait times for 90% of customers by 1 min, but makes the wait times for the remaining 10% longer by 5 min? Overall, I would see a decrease in mean wait time, but things got significantly worse for a segment of my population. What if that 10% whose wait times became 5 minutes longer were already having the longest wait times to begin with? That seems like a bad situation to have, but our earlier model would not pick it up.

One way to pick up such situations is to model conditional quantile functions instead. That is, trying to estimate the mean of y given the features x, let’s trying to estimate a quantile of y given the features x. In our example above, instead of trying to estimate the mean wait time, we could estimate the 95th quantile wait time to catch anything going wrong out in the tails of the distribution.

Another example where estimating conditional quantiles is useful is in growth charts. All of you have probably seen one of these charts below in a doctor’s office before. Each line in the growth chart represents some quantile for length/weight given the person’s age. We track our children’s length and weight on this chart to see if they are growing normally or not.

WHO growth chart for boys. The lines represent conditional quantile functions for different quantiles: of length given age and weight given age.

Quantile regression is a regression method for estimating these conditional quantile functions. Just as linear regression estimates the conditional mean function as a linear combination of the predictors, quantile regression estimates the conditional quantile function as a linear combination of the predictors.

Quantile regression in R

We can perform quantile regression in R easily with the quantreg package. I will demonstrate how to use it on the mtcars dataset. (For more details on the quantreg package, you can read the package’s vignette here.)

Let’s load our packages and data:

library(quantreg)
data(mtcars)

We can perform quantile regression using the rq function. We can specify a tau option which tells rq which conditional quantile we want. The default value for tau is 0.5 which corresponds to median regression. Below, we fit a quantile regression of miles per gallon vs. car weight:

rqfit <- rq(mpg ~ wt, data = mtcars)
rqfit
# Call:
#     rq(formula = mpg ~ wt, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt 
# 34.232237   -4.539474 
# 
# Degrees of freedom: 32 total; 30 residual

Printing the fitted object to the console gives some rudimentary information on the regression fit. We can use the summary function to get more information (just like we do for the lm function):

summary(rqfit)
# Call: rq(formula = mpg ~ wt, data = mtcars)
# 
# tau: [1] 0.5
# 
# Coefficients:
#     coefficients lower bd upper bd
# (Intercept) 34.23224     32.25029 39.74085
# wt          -4.53947     -6.47553 -4.16390

In the table above, the lower bd and upper bd columns represent the endpoints of confidence intervals for the model coefficients. There are a number of ways for these confidence intervals to be computed; this can be specified using the seoption when invoking the summary function. The default value is se="rank", with the other options being “iid”, “nid”, “ker”, “boot” and “BLB” (type ?summary.rq for details).

This next block of code plots the quantile regression line in blue and the linear regression line in red:

plot(mpg ~ wt, data = mtcars, pch = 16, main = "mpg ~ wt")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lty = 2)
abline(rq(mpg ~ wt, data = mtcars), col = "blue", lty = 2)
legend("topright", legend = c("lm", "rq"), col = c("red", "blue"), lty = 2)

Median regression (i.e. 50th quantile regression) is sometimes preferred to linear regression because it is “robust to outliers”. The next plot illustrates this. We add two outliers to the data (colored in orange) and see how it affects our regressions. The dotted lines are the fits for the original data, while the solid lines are for the data with outliers. As before, red is for linear regression while blue is for quantile regression. See how the linear regression fit shifts a fair amount compared to the median regression fit (which barely moves!)?

y <- c(mtcars$mpg, 40, 36)
x <- c(mtcars$wt, 5, 4)
plot(y ~ x, pch = 16, main = "mpg ~ wt")
points(c(5, 4), c(40, 36), pch = 16, col = "dark orange")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lty = 2)
abline(lm(y ~ x), col = "red")
abline(rq(mpg ~ wt, data = mtcars), col = "blue", lty = 2)
abline(rq(y ~ x), col = "blue")

As I mentioned before, the tau option tells rq which conditional quantile we want. What is interesting is that we can set tau to be a vector and rq will give us the fits for all those quantiles:

multi_rqfit <- rq(mpg ~ wt, data = mtcars, tau = seq(0, 1, by = 0.1))
multi_rqfit
# Call:
#     rq(formula = mpg ~ wt, tau = seq(0, 1, by = 0.1), data = mtcars)
# 
# Coefficients:
#     tau= 0.0  tau= 0.1 tau= 0.2  tau= 0.3  tau= 0.4  tau= 0.5  tau= 0.6  tau= 0.7  tau= 0.8  tau= 0.9
# (Intercept)  27.6875 37.509794  37.2768 34.876712 34.891176 34.232237 36.734405 38.497404 38.879916 44.391047
# wt           -3.7500 -6.494845  -6.2400 -5.205479 -4.852941 -4.539474 -5.016077 -5.351887 -5.250722 -6.266786
# tau= 1.0
# (Intercept) 44.781558
# wt          -5.627981
# 
# Degrees of freedom: 32 total; 30 residual

I don’t think there’s a clean one-liner to plot all the quantile fits at once. The following loop is one possible solution:

# plotting different quantiles
colors <- c("#ffe6e6", "#ffcccc", "#ff9999", "#ff6666", "#ff3333",
            "#ff0000", "#cc0000", "#b30000", "#800000", "#4d0000", "#000000")
plot(mpg ~ wt, data = mtcars, pch = 16, main = "mpg ~ wt")
for (j in 1:ncol(multi_rqfit$coefficients)) {
    abline(coef(multi_rqfit)[, j], col = colors[j])
}

rq exhibits interesting behavior if tau is set to some value outside of [0,1]: it gives solutions for ALL values of tau in [0,1]! (It turns out that there are a finite number of them.)

One final note: rq also supports more complicated formula.

# no intercept
rq(mpg ~ wt - 1, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt - 1, data = mtcars)
# 
# Coefficients:
#     wt 
# 4.993498 
# 
# Degrees of freedom: 32 total; 31 residual

# additive model
rq(mpg ~ wt + cyl, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt + cyl, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt         cyl 
# 38.871429   -2.678571   -1.742857 
# 
# Degrees of freedom: 32 total; 29 residual

# model with interactions
rq(mpg ~ wt * cyl, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt * cyl, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt         cyl      wt:cyl 
# 51.6650791  -7.2496674  -3.4759342   0.5960682 
# 
# Degrees of freedom: 32 total; 28 residual

# fit on all other predictors
rq(mpg ~ ., data = mtcars)
# Call:
#     rq(formula = mpg ~ ., data = mtcars)
# 
# Coefficients:
#     (Intercept)         cyl        disp          hp        drat          wt        qsec          vs          am 
# 7.61900909  0.64658010  0.02330302 -0.03149324  0.78058219 -4.56231735  0.57610198  1.58875367  1.33175736 
# gear        carb 
# 2.37453015 -0.33136465 
# 
# Degrees of freedom: 32 total; 21 residual

References:

  1. University of Virginia Library. Getting started with quantile regression.
  2. Wikipedia. Quantile regression.

Simulated annealing for optimization

Simulated annealing is a probabilistic technique for approximating the global optimum of a given objective function. Because it does not guarantee a global optimum, it is known as a metaheuristic.

According to Wikipedia, annealing is “a heat treatment that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves heating a material above its recrystallization temperature, maintaining a suitable temperature for a suitable amount of time, and then cooling” (emphasis mine). As we will see, simulated annealing mimics this physical process in trying to find a function’s global optimum.

There are several versions of simulated annealing. Below, we give a high-level outline of a basic version for minimizing a function f: \mathcal{X} \mapsto \mathbb{R}. (Simulated annealing can be used for constrained optimization as well; for simplicity we limit our exposition to unconstrained optimization.) Let i index the iteration number. Simulated annealing has a positive temperature parameter, denoted by T_i at iteration i, whose evolution strongly influences the result.

Start at some initial point x_0 \in \mathcal{X}. For each iteration i = 0, \dots, if the stopping criterion is not satisfied:

  1. Choose some random proposal point y from a probability distribution. The lower the temperature, the more concentrated the probability distribution’s mass is around x_i.
  2. With some acceptance probability, set x_{i+1} = y. If not, set x_{i+1} = x_i. The acceptance probability is a function of the function values f(x_i) and f(y), and the current temperature T_i.
  3. Update the temperature parameter.

That’s all there is to it! The devil, of course, is in the details. Here are some examples of what each of the building blocks could be.

Stopping criterion:

  • Terminate after a fixed number of iterations, or when the computational budget has been fully utilized.
  • Terminate when the objective function does not improve much, e.g. |f(x_i) - f(x_{i+1})| < 10^{-7}.

Choosing a random proposal point: This depends very much on what the underlying state space looks like. A benefit of simulated annealing is that it can be applied to very general state spaces \mathcal{X}, even discrete ones, as long as there is a notion of distance between two points in the space. (In particular, first-order information about the space (i.e. gradients) is not needed.)

  • If \mathcal{X} = \mathbb{R}^d, we could set y = x_i + T \epsilon, where \epsilon \sim \mathcal{N}(0,I_d), or \epsilon \sim U[-1, 1]^d.
  • The dependence on T could be different as well, e.g. y = x_i + \sqrt{T} \epsilon.
  • If \mathcal{X} is the space of permutations on \{ 1, 2, \dots, n\}, then y could be selected uniformly at random from the set of permutations which are at most T transpositions from x_i.

In the proposals above, the smaller T is, the closer y is likely to be than x_i.

Acceptance probability: The probability of accepting y as the next value in the sequence (i.e. as x_{i+1}) is a function which depends on f(x_i), f(x_{i+1}) and (crucially) the temperature T. Typically if f(y) < f(x_i), we would accept y with probability 1, but this does not have to be the case. In general, the acceptance probability should decrease as T decreases and as f(y) - f(x) becomes more positive.

  • This is the most popular acceptance function in practice: if f(y) - f(x) < 0, accept. If not, accept with probability \dfrac{1}{1 + \exp \left( \frac{f(y) - f(x)}{T} \right)}.

Updating the temperature parameter: We just require T_i \rightarrow 0 as i \rightarrow \infty.

  • T_i = T_0 \cdot (0.95)^i.
  • T_i = T_0 / i.
  • T_i = T_0 / \log i.

References:

  1. Henderson et al. The theory and practice of simulated annealing.
  2. Mathworks. How simulated annealing works.
  3. Mathworks. Simulated annealing terminology.
  4. Wikipedia. Simulated annealing.