Asymptotic normality of sample quantiles

Let X_1, \dots, X_n be i.i.d. random variables with the cumulative distribution function (CDF) F. The central limit theorem demonstrates that the sample mean \overline{X}_n is asymptotically normal (as long as F has a finite second moment):

\begin{aligned} \sqrt{n} (\overline{X}_n - \mu) \stackrel{d}{\rightarrow} \mathcal{N}(0, \sigma^2), \end{aligned}

where \mu and \sigma^2 are the mean and variance of the random variable with CDF F.

It turns out that for any 0 < p < 1, the pth sample quantile is asymptotically normal as well. Here is the result:

Theorem. Fix 0 < p < 1. If F is differentiable at the pth quantile F^{-1}(p) with positive derivative f(F^{-1}(p)), then the sample quantile F_n^{-1}(p) is asymptotically normal:

\begin{aligned} \sqrt{n}\left( F_n^{-1}(p) - F^{-1}(p) \right) \stackrel{d}{\rightarrow} \mathcal{N} \left( 0, \dfrac{p(1-p)}{f^2 (F^{-1}(p))} \right). \end{aligned}

This is frequently cited as Corollary 21.5 in Reference 1. A proof of this result can be found in Reference 2 (there, x refers to F^{-1}(p) above). (It is essentially an application of the Central Limit Theorem followed by an application of the delta method.)

The numerator of the variance p(1-p) is largest when p = 1/2 and gets smaller as we go to more extreme quantiles. This makes intuitive sense: for a fixed level of precision, we need more data to estimate the extreme quantiles as compared to the central quantiles.

The denominator of the variance f^2 (F^{-1}(p)) is largest when the CDF F is changing rapidly at the pth quantile. This again makes intuitive sense: if F is changing a lot there, it means that we have greater probability of getting some samples in the neighborhood of F^{-1}(p), and so we can estimate it more accurately.

References:

  1. Van der Vaart, A. W. (2000). Asymptotic statistics.
  2. Stephens, D. A. Asymptotic distribution of sample quantiles.

The math behind quantile regression

In this previous post, we introduced quantile regression and demonstrated how to run it in R. In this post, we dig into the math behind quantile regression.

Actually, before we get into quantile regression, let’s talk about quantiles. Given some observations y_1, y_2, \dots, y_n \in \mathbb{R}, how do we define the 50th quantile, better known as the median? In high school, we are taught that if we line up all our observations in ascending order, the median is the element right in the middle if n is odd, or the sum of the two middle elements if n is even. Quantiles can be defined in a similar manner: for \tau \in (0,1), the \tauth quantile is a value such that at least \tau fraction of the observations are less than or equal to it.

This is all well and good, but it’s a bit hard to generalize this notion easily. Instead, we can define the \tauth quantile as the solution to a minimization problem; the “algorithm” for finding the \tauth quantile in the previous paragraph can be viewed as a way to solve this minimization problem.

Here is the minimization problem: if we define \rho_\tau(y) = y (\tau - 1\{ y < 0\}), then for a set of observations y_1, \dots, y_n \in \mathbb{R}, we defined its \tauth quantile as

\begin{aligned} \hat{q}_\tau &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n \rho_\tau (y_i - q) \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n (y_i - q)(\tau - 1\{ y_i < q \}) \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \left[ (\tau - 1) \sum_{y_i < q} (y_i - q) + \tau \sum_{y_i \geq q} (y_i - q) \right]. \quad (1) \end{aligned}

When \tau = 1/2, we see that the above reduces to

\begin{aligned} \hat{q}_{0.5} &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \left[ -\frac{1}{2} \sum_{y_i < q} (y_i - q) + \frac{1}{2} \sum_{y_i \geq q} (y_i - q) \right] \\  &= \underset{q \in \mathbb{R}}{\text{argmin}} \quad \sum_{i=1}^n |y_i - q|. \end{aligned}

Here is some hand-wavy reasoning for why the sample \tauth quantile minimizes the expression in (1). Assume that a proportion p of the terms are in the first summation \displaystyle\sum_{y_i < q} (y_i - q), and that 1-p of the terms are in the second summation \displaystyle\sum_{y_i \geq q} (y_i - q). If we increase q by some little \epsilon so that the number of terms in each summation doesn’t change, then overall the expression increases by

\begin{aligned} np(1-\tau)\epsilon - n(1-p)\tau \epsilon = n\epsilon (p - \tau). \end{aligned}

If p < \tau, then we can decrease the objective function value in (1) by making q larger. If p > \tau, then we can decrease the objective function value by making q smaller. The only place where we can’t have any improvement is when p = \tau, i.e. q chosen such that \tau of the observations are less than it.

Koenker and Bassett Jr (1978) extended this minimization problem idea to quantile regression. For each response value y_i, we have a corresponding estimate \hat{y}_i We can interpret the setting above as trying to minimize

\displaystyle\sum_{i=1}^n \rho_\tau (y_i - \hat{y}_i),

where the \hat{y}_i‘s are all forced to be equal to one value, q. In quantile regression, we are forced instead to have \hat{y}_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}, where x_{i1}, \dots, x_{ip} are the values for the ith observation for the p features on hand. Thus, we can rewrite the minimization problem as

\underset{\beta_0, \dots, \beta_p}{\text{minimize}} \quad \displaystyle\sum_{i=1}^n \rho_\tau (y_i - \beta_0 - \beta_1 x_{i1} - \dots - \beta_p x_{ip}),

or in vector notation (and incorporating the intercept as a feature),

\underset{\beta}{\text{minimize}} \quad \displaystyle\sum_{i=1}^n \rho_\tau (y_i - x_i^T \beta).

Notice that this is very similar to ordinary least squares regression. In that case, instead of applying \rho_\tau to each y_i - x_i^T \beta, we apply x \mapsto \frac{1}{2}x^2.

References:

  1. Wikipedia. Quantile regression.
  2. Koenker, R., and Bassett Jr, G. (1978). Regression quantiles.

Quantile regression in R

Quantile regression: what is it?

Let y \in \mathbb{R} be some response variable of interest, and let x \in \mathbb{R}^p be a vector of features or predictors that we want to use to model the response. In linear regression, we are trying to estimate the conditional mean function, \mathbb{E}[y \mid x], by a linear combination of the features.

While the conditional mean function is often what we want to model, sometimes we may want to model something else. On a recent episode of the Linear Digressions podcast, Katie and Ben talked about a situation in Uber where it might make sense to model a conditional quantile function.

Let’s make this more concrete. Say Uber came up with a new algorithm for dispatching drivers and we are interested in how this algorithm fares in terms of wait times for consumers. A simple (linear regression) model for this is

\mathbb{E}[wait\_time \mid algorithm] = a + b \cdot algorithm,

where algorithm = 1 if the new algorithm was used to dispatch a driver, and algorithm = 0 if the previous algorithm was used. From this model, we can say that under the old algorithm, mean wait time was a, but under the new algorithm, mean wait time is a + b. So if b < 0, I would infer that my new algorithm is "doing better".

But is it really? What if the new algorithm improves wait times for 90% of customers by 1 min, but makes the wait times for the remaining 10% longer by 5 min? Overall, I would see a decrease in mean wait time, but things got significantly worse for a segment of my population. What if that 10% whose wait times became 5 minutes longer were already having the longest wait times to begin with? That seems like a bad situation to have, but our earlier model would not pick it up.

One way to pick up such situations is to model conditional quantile functions instead. That is, trying to estimate the mean of y given the features x, let’s trying to estimate a quantile of y given the features x. In our example above, instead of trying to estimate the mean wait time, we could estimate the 95th quantile wait time to catch anything going wrong out in the tails of the distribution.

Another example where estimating conditional quantiles is useful is in growth charts. All of you have probably seen one of these charts below in a doctor’s office before. Each line in the growth chart represents some quantile for length/weight given the person’s age. We track our children’s length and weight on this chart to see if they are growing normally or not.

WHO growth chart for boys. The lines represent conditional quantile functions for different quantiles: of length given age and weight given age.

Quantile regression is a regression method for estimating these conditional quantile functions. Just as linear regression estimates the conditional mean function as a linear combination of the predictors, quantile regression estimates the conditional quantile function as a linear combination of the predictors.

Quantile regression in R

We can perform quantile regression in R easily with the quantreg package. I will demonstrate how to use it on the mtcars dataset. (For more details on the quantreg package, you can read the package’s vignette here.)

Let’s load our packages and data:

library(quantreg)
data(mtcars)

We can perform quantile regression using the rq function. We can specify a tau option which tells rq which conditional quantile we want. The default value for tau is 0.5 which corresponds to median regression. Below, we fit a quantile regression of miles per gallon vs. car weight:

rqfit <- rq(mpg ~ wt, data = mtcars)
rqfit
# Call:
#     rq(formula = mpg ~ wt, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt 
# 34.232237   -4.539474 
# 
# Degrees of freedom: 32 total; 30 residual

Printing the fitted object to the console gives some rudimentary information on the regression fit. We can use the summary function to get more information (just like we do for the lm function):

summary(rqfit)
# Call: rq(formula = mpg ~ wt, data = mtcars)
# 
# tau: [1] 0.5
# 
# Coefficients:
#     coefficients lower bd upper bd
# (Intercept) 34.23224     32.25029 39.74085
# wt          -4.53947     -6.47553 -4.16390

In the table above, the lower bd and upper bd columns represent the endpoints of confidence intervals for the model coefficients. There are a number of ways for these confidence intervals to be computed; this can be specified using the seoption when invoking the summary function. The default value is se="rank", with the other options being “iid”, “nid”, “ker”, “boot” and “BLB” (type ?summary.rq for details).

This next block of code plots the quantile regression line in blue and the linear regression line in red:

plot(mpg ~ wt, data = mtcars, pch = 16, main = "mpg ~ wt")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lty = 2)
abline(rq(mpg ~ wt, data = mtcars), col = "blue", lty = 2)
legend("topright", legend = c("lm", "rq"), col = c("red", "blue"), lty = 2)

Median regression (i.e. 50th quantile regression) is sometimes preferred to linear regression because it is “robust to outliers”. The next plot illustrates this. We add two outliers to the data (colored in orange) and see how it affects our regressions. The dotted lines are the fits for the original data, while the solid lines are for the data with outliers. As before, red is for linear regression while blue is for quantile regression. See how the linear regression fit shifts a fair amount compared to the median regression fit (which barely moves!)?

y <- c(mtcars$mpg, 40, 36)
x <- c(mtcars$wt, 5, 4)
plot(y ~ x, pch = 16, main = "mpg ~ wt")
points(c(5, 4), c(40, 36), pch = 16, col = "dark orange")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lty = 2)
abline(lm(y ~ x), col = "red")
abline(rq(mpg ~ wt, data = mtcars), col = "blue", lty = 2)
abline(rq(y ~ x), col = "blue")

As I mentioned before, the tau option tells rq which conditional quantile we want. What is interesting is that we can set tau to be a vector and rq will give us the fits for all those quantiles:

multi_rqfit <- rq(mpg ~ wt, data = mtcars, tau = seq(0, 1, by = 0.1))
multi_rqfit
# Call:
#     rq(formula = mpg ~ wt, tau = seq(0, 1, by = 0.1), data = mtcars)
# 
# Coefficients:
#     tau= 0.0  tau= 0.1 tau= 0.2  tau= 0.3  tau= 0.4  tau= 0.5  tau= 0.6  tau= 0.7  tau= 0.8  tau= 0.9
# (Intercept)  27.6875 37.509794  37.2768 34.876712 34.891176 34.232237 36.734405 38.497404 38.879916 44.391047
# wt           -3.7500 -6.494845  -6.2400 -5.205479 -4.852941 -4.539474 -5.016077 -5.351887 -5.250722 -6.266786
# tau= 1.0
# (Intercept) 44.781558
# wt          -5.627981
# 
# Degrees of freedom: 32 total; 30 residual

I don’t think there’s a clean one-liner to plot all the quantile fits at once. The following loop is one possible solution:

# plotting different quantiles
colors <- c("#ffe6e6", "#ffcccc", "#ff9999", "#ff6666", "#ff3333",
            "#ff0000", "#cc0000", "#b30000", "#800000", "#4d0000", "#000000")
plot(mpg ~ wt, data = mtcars, pch = 16, main = "mpg ~ wt")
for (j in 1:ncol(multi_rqfit$coefficients)) {
    abline(coef(multi_rqfit)[, j], col = colors[j])
}

rq exhibits interesting behavior if tau is set to some value outside of [0,1]: it gives solutions for ALL values of tau in [0,1]! (It turns out that there are a finite number of them.)

One final note: rq also supports more complicated formula.

# no intercept
rq(mpg ~ wt - 1, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt - 1, data = mtcars)
# 
# Coefficients:
#     wt 
# 4.993498 
# 
# Degrees of freedom: 32 total; 31 residual

# additive model
rq(mpg ~ wt + cyl, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt + cyl, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt         cyl 
# 38.871429   -2.678571   -1.742857 
# 
# Degrees of freedom: 32 total; 29 residual

# model with interactions
rq(mpg ~ wt * cyl, data = mtcars)
# Call:
#     rq(formula = mpg ~ wt * cyl, data = mtcars)
# 
# Coefficients:
#     (Intercept)          wt         cyl      wt:cyl 
# 51.6650791  -7.2496674  -3.4759342   0.5960682 
# 
# Degrees of freedom: 32 total; 28 residual

# fit on all other predictors
rq(mpg ~ ., data = mtcars)
# Call:
#     rq(formula = mpg ~ ., data = mtcars)
# 
# Coefficients:
#     (Intercept)         cyl        disp          hp        drat          wt        qsec          vs          am 
# 7.61900909  0.64658010  0.02330302 -0.03149324  0.78058219 -4.56231735  0.57610198  1.58875367  1.33175736 
# gear        carb 
# 2.37453015 -0.33136465 
# 
# Degrees of freedom: 32 total; 21 residual

References:

  1. University of Virginia Library. Getting started with quantile regression.
  2. Wikipedia. Quantile regression.

Approximating large normal quantiles 2

In my STATS 300C notes, another approximation for large normal quantiles was given: let Z \sim \mathcal{N}(0,1), and let z(\alpha) be the \alpha-quantile of Z, that is, z(\alpha) is the value such that \mathbb{P}(Z \leq z(\alpha)) = \alpha. Fix \alpha, and let B = 2 \log n - 2 \log \alpha - \log (2\pi). Then for large n,

z\left( 1 - \dfrac{\alpha}{n} \right) \approx \sqrt{B \left(1 - \dfrac{\log B}{B} \right)}.

The proof is very similar to the approximation in the previous post. Let z\left( 1 - \dfrac{\alpha}{n} \right) = t. For large n,

\begin{aligned} \frac{\phi(t)}{t} &\approx \frac{\alpha}{n}, \\  -\frac{t^2}{2} - \log t - \frac{1}{2} \log (2\pi) &\approx \log \alpha - \log n, \\  t^2 + 2 \log t &\approx B.  \end{aligned}

n large implies B large, which in turn implies t large. This means that t^2 dominates the LHS, which gives us the 1st order approximation t \approx \sqrt{B}. If we let t = \sqrt{B(1-u)} and substitute it in the equation above instead, we get

\begin{aligned} B(1-u) + \log [B(1-u)] &\approx B, \\  -Bu + \log B + \log (1-u) &\approx 0, \\  u &\approx \frac{B}{\log B}, \end{aligned}

since \log (1-u) is going to be small compared to the rest of the terms. Substituting this value of u back into the expression for t gives us the desired approximation.

Approximating large normal quantiles

In the last post, we showed that for Z \sim \mathcal{N}(0,1) and large t, \mathbb{P}(Z > t) = 1 - \Phi(t) \approx \dfrac{\phi(t)}{t}. In this post, we’ll show how we can use this approximation to get an approximation for large quantiles of the normal distribution.

Let z(\alpha) be the \alpha-quantile of Z, that is, z(\alpha) is the value such that \mathbb{P}(Z \leq z(\alpha)) = \alpha. Fix \alpha. Then for large n,

z\left( 1 - \dfrac{\alpha}{n} \right) = t \quad \Leftrightarrow \quad \dfrac{\phi(t)}{t} \approx \dfrac{\alpha}{n}.

We are going to solve for t in the equation on the right:

\begin{aligned} \frac{1}{\sqrt{2\pi}} e^{-t^2 / 2 - \log t} &\approx \frac{\alpha}{n}, \\  - \log (\sqrt{2\pi}) - t^2 / 2 - \log t &\approx \log \alpha - \log n, \\  t^2 /2 + \log t &\approx \log n - \log \alpha - \log (\sqrt{2\pi}).  \end{aligned}

As n grows large, t has to grow large too. When these quantities are large, t^2 / 2 dominates the LHS while \log n dominates the RHS. Thus, the approximation becomes t^2 \approx 2 \log n, or t \approx \sqrt{2 \log n}. (Interestingly, this approximation does not depend on \alpha!)

We can get a finer approximation if we do not drop the \log t term in the LHS above. Instead, let t = \sqrt{2 \log n} (1 - u). Then we have

\begin{aligned} \frac{2 \log n \cdot (1-u)^2}{2} + \log [\sqrt{2 \log n} (1-u)] &\approx \log n, \\  (u^2 - 2u) \log n + \frac{1}{2}[\log 2 + \log \log n] + \log (1-u) &\approx 0, \\  u^2 - 2u + \frac{\log 2 + \log \log n}{2\log n} + \frac{\log (1 - u)}{\log n} &\approx 0. \end{aligned}

As n grows large, u is going to grow small. The terms that end up dominating are -2u and \dfrac{\log \log n}{2 \log n}, which gives us the approximation u \approx \dfrac{\log \log n}{4 \log n}. Hence, we have t \approx \sqrt{2 \log n} \left(1 - \dfrac{\log \log n}{4 \log n} \right).

Credits: I learnt of this result in my STATS 300C class, but no proof was given (other than “use the \phi(t) / t approximation”).