Asymptotic normality of sample quantiles

Let X_1, \dots, X_n be i.i.d. random variables with the cumulative distribution function (CDF) F. The central limit theorem demonstrates that the sample mean \overline{X}_n is asymptotically normal (as long as F has a finite second moment):

\begin{aligned} \sqrt{n} (\overline{X}_n - \mu) \stackrel{d}{\rightarrow} \mathcal{N}(0, \sigma^2), \end{aligned}

where \mu and \sigma^2 are the mean and variance of the random variable with CDF F.

It turns out that for any 0 < p < 1, the pth sample quantile is asymptotically normal as well. Here is the result:

Theorem. Fix 0 < p < 1. If F is differentiable at the pth quantile F^{-1}(p) with positive derivative f(F^{-1}(p)), then the sample quantile F_n^{-1}(p) is asymptotically normal:

\begin{aligned} \sqrt{n}\left( F_n^{-1}(p) - F^{-1}(p) \right) \stackrel{d}{\rightarrow} \mathcal{N} \left( 0, \dfrac{p(1-p)}{f^2 (F^{-1}(p))} \right). \end{aligned}

This is frequently cited as Corollary 21.5 in Reference 1. A proof of this result can be found in Reference 2 (there, x refers to F^{-1}(p) above). (It is essentially an application of the Central Limit Theorem followed by an application of the delta method.)

The numerator of the variance p(1-p) is largest when p = 1/2 and gets smaller as we go to more extreme quantiles. This makes intuitive sense: for a fixed level of precision, we need more data to estimate the extreme quantiles as compared to the central quantiles.

The denominator of the variance f^2 (F^{-1}(p)) is largest when the CDF F is changing rapidly at the pth quantile. This again makes intuitive sense: if F is changing a lot there, it means that we have greater probability of getting some samples in the neighborhood of F^{-1}(p), and so we can estimate it more accurately.

References:

  1. Van der Vaart, A. W. (2000). Asymptotic statistics.
  2. Stephens, D. A. Asymptotic distribution of sample quantiles.

General formula for the asymptotic covariance of the OLS estimator

This post derives the general formula for the covariance of the ordinary least squares (OLS) estimator.

Imagine we are in the regression setup with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response {\bf y} \in \mathbb{R}^n. Let {\bf x}_i \in \mathbb{R}^p and y_i \in \mathbb{R} denote the ith row of \bf X and the ith element of \bf y respectively. We can always make the following decomposition:

y_i = \mathbb{E} [y_i \mid {\bf x_i}] + \varepsilon_i,

where \mathbb{E}[\varepsilon_i \mid {\bf x_i}] = 0 and \varepsilon is uncorrelated with any function of {\bf x}_i. (This is Theorem 3.1.1 of Reference 1.)

The population regression function approximates y_i as {\bf x}_i^T \beta, where \beta solves the minimization problem

\beta = \text{argmin}_b \: \mathbb{E} \left[ (y_i - {\bf x}_i^T b)^2 \right].

It can be shown that

\beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i].

The ordinary least squares (OLS) estimator is a sample version of this and is given by

\begin{aligned} \hat\beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y} = \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i y_i \right). \end{aligned}

We are often interested in estimating the covariance matrix of \hat\beta as it is needed to construct standard errors for \hat\beta. Defining \varepsilon_i = y_i - {\bf x}_i^T \beta as the ith residual, we can rewrite the above as

\begin{aligned} \hat{\beta} &= \beta + \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i \varepsilon_i \right), \\  \sqrt{n}(\hat{\beta} - \beta) &= n \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \cdot \dfrac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). \end{aligned}

By Slutsky’s Theorem, the quantity above has the same asymptotic distribution as \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \cdot \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). Since \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}*, the Central Limit Theorem tells us that \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right) is asymptotically normally distributed with mean zero and covariance \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] (the matrix of fourth moments). Thus,

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

We can use the diagonal elements of \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} to construct standard errors of \hat\beta. The standard errors computed in this way are called heteroskedasticity-consistent standard errors (or White standard errors, or Eicker-White standard errors). They are “robust” in the sense that they use few assumptions on the data and the model: only those needed to make the Central Limit Theorem go through.

*Note: We do NOT need to assume that \mathbb{E} [y_i \mid {\bf x_i}] is linear in order to conclude that \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}. All we need are the relations \beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] and \varepsilon_i = y_i - {\bf x}_i^T \beta. The derivation is as follows:

\begin{aligned} \mathbb{E}[{\bf x}_i \varepsilon_i] &= \mathbb{E}[{\bf x}_i (y_i - {\bf x}_i^T \beta)] \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \beta \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] \\  &= {\bf 0}. \end{aligned}

Special case of homoskedastic errors

If we assume that the errors are homoskedastic, i.e.

\mathbb{E}[\varepsilon_i \mid X_i] = \sigma^2 for all i

for some constant \sigma, then the asymptotic covariance simplifies a little:

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \sigma^2 \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

References:

  1. Angrist, J. D., and Pischke, J.-S. (2009). Mostly harmless econometrics (Section 3.1.3).

Variance of coefficients for linear models and generalized linear models

Variance of coefficients in linear models

Assume we are in the supervised learning setting with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response y \in \mathbb{R}^n. If the linear regression model is true, i.e.

y = {\bf X} \beta + \epsilon,

where \beta \in \mathbb{R}^p is the vector of coefficients and the entries of \epsilon \in \mathbb{R}^p are i.i.d mean 0 and variance \sigma^2, it is well-known that the maximum likelihood estimate (MLE) of \beta is

\hat\beta = ({\bf X}^T{\bf X})^{-1} {\bf X}^T y.

Assuming {\bf X} is fixed and that \epsilon is the only source of randomness, it follows easily that

Var(\hat\beta) = \sigma^2 ({\bf X}^T{\bf X})^{-1}.

It’s remarkable that (i) we have a closed form solution for the variance of \hat\beta, and (ii) this variance does not depend on the value of \beta.

Variance of coefficients in generalized linear models (GLMs)

We are not so lucky on both counts in the generalized linear model (GLM) setting. Assume that the true data generating model is given by a GLM, i.e.

\begin{aligned} Y_i &\stackrel{ind.}{\sim} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right], \\  \eta_i &= \sum_{j=1}^p \beta_j x_{ij}, \text{ and } \\  \eta_i &= g (\mu_i),  \end{aligned}

where i = 1, \dots, n indexes the observations, \mu_i = \mathbb{E}[Y_i], and g is the link function. (See this previous post for details.)

In this setting, we do not have a closed form solution for \hat\beta, the MLE of \beta. (See this previous post for the likelihood equations and ways to obtain the MLE numerically.) We are also not able to get a closed form solution for the variance of \hat\beta. We can however get a closed form solution for the asymptotic variance of \hat\beta:

\begin{aligned} Var( \hat\beta) &= ({\bf X}^T {\bf WX})^{-1}, \text{ where} \\  {\bf W}_{ij} &= \begin{cases} \left( \dfrac{\partial \mu_i}{\partial \eta_i} \right)^2 / Var(Y_i) &\text{if } i = j, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

References:

  1. Agresti, A. (2013). Categorical data analysis (3rd ed): Section 4.4.8.

Strategies for proving convergence in distribution

When starting out in theoretical statistics, it’s often difficult to know how to even get started on the problem at hand. It helps to have a few general strategies in mind. One thing you are often asked to do is to prove that an estimator \hat{\theta}_n (appropriately scaled) converges in distribution, and you are often asked to determine that limiting distribution. In this post, I list some strategies you can use for that.

Note:

  1. This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
  2. The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

  1. Use the central limit theorem (CLT), or some version of it. This is especially applicable for i.i.d. sums, but there are versions of the CLT that don’t require identical distribution (and there are some that don’t even require complete independence!).
  2. Use the delta method.
  3. If your estimator is the maximum likelihood estimator (MLE), then it is asymptotically normal under some regularity conditions (see here).
  4. If your estimator is an M-estimator or a Z-estimator, then it is asymptotically normal under some regularity conditions (see here).
  5. Work directly with the definition of convergence in distribution: show that the CDFs convergence pointwise to the limiting CDF (at all points of continuity for the limiting CDF).
  6. Use Lévy’s continuity theorem: If the characteristic functions of the random variable sequence converge pointwise to the characteristic function of the limiting random variable, then the random variable sequence converges in distribution to the limiting random variable. (A similar theorem applies for moment generating functions, see here.)
  7. Use the Portmanteau theorem.
  8. Slutsky’s theorem will often come in handy. Often the thing we need to prove convergence in distribution for is a ratio A_n / B_n with randomness in both numerator and denominator. Slutsky’s theorem allows us to deal with the two sources of randomness separately. If we can show something like A_n / m_n \stackrel{d}{\rightarrow} P and m_n / B_n \stackrel{P}{\rightarrow} c for some constant c and deterministic sequence \{ m_n \}, then Slutsky’s theorem concludes that A_n / B_n \stackrel{d}{\rightarrow} cP. (When m_n = m for all n, then showing m_n / B_n \stackrel{P}{\rightarrow} c is the same as showing B_n \stackrel{P}{\rightarrow} cm.)

Strategies for proving that an estimator is consistent

When starting out in theoretical statistics, it’s often difficult to know how to even get started on the problem at hand. It helps to have a few general strategies in mind. In this post, I list some strategies you can use to prove that an estimator \hat{\theta}_n is consistent for a parameter \theta.

Note:

  1. This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
  2. The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

  1. Use the (weak) law of large numbers. This is especially applicable for i.i.d. sums.
  2. Use Chebyshev’s inequality: If \hat\theta_n is unbiased, then \mathbb{P} \{ |\hat\theta_n - \theta | > \epsilon \} \leq \dfrac{Var(\hat\theta_n)}{\epsilon^2}. Thus, the estimator will be consistent if Var(\hat\theta_n) \rightarrow 0.
  3. Actually in strategy 2, we only need \hat\theta_n to be asymptotically unbiased (i.e. \mathbb{E}[\hat\theta_n] \rightarrow \theta) and the result would still hold.
  4. If your estimator is the maximum likelihood estimator (MLE), then it is consistent under some regularity conditions (see here).
  5. If your estimator is an M-estimator or a Z-estimator, try to use the argmax consistency theorem (e.g. see slides 7 and 9 here).
  6. Try to use the continuous mapping theorem: if X_n \stackrel{P}{\rightarrow} X, then g(X_n) \stackrel{P}{\rightarrow} g(X) for any continuous function g.

Big O_p and little o_p notation

For deterministic functions, we use big O and little o notation to simplify expressions for asymptotic behavior. For sequences of random variables, we need slightly more complicated notation to account for their probabilistic nature: big O_P and little o_P notation. This notation is used frequently in theoretical statistics.

Definitions

The notation o_P(1) is shorthand for a sequence of random variables (or vectors) that converge to zero in probability. In other words, if X_1, X_2, \dots is a sequence of random variables such that X_n \stackrel{P}{\rightarrow} 0 as n \rightarrow \infty, we write X_n = o_P(1).

The notation O_P(1) is shorthand for a sequence of random variables (or vectors) that are “uniformly tight”, i.e. bounded in probability. For a sequence of random variables X_1, X_2, \dots, we write X_n = O_P(1) if for all \epsilon > 0, there is an M such that

\begin{aligned} \sup_{n} \mathbb{P}(\| X_n \| > M) < \epsilon. \end{aligned}

More generally, for random variables X_1, X_2, \dots and R_1, R_2, \dots,

  • We write X_n = o_P(R_n) to mean X_n = Y_n R_n with Y_n = o_P(1), and
  • We write X_n = O_P(R_n) to mean X_n = Y_n R_n with Y_n = O_P(1).

Properties

Just as deterministic big O and little o notation has rules for manipulation (e.g. O(n^2) + O(n) = O(n^2)), probabilistic O_P and o_P notation has its own set of rules to help us work with them. Here are some basic rules (taken from Reference 1) which can be proved by going back to the definitions:

  1. o_P(1) + o_P(1) = o_P(1).
  2. o_P(1) + O_P(1) = O_P(1).
  3. O_P(1) o_P(1) = o_P(1).
  4. \left(1 + o_P(1) \right)^{-1} = O_P(1).
  5. o_P(R_n) = R_n o_P(1).
  6. O_P(R_n) = R_n O_P(1).
  7. o_P \left( O_P(1) \right) = o_P(1).

This last property (Lemma 2.12 of Reference 1) connects deterministic O and o notation to probabilistic O_P and o_P notation:

Lemma. Let R be a function with domain in \mathbb{R}^k such that R(0) = 0. Let X_n be a sequence of random vectors taking values in the domain of R such that X_n \stackrel{P}{\rightarrow} 0. Then for every p > 0:

  • If R(h) = o(\|h \|^p) as h \rightarrow 0, then R(X_n) = o_P(\| X_n\|^p).
  • If R(h) = O(\|h \|^p) as h \rightarrow 0, then R(X_n) = O_P(\| X_n\|^p).

References:

  1. van der Vaart, A. W. Asymptotic Statistics (Section 2.2).
  2. Barlett, P. Theoretical Statistics. Lecture 2.