Estimating pi with Monte Carlo simulation

Happy Pi Day! I don’t encounter \pi very much in my area of statistics, so this post might seem a little forced… In this post, I’m going to show one way to estimate \pi.

The starting point is the integral identity

\begin{aligned} 4 \int_0^1 \sqrt{1 - x^2} dx = \pi. \end{aligned}

There are two ways to see why this identity is true. The first is that the integral is simply computing the area of a quarter-circle with unit radius. The second is by explicitly evaluating the integral:

\begin{aligned} \int_0^1 \sqrt{1 - x^2}dx &= \left[ \frac{1}{2} \left( x \sqrt{1 - x^2} + \arcsin x \right) \right]_0^1 \\  &= \frac{1}{2} \left[ \frac{\pi}{2} - 0 \right] \\  &= \frac{\pi}{4}. \end{aligned}

If X \sim \text{Unif}[0,1], then the integral identity means that

\begin{aligned} \mathbb{E} \left[ 4\sqrt{1 - X^2} \right] = \pi. \end{aligned}

Hence, if we take i.i.d. draws X_1, \dots, X_n from the uniform distribution on [0,1], it is reasonable to expect that

\begin{aligned} \frac{1}{n}\sum_{i=1}^n 4\sqrt{1 - X_i^2} \approx \pi. \end{aligned}

The code below shows how well this estimation procedure does for one run as the sample size goes from 1 to 200:

N <- 200
x <- runif(N)
samples <- 4 * sqrt(1 - x^2)
estimates <- cumsum(samples) / 1:N
plot(1:N, estimates, type = "l",
     xlab = "Sample size", ylab = "Estimate of pi",
     main = "Estimates of pi vs. sample size")
abline(h = pi, col ="red", lty = 2)

The next plot shows the relative error on the y-axis instead (the red dotted line represents 1% relative error):

rel_error <- abs(estimates - pi) / pi * 100
plot(1:N, rel_error, type = "l",
xlab = "Sample size", ylab = "Relative error (%)",
main = "Relative error vs. sample size")
abline(h = 0)
abline(h = 1, col = "red", lty = 2)

The law of large populations

I recently read through a really interesting paper by Xiao-Li Meng (Reference 1). It’s relatively light on mathematical machinery (you can get through most of it with just a first course in statistics), but deep in terms of the statistical interpretation of the results. Meng is a first-rate communicator of statistics (just watch some of his videos on Youtube or read some of his musings on XL Files), making this work a joy to read.

(I did not check, but I think Meng covers some of the material in the paper in this Youtube video.)


This paper was motivated by the question “Which one should we trust more, a 5% survey sample or an 80% administrative dataset?” where 5% and 80% present the proportion of the population in the sample. It is pretty obvious that one should prefer a well-randomized 5% survey sample over a 5% administrative dataset, since the latter probably contains several sources of bias. But what if the administrative dataset is much larger? Can we say anything about it?

Meng answers this question to some degree, considering the case where the quantity of interest is a population mean, and we try to estimate it using a sample mean. In the rest of this post, I present just the key identity: read the paper for the details!


Define the following quantities:

  • N: size of the population, n: size of the sample.
  • \overline{G}_N = \dfrac{1}{N}\displaystyle\sum_{j=1}^N G_j: population quantity of interest.
  • R_j = 1 if j is in the sample, 0 otherwise.
  • \begin{aligned} \overline{G}_n = \frac{1}{n}\sum_{j \text{ in sample}} G_j = \frac{\sum_{j=1}^N R_j G_j}{\sum_{j=1}^N R_j} \end{aligned}: sample average (the estimator that we will consider).
  • f = n / N: the fraction of the population in the sample, or sampling rate.

All the quantities above are assumed to be fixed except possibly the R_j‘s. The R_j‘s define what Meng calls the R-mechanism: the way that the sample arrives to us. In simply random sampling, we know the joint distribution of the R_j;s. But it is possible that the sample is not uniformly random, e.g. self-reported data where an individual is more or less likely to report depending on their value of G. It’s even possible that R_j is deterministic: in this case, an individual always decides to report or always decides not to report. In any case, the R-mechanism and the R_j‘s are going to be vital for what follows.

In addition, for convenience let J be a random variable which is uniformly distributed over \{ 1, 2, \dots, N\} (and independent of any preceding randomness).

A fundamental identity for data quality-quantity tradeoff

In Equation (2.3), Meng presents a key identity that demonstrates the tradeoff between data quality and data quantity:

\begin{aligned} \overline{G}_n - \overline{G}_N = \underbrace{\rho_{R, G}}_{\text{Data Quality}} \times \underbrace{\sqrt{\frac{1-f}{f}}}_{\text{Data Quantity}} \times \underbrace{\sigma_G}_{\text{Problem Difficulty}}, \end{aligned}


  • \rho_{R, G} = Corr_J (R_J, G_J) is called the data defect correlation, measuring both sign and degree of selection bias caused by the R-mechanism, and
  • \sigma_G = SD_J (G_J) is the standard deviation of G_J. This captures problem difficulty: the more variation there is among the G_j‘s, the harder it will be to estimate \overline{G}_N.

What’s really interesting about this identity is that it is completely due to algebraic manipulation: no mathematical or probabilistic assumptions go into it! (Other than the definitions in the notation section above.) Conditional on knowing the values of the R_j‘s, Meng shows in Equation (2.2) that

\begin{aligned} \overline{G}_n - \overline{G}_N = ... = \dfrac{Cov_J (R_J, G_J)}{E_J [R_J]}. \end{aligned}

Noting that the R_j‘s are binary variables, we have Var_J (R_J) = (1-f) / f, leading to

\begin{aligned}\overline{G}_n - \overline{G}_N &=\dfrac{Cov_J (R_J, G_J)}{SD_J (R_J) SD_J (G_J) } \times \dfrac{SD_J(R_J)}{E_J [R_J]} \times SD_J (G_J) \\  &= Corr_J (R_J, G_J) \times \dfrac{\sqrt{(1-f)/f}}{f} \times \sigma_G \\  &= \rho_{R,G} \times \sqrt{\dfrac{1-f}{f}} \times \sigma_G, \end{aligned}

as required.

The data defect index

Squaring the identity above and taking expectations with respect to the R-mechanism gives us an expression for the mean-squared error (MSE) (Equation (2.4)):

\begin{aligned} MSE_R (\overline{G}_n) = \underbrace{E_R [\rho_{R, G}^2]}_{D_I} \times \underbrace{\frac{1-f}{f}}_{D_O} \times \underbrace{\sigma_G^2}_{D_U}. \end{aligned}

Thus, there are 3 ways to reduce MSE:

  1. Increase the data quality by reducing D_I: the data defect index.
  2. Increase the data quantity by reducing D_O: the dropout odds. This, however, turns out to be much less effective than increasing data quality.
  3. Reduce the difficulty of the estimation problem by reducing D_U: the degree of uncertainty (this is typically only possible with additional information).

The law of large populations (LLP)

With the identities above, Meng formulates a law of large populations:

Among studies sharing the same (fixed) average data defect correlation E_R [\rho_{R, G}] \neq 0, the (stochastic) error of \overline{G}_n, relative to its benchmark under simple random sampling, grows with the population size N at the rate of \sqrt{N}.


  1. Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election.

What is raking?

tl;dr: Raking (or iterative proportional fitting) is “a post-stratification procedure for adjusting sample weights in a survey so that the adjusted weights add up to known population population totals for the post-stratified classifications when only the marginal population totals are known.” (Reference 1)

While the idea behind raking is pretty simple, I had trouble finding an example to demonstrate the procedure. In what follows, I walk through a toy example of how to apply raking to an estimation problem. (It shouldn’t be too hard to transfer the ideas into mathematical notation.)

Problem background

Imagine that we want to estimate the mean number of hours of TV people in Fantasyland watch each week. Fantasyland is pretty diverse: we can imagine that it is made up of a number of distinct subpopulations, each having very different TV watching habits. For this toy example, assume that there are 2 age groups (child, adult) and 3 household income groups (< $40k/yr, $40-200k/yr, >$200k/yr). Assume that people who fall within the same (age, income) group watch exactly the same amount of TV. The following two tables show the number of people falling in each (age, income) group and how much TV each group watches per week:

If we knew the values in these two tables, then we would know the true value of the mean we are looking for:

\begin{aligned} \mu &= \dfrac{(10,000)(10) + (1,350)(60) + (150)(70) + (20,000)(1) + (20,000)(2) + (150)(70)}{51,650} \\ &\approx 5.07. \end{aligned}

We go and get some data…

Of course, we don’t know the values in the tables above, and so we get a random sample of Fantasyland inhabitants and ask them how much TV they watch. We can then tabulate our data in two tables much like the earlier ones:

(In this example, because we assume that people who fall within the same (age, income) group watch exactly the same amount of TV, the data on the right table is always going to be the same. We will omit it in future figures.)

Problem with simply taking the mean

One might estimate the true mean TV hours watched per week by taking a simple mean of the data we collected in our sample:

\begin{aligned} \hat{\mu} &= \dfrac{(10)(10) + (20)(60) + (30)(70) + (10)(1) + (20)(2) + (40)(70)}{130} \\ &\approx 48.1. \end{aligned}

This estimate is WAY off from the truth! What happened? Let’s look at the population and sample size tables again:

The relative proportions of the 6 subgroups in our sample are completely different from that in the population! For example, household income >$200k make less than 1% of our population, but more than 50% of our sample. As a result, the simple mean of our sample grossly overweights the values of this group, giving us an estimate that is severely biased upwards.

A possible workaround, but what if…?

If we knew the values in the population table, we could just use those values as sample weights instead of the relative frequencies in the sample we took. In this example, it would give us the true value of \mu exactly.

But what if we didn’t know the values in the cells, just the row and column totals? Can we do better than just the simple mean?

While this scenario might look contrived, it does happen in practice. For example, maybe Fantasyland did a census recently and disclosed the row and column totals, but for privacy purposes they did not release the cell values. (This scenario happens more often when we have many demographic variables. For example, in the US we know how many females there are, how many native Americans there are, how many live in Texas, and how many are between 30-39 years, but we don’t know the number of female native American Texans who are 30-39 years old.)

Raking to the rescue!

Raking, also known as iterative proportional fitting (IPF), is a method for adjusting sample weights so that they more accurately reflect the true population weights. The goal of raking is to adjust the sample weights so that the row and column totals (also known as the marginals) mimic those of the population.

It achieves its goal in the following way: Let’s say we are raking an n-way table for variables 1 to n, and that we only know the population marginals.

  1. Multiply all the cells in the sample size table by the same constant so that the cells sum up to the true population total.
  2. Look at variable 1. For each slice of the table that has the same value for variable 1, multiply the cells in this slice by the same constant so that these cells sum up to the true population marginal. Do the same for variable 2, 3, …, n.
  3. Repeat Step 2 until convergence.

Raking demonstrated

Raking is best demonstrated by example. In our TV watching example, in the first step we inflate all cells in our sample size table so that the cells sum up to the true population total, 51,650. (Note: Tables from here on out round the numbers to the nearest integer, so you may see some inconsistencies due to rounding.)

Since everything was inflated by the same factor, computing our estimate \hat{\mu} with these weights will just give us the same value: 48.1.

Let’s look at the variable Age. We will multiply the cells in row 1 by one constant and the cells in row 2 by another constant so that the row totals match the true population row totals:

Next, let’s look at the variable Household Income. We will multiply the cells in each column by a constant such that the column totals match those of the true population:

This completes one round of Step 2 of the raking algorithm. Notice that the sample weights more closely resemble those of the population already. If we use these weights for our estimate, we get something much more reasonable than the simple mean:

\begin{aligned} \hat{\mu} &= \dfrac{(7,514)(10) + (5,347)(60) + (60)(70) + (22,486)(1) + (16,003)(2) + (240)(70)}{51,650} \\ &\approx 9.1. \end{aligned}

After this second operation, the column marginals match the population ones, but the row marginals no longer match their population counterparts. We can do another round of adjustment: first matching the row totals:

and then matching the column totals:

We now have the estimate

\begin{aligned} \hat{\mu} &= \dfrac{(6,688)(10) + (4,759)(60) + (53)(70) + (23,312)(1) + (16,591)(2) + (247)(70)}{51,650} \\ &\approx 8.3. \end{aligned}

After the last operation, both the row and column totals match their population counterparts, so the raking procedure has converged.

Further notes on raking

  • If we only have one variable that we are adjusting for (say Age or Household Income in our example), then raking is the same as post-stratification. (I talk a little bit about post-stratification here.)
  • While the marginals of a raked table will be very close to that of the population, there are no guarantees for any of the totals based on more than one variable. In our example, we cannot guarantee that the cell values of the raked table will be close to the cell values in the population table.
  • Raking cannot work if any of the marginals are equal to zero, since it will involve dividing by zero. Reference 2 suggests adding a little amount to these cells (e.g. 0.001 if all other marginals are whole numbers) to circumvent this problem.
  • Raking will not adjust cells in the table that are equal to zero. The workaround is the same as the previous point: adding a little amount to these cells.

Further reading

Hunsinger has excel sheets that walk through raking for 2, 3 and 4 variables that I think are very helpful. They are available as links within the first 3 links of Reference 2 (e.g. link on slide 6 of “Iterative Proportional Fitting for a Two-Dimensional Table”).

Reference 3 contains a good list of tips and things to think about when using raking.


  1. Lavrakas, P. J. (ed). (2008). Entry on Raking in Encyclopedia of Survey Research Methods.
  2. Hunsinger, E.. Iterative Proportional Fitting Information and Code.
  3. Battaglia, M. P., Hoaglin, D. C., and Frankel, M. R. (2009). Practical Considerations in Raking Survey Data.

Simple random sampling, stratification and post-stratification

The problem

Imagine that you have a population of N subjects. Each subject has some parameter value denoted \theta_i for the ith subject. You are interested in the mean value of this parameter, i.e.

\begin{aligned} \theta = \dfrac{\theta_1 + \dots + \theta_N}{N} = \dfrac{1}{N}\sum_{i=1}^N \theta_i. \end{aligned}

For example, we might want to estimate the number of hours the average person spends watching TV each week. \theta_i would be the number of hours person i spends watching TV each week.

(Depending on how we define the setup, \theta_i could be viewed as random or deterministic. For simplicity, we’re going to assume that they are deterministic. If they are random, we can define \theta as the expectation of the RHS. Either way, \theta is a fixed quantity we are trying to estimate.)

Simple random sampling

The simplest estimation strategy is simple random sampling. Pick some sample size n, and randomly draw n subjects from the population. The random draw should be done such that all subsets of size n are equally likely. If the n subjects have parameter values \theta_1, \dots, \theta_n, then we would estimate \theta by

\begin{aligned} \hat{\theta}_{SRS} = \sum_{i=1}^n \theta_i. \end{aligned}


Imagine that our population is pretty heterogeneous, i.e. we can think of our population as a mix of several subpopulations, where subjects within the same subpopulation are more similar than subjects from different subpopulations. In this setting, stratified sampling can help reduce the variance of our estimate. Assume that our population is made up of J non-overlapping subpopulations (called strata), and that strata j has N_j subjects (so N = N_1+ \dots + N_J). Within each strata, we could perform a simple random sample, then combine the estimates appropriately.

If we take a sample of size n_j for the jth strata and the subjects have parameter values \theta_{j1}, \dots, \theta_{jn_j}, then our estimate for the jth strata would be

\begin{aligned} \hat{\theta}^{(j)} = \frac{1}{n_j} \sum_{k=1}^{n_j} \theta_{jk}, \end{aligned}

and our overall estimate would be

\begin{aligned} \hat{\theta}_{SS} = \sum_{j=1}^J \frac{N_j}{N} \hat{\theta}^{(j)} = \sum_{j=1}^J \sum_{k=1}^{n_j} \frac{N_j}{N n_j}  \theta_{jk}. \end{aligned}

We can think of the RHS expression above as a weighted sum of the observations:

\begin{aligned} \hat{\theta}_{SS} = \sum_{j=1}^J \sum_{k=1}^{n_j} w_{jk}\theta_{jk}, \quad \text{where} \quad w_{jk} = \frac{N_j}{N n_j}. \end{aligned}

In simple random sampling, we would have w_{jk} = 1/n for all j, k instead.

It’s not hard to see how stratification might help, as the following example shows. Assume that in our TV watching example, we have 100 kids who watch 70 hours of TV per week and 10,000 adults who watch 0.1 hours of TV per week (fictional numbers!). The true mean hours of TV per week is

\begin{aligned} \theta = \frac{100 \cdot 70 + 10,000 \cdot 0.1}{100 + 10,000} \approx 0.79 \text{ hours}. \end{aligned}

Let’s say we take a simple random sample of 10 subjects from this population to estimate \theta. Since kids represent less than 1% of the population, it’s highly likely that our 10 subjects will all be adults, resulting in an estimate of \hat{\theta}_{SRS} = 0.1. Not great!

The problem with the above is that we don’t sample from the subpopulation of kids at all. Stratified sampling ensures that this doesn’t happen. With stratification, we might sample 4 kids and 6 adults, which gives us an estimate

\begin{aligned} \hat{\theta}_{SS} = 4 \cdot \frac{100}{(10,000 + 100)(4)} \cdot 70 + 6 \cdot \frac{10,000}{(10,000 + 100)(6)} \cdot 0.1 \approx 0.79 \text{ hours} \end{aligned}

which is spot on.

The correction factor N_j / N is needed in the weights so that the relative sizes of the strata in our sample reflect that in the population. In our example, the proportion of kids in the sample is much higher than that in the population. This means that if we look at the estimates for children and adults and take a simple average (i.e. use weights w_{jk} = 1/(J n_j) instead of w_{jk} = N_j / (Nn_j)), we would be overweighting the kids’ responses. The calculation below shows that:

\begin{aligned} \sum_{j=1}^2 \sum_{k = 1}^{n_j} \frac{1}{n_j} \theta_{jk}  = 4 \cdot \frac{1}{2 * 4} \cdot 70 + 6 \cdot \frac{1}{2 * 6} \cdot 0.1 = 35.05 \text{ hours}.  \end{aligned}


In stratification, we identified the strata beforehand, then performed a simple random sample in each strata. In post-stratification, we perform the sampling first, the identify the strata afterwards and make adjustments based on that. Post-stratification is more common than you might think as it is often hard to identify which strata an object belongs to beforehand. In our TV example, you might be given just a list of phone numbers that you can call. Before calling, you don’t know if the subject is an adult or a kid: you can only find out after calling them.

The formulas for post-stratification are the same as those for stratification with the exception that in stratification, the numbers n_j are determined by the researcher before sampling is done, while in post-stratification they are determined by the data.

Because the n_j are determined by the data, there are times when post-stratification can go wrong. For example, what if one of the n_j‘s is zero? This corresponds to not sampling at all from a particular stratum (in our example, we would very likely have zero kids in our sample). Since n_j appears in the denominator of the weights, we will need to make some other corrections.

While post-stratification and stratification often give very similar results, for the reason above (and a few others), Reference 3 recommends stratification over post-stratification where possible.

(Note: Stratification and post-stratification can be applied for a wide range of sampling designs. In this post I’ve only illustrated them with the simple random sample. The point here is to make the high-level distinction between the concepts.)


  1. Statistics How To. Simple Random Sample: Definition and Examples.
  2. Penn State Eberly College of Science. 6.2 The Stratification Principle.
  3. Lavrakas, P. J. (ed). (2008). Entry on Post-Stratification in Encyclopedia of Survey Research Methods.

Consistency and unbiasedness

Let P be some underlying true probability distribution. We are interested in some parameter of the probability distribution \theta = \theta(P) (e.g. the mean, median or variance of P). To estimate \theta, we have on hand samples X_1, X_2, \dots, which are i.i.d. draws from the distribution P. For each n = 1, 2, \dots, let T_n be the estimator based on the first n data points.

  • The estimator T_n is consistent if T_n converges in probability to \theta.
  • It is unbiased if it is, on average, equal to the true value of the parameter, i.e. \theta if \mathbb{E}[T_n] = \theta.
  • It is asymptotically unbiased if \mathbb{E}[T_n] \rightarrow \theta as n \rightarrow \infty.

Unbiasedness is not the same as consistency

It’s important to note that unbiasedness and consistency do not imply each other. The examples below from Reference 1 show that.

Unbiasedness does not imply consistency: Let X_i \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2), i = 1, 2, \dots. Consider the estimator T_n = X_1 for the mean \mu. We always have \mathbb{E}[T_n] = \mathbb{E}[X_1] = \mu, so it is unbiased. However, T_n converges in distribution to \mathcal{N}(\mu, \sigma^2), and so is not consistent.

Consistency does not imply unbiasedness: Let X_i \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2), i = 1, 2, \dots. The maximum likelihood estimator (MLE) for \sigma^2 is T_n = \dfrac{1}{n}\sum_{i=1}^n (X_i - \overline{X})^2, where \overline{X} = (X_1 + \dots + X_n) / n. It is consistent (the MLE is always consistent), but it is not hard to show that \mathbb{E}[T_n] = \dfrac{n-1}{n}\sigma^2, i.e. it is biased.

Asymptotic unbiasedness and consistency

Asymptotic unbiasedness and consistency also do not imply each other.

Asymptotic unbiasedness does not imply consistency: This is a variation of the example for “unbiasedness does not imply consistency”. Instead of taking T_n = X_1, take T_n = X_1 + 1/n.

Consistency does not imply asymptotic unbiasedness: From Reference 2: consider a silly example where \theta = 0 and we want to estimate \theta using random variables T_n with

\begin{aligned} T_n = \begin{cases} 0 &\text{with probability } (n-1)/n, \\ n &\text{with probability 1/n}. \end{cases} \end{aligned}

T_n is consistent since it converges in probability to 0, but it is not asymptotically unbiased: \mathbb{E}[T_n] = 1 for every n.



  1. StackExchange. Answer to “What is the difference between a consistent estimator and an unbiased estimator?”
  2. StackExchange. Answer to “Consistency and asymptotically unbiasedness?”

Proof that sample mean is independent of sample variance (under normality)

Suppose we have i.i.d. observations X_1, \dots, X_n drawn from a population. We usually estimate the mean and variance of the population by the mean and variance of the sample we have:

\begin{aligned} \bar{X} &= \dfrac{X_1 + \dots + X_n}{n}, \\  S^2 &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2. \end{aligned}

Under the assumption that the population is normally distributed, the sample mean and sample variance are independent of each other.

The first proof of this fact is short but requires some basic knowledge of theoretical statistics. One can prove that the sample mean is a complete sufficient statistic and that the sample variance is an ancillary statistic. Then, by Basu’s Theorem, they must be independent of each other.

The second proof is longer and more explicit (and hence less magical than the first). Assume that X_1, \dots, X_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2). Then

\begin{aligned} \bar{X} &\sim \mathcal{N} \left( \mu, \frac{\sigma^2}{n}\right), \\  \bar{X} - X_j &= \frac{1}{n}(X_1 + \dots + X_{j-1} + X_{j+1} + \dots + X_n) - \frac{n-1}{n}X_j \\  &\sim \mathcal{N}\left( \frac{(n-1)\mu}{n} - \frac{(n-1)\mu}{n}, (n-1)\frac{\sigma^2}{n^2} + (n-1)^2 \frac{\sigma^2}{n^2} \right) \\  &= \mathcal{N} \left( 0, \frac{n-1}{n}\sigma^2 \right), \end{aligned}

for any j = 1, \dots, n. Furthermore, for any j = 1, \dots, n, we have

\begin{aligned} \text{Cov}(X_j - \bar{X}, \bar{X}) &= \text{Cov}(X_j, \bar{X}) -\text{Cov}(\bar{X}, \bar{X}) \\  &= \frac{\sigma^2}{n} - \frac{\sigma^2}{n} \\ &= 0. \end{aligned}

This implies that the vector \begin{pmatrix} \bar{X}, X_1 - \bar{X}, \dots, X_n - \bar{X}\end{pmatrix}^T has multivariate normal distribution with covariance matrix of the form

\begin{pmatrix} \frac{\sigma^2}{n} & \mathbf{0}^T \\ \mathbf{0} & \Sigma  \end{pmatrix}

for some \Sigma. Because of the multivariate normality, we can conclude that \bar{X} and {\mathbf{X}} = \begin{pmatrix} X_1 - \bar{X}, \dots, X_n - \bar{X}\end{pmatrix}^T are independent normal vectors, and so \bar{X} is also independent of {\mathbf{X}}^T {\mathbf{X}} = (n-1) S^2.


  1. StackExchange. Proof of the independence of the sample mean and sample variance.
  2. Statistics 351 (Fall 2009). Independence of \bar{X} and S^2 in a Normal Sample.

Consistency and root n-consistency

Loosely speaking, an estimator (of a parameter) is consistent if, as the number of data points collected grows, the estimator converges in probability to the true value of the parameter.

Now let’s be more precise. Let P be some underlying true probability distribution. We are interested in some parameter of the probability distribution \theta = \theta(P) (e.g. the mean, median or variance of P). To estimate \theta, we have on hand samples X_1, X_2, \dots, which are i.i.d. draws from the distribution P.

For each n = 1, 2, \dots, let T_n be the estimator based on the first n data points. (For example, T_n = \dfrac{X_1 + \dots + X_n}{n}, the sample mean, is a commonly used estimator for the mean.) The estimator T_n is said to be consistent if T_n converges in probability to \theta. (Note: Since \theta is actually unknown, this convergence in probability must hold true for all possible values of \theta.)


Once we know that an estimator is consistent (not all are!), it would also be nice to know how quickly it converges to the true value. This is where \sqrt{n}-consistency comes in. Many estimators are \sqrt{n}-consistent (“root n consistent”), meaning that

\begin{aligned} T_n - \theta = O_p(1/\sqrt{n}). \end{aligned}

(See this post for details on big O_p notation.) Another way of expressing this is to say that \sqrt{n}(T_n - \theta) is bounded in probability, i.e. O_p(1). Yet another definition is that the variance of T_n is O(1/n).

In general, an estimator is said to be n^\delta-consistent if its variance is O(1/n^{2\delta}).


  1. Wikipedia. Consistent estimator.
  2. Magee, L. Asymptotic concepts (Section 2.4).
  3. StackExchange. Answer to “root-n consistent estimator, but root-n doesn’t converge?”
  4. Chapter 7: Estimation (Slide 16).

Horvitz–Thompson estimator

Let’s say we have a finite population of N individuals, and we are interested in some trait that they have. Let X_i denote the value of the trait for individual i. We don’t get to see all these X_i‘s: we only sample n < N of them. With this sample of n individuals, we may be interested in obtaining an estimate of the total T = \sum_{i=1}^N X_i or the mean \tau = \frac{1}{N}\sum_{i=1}^N X_i.

Let’s add another wrinkle to our sampling scheme: we don’t know how we obtained it! (Maybe someone else gave it to us.) All we know is that the probability of individual i being included in the sample was \pi_i. Can we still come up with reasonable estimates for T and \tau?

It turns out that we can. In a 1952 paper, Daniel G. Horvitz and Donovan J. Thompson introduced what is now known as the Horvitz-Thompson estimator:

\hat{T}_{HT} = \displaystyle\sum_{i=1}^n \frac{X_i}{\pi_i}.

Note that the sum only goes over n terms, but it is an estimate for a sum over N terms. This estimator is performing inverse probability weighting: that is, we give each observation a weight which is the inverse of its probability of inclusion. The Horvitz-Thompson estimator is unbiased for T. The paper also worked out an expression for the estimator’s variance, but it’s substantially more complicated.

One potential application of this is if X_i = 1 for all individuals (i = 1, \dots, N). Here, we are just trying to estimate the size of the population N. If we knew the inclusion probabilities (a BIG if), then we could just use the Horvitz-Thompson estimator directly: \hat{T}_{HT} = \sum_{i=1}^n \frac{1}{\pi_i}. Usually we don’t know the \pi_i‘s, since they depend on the knowledge of N in some way! What we could do then is to get estimates \hat{\pi}_i for the inclusion probabilities, then use the plug-in principle to get the estimator \sum_{i=1}^n \frac{X_i}{\hat{\pi}_i}.

Credits: I learnt of this estimator through a talk Kristian Lum gave recently at the Stanford statistics seminar.