How to incorporate observation weights into an estimator

Let’s assume that we have some distribution F we want to estimate some quantity related to it (e.g. the mean of the distribution). We can write the quantity we want to estimate (the “estimand”) as a function of F: \theta = T (F) for some function T. (We use F to denote both the distribution and its cumulative distribution function (CDF).)

Here is a common estimation strategy: if we can draw samples from F, let’s draw x_1, \dots, x_n \stackrel{i.i.d.}{\sim} F. These samples determine an empirical CDF \hat{F}, which simply puts a weight of 1/n at each of the x_i‘s. We can then estimate \theta with \hat\theta = T(\hat{F}). This is known as the plug-in estimator, since we are “plugging in” the empirical CDF for the true CDF.

What if we can’t draw samples from F, but can only draw samples from some other distribution G, i.e. x_1, \dots, x_n \stackrel{i.i.d.}{\sim} G? Estimating \theta = T(F) is not totally a lost cause if we can find observation weights w_1, \dots, w_n that sum up to 1 such that the implied empirical CDF is close to F or \hat{F}. By implied empirical CDF, I mean the distribution putting weight of w_i at x_i for i = 1, \dots, n. If we denote the implied empirical CDF by \hat{G}_w, then \hat{\theta}_w = T(\hat{G}_w) would be a reasonable estimator for \theta.

The discussion above is pretty theoretical, so let’s look at the implications of the discussion above for a few examples.

Example 1: Weighted estimator for the mean

We can write the mean as \mu = \mathbb{E}_F[X] = \int x dF(x). The plug-in estimator is

\begin{aligned} \hat\mu = \int x d\hat{F}(x) = \sum_{i=1}^n x_i / n, \end{aligned}

which is simply the sample mean. With observation weights, the estimator becomes

\begin{aligned} \hat\mu_w = \int x d\hat{G}_w(x) = \sum_{i=1}^n w_i x_i, \end{aligned}

which we recognize as the weighted sample mean.

Example 2: Weighted estimator for the variance

We can write the variance as \sigma^2 = \mathbb{E}_F[X^2] - (\mathbb{E}_F[X])^2 =  \int x^2 dF(x) - \left( \int x dF(x) \right)^2. It follows that the plug-in estimator is

\begin{aligned} s^2 = \int x^2 d\hat{F}(x) - \left( \int x d\hat{F}(x) \right)^2 = \sum_{i=1}^n x_i^2/n - \left(\sum_{i=1}^n x_i / n \right)^2, \end{aligned}

and the weighted estimator is

\begin{aligned} s_2^2 = \int x^2 d\hat{G}_w(x) - \left( \int x d\hat{G}_w(x) \right)^2 = \sum_{i=1}^n w_i x_i^2 - \left(\sum_{i=1}^n w_i x_i \right)^2. \end{aligned}

Example 3: Weighted least squares

In this setting, the distribution F is the joint distribution of the covariates X \in \mathbb{R}^p and the response variable y \in \mathbb{R}. The (population) regression coefficient that we want to estimate is

\begin{aligned} \beta &= \text{argmin}_b \; \mathbb{E}_F[(y_i - X_i^T b)^2] \\  &= \mathbb{E}_F[X_i X_i^T]^{-1} \mathbb{E}_F[X_i y_i]. \end{aligned}

If we draw samples (X_1, y_1), \dots, (X_n, y_n) \stackrel{i.i.d.}{\sim} F with the X_i‘s thought of as column vectors, and if we let \mathbf{X} \in \mathbb{R}^{n \times p} be the matrix with rows being the X_i^T‘s and \mathbf{y} \in \mathbb{R}^n being the column vector of the y_i‘s, then the plug-in estimator is

\begin{aligned} \hat\beta &= \mathbb{E}_{\hat{F}}[X_i X_i^T]^{-1} \mathbb{E}_{\hat{F}}[X_i y_i] \\  &= \left( \sum_{i=1}^n X_i X_i^T / n \right)^{-1} \left( \sum_{i=1}^n X_i y_i / n \right)  \\  &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, \end{aligned}

which you might recognize as the usual ordinary least squares (OLS) estimator. If we have observation weights, then we have the weighted least squares estimator

\begin{aligned} \hat\beta &= \mathbb{E}_{\hat{G}_w}[X_i X_i^T]^{-1} \mathbb{E}_{\hat{G}_w}[X_i y_i] \\  &= \left( \sum_{i=1}^n w_i X_i X_i^T \right)^{-1} \left( \sum_{i=1}^n w_i X_i y_i  \right)  \\  &= (\mathbf{X}^T \mathbf{WX})^{-1} \mathbf{X}^T\mathbf{Wy}, \end{aligned}

where \mathbf{W} is the diagonal matrix with diagonal entries w_1, \dots, w_n. (This is the same formula as the one I presented in this previous post on weighted least squares.)

Example 4: Weighted Atkinson index

In this previous post, we introduced the Atkinson index as a measure of inequality for a given distribution. In that post, what we presented was actually the plug-in estimator for the Atkinson index. Assume that the inequality-aversion parameter \epsilon is not equal to 1. The Atkinson index for a distribution F is defined as

\begin{aligned} I(\epsilon) &= 1 - \left( \mathbb{E}_F [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_F [X]. \end{aligned}

If we replace F with \hat{F}, we get

\begin{aligned} \hat{I}(\epsilon) &= 1 - \left( \mathbb{E}_{\hat{F}} [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_{\hat{F}} [X] \\  &= 1 - \left( \frac{1}{n}\sum_{i=1}^n x_i^{1-\epsilon} \right)^{1 / (1-\epsilon)} \big/ \left( \frac{1}{n} \sum_{i=1}^n x_i \right), \end{aligned}

which is the formula I presented in the previous post. We can get a weighted Atkinson index by replacing F with \hat{G}_w:

\begin{aligned} \hat{I}_w(\epsilon) &= 1 - \left( \mathbb{E}_{\hat{G}_w} [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_{\hat{G}_w} [X] \\  &= 1 - \left( \sum_{i=1}^n w_i x_i^{1-\epsilon} \right)^{1 / (1-\epsilon)} \big/ \left( \sum_{i=1}^n w_i x_i \right). \end{aligned}

(As far as I can tell, this formula hasn’t appeared anywhere before, and none of the functions in R which compute weighted Atkinson index use this formula.)

1 thought on “How to incorporate observation weights into an estimator

  1. Pingback: What do we mean by effective sample size? | Statistical Odds & Ends

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s