# How to incorporate observation weights into an estimator

Let’s assume that we have some distribution $F$ we want to estimate some quantity related to it (e.g. the mean of the distribution). We can write the quantity we want to estimate (the “estimand”) as a function of $F$: $\theta = T (F)$ for some function $T$. (We use $F$ to denote both the distribution and its cumulative distribution function (CDF).)

Here is a common estimation strategy: if we can draw samples from $F$, let’s draw $x_1, \dots, x_n \stackrel{i.i.d.}{\sim} F$. These samples determine an empirical CDF $\hat{F}$, which simply puts a weight of $1/n$ at each of the $x_i$‘s. We can then estimate $\theta$ with $\hat\theta = T(\hat{F})$. This is known as the plug-in estimator, since we are “plugging in” the empirical CDF for the true CDF.

What if we can’t draw samples from $F$, but can only draw samples from some other distribution $G$, i.e. $x_1, \dots, x_n \stackrel{i.i.d.}{\sim} G$? Estimating $\theta = T(F)$ is not totally a lost cause if we can find observation weights $w_1, \dots, w_n$ that sum up to 1 such that the implied empirical CDF is close to $F$ or $\hat{F}$. By implied empirical CDF, I mean the distribution putting weight of $w_i$ at $x_i$ for $i = 1, \dots, n$. If we denote the implied empirical CDF by $\hat{G}_w$, then $\hat{\theta}_w = T(\hat{G}_w)$ would be a reasonable estimator for $\theta$.

The discussion above is pretty theoretical, so let’s look at the implications of the discussion above for a few examples.

Example 1: Weighted estimator for the mean

We can write the mean as $\mu = \mathbb{E}_F[X] = \int x dF(x)$. The plug-in estimator is \begin{aligned} \hat\mu = \int x d\hat{F}(x) = \sum_{i=1}^n x_i / n, \end{aligned}

which is simply the sample mean. With observation weights, the estimator becomes \begin{aligned} \hat\mu_w = \int x d\hat{G}_w(x) = \sum_{i=1}^n w_i x_i, \end{aligned}

which we recognize as the weighted sample mean.

Example 2: Weighted estimator for the variance

We can write the variance as $\sigma^2 = \mathbb{E}_F[X^2] - (\mathbb{E}_F[X])^2 = \int x^2 dF(x) - \left( \int x dF(x) \right)^2$. It follows that the plug-in estimator is \begin{aligned} s^2 = \int x^2 d\hat{F}(x) - \left( \int x d\hat{F}(x) \right)^2 = \sum_{i=1}^n x_i^2/n - \left(\sum_{i=1}^n x_i / n \right)^2, \end{aligned}

and the weighted estimator is \begin{aligned} s_2^2 = \int x^2 d\hat{G}_w(x) - \left( \int x d\hat{G}_w(x) \right)^2 = \sum_{i=1}^n w_i x_i^2 - \left(\sum_{i=1}^n w_i x_i \right)^2. \end{aligned}

Example 3: Weighted least squares

In this setting, the distribution $F$ is the joint distribution of the covariates $X \in \mathbb{R}^p$ and the response variable $y \in \mathbb{R}$. The (population) regression coefficient that we want to estimate is \begin{aligned} \beta &= \text{argmin}_b \; \mathbb{E}_F[(y_i - X_i^T b)^2] \\ &= \mathbb{E}_F[X_i X_i^T]^{-1} \mathbb{E}_F[X_i y_i]. \end{aligned}

If we draw samples $(X_1, y_1), \dots, (X_n, y_n) \stackrel{i.i.d.}{\sim} F$ with the $X_i$‘s thought of as column vectors, and if we let $\mathbf{X} \in \mathbb{R}^{n \times p}$ be the matrix with rows being the $X_i^T$‘s and $\mathbf{y} \in \mathbb{R}^n$ being the column vector of the $y_i$‘s, then the plug-in estimator is \begin{aligned} \hat\beta &= \mathbb{E}_{\hat{F}}[X_i X_i^T]^{-1} \mathbb{E}_{\hat{F}}[X_i y_i] \\ &= \left( \sum_{i=1}^n X_i X_i^T / n \right)^{-1} \left( \sum_{i=1}^n X_i y_i / n \right) \\ &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, \end{aligned}

which you might recognize as the usual ordinary least squares (OLS) estimator. If we have observation weights, then we have the weighted least squares estimator \begin{aligned} \hat\beta &= \mathbb{E}_{\hat{G}_w}[X_i X_i^T]^{-1} \mathbb{E}_{\hat{G}_w}[X_i y_i] \\ &= \left( \sum_{i=1}^n w_i X_i X_i^T \right)^{-1} \left( \sum_{i=1}^n w_i X_i y_i \right) \\ &= (\mathbf{X}^T \mathbf{WX})^{-1} \mathbf{X}^T\mathbf{Wy}, \end{aligned}

where $\mathbf{W}$ is the diagonal matrix with diagonal entries $w_1, \dots, w_n$. (This is the same formula as the one I presented in this previous post on weighted least squares.)

Example 4: Weighted Atkinson index

In this previous post, we introduced the Atkinson index as a measure of inequality for a given distribution. In that post, what we presented was actually the plug-in estimator for the Atkinson index. Assume that the inequality-aversion parameter $\epsilon$ is not equal to 1. The Atkinson index for a distribution $F$ is defined as \begin{aligned} I(\epsilon) &= 1 - \left( \mathbb{E}_F [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_F [X]. \end{aligned}

If we replace $F$ with $\hat{F}$, we get \begin{aligned} \hat{I}(\epsilon) &= 1 - \left( \mathbb{E}_{\hat{F}} [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_{\hat{F}} [X] \\ &= 1 - \left( \frac{1}{n}\sum_{i=1}^n x_i^{1-\epsilon} \right)^{1 / (1-\epsilon)} \big/ \left( \frac{1}{n} \sum_{i=1}^n x_i \right), \end{aligned}

which is the formula I presented in the previous post. We can get a weighted Atkinson index by replacing $F$ with $\hat{G}_w$: \begin{aligned} \hat{I}_w(\epsilon) &= 1 - \left( \mathbb{E}_{\hat{G}_w} [X^{1-\epsilon}] \right)^{1 / (1 - \epsilon)} \big/ \mathbb{E}_{\hat{G}_w} [X] \\ &= 1 - \left( \sum_{i=1}^n w_i x_i^{1-\epsilon} \right)^{1 / (1-\epsilon)} \big/ \left( \sum_{i=1}^n w_i x_i \right). \end{aligned}

(As far as I can tell, this formula hasn’t appeared anywhere before, and none of the functions in R which compute weighted Atkinson index use this formula.)