Basic manipulation of GIF frames with magick

The magick package is a really powerful package for image processing in R. The official vignette is a great place to start learning how to use the package.

I’ve been playing around with using magick for manipulating GIFs and found some tips and tricks that don’t seem to be documented anywhere. Since the NBA restart is upon us, I’ll illustrate these tips with a GIF featuring one of the greatest of all time:

library(magick)

img_path <- paste0("https://media3.giphy.com/media/3o72F9VC4WjFDCPTag/",
                   "giphy.gif?cid=ecf05e470ceb6cd8542e8c243d0e0a2282c3390e5c",
                   "72fd17&rid=giphy.gif")
img <- magick::image_read(img_path)
img

If you print img, you will see that the GIF is represented as a tibble, with each row representing a single frame in the GIF:

print(img)
# # A tibble: 44 x 7
#    format width height colorspace matte filesize density
#    <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
#  1 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  2 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  3 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  4 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  5 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  6 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  7 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  8 GIF      480    266 sRGB       TRUE   2617620 72x72  
#  9 GIF      480    266 sRGB       TRUE   2617620 72x72  
# 10 GIF      480    266 sRGB       TRUE   2617620 72x72  
# # … with 34 more rows

Use the length() function to determine how many frames there are in the GIF:

length(img)
# [1] 44

You can index GIFs as you would with a vector. For example, img[12] returns the 12th frame of the GIF:

You can, of course, put more than one frame index in the square brackets. For example, img[44:1] will run the GIF in reverse (this can also be achieved with rev(img)):

This is a little more elaborate:

img[c(1:44, rep(c(rep(44, times = 5), 43:30, 
                  rep(30, times = 5), 31:43), 
                times = 2))]

Slowing down the dunk action:

img[c(1:30, rep(31:44, each = 3))]

We can replace a single frame with a different image too:

logo <- image_read("https://jeroen.github.io/images/Rlogo.png") %>%
  image_resize("480x266!")
img[25] <- logo
img

I don’t know if there is a clean way to insert the R logo as the 25th frame and push the original 25-44th frames down one frame (that would be a neat functionality to have). The below is a workaround but it feels inelegant:

img <- img[c(1:25, 25:44)]
img[25] <- logo

Image contours in R

I recently came across this short fun post on R-bloggers that demonstrated how to use the image.ContourDetector package (available on CRAN) to extract contours from an image. The image of the contours looked really cool so I thought I would try it out for myself!

As an example, I wanted to extract the contours from the image below:

(Source: premierleague.com)

Here is the code taken from the R-bloggers post (slightly modified):

library(magick)
library(image.ContourDetector)

img_path <- paste0("https://resources.premierleague.com/premierleague/",
                   "photo/2018/11/27/d7e922c7-8e8e-4eff-b1c1-1a6678295257/",
                   "EPL-Instagram-Article.png")
img <- magick::image_read(img_path)
mat <- magick::image_data(img, channels = "gray")
mat <- drop(as.integer(mat, transpose = TRUE))
contourlines <- image_contour_detector(mat)
plot(contourlines)

And here is the result. Not bad for a little chunk of code!

Not bad, but can we do better? In particular, we can’t really see any of the facial features of the players.

It turns out that the image_contour_detector() function has a function option Q, which (according to the documentation) represents “numeric value with the pixel quantization step”. This is discussed in Section 5 of the paper introducing the method. My high-level understanding of the issue is that in digital image processing, the pixel values are usually quantized to integers while the algorithm was designed to work on pixels with real values. As a result, the algorithm can produce some artifacts. The paper gives the figure below as an example:

… I actually like the artifacts! The default value of the parameter is Q = 2. To get more contours, we could set Q = 0, i.e. change the second-last line above to

contourlines <- image_contour_detector(mat, Q = 0)
plot(contourlines)

I think it’s a bit better! Here are the contours for 3 different Q values, placed side by side:

Density of the log-normal distribution

A random variable X has a log-normal distribution if \log X has a normal distribution. Since the normal distribution can be parameterized by the mean and standard deviation (usually denoted \mu and \sigma), the log-normal distribution can be parameterized similarly.

Given \mu \in \mathbb{R} and \sigma > 0, the density of the log-normal distribution is given by

\begin{aligned} f(x) = \dfrac{1}{x \sigma \sqrt{2\pi}} \exp \left[ - \dfrac{(\log x - \mu)^2}{2 \sigma^2} \right], \qquad \text{for } x \in (0, +\infty). \end{aligned}

The figure below shows the probability density function (PDF) of the log-normal distribution for \mu = 0 and varying \sigma. This is very much like the figure you see on the Wikipedia page.

Unfortunately this appears to be most common picture we see for the log-normal, when in fact the picture is different for different parameter settings. There are hardly figures depicting what happens when we vary \mu.

Well, this post aims to fill that gap. All the figures in this post have the x-axis going from 0 to 3 for comparability (the y-axis will vary). Code for generating these figures is available here.

Below are two figures where we vary \sigma as well. The first has \mu fixed at 1 while the second has \mu fixed at -0.75.

The next 3 figures have varying \mu and \sigma fixed at 0, 0.25 and 1.25 respectively.

Missing data and multivariate imputation by chained equations (MICE)

The dataset that an analyst gets is rarely complete. Often several of the data fields will be missing, and the data analyst needs to make a decision on what to do with these missing values. One strategy is to techniques or models that can handle missing values, e.g. decision trees and random forests. Another strategy is to make an educated guess on what the missing values probably are. This process of educated guessing is called imputation. There are many imputation methods: see References 1 and 2 for some commonly used ones.

Single imputation vs. multiple imputation

Single imputation means that the analyst imputes the missing data once to get a complete dataset. From then on, the analyst proceeds as if a complete dataset was given to them. Multiple imputation means that the analyst makes several guesses for the missing data, resulting in, say, m possible complete datasets. For each of these complete datasets, the analyst proceeds with the rest of the data analysis pipeline. At the end, the m sets of results are pooled together. The diagram below, taken from Reference 3, is a schematic for how multiple imputation works.

A big advantage of multiple imputation is that it gives the analyst a sense of variability in the missing data translates to variability in the final results. It also allows the analyst to see if the results are robust to the missing data. A big disadvantage is computation: instead of doing just one data analysis, we now have to do m of them.

Donald Rubin‘s Multiple imputation for nonresponse in surveys (Reference 4) is one of the canonical references for multiple imputation. I haven’t read it myself but it is widely referenced (right now Google Scholar says that it’s been cited 20,433 times).

Multivariate imputation by chained equations (MICE)

One of the most popular multiple imputation methods is multivariate imputation by chained equations (MICE), introduced by van Buuren et al (1999) (Reference 5). It is implemented in the mice package in R. Reference 3 is the paper for the mice package: it contains a lot of detail on both how multiple imputation and MICE work in general, things to consider when using MICE and how to use the package. It is certainly worth a read, even though some of the code may be a bit old. (The package was v2.0 in the paper; right now the latest version of the package on CRAN is v3.10.0.) Reference 6 is the paper for the MICE implementation in Stata: I think its explanation of MICE in Section 2 is the clearest for the layman.

Let’s say we have p variables X_1, \dots, X_p \in \mathbb{R}^n, where n is the number of observations in our dataset. For each variable X_j, let X_j^{obs} and X_j^{mis} denote the observed and missing parts respectively. We want to make educated guesses for X_j^{mis}.

To use MICE, we must first specify the conditional distribution of each variable conditional on the rest of the variables, i.e.

\begin{aligned} &P(X_1 \mid X_2, \dots, X_p, \theta_1), \\  &P(X_2 \mid X_1, X_3, \dots, X_p, \theta_2), \\  &\vdots \\  &P(X_p \mid X_1, \dots, X_{p-1}, \theta_p). \end{aligned}

Starting from a simple draw from the observed marginal distributions, the tth iteration of chained equations is a Gibbs sampler that draws

\begin{aligned} \theta_1^{*(t)} &\sim P(\theta_1 \mid X_1^{obs}, X_2^{(t-1)}, \dots, X_p^{(t-1)}), \\  X_1^{*(t)} &\sim P(X_1 \mid X_1^{obs}, X_2^{(t-1)}, \dots, X_p^{(t-1)}, \theta_1^{*(t)}), \\  \vdots \\  \theta_p^{*(t)} &\sim P(\theta_p \mid X_p^{obs}, X_1^{(t)}, \dots, X_{p-1}^{(t)}), \\  X_p^{*(t)} &\sim P(X_p \mid X_p^{obs}, X_1^{(t)}, \dots, X_{p-1}^{(t)}, \theta_p^{*(t)} ), \end{aligned}

where X_j^{(t)} = (X_j^{obs}, X_j^{*(t)}) is the jth imputed variable at iteration t. Essentially, the missing values for X_j are replaced by simulated draws from the posterior predictive distribution of X_j. This sampling is usually done for 10-20 iterations, and the final imputed dataset is taken as a complete dataset to be used for analysis. If M imputations are required, then this process above is repeated M times (meaning we go through 10M to 20M iterations of the above in total).

The description of the conditional distributions above seems very abstract, and it is! In practice, we would define them via univariate modeling methods. For example, say we have just 3 variables: a continuous X_1, a binary valued X_2, and a categorical variable X_3 with more than 2 levels. We could define the conditional distributions in the following way:

  • X_1 modeled by Bayesian linear regression with predictors X_2, X_3
  • X_2 modeled by logistic regression with predictors X_1, X_3
  • X_3 modeled by linear discriminant analysis (LDA) with predictors X_1, X_2

As of the time of writing, the R mice package v3.10.0 has a panoply of univariate imputation methods one can use. These are shown in the figure below, with the default methods in green (pmm is default for numeric variables):

References:

  1. Towards data science. (2019). 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples).
  2. The Analysis Factor. Seven Ways to Make up Data: Common Methods to Imputing Missing Data.
  3. van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R.
  4. Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys.
  5. van Buuren, S., Boshuizen, H. C., and Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis.
  6. Royston, P., and White, I. R. (2011). Multiple imputation by chained equations (MICE): implementation in Stata.

tidyr::complete to show all possible combinations of variables

This is an issue I often face, so I thought it best to write it down. When doing data analysis, we often want to known how many observations there are in each subgroup. These subgroups can be defined by multiple variables. In the code example below, I want to know how many vehicles there are for each (cyl, gear) combination:

library(tidyverse)
data(mtcars)
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n())

# # A tibble: 8 x 3
# # Groups:   cyl [3]
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     5     2

If you look carefully, you will notice that there are no vehicles with cyl == 8 and gear == 4. In general it’s probably better to include this combination as a row in the tibble, with count as 0. This is especially important in data pipelines where future processes might expect there to be length(unique(cyl)) * length(unique(gear)) rows in the dataset.

We can achieve this by ungrouping the dataset and applying tidyr::complete(). This ensures that every possible (cyl, gear) combination gets a row.

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear)

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4    NA
# 9     8     5     2

For rows that didn’t appear in the original summary table, complete() fills up the remaining columns with NA. We can specify the value complete() should use to fill in these cells with the fill option:

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear, fill = list(count = 0))

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4     0
# 9     8     5     2

References:

  1. Reddit. Need help with dplyr: Show all possible group combinations.

Beamer: animated \itemize and \pause give two of the same slide

I’ve been having this particular issue with Beamer’s animation for some time but finally got down to looking for a fix today. Suppose I have a slide that consists of a bulleted list and some text after. This is the animation I want:

To achieve this, my first instinct is to add [<+->] after \begin{itemize} so that the bullets are animated, and to add \pause between \end{itemize} and the remaining text, like this:

\begin{frame}[t]
    \begin{itemize}[<+->]
        \item This should only come up on first slide.
        \item This should only come up on second slide.
    \end{itemize}

    \pause
    This should only come up on third slide.
\end{frame}

Unfortunately, this results in a duplicate slide:

I can’t just remove \pause because that will result in the line “This should only come up on third slide.” appearing on every slide.

It turns out that there is a very simple solution: add [\thebeamerpauses] after the \pause commmand:

\begin{frame}[t]
    \begin{itemize}[<+->]
        \item This should only come up on first slide.
        \item This should only come up on second slide.
    \end{itemize}

    \pause[\thebeamerpauses]
    This should only come up on third slide.
\end{frame}

For what it’s worth, subsequent \pause commands will work as intended:

\begin{frame}[t]
    \begin{itemize}[<+->]
        \item This should only come up on first slide.
        \item This should only come up on second slide.
    \end{itemize}

    \pause[\thebeamerpauses]
    This should only come up on third slide.
    \pause
    This should only come up on 4th slide.
\end{frame}

References:

  1. StackExchange. Showing a text after an itemize[<+->] with \pause creates double slide.

Welch’s t-test and the Welch-Satterthwaite equation

Welch’s t-test is probably the most commonly used hypothesis test for testing whether two populations have the same mean. Welch’s t-test is generally preferred over Student’s two-sample t-test: while both assume that the population of the two groups are normal, Student’s t-test assumes that the two populations have the same variance while Welch’s t-test does not make any assumption on the variances.

Assume we have n_1 samples from group 1 and n_2 samples from group 2. For j = 1, 2, let \overline{X}_j and s_j^2 denote the sample mean and sample variance of group j respectively. Welch’s t-statistic is defined by

\begin{aligned} t = \dfrac{\overline{X}_1 - \overline{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}. \end{aligned}

Under the null hypothesis, t is approximately distributed as the t-distribution with degrees of freedom

\begin{aligned} \nu = \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 \bigg/ \left( \frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)} \right). \end{aligned}

The equation above is known as the Welch-Satterthwaite equation.

Welch-Satterthwaite equation

Where does the Welch-Satterthwaite equation come from? The commonly cited reference for this is F. E. Satterthwaite’s 1946 paper (Reference 1). Satterthwaite tackles a more general problem: let s_1^2, \dots, s_J^2 be J independent sample variances each having r_j degrees of freedom. Consider the combined estimate

\hat{V}_s = a_1 s_1^2 + \dots + a_J s_J^2,

where a_1, \dots, a_J are constants. (Back then, this was called a complex estimate of variance.) The exact distribution of \hat{V}_s is too difficult to compute (no simple closed form? I’m not sure), so we approximate it with a chi-squared distribution that has the same variance as \hat{V}_s. The question is, what is the degrees of freedom for this approximating chi-squared distribution?

The paper states that the number of degrees of freedom is

\begin{aligned} r_s = \left( \sum_{j=1}^J a_j \mathbb{E}[s_j^2] \right)^2 \bigg/ \left( \sum_{j=1}^J \dfrac{(a_j \mathbb{E}[s_j^2])^2}{r_j} \right). \end{aligned}

In practice we don’t know the value of the expectations, so we replace them with the observed values:

\begin{aligned} \hat{r}_s = \left( \sum_{j=1}^J a_j s_j^2 \right)^2 \bigg/ \left( \sum_{j=1}^J \dfrac{(a_j s_j^2)^2}{r_j} \right). \end{aligned}

The degrees of freedom for Welch’s t-test is this formula above with J = 2, r_j = n_j - 1 and a_j = 1 / n_j.

Heuristic argument

Satterthwaite’s 1946 paper is the commonly cited reference, but that paper actually contains just the formulas and not the heuristic argument. For that, we have to go to his 1941 paper (Reference 2).

Consider first the simple case where we have just two independent sample variances s_1^2 and s_2^2, each with degrees of freedom r_1 and r_2. Let’s compute the degrees of freedom for the chi-squared distribution that approximates V = s_1^2 + s_2^2.

For j = 1, 2, let \sigma_j^2 = \mathbb{E}[s_j^2]. From our set-up, we have \dfrac{r_j s_j^2}{\sigma_j^2} \sim \chi_{r_j}^2, thus

\begin{aligned} \text{Var} (s_j^2) &= \left( \frac{\sigma_j^2}{r_j} \right)^2  \mathbb{E} [\chi_{r_j}^2] = \left( \frac{\sigma_j^2}{r_j} \right)^2  \cdot (2 r_j) \\  &= \frac{2\sigma_j^4}{r_j} \quad \text{for } j = 1, 2, \\  \text{Var} (V) &= 2\left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right), \end{aligned}

where the last equality holds because s_1^2 and s_2^2 are independent. On the other hand, to approximate V be a chi-squared distribution with r degrees of freedom means that \dfrac{r V}{\mathbb{E}[V]} \sim \chi_r^2. Under this approximation,

\text{Var}(V) = \left(\dfrac{\mathbb{E}[V]}{r} \right)^2 \cdot (2 r) = \dfrac{2(\sigma_1^2 + \sigma_2^2)^2}{r}.

For this approximation to be good, we want the variance obtained under the chi-squared approximation to be the same as the true variance. Hence, we have

\begin{aligned} \dfrac{2(\sigma_1^2 + \sigma_2^2)^2}{r} &= 2\left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right), \\  r &= (\sigma_1^2 + \sigma_2^2) \bigg/ \left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right). \end{aligned}

The argument above is perfectly general: we can rerun it to get the effective degrees of freedom for V = a_1 s_1^2 + \dots + a_J s_J^2.

References:

  1. Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
  2. Satterthwaite, F. E. (1941). Synthesis of variance.

The delete-m jackknife for unequal m

The ordinary jackknife

Imagine that we are interested in estimating some parameter \theta, using observations x_1, \dots, x_n. Let \hat\theta_n denote our estimator for \theta based on these n observations. The jackknife was introduced as way to debias the estimator \hat\theta_n. Define \hat\theta_{(-i)} as the estimate of \theta based on all the observations except x_i, and define pseudo-values

\tilde{\theta}_i = n \hat{\theta}_n - (n-1) \hat{\theta}_{(-i)}

for each i. The jackknife estimator for \theta (sometimes known as the delete-1 jackknife) is simply the mean of the pseudo-values:

\begin{aligned} \hat{\theta}_{jack} = \frac{1}{n}\sum_{i=1}^n \tilde{\theta}_i. \end{aligned}

The jackknife can also be used to estimate the variance:

v_{jack} = \dfrac{\hat{s}^2}{n} = \dfrac{1}{n} \cdot \dfrac{1}{n-1} \displaystyle\sum_{i=1}^n (\tilde{\theta}_i - \hat{\theta}_{jack})^2.

This previous post explains the ordinary jackknife in more detail, giving a heuristic argument for why it works.

The delete-m jackknife

Assume that your n data points actually came in groups. Instead of deleting one observation at a time, computing the estimator and averaging n values (which can be computationally expensive when n is large), you wonder if you can run the jackknife but delete one group of observations at a time. The delete-m jackknife does just that when your groups are of equal size.

Assume that the n data points can be split into J groups of equal size. Let m = n/J. For j = 1, \dots, J, define \hat\theta_{(-j)} as the estimate of \theta based on all the observations except those from group j, and define pseudo-value

\tilde{\theta}_j = J \hat{\theta}_n - (J-1) \hat{\theta}_{(-j)}.

The delete-m jackknife estimator for \theta is

\begin{aligned} \hat{\theta}_{jack-m} = \frac{1}{J}\sum_{j=1}^J \tilde{\theta}_j. \end{aligned}

and the delete-m jackknife variance estimator is

v_{jack-m} = \dfrac{1}{J} \cdot \dfrac{1}{J-1} \displaystyle\sum_{j=1}^J (\tilde{\theta}_j - \hat{\theta}_{jack-m})^2.

Notice that these are essentially the formulas for the delete-1 jackknife, but with n replaced with J.

The delete-m jackknife for unequal m

What if your data come in J groups but the groups are not of equal size? Busing et al. (1999) modify the formulas for the delete-m jackknife slightly to account for the differing group sizes and call this jackknife the delete-m_j jackknife. They use the same heuristic argument as in the delete-1 jackknife, but the formulas have to be tweaked so that the argument works for the different group sizes. I only provide the formulas here; the argument can be found in Reference 1 (Section 2.3).

Assume that there are m_j data points in group J, and let h_j = n/m_j. As in the delete-m jackknife, define \hat\theta_{(-j)} as the estimate of \theta based on all the observations except those from group j. Defining the jth pseudo-value as

\tilde{\theta}_j = h_j \hat{\theta}_n - (h_j-1) \hat{\theta}_{(-j)},

the delete-m_j jackknife estimator for \theta is

\begin{aligned} \hat{\theta}_{jack-m_j} = \sum_{j=1}^J \dfrac{1}{h_j} \tilde{\theta}_j. \end{aligned}

The delete-m_j jackknife variance estimator is

v_{jack-m_j} = \dfrac{1}{J} \cdot \displaystyle\sum_{j=1}^J \dfrac{1}{h_j - 1} (\tilde{\theta}_j - \hat{\theta}_{jack-m_j})^2.

Note that in the case of equal groups, h_j = J for all j, and we get back the formulas for the delete-m jackknife.

References:

  1. Busing, F. M. T. A., et al. (1999). Delete-m jackknife for unequal m.

Control variates

Control variates is a method for variance reduction. Imagine that we have X \sim P where P is some probability distribution that we can sample from, and that we are interested in estimating \mu = \mathbb{E}[f(X)] for some function f. The ordinary Monte Carlo procedure is to get i.i.d. samples X_1, \dots, X_n \stackrel{i.i.d.}{\sim} P, then estimate \mu via

\begin{aligned} \hat{\mu} = \frac{1}{n}\sum_{i=1}^n f(X_i). \end{aligned}

If the variance of f(X) is \text{Var}[f(X)] = \sigma_f^2, then we know that the Monte Carlo estimate has variance

\begin{aligned} \text{Var}(\hat{\mu}) &= \frac{\sigma_f^2}{n}. \end{aligned}

Now, imagine that there is some other function h such that we know the value of \theta = \mathbb{E}[h(X)]. For any fixed value \beta \in \mathbb{R}, consider the regression estimator

\begin{aligned} \hat{\mu}_\beta = \frac{1}{n}\sum_{i=1}^n [f(X_i) - \beta h(X_i)] + \beta \theta. \end{aligned}

It’s easy to see that \hat{\mu}_\beta is always unbiased for \mu:

\begin{aligned} \mathbb{E}[\hat{\mu}_\beta] = \frac{1}{n} \sum_{i=1}^n [\mathbb{E}[f(X_i)] - \beta \mathbb{E}[h(X_i)]] + \beta  \theta = \frac{n}{n} (\mu - \beta \theta) + \beta  \theta = \mu. \end{aligned}

Let’s compute \hat{\mu}_\beta‘s variance. If we let \sigma_h^2 = \text{Var}[h(X)] and \rho = \text{Cor}(f(X), h(X)), then

\begin{aligned} \text{Var}(\hat{\mu}_\beta) &= \frac{1}{n} \text{Var} [f(X) - \beta h(X)]  \\  &= \frac{1}{n} \left[ \text{Var}(f(X)) - 2 \beta \text{Cov}(f(X), h(X)) + \beta^2 \text{Var}(h(X)) \right]  \\  &= \frac{\sigma_h^2 \beta^2 - 2\sigma_f \sigma_h \rho \beta + \sigma_f^2}{n}. \end{aligned}

\beta is a free parameter whose value we get to choose. To minimize the variance of \hat{\mu}_\beta, we should take \beta = \dfrac{\sigma_f \rho}{\sigma_h}. Plugging this into the variance formula above, we get

\begin{aligned} \text{Var}(\hat{\mu}_\beta) = \frac{- \sigma_f^2 \rho^2 + \sigma_f^2}{n} = \frac{\sigma_f^2}{n} (1 - \rho^2) = (1-\rho^2) \text{Var}(\hat{\mu}). \end{aligned}

Thus, this difference estimator will always have a smaller variance than the simple Monte Carlo estimate. The larger the absolute value of the correlation between f(X) and h(X), the smaller the variance of the difference estimator. h(X) is known as the control variate. In looking for control variates, we should try to find variables that are as correlated (or anti-correlated) with our variable of interest as possible.

But we can’t compute the optimal \beta

To get the variance formula above, we needed

\beta =  \dfrac{\sigma_f \rho}{\sigma_h} = \dfrac{\sigma_f \sigma_h \rho}{\sigma_h^2} = \dfrac{\text{Cov}(f(X), h(X))}{\text{Var}(h(X))}.

In most practical settings we won’t know any of the quantities on the RHS exactly. What value of \beta should we use then? Owen (Reference 2) suggests using the plug-in estimate

\hat{\beta} = \dfrac{\widehat{\text{Cov}}(f(X), h(X))}{\widehat{\text{Var}}(h(X))} = \dfrac{\sum_{i=1}^n (f(X_i) - \bar{f})(h(X_i) - \bar{h})}{\sum_{i=1}^n (h(X_i) - \bar{h})^2},

where \bar{f} = \frac{1}{n}\sum_{i=1}^n f(X_i) and \bar{h} = \frac{1}{n}\sum_{i=1}^n h(X_i). He claims that while \hat{\mu}_{\hat{\beta}} is biased for \mu in general, this bias is small. (See top of page 32 of Reference 2 for a heuristic argument.)

Multiple control variates

If we have J control variates \boldsymbol{h}(X) = \begin{pmatrix} h_1(X) & \dots & h_J(X) \end{pmatrix}^T, we can use all of them in the difference estimator:

\begin{aligned} \hat{\mu}_{\boldsymbol{\beta}} = \frac{1}{n}\sum_{i=1}^n [f(X_i) - \boldsymbol{\beta}^T \boldsymbol{h}(X_i)] + \boldsymbol{\beta}^T \boldsymbol{\theta}, \end{aligned}

where \boldsymbol{\beta} \in \mathbb{R}^J and \boldsymbol{\theta}= \mathbb{E}[\boldsymbol{h}(X)]. Similar to the one-dimensional case, it can be shown that the optimal value of \boldsymbol{\beta} is

\begin{aligned} \boldsymbol{\beta} = \text{Var}(\boldsymbol{h}(X))^{-1} \text{Cov}(f(X), \boldsymbol{h}(X)), \end{aligned}

and in practice we use the sample version

\begin{aligned} \hat{\boldsymbol{\beta}} &= \widehat{\text{Var}}(\boldsymbol{h}(X))^{-1} \widehat{\text{Cov}}(f(X), \boldsymbol{h}(X)) \\  &= \left(\frac{1}{n}\sum_{i=1}^n (\boldsymbol{h}(X_i) - \bar{\boldsymbol{h}})(\boldsymbol{h}(X_i) - \bar{\boldsymbol{h}})^T \right)^{-1} \frac{1}{n}\sum_{i=1}^n (\boldsymbol{h}(X_i) - \bar{\boldsymbol{h}})(f(X_i) - \bar{f}), \end{aligned}

where \bar{f} = \frac{1}{n}\sum_{i=1}^n f(X_i) and \bar{\boldsymbol{h}} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{h}(X_i). See pp30-32 of Reference 2 for details and some discussion on the bias that this estimate introduces.

Post-stratification as a special case of control variates

Owen notes that post-stratification can be thought of as a special case of control variates. Let’s say we have strata D_1, \dots, D_J. Then post-stratification is simply the regression estimator with J control variates, the jth one being h_j(X) = 1\{ X \in D_j \} (i.e. does the observation belong to stratum D_j?). (See Example 8.4 of Reference 2.)

What if we don’t know \theta = \mathbb{E}[h(X)] exactly?

If we don’t know \theta = \mathbb{E}[h(X)] exactly but have some estimate \hat{\theta} instead, we could replace \theta by \hat{\theta} in the regression estimator:

\begin{aligned} \hat{\tau}_\beta = \frac{1}{n}\sum_{i=1}^n [f(X_i) - \beta h(X_i)] + \beta \hat{\theta}. \end{aligned}

Of course, \hat{\theta} can’t be just any estimate: if we take \hat{\theta} = \frac{1}{n}\sum_{i=1}^n h(X_i), then the regression estimator reduces to the standard Monte Carlo estimate. Instead, let’s assume that we used some data other that X_1, \dots, X_n to get this estimate, and that we have \hat{\theta} = \theta + \varepsilon with \varepsilon independent of the X_i‘s. \hat{\tau}_\theta is still unbiased but has variance

\text{Var}(\hat{\tau}_\beta) = \dfrac{\sigma_h^2 \beta^2 - 2\sigma_f \sigma_h \rho \beta + \sigma_f^2}{n} + \beta^2 \text{Var}(\varepsilon).

The value of \beta which minimizes variance is now \beta = \dfrac{\sigma_f \sigma_h \rho}{\sigma_h^2 + n\text{Var}(\epsilon)}, which gives variance

\begin{aligned} \text{Var}(\hat{\tau}_\beta) &= - \frac{(\sigma_f \sigma_h \rho)^2}{n\sigma_h^2 + n^2 \text{Var}(\varepsilon)} + \frac{\sigma_f^2}{n} \\  &= \dfrac{\sigma_f^2}{n} \left[ 1 - \frac{\rho^2 \sigma_h^2}{\sigma_h^2 + n \text{Var}(\varepsilon)} \right]. \end{aligned}

This is still better than the simple Monte Carlo estimate, but not as good as knowing \theta exactly:

\begin{aligned} \text{Var}(\hat{\tau}_\beta) &\leq \dfrac{\sigma_f^2}{n} [ 1 - 0] = \text{Var}(\hat{\mu}), \\  \text{Var}(\hat{\tau}_\beta) &\geq \dfrac{\sigma_f^2}{n} \left[ 1 - \frac{\rho^2 \sigma_h^2}{\sigma_h^2 + n (0)} \right] = \text{Var}(\hat{\mu}_\beta).  \end{aligned}

References:

  1. Wikipedia. Control variates.
  2. Owen, A. Variance Reduction (Section 8.9).

logit and expit

The logit function is defined as

\begin{aligned} \text{logit}(p) = \log \left(\frac{p}{1-p} \right), \end{aligned}

where p \in (0,1). The logit function transforms a variable constrained to the unit interval (usually a probability) to one that can take on any real value. In statistics, most people encounter the logit function in logistic regression, where we assume that the probability of a binary response Y being 1 is associated with features X_1, \dots, X_k through the relationship

\begin{aligned} \text{logit}[\mathbb{P}(Y = 1)] = \sum_{j=1}^k \beta_j X_j.  \end{aligned}

Because the logit can take on any real value, it makes more sense to model the logit of a probability as above instead of modeling the probability directly like

\begin{aligned} \mathbb{P}(Y = 1) = \sum_{j=1}^k \beta_j X_j.  \end{aligned}

If we model the probability directly, it is possible that \sum_{j=1}^k \beta_j X_j results in a value that lies outside the unit interval; in that situation we would have to do some post-processing (e.g. thresholding) to get valid probabilities.

The expit function is simply the inverse of the logit function. It is defined as

\begin{aligned} \text{expit}(x) = \frac{e^x}{1 + e^x}. \end{aligned}

It takes any real number x and transforms it to a value in (0, 1).

The picture below shows what the logit and expit functions look like. (Since they are inverses of each other, their graphs are reflections of each other across the y = x line.)

We can think of the expit function as a special case of the softmax function. If we define the softmax function for n variables as

\begin{aligned} \sigma (x_1, \dots, x_n) = \left( \frac{e^{x_1}}{e^{x_1 + \dots + e^{x_n}}}, \dots, \frac{e^{x_n}}{e^{x_1 + \dots + e^{x_n}}} \right), \end{aligned}

then when n = 2,

\begin{aligned} \sigma(x, 0) &= \left( \frac{e^x}{e^x + e^0}, \frac{e^0}{e^x + e^0} \right) \\  &= \left( \frac{e^x}{1 + e^x}, \frac{1}{1 + e^x} \right) \\  &= \left( \text{expit}(x), 1 - \text{expit}(x) \right). \end{aligned}