p-value around 0.32? You may have a problem with outliers…

I’ve been reading through Kohavi’s et al.’s Trustworthy Online Controlled Experiments, which gives a ton of great advice on how to do A/B testing well. This post is about one nugget regarding the t-test and outliers that I did not know before.

(This appears in Chapter 19 of the book, “The A/A Test”. In the book they mention this issue in the context of computing the p-values for several two-sample t-tests. In this post I use one one-sample t-test because it is easier to illustrate the point with fewer sources of randomness.)

Imagine we are doing a one-sample t-test with 100 observations. Say our observations are generated in the following way:

set.seed(1)
x <- runif(100, min = 0, max = 1) + 10
x[100] <- 100000

Every single value in x is much higher zero. We have this one huge outlier, but it’s in the “right” direction (i.e. greater than all values in x, much greater than zero). If we run a two-sided t-test with this data for the null hypothesis that the mean of the data generating distribution is equal to zero, we should get a really small p-value right?

Wrong! Here is the R output:

t.test(x)$p.value
# [1] 0.3147106

The problem is because of that outlier that x has. Sure, it increases the mean of x by a lot, but it also increases the variance a ton! Remember that the t-statistic is a ratio of the mean estimate to its sample standard deviation, so the outlier’s effects on the mean and variance “cancel” each other.

Here’s a sketch of the mathematical argument. Assume the outlier has value o, and that there are n observations in each sample. When the outlier is very large, it dominates the mean estimate, so the numerator of the t-statistic is something like o / n. It also dominates the variance estimate:

\begin{aligned} \text{Variance} &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2, \\  \text{Sample variance} s^2 &\approx \frac{o^2}{n} - \frac{o^2}{n^2} \approx \frac{o^2}{n}. \end{aligned}

Hence, the t-statistic is roughly

\begin{aligned} t = \frac{\hat{\mu} - 0}{\sqrt{s^2 / n}} \approx \frac{o/n}{\sqrt{(o^2/n)/n}} = 1 \text{ or } -1, \end{aligned}

depending on whether o is positive or negative. Hence, the two-sided p-value is approximately \mathbb{P} \{ |\mathcal{N}(0,1)| \geq 1 \}:

2 * pnorm(-1)
# [1] 0.3173105

Another way to understand this intuitively is to make a plot of the data. First, let’s plot the data without the outlier:

plot(x[1:99], pch = 16, cex = 0.5, ylim = c(0, max(x[1:99])),
     xlab = "Observation index", ylab = "Value",
     main = "Plot of values without outlier")
abline(h = 0, col = "red", lty = 2)

It’s obvious here that the variation in x pales in comparison to the distance of x from the y = 0 line, so we expect a very small p-value. (In fact, R tells me the p-value is \approx 4 \times 10^{-158}!) Next, let’s plot the data with the outlier:

plot(x, pch = 16, cex = 0.5, ylim = c(0, max(x)),
     xlab = "Observation index", ylab = "Value",
     main = "Plot of values with outlier")
abline(h = 0, col = "red", lty = 2)

Not so obvious that the mean of x is different from zero now right?

Welch’s t-test and the Welch-Satterthwaite equation

Welch’s t-test is probably the most commonly used hypothesis test for testing whether two populations have the same mean. Welch’s t-test is generally preferred over Student’s two-sample t-test: while both assume that the population of the two groups are normal, Student’s t-test assumes that the two populations have the same variance while Welch’s t-test does not make any assumption on the variances.

Assume we have n_1 samples from group 1 and n_2 samples from group 2. For j = 1, 2, let \overline{X}_j and s_j^2 denote the sample mean and sample variance of group j respectively. Welch’s t-statistic is defined by

\begin{aligned} t = \dfrac{\overline{X}_1 - \overline{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}. \end{aligned}

Under the null hypothesis, t is approximately distributed as the t-distribution with degrees of freedom

\begin{aligned} \nu = \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 \bigg/ \left( \frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)} \right). \end{aligned}

The equation above is known as the Welch-Satterthwaite equation.

Welch-Satterthwaite equation

Where does the Welch-Satterthwaite equation come from? The commonly cited reference for this is F. E. Satterthwaite’s 1946 paper (Reference 1). Satterthwaite tackles a more general problem: let s_1^2, \dots, s_J^2 be J independent sample variances each having r_j degrees of freedom. Consider the combined estimate

\hat{V}_s = a_1 s_1^2 + \dots + a_J s_J^2,

where a_1, \dots, a_J are constants. (Back then, this was called a complex estimate of variance.) The exact distribution of \hat{V}_s is too difficult to compute (no simple closed form? I’m not sure), so we approximate it with a chi-squared distribution that has the same variance as \hat{V}_s. The question is, what is the degrees of freedom for this approximating chi-squared distribution?

The paper states that the number of degrees of freedom is

\begin{aligned} r_s = \left( \sum_{j=1}^J a_j \mathbb{E}[s_j^2] \right)^2 \bigg/ \left( \sum_{j=1}^J \dfrac{(a_j \mathbb{E}[s_j^2])^2}{r_j} \right). \end{aligned}

In practice we don’t know the value of the expectations, so we replace them with the observed values:

\begin{aligned} \hat{r}_s = \left( \sum_{j=1}^J a_j s_j^2 \right)^2 \bigg/ \left( \sum_{j=1}^J \dfrac{(a_j s_j^2)^2}{r_j} \right). \end{aligned}

The degrees of freedom for Welch’s t-test is this formula above with J = 2, r_j = n_j - 1 and a_j = 1 / n_j.

Heuristic argument

Satterthwaite’s 1946 paper is the commonly cited reference, but that paper actually contains just the formulas and not the heuristic argument. For that, we have to go to his 1941 paper (Reference 2).

Consider first the simple case where we have just two independent sample variances s_1^2 and s_2^2, each with degrees of freedom r_1 and r_2. Let’s compute the degrees of freedom for the chi-squared distribution that approximates V = s_1^2 + s_2^2.

For j = 1, 2, let \sigma_j^2 = \mathbb{E}[s_j^2]. From our set-up, we have \dfrac{r_j s_j^2}{\sigma_j^2} \sim \chi_{r_j}^2, thus

\begin{aligned} \text{Var} (s_j^2) &= \left( \frac{\sigma_j^2}{r_j} \right)^2  \mathbb{E} [\chi_{r_j}^2] = \left( \frac{\sigma_j^2}{r_j} \right)^2  \cdot (2 r_j) \\  &= \frac{2\sigma_j^4}{r_j} \quad \text{for } j = 1, 2, \\  \text{Var} (V) &= 2\left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right), \end{aligned}

where the last equality holds because s_1^2 and s_2^2 are independent. On the other hand, to approximate V be a chi-squared distribution with r degrees of freedom means that \dfrac{r V}{\mathbb{E}[V]} \sim \chi_r^2. Under this approximation,

\text{Var}(V) = \left(\dfrac{\mathbb{E}[V]}{r} \right)^2 \cdot (2 r) = \dfrac{2(\sigma_1^2 + \sigma_2^2)^2}{r}.

For this approximation to be good, we want the variance obtained under the chi-squared approximation to be the same as the true variance. Hence, we have

\begin{aligned} \dfrac{2(\sigma_1^2 + \sigma_2^2)^2}{r} &= 2\left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right), \\  r &= (\sigma_1^2 + \sigma_2^2) \bigg/ \left( \frac{\sigma_1^4}{r_1} + \frac{\sigma_2^4}{r_2} \right). \end{aligned}

The argument above is perfectly general: we can rerun it to get the effective degrees of freedom for V = a_1 s_1^2 + \dots + a_J s_J^2.

References:

  1. Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
  2. Satterthwaite, F. E. (1941). Synthesis of variance.