How heavy-tailed is the t distribution? (Part 2)

In this previous post, we explored how heavy-tailed the t distribution is through the question: “What is the probability that the random variable is at least x standard deviations (SDs) away from the mean?” For the most part, the smaller the degrees of freedom, the larger this probability was (more “heavy-tailed”), until we realized that the trend reversed for really small degrees of freedom (2.1 in the post). In fact, for 1 < df \leq 2, the variance of the t distribution is infinite, and so the random variable is always within 1 SD of the mean!

We need another way to think about heavy-tailedness. (The code to produce the figures in this post is available here.)

A first approach that doesn’t work

You might be wondering, why didn’t I just plot P\{ T > x \} with T \sim t_{df} against x, for various values of x and df? If I did that, I would have ended up with the plot below (for the log of the probabilities):

That seems to be exactly what we want: the smaller the degrees of freedom, the slower this probability decays…

The problem is that the comparison above ignores the scale of the random variables. Imagine if we tried to make the plot above, but instead of plotting lines for the t distribution with different degrees of freedom, let’s plot it for the normal distribution with different standard deviations. This is what we would get:

That seems to give the same trend as the plot before! Can we then conclude that the \mathcal{N}(0, 10^2) distribution is more heavy-tailed than the \mathcal{N}(0, 1) distribution??

One way to incorporate scale

The discussion above illustrates the need to take scale into account. We tried to do this in the previous post by scaling each distribution by its own SD, but that idea broke down for small degrees of freedom.

Here’s an idea: Pick some threshold x'. For each random variable T, find the scale factor k such that P \{ kT > x' \} = P \{ \mathcal{N}(0, 1) > x' \}. For this value of k, kT and \mathcal{N}(0, 1) are on the same scale w.r.t. this threshold. We then compare the tail probabilities of kT and \mathcal{N}(0, 1) (instead of T and \mathcal{N}(0, 1)).

Finding k is not hard: here’s a three-line function that does it for the t distribution in R:

getScaleFactor <- function(df, threshold) {
  tailProb <- pnorm(threshold, lower.tail = FALSE)
  tQuantile <- qt(tailProb, df = df, lower.tail = FALSE)
  return(threshold / tQuantile)
}

Let’s plot the log10 of the tail probability P \{ kT > x \} with T \sim t_{df} against x for various values of x and df, with the scale factor k computed as above:

By definition, the tail probabilities will coincide when x is equal to the threshold used to compute the scale factors. We now see a clear trend with no breakdown: for smaller values of df, the tail probability P \{ kT > x \} is larger.

Another side benefit of this way to looking at tail probabilities is that we can now compare distributions which have infinite variance, or even an undefined mean (like the Cauchy distribution, which is the t distribution with one degree of freedom)! Here is the same plot as above but for smaller degrees of freedom:

Advertisement

How heavy-tailed is the t distribution?

It’s well-known that the t distribution has heavier tails than the normal distribution, and the smaller the degree of freedom, the more “heavy-tailed” it is. As the degrees of freedom goes to 1, the t distribution goes to the Cauchy distribution, and as the degrees of freedom goes to infinity, it goes to the normal distribution.

One way to measure the “heavy-tailedness” of a distribution is by computing the probability of the random variable taking a value that is at least x standard deviations (SD) away from its mean. The larger those probabilities are, the more heavy-tailed a distribution is.

The code below computes the (two-sided) tail probabilities for the t distribution for a range of degree of freedom values. Because the probabilities are so small, we compute the log10 of these probabilities instead. Hence, a value of -3 corresponds to a probability of 10^{-3}, or a 1-in-1,000 chance.

library(ggplot2)

dfVal <- c(Inf, 100, 50, 30, 10, 5, 3, 2.1)
sdVal <- 1:10

tbl <- lapply(dfVal, function(df) {
  stdDev <- if (is.infinite(df)) 1 else sqrt(df / (df - 2))
  data.frame(df = df,
             noSD = sdVal,
             log10Prob = log10(2 * pt(-(sdVal) * stdDev, df = df)))
})

tbl <- do.call(rbind, tbl)
tbl$df <- factor(tbl$df)

ggplot(tbl, aes(x = noSD, y = log10Prob, col = df)) +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Spectral", direction = 1) +
  labs(x = "No. of SD", y = "log10(Probability of being >= x SD from mean)",
       title = "Tail probabilities for t distribution",
       col = "Deg. of freedom") +
  theme_bw()

Don’t be fooled by the scale on the vertical axis! For a t distribution with 3 degrees of freedom, the probability of being 10 SD out is about 1-in-2,400. For a normal distribution (inifinite degrees of freedom in the figure), that same probability is about 1-in-65,000,000,000,000,000,000,000! (That’s 65 followed by 21 zeros. As a comparison, the number of stars in the universe is estimated to be around 10^24, or 1 followed by 24 zeros.)

If you look closely at the figure, you might notice something a little odd with df = 2.1: it seems that for any number of SDs, the probability of being that number of SD out for df = 2.1 is lower than that for df = 3. Does that mean that df = 2.1 is less heavy-tailed than df = 3?

Not necessarily. A t distribution with \nu degrees of freedom has SD \sqrt{\nu / (\nu - 2)}. For \nu = 3, the SD is about 1.73 while for \nu = 2.1 the SD is about 4.58, much larger! Taking this to the extreme, consider a t distribution with \nu = 2. The variance is infinite in this case, so the random variable always takes values within 1 SD of the mean! Does it mean that this distribution is less heavy tailed than the normal distribution?

Looks like we might need another way to define heavy-tailedness!

Update (2021-11-06): This blog post contains a nice discussion on some of the weirdness we see when the degrees of freedom for the t distribution is between 2 and 3.

t distribution as a mixture of normals

In class, the t distribution is usually introduced like this: if X \sim \mathcal{N}(0,1) and Z \sim \chi_\nu^2 are independent, then T = \dfrac{X}{\sqrt{Z / \nu}} has t distribution with \nu degrees of freedom, denoted t_\nu or t_{(\nu)}.

Did you know that the t distribution can also be viewed as a (continuous) mixture of normal random variables? Specifically, let W have inverse-gamma distribution \text{InvGam}\left(\dfrac{\nu}{2}, \dfrac{\nu}{2} \right), and define the conditional distribution X \mid W = w \sim \mathcal{N}(0, w). Then the unconditional distribution of X is the t distribution with \nu degrees of freedom.

The proof follows directly from computing the unconditional (or marginal) density of X:

\begin{aligned} f_X(x) &= \int_0^\infty f_{X \mid W}(x) f_W(w) dw \\  &\propto \int_0^\infty \frac{1}{\sqrt{w}} \exp \left( -\frac{x^2}{2w} \right) \cdot w^{-\nu/2 - 1} \exp \left( - \frac{\nu}{2w} \right) \\  &= \int_0^\infty w^{-\frac{\nu + 1}{2} - 1} \exp \left( - \frac{x^2 + \nu}{2w} \right) dw. \end{aligned}

Note that the integrand above is proportional to the PDF of the inverse-gamma distribution with \alpha = \dfrac{\nu + 1}{2} and \beta = \dfrac{x^2 + \nu}{2}. Hence, we can evaluate the last integral exactly to get

f_X(x) \propto \Gamma \left( \dfrac{\nu + 1}{2} \right) \left(\dfrac{x^2 + \nu}{2}\right)^{-\frac{\nu + 1}{2}} \propto\left( x^2 + \nu \right)^{-\frac{\nu + 1}{2}} \propto \left( 1 + \dfrac{x^2}{\nu} \right)^{-\frac{\nu + 1}{2}},

which is proportional to the PDF of the  t_\nu distribution.

Sources for the information above:

  1. Student-t as a mixture of normals, John D. Cook Consulting.