What is a stable distribution?

Let X be a real-valued (non-degenerate) random variable. X is said to be stable (or have a stable distribution) if for any two independent copies X_1, X_2 of X and any constants a, b > 0,

aX_1 + bX_2 \stackrel{d}{=} cX + d

for some constants c > 0 and d. (Here, \stackrel{d}{=} denotes equal in distribution.) X is said to be strictly stable if the above holds with d = 0, and is said to be symmetric stable if X has the same distribution as -X.

Here are some other equivalent definitions for stable random variables:

  1. X is said to be stable if for any n \geq 2, there are constants c_n > 0 and d_n such that

    X_1 + X_2 + \dots + X_n \stackrel{d}{=} c_nX + d_n,

    where X_1, \dots, X_n are independent copies of X.

  2. X is said to be stable if it has a domain of attraction, i.e. if there is a sequence of i.i.d. random variables Y_1, Y_2, \dots, a sequence of positive numbers d_1, d_2, \dots and a sequence of numbers a_1, a_2, \dots such that
    \begin{aligned} \dfrac{Y_1 + Y_2 + \dots + Y_n}{d_n} + a_n \stackrel{d}{\Rightarrow} X. \end{aligned}

    (Here, \stackrel{d}{\Rightarrow} denotes convergence in distribution.)

  3. X is said to be stable if there are parameters 0 < \alpha \leq 2, -1 \leq \beta \leq 1, \mu \in \mathbb{R}, c \geq 0 such that its characteristic function has the following form:

    \begin{aligned} \varphi_X(t) = \mathbb{E} \left[ \exp(itX) \right] &= \exp \left[ it\mu - |c t|^\alpha (1 - i \beta \text{sgn}(t) \Phi) \right], \end{aligned}where

    \begin{aligned} \Phi &= \begin{cases} \tan (\pi \alpha / 2) &\text{if } \alpha \neq 1, \\ -\frac{2}{\pi} \log |c t| &\text{if } \alpha = 1. \end{cases} \end{aligned}

    and

    \begin{aligned} \text{sgn}(t) &= \begin{cases} 1 &\text{if } t > 0, \\ 0 &\text{if } t = 0, \\ -1 &\text{if } t < 0. \end{cases} \end{aligned}

These equivalent definitions are stated in Samorodnitsky & Taqqu (1994) (Reference 2) without proof; they point the reader to Gnedenko & Kolmogorov (1954) (Reference 3).

The last definition can be thought of as a parametrization of all stable distributions. (The Wikipedia article notes that it is not the only parametrization in use.) Unfortunately, this general formula for a stable distribution’s characteristic function cannot be converted into a general formula for a stable distribution’s probability density function (at least one that can be written analytically).

Examples of stable distributions

Here are some special cases of stable distributions which can be deduced by inspecting the characteristic function:

  • When \alpha = 2, we get the normal distribution with mean \mu and variance \sigma^2 = 2c^2.
  • When \alpha = 1 and \beta = 0, we get the Cauchy distribution with scale parameter c and shift parameter \mu.
  • When \alpha = 1/2 and \beta = 1, we get the Lévy distribution with scale parameter c and shift parameter \mu.

Honorable mention: The Poisson distribution is not stable: if X_1 and X_2 are two independent copies of a Poisson distribution, aX_1 + bX_2 is, in general, not Poisson-distributed. However, it is Poisson distributed if a = b = 1.

References:

  1. Wikipedia. Stable distribution.
  2. Samorodnitsky, G., and Taqqu, M. S. (1994). Stable Non-Gaussian Random Processes.
  3. Gnedenko, B. V., and Kolmogorov, A. N. (1954). Limit distributions for sums of independent random variables.
Advertisement

The product of doubly stochastic matrices is doubly stochastic

A matrix \mathbf{D} \in \mathbb{R}^{n \times n} with non-negative entries is said to be doubly stochastic if each of its rows and columns sum to 1, i.e.

\begin{aligned} \sum_{k=1}^n d_{jk} &= 1 \qquad \text{for all } j, \\  \sum_{j=1}^n d_{jk} &= 1 \qquad \text{for all } k.\end{aligned}

If \mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times n} are both doubly stochastic, then their product \mathbf{C} = \mathbf{AB} is also doubly stochastic. We can prove this directly by some algebra. For \mathbf{C}‘s rows: for each j,

\begin{aligned} \sum_{k=1}^n c_{jk} &=  \sum_{k=1}^n \left( \sum_{\ell = 1}^n a_{j\ell} b_{\ell k} \right) \\  &= \sum_{\ell = 1}^n a_{j\ell} \left( \sum_{k=1}^n b_{\ell k} \right) \\  &= \sum_{\ell = 1}^n a_{j\ell} \\  &= 1. \end{aligned}

For \mathbf{C}‘s columns: for each k,

\begin{aligned} \sum_{j=1}^n c_{jk} &=  \sum_{j=1}^n \left( \sum_{\ell = 1}^n a_{j\ell} b_{\ell k} \right) \\  &= \sum_{\ell = 1}^n b_{\ell k} \left( \sum_{j=1}^n a_{j\ell} \right) \\  &= \sum_{\ell = 1}^n b_{\ell k} \\  &= 1. \end{aligned}

One little nice connection to Markov chains: if a matrix is a right stochastic matrix (i.e. rows sum to 1), we can view it as the transition probability matrix of a Markov chain. Since \mathbf{A} and \mathbf{B} are both doubly stochastic, we can view them as transition probability matrices for Markov chains denoted \mathcal{A} and \mathcal{B}. If we think of a new process where each step consists of taking one move according to \mathcal{B} and then one move according to \mathcal{A}, our identity above says that this new process is also a Markov chain. This establishes the right stochasticity of \mathbf{C}.

Running the same argument with the transposes, we can conclude that \mathbf{C}^\top is right stochastic, implying that \mathbf{C} is left stochastic and is hence doubly stochastic.

Depending on how obvious the Markov chain connection is to you, you could see this as a nice implication of our identity or another proof of the identity.

What is mean independence?

Mean independence is a relationship between two random variables that lies between the usual definition of independence and uncorrelatedness. A random variable Y is said to mean independent of X if (and only if)

\mathbb{E}[Y \mid X = x] = \mathbb{E}[Y]

for all x such that the probability mass/density of X at x is not zero.

Mean independence comes up most often in econometrics as an assumption that is weaker than independence but stronger than uncorrelatedness. Independence implies mean independence, and mean independence implies uncorrelatedness.

Independence implies mean independence

This is the most technical argument of the lot (adapted from Reference 2). By definition of conditional expectation, \mathbb{E}[Y \mid X] is the unique random variable such that for any A \in \sigma(X) (the \sigma-algebra generated by X), we have

\mathbb{E}[\mathbb{E}[Y \mid X] 1_A] = \mathbb{E}[Y 1_A].

However, by independence of X and Y,

\mathbb{E}[Y 1_A] = \mathbb{E}[Y] \mathbb{E}[1_A] = \mathbb{E}[\mathbb{E}[Y]1_A].

Hence, \mathbb{E}[Y \mid X] must be equal to the constant random variable E[Y], i.e. Y is mean independent of X.

(Update: Ok here’s an easier way to see this: independence implies that the conditional distribution of Y given X = x is the same for all x. Since the conditional mean \mathbb{E}[Y \mid X = x] is a function of the conditional distribution Y \mid X = x, it follows that \mathbb{E}[Y \mid X = x] is the same for all x, and so must be equal to \mathbb{E}[Y].)

Mean independence implies uncorrelatedness

(Proof adapted from Reference 3.) Assume Y is mean independent of X. By the law of iterated expectations,

\begin{aligned} \mathbb{E}[XY] &= \mathbb{E}[\mathbb{E}[XY \mid X]] \\  &= \mathbb{E}[X \mathbb{E}[Y \mid X]] \\  &= \mathbb{E}[X \mathbb{E}[Y]] \\  &= \mathbb{E}[Y] \mathbb{E}[X],  \end{aligned}

so \text{Cov}(X, Y) = 0.

Both converses are not true, as the counterexamples below show. (Counterexamples adapted from Reference 4.)

Mean independence does not imply independence

Let X = \cos \theta and Y = \sin \theta, with \theta \sim \text{Unif}[0, 2\pi]. For any x \in [-1, 1], \mathbb{P}(Y = \sqrt{1-x^2}) = \mathbb{P}(Y = -\sqrt{1-x^2}) = 0.5, so \mathbb{E}[Y \mid X = x] = 0, i.e. Y is mean independent of X. We can similarly establish that X is mean independent of Y.

On the other hand, \mathbb{P}(X > 3/4) > 0 and \mathbb{P}(Y > 3/4) > 0, but since X^2 + Y^2 = 1 < (3/4)^2 + (3/4)^2,

\mathbb{P}(X > 3/4 \text{ and } Y > 3/4) = 0 \neq \mathbb{P}(X > 3/4 ) \mathbb{P}(Y > 3/4).

Therefore, X and Y are not independent.

Uncorrelatedness does not imply mean independence

Let X and Y be such that (X, Y) \in \{ (1, 3), (-3, 1), (-1, -3), (3, -1) \} with equal probability. With this setup,

\mathbb{E}[XY] = \mathbb{E}[X] = \mathbb{E}[Y] = 0,

so \text{Cov}(X, Y) = 0, i.e. X and Y uncorrelated. However,

\mathbb{E}[Y \mid X = 1] = 3, \quad \mathbb{E}[Y \mid X = -3] = 1,

so Y is not mean independent of X.

Mean independence is not symmetric

Both independence and uncorrelatedness are symmetric; however (maybe surprisingly) mean independence is not. Here is a simple from StackExchange demonstrating this. Let X and Y be such that (X, Y) \in \{ (2, 1), (-2, 1), (5, -1), (-5, -1) \} with equal probability. With this setup,

\mathbb{E}[X \mid Y = 1] = \mathbb{E}[X \mid Y = -1] = 0,

so X is mean independent of Y, but

\mathbb{E}[Y \mid X = 2] = 1, \quad \mathbb{E}[Y \mid X = 5] = -1,

so Y is not mean independent of X.

References:

  1. Wikipedia. Mean dependence.
  2. Dembo, A. Stat/219 Math 136 – Stochastic Processes: Notes on Section 4.1.2.
  3. Zhang, H. Mean-independence implies zero covariance.
  4. Merx, J.-P. Mean independent and correlated variables.

Computing E[1/(a+X)] for binomial X

Theorem: For a = 1, 2, \dots, let U_{n,a} = \mathbb{E}\left[\dfrac{1}{a+X} \right] with X \sim \text{Binom}(n, p). Then

\begin{aligned} U_{n,1} &= \dfrac{1 - (1-p)^{n+1}}{(n+1)p}, \\  U_{n, a} &= \dfrac{1 - (a-1) U_{n+1, a-1}}{(n+1)p} \qquad \text{for } a \geq 2. \end{aligned}

While the theorem does not give us a single explicit formula for \mathbb{E}\left[\dfrac{1}{a+X} \right], we can write out an expression for \mathbb{E}\left[\dfrac{1}{a+X} \right] for a given a by iteratively substituting the recurrence relation into itself. (Maybe we can use that to get a single explicit formula; I didn’t try too hard.)

The theorem is easy to prove if you know the identity presented in this previous post: for a > 0,

\begin{aligned} \mathbb{E}\left[ \dfrac{1}{a + X} \right] = \int_0^1 t^{a-1} \cdot P_X(t) dt, \end{aligned}

where P_X is the probability generating function for X. Letting q = 1-p, we have P_X(t) = (q + pt)^n for X \sim \text{Binom}(n, p) (proof here). For a = 1,

\begin{aligned} U_{n, 1} = \int_0^1 t^0 (q + pt)^n dt  = \left[ \dfrac{(q+pt)^{n+1}}{(n+1)p} \right]_0^1  = \dfrac{1 - q^{n+1}}{(n+1)p}. \end{aligned}

For a \geq 2, using integration by parts,

\begin{aligned} U_{n, a} &= \int_0^1 t^{a-1} (q + pt)^n dt \\  &= \left[ t^{a-1} \cdot \dfrac{(q+pt)^{n+1}}{(n+1)p} \right]_0^1 - \int_0^1 \dfrac{(q+pt)^{n+1}}{(n+1)p} \cdot (a-1)t^{a-2} dt \\  &= \left[ \dfrac{1}{(n+1)p} - 0 \right] - \dfrac{a-1}{(n+1)p} \int_0^1 t^{(a-1)-1} (q+pt)^{n+1} dt \\  &= \dfrac{1 - (a-1) U_{n+1, a-1}}{(n+1)p},  \end{aligned}

as required.

We can verify this recurrence relation via simulation. Below are two R functions:

GetExpectation <- function(n, p, a) {
  if (a == 1) {
    (1 - (1-p)^(n+1)) / ((n+1) * p)
  } else {
    (1 - (a - 1) * GetExpectation(n + 1, p, a - 1)) / ((n+1) * p)
  }
}

SimulateExpectation <- function(n, p, a, nSim = 1e6) {
  X <- rbinom(nSim, size = n, prob = p)
  mean(1 / (X + a))
}

The first computes U_{n,a} exactly with the recurrence relation, while the second does a Monte Carlo simulation. Here is the code that verifies the identity for n = 5, p = 0.7 and a = 4:

set.seed(1)
n <- 5
p <- 0.7
a <- 4
GetExpectation(n, p, a)
# [1] 0.1361207
SimulateExpectation(n, p, a)
# [1] 0.1361158

Computing E[1/(1+X)] with the probability generating function

I came across the following result recently when googling around for something else: Let X be a discrete random variable taking values in \{ 0, 1, \dots, \}, and let P_X be its probability generating function, i.e.

\begin{aligned} P_X(t) = \mathbb{E}[t^X] = \sum_{x=0}^\infty t^x \mathbb{P} \{ X = x\}. \end{aligned}

Then for a > 0, we have

\begin{aligned} \mathbb{E}\left[ \dfrac{1}{a + X} \right] = \int_0^1 t^{a-1} \cdot P_X(t) dt. \end{aligned}

I learnt of this identity from one of the answers in this link, where the identity was used to compute \mathbb{E}\left[ \frac{1}{1+X} \right] for X \sim \text{Binom}(n, p).

The proof of the identity amounts to the switching of a sum and an integral, which can be justified using Tonelli’s theorem (what follows may not be super rigorous, but it should check out):

\begin{aligned} \int_0^1 t^{a-1} \cdot P_X(t) dt &= \int_0^1 t^{a-1} \cdot \left( \sum_{x=0}^\infty t^x \mathbb{P} \{ X = x\} \right) dt \\  &= \sum_{x=0}^\infty \mathbb{P} \{ X = x\} \left( \int_0^1 t^{x + a - 1} dt \right) \\  &= \sum_{x=0}^\infty \mathbb{P} \{ X = x\} \left[ \dfrac{t^{x+a}}{x+a} \right]_0^1 \\  &= \sum_{x=0}^\infty \mathbb{P} \{ X = x\} \dfrac{1}{x+a} \\  &= \mathbb{E}\left[\dfrac{1}{X+a} \right]. \end{aligned}

How heavy-tailed is the t distribution? (Part 2)

In this previous post, we explored how heavy-tailed the t distribution is through the question: “What is the probability that the random variable is at least x standard deviations (SDs) away from the mean?” For the most part, the smaller the degrees of freedom, the larger this probability was (more “heavy-tailed”), until we realized that the trend reversed for really small degrees of freedom (2.1 in the post). In fact, for 1 < df \leq 2, the variance of the t distribution is infinite, and so the random variable is always within 1 SD of the mean!

We need another way to think about heavy-tailedness. (The code to produce the figures in this post is available here.)

A first approach that doesn’t work

You might be wondering, why didn’t I just plot P\{ T > x \} with T \sim t_{df} against x, for various values of x and df? If I did that, I would have ended up with the plot below (for the log of the probabilities):

That seems to be exactly what we want: the smaller the degrees of freedom, the slower this probability decays…

The problem is that the comparison above ignores the scale of the random variables. Imagine if we tried to make the plot above, but instead of plotting lines for the t distribution with different degrees of freedom, let’s plot it for the normal distribution with different standard deviations. This is what we would get:

That seems to give the same trend as the plot before! Can we then conclude that the \mathcal{N}(0, 10^2) distribution is more heavy-tailed than the \mathcal{N}(0, 1) distribution??

One way to incorporate scale

The discussion above illustrates the need to take scale into account. We tried to do this in the previous post by scaling each distribution by its own SD, but that idea broke down for small degrees of freedom.

Here’s an idea: Pick some threshold x'. For each random variable T, find the scale factor k such that P \{ kT > x' \} = P \{ \mathcal{N}(0, 1) > x' \}. For this value of k, kT and \mathcal{N}(0, 1) are on the same scale w.r.t. this threshold. We then compare the tail probabilities of kT and \mathcal{N}(0, 1) (instead of T and \mathcal{N}(0, 1)).

Finding k is not hard: here’s a three-line function that does it for the t distribution in R:

getScaleFactor <- function(df, threshold) {
  tailProb <- pnorm(threshold, lower.tail = FALSE)
  tQuantile <- qt(tailProb, df = df, lower.tail = FALSE)
  return(threshold / tQuantile)
}

Let’s plot the log10 of the tail probability P \{ kT > x \} with T \sim t_{df} against x for various values of x and df, with the scale factor k computed as above:

By definition, the tail probabilities will coincide when x is equal to the threshold used to compute the scale factors. We now see a clear trend with no breakdown: for smaller values of df, the tail probability P \{ kT > x \} is larger.

Another side benefit of this way to looking at tail probabilities is that we can now compare distributions which have infinite variance, or even an undefined mean (like the Cauchy distribution, which is the t distribution with one degree of freedom)! Here is the same plot as above but for smaller degrees of freedom:

How heavy-tailed is the t distribution?

It’s well-known that the t distribution has heavier tails than the normal distribution, and the smaller the degree of freedom, the more “heavy-tailed” it is. As the degrees of freedom goes to 1, the t distribution goes to the Cauchy distribution, and as the degrees of freedom goes to infinity, it goes to the normal distribution.

One way to measure the “heavy-tailedness” of a distribution is by computing the probability of the random variable taking a value that is at least x standard deviations (SD) away from its mean. The larger those probabilities are, the more heavy-tailed a distribution is.

The code below computes the (two-sided) tail probabilities for the t distribution for a range of degree of freedom values. Because the probabilities are so small, we compute the log10 of these probabilities instead. Hence, a value of -3 corresponds to a probability of 10^{-3}, or a 1-in-1,000 chance.

library(ggplot2)

dfVal <- c(Inf, 100, 50, 30, 10, 5, 3, 2.1)
sdVal <- 1:10

tbl <- lapply(dfVal, function(df) {
  stdDev <- if (is.infinite(df)) 1 else sqrt(df / (df - 2))
  data.frame(df = df,
             noSD = sdVal,
             log10Prob = log10(2 * pt(-(sdVal) * stdDev, df = df)))
})

tbl <- do.call(rbind, tbl)
tbl$df <- factor(tbl$df)

ggplot(tbl, aes(x = noSD, y = log10Prob, col = df)) +
  geom_line(size = 1) +
  scale_color_brewer(palette = "Spectral", direction = 1) +
  labs(x = "No. of SD", y = "log10(Probability of being >= x SD from mean)",
       title = "Tail probabilities for t distribution",
       col = "Deg. of freedom") +
  theme_bw()

Don’t be fooled by the scale on the vertical axis! For a t distribution with 3 degrees of freedom, the probability of being 10 SD out is about 1-in-2,400. For a normal distribution (inifinite degrees of freedom in the figure), that same probability is about 1-in-65,000,000,000,000,000,000,000! (That’s 65 followed by 21 zeros. As a comparison, the number of stars in the universe is estimated to be around 10^24, or 1 followed by 24 zeros.)

If you look closely at the figure, you might notice something a little odd with df = 2.1: it seems that for any number of SDs, the probability of being that number of SD out for df = 2.1 is lower than that for df = 3. Does that mean that df = 2.1 is less heavy-tailed than df = 3?

Not necessarily. A t distribution with \nu degrees of freedom has SD \sqrt{\nu / (\nu - 2)}. For \nu = 3, the SD is about 1.73 while for \nu = 2.1 the SD is about 4.58, much larger! Taking this to the extreme, consider a t distribution with \nu = 2. The variance is infinite in this case, so the random variable always takes values within 1 SD of the mean! Does it mean that this distribution is less heavy tailed than the normal distribution?

Looks like we might need another way to define heavy-tailedness!

Update (2021-11-06): This blog post contains a nice discussion on some of the weirdness we see when the degrees of freedom for the t distribution is between 2 and 3.

Generating random draws from the Dir(1, 1, …, 1) distribution using the uniform distribution

Fix an integer K \geq 2 and parameters \alpha_1, \dots, \alpha_K > 0, commonly written as a vector \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K). \boldsymbol{X} = (X_1, \dots, X_K) has the Dirichlet distribution (we write \boldsymbol{X} \sim \text{Dir}(\boldsymbol{\alpha})) if it has support on the probability simplex \left\{ (x_1, \dots, x_K) \in \mathbb{R}^K: x_i \geq 0 \text{ and } \sum_{i=1}^K  = 1 \right\} and its probability density function (PDF) satisfies

\begin{aligned} f(x_1, \dots, x_K) &\propto \prod_{i=1}^K x_i^{\alpha_i - 1}. \end{aligned}

The \text{Dir}(1, 1, \dots, 1) distribution is special because its PDF is constant over its support:

\begin{aligned} f(x_1, \dots, x_K) \propto 1. \end{aligned}

The following lemma demonstrates one way to generate random draws from the \text{Dir}(1, 1, \dots, 1) distribution:

Lemma. Let U_1, \dots, U_{K-1} \sim \text{Unif}[0,1], and let U_{(1)}, \dots, U_{(K-1)} denote the order statistics for U_1, \dots, U_{K-1}. Then

\begin{aligned} \left( U_{(1)} - 0, U_{(2)} - U_{(1)}, \dots, 1 - U_{(K-1)} \right)  \sim \text{Dir}(1, 1, \dots, 1). \end{aligned}

(See Section 2 of Reference 1 for various ways to generate samples from the Dirichlet distribution for arbitrary \boldsymbol{\alpha}.)

The proof of the lemma starts with the following theorem, which gives the joint density of all the order statistics:

Theorem. Let X_1, \dots, X_n be i.i.d. random variables with PDF f. Then the joint density of the order statistics X_{(1)}, \dots, X_{(n)} is

\begin{aligned} f_{X_{(1)}, \dots, X_{(n)}}(y_1, \dots, y_n) = n! f(y_1) \dots f(y_n) 1 \{ y_1 < y_2 < \dots < y_n \}. \end{aligned}

(This is Theorem 6.1 of Reference 2; you can find a proof for this theorem there.) Applying this theorem to U_1, \dots, U_n, we have

\begin{aligned} f_{U_{(1)}, \dots, U_{(K-1)}}(y_1, \dots, y_{K-1}) = (K-1)! 1 \{ 0 \leq y_1 < y_2 < \dots < y_n \leq 1 \}. \end{aligned}

The lemma then follows by applying a change of variables to the PDF (see this link for the steps one needs to carry out for the change of variables).

References:

  1. Frigyik, B. A., et. al. (2010). Introduction to the Dirichlet Distribution and Related Processes.
  2. DasGupta, A. Finite Sample Theory of Order Statistics and Extremes.

Marginal distributions of the Dirichlet distribution and the aggregation property

Introducing the Dirichlet distribution

Fix an integer K \geq 2 and parameters \alpha_1, \dots, \alpha_K > 0, commonly written as a vector \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K). The Dirichlet distribution is the continuous probability distribution having support on the probability simplex \left\{ (x_1, \dots, x_K) \in \mathbb{R}^K: x_i \geq 0 \text{ and } \sum_{i=1}^K  = 1 \right\} and whose probability density function is given by

\begin{aligned} f(x_1, \dots, x_K) = \dfrac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1}. \end{aligned}

(Here, B(\boldsymbol{\alpha}) is the multivariate beta function; it acts as the normalizing constant so that the integral of the PDF over the whole space sums to 1.) If \boldsymbol{X} = (X_1, \dots, X_K) has the Dirichlet distribution, we write \boldsymbol{X} \sim \text{Dir}(\boldsymbol{\alpha}).

The Dirichlet distribution is a multivariate generalization of the beta distribution, which corresponds to the distribution of x_1 when K = 2.

Marginal distributions of the Dirichlet distribution

The connection between the Dirichlet and beta distributions goes beyond the multivariate generalization. It turns out that the marginal distributions of the Dirichlet distribution are beta distributions. Specifically, letting \alpha_0 = \sum_{i=1}^K \alpha_i, we have

\begin{aligned} X_i \sim \text{Beta}(\alpha_i, \alpha_0 - \alpha_i). \end{aligned}

This result follows directly from the lemma below (also known as the aggregation property of the Dirichlet distribution); just apply it repeatedly to any two elements not including x_i until just two elements are left.

Lemma (Aggregation property). Assume (X_1, \dots, X_K) \sim \text{Dir}(\alpha_1, \dots, \alpha_K). If the random variables with subscripts i and j are dropped from the vector and replaced by their sum, then the resulting random variable has a Dirichlet distribution as well:

\begin{aligned} (X_1, \dots, X_i + X_j, \dots, X_K) \sim \text{Dir}(\alpha_1, \dots, \alpha_i + \alpha_j, \dots, \alpha_K). \end{aligned}

Proof of the aggregation property

It remains to prove this lemma. Without loss of generality, we just have to prove it for (i, j) = (1, 2). Compute the PDF of (X_1 + X_2, X_3, \dots, X_K) directly:

\begin{aligned} f(x_1', x_3, \dots, x_K) &= \int_0^{x_1'} \dfrac{1}{B(\boldsymbol{\alpha})} u^{\alpha_1 - 1} (x_1' - u)^{\alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1} du \\  &\propto \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^{x_1'} u^{\alpha_1 - 1} (x_1' - u)^{\alpha_2 - 1} du. \end{aligned}

Applying a change of variables v = \dfrac{u}{x_1'},

\begin{aligned} f(x_1', x_3, \dots, x_K) &= \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^1 (x_1' v)^{\alpha_1 - 1} [x_1'(1 - v)]^{\alpha_2 - 1} (x_1')dv \\  &= (x_1')^{\alpha_1 + \alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^1 v^{\alpha_1 - 1} (1 - v)^{\alpha_2 - 1} dv \\  &\propto (x_1')^{\alpha_1 + \alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1}, \end{aligned}

which implies that (X_1 + X_2, X_3, \dots, X_K) has the \text{Dir}(\alpha_1 + \alpha_2, \alpha_3, \dots, \alpha_K) distribution.

Simulating the dice game “Toss Up!” in R

Toss Up! is a very simple dice game that I’ve always wanted to simulate but never got around to doing so (until now!). This post outlines how to simulate a Toss Up! game in R, as well as how to evaluate the effectiveness of different game strategies. All the code for this blog post is available here.

Rules

The official rules for Toss Up! are available here. Here is an abbreviated version:

  • As shown in the picture above, the game uses 10 six-sided dice. Each die has 3 green sides, 2 yellow sides and one red side.
  • For each turn, a player rolls all 10 dice. If any greens are rolled, set them aside.
    • At this point, the player can choose to end the turn and “bank” the points (one point for one green), or keep rolling the remaining dice.
    • After every roll, greens are set aside. If you roll enough times to make all 10 dice green, you have 10 points and you can either bank or roll all 10 dice again.
    • A player can keep going until he/she decides to stop rolling and bank all the points earned on this turn.
    • Catch: If, on a roll, none of the dice are green and at least one is red, the turn ends and no points can be banked.
  • First player to reach 100 points wins.

Simulating Toss Up!

There are several different ways to implement Toss Up!: I describe my (two-player) version below. To allow for greater flexibility for the resulting code, I implement the game with three global constants that the programmer can change:

  1. NUM_DICE: The number of dice used in the game (default is 10).
  2. FACE_PROBS: A numeric vector of length 3 denoting the probability that a die comes up red, yellow or green respectively (default is c(1/6, 2/6, 3/6)).
  3. POINTS_TO_WIN: The number of points a player needs to obtain to win the game (default is 100).

Step 1: How can we describe the state of a game?

The state of the game can be encapsulated with 4 pieces of information:

  1. current_player: who’s turn it is to make a decision.
  2. scores: a vector of length 2 containing the scores for players 1 and 2 respectively.
  3. turn_points: Number of points scored on the current turn so far (these points have not been banked yet).
  4. dice_rolls: A vector of variable length denoting the outcome of the dice rolls (0: red, 1: yellow, 2: green).

In my implementation, the state is stored as a list with the 4 elements above.

Step 2: Updating the state

From a given state, there are 3 possibilities for what comes next:

  1. The dice rolls don’t have any greens and at least one red. In this case, the current players turn is over. We need to change current_player, reset turn_points to 0, and re-roll the dice (NUM_DICE of them).
  2. The dice rolls either (i) have at least one green, or (ii) have no reds. In this case, the current player has a choice of what to do.
    1. If the player chooses to bank, then we need to update scores, reset turn_points to 0, change current_player and re-roll the dice (NUM_DICE of them).
    2. If the player chooses to roll, then we need to update turn_points and re-roll just the dice that were not green.

The function updateState below does all of the above. I have also added a verbose option which, if set to TRUE, prints information on the game state to the console. (The function DiceToString is a small helper function for printing out the dice rolls.)

DiceToString <- function(dice_rolls) {
  return(paste(sum(dice_rolls == 2), "Green,",
               sum(dice_rolls == 1), "Yellow,",
               sum(dice_rolls == 0), "Red"))
}

# Executes the current player's action and updates the state. Is also used if
# no greens and at least one red is rolled.
UpdateState <- function(state, action, verbose = FALSE) {
  if (verbose) {
    cat("Current roll:", DiceToString(state$dice_rolls))
    if (action != "rolled at least 1 red and no green")
      cat(" (bank", state$turn_points + sum(state$dice_rolls == 2), "pts?)",
          fill = TRUE)
    else
      cat("", fill = TRUE)
    cat(paste0("Player ", state$current_player, " ", action, "s"),
        fill = TRUE)
  }
    
  if (action == "bank") {
    # action = "bank": bank the points current player earned this turn, then
    # re-roll the dice.
    state$scores[state$current_player] <- state$scores[state$current_player] +
      state$turn_points + sum(state$dice_rolls == 2)
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else if (action == "roll") {
    # action = "roll": add the green dice to points earned this turn, then 
    # re-roll the remaining dice.
    state$turn_points <- state$turn_points + sum(state$dice_rolls == 2)
    ndice <- sum(state$dice_rolls != 2)
    if (ndice == 0) ndice <- NUM_DICE
    state$dice_rolls <- sample(0:2, size = ndice, replace = TRUE, 
                               prob = FACE_PROBS)
  } else if (action == "rolled at least 1 red and no green") {
    # action = "rolled at least 1 red and no green": 
    # no points banked, turn ends, re-roll dice.
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else {
    stop("action must be 'bank', 'roll', or 'rolled at least 1 red and no green'")
  }
  
  if (verbose) {
    if (action != "roll") {
      cat("Current scores:", state$scores, fill = TRUE)
      cat("", fill = TRUE)
    }
  }
  
  return(state)
}

Step 3: How to express player behavior and strategy?

We can think of a player as a black box which takes a game state as an input and outputs a decision, “bank” or “roll”. In other words, the player is a function!

Below are two extremely simple strategies expressed as functions. The first strategy simply banks after the first roll. The second strategy banks once more than target points can be earned from the turn.

# strategy that stops after one roll
OneRoll <- function(state) {
  return("bank")
}

# strategy that stops only when the turn yields > `target` points
OverTargetPoints <- function(state, target = 10) {
  if (state$turn_points + sum(state$dice_rolls == 2) > target) {
    return("bank")
  } else {
    return("roll")
  }
}

(Note: In my implementation, strategy functions should assume that they are current_player: that is how they can determine their current score and that of their opponent correctly.)

Step 4: Simulating a full game

We can now put the pieces from the previous steps together to simulate a full game of Toss Up!. The SimulateGame function takes in two strategy functions as input. It sets up the initial state, then updates the state repeatedly until the game ends.

SimulateGame <- function(Strategy1, Strategy2, verbose = FALSE) {
  # set up initial state
  state <- list(current_player = 1,
                scores = c(0, 0),
                turn_points = 0,
                dice_rolls = sample(0:2, size = NUM_DICE, replace = TRUE, 
                                    prob = FACE_PROBS))
  
  # check if no greens and at least one red, if so change player
  while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
    state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
  }
  
  # while the game has not ended:
  # - get the next action from the current player's strategy
  # - update the state
  while (max(state$scores) < POINTS_TO_WIN) {
    if (state$current_player == 1) {
      action <- Strategy1(state)
    } else {
      action <- Strategy2(state)
    }
    state <- UpdateState(state, action, verbose)
    
    # check if no greens and at least one red
    while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
      state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
    }
  }
  
  # game has ended: return winner
  if (verbose) {
    cat(paste("Game ends: Player", which.max(state$scores), "wins!"),
        fill = TRUE)
  }
  return(which.max(state$scores))
}

Two examples of simulated games

The code below shows what a simulated game of Toss Up! might look like. In the code snippet below, players need 20 points to win. The first player uses the super conservative strategy of banking immediately, while the second player tries to win the entire game in one turn.

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 4 Green, 3 Yellow, 3 Red (bank 4 pts?)
# Player 1 banks
# Current scores: 4 0
# 
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 3 Green, 1 Yellow, 1 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 0 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 20 pts?)
# Player 2 banks
# Current scores: 4 20
# 
# Game ends: Player 2 wins!

In this particular simulation, it paid off to be a bit more risk taking. As the simulation below shows, that’s not always the case.

NUM_DICE <- 10
FACE_PROBS <- c(0.5, 0.0, 0.5)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 6 Green, 0 Yellow, 4 Red (bank 6 pts?)
# Player 1 banks
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 3 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 11 0
# 
# Current roll: 7 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 3 Green, 0 Yellow, 7 Red (bank 13 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 5 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 11 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 16 0
# 
# Current roll: 4 Green, 0 Yellow, 6 Red (bank 4 pts?)
# Player 2 rolls
# Current roll: 4 Green, 0 Yellow, 2 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 16 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 21 0
# 
# Game ends: Player 1 wins!

Comparing some simple strategies

We can’t tell whether one strategy is better than another by looking at a single game since there is so much randomness involved. What we should do is simulate many games and see which strategy wins out over the long run.

OneRoll vs. OneRoll

Let’s do a sanity check. Here, we run 10,000 games of the strategy OneRoll vs. itself (this takes around 4 seconds on my machine). Since the strategy for both players is the same, we might expect player 1 to win around 50% of the time right?

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 100

nsim <- 10000
set.seed(1)
winners <- replicate(nsim, SimulateGame(OneRoll, OneRoll))
table(winners)  # not 50-50: player who starts first has an advantage

# winners
# 1    2 
# 5910 4090 

It looks like player 1 wins around 59% of the time! On second thought, we would expect player 1 to win more because he/she has a first mover advantage: if player 1 reaches 100 points, he/she wins even if player 2 might reach 100 points on the very next turn.

To make this sanity check work, we should have the players each go first half the time. The code below achieves that:

# for even numbered simulations, let player 2 start first
winners2 <- winners
winners2[2 * 1:(nsim/2)] <- 3 - winners2[2 * 1:(nsim/2)]
table(winners2)

# winners2
# 1    2 
# 5030 4970 

Now player 1 wins just 50.3% of the time, which is much closer to the expected 50%.

OneRoll vs. >20 points

Next, let’s compare the “one roll” strategy with the strategy which stops once the player can bank more than 20 points for the turn:

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  OneRoll,
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 75 9925 

Wow! The strategy of banking only when >20 points have been earned on a turn wins almost all the time, even though it doesn’t have the first mover advantage!

>20 points vs. >50 points

Taking a bit more risk than always banking seems to pay off. What about even more risk? How does banking after >20 points compare with banking after >50 points?

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 20) },
  function(state) { OverTargetPoints(state, 50) }))
table(winners)

# winners
# 1    2 
# 6167 3833 

# switch order of players
set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 50) },
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 4414 5586 

The >20 points strategy won 61.7% of the games when it went first, and won 55.9% of the games when it went second, so it’s clearly a superior strategy.

Where can we go from here?

There are several future directions we could take with this code base.

  • The strategies presented here are very simple. Can we program more complex strategies (e.g. some that depend on the score of the other player, or on how close the player is to winning)?
  • Among strategies of the form “bank once you have more than xx points”, is there an optimal value for xx?
  • What is the optimal strategy for Toss Up!? Can we learn it through methods such as Q-learning?
  • Is there even such a thing as an optimal strategy? Do there exist 3 strategies such that A beats B, B beats C, but C beats A?
  • Can we quantify the first mover advantage?
  • Can we modify the code so that we can visualize the players’ scores as the game progresses?
  • Can we modify the code so that more than 2 players can play?