Generating random draws from the Dir(1, 1, …, 1) distribution using the uniform distribution

Fix an integer K \geq 2 and parameters \alpha_1, \dots, \alpha_K > 0, commonly written as a vector \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K). \boldsymbol{X} = (X_1, \dots, X_K) has the Dirichlet distribution (we write \boldsymbol{X} \sim \text{Dir}(\boldsymbol{\alpha})) if it has support on the probability simplex \left\{ (x_1, \dots, x_K) \in \mathbb{R}^K: x_i \geq 0 \text{ and } \sum_{i=1}^K  = 1 \right\} and its probability density function (PDF) satisfies

\begin{aligned} f(x_1, \dots, x_K) &\propto \prod_{i=1}^K x_i^{\alpha_i - 1}. \end{aligned}

The \text{Dir}(1, 1, \dots, 1) distribution is special because its PDF is constant over its support:

\begin{aligned} f(x_1, \dots, x_K) \propto 1. \end{aligned}

The following lemma demonstrates one way to generate random draws from the \text{Dir}(1, 1, \dots, 1) distribution:

Lemma. Let U_1, \dots, U_{K-1} \sim \text{Unif}[0,1], and let U_{(1)}, \dots, U_{(K-1)} denote the order statistics for U_1, \dots, U_{K-1}. Then

\begin{aligned} \left( U_{(1)} - 0, U_{(2)} - U_{(1)}, \dots, 1 - U_{(K-1)} \right)  \sim \text{Dir}(1, 1, \dots, 1). \end{aligned}

(See Section 2 of Reference 1 for various ways to generate samples from the Dirichlet distribution for arbitrary \boldsymbol{\alpha}.)

The proof of the lemma starts with the following theorem, which gives the joint density of all the order statistics:

Theorem. Let X_1, \dots, X_n be i.i.d. random variables with PDF f. Then the joint density of the order statistics X_{(1)}, \dots, X_{(n)} is

\begin{aligned} f_{X_{(1)}, \dots, X_{(n)}}(y_1, \dots, y_n) = n! f(y_1) \dots f(y_n) 1 \{ y_1 < y_2 < \dots < y_n \}. \end{aligned}

(This is Theorem 6.1 of Reference 2; you can find a proof for this theorem there.) Applying this theorem to U_1, \dots, U_n, we have

\begin{aligned} f_{U_{(1)}, \dots, U_{(K-1)}}(y_1, \dots, y_{K-1}) = (K-1)! 1 \{ 0 \leq y_1 < y_2 < \dots < y_n \leq 1 \}. \end{aligned}

The lemma then follows by applying a change of variables to the PDF (see this link for the steps one needs to carry out for the change of variables).

References:

  1. Frigyik, B. A., et. al. (2010). Introduction to the Dirichlet Distribution and Related Processes.
  2. DasGupta, A. Finite Sample Theory of Order Statistics and Extremes.

Marginal distributions of the Dirichlet distribution and the aggregation property

Introducing the Dirichlet distribution

Fix an integer K \geq 2 and parameters \alpha_1, \dots, \alpha_K > 0, commonly written as a vector \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K). The Dirichlet distribution is the continuous probability distribution having support on the probability simplex \left\{ (x_1, \dots, x_K) \in \mathbb{R}^K: x_i \geq 0 \text{ and } \sum_{i=1}^K  = 1 \right\} and whose probability density function is given by

\begin{aligned} f(x_1, \dots, x_K) = \dfrac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1}. \end{aligned}

(Here, B(\boldsymbol{\alpha}) is the multivariate beta function; it acts as the normalizing constant so that the integral of the PDF over the whole space sums to 1.) If \boldsymbol{X} = (X_1, \dots, X_K) has the Dirichlet distribution, we write \boldsymbol{X} \sim \text{Dir}(\boldsymbol{\alpha}).

The Dirichlet distribution is a multivariate generalization of the beta distribution, which corresponds to the distribution of x_1 when K = 2.

Marginal distributions of the Dirichlet distribution

The connection between the Dirichlet and beta distributions goes beyond the multivariate generalization. It turns out that the marginal distributions of the Dirichlet distribution are beta distributions. Specifically, letting \alpha_0 = \sum_{i=1}^K \alpha_i, we have

\begin{aligned} X_i \sim \text{Beta}(\alpha_i, \alpha_0 - \alpha_i). \end{aligned}

This result follows directly from the lemma below (also known as the aggregation property of the Dirichlet distribution); just apply it repeatedly to any two elements not including x_i until just two elements are left.

Lemma (Aggregation property). Assume (X_1, \dots, X_K) \sim \text{Dir}(\alpha_1, \dots, \alpha_K). If the random variables with subscripts i and j are dropped from the vector and replaced by their sum, then the resulting random variable has a Dirichlet distribution as well:

\begin{aligned} (X_1, \dots, X_i + X_j, \dots, X_K) \sim \text{Dir}(\alpha_1, \dots, \alpha_i + \alpha_j, \dots, \alpha_K). \end{aligned}

Proof of the aggregation property

It remains to prove this lemma. Without loss of generality, we just have to prove it for (i, j) = (1, 2). Compute the PDF of (X_1 + X_2, X_3, \dots, X_K) directly:

\begin{aligned} f(x_1', x_3, \dots, x_K) &= \int_0^{x_1'} \dfrac{1}{B(\boldsymbol{\alpha})} u^{\alpha_1 - 1} (x_1' - u)^{\alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1} du \\  &\propto \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^{x_1'} u^{\alpha_1 - 1} (x_1' - u)^{\alpha_2 - 1} du. \end{aligned}

Applying a change of variables v = \dfrac{u}{x_1'},

\begin{aligned} f(x_1', x_3, \dots, x_K) &= \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^1 (x_1' v)^{\alpha_1 - 1} [x_1'(1 - v)]^{\alpha_2 - 1} (x_1')dv \\  &= (x_1')^{\alpha_1 + \alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1} \int_0^1 v^{\alpha_1 - 1} (1 - v)^{\alpha_2 - 1} dv \\  &\propto (x_1')^{\alpha_1 + \alpha_2 - 1} \prod_{i=3}^K x_i^{\alpha_i - 1}, \end{aligned}

which implies that (X_1 + X_2, X_3, \dots, X_K) has the \text{Dir}(\alpha_1 + \alpha_2, \alpha_3, \dots, \alpha_K) distribution.

Simulating the dice game “Toss Up!” in R

Toss Up! is a very simple dice game that I’ve always wanted to simulate but never got around to doing so (until now!). This post outlines how to simulate a Toss Up! game in R, as well as how to evaluate the effectiveness of different game strategies. All the code for this blog post is available here.

Rules

The official rules for Toss Up! are available here. Here is an abbreviated version:

  • As shown in the picture above, the game uses 10 six-sided dice. Each die has 3 green sides, 2 yellow sides and one red side.
  • For each turn, a player rolls all 10 dice. If any greens are rolled, set them aside.
    • At this point, the player can choose to end the turn and “bank” the points (one point for one green), or keep rolling the remaining dice.
    • After every roll, greens are set aside. If you roll enough times to make all 10 dice green, you have 10 points and you can either bank or roll all 10 dice again.
    • A player can keep going until he/she decides to stop rolling and bank all the points earned on this turn.
    • Catch: If, on a roll, none of the dice are green and at least one is red, the turn ends and no points can be banked.
  • First player to reach 100 points wins.

Simulating Toss Up!

There are several different ways to implement Toss Up!: I describe my (two-player) version below. To allow for greater flexibility for the resulting code, I implement the game with three global constants that the programmer can change:

  1. NUM_DICE: The number of dice used in the game (default is 10).
  2. FACE_PROBS: A numeric vector of length 3 denoting the probability that a die comes up red, yellow or green respectively (default is c(1/6, 2/6, 3/6)).
  3. POINTS_TO_WIN: The number of points a player needs to obtain to win the game (default is 100).

Step 1: How can we describe the state of a game?

The state of the game can be encapsulated with 4 pieces of information:

  1. current_player: who’s turn it is to make a decision.
  2. scores: a vector of length 2 containing the scores for players 1 and 2 respectively.
  3. turn_points: Number of points scored on the current turn so far (these points have not been banked yet).
  4. dice_rolls: A vector of variable length denoting the outcome of the dice rolls (0: red, 1: yellow, 2: green).

In my implementation, the state is stored as a list with the 4 elements above.

Step 2: Updating the state

From a given state, there are 3 possibilities for what comes next:

  1. The dice rolls don’t have any greens and at least one red. In this case, the current players turn is over. We need to change current_player, reset turn_points to 0, and re-roll the dice (NUM_DICE of them).
  2. The dice rolls either (i) have at least one green, or (ii) have no reds. In this case, the current player has a choice of what to do.
    1. If the player chooses to bank, then we need to update scores, reset turn_points to 0, change current_player and re-roll the dice (NUM_DICE of them).
    2. If the player chooses to roll, then we need to update turn_points and re-roll just the dice that were not green.

The function updateState below does all of the above. I have also added a verbose option which, if set to TRUE, prints information on the game state to the console. (The function DiceToString is a small helper function for printing out the dice rolls.)

DiceToString <- function(dice_rolls) {
  return(paste(sum(dice_rolls == 2), "Green,",
               sum(dice_rolls == 1), "Yellow,",
               sum(dice_rolls == 0), "Red"))
}

# Executes the current player's action and updates the state. Is also used if
# no greens and at least one red is rolled.
UpdateState <- function(state, action, verbose = FALSE) {
  if (verbose) {
    cat("Current roll:", DiceToString(state$dice_rolls))
    if (action != "rolled at least 1 red and no green")
      cat(" (bank", state$turn_points + sum(state$dice_rolls == 2), "pts?)",
          fill = TRUE)
    else
      cat("", fill = TRUE)
    cat(paste0("Player ", state$current_player, " ", action, "s"),
        fill = TRUE)
  }
    
  if (action == "bank") {
    # action = "bank": bank the points current player earned this turn, then
    # re-roll the dice.
    state$scores[state$current_player] <- state$scores[state$current_player] +
      state$turn_points + sum(state$dice_rolls == 2)
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else if (action == "roll") {
    # action = "roll": add the green dice to points earned this turn, then 
    # re-roll the remaining dice.
    state$turn_points <- state$turn_points + sum(state$dice_rolls == 2)
    ndice <- sum(state$dice_rolls != 2)
    if (ndice == 0) ndice <- NUM_DICE
    state$dice_rolls <- sample(0:2, size = ndice, replace = TRUE, 
                               prob = FACE_PROBS)
  } else if (action == "rolled at least 1 red and no green") {
    # action = "rolled at least 1 red and no green": 
    # no points banked, turn ends, re-roll dice.
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else {
    stop("action must be 'bank', 'roll', or 'rolled at least 1 red and no green'")
  }
  
  if (verbose) {
    if (action != "roll") {
      cat("Current scores:", state$scores, fill = TRUE)
      cat("", fill = TRUE)
    }
  }
  
  return(state)
}

Step 3: How to express player behavior and strategy?

We can think of a player as a black box which takes a game state as an input and outputs a decision, “bank” or “roll”. In other words, the player is a function!

Below are two extremely simple strategies expressed as functions. The first strategy simply banks after the first roll. The second strategy banks once more than target points can be earned from the turn.

# strategy that stops after one roll
OneRoll <- function(state) {
  return("bank")
}

# strategy that stops only when the turn yields > `target` points
OverTargetPoints <- function(state, target = 10) {
  if (state$turn_points + sum(state$dice_rolls == 2) > target) {
    return("bank")
  } else {
    return("roll")
  }
}

(Note: In my implementation, strategy functions should assume that they are current_player: that is how they can determine their current score and that of their opponent correctly.)

Step 4: Simulating a full game

We can now put the pieces from the previous steps together to simulate a full game of Toss Up!. The SimulateGame function takes in two strategy functions as input. It sets up the initial state, then updates the state repeatedly until the game ends.

SimulateGame <- function(Strategy1, Strategy2, verbose = FALSE) {
  # set up initial state
  state <- list(current_player = 1,
                scores = c(0, 0),
                turn_points = 0,
                dice_rolls = sample(0:2, size = NUM_DICE, replace = TRUE, 
                                    prob = FACE_PROBS))
  
  # check if no greens and at least one red, if so change player
  while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
    state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
  }
  
  # while the game has not ended:
  # - get the next action from the current player's strategy
  # - update the state
  while (max(state$scores) < POINTS_TO_WIN) {
    if (state$current_player == 1) {
      action <- Strategy1(state)
    } else {
      action <- Strategy2(state)
    }
    state <- UpdateState(state, action, verbose)
    
    # check if no greens and at least one red
    while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
      state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
    }
  }
  
  # game has ended: return winner
  if (verbose) {
    cat(paste("Game ends: Player", which.max(state$scores), "wins!"),
        fill = TRUE)
  }
  return(which.max(state$scores))
}

Two examples of simulated games

The code below shows what a simulated game of Toss Up! might look like. In the code snippet below, players need 20 points to win. The first player uses the super conservative strategy of banking immediately, while the second player tries to win the entire game in one turn.

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 4 Green, 3 Yellow, 3 Red (bank 4 pts?)
# Player 1 banks
# Current scores: 4 0
# 
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 3 Green, 1 Yellow, 1 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 0 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 20 pts?)
# Player 2 banks
# Current scores: 4 20
# 
# Game ends: Player 2 wins!

In this particular simulation, it paid off to be a bit more risk taking. As the simulation below shows, that’s not always the case.

NUM_DICE <- 10
FACE_PROBS <- c(0.5, 0.0, 0.5)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 6 Green, 0 Yellow, 4 Red (bank 6 pts?)
# Player 1 banks
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 3 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 11 0
# 
# Current roll: 7 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 3 Green, 0 Yellow, 7 Red (bank 13 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 5 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 11 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 16 0
# 
# Current roll: 4 Green, 0 Yellow, 6 Red (bank 4 pts?)
# Player 2 rolls
# Current roll: 4 Green, 0 Yellow, 2 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 16 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 21 0
# 
# Game ends: Player 1 wins!

Comparing some simple strategies

We can’t tell whether one strategy is better than another by looking at a single game since there is so much randomness involved. What we should do is simulate many games and see which strategy wins out over the long run.

OneRoll vs. OneRoll

Let’s do a sanity check. Here, we run 10,000 games of the strategy OneRoll vs. itself (this takes around 4 seconds on my machine). Since the strategy for both players is the same, we might expect player 1 to win around 50% of the time right?

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 100

nsim <- 10000
set.seed(1)
winners <- replicate(nsim, SimulateGame(OneRoll, OneRoll))
table(winners)  # not 50-50: player who starts first has an advantage

# winners
# 1    2 
# 5910 4090 

It looks like player 1 wins around 59% of the time! On second thought, we would expect player 1 to win more because he/she has a first mover advantage: if player 1 reaches 100 points, he/she wins even if player 2 might reach 100 points on the very next turn.

To make this sanity check work, we should have the players each go first half the time. The code below achieves that:

# for even numbered simulations, let player 2 start first
winners2 <- winners
winners2[2 * 1:(nsim/2)] <- 3 - winners2[2 * 1:(nsim/2)]
table(winners2)

# winners2
# 1    2 
# 5030 4970 

Now player 1 wins just 50.3% of the time, which is much closer to the expected 50%.

OneRoll vs. >20 points

Next, let’s compare the “one roll” strategy with the strategy which stops once the player can bank more than 20 points for the turn:

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  OneRoll,
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 75 9925 

Wow! The strategy of banking only when >20 points have been earned on a turn wins almost all the time, even though it doesn’t have the first mover advantage!

>20 points vs. >50 points

Taking a bit more risk than always banking seems to pay off. What about even more risk? How does banking after >20 points compare with banking after >50 points?

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 20) },
  function(state) { OverTargetPoints(state, 50) }))
table(winners)

# winners
# 1    2 
# 6167 3833 

# switch order of players
set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 50) },
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 4414 5586 

The >20 points strategy won 61.7% of the games when it went first, and won 55.9% of the games when it went second, so it’s clearly a superior strategy.

Where can we go from here?

There are several future directions we could take with this code base.

  • The strategies presented here are very simple. Can we program more complex strategies (e.g. some that depend on the score of the other player, or on how close the player is to winning)?
  • Among strategies of the form “bank once you have more than xx points”, is there an optimal value for xx?
  • What is the optimal strategy for Toss Up!? Can we learn it through methods such as Q-learning?
  • Is there even such a thing as an optimal strategy? Do there exist 3 strategies such that A beats B, B beats C, but C beats A?
  • Can we quantify the first mover advantage?
  • Can we modify the code so that we can visualize the players’ scores as the game progresses?
  • Can we modify the code so that more than 2 players can play?

Exploring the game “First Orchard” with simulation in R

My daughter received the board game First Orchard as a Christmas present and she’s hooked on it so far. In playing the game with her, a few probability/statistics questions came to mind. This post outlines how I answered some of them using simulation in R. All code for this blog post can be found here.

(In my googling I found that Matt Lane has an excellent blog post on this game, answering some of the questions that I was interested in.)

Gameplay

Before we get to the questions, let me give a quick explanation of how the game works (see Reference 1 for a more colorful explanation as well as an applet to play the game online).

  • It’s a cooperative game, with all players playing against the board.
  • The game starts with 16 fruit: 4 of each color (red, green, blue, yellow), and a raven at one end of a path that is 5 steps long.
  • On each player’s turn, the player rolls a 6-sided die.
    • If the die comes up red, green, blue or yellow, the player gets to “harvest” a fruit of that color if there are any left to harvest. If all 4 fruits of that color have been harvested, nothing happens.
    • If the die shows a fruit basket, the player gets to harvest a fruit of any color.
    • If the die shows a raven, the raven moves one step along the path.
  • The game ends when either all the fruit are harvested (players win) or when the raven reaches the end of the path (raven wins).

As you can see this is a really simple game (hence it’s “suitable for 2+ years” rating). The only strategy is in choosing what fruit to take if a fruit basket shows up: see Reference 1 for some simulations for different strategies. Intuitively it seems like choosing the color with the most fruit remaining is the best strategy, and that’s what I bake into my code. (Is there a proof for this, and is this still true in more general circumstances described in Reference 1?)

Code for simulating the game

The state of the game can be captured in a numeric vector of length 5. The first 4 numbers refer to the number of fruit left for each color, and the 5th number keeps track of the number of steps the raven has taken so far. I created 3 functions to simulate one game of First Orchard (see full code here):

  • SimulateTurn(state, verbose) takes one dice roll and updates the state of the game. For simplicity, if a 1-4 is rolled, a fruit is harvested from that corresponding tree. If 5 is rolled, the raven takes a step. A rolled 6 is taken to mean “fruit basket”, and I remove a fruit from the tree with the most remaining fruits.
  • CheckGameState(state, max_raven_steps) checks if the game has ended or not, and if so, who won.
  • SimulateGame(fruit_count, max_raven_steps, verbose) runs an entire game of First Orchard: while the game has not ended, run SimulateTurn. Once the game has ended, this function returns (i) who won, (ii) the number of turns taken, (iii) the number of steps the raven took, and (iv) the number of fruit left.

We allow for two game parameters to be defined by the user: the number of fruit of each type at the start of the game (fruit_count, default is 4) and the number of steps the raven must take in order for it to win (max_raven_steps, default is 5). The verbose option for these functions so that the user can see what happened in the game. The code below is an example of the output from SimulateGame:

set.seed(1)
results <- SimulateGame(fruit_count = 2, max_raven_steps = 3, 
                        verbose = TRUE)
# Roll: 1 , State: 1,2,2,2,0
# Roll: 4 , State: 1,2,2,1,0
# Roll: 1 , State: 0,2,2,1,0
# Roll: 2 , State: 0,1,2,1,0
# Roll: 5 , State: 0,1,2,1,1
# Roll: 3 , State: 0,1,1,1,1
# Roll: 6 , State: 0,0,1,1,1
# Roll: 2 , State: 0,0,1,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 1 , State: 0,0,0,1,1
# Roll: 5 , State: 0,0,0,1,2
# Roll: 5 , State: 0,0,0,1,3
# Raven wins
# # of turns: 13
# # of steps raven took: 3
# # fruit left: 1

Simulation time!

What is the probability of winning the game? How does that change as we vary (i) the number of fruit of each color, and (ii) the number of steps the raven must take in order for the players to lose?

We let the number of fruit of each color vary from 1 to 8, and the number of steps the raven must take from 1 to 8. For each parameter setting, we simulate the game 10,000 times and compute the player win probability. We plot the results as a heatmap.

As one might expect, win probability goes down as the number of fruit increases and as the number of steps the raven must take decreases. For the original game (4 fruit, 5 steps), the win probability is approximately 62%. Sounds reasonable: we would like a win probability >50% so that kids will not get discouraged by too much losing, but not so high that they think the game is trivial.

For the original game (4 fruit, 5 steps), what is the expected number of steps until the game ends? Does this change depending on whether the player or the raven wins?

We simulate the game 100,000 times and keep track of the number of steps taken in each game. The shortest game took 5 steps while the longest took 45 steps, with the modal number of steps being 21 (it also happens to be the mean and median). Here is the histogram for all 100,000 runs:

Here are the histograms split by the outcome:

Games where the raven wins tend to be shorter than those when players win. Maybe that’s not too surprising, since a game where the raven wins needs just 5 steps, while a game where the players win needs at least 16 steps. On average, the game takes 19 steps for raven wins and 22 steps for player wins.

For the original game (4 fruit, 5 steps), given that the raven loses, what is distribution of the number of steps the raven has taken?

Because we programmed SimulateGame to return the number of steps the raven has taken as well, we don’t have to rerun the simulations: we can just use the 100,000 simulations we ran previously and look at the ones that the raven lost. Here is the histogram of steps the raven took in losing games, with the vertical red line representing the mean:

For the original game (4 fruit, 5 steps), given that the raven wins, what is distribution of the number of unharvested fruit?

Again, we can just use the 100,000 simulations we ran previously and look at the ones that the raven won. Here is the histogram along with the mean and median marked out with vertical lines:

The modal number of fruit left in player-losing games is 1: ugh tantalizingly close!

If there was no raven, how many turns would it take to harvest all the fruit?

“No raven” is the same as saying that the raven needs to take an infinite number of steps in order to win. Hence, we can use our existing simulation code with max_raven_steps = Inf to simulate this setting.

The shortest game took 16 turns while the longest game took 63 turns, with 22 and 24 turns being the modal and mean number of turns respectively. (In theory, a game could go on forever.) Here is the histogram:

References:

  1. Matt Lane. (2018). Harvesting Wins.

What is the Tweedie distribution?

The Tweedie distribution is a three-parameter family of distributions that (i) is a special case of exponential dispersion models, but (ii) is a generalization of several familiar probability distributions, including the normal, gamma, inverse Gaussian and Poisson distributions.

Exponential dispersion models

A random variable Y is a (reproductive) exponential dispersion model with mean parameter \mu and dispersion parameter \sigma^2 if its probability density function (PDF) can be written in the form

\begin{aligned} f_Y (y; \mu, \sigma^2) = h (\sigma^2, y) \exp \left[ \dfrac{\theta y - A(\theta)}{\sigma^2} \right], \end{aligned}

where \theta is such that \mu = A'(\theta). We write Y \sim ED(\mu, \sigma^2).

(Note: This density is always relative to some underlying reference measure. If the underlying measure is the Lebesgue measure, then f is the usual PDF that we associate with continuous random variables. If the the underlying measure is the counting measure on, say, integers, then f is the probability mass function (PMF) that we associate with discrete random variables.)

At first it may seem a bit strange for the PDF to be written in terms of \theta rather than \mu: why not treat \theta as the parameter, rather than \mu? We often parameterize the family by \mu because (i) it has a clear interpretation, (ii) we often express the variance of Y as a function of \mu (called the variance function), and (iii) an exponential dispersion model is characterized within the class of all exponential models by its variance function.

Exponential dispersion models were introduced by Jørgensen in 1987 (Reference 2) as a generalization of generalized linear models; (iii) in the previous paragraph was proved as Theorem 1 in Reference 2.

If one can switch the order of differentiation and integration (which is the case under certain regularity conditions), it can also be shown that

\begin{aligned} \mu &= A'(\theta), \\  Var(Y) &= V(\mu) = \sigma^2 A''(\theta). \end{aligned}

Probability density function of the Tweedie distribution

The Tweedie distribution is a special case of the exponential dispersion model, in that the variance function must have the form

Var(Y) = \sigma^2 \mu^p,

where p \in \mathbb{R} is known as the power parameter. Variance functions of this form are known as power variance functions. The Tweedie distribution is named after M. C. K. Tweedie, who studied the exponential dispersion model with power variance functions in a 1984 paper (Reference 3).

It can be shown that this implies the probability density function has the form

\begin{aligned} f_Y (y; \mu, \sigma^2) = h (\sigma^2, y) \exp \left[ \dfrac{\theta y - A_p(\theta)}{\sigma^2} \right], \end{aligned}

where

\begin{aligned} A_p(\theta) = \begin{cases} \dfrac{\alpha - 1}{\alpha} \left( \dfrac{\theta}{\alpha - 1}\right)^\alpha &\text{for } p \neq 1, 2, \\ - \log (-\theta) &\text{for } p =2, \\ e^\theta &\text{for } p = 1, \end{cases} \qquad \alpha = \dfrac{p-2}{p-1}. \end{aligned}

Special cases of the Tweedie distribution

Table 1 of Reference 2 (reproduced below, click to zoom in) shows what the Tweedie distribution corresponds to for various values of the parameter p. In the table, support refers to the values that Y can take on, \Theta refers to the set of valid \theta parameter values, i.e.

\Theta = \{ \theta \in \mathbb{R}: A_p(\theta) < \infty \},

and \Omega = A'(\text{int } \Theta).

In particular:

  1. p = 0 is equivalent to the normal distribution,
  2. p = 1 is equivalent to the Poisson distribution,
  3. p = 2 is equivalent to the Gamma distribution, and
  4. p = 3 is equivalent to the Inverse Gaussian distribution.

These can be checked by substituting the formulas for A_p(\theta) into the PDF formula.

References:

  1. Wikipedia. Tweedie distribution.
  2. Jørgensen, B. (1987). Exponential dispersion models.
  3. Tweedie, M. C. K. (1984). An index with distinguishes between some important exponential families.

Simulating paths from a random walk

If you’ve ever visited this blog at wordpress.com, you might have noticed a header image that looks like this:

Ever wonder how it was generated? The image depicts 100 simulations of an asymmetric random walk. In this post, I’ll go through the code used to generate this image. All the code can also be found here.

For t = 1, 2, \dots, consider a series of i.i.d. random variables X_1, X_2, \dots such that for any t, X_t = 1 with probability p, and X_t = -1 with probability 1-p. (p \in [0,1] is a parameter that we get to choose.) A random walk simply tracks the cumulative sum of these random variables, i.e.

S_0 = 0, \quad S_t = \sum_{s = 1}^t X_s.

In my image, I let the random walk run until it hits a fixed upper limit or a fixed lower limit. Here is an R function that generates one realization of this random walk:

# returns the random walk path values as a vector 
# (random walk always starts at 0)
# p: probability of increasing by 1
# stop if path value hits either `lower` or `upper`
run <- function(p, lower, upper) {
    values <- c(0)
    current <- 0
    while (current > lower & current < upper) {
        current <- current + ifelse(runif(1) < p, 1, -1)
        values <- c(values, current)
    }
    values
}

(There might be more efficient ways of doing this, but since computation is relatively fast this is good enough for our purposes.)

The code below creates a list of 100 elements, each element being a simulated random walk. The probability of going up at any one step is 0.48, and we stop the random walk once we hit -50 or 50.

N <- 100  # no. of paths to simulate
p <- 0.48
lower <- -50
upper <- 50

# simulate paths
set.seed(1055)
vlist <- replicate(N, run(p, lower, upper))

We can plot these paths along with the x-axis and the upper & lower limits:

# get length of longest path
max_length <- max(sapply(vlist, length))

# make plot
par(mar = rep(0, 4))  # no margins
plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]])
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

Without color, it’s hard to make much sense of the image. To introduce color, let’s create a function that picks a color for path based on (i) whether the path hit the upper or lower limit, and (ii) how long the path is when compared to the longest path in the image.

colorPicker <- function(values, max_length,
                        ls_color = c(178, 34, 34), ll_color = c(255, 204, 0),
                        us_color = c(0, 0, 102), ul_color = c(102, 204, 225)) {
    l <- length(values)
    if (values[l] < 0) {
        rgb_values <- (ls_color + (ll_color - ls_color) * l / max_length) / 255
    } else {
        rgb_values <- (us_color + (ul_color - us_color) * l / max_length) / 255
    }
    rgb(rgb_values[1], rgb_values[2], rgb_values[3])
}

If a path hits the lower limit and has length 0, it will have color ls_color / 255, and if a path hits the lower limit and has the longest length among all our simulated paths, it will have color ll_color / 255. There is a similar relationship for paths hitting the upper limit and us_color and ul_color. (You can see what these colors are at a website such as this.)

We can now color our black-and-white image:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]], 
          col = colorPicker(vlist[[i]], max_length), lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

From the image it is now obvious that most of the paths hit the lower limit first, and fairly quickly at that! Here is the same image, but with different choices of color scale:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]], 
          col = colorPicker(vlist[[i]], max_length,
                            ls_color = c(230, 230, 230), 
                            ll_color = c(166, 166, 166),
                            us_color = c(255, 0, 0),
                            ul_color = c(0, 0, 255)),
          lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

The possibilities are endless!

If you haven’t seen random walks before, you might be surprised at how a slightly biased walk (p = 0.48 instead of p = 0.5) results in so many more paths hitting the lower limit before the upper limit, even though the two limits are the same distance from 0. This is an example of 100 simulations when p = 0.5:

This is an example of 100 simulations when p = 0.52:

Isn’t it interesting how slight deviations in the probability of going one step up (instead of down) completely change the dynamics? It turns out that there is a closed form for the probability of hitting one limit (as opposed to hitting the other). If the upper limit is x = b and the lower limit is x = -a, and q = 1-p, then

\begin{aligned} \mathbb{P} \{ \text{hit } x = b \text{ before } x = -a \} = \begin{cases} \frac{1 - (q/p)^a}{1 - (q/p)^{a+b}} &\text{if } p \neq 1/2, \\ \frac{a}{a+b} &\text{if } p = 1/2. \end{cases} \end{aligned}

For my blog header image, p = 0.48, a = b = 50, which means the probability of hitting the upper limit is approximately 0.0179: not high at all!

Density of the log-normal distribution

A random variable X has a log-normal distribution if \log X has a normal distribution. Since the normal distribution can be parameterized by the mean and standard deviation (usually denoted \mu and \sigma), the log-normal distribution can be parameterized similarly.

Given \mu \in \mathbb{R} and \sigma > 0, the density of the log-normal distribution is given by

\begin{aligned} f(x) = \dfrac{1}{x \sigma \sqrt{2\pi}} \exp \left[ - \dfrac{(\log x - \mu)^2}{2 \sigma^2} \right], \qquad \text{for } x \in (0, +\infty). \end{aligned}

The figure below shows the probability density function (PDF) of the log-normal distribution for \mu = 0 and varying \sigma. This is very much like the figure you see on the Wikipedia page.

Unfortunately this appears to be most common picture we see for the log-normal, when in fact the picture is different for different parameter settings. There are hardly figures depicting what happens when we vary \mu.

Well, this post aims to fill that gap. All the figures in this post have the x-axis going from 0 to 3 for comparability (the y-axis will vary). Code for generating these figures is available here.

Below are two figures where we vary \sigma as well. The first has \mu fixed at 1 while the second has \mu fixed at -0.75.

The next 3 figures have varying \mu and \sigma fixed at 0.25, 1 and 1.25 respectively.

varying_mean

A correlation coefficient that measures the degree of dependence between variables

I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…

  1. … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
  2. … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
  3. … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).

What was surprising to me was Point 2: that this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of linear dependence between X and Y. (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)

Defining the sample correlation coefficient

Let X and Y be real-valued random variables such that Y is not a constant, and let (X_1, Y_1), \dots, (X_n, Y_n) be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:

  1. Rearrange the data as (X_{(1)}, Y_{(1)}), \dots, (X_{(n)}, Y_{(n)}) so that the X values are in increasing order, i.e. X_{(1)} \leq \dots \leq X_{(n)}. If there are ties, break them uniformly at random.
  2. For each index i, let r_i be the rank of Y_{(i)}, i.e. the number of j such that Y_{(j)} \leq Y_{(i)}, and let \l_i be the number of j such that Y_{(j)} \geq Y_{(i)}.
  3. Define the new correlation coefficient as

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{n \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{2 \sum_{i=1}^n l_i (n - l_i)}. \end{aligned}

If there are no ties among the Y‘s, the denominator simplifies and we don’t have to compute the l_i‘s:

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{3 \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{n^2 - 1}. \end{aligned}

What is the sample correlation coefficient estimating?

The following theorem tells us what \xi_n(X,Y) is trying to estimate:

Theorem: As n \rightarrow \infty, \xi_n(X, Y) converges almost surely to the deterministic limit

\begin{aligned} \xi (X, Y) = \dfrac{\int \text{Var}(\mathbb{E}[1_{\{ Y \geq t\}} \mid X]) d\mu(t) }{\int \text{Var}(1_{\{ Y \geq t\}}) d\mu(t) }. \end{aligned}

\xi(X, Y) seems like a nasty quantity but it has some nice properties:

  1. \xi(X, Y) always belongs to [0, 1]. (This follows immediately from the law of total variance.)
  2. \xi(X, Y) = 0 iff X and Y are independent, and \xi(X, Y) = 1 iff there is a measurable function f: \mathbb{R} \mapsto \mathbb{R} such that Y = f(X) almost surely.

This later paper by Mona Azadkia and Chatterjee extends \xi to capture degrees of conditional dependence.

Some properties of \xi_n(X, Y) and \xi(X, Y)

Here are some other key properties of this new correlation coefficient:

  1. \xi is not symmetric in X and Y, i.e. often we will have \xi(X, Y) \neq \xi(Y, X). It can be symmetrized by considering \max \{ \xi(X, Y), \xi(Y, X)\} as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if Y is a function of X, and not just if one of the variables is a function of the other”.
  2. \xi_n(X, Y) remains unchanged if we apply strictly increasing transformations to X and Y as it is only based on the ranks of the data.
  3. Since \xi_n(X, Y) is only based on ranks, it can be computed in O(n \log n) time.
  4. We have some asymptotic theory for \xi_n under the assumption of independence:

    Theorem: Suppose X and Y are independent and Y is continuous. Then \sqrt{n} \xi_n (X, Y) \rightarrow \mathcal{N}(0, 2/5) in distribution as n \rightarrow \infty.

    (The paper has a corresponding result for when Y is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.

References:

  1. Chatterjee, S. (2019). A new coefficient of correlation.

What is the von Mises distribution?

As the name suggests, circular data is data that is measured on a circle. One example of circular data is direction (e.g. how many degrees from True North); another is time. (Note: Time is interesting because we can think of it as linear (Jan 2019 -> … -> Dec 2019 -> Jan 2020) or circular (Jan -> … -> Dec -> Jan).)

A common probability distribution we see with circular data is the von Mises distribution (named after Richard von Mises, an early 20th-century Austrian mathematician). It is a close approximation to the wrapped normal distribution which is the normal distribution “wrapped” around a unit circle. The von Mises distribution is simpler and more tractable, and hence is more commonly used.

The von Mises distribution has two parameters: \mu \in \mathbb{R}, which is a measure of location, and \kappa \geq 0, which is a measure of concentration. For any x \in [\mu - \pi, \mu + \pi], the probability density function (PDF) is given by

\begin{aligned} f(x) = \dfrac{\exp [\kappa \cos (x - \mu)]}{2\pi I_0 (\kappa)}, \end{aligned}

where I_0 is the modified Bessel function of order 0. Below is a picture of the PDF for different values of \kappa (taken from Wikipedia). You can see that it looks very much like the normal distribution. As \kappa increases, the distribution approaches the \mathcal{N}(\mu, 1/\kappa) distribution. For small \kappa, the distribution is very close to uniform, with \kappa = 0 corresponding exactly to the uniform distribution over the interval.

Below is a really nice visualization of circular data overlaid with the estimated von Mises distribution, taken from this blog post:

The mean, median and mode of this distribution is \mu. The variance is equal to 1 - I_1(\kappa)/I_0(\kappa), where I_j is the modified Bessel function of order j.

References:

  1. Wikipedia. von Mises distribution.
  2. Zeileis, A. Circular regression trees and forests.

What is a copula?

Copulas

Let d be some positive integer. A d-dimensional copula C : [0,1]^d \mapsto [0, 1] is a joint cumulative distribution function (CDF) of a d-dimensional random vector on [0,1]^d with uniform marginals, i.e. C(1, \dots, 1, u, 1, \dots, 1) = u if one of the arguments is u and the rest are 1.

Copulas are useful because they contain all information on the dependence structure between the elements of a d-dimensional random vector. One example of this is in the generation of random samples from multivariate probability distributions. Suppose we have a random vector (X_1, \dots, X_d) such that the marginal CDF for each X_i, denoted by F_i, is continuous. Then

(U_1, \dots, U_d) = \left( F_1(X_1), \dots, F_d(X_d) \right)

has uniformly distributed marginals, i.e. U_i \sim U[0,1] for i = 1, \dots, d. We can think of the copula as the joint CDF of U_1, \dots, U_d, i.e.

\begin{aligned} C(u_1, \dots, u_d) &= \text{Pr}[U_1 \leq u_1, \dots, U_d \leq u_d] \\  &= \text{Pr}\left[ X_1 \leq F_1^{-1}(u_1), \dots, X_d \leq F_d^{-1}(u_d) \right]. \end{aligned}

Now let’s say we want a random draw of (X_1, \dots, X_d). Assuming that we can draw a sample from the copula distribution implied by C, this gives us a sample of U_1, \dots, U_d. From this, we can obtain the desired sample

\begin{aligned} (X_1, \dots, X_d) = \left( F_1^{-1}(U_1), \dots, F_d^{-1}(U_d) \right). \end{aligned}

All this, of course, is predicated on being able to sample from the copula distribution.

Sklar’s theorem

Sklar’s theorem tells us that we can always decompose a d-dimensional joint CDF into a copula and the marginal CDFs, and that conversely, any copula and set of d univariate CDFs gives us a valid d-dimensional joint CDF. Below is the formal statement of the theorem:

Sklar’s theorem. (1959). Let F be a d-dimensional CDF with marginals F_1, \dots, F_d. Then there exists a copula C such that

F(x_1, \dots, x_d) = C \left( F_1(x_1), \dots, F_d(x_d) \right)

for all x_i \in [-\infty, \infty] and i = 1, \dots, d. If F_i is continuous for i = 1, \dots, d, then C is unique; otherwise C is uniquely determined only on \text{Range}(F_1) \times \dots \times \text{Range}(F_d).

In the opposite direction, for any copula C and univariate CDFs F_1, \dots, F_d, the function F as defined above is a multivariate CDF with marginals F_1, \dots, F_d.

Viewed in this light, copulas allow us to think of the dependence structure independently from the marginal distributions. (This may or may not be a good idea! The dependence between the U_i‘s is one step removed from reality, i.e. dependence between the X_i‘s.)

Gaussian copulas

The most famous family of copulas are Gaussian copulas. Let R be a d \times d correlation matrix, and let \Phi_R represent the joint CDF of a d-dimensional Gaussian distribution with mean 0 and covariance matrix R. Then the Gaussian copula with parameter matrix R is defined by

\begin{aligned} C(u) = \Phi_R \left( \Phi^{-1}(u_1), \dots, \Phi^{-1}(u_d) \right), \end{aligned}

where \Phi is the CDF of the standard Gaussian distribution.

One reason Gaussian copulas are popular is because they are easy to simulate. Let A be the square root matrix of R, i.e. R = A^T A. We can draw from the copula using this simple recipe:

  1. Z = (Z_1, \dots, Z_d), where Z_1, \dots, Z_d \stackrel{i.i.d.}{\sim} \mathcal{N}(0,1).
  2. Set X = A^T Z. (Note that we have X \sim \mathcal{N} (0, R).)
  3. Return U = \left( \Phi(X_1), \dots, \Phi(X_d) \right).

More on copulas

Reference 2 contains details on other popular copulas as well as how to estimate copulas from data. Reference 3 contains quite a bit of information on the history of copulas.

References:

  1. Wikipedia. Copula (probability theory).
  2. Haugh, Martin (2016). An introduction to copulas.
  3. Sempi, C. (2011). An introduction to copulas.