covidcast package for COVID-19-related data

(This is a PSA post, where I share a package that I think that might be of interest to the community but I haven’t looked too deeply into myself.)

Today I learnt of the covidcast R package, which provides access to the COVIDcast Epidata API published by the Delphi group at Carnegie Mellon University. According to the covidcast R package website,

This API provides daily access to a range of COVID-related signals Delphi that builds and maintains, from sources like symptom surveys and medical claims data, and also standard signals that we simply mirror, like confirmed cases and deaths.

(There is a corresponding python package with similar functionality.) The Delphi group has done a huge amount of work in logging a wide variety of COVID-related data and making it available, along with tools to visualize and make sense of the data.

For those interested in doing COVID-related analyses, I think this is a treasure trove of information for you to use. The covidcast package contains several different types of data (which they call “signals”), including public behavior (e.g. COVID searches on Google), early indicators (e.g. COVID-related doctor visits) and late indicators (e.g. deaths). Documentation on the signals available can be found here. (Note: The data is US-focused right now; I don’t know if there are plans to include data from other countries.)

Let me end off with a simple example showing what you can do with this package. This example is modified from one of the package vignettes; see the Articles section of the package website for more examples.

The package is not available on CRAN yet but can be downloaded from Github:

devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/covidcast")

The code below pulls data on cumulative COVID cases per 100k people on 2020-12-31 at the county level. covidcast_signal is the function to use for pulling data, and it returns an object of class c("covidcast_signal", "data.frame").

library(covidcast)

# Cumulative COVID cases per 100k people on 2020-12-31
df <- covidcast_signal(data_source = "usa-facts", 
                   signal = "confirmed_cumulative_prop",
                   start_day = "2020-12-31", end_day = "2020-12-31")
summary(df)
# A `covidcast_signal` data frame with 3142 rows and 9 columns.
# 
# data_source : usa-facts
# signal      : confirmed_cumulative_prop
# geo_type    : county
# 
# first date                          : 2020-12-31
# last date                           : 2020-12-31
# median number of geo_values per day : 3142

There is a plot method for calss covidcast_signal objects:

plot(df)

The automatic plot is usually not bad. The plot method comes with some arguments that the user can use to customize the plot (full documentation here):

breaks <- c(0, 500, 1000, 5000, 10000)
colors <- c("#D3D3D3", "#FEDDA2",  "#FD9950", "#C74E32", "#800026")
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31")

The plot returned is actually created using the ggplot2 package, so it is possible to add your own ggplot2 code on top of it:

library(ggplot2)
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31") +
  theme(title = element_text(face = "bold"))

The Mendoza line

The Mendoza Line is a term from baseball. Named after Mario Mendoza, it refers to the threshold of incompetent hitting. It is frequently taken to be a batting average of .200, although all the sources I looked at made sure to note that Mendoza’s career average was actually a little better: .215.

This post explores a few questions related to the Mendoza line:

  • Was Mario Mendoza really so bad as to warrant the expression being named after him?
  • How many players fell below the Mendoza line each year?
  • What might the Mendoza line look like if we allowed it to change dynamically?

All code for this post can be found here.

(Caveat: I don’t know baseball well, so some of the assumptions or conclusions I make below may not be good ones. If I made a mistake, let me know!)

The data

The Lahman package on CRAN contains all the baseball statistics from 1871 to 2019. We’ll use the Batting data frame for statistics and the People data frame for player names.

library(Lahman)
library(tidyverse)
data(Batting)
data(People)

# add AVG to Batting
Batting$AVG <- with(Batting, H / AB)

First, let’s look for Mario Mendoza and verify that his batting average is indeed .215:

 
# find Mario Mendoza in People
People %>% filter(nameFirst == "Mario" & nameLast == "Mendoza")
# his ID is mendoma01

Batting %>% filter(playerID == "mendoma01") %>%
  summarize(career_avg = sum(H) / sum(AB))
#   career_avg
# 1  0.2146597

Was Mario Mendoza really so bad as to warrant the expression being named after him?

Let’s compute the career batting averages for the players and limit our dataset to just the players with at least 1000 at bats in their career:

 
# Batting average for players with >= 1000 AB
avg_df <- Batting %>% group_by(playerID) %>%
  summarize(tot_AB = sum(AB), career_avg = sum(H) / sum(AB)) %>%
  filter(tot_AB >= 1000) %>%
  left_join(People, by = "playerID") %>%
  select(playerID, tot_AB, career_avg, nameFirst, nameLast) %>%
  arrange(desc(career_avg))

Let’s look at the top 10 players by batting average: we should see some famous names there! (If not, maybe 1000 ABs is not stringent enough of a criterion to rule out small sample size?)

 
# top 10
head(avg_df, n = 10)
# # A tibble: 10 x 5
#    playerID  tot_AB career_avg nameFirst    nameLast 
#    <chr>      <int>      <dbl> <chr>        <chr>    
#  1 cobbty01   11436      0.366 Ty           Cobb     
#  2 barnero01   2391      0.360 Ross         Barnes   
#  3 hornsro01   8173      0.358 Rogers       Hornsby  
#  4 jacksjo01   4981      0.356 Shoeless Joe Jackson  
#  5 meyerle01   1443      0.356 Levi         Meyerle  
#  6 odoulle01   3264      0.349 Lefty        O'Doul   
#  7 delahed01   7510      0.346 Ed           Delahanty
#  8 mcveyca01   2513      0.346 Cal          McVey    
#  9 speaktr01  10195      0.345 Tris         Speaker  
# 10 hamilbi01   6283      0.344 Billy        Hamilton 

Next, let’s look at the bottom 10 players by batting average:

 
# bottom 10
tail(avg_df, n = 10)
# # A tibble: 10 x 5
#    playerID  tot_AB career_avg nameFirst nameLast
#    <chr>      <int>      <dbl> <chr>     <chr>   
#  1 seaveto01   1315      0.154 Tom       Seaver  
#  2 donahre01   1150      0.152 Red       Donahue 
#  3 fellebo01   1282      0.151 Bob       Feller  
#  4 grovele01   1369      0.148 Lefty     Grove   
#  5 suttodo01   1354      0.144 Don       Sutton  
#  6 amesre01    1014      0.141 Red       Ames    
#  7 faberre01   1269      0.134 Red       Faber   
#  8 perryga01   1076      0.131 Gaylord   Perry   
#  9 pappami01   1073      0.123 Milt      Pappas  
# 10 frienbo01   1137      0.121 Bob       Friend  

Those numbers look quite a lot smaller than Mendoza’s! Notice also that all of them have ABs just over 1000, my threshold for this dataset. Maybe 1000 ABs is too loose of a condition… But Mendoza only had 1337 ABs, so if we make the condition more stringent (e.g. considering only players with >= 2000 ABs), it’s not fair to pick on him…

Among players with >= 1000 ABs, how poor was Mendoza’s performance?

 
# How far down was Mario Mendoza?
which(avg_df$playerID == "mendoma01") / nrow(avg_df)
# [1] 0.9630212

He’s roughly at the 5th quantile of all players for batting average.

How many players fell below the Mendoza line each year?

For this question, in each season we only consider players who had at least 100 ABs that season.

 
# Look at player-seasons with at least 100 ABs
batting_df <- Batting %>% filter(AB >= 100)

The nice thing about the tidyverse is that we can answer the question above with a series of pipes ending in a plot:

 
batting_df %>% group_by(yearID) %>%
  summarize(below200 = mean(AVG < 0.200)) %>%
  ggplot(aes(yearID, below200)) +
  geom_line() +
  labs(title = "Proportion of players below Mendoza line by year",
       x = "Year", y = "Prop. below .200") +
  theme_bw()

There is a fair amount of fluctuation, with the proportion of players under the Mendoza line going as high as 24% and as low as 1%. If I had to guess, for the last 50 years or so the proportion seems to fluctuate around 5%.

What might the Mendoza line look like if we allowed it to change dynamically?

Instead of defining the Mendoza line as having a batting average below .200, what happens if we define the Mendoza line for a particular season as the batting average of the player at the 5th quantile?

We can answer this easily by summarizing the data using the quantile function (dplyr v1.0.0 makes this easy):

 
batting_df %>% group_by(yearID) %>%
  summarize(bottom5 = quantile(AVG, 0.05)) %>%
  ggplot(aes(yearID, bottom5)) +
  geom_line() +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  labs(title = "Batting average of 5th quantile by year",
       x = "Year", y = "5th quantile batting average") +
  theme_bw()

That looks like a lot of fluctuation, but if you look closely at the y-axis, you’ll see that the values hover between 0.14 and 0.24. Here is the same line graph but with zero included on the y-axis and a loess smoothing curve:

 
batting_df %>% group_by(yearID) %>%
  summarize(bottom5 = quantile(AVG, 0.05)) %>%
  ggplot(aes(yearID, bottom5)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  scale_y_continuous(limits = c(0, 0.24)) +
  labs(title = "Batting average of 5th quantile by year",
       x = "Year", y = "5th quantile batting average") +
  theme_bw()

I’m not sure how much to trust that smoother, but it comes awfully close to the Mendoza line!

For completeness, here are the lines representing the batting averages for players at various quantiles over time:

 
batting_df %>% group_by(yearID) %>%
  summarize(AVG = quantile(AVG, c(0.05, 1:4 / 5, 0.95)),
            quantile = c(0.05, 1:4 / 5, 0.95)) %>%
  mutate(quantile = factor(quantile, levels = c(0.95, 4:1 / 5, 0.05))) %>%
  ggplot(aes(x = yearID, y = AVG, col = quantile)) +
  geom_line() +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  labs(title = "Batting average for various quantiles by year",
       x = "Year", y = "Quantile batting average") +
  theme_bw()

At a glance the higher quantile lines look just like vertical translations of the 5th quantile line, suggesting that across years the entire distribution of batting averages shifts up or down (not just parts of the distribution).

glmnet v4.1: regularized Cox models for (start, stop] and stratified data

My latest work on the glmnet package has just been pushed to CRAN! In this release (v4.1), we extend the scope of regularized Cox models to include (start, stop] data and strata variables. In addition, we provide the survfit method for plotting survival curves based on the model (as the survival package does).

Why is this a big deal? As explained in Therneau and Grambsch (2000), the ability to work with start-stop responses opens the door to fitting regularized Cox models with:

  • time-dependent covariates,
  • time-dependent strata,
  • left truncation,
  • multiple time scales,
  • multiple events per subject,
  • independent increment, marginal, and conditional models for correlated data, and
  • various forms of case-cohort models.

glmnet v4.1 is now available on CRAN here. We have reorganized the package’s vignettes, with the new functionality described in the vignette “Regularized Cox Regression” (PDF version/web version). Don’t hesitate to reach out if you have questions.

(Note: This is joint work with Trevor Hastie, Balasubramanian Narasimhan and Rob Tibshirani.)

Simulating the dice game “Toss Up!” in R

Toss Up! is a very simple dice game that I’ve always wanted to simulate but never got around to doing so (until now!). This post outlines how to simulate a Toss Up! game in R, as well as how to evaluate the effectiveness of different game strategies. All the code for this blog post is available here.

Rules

The official rules for Toss Up! are available here. Here is an abbreviated version:

  • As shown in the picture above, the game uses 10 six-sided dice. Each die has 3 green sides, 2 yellow sides and one red side.
  • For each turn, a player rolls all 10 dice. If any greens are rolled, set them aside.
    • At this point, the player can choose to end the turn and “bank” the points (one point for one green), or keep rolling the remaining dice.
    • After every roll, greens are set aside. If you roll enough times to make all 10 dice green, you have 10 points and you can either bank or roll all 10 dice again.
    • A player can keep going until he/she decides to stop rolling and bank all the points earned on this turn.
    • Catch: If, on a roll, none of the dice are green and at least one is red, the turn ends and no points can be banked.
  • First player to reach 100 points wins.

Simulating Toss Up!

There are several different ways to implement Toss Up!: I describe my (two-player) version below. To allow for greater flexibility for the resulting code, I implement the game with three global constants that the programmer can change:

  1. NUM_DICE: The number of dice used in the game (default is 10).
  2. FACE_PROBS: A numeric vector of length 3 denoting the probability that a die comes up red, yellow or green respectively (default is c(1/6, 2/6, 3/6)).
  3. POINTS_TO_WIN: The number of points a player needs to obtain to win the game (default is 100).

Step 1: How can we describe the state of a game?

The state of the game can be encapsulated with 4 pieces of information:

  1. current_player: who’s turn it is to make a decision.
  2. scores: a vector of length 2 containing the scores for players 1 and 2 respectively.
  3. turn_points: Number of points scored on the current turn so far (these points have not been banked yet).
  4. dice_rolls: A vector of variable length denoting the outcome of the dice rolls (0: red, 1: yellow, 2: green).

In my implementation, the state is stored as a list with the 4 elements above.

Step 2: Updating the state

From a given state, there are 3 possibilities for what comes next:

  1. The dice rolls don’t have any greens and at least one red. In this case, the current players turn is over. We need to change current_player, reset turn_points to 0, and re-roll the dice (NUM_DICE of them).
  2. The dice rolls either (i) have at least one green, or (ii) have no reds. In this case, the current player has a choice of what to do.
    1. If the player chooses to bank, then we need to update scores, reset turn_points to 0, change current_player and re-roll the dice (NUM_DICE of them).
    2. If the player chooses to roll, then we need to update turn_points and re-roll just the dice that were not green.

The function updateState below does all of the above. I have also added a verbose option which, if set to TRUE, prints information on the game state to the console. (The function DiceToString is a small helper function for printing out the dice rolls.)

DiceToString <- function(dice_rolls) {
  return(paste(sum(dice_rolls == 2), "Green,",
               sum(dice_rolls == 1), "Yellow,",
               sum(dice_rolls == 0), "Red"))
}

# Executes the current player's action and updates the state. Is also used if
# no greens and at least one red is rolled.
UpdateState <- function(state, action, verbose = FALSE) {
  if (verbose) {
    cat("Current roll:", DiceToString(state$dice_rolls))
    if (action != "rolled at least 1 red and no green")
      cat(" (bank", state$turn_points + sum(state$dice_rolls == 2), "pts?)",
          fill = TRUE)
    else
      cat("", fill = TRUE)
    cat(paste0("Player ", state$current_player, " ", action, "s"),
        fill = TRUE)
  }
    
  if (action == "bank") {
    # action = "bank": bank the points current player earned this turn, then
    # re-roll the dice.
    state$scores[state$current_player] <- state$scores[state$current_player] +
      state$turn_points + sum(state$dice_rolls == 2)
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else if (action == "roll") {
    # action = "roll": add the green dice to points earned this turn, then 
    # re-roll the remaining dice.
    state$turn_points <- state$turn_points + sum(state$dice_rolls == 2)
    ndice <- sum(state$dice_rolls != 2)
    if (ndice == 0) ndice <- NUM_DICE
    state$dice_rolls <- sample(0:2, size = ndice, replace = TRUE, 
                               prob = FACE_PROBS)
  } else if (action == "rolled at least 1 red and no green") {
    # action = "rolled at least 1 red and no green": 
    # no points banked, turn ends, re-roll dice.
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else {
    stop("action must be 'bank', 'roll', or 'rolled at least 1 red and no green'")
  }
  
  if (verbose) {
    if (action != "roll") {
      cat("Current scores:", state$scores, fill = TRUE)
      cat("", fill = TRUE)
    }
  }
  
  return(state)
}

Step 3: How to express player behavior and strategy?

We can think of a player as a black box which takes a game state as an input and outputs a decision, “bank” or “roll”. In other words, the player is a function!

Below are two extremely simple strategies expressed as functions. The first strategy simply banks after the first roll. The second strategy banks once more than target points can be earned from the turn.

# strategy that stops after one roll
OneRoll <- function(state) {
  return("bank")
}

# strategy that stops only when the turn yields > `target` points
OverTargetPoints <- function(state, target = 10) {
  if (state$turn_points + sum(state$dice_rolls == 2) > target) {
    return("bank")
  } else {
    return("roll")
  }
}

(Note: In my implementation, strategy functions should assume that they are current_player: that is how they can determine their current score and that of their opponent correctly.)

Step 4: Simulating a full game

We can now put the pieces from the previous steps together to simulate a full game of Toss Up!. The SimulateGame function takes in two strategy functions as input. It sets up the initial state, then updates the state repeatedly until the game ends.

SimulateGame <- function(Strategy1, Strategy2, verbose = FALSE) {
  # set up initial state
  state <- list(current_player = 1,
                scores = c(0, 0),
                turn_points = 0,
                dice_rolls = sample(0:2, size = NUM_DICE, replace = TRUE, 
                                    prob = FACE_PROBS))
  
  # check if no greens and at least one red, if so change player
  while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
    state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
  }
  
  # while the game has not ended:
  # - get the next action from the current player's strategy
  # - update the state
  while (max(state$scores) < POINTS_TO_WIN) {
    if (state$current_player == 1) {
      action <- Strategy1(state)
    } else {
      action <- Strategy2(state)
    }
    state <- UpdateState(state, action, verbose)
    
    # check if no greens and at least one red
    while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
      state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
    }
  }
  
  # game has ended: return winner
  if (verbose) {
    cat(paste("Game ends: Player", which.max(state$scores), "wins!"),
        fill = TRUE)
  }
  return(which.max(state$scores))
}

Two examples of simulated games

The code below shows what a simulated game of Toss Up! might look like. In the code snippet below, players need 20 points to win. The first player uses the super conservative strategy of banking immediately, while the second player tries to win the entire game in one turn.

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 4 Green, 3 Yellow, 3 Red (bank 4 pts?)
# Player 1 banks
# Current scores: 4 0
# 
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 3 Green, 1 Yellow, 1 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 0 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 20 pts?)
# Player 2 banks
# Current scores: 4 20
# 
# Game ends: Player 2 wins!

In this particular simulation, it paid off to be a bit more risk taking. As the simulation below shows, that’s not always the case.

NUM_DICE <- 10
FACE_PROBS <- c(0.5, 0.0, 0.5)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 6 Green, 0 Yellow, 4 Red (bank 6 pts?)
# Player 1 banks
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 3 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 11 0
# 
# Current roll: 7 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 3 Green, 0 Yellow, 7 Red (bank 13 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 5 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 11 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 16 0
# 
# Current roll: 4 Green, 0 Yellow, 6 Red (bank 4 pts?)
# Player 2 rolls
# Current roll: 4 Green, 0 Yellow, 2 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 16 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 21 0
# 
# Game ends: Player 1 wins!

Comparing some simple strategies

We can’t tell whether one strategy is better than another by looking at a single game since there is so much randomness involved. What we should do is simulate many games and see which strategy wins out over the long run.

OneRoll vs. OneRoll

Let’s do a sanity check. Here, we run 10,000 games of the strategy OneRoll vs. itself (this takes around 4 seconds on my machine). Since the strategy for both players is the same, we might expect player 1 to win around 50% of the time right?

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 100

nsim <- 10000
set.seed(1)
winners <- replicate(nsim, SimulateGame(OneRoll, OneRoll))
table(winners)  # not 50-50: player who starts first has an advantage

# winners
# 1    2 
# 5910 4090 

It looks like player 1 wins around 59% of the time! On second thought, we would expect player 1 to win more because he/she has a first mover advantage: if player 1 reaches 100 points, he/she wins even if player 2 might reach 100 points on the very next turn.

To make this sanity check work, we should have the players each go first half the time. The code below achieves that:

# for even numbered simulations, let player 2 start first
winners2 <- winners
winners2[2 * 1:(nsim/2)] <- 3 - winners2[2 * 1:(nsim/2)]
table(winners2)

# winners2
# 1    2 
# 5030 4970 

Now player 1 wins just 50.3% of the time, which is much closer to the expected 50%.

OneRoll vs. >20 points

Next, let’s compare the “one roll” strategy with the strategy which stops once the player can bank more than 20 points for the turn:

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  OneRoll,
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 75 9925 

Wow! The strategy of banking only when >20 points have been earned on a turn wins almost all the time, even though it doesn’t have the first mover advantage!

>20 points vs. >50 points

Taking a bit more risk than always banking seems to pay off. What about even more risk? How does banking after >20 points compare with banking after >50 points?

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 20) },
  function(state) { OverTargetPoints(state, 50) }))
table(winners)

# winners
# 1    2 
# 6167 3833 

# switch order of players
set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 50) },
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 4414 5586 

The >20 points strategy won 61.7% of the games when it went first, and won 55.9% of the games when it went second, so it’s clearly a superior strategy.

Where can we go from here?

There are several future directions we could take with this code base.

  • The strategies presented here are very simple. Can we program more complex strategies (e.g. some that depend on the score of the other player, or on how close the player is to winning)?
  • Among strategies of the form “bank once you have more than xx points”, is there an optimal value for xx?
  • What is the optimal strategy for Toss Up!? Can we learn it through methods such as Q-learning?
  • Is there even such a thing as an optimal strategy? Do there exist 3 strategies such that A beats B, B beats C, but C beats A?
  • Can we quantify the first mover advantage?
  • Can we modify the code so that we can visualize the players’ scores as the game progresses?
  • Can we modify the code so that more than 2 players can play?

Exploring the game “First Orchard” with simulation in R

My daughter received the board game First Orchard as a Christmas present and she’s hooked on it so far. In playing the game with her, a few probability/statistics questions came to mind. This post outlines how I answered some of them using simulation in R. All code for this blog post can be found here.

(In my googling I found that Matt Lane has an excellent blog post on this game, answering some of the questions that I was interested in.)

Gameplay

Before we get to the questions, let me give a quick explanation of how the game works (see Reference 1 for a more colorful explanation as well as an applet to play the game online).

  • It’s a cooperative game, with all players playing against the board.
  • The game starts with 16 fruit: 4 of each color (red, green, blue, yellow), and a raven at one end of a path that is 5 steps long.
  • On each player’s turn, the player rolls a 6-sided die.
    • If the die comes up red, green, blue or yellow, the player gets to “harvest” a fruit of that color if there are any left to harvest. If all 4 fruits of that color have been harvested, nothing happens.
    • If the die shows a fruit basket, the player gets to harvest a fruit of any color.
    • If the die shows a raven, the raven moves one step along the path.
  • The game ends when either all the fruit are harvested (players win) or when the raven reaches the end of the path (raven wins).

As you can see this is a really simple game (hence it’s “suitable for 2+ years” rating). The only strategy is in choosing what fruit to take if a fruit basket shows up: see Reference 1 for some simulations for different strategies. Intuitively it seems like choosing the color with the most fruit remaining is the best strategy, and that’s what I bake into my code. (Is there a proof for this, and is this still true in more general circumstances described in Reference 1?)

Code for simulating the game

The state of the game can be captured in a numeric vector of length 5. The first 4 numbers refer to the number of fruit left for each color, and the 5th number keeps track of the number of steps the raven has taken so far. I created 3 functions to simulate one game of First Orchard (see full code here):

  • SimulateTurn(state, verbose) takes one dice roll and updates the state of the game. For simplicity, if a 1-4 is rolled, a fruit is harvested from that corresponding tree. If 5 is rolled, the raven takes a step. A rolled 6 is taken to mean “fruit basket”, and I remove a fruit from the tree with the most remaining fruits.
  • CheckGameState(state, max_raven_steps) checks if the game has ended or not, and if so, who won.
  • SimulateGame(fruit_count, max_raven_steps, verbose) runs an entire game of First Orchard: while the game has not ended, run SimulateTurn. Once the game has ended, this function returns (i) who won, (ii) the number of turns taken, (iii) the number of steps the raven took, and (iv) the number of fruit left.

We allow for two game parameters to be defined by the user: the number of fruit of each type at the start of the game (fruit_count, default is 4) and the number of steps the raven must take in order for it to win (max_raven_steps, default is 5). The verbose option for these functions so that the user can see what happened in the game. The code below is an example of the output from SimulateGame:

set.seed(1)
results <- SimulateGame(fruit_count = 2, max_raven_steps = 3, 
                        verbose = TRUE)
# Roll: 1 , State: 1,2,2,2,0
# Roll: 4 , State: 1,2,2,1,0
# Roll: 1 , State: 0,2,2,1,0
# Roll: 2 , State: 0,1,2,1,0
# Roll: 5 , State: 0,1,2,1,1
# Roll: 3 , State: 0,1,1,1,1
# Roll: 6 , State: 0,0,1,1,1
# Roll: 2 , State: 0,0,1,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 1 , State: 0,0,0,1,1
# Roll: 5 , State: 0,0,0,1,2
# Roll: 5 , State: 0,0,0,1,3
# Raven wins
# # of turns: 13
# # of steps raven took: 3
# # fruit left: 1

Simulation time!

What is the probability of winning the game? How does that change as we vary (i) the number of fruit of each color, and (ii) the number of steps the raven must take in order for the players to lose?

We let the number of fruit of each color vary from 1 to 8, and the number of steps the raven must take from 1 to 8. For each parameter setting, we simulate the game 10,000 times and compute the player win probability. We plot the results as a heatmap.

As one might expect, win probability goes down as the number of fruit increases and as the number of steps the raven must take decreases. For the original game (4 fruit, 5 steps), the win probability is approximately 62%. Sounds reasonable: we would like a win probability >50% so that kids will not get discouraged by too much losing, but not so high that they think the game is trivial.

For the original game (4 fruit, 5 steps), what is the expected number of steps until the game ends? Does this change depending on whether the player or the raven wins?

We simulate the game 100,000 times and keep track of the number of steps taken in each game. The shortest game took 5 steps while the longest took 45 steps, with the modal number of steps being 21 (it also happens to be the mean and median). Here is the histogram for all 100,000 runs:

Here are the histograms split by the outcome:

Games where the raven wins tend to be shorter than those when players win. Maybe that’s not too surprising, since a game where the raven wins needs just 5 steps, while a game where the players win needs at least 16 steps. On average, the game takes 19 steps for raven wins and 22 steps for player wins.

For the original game (4 fruit, 5 steps), given that the raven loses, what is distribution of the number of steps the raven has taken?

Because we programmed SimulateGame to return the number of steps the raven has taken as well, we don’t have to rerun the simulations: we can just use the 100,000 simulations we ran previously and look at the ones that the raven lost. Here is the histogram of steps the raven took in losing games, with the vertical red line representing the mean:

For the original game (4 fruit, 5 steps), given that the raven wins, what is distribution of the number of unharvested fruit?

Again, we can just use the 100,000 simulations we ran previously and look at the ones that the raven won. Here is the histogram along with the mean and median marked out with vertical lines:

The modal number of fruit left in player-losing games is 1: ugh tantalizingly close!

If there was no raven, how many turns would it take to harvest all the fruit?

“No raven” is the same as saying that the raven needs to take an infinite number of steps in order to win. Hence, we can use our existing simulation code with max_raven_steps = Inf to simulate this setting.

The shortest game took 16 turns while the longest game took 63 turns, with 22 and 24 turns being the modal and mean number of turns respectively. (In theory, a game could go on forever.) Here is the histogram:

References:

  1. Matt Lane. (2018). Harvesting Wins.

What is convex clustering?

Introduction

Let’s say we have n observations in p-dimensional space x_1, \dots, x_n \in \mathbb{R}^p. The problem of clustering is to group these observations into a (small) number of clusters such that the observations within each cluster are more alike than observations in other clusters.

Convex clustering is one method that can be used to cluster observations. Instead of directly assigning each observation to a cluster, it assigns each observation to a point called the “cluster centroid”. Two observations are then said to belong to the same cluster if they share the same cluster centroid.

Convex clustering computes the cluster centroids by solving a minimizing a convex function, hence its name. Convex clustering has several (well, at least two) nice properties:

  1. It is the solution to a convex minimization problem. Hence, algorithms can find a global minimum instead of a local minimum (like k-means clustering). In addition, this allows us to give theoretical and computational guarantees for the solution.
  2. Convex clustering can give not just a single clustering solution but an entire path of clustering solutions, ranging from each observation being in its own cluster, to all observations belonging to a single cluster. The user can choose how many clusters they want to see, and the algorithm can give a solution that closely matches that. In this sense convex clustering can be viewed as an alternative to hierarchical clustering.

The mathematical details

Each of the observations x_j \in \mathbb{R}^p is assigned to a corresponding cluster centroid u_j \in \mathbb{R}^p. The u_j‘s are the solution to the minimization problem

\begin{aligned} \text{minimize}_{u_1, \dots, u_n} \quad \dfrac{1}{2}\sum_{j=1}^n \| x_j - u_j \|_2^2 + \lambda \sum_{i < j} w_{ij} \| u_i - u_j \|_q^2, \end{aligned}

where q \geq 1 is a fixed constant, \lambda \geq 0 is a tuning hyperparameter, and the w_{ij} > 0‘s are pair-specific weights.

From my searching, it looks like the 3 earliest references for convex clustering are Pelckmans et al. (2005), Lindsten et al. (2011) and Hocking et al. (2011), listed as References 1-3 at the end of this post. Pelckmans et al. (2005) is the first paper to propose this minimization problem without the w_{ij} term, and focuses on the cases where q = 1 or q = 2. Lindsten et al. (2011) has the same formulation but argues that q = 2 is a more appropriate choice. Hocking et al. (2011) is probably the first reference that includes the w_{ij} term. They suggest the choice w_{ij} = \exp (- \gamma \|x_i - x_j \|_2^2) to obtain clustering results that are sensitive to local density in the data. From my cursory glance at later literature, it seems that the choice of weights is crucial in obtaining a good solution.

The special cases of \lambda = 0 and \lambda = \infty

Here is the minimization problem again:

\begin{aligned} \text{minimize}_{u_1, \dots, u_n} \quad \dfrac{1}{2}\sum_{j=1}^n \| x_j - u_j \|_2^2 + \lambda \sum_{i < j} w_{ij} \| u_i - u_j \|_q^2, \end{aligned}

We see an inherent tradeoff between the two terms in the objective function. The first term encourages the cluster centroids to be near to their assigned points, while the second term encourages the cluster centroids to be near to each other.

Let’s look at what happens for extreme values of \lambda to see this. When \lambda = 0, there is no penalty at all for cluster centroids being far from each other. Hence, the optimal solution is u_j = x_j for all j, i.e. each point is in its own cluster. On the other hand, when \lambda = \infty, any difference in cluster centroids is penalized infinitely. This means that the optimal solution is u_i = u_j for all i and j, i.e. all points are in the same cluster.

For \lambda \in (0, \infty), the tradeoff is less stark. We can make u_j closer to the other centroids, but possibly increasing the cost \| x_j - u_j \|_2^2: how much we tradeoff depends on the size of \lambda.

R implementation for convex clustering

Your best bet for performing convex clustering in R is probably the clustRviz package (available here on Github) which can compute the convex clustering path via its CARP function. It does so using the CARP algorithm explained in Reference 4. It should be noted that the path it computes is approximate, although (i) approximations are often good enough for interpretation of clusters, and (ii) the algorithm has some theoretical backing: as the algorithm’s parameters converge to certain values, the approximate path of solutions converges to the exact solution path.

As far as I can tell there isn’t a solution on CRAN. The cvxclustr package was on CRAN but has been archived since 2020-10-28 due to check issues not being corrected in time. Older versions of the package are available here.

Hocking et al. (2011) provided R packages for their algorithm (clusterpath and clusterpathRcpp) that are available here on R-Forge, but it looks old and I have no idea if it still works/is in use.

Generalizations of convex clustering

The convex clustering minimization problem can be generalized easily by changing the distance metric for either term in the objective function. This allows the problem to remain convex while enabling the solutions to have other desired properties. See Section 1.1 of Reference 5 for some examples.

References:

  1. Pelckmans, K. et al. (2005). Convex clustering shrinkage.
  2. Lindsten, F. et al. (2011). Clustering using sum-of-norms regularization: with application to particle filter output computation.
  3. Hocking, T. D. et al. (2011). Clusterpath: an algorithm for clustering using convex fusion penalties.
  4. Weylandt, M., Nagorski, J., and Allen, G. I. (2020). Dynamic visualization and fast computation for convex clustering via algorithmic regularization.
  5. Weylandt, M., and Michailidis, G. (2020). Automatic registration and convex clustering of time series.

Upper bound on the absolute value of roots of polynomials

Let p(x) = a_0 + a_1 x + \dots + a_n x^n be a polynomials with complex coefficients a_0, \dots, a_n \in \mathbb{C} and a_n \neq 0. Let r \in \mathbb{C} be a root of the polynomial, i.e.

p(r) = 0.

Can we say anything about |r|? Yes: we have a number of upper bounds on |r|. The two most common ones are Lagrange’s bound

\begin{aligned} |r| \leq \max \left\{ 1, \sum_{i=0}^{n-1} \left| \dfrac{a_i}{a_n} \right| \right\}, \end{aligned}

and Cauchy’s bound

\begin{aligned} |r| \leq 1 + \max \left\{ \left| \dfrac{a_{n-1}}{a_n} \right|, \left| \dfrac{a_{n-2}}{a_n} \right|, \dots, \left| \dfrac{a_0}{a_n} \right| \right\}. \end{aligned}

This Wikipedia article contains the proofs of the bounds above as well as other upper bounds.

(Note: This post feels more like a math “odd-and-end” rather than a stats one, but who knows? This may come in handy for some theoretical proof.)

(Credits: I first learned of the bound by Lagrange on John Cook’s blog.)

A shiny app for exploratory data analysis

I recently learnt how to build basic R Shiny apps. To practice using Shiny, I created a simple app that you can use to perform simple exploratory data analysis. You can use the app here to play around with the diamonds dataset from the ggplot2 package. To use the app for your own dataset, download the raw R code here (just the app.R file) and assign your dataset to  raw_df. In the rest of this post, I outline how to use this app.

(Credits: I worked off the source code for the “Diamonds Explorer” app. There are a few versions of this app out there and I can’t find the exact source I used, but it was very close to the source code of this version.)

As you can see from the screenshot below, the main panel (on the right) has 4 tabs. The last two tabs simply give the output of calling the summary and str functions on the entire dataset; they are not affected by the controls in the side panel. The “Data Snippet” panel prints up to 15 rows of the dataset for a peek into what the dataset looks like. (These 15 rows are the first 15 rows of the dataset used to create the plot on the “Plot” tab.)

The most interesting tab is probably the “Plot” tab. First let me describe how the app selects the dataset it makes the plot for. By default, it picks 1000 random observations or all the observations if the dataset has less than 1000 rows. The user can input the random seed for reproducibility. The user can also control the number of observations using the slider, and choose the observations randomly or take the top n from the dataset.

The type of plot the app makes depends on the type of variables given to it. In the screenshot above, one numeric variable and one non-numeric variable is given, so the app makes a boxplot. If two numeric variables are given, it makes a scatterplot:

For scatterplots, the user has the option to jitter the points and/or to add a smoothing line:

If two non-numeric variables are given, the app makes a heatmap depicting how often each combination is present in the data:

The plots above depict the joint distribution of two variables in the dataset. If the user wants a visualization for just one variable, the user can set the “Y” variable to “None”. If the “X” variable is numeric, the app plots a histogram:

If the “X” variable is non-numeric, the app plots a bar plot showing counts:

Finally, let’s talk about color. For simplicity, the app only allows plots to be colored by non-numeric variables. Below is an example of a colored scatterplot:

As the screenshot below shows, color works for boxplots too. (Color does not work for heatmaps.)

Color can be used for the one-dimensional plots as well:

There are certainly several improvements that can be made to this app. For one, it would be nice if the user could upload their dataset through the app (instead of downloading my source code and assigning their dataset to the raw_df variable). There could also be more controls to let the user change certain aspects of the plot (e.g. transparency of points), but at some point the UI might become too overwhelming for the user.

Happy exploring!

The law of large populations

I recently read through a really interesting paper by Xiao-Li Meng (Reference 1). It’s relatively light on mathematical machinery (you can get through most of it with just a first course in statistics), but deep in terms of the statistical interpretation of the results. Meng is a first-rate communicator of statistics (just watch some of his videos on Youtube or read some of his musings on XL Files), making this work a joy to read.

(I did not check, but I think Meng covers some of the material in the paper in this Youtube video.)

Motivation

This paper was motivated by the question “Which one should we trust more, a 5% survey sample or an 80% administrative dataset?” where 5% and 80% present the proportion of the population in the sample. It is pretty obvious that one should prefer a well-randomized 5% survey sample over a 5% administrative dataset, since the latter probably contains several sources of bias. But what if the administrative dataset is much larger? Can we say anything about it?

Meng answers this question to some degree, considering the case where the quantity of interest is a population mean, and we try to estimate it using a sample mean. In the rest of this post, I present just the key identity: read the paper for the details!

Set-up

Define the following quantities:

  • N: size of the population, n: size of the sample.
  • \overline{G}_N = \dfrac{1}{N}\displaystyle\sum_{j=1}^N G_j: population quantity of interest.
  • R_j = 1 if j is in the sample, 0 otherwise.
  • \begin{aligned} \overline{G}_n = \frac{1}{n}\sum_{j \text{ in sample}} G_j = \frac{\sum_{j=1}^N R_j G_j}{\sum_{j=1}^N R_j} \end{aligned}: sample average (the estimator that we will consider).
  • f = n / N: the fraction of the population in the sample, or sampling rate.

All the quantities above are assumed to be fixed except possibly the R_j‘s. The R_j‘s define what Meng calls the R-mechanism: the way that the sample arrives to us. In simply random sampling, we know the joint distribution of the R_j;s. But it is possible that the sample is not uniformly random, e.g. self-reported data where an individual is more or less likely to report depending on their value of G. It’s even possible that R_j is deterministic: in this case, an individual always decides to report or always decides not to report. In any case, the R-mechanism and the R_j‘s are going to be vital for what follows.

In addition, for convenience let J be a random variable which is uniformly distributed over \{ 1, 2, \dots, N\} (and independent of any preceding randomness).

A fundamental identity for data quality-quantity tradeoff

In Equation (2.3), Meng presents a key identity that demonstrates the tradeoff between data quality and data quantity:

\begin{aligned} \overline{G}_n - \overline{G}_N = \underbrace{\rho_{R, G}}_{\text{Data Quality}} \times \underbrace{\sqrt{\frac{1-f}{f}}}_{\text{Data Quantity}} \times \underbrace{\sigma_G}_{\text{Problem Difficulty}}, \end{aligned}

where

  • \rho_{R, G} = Corr_J (R_J, G_J) is called the data defect correlation, measuring both sign and degree of selection bias caused by the R-mechanism, and
  • \sigma_G = SD_J (G_J) is the standard deviation of G_J. This captures problem difficulty: the more variation there is among the G_j‘s, the harder it will be to estimate \overline{G}_N.

What’s really interesting about this identity is that it is completely due to algebraic manipulation: no mathematical or probabilistic assumptions go into it! (Other than the definitions in the notation section above.) Conditional on knowing the values of the R_j‘s, Meng shows in Equation (2.2) that

\begin{aligned} \overline{G}_n - \overline{G}_N = ... = \dfrac{Cov_J (R_J, G_J)}{E_J [R_J]}. \end{aligned}

Noting that the R_j‘s are binary variables, we have Var_J (R_J) = (1-f) / f, leading to

\begin{aligned}\overline{G}_n - \overline{G}_N &=\dfrac{Cov_J (R_J, G_J)}{SD_J (R_J) SD_J (G_J) } \times \dfrac{SD_J(R_J)}{E_J [R_J]} \times SD_J (G_J) \\  &= Corr_J (R_J, G_J) \times \dfrac{\sqrt{(1-f)/f}}{f} \times \sigma_G \\  &= \rho_{R,G} \times \sqrt{\dfrac{1-f}{f}} \times \sigma_G, \end{aligned}

as required.

The data defect index

Squaring the identity above and taking expectations with respect to the R-mechanism gives us an expression for the mean-squared error (MSE) (Equation (2.4)):

\begin{aligned} MSE_R (\overline{G}_n) = \underbrace{E_R [\rho_{R, G}^2]}_{D_I} \times \underbrace{\frac{1-f}{f}}_{D_O} \times \underbrace{\sigma_G^2}_{D_U}. \end{aligned}

Thus, there are 3 ways to reduce MSE:

  1. Increase the data quality by reducing D_I: the data defect index.
  2. Increase the data quantity by reducing D_O: the dropout odds. This, however, turns out to be much less effective than increasing data quality.
  3. Reduce the difficulty of the estimation problem by reducing D_U: the degree of uncertainty (this is typically only possible with additional information).

The law of large populations (LLP)

With the identities above, Meng formulates a law of large populations:

Among studies sharing the same (fixed) average data defect correlation E_R [\rho_{R, G}] \neq 0, the (stochastic) error of \overline{G}_n, relative to its benchmark under simple random sampling, grows with the population size N at the rate of \sqrt{N}.

References:

  1. Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election.

Variance of coefficients for linear models and generalized linear models

Variance of coefficients in linear models

Assume we are in the supervised learning setting with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response y \in \mathbb{R}^n. If the linear regression model is true, i.e.

y = {\bf X} \beta + \epsilon,

where \beta \in \mathbb{R}^p is the vector of coefficients and the entries of \epsilon \in \mathbb{R}^p are i.i.d mean 0 and variance \sigma^2, it is well-known that the maximum likelihood estimate (MLE) of \beta is

\hat\beta = ({\bf X}^T{\bf X})^{-1} {\bf X}^T y.

Assuming {\bf X} is fixed and that \epsilon is the only source of randomness, it follows easily that

Var(\hat\beta) = \sigma^2 ({\bf X}^T{\bf X})^{-1}.

It’s remarkable that (i) we have a closed form solution for the variance of \hat\beta, and (ii) this variance does not depend on the value of \beta.

Variance of coefficients in generalized linear models (GLMs)

We are not so lucky on both counts in the generalized linear model (GLM) setting. Assume that the true data generating model is given by a GLM, i.e.

\begin{aligned} Y_i &\stackrel{ind.}{\sim} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right], \\  \eta_i &= \sum_{j=1}^p \beta_j x_{ij}, \text{ and } \\  \eta_i &= g (\mu_i),  \end{aligned}

where i = 1, \dots, n indexes the observations, \mu_i = \mathbb{E}[Y_i], and g is the link function. (See this previous post for details.)

In this setting, we do not have a closed form solution for \hat\beta, the MLE of \beta. (See this previous post for the likelihood equations and ways to obtain the MLE numerically.) We are also not able to get a closed form solution for the variance of \hat\beta. We can however get a closed form solution for the asymptotic variance of \hat\beta:

\begin{aligned} Var( \hat\beta) &= ({\bf X}^T {\bf WX})^{-1}, \text{ where} \\  {\bf W}_{ij} &= \begin{cases} \left( \dfrac{\partial \mu_i}{\partial \eta_i} \right)^2 / Var(Y_i) &\text{if } i = j, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

References:

  1. Agresti, A. (2013). Categorical data analysis (3rd ed): Section 4.4.8.