Today I learnt of the `covidcast`

R package, which provides access to the COVIDcast Epidata API published by the Delphi group at Carnegie Mellon University. According to the `covidcast`

R package website,

This API provides daily access to a range of COVID-related signals Delphi that builds and maintains, from sources like symptom surveys and medical claims data, and also standard signals that we simply mirror, like confirmed cases and deaths.

(There is a corresponding python package with similar functionality.) The Delphi group has done a huge amount of work in logging a wide variety of COVID-related data and making it available, along with tools to visualize and make sense of the data.

For those interested in doing COVID-related analyses, I think this is a treasure trove of information for you to use. The `covidcast`

package contains several different types of data (which they call “signals”), including public behavior (e.g. COVID searches on Google), early indicators (e.g. COVID-related doctor visits) and late indicators (e.g. deaths). Documentation on the signals available can be found here. (Note: The data is US-focused right now; I don’t know if there are plans to include data from other countries.)

Let me end off with a simple example showing what you can do with this package. This example is modified from one of the package vignettes; see the Articles section of the package website for more examples.

The package is not available on CRAN yet but can be downloaded from Github:

devtools::install_github("cmu-delphi/covidcast", ref = "main", subdir = "R-packages/covidcast")

The code below pulls data on cumulative COVID cases per 100k people on 2020-12-31 at the county level. `covidcast_signal`

is the function to use for pulling data, and it returns an object of class `c("covidcast_signal", "data.frame")`

.

library(covidcast) # Cumulative COVID cases per 100k people on 2020-12-31 df <- covidcast_signal(data_source = "usa-facts", signal = "confirmed_cumulative_prop", start_day = "2020-12-31", end_day = "2020-12-31") summary(df) # A `covidcast_signal` data frame with 3142 rows and 9 columns. # # data_source : usa-facts # signal : confirmed_cumulative_prop # geo_type : county # # first date : 2020-12-31 # last date : 2020-12-31 # median number of geo_values per day : 3142

There is a `plot`

method for calss `covidcast_signal`

objects:

plot(df)

The automatic plot is usually not bad. The `plot`

method comes with some arguments that the user can use to customize the plot (full documentation here):

breaks <- c(0, 500, 1000, 5000, 10000) colors <- c("#D3D3D3", "#FEDDA2", "#FD9950", "#C74E32", "#800026") plot(df, choro_col = colors, choro_params = list(breaks = breaks), title = "Cumulative COVID cases per 100k people on 2020-12-31")

The plot returned is actually created using the `ggplot2`

package, so it is possible to add your own `ggplot2`

code on top of it:

library(ggplot2) plot(df, choro_col = colors, choro_params = list(breaks = breaks), title = "Cumulative COVID cases per 100k people on 2020-12-31") + theme(title = element_text(face = "bold"))]]>

This post explores a few questions related to the Mendoza line:

- Was Mario Mendoza really so bad as to warrant the expression being named after him?
- How many players fell below the Mendoza line each year?
- What might the Mendoza line look like if we allowed it to change dynamically?

All code for this post can be found here.

(* Caveat:* I don’t know baseball well, so some of the assumptions or conclusions I make below may not be good ones. If I made a mistake, let me know!)

**The data**

The `Lahman`

package on CRAN contains all the baseball statistics from 1871 to 2019. We’ll use the `Batting`

data frame for statistics and the `People`

data frame for player names.

library(Lahman) library(tidyverse) data(Batting) data(People) # add AVG to Batting Batting$AVG <- with(Batting, H / AB)

First, let’s look for Mario Mendoza and verify that his batting average is indeed .215:

# find Mario Mendoza in People People %>% filter(nameFirst == "Mario" & nameLast == "Mendoza") # his ID is mendoma01 Batting %>% filter(playerID == "mendoma01") %>% summarize(career_avg = sum(H) / sum(AB)) # career_avg # 1 0.2146597

**Was Mario Mendoza really so bad as to warrant the expression being named after him?**

Let’s compute the career batting averages for the players and limit our dataset to just the players with at least 1000 at bats in their career:

# Batting average for players with >= 1000 AB avg_df <- Batting %>% group_by(playerID) %>% summarize(tot_AB = sum(AB), career_avg = sum(H) / sum(AB)) %>% filter(tot_AB >= 1000) %>% left_join(People, by = "playerID") %>% select(playerID, tot_AB, career_avg, nameFirst, nameLast) %>% arrange(desc(career_avg))

Let’s look at the top 10 players by batting average: we should see some famous names there! (If not, maybe 1000 ABs is not stringent enough of a criterion to rule out small sample size?)

# top 10 head(avg_df, n = 10) # # A tibble: 10 x 5 # playerID tot_AB career_avg nameFirst nameLast # <chr> <int> <dbl> <chr> <chr> # 1 cobbty01 11436 0.366 Ty Cobb # 2 barnero01 2391 0.360 Ross Barnes # 3 hornsro01 8173 0.358 Rogers Hornsby # 4 jacksjo01 4981 0.356 Shoeless Joe Jackson # 5 meyerle01 1443 0.356 Levi Meyerle # 6 odoulle01 3264 0.349 Lefty O'Doul # 7 delahed01 7510 0.346 Ed Delahanty # 8 mcveyca01 2513 0.346 Cal McVey # 9 speaktr01 10195 0.345 Tris Speaker # 10 hamilbi01 6283 0.344 Billy Hamilton

Next, let’s look at the bottom 10 players by batting average:

# bottom 10 tail(avg_df, n = 10) # # A tibble: 10 x 5 # playerID tot_AB career_avg nameFirst nameLast # <chr> <int> <dbl> <chr> <chr> # 1 seaveto01 1315 0.154 Tom Seaver # 2 donahre01 1150 0.152 Red Donahue # 3 fellebo01 1282 0.151 Bob Feller # 4 grovele01 1369 0.148 Lefty Grove # 5 suttodo01 1354 0.144 Don Sutton # 6 amesre01 1014 0.141 Red Ames # 7 faberre01 1269 0.134 Red Faber # 8 perryga01 1076 0.131 Gaylord Perry # 9 pappami01 1073 0.123 Milt Pappas # 10 frienbo01 1137 0.121 Bob Friend

Those numbers look quite a lot smaller than Mendoza’s! Notice also that all of them have ABs just over 1000, my threshold for this dataset. Maybe 1000 ABs is too loose of a condition… But Mendoza only had 1337 ABs, so if we make the condition more stringent (e.g. considering only players with >= 2000 ABs), it’s not fair to pick on him…

Among players with >= 1000 ABs, how poor was Mendoza’s performance?

# How far down was Mario Mendoza? which(avg_df$playerID == "mendoma01") / nrow(avg_df) # [1] 0.9630212

He’s roughly at the 5th quantile of all players for batting average.

**How many players fell below the Mendoza line each year?**

For this question, in each season we only consider players who had at least 100 ABs that season.

# Look at player-seasons with at least 100 ABs batting_df <- Batting %>% filter(AB >= 100)

The nice thing about the tidyverse is that we can answer the question above with a series of pipes ending in a plot:

batting_df %>% group_by(yearID) %>% summarize(below200 = mean(AVG < 0.200)) %>% ggplot(aes(yearID, below200)) + geom_line() + labs(title = "Proportion of players below Mendoza line by year", x = "Year", y = "Prop. below .200") + theme_bw()

There is a fair amount of fluctuation, with the proportion of players under the Mendoza line going as high as 24% and as low as 1%. If I had to guess, for the last 50 years or so the proportion seems to fluctuate around 5%.

**What might the Mendoza line look like if we allowed it to change dynamically?**

Instead of defining the Mendoza line as having a batting average below .200, what happens if we define the Mendoza line for a particular season as the batting average of the player at the 5th quantile?

We can answer this easily by summarizing the data using the `quantile`

function (dplyr v1.0.0 makes this easy):

batting_df %>% group_by(yearID) %>% summarize(bottom5 = quantile(AVG, 0.05)) %>% ggplot(aes(yearID, bottom5)) + geom_line() + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + labs(title = "Batting average of 5th quantile by year", x = "Year", y = "5th quantile batting average") + theme_bw()

That looks like a lot of fluctuation, but if you look closely at the y-axis, you’ll see that the values hover between 0.14 and 0.24. Here is the same line graph but with zero included on the y-axis and a loess smoothing curve:

batting_df %>% group_by(yearID) %>% summarize(bottom5 = quantile(AVG, 0.05)) %>% ggplot(aes(yearID, bottom5)) + geom_line() + geom_smooth(se = FALSE) + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + scale_y_continuous(limits = c(0, 0.24)) + labs(title = "Batting average of 5th quantile by year", x = "Year", y = "5th quantile batting average") + theme_bw()

I’m not sure how much to trust that smoother, but it comes awfully close to the Mendoza line!

For completeness, here are the lines representing the batting averages for players at various quantiles over time:

batting_df %>% group_by(yearID) %>% summarize(AVG = quantile(AVG, c(0.05, 1:4 / 5, 0.95)), quantile = c(0.05, 1:4 / 5, 0.95)) %>% mutate(quantile = factor(quantile, levels = c(0.95, 4:1 / 5, 0.05))) %>% ggplot(aes(x = yearID, y = AVG, col = quantile)) + geom_line() + geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") + labs(title = "Batting average for various quantiles by year", x = "Year", y = "Quantile batting average") + theme_bw()

At a glance the higher quantile lines look just like vertical translations of the 5th quantile line, suggesting that across years the entire distribution of batting averages shifts up or down (not just parts of the distribution).

]]>`glmnet`

package has just been pushed to CRAN! In this release (v4.1), we extend the scope of regularized Cox models to include (start, stop] data and strata variables. In addition, we provide the `survfit`

method for plotting survival curves based on the model (as the `survival`

package does).
* Why is this a big deal?* As explained in Therneau and Grambsch (2000), the ability to work with start-stop responses opens the door to fitting regularized Cox models with:

- time-dependent covariates,
- time-dependent strata,
- left truncation,
- multiple time scales,
- multiple events per subject,
- independent increment, marginal, and conditional models for correlated data, and
- various forms of case-cohort models.

`glmnet v4.1`

is now available on CRAN here. We have reorganized the package’s vignettes, with the new functionality described in the vignette “Regularized Cox Regression” (PDF version/web version). Don’t hesitate to reach out if you have questions.

(* Note:* This is joint work with Trevor Hastie, Balasubramanian Narasimhan and Rob Tibshirani.)

**Rules**

The official rules for Toss Up! are available here. Here is an abbreviated version:

- As shown in the picture above, the game uses 10 six-sided dice. Each die has 3 green sides, 2 yellow sides and one red side.
- For each turn, a player rolls all 10 dice. If any greens are rolled, set them aside.
- At this point, the player can choose to end the turn and “bank” the points (one point for one green), or keep rolling the remaining dice.
- After every roll, greens are set aside. If you roll enough times to make all 10 dice green, you have 10 points and you can either bank or roll all 10 dice again.
- A player can keep going until he/she decides to stop rolling and bank all the points earned on this turn.
**Catch:**If, on a roll, none of the dice are green and at least one is red, the turn ends and no points can be banked.

- First player to reach 100 points wins.

**Simulating Toss Up!**

There are several different ways to implement Toss Up!: I describe my (two-player) version below. To allow for greater flexibility for the resulting code, I implement the game with three global constants that the programmer can change:

`NUM_DICE`

: The number of dice used in the game (default is 10).`FACE_PROBS`

: A numeric vector of length 3 denoting the probability that a die comes up red, yellow or green respectively (default is`c(1/6, 2/6, 3/6)`

).`POINTS_TO_WIN`

: The number of points a player needs to obtain to win the game (default is 100).

**Step 1: How can we describe the state of a game?**

The state of the game can be encapsulated with 4 pieces of information:

`current_player`

: who’s turn it is to make a decision.`scores`

: a vector of length 2 containing the scores for players 1 and 2 respectively.`turn_points`

: Number of points scored on the current turn so far (these points have not been banked yet).`dice_rolls`

: A vector of variable length denoting the outcome of the dice rolls (0: red, 1: yellow, 2: green).

In my implementation, the state is stored as a list with the 4 elements above.

**Step 2: Updating the state**

From a given state, there are 3 possibilities for what comes next:

- The dice rolls don’t have any greens and at least one red. In this case, the current players turn is over. We need to change
`current_player`

, reset`turn_points`

to 0, and re-roll the dice (`NUM_DICE`

of them). - The dice rolls either (i) have at least one green, or (ii) have no reds. In this case, the current player has a choice of what to do.
- If the player chooses to
, then we need to update**bank**`scores`

, reset`turn_points`

to 0, change`current_player`

and re-roll the dice (`NUM_DICE`

of them). - If the player chooses to
, then we need to update**roll**`turn_points`

and re-roll just the dice that were not green.

- If the player chooses to

The function `updateState`

below does all of the above. I have also added a `verbose`

option which, if set to `TRUE`

, prints information on the game state to the console. (The function `DiceToString`

is a small helper function for printing out the dice rolls.)

DiceToString <- function(dice_rolls) { return(paste(sum(dice_rolls == 2), "Green,", sum(dice_rolls == 1), "Yellow,", sum(dice_rolls == 0), "Red")) } # Executes the current player's action and updates the state. Is also used if # no greens and at least one red is rolled. UpdateState <- function(state, action, verbose = FALSE) { if (verbose) { cat("Current roll:", DiceToString(state$dice_rolls)) if (action != "rolled at least 1 red and no green") cat(" (bank", state$turn_points + sum(state$dice_rolls == 2), "pts?)", fill = TRUE) else cat("", fill = TRUE) cat(paste0("Player ", state$current_player, " ", action, "s"), fill = TRUE) } if (action == "bank") { # action = "bank": bank the points current player earned this turn, then # re-roll the dice. state$scores[state$current_player] <- state$scores[state$current_player] + state$turn_points + sum(state$dice_rolls == 2) state$turn_points <- 0 state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, prob = FACE_PROBS) state$current_player <- 3 - state$current_player } else if (action == "roll") { # action = "roll": add the green dice to points earned this turn, then # re-roll the remaining dice. state$turn_points <- state$turn_points + sum(state$dice_rolls == 2) ndice <- sum(state$dice_rolls != 2) if (ndice == 0) ndice <- NUM_DICE state$dice_rolls <- sample(0:2, size = ndice, replace = TRUE, prob = FACE_PROBS) } else if (action == "rolled at least 1 red and no green") { # action = "rolled at least 1 red and no green": # no points banked, turn ends, re-roll dice. state$turn_points <- 0 state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, prob = FACE_PROBS) state$current_player <- 3 - state$current_player } else { stop("action must be 'bank', 'roll', or 'rolled at least 1 red and no green'") } if (verbose) { if (action != "roll") { cat("Current scores:", state$scores, fill = TRUE) cat("", fill = TRUE) } } return(state) }

**Step 3: How to express player behavior and strategy?**

We can think of a player as a black box which takes a game state as an input and outputs a decision, “bank” or “roll”. In other words, the player is a function!

Below are two extremely simple strategies expressed as functions. The first strategy simply banks after the first roll. The second strategy banks once more than `target`

points can be earned from the turn.

# strategy that stops after one roll OneRoll <- function(state) { return("bank") } # strategy that stops only when the turn yields > `target` points OverTargetPoints <- function(state, target = 10) { if (state$turn_points + sum(state$dice_rolls == 2) > target) { return("bank") } else { return("roll") } }

(** Note:** In my implementation, strategy functions should assume that they are

`current_player`

: that is how they can determine their current score and that of their opponent correctly.)**Step 4: Simulating a full game**

We can now put the pieces from the previous steps together to simulate a full game of Toss Up!. The `SimulateGame`

function takes in two strategy functions as input. It sets up the initial state, then updates the state repeatedly until the game ends.

SimulateGame <- function(Strategy1, Strategy2, verbose = FALSE) { # set up initial state state <- list(current_player = 1, scores = c(0, 0), turn_points = 0, dice_rolls = sample(0:2, size = NUM_DICE, replace = TRUE, prob = FACE_PROBS)) # check if no greens and at least one red, if so change player while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) { state <- UpdateState(state, "rolled at least 1 red and no green", verbose) } # while the game has not ended: # - get the next action from the current player's strategy # - update the state while (max(state$scores) < POINTS_TO_WIN) { if (state$current_player == 1) { action <- Strategy1(state) } else { action <- Strategy2(state) } state <- UpdateState(state, action, verbose) # check if no greens and at least one red while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) { state <- UpdateState(state, "rolled at least 1 red and no green", verbose) } } # game has ended: return winner if (verbose) { cat(paste("Game ends: Player", which.max(state$scores), "wins!"), fill = TRUE) } return(which.max(state$scores)) }

**Two examples of simulated games**

The code below shows what a simulated game of Toss Up! might look like. In the code snippet below, players need 20 points to win. The first player uses the super conservative strategy of banking immediately, while the second player tries to win the entire game in one turn.

NUM_DICE <- 10 FACE_PROBS <- c(1/6, 2/6, 3/6) POINTS_TO_WIN <- 20 set.seed(1) winner <- SimulateGame(OneRoll, function(state) { OverTargetPoints(state, 19) }, verbose = TRUE) # Current roll: 4 Green, 3 Yellow, 3 Red (bank 4 pts?) # Player 1 banks # Current scores: 4 0 # # Current roll: 5 Green, 4 Yellow, 1 Red (bank 5 pts?) # Player 2 rolls # Current roll: 3 Green, 1 Yellow, 1 Red (bank 8 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 0 Red (bank 10 pts?) # Player 2 rolls # Current roll: 5 Green, 4 Yellow, 1 Red (bank 15 pts?) # Player 2 rolls # Current roll: 2 Green, 3 Yellow, 0 Red (bank 17 pts?) # Player 2 rolls # Current roll: 0 Green, 3 Yellow, 0 Red (bank 17 pts?) # Player 2 rolls # Current roll: 2 Green, 1 Yellow, 0 Red (bank 19 pts?) # Player 2 rolls # Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?) # Player 2 rolls # Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?) # Player 2 rolls # Current roll: 1 Green, 0 Yellow, 0 Red (bank 20 pts?) # Player 2 banks # Current scores: 4 20 # # Game ends: Player 2 wins!

In this particular simulation, it paid off to be a bit more risk taking. As the simulation below shows, that’s not always the case.

NUM_DICE <- 10 FACE_PROBS <- c(0.5, 0.0, 0.5) POINTS_TO_WIN <- 20 set.seed(1) winner <- SimulateGame(OneRoll, function(state) { OverTargetPoints(state, 19) }, verbose = TRUE) # Current roll: 6 Green, 0 Yellow, 4 Red (bank 6 pts?) # Player 1 banks # Current scores: 6 0 # # Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 3 Red (bank 7 pts?) # Player 2 rolls # Current roll: 0 Green, 0 Yellow, 3 Red # Player 2 rolled at least 1 red and no greens # Current scores: 6 0 # # Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?) # Player 1 banks # Current scores: 11 0 # # Current roll: 7 Green, 0 Yellow, 3 Red (bank 7 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 1 Red (bank 9 pts?) # Player 2 rolls # Current roll: 1 Green, 0 Yellow, 0 Red (bank 10 pts?) # Player 2 rolls # Current roll: 3 Green, 0 Yellow, 7 Red (bank 13 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 5 Red (bank 15 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 3 Red (bank 17 pts?) # Player 2 rolls # Current roll: 2 Green, 0 Yellow, 1 Red (bank 19 pts?) # Player 2 rolls # Current roll: 0 Green, 0 Yellow, 1 Red # Player 2 rolled at least 1 red and no greens # Current scores: 11 0 # # Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?) # Player 1 banks # Current scores: 16 0 # # Current roll: 4 Green, 0 Yellow, 6 Red (bank 4 pts?) # Player 2 rolls # Current roll: 4 Green, 0 Yellow, 2 Red (bank 8 pts?) # Player 2 rolls # Current roll: 1 Green, 0 Yellow, 1 Red (bank 9 pts?) # Player 2 rolls # Current roll: 0 Green, 0 Yellow, 1 Red # Player 2 rolled at least 1 red and no greens # Current scores: 16 0 # # Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?) # Player 1 banks # Current scores: 21 0 # # Game ends: Player 1 wins!

**Comparing some simple strategies**

We can’t tell whether one strategy is better than another by looking at a single game since there is so much randomness involved. What we should do is simulate many games and see which strategy wins out over the long run.

**OneRoll vs. OneRoll**

Let’s do a sanity check. Here, we run 10,000 games of the strategy `OneRoll`

vs. itself (this takes around 4 seconds on my machine). Since the strategy for both players is the same, we might expect player 1 to win around 50% of the time right?

NUM_DICE <- 10 FACE_PROBS <- c(1/6, 2/6, 3/6) POINTS_TO_WIN <- 100 nsim <- 10000 set.seed(1) winners <- replicate(nsim, SimulateGame(OneRoll, OneRoll)) table(winners) # not 50-50: player who starts first has an advantage # winners # 1 2 # 5910 4090

It looks like player 1 wins around 59% of the time! On second thought, we would expect player 1 to win more because he/she has a * first mover advantage*: if player 1 reaches 100 points, he/she wins even if player 2 might reach 100 points on the very next turn.

To make this sanity check work, we should have the players each go first half the time. The code below achieves that:

# for even numbered simulations, let player 2 start first winners2 <- winners winners2[2 * 1:(nsim/2)] <- 3 - winners2[2 * 1:(nsim/2)] table(winners2) # winners2 # 1 2 # 5030 4970

Now player 1 wins just 50.3% of the time, which is much closer to the expected 50%.

**OneRoll vs. >20 points**

Next, let’s compare the “one roll” strategy with the strategy which stops once the player can bank more than 20 points for the turn:

set.seed(1) winners <- replicate(nsim, SimulateGame( OneRoll, function(state) { OverTargetPoints(state, 20) })) table(winners) # winners # 1 2 # 75 9925

Wow! The strategy of banking only when >20 points have been earned on a turn wins almost all the time, even though it doesn’t have the first mover advantage!

**>20 points vs. >50 points**

Taking a bit more risk than always banking seems to pay off. What about even more risk? How does banking after >20 points compare with banking after >50 points?

set.seed(1) winners <- replicate(nsim, SimulateGame( function(state) { OverTargetPoints(state, 20) }, function(state) { OverTargetPoints(state, 50) })) table(winners) # winners # 1 2 # 6167 3833 # switch order of players set.seed(1) winners <- replicate(nsim, SimulateGame( function(state) { OverTargetPoints(state, 50) }, function(state) { OverTargetPoints(state, 20) })) table(winners) # winners # 1 2 # 4414 5586

The >20 points strategy won 61.7% of the games when it went first, and won 55.9% of the games when it went second, so it’s clearly a superior strategy.

**Where can we go from here?**

There are several future directions we could take with this code base.

- The strategies presented here are very simple. Can we program more complex strategies (e.g. some that depend on the score of the other player, or on how close the player is to winning)?
- Among strategies of the form “bank once you have more than xx points”, is there an optimal value for xx?
- What is the optimal strategy for Toss Up!? Can we learn it through methods such as Q-learning?
- Is there even such a thing as an optimal strategy? Do there exist 3 strategies such that A beats B, B beats C, but C beats A?
- Can we quantify the first mover advantage?
- Can we modify the code so that we can visualize the players’ scores as the game progresses?
- Can we modify the code so that more than 2 players can play?

(In my googling I found that Matt Lane has an excellent blog post on this game, answering some of the questions that I was interested in.)

**Gameplay**

Before we get to the questions, let me give a quick explanation of how the game works (see Reference 1 for a more colorful explanation as well as an applet to play the game online).

- It’s a cooperative game, with all players playing against the board.
- The game starts with 16 fruit: 4 of each color (red, green, blue, yellow), and a raven at one end of a path that is 5 steps long.
- On each player’s turn, the player rolls a 6-sided die.
- If the die comes up red, green, blue or yellow, the player gets to “harvest” a fruit of that color if there are any left to harvest. If all 4 fruits of that color have been harvested, nothing happens.
- If the die shows a fruit basket, the player gets to harvest a fruit of any color.
- If the die shows a raven, the raven moves one step along the path.

- The game ends when either all the fruit are harvested (players win) or when the raven reaches the end of the path (raven wins).

As you can see this is a really simple game (hence it’s “suitable for 2+ years” rating). The only strategy is in choosing what fruit to take if a fruit basket shows up: see Reference 1 for some simulations for different strategies. Intuitively it seems like choosing the color with the most fruit remaining is the best strategy, and that’s what I bake into my code. (*Is there a proof for this, and is this still true in more general circumstances described in Reference 1?*)

**Code for simulating the game**

The state of the game can be captured in a numeric vector of length 5. The first 4 numbers refer to the number of fruit left for each color, and the 5th number keeps track of the number of steps the raven has taken so far. I created 3 functions to simulate one game of First Orchard (see full code here):

`SimulateTurn(state, verbose)`

takes one dice roll and updates the state of the game. For simplicity, if a 1-4 is rolled, a fruit is harvested from that corresponding tree. If 5 is rolled, the raven takes a step. A rolled 6 is taken to mean “fruit basket”, and I remove a fruit from the tree with the most remaining fruits.`CheckGameState(state, max_raven_steps)`

checks if the game has ended or not, and if so, who won.`SimulateGame(fruit_count, max_raven_steps, verbose)`

runs an entire game of First Orchard: while the game has not ended, run`SimulateTurn`

. Once the game has ended, this function returns (i) who won, (ii) the number of turns taken, (iii) the number of steps the raven took, and (iv) the number of fruit left.

We allow for two game parameters to be defined by the user: the number of fruit of each type at the start of the game (`fruit_count`

, default is 4) and the number of steps the raven must take in order for it to win (`max_raven_steps`

, default is 5). The `verbose`

option for these functions so that the user can see what happened in the game. The code below is an example of the output from `SimulateGame`

:

set.seed(1) results <- SimulateGame(fruit_count = 2, max_raven_steps = 3, verbose = TRUE) # Roll: 1 , State: 1,2,2,2,0 # Roll: 4 , State: 1,2,2,1,0 # Roll: 1 , State: 0,2,2,1,0 # Roll: 2 , State: 0,1,2,1,0 # Roll: 5 , State: 0,1,2,1,1 # Roll: 3 , State: 0,1,1,1,1 # Roll: 6 , State: 0,0,1,1,1 # Roll: 2 , State: 0,0,1,1,1 # Roll: 3 , State: 0,0,0,1,1 # Roll: 3 , State: 0,0,0,1,1 # Roll: 1 , State: 0,0,0,1,1 # Roll: 5 , State: 0,0,0,1,2 # Roll: 5 , State: 0,0,0,1,3 # Raven wins # # of turns: 13 # # of steps raven took: 3 # # fruit left: 1

**Simulation time!**

**What is the probability of winning the game? How does that change as we vary (i) the number of fruit of each color, and (ii) the number of steps the raven must take in order for the players to lose?**

We let the number of fruit of each color vary from 1 to 8, and the number of steps the raven must take from 1 to 8. For each parameter setting, we simulate the game 10,000 times and compute the player win probability. We plot the results as a heatmap.

As one might expect, win probability goes down as the number of fruit increases and as the number of steps the raven must take decreases. For the original game (4 fruit, 5 steps), the win probability is approximately 62%. Sounds reasonable: we would like a win probability >50% so that kids will not get discouraged by too much losing, but not so high that they think the game is trivial.

**For the original game (4 fruit, 5 steps), what is the expected number of steps until the game ends? Does this change depending on whether the player or the raven wins?**

We simulate the game 100,000 times and keep track of the number of steps taken in each game. The shortest game took 5 steps while the longest took 45 steps, with the modal number of steps being 21 (it also happens to be the mean and median). Here is the histogram for all 100,000 runs:

Here are the histograms split by the outcome:

Games where the raven wins tend to be shorter than those when players win. Maybe that’s not too surprising, since a game where the raven wins needs just 5 steps, while a game where the players win needs at least 16 steps. On average, the game takes 19 steps for raven wins and 22 steps for player wins.

**For the original game (4 fruit, 5 steps), given that the raven loses, what is distribution of the number of steps the raven has taken?**

Because we programmed `SimulateGame`

to return the number of steps the raven has taken as well, we don’t have to rerun the simulations: we can just use the 100,000 simulations we ran previously and look at the ones that the raven lost. Here is the histogram of steps the raven took in losing games, with the vertical red line representing the mean:

**For the original game (4 fruit, 5 steps), given that the raven wins, what is distribution of the number of unharvested fruit?**

Again, we can just use the 100,000 simulations we ran previously and look at the ones that the raven won. Here is the histogram along with the mean and median marked out with vertical lines:

The modal number of fruit left in player-losing games is 1: ugh tantalizingly close!

**If there was no raven, how many turns would it take to harvest all the fruit?**

“No raven” is the same as saying that the raven needs to take an infinite number of steps in order to win. Hence, we can use our existing simulation code with `max_raven_steps = Inf`

to simulate this setting.

The shortest game took 16 turns while the longest game took 63 turns, with 22 and 24 turns being the modal and mean number of turns respectively. (In theory, a game could go on forever.) Here is the histogram:

References:

- Matt Lane. (2018). Harvesting Wins.

Let’s say we have observations in -dimensional space . The problem of * clustering* is to group these observations into a (small) number of clusters such that the observations within each cluster are more alike than observations in other clusters.

* Convex clustering* is one method that can be used to cluster observations. Instead of directly assigning each observation to a cluster, it assigns each observation to a point called the

Convex clustering computes the cluster centroids by solving a minimizing a convex function, hence its name. Convex clustering has several (well, at least two) nice properties:

- It is the solution to a convex minimization problem. Hence, algorithms can find a global minimum instead of a local minimum (like
*k*-means clustering). In addition, this allows us to give theoretical and computational guarantees for the solution. - Convex clustering can give not just a single clustering solution but an entire path of clustering solutions, ranging from each observation being in its own cluster, to all observations belonging to a single cluster. The user can choose how many clusters they want to see, and the algorithm can give a solution that closely matches that. In this sense convex clustering can be viewed as an alternative to hierarchical clustering.

**The mathematical details**

Each of the observations is assigned to a corresponding cluster centroid . The ‘s are the solution to the minimization problem

where is a fixed constant, is a tuning hyperparameter, and the ‘s are pair-specific weights.

From my searching, it looks like the 3 earliest references for convex clustering are Pelckmans et al. (2005), Lindsten et al. (2011) and Hocking et al. (2011), listed as References 1-3 at the end of this post. Pelckmans et al. (2005) is the first paper to propose this minimization problem without the term, and focuses on the cases where or . Lindsten et al. (2011) has the same formulation but argues that is a more appropriate choice. Hocking et al. (2011) is probably the first reference that includes the term. They suggest the choice to obtain clustering results that are sensitive to local density in the data. From my cursory glance at later literature, it seems that the choice of weights is crucial in obtaining a good solution.

**The special cases of and **

Here is the minimization problem again:

We see an inherent tradeoff between the two terms in the objective function. The first term encourages the cluster centroids to be near to their assigned points, while the second term encourages the cluster centroids to be near to each other.

Let’s look at what happens for extreme values of to see this. When , there is no penalty at all for cluster centroids being far from each other. Hence, the optimal solution is for all , i.e. each point is in its own cluster. On the other hand, when , any difference in cluster centroids is penalized infinitely. This means that the optimal solution is for all and , i.e. all points are in the same cluster.

For , the tradeoff is less stark. We can make closer to the other centroids, but possibly increasing the cost : how much we tradeoff depends on the size of .

**R implementation for convex clustering**

Your best bet for performing convex clustering in R is probably the `clustRviz`

package (available here on Github) which can compute the convex clustering path via its `CARP`

function. It does so using the CARP algorithm explained in Reference 4. It should be noted that the path it computes is *approximate*, although (i) approximations are often good enough for interpretation of clusters, and (ii) the algorithm has some theoretical backing: as the algorithm’s parameters converge to certain values, the approximate path of solutions converges to the exact solution path.

As far as I can tell there isn’t a solution on CRAN. The `cvxclustr`

package was on CRAN but has been archived since 2020-10-28 due to check issues not being corrected in time. Older versions of the package are available here.

Hocking et al. (2011) provided R packages for their algorithm (`clusterpath`

and `clusterpathRcpp`

) that are available here on R-Forge, but it looks old and I have no idea if it still works/is in use.

**Generalizations of convex clustering**

The convex clustering minimization problem can be generalized easily by changing the distance metric for either term in the objective function. This allows the problem to remain convex while enabling the solutions to have other desired properties. See Section 1.1 of Reference 5 for some examples.

References:

- Pelckmans, K. et al. (2005). Convex clustering shrinkage.
- Lindsten, F. et al. (2011). Clustering using sum-of-norms regularization: with application to particle filter output computation.
- Hocking, T. D. et al. (2011). Clusterpath: an algorithm for clustering using convex fusion penalties.
- Weylandt, M., Nagorski, J., and Allen, G. I. (2020). Dynamic visualization and fast computation for convex clustering via algorithmic regularization.
- Weylandt, M., and Michailidis, G. (2020). Automatic registration and convex clustering of time series.

* Can we say anything about ?* Yes: we have a number of upper bounds on . The two most common ones are Lagrange’s bound

and Cauchy’s bound

This Wikipedia article contains the proofs of the bounds above as well as other upper bounds.

(* Note:* This post feels more like a math “odd-and-end” rather than a stats one, but who knows? This may come in handy for some theoretical proof.)

(* Credits:* I first learned of the bound by Lagrange on John Cook’s blog.)

`diamonds`

dataset from the `ggplot2`

package. To use the app for your own dataset, download the raw R code here (just the `app.R`

file) and assign your dataset to `raw_df`

. In the rest of this post, I outline how to use this app.
(* Credits:* I worked off the source code for the “Diamonds Explorer” app. There are a few versions of this app out there and I can’t find the exact source I used, but it was very close to the source code of this version.)

As you can see from the screenshot below, the main panel (on the right) has 4 tabs. The last two tabs simply give the output of calling the `summary`

and `str`

functions on the entire dataset; they are not affected by the controls in the side panel. The “Data Snippet” panel prints up to 15 rows of the dataset for a peek into what the dataset looks like. (These 15 rows are the first 15 rows of the dataset used to create the plot on the “Plot” tab.)

The most interesting tab is probably the “Plot” tab. First let me describe how the app selects the dataset it makes the plot for. By default, it picks 1000 random observations or all the observations if the dataset has less than 1000 rows. The user can input the random seed for reproducibility. The user can also control the number of observations using the slider, and choose the observations randomly or take the top from the dataset.

The type of plot the app makes depends on the type of variables given to it. In the screenshot above, one numeric variable and one non-numeric variable is given, so the app makes a boxplot. If two numeric variables are given, it makes a scatterplot:

For scatterplots, the user has the option to jitter the points and/or to add a smoothing line:

If two non-numeric variables are given, the app makes a heatmap depicting how often each combination is present in the data:

The plots above depict the joint distribution of two variables in the dataset. If the user wants a visualization for just one variable, the user can set the “Y” variable to “None”. If the “X” variable is numeric, the app plots a histogram:

If the “X” variable is non-numeric, the app plots a bar plot showing counts:

Finally, let’s talk about color. For simplicity, the app only allows plots to be colored by non-numeric variables. Below is an example of a colored scatterplot:

As the screenshot below shows, color works for boxplots too. (Color does not work for heatmaps.)

Color can be used for the one-dimensional plots as well:

There are certainly several improvements that can be made to this app. For one, it would be nice if the user could upload their dataset through the app (instead of downloading my source code and assigning their dataset to the `raw_df`

variable). There could also be more controls to let the user change certain aspects of the plot (e.g. transparency of points), but at some point the UI might become too overwhelming for the user.

**Happy exploring!**

(I did not check, but I think Meng covers some of the material in the paper in this Youtube video.)

**Motivation**

This paper was motivated by the question * “Which one should we trust more, a 5% survey sample or an 80% administrative dataset?”* where 5% and 80% present the proportion of the population in the sample. It is pretty obvious that one should prefer a well-randomized 5% survey sample over a 5% administrative dataset, since the latter probably contains several sources of bias. But what if the administrative dataset is much larger? Can we say anything about it?

Meng answers this question to some degree, considering the case where the quantity of interest is a population mean, and we try to estimate it using a sample mean. In the rest of this post, I present just the key identity: read the paper for the details!

**Set-up**

Define the following quantities:

- : size of the population, : size of the sample.
- : population quantity of interest.
- if j is in the sample, 0 otherwise.
- : sample average (the estimator that we will consider).
- : the fraction of the population in the sample, or sampling rate.

All the quantities above are assumed to be fixed except possibly the ‘s. The ‘s define what Meng calls the * R-mechanism*: the way that the sample arrives to us. In simply random sampling, we know the joint distribution of the ;s. But it is possible that the sample is not uniformly random, e.g. self-reported data where an individual is more or less likely to report depending on their value of . It’s even possible that is deterministic: in this case, an individual always decides to report or always decides not to report. In any case, the R-mechanism and the ‘s are going to be vital for what follows.

In addition, for convenience let be a random variable which is uniformly distributed over (and independent of any preceding randomness).

**A fundamental identity for data quality-quantity tradeoff**

In Equation (2.3), Meng presents a key identity that demonstrates the tradeoff between data quality and data quantity:

where

- is called the
, measuring both sign and degree of selection bias caused by the -mechanism, and**data defect correlation** - is the standard deviation of . This captures
: the more variation there is among the ‘s, the harder it will be to estimate .**problem difficulty**

What’s really interesting about this identity is that it is completely due to algebraic manipulation: no mathematical or probabilistic assumptions go into it! (Other than the definitions in the notation section above.) Conditional on knowing the values of the ‘s, Meng shows in Equation (2.2) that

Noting that the ‘s are binary variables, we have , leading to

as required.

**The data defect index**

Squaring the identity above and taking expectations with respect to the R-mechanism gives us an expression for the mean-squared error (MSE) (Equation (2.4)):

Thus, there are 3 ways to reduce MSE:

by reducing : the data**Increase the data quality**.**defect index**by reducing : the**Increase the data quantity**. This, however, turns out to be much less effective than increasing data quality.**dropout odds**by reducing : the**Reduce the difficulty of the estimation problem**(this is typically only possible with additional information).**degree of uncertainty**

**The law of large populations (LLP)**

With the identities above, Meng formulates a * law of large populations*:

Among studies sharing the same (fixed) average data defect correlation , the (stochastic) error of , relative to its benchmark under simple random sampling, grows with the population size at the rate of .

References:

]]>Assume we are in the supervised learning setting with design matrix and response . If the * linear regression model* is true, i.e.

where is the vector of coefficients and the entries of are i.i.d mean 0 and variance , it is well-known that the maximum likelihood estimate (MLE) of is

Assuming is fixed and that is the only source of randomness, it follows easily that

It’s remarkable that (i) we have a closed form solution for the variance of , and (ii) this variance does not depend on the value of .

**Variance of coefficients in generalized linear models (GLMs)**

We are not so lucky on both counts in the * generalized linear model (GLM)* setting. Assume that the true data generating model is given by a GLM, i.e.

where indexes the observations, , and is the link function. (See this previous post for details.)

In this setting, we do not have a closed form solution for , the MLE of . (See this previous post for the likelihood equations and ways to obtain the MLE numerically.) We are also not able to get a closed form solution for the variance of . We can however get a closed form solution for the * asymptotic variance* of :

References:

- Agresti, A. (2013). Categorical data analysis (3rd ed): Section 4.4.8.