What is the Tukey loss function?

The Tukey loss function

The Tukey loss function, also known as Tukey’s biweight loss function, is a loss function that is used in robust statistics. Tukey’s loss is similar to Huber loss in that it demonstrates quadratic behavior near the origin. However, it is even more insensitive to outliers because the loss incurred by large residuals is constant, rather than scaling linearly as it would for the Huber loss.

The loss function is defined by the formula

\begin{aligned} \ell (r) = \begin{cases} \frac{c^2}{6} \left(1 - \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^3 \right) &\text{if } |r| \leq c, \\ \frac{c^2}{6} &\text{otherwise}. \end{cases} \end{aligned}

In the above, I use r as the argument to the function to represent “residual”, while c is a positive parameter that the user has to choose. A common choice of this parameter is c = 4.685: Reference 1 notes that this value results in approximately 95% asymptotic statistical efficiency as ordinary least squares when the true errors have the standard normal distribution.

You may be wondering why the loss function has a somewhat unusual constant of c^2 / 6 out front: it’s because this results in a nicer expression for the derivative of the loss function:

\begin{aligned} \ell' (r) = \begin{cases} r \left[ 1 - \left( \frac{r}{c}\right)^2 \right]^2 &\text{if } |r| \leq c, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

In the field of robust statistics, the derivative of the loss function is often of more interest than the loss function itself. In this field, it is common to denote the loss function and its derivative by the symbols \rho and \psi respectively.

(Note: The term “Tukey biweight function” is slightly ambiguous and as a commenter pointed out, it often refers to the derivative of the loss rather than the loss function itself. It’s better to be more specific, e.g. saying “Tukey biweight loss function” for the loss (as in this paper), and/or saying “Tukey biweight psi function” for the derivative of the loss (as in this implementation).

Plots of the Tukey loss function

Here is R code that computes the Tukey loss and its derivative:

tukey_loss <- function(r, c) {
  ifelse(abs(r) <= c,
         c^2 / 6 * (1 - (1 - (r / c)^2)^3),
         c^2 / 6)
}

tukey_loss_derivative <- function(r, c) {
  ifelse(abs(r) <= c,
         r * (1 - (r / c)^2)^2,
         0)
}

Here are plots of the loss function and its derivative for a few values of the c parameter:

r <- seq(-6, 6, length.out = 301)
c <- 1:3

# plot of tukey loss
library(ggplot2)
theme_set(theme_bw())
loss_df <- data.frame(
  r = rep(r, times = length(c)),
  loss = unlist(lapply(c, function(x) tukey_loss(r, x))),
  c = rep(c, each = length(r))
)

ggplot(loss_df, aes(x = r, y = loss, col = factor(c))) +
  geom_line() +
  labs(title = "Plot of Tukey loss", y = "Tukey loss",
       col = "c") +
  theme(legend.position = "bottom")

# plot of tukey loss derivative
loss_deriv_df <- data.frame(
  r = rep(r, times = length(c)),
  loss_deriv = unlist(lapply(c, function(x) tukey_loss_derivative(r, x))),
  c = rep(c, each = length(r))
)

ggplot(loss_deriv_df, aes(x = r, y = loss_deriv, col = factor(c))) +
  geom_line() +
  labs(title = "Plot of derivative of Tukey loss", y = "Derivative of Tukey loss",
       col = "c") +
  theme(legend.position = "bottom")

Some history

According to what I could find, Tukey’s loss was proposed by Beaton & Tukey (1974) (Reference 2), but not in the form I presented above. Rather, they proposed weights to be used in an iterative reweighted least squares (IRLS) procedure.

References:

  1. Belagiannis, V., et al. (2015). Robust Optimization for Deep Regression.
  2. Beaton, A. E., and Tukey, J. W. (1974). The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data.

Is the EPL getting more unequal?

I recently heard that Manchester City were so far ahead in the English Premier League (EPL) that the race for first was basically over, even though they were still about 6-7 more games to go (out of a total of 38 games). At the other end of the table, I heard that Sheffield United were so far behind that they are all but certain to be relegated. This got me wondering: has the EPL become more unequal over time? I guess there are several ways to interpret what “unequal” means, and this post is my attempt at quantifying it.

The data for this post is available here, while all the code is available in one place here. (I also have a previous blog post from 2018 on the histogram of points scored in the EPL over time.)

Let’s start by looking at a density plot of the points scored by the 20 teams and how that has changed over the last 12 years (note that the latest year is at the top of the chart):

library(tidyverse)
df <- read_csv("../data/epl_standings.csv")

theme_set(theme_bw())

# joy plot
library(ggridges)
ggplot(df, aes(x = Points, y = factor(Season))) +
    geom_density_ridges(scale = 3) +
    labs(y = "Season")

It looks like the spread for the more recent seasons is greater than that in earlier seasons, evidenced by the bulk of the distribution widening as we go from the bottom to the top of the chart.

Another way to view the data is to draw a boxplot for each season. In this view, it’s harder to see the spread that we saw in the joy plot.

# boxplot
ggplot(df, aes(x = Season, y = Points, group = Season)) +
    geom_boxplot() +
    labs(title = "Distribution of points by season")

One possible way to define “unequal” is to look at the difference between the number of points the top team earned vs. that for the bottom team: the larger the difference, the more unequal the EPL is becoming. The plot below shows the maximum and minimum number of points earned by a team across the years, as well as the difference between the two. We also show the linear regression lines for each of these values. With the upward slopes, it looks like the gap between the best and the worst is certainly increasing.

# plot of maximum and minimum
df %>% group_by(Season) %>%
    summarize(max = max(Points), min = min(Points)) %>%
    mutate(range = max - min) %>%
    pivot_longer(max:range) %>%
    ggplot(aes(x = Season, y = value, col = name)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Max/min points by season", y = "Points") +
    theme(legend.title = element_blank(), legend.position = "bottom")

The problem with using range is that it only tracks the difference between the best and the worst teams while ignoring all the teams in the middle. A measure that takes the middle into account is variance. Note that the number of points that a team can score lies in the range 0 to 38 \times 3 = 114. For random variables bounded in the interval [0, 114], the smallest possible variance is 0 (all teams scoring the same number of points), while the maximum possible variance is (114 - 0)^2 / 4 = 3249 (half the teams scoring 114 points, half the teams scoring 0 points). Based on the configurations attaining the extremes, it seems reasonable to interpret the scores having a higher variance as the league being more unequal.

Here is a plot of point variance over time along with the linear regression line:

df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

There’s quite a lot of fluctuation from year to year, but there seems to be an increasing trend. However, notice that the y-axis only goes from 250 to 450, while our min and max variances were 0 and 3249. Perhaps it’s better to have the y-axis go from 0 to 3249 to have a better sense of scale?

Before doing that, note that it’s actually not possible for half the teams to score maximum points. In the EPL, every team plays every other team exactly twice. Hence, only one team can have maximum points, since it means that everyone else loses to this team. I don’t have a proof for this, but I believe maximum variance happens when team 1 beats everyone, team 2 beats everyone except team 1, team 3 beats everyone except teams 1 and 2, and so on. With this configuration, the variance attained is just 1260.

Here’s the same line plot but with the y-axis going from 0 to 1260. With this scale, the change in variance over time looks pretty flat.

max_var_dist <- seq(0, 38 * 3, by = 6)
max_var <- var(max_var_dist)
df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, max_var), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

Which is the correct interpretation? Is the change in variance over time important enough that we can say the EPL has become more unequal, or is it essentially the same over time? This is where I think domain expertise comes into play. 1260 is a theoretical maximum for the variance, but my guess is that the layperson looking at two tables, one with variance 300 and one with variance 900, would be able to tell them apart and say that the latter is more unequal. Can the layperson really tell the difference between variances of 250 and 450? I would generate several tables having these variances and test if people could tell them apart.

Finally, the Gini coefficient is one other measure of inequality, with 0 being the most equal and 1 being the most unequal.

# plot of Gini
library(DescTools)
df %>% group_by(Season) %>%
    summarize(gini = DescTools::Gini(Points)) %>%
    ggplot(aes(x = Season, y = gini)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, 1), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Gini coefficient of point dist by season", 
         y = "Gini coefficient")

Here are the plots with different scales for the y-axis:

As with variance, the different scales give very different interpretations. It will require some research to figure out if a change of Gini coefficient from 0.17 to 0.22 is perceptible to the viewer.

Estimating pi with Monte Carlo simulation

Happy Pi Day! I don’t encounter \pi very much in my area of statistics, so this post might seem a little forced… In this post, I’m going to show one way to estimate \pi.

The starting point is the integral identity

\begin{aligned} 4 \int_0^1 \sqrt{1 - x^2} dx = \pi. \end{aligned}

There are two ways to see why this identity is true. The first is that the integral is simply computing the area of a quarter-circle with unit radius. The second is by explicitly evaluating the integral:

\begin{aligned} \int_0^1 \sqrt{1 - x^2}dx &= \left[ \frac{1}{2} \left( x \sqrt{1 - x^2} + \arcsin x \right) \right]_0^1 \\  &= \frac{1}{2} \left[ \frac{\pi}{2} - 0 \right] \\  &= \frac{\pi}{4}. \end{aligned}

If X \sim \text{Unif}[0,1], then the integral identity means that

\begin{aligned} \mathbb{E} \left[ 4\sqrt{1 - X^2} \right] = \pi. \end{aligned}

Hence, if we take i.i.d. draws X_1, \dots, X_n from the uniform distribution on [0,1], it is reasonable to expect that

\begin{aligned} \frac{1}{n}\sum_{i=1}^n 4\sqrt{1 - X_i^2} \approx \pi. \end{aligned}

The code below shows how well this estimation procedure does for one run as the sample size goes from 1 to 200:

set.seed(1)
N <- 200
x <- runif(N)
samples <- 4 * sqrt(1 - x^2)
estimates <- cumsum(samples) / 1:N
plot(1:N, estimates, type = "l",
     xlab = "Sample size", ylab = "Estimate of pi",
     main = "Estimates of pi vs. sample size")
abline(h = pi, col ="red", lty = 2)

The next plot shows the relative error on the y-axis instead (the red dotted line represents 1% relative error):

rel_error <- abs(estimates - pi) / pi * 100
plot(1:N, rel_error, type = "l",
xlab = "Sample size", ylab = "Relative error (%)",
main = "Relative error vs. sample size")
abline(h = 0)
abline(h = 1, col = "red", lty = 2)

What is a sunflower plot?

A sunflower plot is a type of scatterplot which tries to reduce overplotting. When there are multiple points that have the same (x, y) values, sunflower plots plot just one point there, but has little edges (or “petals”) coming out from the point to indicate how many points are really there.

It’s best to see this via an example. Here is a plot of carb vs. gear from the mtcars dataset:

plot(mtcars$gear, mtcars$carb,
     main = "Plot of carb vs. gear")

From the plot it looks like there are only 11 data points. However, if we check the Environments tab in RStudio we see that there are actually 32 observations in the dataset: it’s just that some of the observations have the same (gear, carb) values.

Let’s see how a sunflower plot deals with this overplotting. It turns out that base R comes with a sunflower plot function that does just this:

sunflowerplot(mtcars$gear, mtcars$carb,
              main = "Plot of carb vs. gear")

This tells us, for example, that there are 3 observations with (gear, carb) = (3, 1).

We can change the color of the “petals” by specifying seg.col:

sunflowerplot(mtcars$gear, mtcars$carb,
              seg.col = "blue",
              main = "Plot of carb vs. gear")

By default, the first petal always points up. To have the first petal point in random directions, specify rotate = TRUE:

set.seed(1)
sunflowerplot(mtcars$gear, mtcars$carb,
              rotate = TRUE,
              main = "Plot of carb vs. gear")

When given (x, y) values, sunflowerplot counts the number of times each (x, y) value appears to determine the number of petals it needs to draw. It is possible to override this behavior by passing a number argument, as the next code snippet shows:

set.seed(1)
sunflowerplot(1:10, 1:10, number = 1:10,
              main = "n observations at (n, n)")

Sunflower plots aren’t the only way to reduce overplotting. Another common technique is jittering, where random noise is added to each point. The code below shows how you can do this in base R. Of course, a drawback of this is that the points are not plotted at the exact location of the data.

set.seed(1)
plot(jitter(mtcars$gear), jitter(mtcars$carb),
     main = "Plot of carb vs. gear")

Sunflower plots will not solve all your overplotting issues. Here is an example (on the diamonds dataset) where it does a horrendous job:

library(ggplot2)
data(diamonds)
sunflowerplot(diamonds$carat, diamonds$price,
              main = "Plot of price vs. carat")

A better solution here would be to change the transparency of the points. The code snippet below shows how this can be done in base R:

plot(diamonds$carat, diamonds$price,
     col = rgb(0, 0, 0, alpha = 0.05),
     main = "Plot of price vs. carat")

covidcast package for COVID-19-related data

(This is a PSA post, where I share a package that I think that might be of interest to the community but I haven’t looked too deeply into myself.)

Today I learnt of the covidcast R package, which provides access to the COVIDcast Epidata API published by the Delphi group at Carnegie Mellon University. According to the covidcast R package website,

This API provides daily access to a range of COVID-related signals Delphi that builds and maintains, from sources like symptom surveys and medical claims data, and also standard signals that we simply mirror, like confirmed cases and deaths.

(There is a corresponding python package with similar functionality.) The Delphi group has done a huge amount of work in logging a wide variety of COVID-related data and making it available, along with tools to visualize and make sense of the data.

For those interested in doing COVID-related analyses, I think this is a treasure trove of information for you to use. The covidcast package contains several different types of data (which they call “signals”), including public behavior (e.g. COVID searches on Google), early indicators (e.g. COVID-related doctor visits) and late indicators (e.g. deaths). Documentation on the signals available can be found here. (Note: The data is US-focused right now; I don’t know if there are plans to include data from other countries.)

Let me end off with a simple example showing what you can do with this package. This example is modified from one of the package vignettes; see the Articles section of the package website for more examples.

The package is not available on CRAN yet but can be downloaded from Github:

devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/covidcast")

The code below pulls data on cumulative COVID cases per 100k people on 2020-12-31 at the county level. covidcast_signal is the function to use for pulling data, and it returns an object of class c("covidcast_signal", "data.frame").

library(covidcast)

# Cumulative COVID cases per 100k people on 2020-12-31
df <- covidcast_signal(data_source = "usa-facts", 
                   signal = "confirmed_cumulative_prop",
                   start_day = "2020-12-31", end_day = "2020-12-31")
summary(df)
# A `covidcast_signal` data frame with 3142 rows and 9 columns.
# 
# data_source : usa-facts
# signal      : confirmed_cumulative_prop
# geo_type    : county
# 
# first date                          : 2020-12-31
# last date                           : 2020-12-31
# median number of geo_values per day : 3142

There is a plot method for calss covidcast_signal objects:

plot(df)

The automatic plot is usually not bad. The plot method comes with some arguments that the user can use to customize the plot (full documentation here):

breaks <- c(0, 500, 1000, 5000, 10000)
colors <- c("#D3D3D3", "#FEDDA2",  "#FD9950", "#C74E32", "#800026")
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31")

The plot returned is actually created using the ggplot2 package, so it is possible to add your own ggplot2 code on top of it:

library(ggplot2)
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31") +
  theme(title = element_text(face = "bold"))

The Mendoza line

The Mendoza Line is a term from baseball. Named after Mario Mendoza, it refers to the threshold of incompetent hitting. It is frequently taken to be a batting average of .200, although all the sources I looked at made sure to note that Mendoza’s career average was actually a little better: .215.

This post explores a few questions related to the Mendoza line:

  • Was Mario Mendoza really so bad as to warrant the expression being named after him?
  • How many players fell below the Mendoza line each year?
  • What might the Mendoza line look like if we allowed it to change dynamically?

All code for this post can be found here.

(Caveat: I don’t know baseball well, so some of the assumptions or conclusions I make below may not be good ones. If I made a mistake, let me know!)

The data

The Lahman package on CRAN contains all the baseball statistics from 1871 to 2019. We’ll use the Batting data frame for statistics and the People data frame for player names.

library(Lahman)
library(tidyverse)
data(Batting)
data(People)

# add AVG to Batting
Batting$AVG <- with(Batting, H / AB)

First, let’s look for Mario Mendoza and verify that his batting average is indeed .215:

 
# find Mario Mendoza in People
People %>% filter(nameFirst == "Mario" & nameLast == "Mendoza")
# his ID is mendoma01

Batting %>% filter(playerID == "mendoma01") %>%
  summarize(career_avg = sum(H) / sum(AB))
#   career_avg
# 1  0.2146597

Was Mario Mendoza really so bad as to warrant the expression being named after him?

Let’s compute the career batting averages for the players and limit our dataset to just the players with at least 1000 at bats in their career:

 
# Batting average for players with >= 1000 AB
avg_df <- Batting %>% group_by(playerID) %>%
  summarize(tot_AB = sum(AB), career_avg = sum(H) / sum(AB)) %>%
  filter(tot_AB >= 1000) %>%
  left_join(People, by = "playerID") %>%
  select(playerID, tot_AB, career_avg, nameFirst, nameLast) %>%
  arrange(desc(career_avg))

Let’s look at the top 10 players by batting average: we should see some famous names there! (If not, maybe 1000 ABs is not stringent enough of a criterion to rule out small sample size?)

 
# top 10
head(avg_df, n = 10)
# # A tibble: 10 x 5
#    playerID  tot_AB career_avg nameFirst    nameLast 
#    <chr>      <int>      <dbl> <chr>        <chr>    
#  1 cobbty01   11436      0.366 Ty           Cobb     
#  2 barnero01   2391      0.360 Ross         Barnes   
#  3 hornsro01   8173      0.358 Rogers       Hornsby  
#  4 jacksjo01   4981      0.356 Shoeless Joe Jackson  
#  5 meyerle01   1443      0.356 Levi         Meyerle  
#  6 odoulle01   3264      0.349 Lefty        O'Doul   
#  7 delahed01   7510      0.346 Ed           Delahanty
#  8 mcveyca01   2513      0.346 Cal          McVey    
#  9 speaktr01  10195      0.345 Tris         Speaker  
# 10 hamilbi01   6283      0.344 Billy        Hamilton 

Next, let’s look at the bottom 10 players by batting average:

 
# bottom 10
tail(avg_df, n = 10)
# # A tibble: 10 x 5
#    playerID  tot_AB career_avg nameFirst nameLast
#    <chr>      <int>      <dbl> <chr>     <chr>   
#  1 seaveto01   1315      0.154 Tom       Seaver  
#  2 donahre01   1150      0.152 Red       Donahue 
#  3 fellebo01   1282      0.151 Bob       Feller  
#  4 grovele01   1369      0.148 Lefty     Grove   
#  5 suttodo01   1354      0.144 Don       Sutton  
#  6 amesre01    1014      0.141 Red       Ames    
#  7 faberre01   1269      0.134 Red       Faber   
#  8 perryga01   1076      0.131 Gaylord   Perry   
#  9 pappami01   1073      0.123 Milt      Pappas  
# 10 frienbo01   1137      0.121 Bob       Friend  

Those numbers look quite a lot smaller than Mendoza’s! Notice also that all of them have ABs just over 1000, my threshold for this dataset. Maybe 1000 ABs is too loose of a condition… But Mendoza only had 1337 ABs, so if we make the condition more stringent (e.g. considering only players with >= 2000 ABs), it’s not fair to pick on him…

Among players with >= 1000 ABs, how poor was Mendoza’s performance?

 
# How far down was Mario Mendoza?
which(avg_df$playerID == "mendoma01") / nrow(avg_df)
# [1] 0.9630212

He’s roughly at the 5th quantile of all players for batting average.

How many players fell below the Mendoza line each year?

For this question, in each season we only consider players who had at least 100 ABs that season.

 
# Look at player-seasons with at least 100 ABs
batting_df <- Batting %>% filter(AB >= 100)

The nice thing about the tidyverse is that we can answer the question above with a series of pipes ending in a plot:

 
batting_df %>% group_by(yearID) %>%
  summarize(below200 = mean(AVG < 0.200)) %>%
  ggplot(aes(yearID, below200)) +
  geom_line() +
  labs(title = "Proportion of players below Mendoza line by year",
       x = "Year", y = "Prop. below .200") +
  theme_bw()

There is a fair amount of fluctuation, with the proportion of players under the Mendoza line going as high as 24% and as low as 1%. If I had to guess, for the last 50 years or so the proportion seems to fluctuate around 5%.

What might the Mendoza line look like if we allowed it to change dynamically?

Instead of defining the Mendoza line as having a batting average below .200, what happens if we define the Mendoza line for a particular season as the batting average of the player at the 5th quantile?

We can answer this easily by summarizing the data using the quantile function (dplyr v1.0.0 makes this easy):

 
batting_df %>% group_by(yearID) %>%
  summarize(bottom5 = quantile(AVG, 0.05)) %>%
  ggplot(aes(yearID, bottom5)) +
  geom_line() +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  labs(title = "Batting average of 5th quantile by year",
       x = "Year", y = "5th quantile batting average") +
  theme_bw()

That looks like a lot of fluctuation, but if you look closely at the y-axis, you’ll see that the values hover between 0.14 and 0.24. Here is the same line graph but with zero included on the y-axis and a loess smoothing curve:

 
batting_df %>% group_by(yearID) %>%
  summarize(bottom5 = quantile(AVG, 0.05)) %>%
  ggplot(aes(yearID, bottom5)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  scale_y_continuous(limits = c(0, 0.24)) +
  labs(title = "Batting average of 5th quantile by year",
       x = "Year", y = "5th quantile batting average") +
  theme_bw()

I’m not sure how much to trust that smoother, but it comes awfully close to the Mendoza line!

For completeness, here are the lines representing the batting averages for players at various quantiles over time:

 
batting_df %>% group_by(yearID) %>%
  summarize(AVG = quantile(AVG, c(0.05, 1:4 / 5, 0.95)),
            quantile = c(0.05, 1:4 / 5, 0.95)) %>%
  mutate(quantile = factor(quantile, levels = c(0.95, 4:1 / 5, 0.05))) %>%
  ggplot(aes(x = yearID, y = AVG, col = quantile)) +
  geom_line() +
  geom_hline(yintercept = c(0.2), color = "red", linetype = "dashed") +
  labs(title = "Batting average for various quantiles by year",
       x = "Year", y = "Quantile batting average") +
  theme_bw()

At a glance the higher quantile lines look just like vertical translations of the 5th quantile line, suggesting that across years the entire distribution of batting averages shifts up or down (not just parts of the distribution).

glmnet v4.1: regularized Cox models for (start, stop] and stratified data

My latest work on the glmnet package has just been pushed to CRAN! In this release (v4.1), we extend the scope of regularized Cox models to include (start, stop] data and strata variables. In addition, we provide the survfit method for plotting survival curves based on the model (as the survival package does).

Why is this a big deal? As explained in Therneau and Grambsch (2000), the ability to work with start-stop responses opens the door to fitting regularized Cox models with:

  • time-dependent covariates,
  • time-dependent strata,
  • left truncation,
  • multiple time scales,
  • multiple events per subject,
  • independent increment, marginal, and conditional models for correlated data, and
  • various forms of case-cohort models.

glmnet v4.1 is now available on CRAN here. We have reorganized the package’s vignettes, with the new functionality described in the vignette “Regularized Cox Regression” (PDF version/web version). Don’t hesitate to reach out if you have questions.

(Note: This is joint work with Trevor Hastie, Balasubramanian Narasimhan and Rob Tibshirani.)

Simulating the dice game “Toss Up!” in R

Toss Up! is a very simple dice game that I’ve always wanted to simulate but never got around to doing so (until now!). This post outlines how to simulate a Toss Up! game in R, as well as how to evaluate the effectiveness of different game strategies. All the code for this blog post is available here.

Rules

The official rules for Toss Up! are available here. Here is an abbreviated version:

  • As shown in the picture above, the game uses 10 six-sided dice. Each die has 3 green sides, 2 yellow sides and one red side.
  • For each turn, a player rolls all 10 dice. If any greens are rolled, set them aside.
    • At this point, the player can choose to end the turn and “bank” the points (one point for one green), or keep rolling the remaining dice.
    • After every roll, greens are set aside. If you roll enough times to make all 10 dice green, you have 10 points and you can either bank or roll all 10 dice again.
    • A player can keep going until he/she decides to stop rolling and bank all the points earned on this turn.
    • Catch: If, on a roll, none of the dice are green and at least one is red, the turn ends and no points can be banked.
  • First player to reach 100 points wins.

Simulating Toss Up!

There are several different ways to implement Toss Up!: I describe my (two-player) version below. To allow for greater flexibility for the resulting code, I implement the game with three global constants that the programmer can change:

  1. NUM_DICE: The number of dice used in the game (default is 10).
  2. FACE_PROBS: A numeric vector of length 3 denoting the probability that a die comes up red, yellow or green respectively (default is c(1/6, 2/6, 3/6)).
  3. POINTS_TO_WIN: The number of points a player needs to obtain to win the game (default is 100).

Step 1: How can we describe the state of a game?

The state of the game can be encapsulated with 4 pieces of information:

  1. current_player: who’s turn it is to make a decision.
  2. scores: a vector of length 2 containing the scores for players 1 and 2 respectively.
  3. turn_points: Number of points scored on the current turn so far (these points have not been banked yet).
  4. dice_rolls: A vector of variable length denoting the outcome of the dice rolls (0: red, 1: yellow, 2: green).

In my implementation, the state is stored as a list with the 4 elements above.

Step 2: Updating the state

From a given state, there are 3 possibilities for what comes next:

  1. The dice rolls don’t have any greens and at least one red. In this case, the current players turn is over. We need to change current_player, reset turn_points to 0, and re-roll the dice (NUM_DICE of them).
  2. The dice rolls either (i) have at least one green, or (ii) have no reds. In this case, the current player has a choice of what to do.
    1. If the player chooses to bank, then we need to update scores, reset turn_points to 0, change current_player and re-roll the dice (NUM_DICE of them).
    2. If the player chooses to roll, then we need to update turn_points and re-roll just the dice that were not green.

The function updateState below does all of the above. I have also added a verbose option which, if set to TRUE, prints information on the game state to the console. (The function DiceToString is a small helper function for printing out the dice rolls.)

DiceToString <- function(dice_rolls) {
  return(paste(sum(dice_rolls == 2), "Green,",
               sum(dice_rolls == 1), "Yellow,",
               sum(dice_rolls == 0), "Red"))
}

# Executes the current player's action and updates the state. Is also used if
# no greens and at least one red is rolled.
UpdateState <- function(state, action, verbose = FALSE) {
  if (verbose) {
    cat("Current roll:", DiceToString(state$dice_rolls))
    if (action != "rolled at least 1 red and no green")
      cat(" (bank", state$turn_points + sum(state$dice_rolls == 2), "pts?)",
          fill = TRUE)
    else
      cat("", fill = TRUE)
    cat(paste0("Player ", state$current_player, " ", action, "s"),
        fill = TRUE)
  }
    
  if (action == "bank") {
    # action = "bank": bank the points current player earned this turn, then
    # re-roll the dice.
    state$scores[state$current_player] <- state$scores[state$current_player] +
      state$turn_points + sum(state$dice_rolls == 2)
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else if (action == "roll") {
    # action = "roll": add the green dice to points earned this turn, then 
    # re-roll the remaining dice.
    state$turn_points <- state$turn_points + sum(state$dice_rolls == 2)
    ndice <- sum(state$dice_rolls != 2)
    if (ndice == 0) ndice <- NUM_DICE
    state$dice_rolls <- sample(0:2, size = ndice, replace = TRUE, 
                               prob = FACE_PROBS)
  } else if (action == "rolled at least 1 red and no green") {
    # action = "rolled at least 1 red and no green": 
    # no points banked, turn ends, re-roll dice.
    state$turn_points <- 0
    state$dice_rolls <- sample(0:2, size = NUM_DICE, replace = TRUE, 
                               prob = FACE_PROBS)
    state$current_player <- 3 - state$current_player
  } else {
    stop("action must be 'bank', 'roll', or 'rolled at least 1 red and no green'")
  }
  
  if (verbose) {
    if (action != "roll") {
      cat("Current scores:", state$scores, fill = TRUE)
      cat("", fill = TRUE)
    }
  }
  
  return(state)
}

Step 3: How to express player behavior and strategy?

We can think of a player as a black box which takes a game state as an input and outputs a decision, “bank” or “roll”. In other words, the player is a function!

Below are two extremely simple strategies expressed as functions. The first strategy simply banks after the first roll. The second strategy banks once more than target points can be earned from the turn.

# strategy that stops after one roll
OneRoll <- function(state) {
  return("bank")
}

# strategy that stops only when the turn yields > `target` points
OverTargetPoints <- function(state, target = 10) {
  if (state$turn_points + sum(state$dice_rolls == 2) > target) {
    return("bank")
  } else {
    return("roll")
  }
}

(Note: In my implementation, strategy functions should assume that they are current_player: that is how they can determine their current score and that of their opponent correctly.)

Step 4: Simulating a full game

We can now put the pieces from the previous steps together to simulate a full game of Toss Up!. The SimulateGame function takes in two strategy functions as input. It sets up the initial state, then updates the state repeatedly until the game ends.

SimulateGame <- function(Strategy1, Strategy2, verbose = FALSE) {
  # set up initial state
  state <- list(current_player = 1,
                scores = c(0, 0),
                turn_points = 0,
                dice_rolls = sample(0:2, size = NUM_DICE, replace = TRUE, 
                                    prob = FACE_PROBS))
  
  # check if no greens and at least one red, if so change player
  while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
    state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
  }
  
  # while the game has not ended:
  # - get the next action from the current player's strategy
  # - update the state
  while (max(state$scores) < POINTS_TO_WIN) {
    if (state$current_player == 1) {
      action <- Strategy1(state)
    } else {
      action <- Strategy2(state)
    }
    state <- UpdateState(state, action, verbose)
    
    # check if no greens and at least one red
    while (sum(state$dice_rolls == 2) == 0 && sum(state$dice_rolls == 0) > 0) {
      state <- UpdateState(state, "rolled at least 1 red and no green", verbose)
    }
  }
  
  # game has ended: return winner
  if (verbose) {
    cat(paste("Game ends: Player", which.max(state$scores), "wins!"),
        fill = TRUE)
  }
  return(which.max(state$scores))
}

Two examples of simulated games

The code below shows what a simulated game of Toss Up! might look like. In the code snippet below, players need 20 points to win. The first player uses the super conservative strategy of banking immediately, while the second player tries to win the entire game in one turn.

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 4 Green, 3 Yellow, 3 Red (bank 4 pts?)
# Player 1 banks
# Current scores: 4 0
# 
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 3 Green, 1 Yellow, 1 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 5 Green, 4 Yellow, 1 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 0 Green, 3 Yellow, 0 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 1 Yellow, 0 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 20 pts?)
# Player 2 banks
# Current scores: 4 20
# 
# Game ends: Player 2 wins!

In this particular simulation, it paid off to be a bit more risk taking. As the simulation below shows, that’s not always the case.

NUM_DICE <- 10
FACE_PROBS <- c(0.5, 0.0, 0.5)
POINTS_TO_WIN <- 20
set.seed(1)
winner <- SimulateGame(OneRoll,
                       function(state) { OverTargetPoints(state, 19) },
                       verbose = TRUE)

# Current roll: 6 Green, 0 Yellow, 4 Red (bank 6 pts?)
# Player 1 banks
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 3 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 6 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 11 0
# 
# Current roll: 7 Green, 0 Yellow, 3 Red (bank 7 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 0 Red (bank 10 pts?)
# Player 2 rolls
# Current roll: 3 Green, 0 Yellow, 7 Red (bank 13 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 5 Red (bank 15 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 3 Red (bank 17 pts?)
# Player 2 rolls
# Current roll: 2 Green, 0 Yellow, 1 Red (bank 19 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 11 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 16 0
# 
# Current roll: 4 Green, 0 Yellow, 6 Red (bank 4 pts?)
# Player 2 rolls
# Current roll: 4 Green, 0 Yellow, 2 Red (bank 8 pts?)
# Player 2 rolls
# Current roll: 1 Green, 0 Yellow, 1 Red (bank 9 pts?)
# Player 2 rolls
# Current roll: 0 Green, 0 Yellow, 1 Red
# Player 2 rolled at least 1 red and no greens
# Current scores: 16 0
# 
# Current roll: 5 Green, 0 Yellow, 5 Red (bank 5 pts?)
# Player 1 banks
# Current scores: 21 0
# 
# Game ends: Player 1 wins!

Comparing some simple strategies

We can’t tell whether one strategy is better than another by looking at a single game since there is so much randomness involved. What we should do is simulate many games and see which strategy wins out over the long run.

OneRoll vs. OneRoll

Let’s do a sanity check. Here, we run 10,000 games of the strategy OneRoll vs. itself (this takes around 4 seconds on my machine). Since the strategy for both players is the same, we might expect player 1 to win around 50% of the time right?

NUM_DICE <- 10
FACE_PROBS <- c(1/6, 2/6, 3/6)
POINTS_TO_WIN <- 100

nsim <- 10000
set.seed(1)
winners <- replicate(nsim, SimulateGame(OneRoll, OneRoll))
table(winners)  # not 50-50: player who starts first has an advantage

# winners
# 1    2 
# 5910 4090 

It looks like player 1 wins around 59% of the time! On second thought, we would expect player 1 to win more because he/she has a first mover advantage: if player 1 reaches 100 points, he/she wins even if player 2 might reach 100 points on the very next turn.

To make this sanity check work, we should have the players each go first half the time. The code below achieves that:

# for even numbered simulations, let player 2 start first
winners2 <- winners
winners2[2 * 1:(nsim/2)] <- 3 - winners2[2 * 1:(nsim/2)]
table(winners2)

# winners2
# 1    2 
# 5030 4970 

Now player 1 wins just 50.3% of the time, which is much closer to the expected 50%.

OneRoll vs. >20 points

Next, let’s compare the “one roll” strategy with the strategy which stops once the player can bank more than 20 points for the turn:

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  OneRoll,
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 75 9925 

Wow! The strategy of banking only when >20 points have been earned on a turn wins almost all the time, even though it doesn’t have the first mover advantage!

>20 points vs. >50 points

Taking a bit more risk than always banking seems to pay off. What about even more risk? How does banking after >20 points compare with banking after >50 points?

set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 20) },
  function(state) { OverTargetPoints(state, 50) }))
table(winners)

# winners
# 1    2 
# 6167 3833 

# switch order of players
set.seed(1)
winners <- replicate(nsim, SimulateGame(
  function(state) { OverTargetPoints(state, 50) },
  function(state) { OverTargetPoints(state, 20) }))
table(winners)

# winners
# 1    2 
# 4414 5586 

The >20 points strategy won 61.7% of the games when it went first, and won 55.9% of the games when it went second, so it’s clearly a superior strategy.

Where can we go from here?

There are several future directions we could take with this code base.

  • The strategies presented here are very simple. Can we program more complex strategies (e.g. some that depend on the score of the other player, or on how close the player is to winning)?
  • Among strategies of the form “bank once you have more than xx points”, is there an optimal value for xx?
  • What is the optimal strategy for Toss Up!? Can we learn it through methods such as Q-learning?
  • Is there even such a thing as an optimal strategy? Do there exist 3 strategies such that A beats B, B beats C, but C beats A?
  • Can we quantify the first mover advantage?
  • Can we modify the code so that we can visualize the players’ scores as the game progresses?
  • Can we modify the code so that more than 2 players can play?

Exploring the game “First Orchard” with simulation in R

My daughter received the board game First Orchard as a Christmas present and she’s hooked on it so far. In playing the game with her, a few probability/statistics questions came to mind. This post outlines how I answered some of them using simulation in R. All code for this blog post can be found here.

(In my googling I found that Matt Lane has an excellent blog post on this game, answering some of the questions that I was interested in.)

Gameplay

Before we get to the questions, let me give a quick explanation of how the game works (see Reference 1 for a more colorful explanation as well as an applet to play the game online).

  • It’s a cooperative game, with all players playing against the board.
  • The game starts with 16 fruit: 4 of each color (red, green, blue, yellow), and a raven at one end of a path that is 5 steps long.
  • On each player’s turn, the player rolls a 6-sided die.
    • If the die comes up red, green, blue or yellow, the player gets to “harvest” a fruit of that color if there are any left to harvest. If all 4 fruits of that color have been harvested, nothing happens.
    • If the die shows a fruit basket, the player gets to harvest a fruit of any color.
    • If the die shows a raven, the raven moves one step along the path.
  • The game ends when either all the fruit are harvested (players win) or when the raven reaches the end of the path (raven wins).

As you can see this is a really simple game (hence it’s “suitable for 2+ years” rating). The only strategy is in choosing what fruit to take if a fruit basket shows up: see Reference 1 for some simulations for different strategies. Intuitively it seems like choosing the color with the most fruit remaining is the best strategy, and that’s what I bake into my code. (Is there a proof for this, and is this still true in more general circumstances described in Reference 1?)

Code for simulating the game

The state of the game can be captured in a numeric vector of length 5. The first 4 numbers refer to the number of fruit left for each color, and the 5th number keeps track of the number of steps the raven has taken so far. I created 3 functions to simulate one game of First Orchard (see full code here):

  • SimulateTurn(state, verbose) takes one dice roll and updates the state of the game. For simplicity, if a 1-4 is rolled, a fruit is harvested from that corresponding tree. If 5 is rolled, the raven takes a step. A rolled 6 is taken to mean “fruit basket”, and I remove a fruit from the tree with the most remaining fruits.
  • CheckGameState(state, max_raven_steps) checks if the game has ended or not, and if so, who won.
  • SimulateGame(fruit_count, max_raven_steps, verbose) runs an entire game of First Orchard: while the game has not ended, run SimulateTurn. Once the game has ended, this function returns (i) who won, (ii) the number of turns taken, (iii) the number of steps the raven took, and (iv) the number of fruit left.

We allow for two game parameters to be defined by the user: the number of fruit of each type at the start of the game (fruit_count, default is 4) and the number of steps the raven must take in order for it to win (max_raven_steps, default is 5). The verbose option for these functions so that the user can see what happened in the game. The code below is an example of the output from SimulateGame:

set.seed(1)
results <- SimulateGame(fruit_count = 2, max_raven_steps = 3, 
                        verbose = TRUE)
# Roll: 1 , State: 1,2,2,2,0
# Roll: 4 , State: 1,2,2,1,0
# Roll: 1 , State: 0,2,2,1,0
# Roll: 2 , State: 0,1,2,1,0
# Roll: 5 , State: 0,1,2,1,1
# Roll: 3 , State: 0,1,1,1,1
# Roll: 6 , State: 0,0,1,1,1
# Roll: 2 , State: 0,0,1,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 3 , State: 0,0,0,1,1
# Roll: 1 , State: 0,0,0,1,1
# Roll: 5 , State: 0,0,0,1,2
# Roll: 5 , State: 0,0,0,1,3
# Raven wins
# # of turns: 13
# # of steps raven took: 3
# # fruit left: 1

Simulation time!

What is the probability of winning the game? How does that change as we vary (i) the number of fruit of each color, and (ii) the number of steps the raven must take in order for the players to lose?

We let the number of fruit of each color vary from 1 to 8, and the number of steps the raven must take from 1 to 8. For each parameter setting, we simulate the game 10,000 times and compute the player win probability. We plot the results as a heatmap.

As one might expect, win probability goes down as the number of fruit increases and as the number of steps the raven must take decreases. For the original game (4 fruit, 5 steps), the win probability is approximately 62%. Sounds reasonable: we would like a win probability >50% so that kids will not get discouraged by too much losing, but not so high that they think the game is trivial.

For the original game (4 fruit, 5 steps), what is the expected number of steps until the game ends? Does this change depending on whether the player or the raven wins?

We simulate the game 100,000 times and keep track of the number of steps taken in each game. The shortest game took 5 steps while the longest took 45 steps, with the modal number of steps being 21 (it also happens to be the mean and median). Here is the histogram for all 100,000 runs:

Here are the histograms split by the outcome:

Games where the raven wins tend to be shorter than those when players win. Maybe that’s not too surprising, since a game where the raven wins needs just 5 steps, while a game where the players win needs at least 16 steps. On average, the game takes 19 steps for raven wins and 22 steps for player wins.

For the original game (4 fruit, 5 steps), given that the raven loses, what is distribution of the number of steps the raven has taken?

Because we programmed SimulateGame to return the number of steps the raven has taken as well, we don’t have to rerun the simulations: we can just use the 100,000 simulations we ran previously and look at the ones that the raven lost. Here is the histogram of steps the raven took in losing games, with the vertical red line representing the mean:

For the original game (4 fruit, 5 steps), given that the raven wins, what is distribution of the number of unharvested fruit?

Again, we can just use the 100,000 simulations we ran previously and look at the ones that the raven won. Here is the histogram along with the mean and median marked out with vertical lines:

The modal number of fruit left in player-losing games is 1: ugh tantalizingly close!

If there was no raven, how many turns would it take to harvest all the fruit?

“No raven” is the same as saying that the raven needs to take an infinite number of steps in order to win. Hence, we can use our existing simulation code with max_raven_steps = Inf to simulate this setting.

The shortest game took 16 turns while the longest game took 63 turns, with 22 and 24 turns being the modal and mean number of turns respectively. (In theory, a game could go on forever.) Here is the histogram:

References:

  1. Matt Lane. (2018). Harvesting Wins.

A shiny app for exploratory data analysis

I recently learnt how to build basic R Shiny apps. To practice using Shiny, I created a simple app that you can use to perform simple exploratory data analysis. You can use the app here to play around with the diamonds dataset from the ggplot2 package. To use the app for your own dataset, download the raw R code here (just the app.R file) and assign your dataset to  raw_df. In the rest of this post, I outline how to use this app.

(Credits: I worked off the source code for the “Diamonds Explorer” app. There are a few versions of this app out there and I can’t find the exact source I used, but it was very close to the source code of this version.)

As you can see from the screenshot below, the main panel (on the right) has 4 tabs. The last two tabs simply give the output of calling the summary and str functions on the entire dataset; they are not affected by the controls in the side panel. The “Data Snippet” panel prints up to 15 rows of the dataset for a peek into what the dataset looks like. (These 15 rows are the first 15 rows of the dataset used to create the plot on the “Plot” tab.)

The most interesting tab is probably the “Plot” tab. First let me describe how the app selects the dataset it makes the plot for. By default, it picks 1000 random observations or all the observations if the dataset has less than 1000 rows. The user can input the random seed for reproducibility. The user can also control the number of observations using the slider, and choose the observations randomly or take the top n from the dataset.

The type of plot the app makes depends on the type of variables given to it. In the screenshot above, one numeric variable and one non-numeric variable is given, so the app makes a boxplot. If two numeric variables are given, it makes a scatterplot:

For scatterplots, the user has the option to jitter the points and/or to add a smoothing line:

If two non-numeric variables are given, the app makes a heatmap depicting how often each combination is present in the data:

The plots above depict the joint distribution of two variables in the dataset. If the user wants a visualization for just one variable, the user can set the “Y” variable to “None”. If the “X” variable is numeric, the app plots a histogram:

If the “X” variable is non-numeric, the app plots a bar plot showing counts:

Finally, let’s talk about color. For simplicity, the app only allows plots to be colored by non-numeric variables. Below is an example of a colored scatterplot:

As the screenshot below shows, color works for boxplots too. (Color does not work for heatmaps.)

Color can be used for the one-dimensional plots as well:

There are certainly several improvements that can be made to this app. For one, it would be nice if the user could upload their dataset through the app (instead of downloading my source code and assigning their dataset to the raw_df variable). There could also be more controls to let the user change certain aspects of the plot (e.g. transparency of points), but at some point the UI might become too overwhelming for the user.

Happy exploring!