# Using different fonts with ggplot2

I was recently asked to convert all the fonts in my ggplot2-generated figures for a paper to Times New Roman. It turns out that this is easy, but it brought up a whole host of questions that I don’t have the full answer to.

If you want to go all out with using custom fonts, I suggest looking into the extrafont and showtext packages. This post will focus on what you can do without importing additional packages.

Let’s make a basic plot and see its default look (I am generating this on a Mac with the Quartz device):

library(ggplot2)
base_fig <- ggplot(data = economics, aes(date, pop)) +
geom_line() +
labs(title = "Total US population over time",
subtitle = "Population in thousands",
x = "Date",
y = "Total population (in thousands)")

base_fig


To change all text in the figure to Times New Roman, we just need to update the text option of the theme as follows:

base_fig +
theme(text = element_text(family = "Times New Roman"))


ggplot allows you to change the font of each part of the figure: you just need to know the correct option to modify in the theme. (For a full list of customizable components of the theme, see this documentation.)

base_fig +
theme(plot.title    = element_text(family = "mono"),
plot.subtitle = element_text(family = "sans"),
axis.title.x  = element_text(family = "Comic Sans MS"),
axis.title.y  = element_text(family = "AppleGothic"),
axis.text.x   = element_text(family = "Optima"),
axis.text.y   = element_text(family = "Luminari"))


For standard fonts (i.e. widely-used fonts used in large parts of academia and industry) the code above will suffice.

What fonts can I pass to element_text()? I couldn’t find an easy answer on this. This will depend on the OS you are using and the graphics device used to render the figure.

On a Mac, you can go to the “Font Book” application and have a look at the list of fonts there. Most fonts there should work. One exception I noticed is when using fonts having a different alphabet. (For example, using family = "Klee" in the code above did not work for me.)

What’s a graphic device? It’s the engine that renders your plot. Common graphics devices are Quartz and X11. This is something the user typically does not need to care about. See this link and ?Devices for more details.

What are “mono”, “sans” and “serif”? These are categories of “types of fonts” (see details here). When the user specifies one of these 3 keywords instead of a full font name (e.g. “Times New Roman”), the graphics engine uses its default font associated for that keyword.

For the Quartz device, you can use quartzFonts() to see what the default font for each of these keywords is:

quartzFonts()
# $serif # [1] "Times-Roman" "Times-Bold" "Times-Italic" "Times-BoldItalic" # #$sans
# [1] "Helvetica"             "Helvetica-Bold"        "Helvetica-Oblique"     "Helvetica-BoldOblique"
#
# mono # [1] "Courier" "Courier-Bold" "Courier-Oblique" "Courier-BoldOblique"  # Documentation for internal functions TLDR: To avoid triple quotes and R CMD CHECK --as-cran errors due to documentation examples for internal functions, enclose the example code in \dontrun{}. I recently encountered an issue when submitting an R package to CRAN that I couldn’t find a clean answer for. One of the comments from the manual check was the following: Using foo:::f instead of foo::f allows access to unexported objects. It should not be used in other context and is generally not recommended, as the semantics of unexported objects may be changed by the package author in routine maintenance. I was a bit surprised at getting this message as I was quite careful to avoid using the triple quotes ::: in my functions. A bit of checking revealed that I had used them in one of the examples when documenting an internal function. Now, a simple fix for this would have been to delete the documentation examples for this internal function and resubmit, but that seemed wrong to me. These examples, while not necessarily user-facing, and very important for developers of the package to get a quick handle on the internal functions. There has to be a better way! Here’s a small example that replicates the problem I faced. (Note that I am working on a Mac, and parts of the post will reflect that.) Create a new R package via RStudio’s “New Project…” (I named this package mypackage, and add the following internal function as internal.R: #' Internal function #' #' This function prints a message. #' #' @examples #' mypackage:::internal_function() internal_function <- function() { print("This is an internal function.") }  Running ?internal_function in the console, I get the following documentation, as one would expect: Running R CMD CHECK --as-cran on this package does not throw any warnings or notes, but when you submit it on CRAN, you will get the comment at the beginning of the post. My first thought was to replace the line #' mypackage:::internal_function() with #' internal_function(). However, this will throw an error when running R CMD CHECK --as-cran: A simple workaround I found (from this thread) is to enclose the example in \dontrun{}: #' Internal function #' #' This function prints a message. #' #' @examples \dontrun{ #' internal_function() #' } internal_function <- function() { print("This is an internal function.") }  R CMD CHECK --as-cran no longer throws an error, and there are no :::s to complain about. The documentation file for internal_function now looks like this: Because this function is internal, it is worth adding a line @keywords internal in the roxygen comment block for this function; that way the function is removed from the package’s documentation index. Disclaimer: This is the workaround I found: it is not necessarily the best way or the correct way to deal with this issue. I would love to hear from others on how you avoid this problem when documenting internal functions! # What is the Stable Unit Treatment Value Assumption (SUTVA)? The Stable Unit Treatment Value Assumption (SUTVA) is a key assumption that is usually made in causal inference. Reference 1 gives a clear definition of SUTVA, which points out that SUTVA is really two assumptions rolled into one: 1. The potential outcomes for any unit do not vary with the treatments assigned to other units. 2. For each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes. When someone references SUTVA they are typically thinking of assumption 1. However, assumption 2 is important too! Note also that SUTVA does NOT say anything about how the treatment assigned to one unit affects the treatment assigned to another unit. In fact, in many designs the treatment assignment is not independent across units. For example, in the completely randomized design for $n$ units, exactly $m$ units get treatment and $n-m$ units get control, where $m$ is fixed in advance. Unit 1 getting the treatment means that unit 2 is less likely to get treatment: the treatment assignments for units 1 and 2 are not independent of each other. References: 1. Imbens, G. W., and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. # Estimating win probability from best-of-7 series is not straightforward (Note: The code in this post is available here as a single R script.) Let’s say A and B play 7 games and A wins 4 of them. What would be a reasonable estimate for A’s win probability (over B)? A natural estimate would be 4/7. More generally, if A wins $x$ games, then the estimator $\hat{p} = x/7$ is an unbiased estimator of the true win probability $p$. This is not difficult to show: since $x \sim \text{Binom}(7, p)$, we have \begin{aligned} \mathbb{E}[x/7] = \mathbb{E}[x] / 7 = (7p) / 7 = p. \end{aligned} Now, let’s say that A and B play a best-of-7 series (or equivalently, first to 4 wins), and A wins 4 of them. It seems natural to estimate A’s win probability by 4/7 again. If we define the estimator $\hat{p} = \dfrac{\# \text{ games A wins}}{\# \text{ games}},$ does this estimator have any nice properties? It’s hard to say anything about this estimator (if someone knows something, please let me know!). In fact, it’s not even unbiased! Estimator is biased: a simulation First, let’s set up two functions. The first function, SampleSeries, simulates a series which ends when either player 1 reaches n_wins1 wins or when player 2 reaches n_wins2 wins. (The true win probability for player 1 is p.) The second function, EstimateWinProb, samples n_sim series. For each series, it estimates player 1’s win probability as in $\hat{p}$ above, then returns the mean of $\hat{p}$ over the n_sim simulations. # Player 1 has win probability p. Simulates a series of games which ends when # player 1 reaches n_wins1 wins or when player 2 reaches n_wins2 wins. SampleSeries <- function(p, n_wins1, n_wins2 = n_wins1) { game_results <- rbinom(n_wins1 + n_wins2 - 1, size = 1, prob = p) if (sum(game_results) >= n_wins1) { n_games <- which(game_results == 1)[n_wins1] } else { n_games <- which(game_results == 0)[n_wins2] } player1_wins <- sum(game_results[1:n_games]) return(c(player1_wins, n_games - player1_wins)) } # Run SampleSeries n_sim times and for each run, estimate p by player 1's # win percentage in that series. Returns the estimated probabilities. EstimateWinProb <- function(p, n_sim, n_wins1, n_wins2 = n_wins1) { win_data <- replicate(n_sim, SampleSeries(p, n_wins1, n_wins2)) prob_estimates <- apply(win_data, 2, function(x) x[1] / (x[1] + x[2])) return(mean(prob_estimates)) }  Next, we run a Monte Carlo simulation (20,000 simulations) to compute the mean of $\hat{p}$ for a range of values of $p$. (Because of the symmetry inherent in the setup, we only consider $p \geq 0.5$.) set.seed(1) p <- 20:40 / 40 n_wins1 <- 4 n_sim <- 20000 mean_phat <- sapply(p, function(p) EstimateWinProb(p, n_sim, n_wins1)) plot(p, mean_phat, asp = 1, pch = 16, ylab = "Win prob estimator mean", main = "First to 4 wins") abline(0, 1)  If $\hat{p}$ were unbiased, we would expect the black dots to lie on the black line $y = x$; it’s clear that they don’t. In this setting, $\hat{p}$ is biased upwards. Estimator is biased: exact computation I was not able to find a nice closed-form expression for $\mathbb{E}[\hat{p}]$: for a best-of-7 series, WolframAlpha “simplified” my expression to give \begin{aligned} \mathbb{E}[\hat{p}] = \dfrac{4x}{5} + \dfrac{2x^2}{15} + \dfrac{4x^3}{105} + \dfrac{101x^4}{21} - \dfrac{1252x^5}{105} + 10x^6 - \dfrac{20x^7}{7}. \end{aligned} However, it is possible to write a function that computes $\mathbb{E}[\hat{p}]$ from first principles: # Return expected value of the "series win percentage" estimator GetEstimatorMean <- function(p, n_wins1, n_wins2 = n_wins1) { # first line is for when player 1 wins, second line is for when player 2 wins summands1 <- sapply(0:(n_wins2 - 1), function(x) n_wins1 / (n_wins1 + x) * p^n_wins1 * (1 - p)^x * choose(x + n_wins1 - 1, x) ) summands2 <- sapply(0:(n_wins1 - 1), function(x) x / (x + n_wins2) * p^x * (1 - p)^n_wins2 * choose(x + n_wins2 - 1, x) ) return(sum(c(summands1, summands2))) }  Let’s plot these exact values on the previous plot to make sure we got it right: lines(p, sapply(p, function(x) GetEstimatorMean(x, n_wins1)), col = "red", lty = 2, lwd = 1.5)  What might be a better estimator? Let’s go back to the setup where A and B play a best-of-7 series (or equivalently, first to 4 wins), and A wins 4 of them. Instead of estimating A’s win probability by 4/7, we could use the plot above to find which value of $p$ would have given $\mathbb{E}[\hat{p}] = 4/7$, and use that instead. This is essentially a method of moments estimator with a single data point, and so has some nice properties (well, at least it’s consistent). The picture below depicts this method for obtaining estimates for when A wins 4 out of 4, 5, 6 and 7 games. p <- 20:40 / 40 original_estimate <- c(4 / 7:4) mm_estimate <- c(0.56, 0.64, 0.77, 1) # by trial and error plot(p, sapply(p, function(x) GetEstimatorMean(x, n_wins1 = 4)), type = "l", asp = 1, ylab = "Win prob estimator mean") abline(0, 1, lty = 2) segments(x0 = 0, x1 = mm_estimate, y0 = original_estimate, col = "red", lty = 2) segments(y0 = 0, y1 = original_estimate, x0 = mm_estimate, col = "red", lty = 2)  What are some other possible estimators in this setting? First to n wins: generalizing the above Since we’ve coded up all this already, why look at just first to 4 wins? Let’s see what happens to the estimator $\hat{p}$ when we vary the number of wins the players need in order to win the series. The code below makes a plot showing how $\mathbb{E}[\hat{p}]$ varies as we change the number of wins needed from 1 to 8 (inclusive). library(tidyverse) theme_set(theme_bw()) df <- expand.grid(p = 20:40 / 40, n_wins1 = 1:8) dfmean_phat <- apply(df, 1, function(x) GetEstimatorMean(x[1], x[2]))
ggplot(df, aes(group = factor(n_wins1))) +
geom_line(aes(x = p, y = mean_phat, col = factor(n_wins1))) +
coord_fixed(ratio = 1) +
scale_color_discrete(name = "n_wins") +
labs(y = "Win prob estimator mean", title = "First to n_wins")


Past some point, the bias decreases as n_wins1 increases. Peak bias seems to be around n_wins1 = 3. The picture below zooms in on a small portion of the picture above:

Asymmetric series: different win criteria for the players

Notice that in the code above, we maintained separate parameters for player 1 and player 2 (n_wins1 and n_wins2). This allows us to generalize even further, to explore what happens when players 1 and 2 need a different number of wins in order to win the series.

The code below shows how $\mathbb{E}[\hat{p}]$ varies with $p$ for different combinations of n_wins1 and n_wins2. Each line within a facet corresponds to one value of n_wins1, while each facet of the plot corresponds to one value of n_wins2. The size of $\hat{p}$‘s bias increases as the difference between n_wins1 and n_wins2 increases.

df <- expand.grid(p = 20:40 / 40, n_wins1 = 1:8, n_wins2 = 1:8)
df$mean_phat <- apply(df, 1, function(x) GetEstimatorMean(x[1], x[2], x[3])) ggplot(df, aes(group = factor(n_wins1))) + geom_line(aes(x = p, y = mean_phat, col = factor(n_wins1))) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + coord_fixed(ratio = 1) + scale_color_discrete(name = "n_wins1") + facet_wrap(~ n_wins2, ncol = 4, labeller = label_both) + labs(y = "Win prob estimator mean", title = "Player 1 to n_wins1 or Player 2 to n_wins2")  # NBA playoffs: Visualizing win percentage by seeding Background With the NBA playoffs going on, I’ve been thinking about the following question: A and B are about to play a game. We know that among all players, A has rank/seed $a$ and B has rank/seed $b$. (A higher ranking/seeding corresponds to a smaller value of $a$, with $a = 1$ being the best. Players with higher rank/seed are better players.) Knowing only these two ranks, what is a reasonable model for the probability that the higher seed wins the game? On one hand, if the probability is always 0.5, it means that ranking is uninformative: the result of the game is like an unbiased coin flip. On the other hand, if the probability is 1, it means that ranking is perfectly informative: a higher-ranked player will always beat a lower-ranked player. Reality is probably somewhere in between. Of course, there is no single correct answer for this. The answer could also vary greatly across sports and games, and within a single sport it could also vary from league to league. The idea is just to come up with something that can approximate reality somewhat closely. Unfortunately I will not have an answer for this question in this post. I will, however, show you some data that I scraped for the NBA playoffs and some interesting data visualizations. NBA playoffs: An abbreviated introduction There are two conferences in the NBA. At the end of the regular season, the top 8 teams in each conference make it to the playoff round where they play best-of-7 series to determine who advances. The teams are ranked within each conference according to their win-loss records in the regular season. The image below is the playoff bracket for the 2019-2020 season, giving you an idea of how a team’s seeding affects which opponents it will face in different playoff rounds. Data scraping My initial idea was the following: scrape NBA playoff data, keeping track of how many times seed $a$ played against seed $b$, as well as how many times seed $a$ won. This could be the raw data we use to estimate/model the win probabilities. Since the seedings only make sense within each conference, for each playoff year, I excluded the NBA final (where the winner of each conference faces the other). I was fortunate to find that Land of BasketBall.com presents the information I need in a systematic manner, making it easy to scrape. The R scraping script and RDS data files I saved are available here. Determining how many years of playoff data should go into my dataset was a surprisingly difficult question. See this wikipedia section for a full history of the NBA playoff format. I was surprised to find out that ranking by win-loss record within each conference was only instituted from the 2016 playoffs onwards. There were more complicated seeding rules before that, meaning that teams could have higher rank than other teams which had better win-loss records than them. Thus, if I wanted ranking to be as pure as possible, I could only use data from 2016-2020 (the latest playoffs with data at time of writing). As you will see in the rest of this post, this is not enough data to do good estimation. On the other hand, going back further means that the idea of ranking could be diluted. The rest of the post is based on just 2016-2020 data. You can download data for 1984-2020 (1984 was when the playoffs first went from 12 teams to 16 teams) here, rerun the analysis and compare. (You could even try to go back further to 1947, when the playoffs first started!) Data visualization Let’s have a look at the raw data: library(tidyverse) readRDS("nba_playoffs_data_2016-2020.rds") %>% head() # year team1 wins1 team2 wins2 seed1 seed2 # 1 2016 Warriors 4 Rockets 1 1 8 # 2 2016 Clippers 2 TrailBlazers 4 4 5 # 3 2016 Thunder 4 Mavericks 1 3 6 # 4 2016 Spurs 4 Grizzlies 0 2 7 # 5 2016 Warriors 4 TrailBlazers 1 1 5 # 6 2016 Thunder 4 Spurs 2 3 2  Each row corresponds to one best-of-7 series. Let’s add some more columns that will be useful: results_df <- readRDS("nba_playoffs_data_2016-2020.rds") %>% mutate(higher_seed = pmin(seed1, seed2), lower_seed = pmax(seed1, seed2), n_games = wins1 + wins2) %>% mutate(seed_diff = lower_seed - higher_seed, higher_seed_wins = ifelse(higher_seed == seed1, wins1, wins2), lower_seed_wins = ifelse(higher_seed == seed1, wins2, wins1)) %>% mutate(series_winner = ifelse(wins1 > wins2, higher_seed, lower_seed))  In this first plot, we plot the number of series played between each pair of seeds (we always order the seeds so that the higher seed is first/on the x-axis). theme_set(theme_bw()) results_df %>% group_by(higher_seed, lower_seed) %>% summarize(n_games = n()) %>% ggplot(aes(x = higher_seed, y = lower_seed)) + geom_tile(aes(fill = n_games)) + geom_text(aes(label = n_games), col = "white") + scale_x_continuous(breaks = 1:8) + scale_y_continuous(breaks = 1:8) + coord_fixed(ratio = 1) + scale_fill_gradient(low = "red", high = "blue", name = "# games") + labs(x = "Higher seed", y = "Lower seed", title = "# series played between seeds") + theme(panel.grid.major = element_blank())  Note that the x-axis starts with the first seed while the y-axis starts with the second seed, since the first seed can never be the lower seed in any matchup. We also won’t have any data in the bottom-right half of the plot since we always put the higher seed on the x-axis. Notice that we always have 10 matches between seeds 1-8, 2-7, 3-6 and 4-5: these are the first round matchups that will always happen. Notice also that some matches have never happened (e.g. 3-4, and anything involving two of 5, 6, 7 and 8). Next, we plot the number of games played between each pair of seeds: results_df %>% group_by(higher_seed, lower_seed) %>% summarize(n_games = sum(n_games)) %>% ggplot(aes(x = higher_seed, y = lower_seed)) + geom_tile(aes(fill = n_games)) + geom_text(aes(label = n_games), col = "white") + scale_x_continuous(breaks = 1:8) + scale_y_continuous(breaks = 1:8) + coord_fixed(ratio = 1) + scale_fill_gradient(low = "red", high = "blue", name = "# games") + labs(x = "Higher seed", y = "Lower seed", title = "# games played between seeds") + theme(panel.grid.major = element_blank())  From these numbers, it is clear that we won’t be able to get good estimates for the win probability between pairs of seeds due to small sample size, except possibly for the 1-8, 2-7, 3-6, 4-5 and 2-3 matchups. Next, let’s look at the win percentage for the higher seed in each pair: win_pct_df <- results_df %>% group_by(higher_seed, lower_seed) %>% summarize(higher_seed_wins = sum(higher_seed_wins), lower_seed_wins = sum(lower_seed_wins)) %>% mutate(higher_seed_win_pct = higher_seed_wins / (higher_seed_wins + lower_seed_wins)) %>% select(higher_seed, lower_seed, higher_seed_win_pct) head(win_pct_df) # # A tibble: 6 x 3 # # Groups: higher_seed [2] # higher_seed lower_seed higher_seed_win_pct # <dbl> <dbl> <dbl> # 1 1 2 0.5 # 2 1 3 0.75 # 3 1 4 0.645 # 4 1 5 0.684 # 5 1 8 0.8 # 6 2 3 0.554 ggplot(win_pct_df, aes(x = higher_seed, y = lower_seed)) + geom_tile(aes(fill = higher_seed_win_pct)) + geom_text(aes(label = round(higher_seed_win_pct, 2))) + scale_x_continuous(breaks = 1:8) + scale_y_continuous(breaks = 1:8) + coord_fixed(ratio = 1) + scale_fill_gradient2(low = "red", high = "blue", midpoint = 0.5, name = "Higher seed win %") + labs(x = "Higher seed", y = "Lower seed", title = "Higher seed win percentage") + theme(panel.grid.major = element_blank())  We shouldn’t give these numbers too much credence because of the small sample size, but there are a couple of things of note: 1. We would expect the higher seed to have more than a 50% chance to beat a lower seed. That’s not what we see here (2-4, 3-5). 2. Fixing the higher seed, we would expect that the lower seeded the opponent, the higher the win percentage. That’s not what we see when we look down each column. For example, seed 1 beat seed 3 75% of the time, but only beat seed 4 65% of the time. 3. A very simple win probability model would be to say that the win probability only depends on the difference between the ranks. If that is the case, then win probabilities along each northeast-southwest diagonal should be about the same. That is clearly not the case (especially 1-3, 2-4 and 3-5). 4. The win probabilities for 1-8, 2-7, 3-6 and 4-5 make sense, and that’s probably because of the larger sample size. I would have expected a win probability diagram that looked more like this: expand.grid(higher_seed = 1:8, lower_seed = 1:8) %>% filter(higher_seed < lower_seed) %>% mutate(seed_diff = lower_seed - higher_seed, higher_seed_win_pct = 0.5 + seed_diff * 0.4 / 7) %>% ggplot(aes(x = higher_seed, y = lower_seed)) + geom_tile(aes(fill = higher_seed_win_pct)) + geom_text(aes(label = round(higher_seed_win_pct, 2))) + scale_x_continuous(breaks = 1:8) + scale_y_continuous(breaks = 1:8) + coord_fixed(ratio = 1) + scale_fill_gradient2(low = "red", high = "blue", midpoint = 0.5, name = "Higher seed win %") + labs(x = "Higher seed", y = "Lower seed", title = "Expected higher seed win percentage") + theme(panel.grid.major = element_blank())  Having seen the data, what might you do next in trying to answer my original question? # Small gotcha when using negative indexing Negative indexing is a commonly used method in R to drop elements from a vector or rows/columns from a matrix that the user does not want. For example, the code below drops the third column from the matrix M: M <- matrix(1:9, nrow = 3) M # [,1] [,2] [,3] # [1,] 1 4 7 # [2,] 2 5 8 # [3,] 3 6 9 M[, -3] # [,1] [,2] # [1,] 1 4 # [2,] 2 5 # [3,] 3 6  Now, let’s say we want to write a function with signature dropColumns(x, exclude) that takes in a matrix x and a vector of indices exclude, and returns a matrix without the columns listed in exclude. (The result in the code snippet above could be achieved with the call dropColumns(M, 3)). A natural attempt (without checking the indices are within bounds) would be the following: dropColumns <- function(x, exclude) { x[, -exclude, drop = FALSE] }  Let’s try it out: dropColumns(M, c(1, 2)) # [,1] # [1,] 7 # [2,] 8 # [3,] 9  What happens if exclude is the empty vector (i.e. we don’t want to drop any columns)? The code below shows two different ways of doing this, and BOTH of them don’t give the expected result! dropColumns(M, c()) # Error in -exclude : invalid argument to unary operator dropColumns(M, integer(0)) # [1,] # [2,] # [3,]  Moral of the story: When doing negative indexing with variables, you need to check for whether the variable is a vector of length zero. The revised function below is one way to get around this issue: let me know if there are more elegant ways to achieve this! dropColumns <- function(x, exclude) { if (length(exclude) > 0) x[, -exclude, drop = FALSE] else x } dropColumns(M, c()) # [,1] [,2] [,3] # [1,] 1 4 7 # [2,] 2 5 8 # [3,] 3 6 9 dropColumns(M, integer(0)) # [,1] [,2] [,3] # [1,] 1 4 7 # [2,] 2 5 8 # [3,] 3 6 9 dropColumns(M, c(1, 2)) # [,1] # [1,] 7 # [2,] 8 # [3,] 9  # Introducing cvwrapr for your cross-validation needs Update (2021-06-11): cvwrapr v1.0 is now on CRAN! You can view it here. TLDR: I’ve written an R package, cvwrapr, that helps users to cross-validate hyperparameters. The code base is largely extracted from the glmnet package. The R package is available for download from Github, and contains two vignettes which demonstrate how to use it. Comments, feedback and bug reports welcome! Imagine yourself in the following story: You are working on developing a supervised learning method. You ponder each detail of the algorithm day and night, obsessing over whether it does the right thing. After weeks and months of grinding away at your laptop, you finally have a fitter function that you are satisfied with. You bring it to your collaborator who says: “Great stuff! … Now we need a function that does cross-validation for your hyperparameter.” A wave of fatigue washes over you: weren’t you done with the project already? You galvanize yourself, thinking, “it’s not that bad: it’s just a for loop over the folds right?” You write the for loop and get a matrix of out-of-fold predictions. You now realize you have to write more code to compute the CV error. “Ok, that’s simple enough for mean-squared error…”, but then a rush of questions flood your mind: “What about the CV standard errors? I can never remember what to divide the standard deviation by…” “We have to compute lambda.min and lambda.1se too right?” “What about family=’binomial’? family=’cox’?” “Misclassification error? AUC?” As these questions (and too much wine) make you doubt your life choices, a realization pops into your head: “Wait: doesn’t cv.glmnet do all this already??” And that is the backstory for the cvwrapr package, which I’ve written as a personal project. It essentially rips out the cross-validation (CV) portion of the glmnet package and makes it more general. The R package is available for download from Github, and contains two vignettes which demonstrate how to use it. The rest of this post is a condensed version of the vignettes. First, let’s set up some fake data. set.seed(1) nobs <- 100; nvars <- 10 x <- matrix(rnorm(nobs * nvars), nrow = nobs) y <- rowSums(x[, 1:2]) + rnorm(nobs)  The lasso The lasso is a popular regression method that induces sparsity of features in the fitted model. It comes with a hyperparmater $\lambda$ that the user usually picks by cross-validation (CV). The glmnet package has a function cv.glmnet that does this: library(glmnet) glmnet_fit <- cv.glmnet(x, y)  The code snippet below shows the equivalent code using the cvwrapr package. The model-fitting function is passed to kfoldcv via the train_fun argument, while the prediction function is passed to the predict_fun argument. (See the vignette for more details on the constraints on these functions.) library(cvwrapr) cv_fit <- kfoldcv(x, y, train_fun = glmnet, predict_fun = predict)  The lasso with variable filtering In some settings, we might want to exclude variables which are too sparse. For example, we may want to remove variables that are more than 80% sparse (i.e. more than 80% of observations having feature value of zero), then fit the lasso for the remaining variables. To do CV correctly, the filtering step should be included in the CV loop as well. The functions in the glmnet package have an exclude argument where the user can specify which variables to exclude from model-fitting. Unfortunately, at the time of writing, the exclude argument can only be a vector of numbers, representing the feature columns to be excluded. Since we could be filtering out different variables in different CV folds, using the exclude argument will not work for us. The cvwrapr package allows us to wrap the feature filtering step into the CV loop with the following code: # filter function filter <- function(x, ...) which(colMeans(x == 0) > 0.8) # model training function train_fun <- function(x, y) { exclude <- which(colMeans(x == 0) > 0.8) if (length(exclude) == 0) { model <- glmnet(x, y) } else { model <- glmnet(x[, -exclude, drop = FALSE], y) } return(list(lambda = model$lambda,
exclude = exclude,
model = model))
}

# prediction function
predict_fun <- function(object, newx, s) {
if (length(object$exclude) == 0) { predict(object$model, newx = newx, s = s)
} else {
predict(object$model, newx = newx[, -object$exclude, drop = FALSE],
s = s)
}
}

set.seed(1)
cv_fit <- kfoldcv(x, y, train_fun = train_fun,
predict_fun = predict_fun)


Next, let’s look at a gradient boosting example. In gradient boosting, a common hyperparameter to cross-validate is the number of trees in the gradient boosting ensemble. The code below shows how one can achieve that with the cvwrapr and gbm packages:

library(gbm)

# lambda represents # of trees
train_fun <- function(x, y, lambda) {
df <- data.frame(x, y)
model <- gbm::gbm(y ~ ., data = df,
n.trees = max(lambda),
distribution = "gaussian")

return(list(lambda = lambda, model = model))
}

predict_fun <- function(object, newx, s) {
newdf <- data.frame(newx)
predict(object$model, newdata = newdf, n.trees = s) } set.seed(3) lambda <- 1:100 cv_fit <- kfoldcv(x, y, lambda = lambda, train_fun = train_fun, predict_fun = predict_fun)  kfoldcv returns an object of class “cvobj”, which has a plot method: plot(cv_fit, log.lambda = FALSE, xlab = "No. of trees", main = "CV MSE vs. no. of trees")  The vertical line on the right corresponds to the hyperparameter value that gives the smallest CV error, while the vertical line on the left corresponds to the value whose CV error is within one standard error of the minimum. Computing different error metrics Sometimes you may only have access to the out-of-fold predictions; in these cases you can use cvwrapr‘s computeError function to compute the CV error for you (a non-trivial task!). The code below does CV for the lasso with the CV error metric being the default mean squared error: set.seed(1) cv_fit <- kfoldcv(x, y, train_fun = glmnet, predict_fun = predict, keep = TRUE) plot(cv_fit)  What if we wanted to pick the hyperparameter according to mean absolute error instead? One way would be to call kfoldcv again with the additional argument type.measure = "mae". This would involve doing all the model-fitting again. Since we specified keep = TRUE in the call above, cv_fit contains the out-of-fold predictions: we can do CV for mean absolute error with these predictions without refitting the models. The call below achieves that (see vignette for details on the additional arguments): mae_err <- computeError(cv_fit$fit.preval, y,
cv_fit$lambda, cv_fit$foldid,
type.measure = "mae", family = "gaussian")
plot(mae_err)


# The effect of rescaling all observation weights in the Cox model

TLDR: This post explains what happens to the Cox model when we scale observation weights for all observations (assuming the Breslow approximation for ties).

Imagine that we are in the survival analysis setting with right-censoring, and that we are considering the Cox proportional hazards model. Let:

• Data come in the form $(x_1, y_1, \delta_1), \dots, (x_n, y_n, \delta_n)$, where $x_j \in \mathbb{R}^p$, $y_j \in (0, \infty)$, and $\delta_j = 1$ if failure occurs, and $\delta_j = 0$ if the time is censored.
• $t_1 < t_2 < \dots, t_m$ denote the unique failure times.
• $w_j$ be the observation weight for observation $j$.
• $\beta \in \mathbb{R}^p$ denote the coefficient vector for the Cox model.
• $\eta_j = x_j^T \beta$.
• $D_i$ denote the set of indices of failure at time $t_i$.
• $d_i = \sum_{j \in D_i} w_j$.
• $R_i$ denote the risk set at time $t_i$, i.e. the indices $j$ for which $y_j \geq t_i$.

Assuming the Breslow approximation for ties (read more about ties in Reference 1), the log partial likelihood for the model is

\begin{aligned} \ell(\beta) = \sum_{i=1}^m \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in R_i} w_j e^{\eta_j} \right) \right]. \end{aligned}

The fitted model finds $\beta$ which maximizes the log partial likelihood; equivalently, it minimizes the deviance

$D(\beta) = 2(\ell_{sat} - \ell(\beta)),$

where $\ell_{sat}$ is the log partial likelihood for the saturated model, defined to be

\begin{aligned} \ell_{sat} = - \sum_{i=1}^m d_i \log d_i. \end{aligned}

(See this previous post for more details on deviance and the saturated model. The formula above is from Section 2.6 of Reference 2.)

What happens to the model when we include each observation twice in the dataset? More generally, what happens if we scale all the observation weights by a (positive) factor of $k$, i.e. $w_j' = kw_j$ for all $j$? Three things happen:

1. The log partial likelihood does not scale by $k$.
2. The model coefficients remain the same.
3. The deviance scales by $k$.

These observations are a result of direct algebraic manipulation. When we scale the observation weights, we now have $w_j' = kw_j$ and $d_i' = kd_i$. Plugging these into the expression for the log partial likelihood gives

\begin{aligned} \ell' = k \ell - k \sum_{i=1}^m d_i \log k, \end{aligned}

where $\ell'$ and $\ell$ are the new and original log partial likelihoods respectively. Thus, the original log partial likelihood is scaled by $k$, but then has a (non-zero) term subtracted from it.

To see observation 2, notice that the new log partial likelihood is simply a scaled version of the original plus a constant, and so maximizing the new and original log partial likelihoods are equivalent.

Finally, algebraic manipulation shows that the new log partial likelihood for the saturated model is

\begin{aligned} \ell_{sat}' = k \ell_{sat} - k \sum_{i=1}^m d_i \log k. \end{aligned}

The second term above is exactly the same as the second term in $\ell'$. Hence, when computing the deviance, given by $2 (\ell_{sat} - \ell)$, that term cancels and we conclude that the new deviance is simply the original deviance scaled by $k$.

References:

1. Breheny, P. Tied survival times; estimation of survival probabilities.
2. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent.

# What is steepest descent?

Introduction

Assume we have a convex function $f: \mathbb{R}^d \mapsto \mathbb{R}$ and we would like to find its minimum. A commonly used class of algorithms, called descent algorithms, function in the following manner:

1. Find a descent direction $\Delta x$.
2. Update $x = x + t \Delta x$, where $t$ is chosen in some manner (e.g. deterministically, backtracking line search, exact line search).
3. Repeat steps 1 and 2 until convergence.

Steepest descent

There are several descent directions to choose from: which should we use? If we look at the first order Taylor expansion of $f(x+v)$ around $x$:

\begin{aligned} f(x + v) \approx f(x) + \nabla f(x)^\top v, \end{aligned}

the term $\nabla f(x)^\top v$ is known as the directional derivative of $f$ at $x$ in the direction $v$. It makes sense to find a $v$ whose directional derivative is as negative as possible, i.e. as steep as possible.

The problem, of course, is that once we find some $v$ such that $\nabla f(x)^\top v$ is negative, we can increase the norm of $v$ as large as we like so that the directional derivative is negative infinity. This means that without normalization, we cannot choose between all the directions having $\nabla f(x)^\top v < 0$.

To that end, let $\| \cdot \|$ be some norm on $\mathbb{R}^d$. (What this norm is will matter later.) We define a normalized steepest descent direction with respect to $\| \cdot \|$ as

$\Delta x_{nsd} = \text{argmin} \; \left\{ \nabla f(x)^\top v : \| v \| \leq 1 \right\}.$

(The subscript “nsd” is short for “normalized steepest descent”.) This direction need not be unique. We can then descend in the direction of $\Delta x_{nsd}$.

It is also common to look at the unnormalized steepest descent step

$\Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd},$

where $\| \cdot \|_*$ is the dual norm of $\| \cdot \|$. (For exact line search, it makes no difference whether we use $\Delta x_{nsd}$ or $\Delta x_{sd}$ since the scale factor does not matter.)

Steepest descent as a class of algorithms

Notice that the steepest descent direction $\Delta x_{sd}$ depends on the norm that is chosen in its definition. Hence, steepest descent is really a generic name: specific algorithms arise when we choose $\| \cdot \|$ to be a particular norm.

When I was looking around the internet for information on steepest descent, there appears to be some confusion between steepest descent and gradient descent, with some webpages saying that the two are the same. That is not quite correct: gradient descent is a special case of steepest descent where the norm chosen is the Euclidean norm. But I suppose in many contexts we only consider the Euclidean norm, which is why we might conflate the two algorithms.

Why use different norms?

This begs the question: why consider any norm in the definition of the steepest descent direction other than the Euclidean norm? I am not too familiar with material in this area, but it seems that the choice of norm can greatly affect the convergence rate of the algorithm. See Section 9.4.4 of Reference 1 for some examples. This argument might become clearer as well when we see that several popular variations of gradient descent are actually special cases of steepest descent.

Special cases of steepest descent

A few popular optimization algorithms can be viewed as steepest descent for a particular choice of norm. Again, see Section 9.4 of Reference 1 for details.

• Euclidean norm: If $\| \cdot \|$ is the Euclidean norm, then steepest descent is gradient descent.
• Quadratic norms: For any symmetric (strictly) positive definite matrix $P \in \mathbb{R}^{d \times d}$, define the norm $\| x \|_P = (x^\top P x)^{1/2} = \| P^{1/2} x \|_2$. Steepest descent under this norm amounts to doing gradient descent after doing a change of coordinates. Concretely, our variable is now $\bar{x} = P^{1/2} x$, and we are minimizing $\bar{f}$ defined by $\bar{f}(\bar{x}) = f(P^{-1/2}\bar{x}) = f(x)$. Steepest descent in the original space corresponds to gradient descent of $\bar{f}$ w.r.t. $\bar{x}$.
In particular, if $P = \nabla^2 f(x)$, then steepest descent becomes Newton’s method.
• $\ell_1$ norm: If $\| \cdot \| = \| \cdot \|_1$, steepest descent becomes coordinate descent.

References:

1. Boyd, S., and Vandenberghe, L. Convex Optimization (Section 9.4: Steepest descent method).

# What is Andrew’s Sine?

Notational set-up

Let’s say we have data $x_1, \dots, x_n \stackrel{i.i.d.}{\sim} P_\theta$ for some probability distribution $P_\theta$, and we want to use the data to estimate the parameter $\theta$. If our estimate $\hat\theta$ is the solution to a minimization problem of the form

\begin{aligned} \text{minimize}_\theta \quad \sum_{i=1}^n \rho (x_i ; \theta) \end{aligned}

for some function $\rho$, then $\hat\theta$ is called an M-estimator. In maximum likelihood statistics, we choose $\rho (x ; \theta) = - \log f(x ; \theta)$, where $f (\cdot ; \theta)$ is the probability density associated with $P_\theta$.

Define

$\psi (x ; \theta) = \dfrac{\partial \rho(x ; \theta)}{\partial \theta}.$

Sometimes, we prefer to think of $\hat\theta$ as the solution to the implicit equation

\begin{aligned} \sum_{i=1}^n \psi(x ; \theta) = 0. \end{aligned}

In the field of robust statistics, we want to choose $\rho$ and/or $\psi$ such that the solution to the problem above has some robustness properties. These are the functions being referred to when you come across the terms “rho function” or “psi function” in this field.

Andrew’s Sine

In this blog we’ve already come across one rho/psi function used in robust statistics: the Tukey loss function. Andrew’s Sine is another psi function that appears in robust statistics. It is defined by

\begin{aligned} \psi(x) = \begin{cases} \sin (x / c) &\text{if } |x| \leq c\pi, \\ 0 &\text{otherwise}, \end{cases} \end{aligned}

where $c$ is a user-defined parameter. The rho function implied by this choice is

\begin{aligned} \rho(x) = \begin{cases} c[1 - \cos (z / c)] &\text{if } |x| \leq c\pi, \\ 2c &\text{otherwise}. \end{cases} \end{aligned}

Here are plots of both the rho and psi functions for a few choices of $c$:

Some history

Andrew’s Sine is named after D. F. Andrews. The first mention of it I could find was in Andrews (1974) (Reference 3), but it appears to have been proposed first by Andrews et al. 1972 (Reference 4), for which I can’t find a copy.

Choosing $c$

Reference 3 recommends using $c = 1.5$ or $c = 1.8$ without giving an explanation. Reference 5 suggests $c = 1.339$, noting that with this value of $c$, the corresponding M-estimator gives 95% efficiency at the normal distribution.

References:

1. Wolfram MathWorld. Andrew’s Sine.
2. Penn State Department of Statistics, STAT 501. 13.3 – Robust Regression Methods.
3. Andrews, D. F. (1974). A Robust Method for Multiple Linear Regression.
4. Andrews, D. F., et al. (1972). Robust Estimates of Location: Survey and Advances.
5. Young, D. S. (2017). Handbook of Regression Methods.