# Slight inconsistency between forcats’ fct_lump_min and fct_lump_prop

I recently noticed a slight inconsistency between the forcats package’s fct_lump_min and fct_lump_prop functions. (I’m working with v0.5.1, which is the latest version at the time of writing.) These functions lump levels that meet a certain criteria into an “other” level. According to the documentation,

• fct_lump_min “lumps levels that appear fewer than min times”, and
• fct_lump_prop “lumps levels that appear fewer than prop * n times”, where n is the length of the factor variable.

Let’s try this out in an example:

x <- factor(c(rep(1, 6), rep(2, 3), rep(3, 1)))
x
#  [1] 1 1 1 1 1 1 2 2 2 3
# Levels: 1 2 3

fct_lump_min(x, min = 3)
#  [1] 1     1     1     1     1     1     2     2     2     Other
# Levels: 1 2 Other


The levels 1 and 2 appear at least 3 times, and so they are not converted to the “Other” level. The level 3 appears just once, and so is converted.

What do you think this line of code returns?

fct_lump_prop(x, prop = 0.3)


Since prop * n = 0.3 * 10 = 3, the documentation suggests that only the level 3 should be converted to “Other”. However, that is NOT the case:

fct_lump_prop(x, prop = 0.3)
#  [1] 1     1     1     1     1     1     Other Other Other Other
# Levels: 1 Other


The level 2 appears exactly 3 times, and is converted into “Other”, contrary to what the documentation says.

Digging, into the source code of fct_lump_prop, you will find this line of code:

new_levels <- ifelse(prop_n > prop, levels(f), other_level)


prop_n is the proportion of times each factor level appears. If the documentation is correct, the greater than sign should really be a greater than or equal sign.

So what’s the right fix here? One way is to fix the documentation to say that fct_lump_prop “lumps levels that appear at most prop * n times”, but that breaks consistency with fct_lump_min. Another is to make the fix in the code as suggested above, but that will change code behavior. Either is probably fine, and it’s up to the package owner to decide which makes more sense.

# A quirk when using data.table?

I recently came across this quirk in using data.table that I don’t really have a clean solution for. I outline the issue below as well as my current way around it. Appreciate any better solutions!

The problem surfaces quite generally, but I’ll illustrate it by trying to achieve the following task: write a function that takes a data table and a column name, and returns the data table with the data in that column scrambled.

The function below was my first attempt:

library(data.table)

scramble_col <- function(input_dt, colname) {
input_dt[[colname]] <- sample(input_dt[[colname]])
input_dt
}


The code snippet below shows that it seems to work:

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2


However, when I tried to add a new column that is a replica of the x column, I get a strange warning!

input_dt[, y := x]  # gives warning


There are few webpages out there that try to explain what’s going on with this warning, but I haven’t had time to fully digest what is going on. My high-level takeaway is that the assignment in the line input_dt[[colname]] <- sample(input_dt[[colname]]) is problematic.

This was my second attempt:

scramble_col <- function(input_dt, colname) {
input_dt[, c(colname) := sample(get(colname))]
}


This version works well: it doesn’t throw the warning when I added a second column.

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2
input_dt[, y := x]
input_dt
#    x y
# 1: 1 1
# 2: 4 4
# 3: 3 3
# 4: 5 5
# 5: 2 2


However, the function does not work for a particular input: when the column name is colname! When I run the following code, I get an error message.

input_dt <- data.table(colname = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "colname") # error


The function below was my workaround and I think it works for all inputs, but it seems a bit inelegant:

scramble_col <- function(input_dt, colname) {
new_col <- sample(input_dt[[colname]])
input_dt[, c(colname) := new_col]
}


Would love to hear if anyone has a better solution for this task, or if you have some insight into what is going on with the warning/error above.

# A short note on the startsWith function

The startsWith function comes with base R, and determines whether entries of an input start with a given prefix. (The endsWith function does the same thing but for suffixes.) The following code checks if each of “ant”, “banana” and “balloon” starts with “a”:

startsWith(c("ant", "banana", "balloon"), "a")
# [1]  TRUE FALSE FALSE


The second argument (the prefix to check) can also be a vector. The code below checks if “ant” starts with “a” and if “ant” starts with “b”:

startsWith("ant", c("a", "b"))
# [1]  TRUE FALSE


Where things might get a bit unintuitive is when both arguments are vectors of length >1. Why do you think the line of code below returned the result it did?

startsWith(c("ant", "banana", "balloon"), c("a", "b"))
# [1] TRUE TRUE FALSE


This makes sense when we look at the documentation for startsWith‘s return value:

A logical vector, of “common length” of x and prefix (or suffix), i.e., of the longer of the two lengths unless one of them is zero when the result is also of zero length. A shorter input is recycled to the output length.

startsWith(x, prefix) checks if x[i] starts with prefix[i] for each i. In our line of code above, the function checks if “ant” starts with “a” and “banana” starts with “b”. Since x had length greater than prefix, we “recycle” prefix and check if “balloon” starts with “a”.

If you want to check if each x[i] starts with any prefix[j] (with j possibly being different from i), we could do the following:

x <- c("ant", "banana", "balloon")
prefix <- c("a", "b")
has_prefix <- sapply(prefix, function(p) startsWith(x, p))
has_prefix
#          a     b
# [1,]  TRUE FALSE
# [2,] FALSE  TRUE
# [3,] FALSE  TRUE

apply(has_prefix, 1, any)
# [1] TRUE TRUE TRUE


# What is a horizon chart?

A horizon chart is a compact version of an area chart. In the words of Jonathan Schwabish (Reference 1, page 164), it is

… an area chart that is sliced into equal horizontal intervals and collapsed down into single bands, which makes the graph more compact and similar to a heatmap…

What are horizon charts good for? Here is Schwabish again:

Horizon charts are especially useful when you are visualizing time series data that are so close in value so that the data marks in, for example, a line chart, would lie atop each other. … the purpose of the horizon chart is not necessarily to enable readers to pick out specific values, but instead to easily spot general trends and identify extreme values.

Horizon charts can be a bit difficult to interpret at first, partly because the way to create them is a bit involved. Let’s illustrate how horizon charts are created with an example using the EuStockMarkets dataset from the datasets R package. (Most of this code is derived from the documentation for the horizonplot function in the latticeExtra package.)
The dataset contains the daily closing prices of 4 European stock indices from 1991-1998.

library(latticeExtra)

str(EuStockMarkets)
# Time-Series [1:1860, 1:4] from 1991 to 1999: 1629 1614 1607 1621 1618 ...
# - attr(*, "dimnames")=List of 2
#  ..$: NULL # ..$ : chr [1:4] "DAX" "SMI" "CAC" "FTSE"


Let’s make a line plot of just Germany’s DAX:

# look at just DAX
dax_ts <- EuStockMarkets[, 1]

# raw plot
plot(dax_ts, ylab = "DAX", main = "Plot of DAX (1991-1998)")


First, choose a baseline level $B$ and some bandwidth $\Delta$. Color each horizontal band between $B + i\Delta$ and $B + (i+1)\Delta$ with a different color ($i$ is an integer, usually between -3 and 2 inclusive). In the code below, we choose a baseline of 4,000 and a bandwidth of 1,000. Bands above 4,000 are in increasingly darker shades of blue, while bands below 4,000 are in increasingly darker shades of red.

xyplot(dax_ts,
panel = function(x, y, ...) {
col <-
c("#B41414","#E03231","#F7A99C","#9FC8DC","#468CC8","#0165B3")
for (i in c(-3:-1, 2:0)) {
if (i >= 0)
yi <- pmax(4000, pmin(y, 4000 + 1000 * (i+1)))
if (i < 0)
yi <- pmin(4000, pmax(y, 4000 + 1000 * i))
panel.xyarea(x, yi, origin = 4000,
col = col[i+4], border = NA)
}
panel.lines(x, y, col = "black")
panel.abline(h = 4000, lty = 2)
},
ylab = "DAX",
main = "Plot of DAX (1991-1998)")


Second, flip all the bands below the baseline along the baseline, then stack all the bands on top of each other, with the darker colors right on top… and that’s a horizon chart!

horizonplot(dax_ts, colorkey = TRUE,
origin = 4000, horizonscale = 1000,
col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC",
"#468CC8","#0165B3"),
main = "Plot of DAX: 1991-1998")


The main advantage of the horizon chart is its (vertical) compactness. The horizon chart above has dimensions 600 x 200, while the earlier time series plots had dimensions 600 x 400. For comparison, here is the line plot with dimensions 600 x 200:

The real power of horizon charts comes when we want to compare multiple time series against each other. Here is the time series plot for all 4 EU stock indices (dimensions 600 x 400):

plot(EuStockMarkets, main = "Plot of EU Stock Indices (1991-1998)")


Here’s a horizon chart for the same data with the same dimensions (600 x 400). Notice how it’s easier to see the tiny fluctuations in each time series and to compare across time series.

## these layers show scale and origin in each panel
infolayers <-
layer(panel.scaleArrow(x = 0.99, digits = 1, col = "grey",
srt = 90, cex = 0.7)) +
layer(lim <- current.panel.limits(),
panel.text(lim$x[1], lim$y[1], round(lim$y[1],1), font = 2, cex = 0.7, adj = c(-0.5,-0.5), col = "#9FC8DC")) ## horizon chart for EU stock indices horizonplot(EuStockMarkets, colorkey = TRUE, col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC", "#468CC8","#0165B3"), origin = 4000, horizonscale = 1000, main = "Plot of EU Stock Indices (1991-1998)") + infolayers  In the horizon chart above, we set the same baseline (4,000) and same bandwidth (1,000) across the 4 stock indices. This doesn’t make a whole lot of sense because the stock indices are on different scales. The code below generates a horizon chart with each index having its own baseline and bandwidth: horizonplot(EuStockMarkets, colorkey = TRUE, col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC", "#468CC8","#0165B3"), main = "Plot of EU Stock Indices (1991-1998)") + infolayers  Let’s finish off this post with an example of a horizon chart from Reference 1. We want to compare public health spending (as a percentage of GDP) over time for 10 countries. Here’s what a time series plot (line chart) looks like: And here is the horizon chart for the same data: I think it’s much easier to make comparisons and see trends in the horizon chart as compared to the line chart. References: 1. Schwabish, J. (2021). Better Data Visualizations. # Something to note when using the merge function in R Base R has a merge function which does join operations on data frames. As the documentation says, the function [merges] two data frames by common columns or row names, or do other versions of database join operations. One thing that I realized which may not be obvious is that merge can have somewhat unexpected behavior regarding the ordering of rows in the result. Let’s see an example with the mtcars dataset: data(mtcars) mtcars$ID <- 1:nrow(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6


Let’s say we want to add a row cyl_word which is the value in the cyl column in words (e.g. 6 -> "Six"). The code below creates a reference table and joins the two tables to get the desired result:

cyl_df <- data.frame(cyl = c(4, 6, 8),
cyl_word = c("four", "six", "eight"))

joined_df <- merge(mtcars, cyl_df, by = "cyl")
#   cyl  mpg  disp hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   4 22.8 140.8 95 3.92 3.150 22.90  1  0    4    2  9     four
# 2   4 22.8 108.0 93 3.85 2.320 18.61  1  1    4    1  3     four
# 3   4 24.4 146.7 62 3.69 3.190 20.00  1  0    4    2  8     four
# 4   4 21.5 120.1 97 3.70 2.465 20.01  1  0    3    1 21     four
# 5   4 30.4  75.7 52 4.93 1.615 18.52  1  1    4    2 19     four
# 6   4 33.9  71.1 65 4.22 1.835 19.90  1  1    4    1 20     four


Look at the ID column: the rows are not returned in the same order! Reading the documentation tells us that merge has an argument sort which has default value TRUE, meaning that the results are sorted by the values in the columns that we merged on (cyl in this case). That is why all the cyl == 4 rows appear first in the return value.

(Notice also that the row names have disappeared: this may not be something you want!)

One might think that if we set sort = FALSE, the output would have the same row ordering as the input. Unexpectedly (at least to me), this is not the case:

joined_df <- merge(mtcars, cyl_df, by = "cyl", sort = FALSE)
#   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4 11      six
# 4   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1  6      six
# 6   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4 10      six


The documentation says as much: under the “Value” section, it writes that (emphasis mine)

The rows are by default lexicographically sorted on the common columns, but for sort = FALSE are in an unspecified order.

How can we get the output back in the original row order? One way is to add an ID column like we did above, then sort the result by that column:

joined_df <- joined_df[order(joined_df$ID), ] head(joined_df) # cyl mpg disp hp drat wt qsec vs am gear carb ID cyl_word # 1 6 21.0 160 110 3.90 2.620 16.46 0 1 4 4 1 six # 2 6 21.0 160 110 3.90 2.875 17.02 0 1 4 4 2 six # 9 4 22.8 108 93 3.85 2.320 18.61 1 1 4 1 3 four # 4 6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 4 six # 19 8 18.7 360 175 3.15 3.440 17.02 0 0 3 2 5 eight # 5 6 18.1 225 105 2.76 3.460 20.22 1 0 3 1 6 six testthat::expect_equal(joined_df$ID, 1:nrow(joined_df))  # test passes


Alternatively, use functions in the dplyr package:

library(dplyr)
joined_df <- left_join(mtcars, cyl_df, by = "cyl")
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3     four
# 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5    eight
# 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6      six

testthat::expect_equal(joined_df$ID, 1:nrow(joined_df)) # test passes  # Changing the column names for model.matrix output In this previous post, I showed how you can include a dummy variable for the baseline level in the output of the model.matrix function. In this post, I show how you can make changes to the column names of model.matrix‘s output to make downstream parsing a little easier. Let’s use the iris dataset again: data(iris) str(iris) # 'data.frame': 150 obs. of 5 variables: #$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

x <- model.matrix(Sepal.Length ~ Species, iris)
#   (Intercept) Speciesversicolor Speciesvirginica
# 1           1                 0                0
# 2           1                 0                0
# 3           1                 0                0
# 4           1                 0                0
# 5           1                 0                0
# 6           1                 0                0


Notice the default behavior for the column names of the returned matrix: for a given level, the column name is the name of the variable concatenated with the name of the level, with no spaces in between. For example, the last column in the matrix above represents the virginica level of the Species variable.

Because the concatenation happens with no characters in between the variable and level names, it can be hard to programmatically separate the two parts in the returned column names. We can make our life easier by having model.matrix return the variable and level names with some special character, e.g. ., in between.

We can achieve this by modifying the contrasts.arg function argument. In our example, the default value for this argument is list(Species = contrasts(iris$Species)). The code below shows what contrasts(iris$Species) is:

contrasts(iris$Species) # versicolor virginica # setosa 0 0 # versicolor 1 0 # virginica 0 1  We can modify the column names of contrasts(iris$Species) to achieve the desired effect:

speciesContrast <- contrasts(iris$Species) colnames(speciesContrast) <- paste0(".", colnames(speciesContrast)) x <- model.matrix( Sepal.Length ~ Species, iris, contrasts.arg = list(Species = speciesContrast) ) head(x) # (Intercept) Species.versicolor Species.virginica # 1 1 0 0 # 2 1 0 0 # 3 1 0 0 # 4 1 0 0 # 5 1 0 0 # 6 1 0 0  We can do this programmatically for all factor variables in a data frame too. Here is our example data frame: df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)), y = factor(rep(c("d", "e", "f"), times = 3)), z = 1:9) str(df) # 'data.frame': 9 obs. of 3 variables: #$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
#  $y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3 #$ z: int  1 2 3 4 5 6 7 8 9

x <- model.matrix(~ ., data = df)
#   (Intercept) xb xc ye yf z
# 1           1  0  0  0  0 1
# 2           1  1  0  1  0 2
# 3           1  0  1  0  1 3
# 4           1  0  0  0  0 4
# 5           1  1  0  1  0 5
# 6           1  0  1  0  1 6


Here is the code that adds --- between the variable and level names:

ChangeColnames <- function(x) {
colnames(x) <- paste0("---", colnames(x))
x
}

x <- model.matrix(
~ .,
data = df,
contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE],
function(x) ChangeColnames(contrasts(x)))
)
#   (Intercept) x---b x---c y---e y---f z
# 1           1     0     0     0     0 1
# 2           1     1     0     1     0 2
# 3           1     0     1     0     1 3
# 4           1     0     0     0     0 4
# 5           1     1     0     1     0 5
# 6           1     0     1     0     1 6


# How to include all levels of a factor variable in a model matrix in R

In R, the model.matrix function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).

Let’s see this in action on the iris dataset:

data(iris)
str(iris)
# 'data.frame':	150 obs. of  5 variables:
#  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... x <- model.matrix(Sepal.Length ~ Species, iris) head(x) # (Intercept) Speciesversicolor Speciesvirginica # 1 1 0 0 # 2 1 0 0 # 3 1 0 0 # 4 1 0 0 # 5 1 0 0 # 6 1 0 0  model.matrix returns a column of ones labeled (Intercept) by default. Also note that while the Species factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of model.matrix only has dummy variables for the latter two levels. For a factor variable, model.matrix treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity. However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the contrasts.arg argument: x <- model.matrix( Sepal.Length ~ Species, data = iris, contrasts.arg = list(Species = contrasts(iris$Species, contrasts = FALSE)))
#   (Intercept) Speciessetosa Speciesversicolor Speciesvirginica
# 1           1             1                 0                0
# 2           1             1                 0                0
# 3           1             1                 0                0
# 4           1             1                 0                0
# 5           1             1                 0                0
# 6           1             1                 0                0


Let’s have a closer look at what we passed as the value of Species in the list:

contrasts(iris$Species, contrasts = FALSE) # setosa versicolor virginica # setosa 1 0 0 # versicolor 0 1 0 # virginica 0 0 1  Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output: contrasts(iris$Species, contrasts = TRUE)
#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1


It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.

df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)),
y = factor(rep(c("d", "e", "f"), times = 3)),
z = 1:9)
str(df)
# 'data.frame':	9 obs. of  3 variables:
#  $x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 #$ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3
#  $z: int 1 2 3 4 5 6 7 8 9 # default: no dummy variable for baseline level x <- model.matrix(~ ., data = df) head(x) # (Intercept) xb xc ye yf z # 1 1 0 0 0 0 1 # 2 1 1 0 1 0 2 # 3 1 0 1 0 1 3 # 4 1 0 0 0 0 4 # 5 1 1 0 1 0 5 # 6 1 0 1 0 1 6 # dummy variables for baseline levels included x <- model.matrix( ~ ., data = df, contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE], contrasts, contrasts = FALSE)) head(x) # (Intercept) xa xb xc yd ye yf z # 1 1 1 0 0 1 0 0 1 # 2 1 0 1 0 0 1 0 2 # 3 1 0 0 1 0 0 1 3 # 4 1 1 0 0 1 0 0 4 # 5 1 0 1 0 0 1 0 5 # 6 1 0 0 1 0 0 1 6  References: 1. StackOverflow. All Levels of a Factor in a Model Matrix in R. # Switching testthat editions and how it affects testing functions and formulas testthat is a popular R package used for unit testing. From v3.0.0, testthat introduces the idea of “editions”. This is testthat‘s way of maintaining backward compatibility. At the time of writing, the 3rd edition is the latest and incorporates the package developer’s latest recommendations, some of which could be backward incompatible. If the user wants testthat‘s old behavior, they can use earlier editions of the package. Use edition_get() to find out which edition of testthat is currently active, and use local_edition() to change the active edition: library(testthat) edition_get() # your return value may be different # [1] 2 local_edition(3) edition_get() # [1] 3  You can read more about the changes that the 3rd edition introduces in Reference 1. In the rest of this post, I’ll cover one difference between the 2nd and 3rd edition: how testthat handles the environment for functions and formulas. (This was how I stumbled upon the concept of testthat editions in the first place.) The 2nd edition ignores the environment of functions and formulas when testing for equality, while the 3rd edition takes them into account. Here’s a simple, albeit contrived, example that demonstrates this difference. Consider the following functions: f <- function(x) { g <- function(x) x g } actual_value <- f() expected_value <- function(x) x  Both actual_value and expected_value are simply the identity function: they return whatever is passed. Let’s see what happens when we test for equality in the 2nd and 3rd editions: local_edition(2) expect_equal(actual_value, expected_value) # passes local_edition(3) expect_equal(actual_value, expected_value) # Error: actual_value (actual) not equal to expected_value (expected). # # attr(environment(actual), 'handlers') is absent # attr(environment(expected), 'handlers') is a list # # environment(actual) is <env:0x7fa8e59b7228> # environment(expected) is <env:global>  The 2nd edition ignores the environments of the two functions and declares them as equal. The 3rd edition notices that the environments of the functions are not the same and throws an error. You can pass the argument ignore_function_env = TRUE if you don’t want the 3rd edition to test for equality of function environments: local_edition(3) expect_equal(actual_value, expected_value, ignore_function_env = TRUE) # passes  Here’s an example where testing for equality of function environments matters: f <- function(x) { y <- 3 g <- function(x) x + y g } actual_value <- f() expected_value <- g <- function(x) x + y  The 2nd edition test passes while the 3rd edition test does not: local_edition(2) expect_equal(actual_value, expected_value) # passes local_edition(3) expect_equal(actual_value, expected_value) # Error: actual_value (actual) not equal to expected_value (expected). # # attr(environment(actual), 'handlers') is absent # attr(environment(expected), 'handlers') is a list # # environment(actual) is <env:0x7fa8e15eb6a0> # environment(expected) is <env:global>  In this case, the test should fail. When actual_value is called, it pulls the value of y from the enclosing scope, which is the body of f. When expected_value is called, it pulls the value of y from its enclosing scope, which is the global environment. As the code snippet below shows, these values can be different: y <- 1 expected_value(10) # [1] 11 actual_value(10) # [1] 13  The situation is similar for formulas: the 3rd edition will check the formula’s environment while the 2nd edition won’t. Turn off the environment checks by passing ignore_formula_env = TRUE: f <- function(x, y) { formula_string <- paste(y, "~", x) as.formula(formula_string) } actual_value <- f("x", "y") expected_value <- as.formula("y ~ x") local_edition(2) expect_equal(actual_value, expected_value) # passes local_edition(3) expect_equal(actual_value, expected_value) # Error: actual_value (actual) not equal to expected_value (expected). # # attr(attr(actual, '.Environment'), 'handlers') is absent # attr(attr(expected, '.Environment'), 'handlers') is a list # # attr(actual, '.Environment') is <env:0x7fa8e559b4b8> # attr(expected, '.Environment') is <env:global> local_edition(3) expect_equal(actual_value, expected_value, ignore_formula_env = TRUE) # passes  While I understand why one might want to check the environment for functions and formulas during testing, I admit that the examples above seems quite contrived. I would love to hear of examples that others have encountered where having ignore_function_env = FALSE or ignore_formula_env = FALSE was crucial. References: 1. Wickham, H. testthat 3e. # Comparing the Bradley Terry model to betting odds In this previous post, I described the Bradley-Terry model and showed how we could use it to predict game outcomes in the NBA 2018-19 regular season. After ffitting the Bradley-Terry model on the first half of the regular season (with and without home advantage), I used the model to predict win probabilities for the second half of the season. The models gave test Brier scores of 0.213 and 0.219, which I said was no longer than random 50-50 guessing (which has a Brier score of 0.25). In private correspondence, Justin Dyer pointed out to me that it’s not clear a priori that Brier scores of ~0.21 are bad: we should benchmark the scores against other models, or the Brier scores implied by bookmaking odds. Presumably bookmaking odds incorporate a lot more information than the simple Bradley-Terry model above (e.g. injuries, team’s recent form). In this post, we compare the Bradley-Terry model against betting odds for that same data. All the code in this post can be found in one script here. Getting the data Betting odds data was obtained from here and games data was obtained from here. First, import the packages that we will use: library(DBI) library(plotROC) library(readr) library(tidyverse)  Next, we pull out the relevant game data from the SQLite database: seasonYear <- 2018 # represents 2018-2019 season # this filepath specific to my local drive mainFile <- "basketball.sqlite" # get all regular season games (only relevant columns # selected) mydb <- dbConnect(RSQLite::SQLite(), mainFile) df <- dbGetQuery(mydb, "SELECT * FROM Game") dbDisconnect(mydb) season_df <- df %>% mutate(GAME_DATE = as.Date(GAME_DATE), SEASON = as.numeric(SEASON)) %>% filter(SEASON == seasonYear) %>% select(GAME_DATE, TEAM_NAME_HOME, TEAM_NAME_AWAY, WL_HOME) %>% arrange(GAME_DATE) head(season_df) # GAME_DATE TEAM_NAME_HOME TEAM_NAME_AWAY WL_HOME # 1 2018-10-16 Golden State Warriors Oklahoma City Thunder W # 2 2018-10-16 Boston Celtics Philadelphia 76ers W # 3 2018-10-17 San Antonio Spurs Minnesota Timberwolves W # 4 2018-10-17 New York Knicks Atlanta Hawks W # 5 2018-10-17 Phoenix Suns Dallas Mavericks W # 6 2018-10-17 LA Clippers Denver Nuggets L  After that, we pull out the relevant betting odds data. The data frame contains several different types of odds as well as odds from different bookmakers. We use the Average_Line_ML column, representing the average moneyline bet from the bookmakers in the dataset. In the last line, we convert the moneyline odds to win probability (see this website for more details). # this filepath specific to my local drive vegas <- read_csv("vegas.txt") bet_df <- vegas %>% filter(Location == "home") %>% select(Date, Team, OppTeam, line = Average_Line_ML) %>% mutate(winProb = ifelse(line < 0, -line / (100 - line), 100 / (100 + line))) head(bet_df) # # A tibble: 6 × 5 # Date Team OppTeam line winProb # <date> <chr> <chr> <dbl> <dbl> # 1 2018-10-16 Boston Philadelphia -210. 0.678 # 2 2018-10-16 Golden State Oklahoma City -1015. 0.910 # 3 2018-10-17 Charlotte Milwaukee 145 0.408 # 4 2018-10-17 Detroit Brooklyn -242. 0.708 # 5 2018-10-17 Indiana Memphis -294. 0.746 # 6 2018-10-17 Orlando Miami 116 0.463  Join the two datasets together: season_teams <- sort(unique(season_df$TEAM_NAME_HOME))
bet_teams <- sort(unique(bet_df$Team)) names(bet_teams) <- season_teams df <- season_df %>% transmute(Date = GAME_DATE, Team = bet_teams[TEAM_NAME_HOME], OppTeam = bet_teams[TEAM_NAME_AWAY], HomeWinLoss = WL_HOME) %>% left_join(bet_df)  For some reason, 4 of the games did not have betting odds data. We impute a win probability of 0.5 to them (i.e. no information, guess a 50-50 coin flip): df$winProb[is.na(df$winProb)] <- 0.5  Fitting the Bradley-Terry model The chunk of code below fits the Bradley-Terry model (with and without home advantage) on the training data. This is an abbreviated version of the code in the previous post; go there if you would like more detail. # Get data in a form suitable for the BT model get_data_vec <- function(home_team, away_team, teams) { vec <- rep(0, length(teams)) vec[teams == home_team] <- 1 vec[teams == away_team] <- -1 vec } X <- apply(df, 1, function(row) get_data_vec(row["Team"], row["OppTeam"], bet_teams)) X <- t(X) colnames(X) <- bet_teams dim(X) y <- as.numeric(df$HomeWinLoss == "W")
bt_df <- as.data.frame(cbind(X, y))

# split into train and test
n_train <- nrow(X) / 2
train_df <- bt_df[1:n_train, ]
test_df <- bt_df[(n_train + 1):nrow(X), ]

# train BT models
train_mod_home <- glm(y ~ ., data = train_df, family = binomial())
train_mod_no_home <- glm(y ~ . + 0, data = train_df, family = binomial())


Comparing predictions

We get the predictions for the models and betting odds on the test set:

pred_home <- predict(train_mod_home, newdata = test_df, type = "response")
pred_no_home <- predict(train_mod_no_home, newdata = test_df, type = "response")
pred_bet <- df$winProb[(n_train + 1):nrow(X)]  Before comparing the models on test performance, let’s have a look at the predictions themselves. The code below plots the win probability predictions from betting odds against those from the Bradley-Terry model with home advantage. plot(pred_home, pred_bet, pch = 16, cex = 0.5, xlab = "Prob. of home win (BT model w. home adv.)", ylab = "Prob. of home win (betting odds)") abline(0, 1)  It looks like there is quite a bit of variation between the two but they generally give the same directional outcome (i.e. they would predict the same team winning). Now for test performance. The code below plots the test ROC curves for the 3 models: pred_df <- data.frame( truth = test_df$y,
pred_home = pred_home,
pred_no_home = pred_no_home,
pred_bet = pred_bet
) %>%
pivot_longer(pred_home:pred_bet,
names_to = "model",
values_to = "prediction")

roc_plot <- ggplot(pred_df) +
geom_roc(aes(d = truth, m = prediction, color = model), labels = FALSE)

roc_plot +
labs(x = "True positive fraction",
y = "False positive fraction",
title = "Test set ROC plots") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
theme(legend.position = "bottom")


From the ROC curve plot, betting odds (red line) clearly outperform the two Bradley-Terry models, but not by much! Finally, let’s compute the test AUCs and Brier scores:

# compute AUC
calc_auc(roc_plot)
#   PANEL group       AUC
# 1     1     1 0.7421478
# 2     1     2 0.7120678
# 3     1     3 0.7133042

# compute Brier score
mean((pred_bet - test_df$y)^2) # [1] 0.2027192 mean((pred_home - test_df$y)^2)
# [1] 0.2130266
mean((pred_no_home - test_dfy)^2) # [1] 0.2186777  Betting odds has a better test Brier score, but only marginally: 0.203 vs. 0.213 & 0.219! This is quite surprising in that the Bradley-Terrry model is so simple and uses much less information than betting odds, yet its performance is not too far off. # What is the Bradley-Terry model? The Bradley-Terry model The Bradley-Terry model, named after R. A. Bradley and M. E. Terry, is a probability model for predicting the outcome of a paired comparison. Imagine that we have $K$ teams competing against each other. The model assigns team $i$ a score $p_i$, with higher scores corresponding to better teams. Given two teams $i$ and $j$, the model asserts that \begin{aligned} \text{Prob(} i \text{ beats } j) &= \dfrac{p_i}{p_i + p_j}. \end{aligned} (Notice that the model implies that either i beats j or j beats i: there is no room for ties.) If we parameterize the scores by $p_i = \exp(\beta_i)$, then the model above is equivalent to \begin{aligned} \text{logit} \left( \text{Prob(} i \text{ beats } j) \right) &= \log \left( \dfrac{\text{Prob(} i \text{ beats } j) }{\text{Prob(} j \text{ beats } i) }\right) \\ &= \log \left( \dfrac{p_i}{p_j} \right) = \beta_i - \beta_j. \end{aligned} Thus, estimating the parameters $\beta_1, \dots, \beta_K$ amounts to fitting a logistic regression, which is pretty straightforward. The part that’s not as straightforward is converting typical win-loss data into the data matrix required for the logistic regression. One nice thing about the Bradley-Terry model is that it’s really easy to take home field advantage into account. Home field advantage amounts to including an intercept in the model (assuming that team i is the home team): \begin{aligned} \text{logit} \left( \text{Prob(} i \text{ beats } j) \right) &= \alpha + \beta_i - \beta_j. \end{aligned} The Bradley-Terry model doesn’t just apply to sports teams: it applies whenever we are trying to rank items, e.g. which documents are most relevant to a user query. An example: NBA regular season 2018-19 For the rest of this article, we demonstrate how to fit a Bradley-Terry model to NBA data. All the code in this post can be found in one script here. Preparing the data First, let’s import packages that we’ll use: library(DBI) library(tidyverse) library(ggrepel) library(plotROC)  Next, let’s pull the relevant data from an SQLite database (your filepath to the data will probably be different from mine): seasonYear <- 2018 # represents 2018-2019 season # this filepath specific to my local drive mainFile <- "../data/nba-kaggle-wyattowalsh/basketball.sqlite" # get all regular season games (only relevant columns # selected) mydb <- dbConnect(RSQLite::SQLite(), mainFile) df <- dbGetQuery(mydb, "SELECT * FROM Game") dbDisconnect(mydb) season_df <- df %>% mutate(GAME_DATE = as.Date(GAME_DATE), SEASON = as.numeric(SEASON)) %>% filter(SEASON == seasonYear) %>% select(GAME_DATE, TEAM_NAME_HOME, TEAM_NAME_AWAY, WL_HOME, WL_AWAY) %>% arrange(GAME_DATE) head(season_df) # GAME_DATE TEAM_NAME_HOME TEAM_NAME_AWAY WL_HOME WL_AWAY # 1 2018-10-16 Golden State Warriors Oklahoma City Thunder W L # 2 2018-10-16 Boston Celtics Philadelphia 76ers W L # 3 2018-10-17 San Antonio Spurs Minnesota Timberwolves W L # 4 2018-10-17 New York Knicks Atlanta Hawks W L # 5 2018-10-17 Phoenix Suns Dallas Mavericks W L # 6 2018-10-17 LA Clippers Denver Nuggets L W  The data contains information on which team is home and which is away: we can use that information when fitting the Bradley-Terry model with home field advantage. To make our subsequent analysis and plots more readable, we get team names and abbreviations. Note that teams will only contain the team names from the 2018-19 season, while team_abbrev_df contains teams from all the seasons present in the original SQLite database. # get team abbreviations and names team_abbrev_df <- df %>% select(team = TEAM_NAME_HOME, team_abbr = TEAM_ABBREVIATION_HOME) %>% distinct() teams <- sort(unique(season_dfTEAM_NAME_HOME))


Next, we make the data matrix needed for the Bradley Terry model. For each row of season_df representing one game, we need to turn that into a vector of 30 numbers. If team $i$ is the home team, there should be a 1 in the $i$th spot; if team $j$ is the away team, there should be a -1 in the $j$th spot; otherwise the vector element should be 0. The function get_data_vec() does this for us, and we apply it to every row in season_df.

# get dataframe for Bradley-Terry model
get_data_vec <- function(home_team, away_team, teams) {
vec <- rep(0, length(teams))
vec[teams == home_team] <- 1
vec[teams == away_team] <- -1
vec
}

X <- apply(season_df, 1,
function(row) get_data_vec(row["TEAM_NAME_HOME"],
row["TEAM_NAME_AWAY"],
teams))
X <- t(X)
colnames(X) <- teams

dim(X)
# [1] 1230   30


Bundle X and the response (did team $i$ win or not?) together in a data frame:

y <- as.numeric(season_df$WL_HOME == "W") bt_df <- as.data.frame(cbind(X, y))  Fitting the Bradley-Terry model Preparing the data was the hard part: fitting the model is simply logistic regression, which we can do with the glm() function: # Bradley-Terry model with home advantage bt_mod <- glm(y ~ ., data = bt_df, family = binomial()) summary(bt_mod) # Call: # glm(formula = y ~ ., family = binomial(), data = bt_df) # # Deviance Residuals: # Min 1Q Median 3Q Max # -2.3657 -1.0417 0.5943 0.9370 2.0810 # # Coefficients: (1 not defined because of singularities) # Estimate Std. Error z value Pr(>|z|) # (Intercept) 0.45312 0.06451 7.024 2.15e-12 *** # Atlanta Hawks -0.14263 0.33801 -0.422 0.673045 # Boston Celtics 0.95279 0.33708 2.827 0.004704 ** # Brooklyn Nets 0.57338 0.33323 1.721 0.085313 . # Charlotte Hornets 0.42252 0.33290 1.269 0.204369 # Chicago Bulls -0.56384 0.35012 -1.610 0.107310 # Cleveland Cavaliers -0.77299 0.35977 -2.149 0.031668 * # Dallas Mavericks 0.18056 0.34024 0.531 0.595646 # Denver Nuggets 1.34069 0.34632 3.871 0.000108 *** # Detroit Pistons 0.51927 0.33477 1.551 0.120878 # Golden State Warriors 1.49502 0.35119 4.257 2.07e-05 *** # Houston Rockets 1.27371 0.34470 3.695 0.000220 *** # Indiana Pacers 0.88599 0.33580 2.638 0.008328 ** # LA Clippers 0.98431 0.33971 2.898 0.003761 ** # Los Angeles Lakers 0.39871 0.33801 1.180 0.238164 # Memphis Grizzlies 0.18822 0.33950 0.554 0.579314 # Miami Heat 0.42317 0.33174 1.276 0.202095 # Milwaukee Bucks 1.59581 0.35456 4.501 6.77e-06 *** # Minnesota Timberwolves 0.35609 0.33764 1.055 0.291591 # New Orleans Pelicans 0.15370 0.33992 0.452 0.651159 # New York Knicks -0.90066 0.36632 -2.459 0.013945 * # Oklahoma City Thunder 1.03916 0.34122 3.045 0.002323 ** # Orlando Magic 0.56597 0.33298 1.700 0.089184 . # Philadelphia 76ers 1.07406 0.34054 3.154 0.001611 ** # Phoenix Suns -0.68639 0.36436 -1.884 0.059590 . # Portland Trail Blazers 1.28427 0.34507 3.722 0.000198 *** # Sacramento Kings 0.49275 0.33739 1.460 0.144160 # San Antonio Spurs 0.97029 0.33981 2.855 0.004298 ** # Toronto Raptors 1.47941 0.34897 4.239 2.24e-05 *** # Utah Jazz 1.08503 0.34131 3.179 0.001478 ** # Washington Wizards NA NA NA NA # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 1662.6 on 1229 degrees of freedom # Residual deviance: 1441.9 on 1200 degrees of freedom # AIC: 1501.9 # # Number of Fisher Scoring iterations: 4  Some notes are in order: • Since an intercept is included in the model, this is the Bradley-Terry model with home field advantage. We’ll see how to fit a model without an intercept later in this post. • Notice that the Washington Wizards doesn’t have a coefficient associated with it. This is because the original model is unidentifiable: if we were to replace all the coefficients $\beta_j$ with $\beta_j + c$ for any constant $c$, the resulting model would be indistinguishable from the original one. R is smart enough to detect this and automatically chooses one of the levels (in this case the Washington Wizards) as having coefficient 0. If we want another team to have coefficient 0, we could have dropped its column in the data matrix X before fitting the model. • Larger coefficients correspond to better teams. In the next code block, we examine the relationship between the Bradley-Terry coefficients and the team’s overall win percentage for the season: # Compare BT coefficients with overall win percentage coef_df <- data.frame( team = teams, beta = c(summary(bt_mod)$coefficients[2:length(teams), "Estimate"], 0)
)

# get team win percentages
home_df <- season_df %>% group_by(TEAM_NAME_HOME) %>%
summarize(home_win  = sum(WL_HOME == "W"),
home_loss = sum(WL_HOME == "L"))
away_df <- season_df %>% group_by(TEAM_NAME_AWAY) %>%
summarize(away_win  = sum(WL_AWAY == "W"),
away_loss = sum(WL_AWAY == "L"))
win_pct_df <- inner_join(home_df, away_df,
by = c("TEAM_NAME_HOME" = "TEAM_NAME_AWAY")) %>%
transmute(team = TEAM_NAME_HOME,
win = home_win + away_win,
loss = home_loss + away_loss) %>%
mutate(win_pct = win / (win + loss)) %>%
arrange(desc(win_pct)) %>%
left_join(team_abbrev_df)

win_pct_df %>% inner_join(coef_df) %>%
ggplot(aes(x = win_pct, y = beta)) +
geom_point() +
geom_text_repel(aes(label = team_abbr)) +
labs(x = "Win percentage", y = "Bradley-Terry beta",
title = "Bradley-Terry beta vs. Win %")


As we probably expected, there’s high agreement between the two measures of team quality. (Why not use win percentage right off the bat? That’s because win percentage depends on “strength of schedule”. A team that plays weaker teams is likely to have a win percentage that is biased upwards, and vice versa for a team that plays stronger teams. Win percentage does not account for this but the Bradley-Terry model does.)

(I also just noticed that there are some extraneous labels in the figure; it’s because some teams have had multiple abbreviations throughout the history of the game, and I should have filtered them out from team_abbrev_df.)

Fitting the model without home advantage is the same as performing logistic regression without an intercept. This is easy to do: we just add + 0 in the formula argument.

bt_mod1 <- glm(y ~ . + 0, data = bt_df, family = binomial())


Let’s compare the coefficients for the teams from the two models:

coef_df1 <- data.frame(
team = teams,
beta = c(summary(bt_mod1)$coefficients[, "Estimate"], 0) ) plot(coef_df$beta, coef_df1$beta, pch = 16, xlab = "Beta with home adv.", ylab = "Beta w/o home adv.") abline(0, 1, col = "red", lty = 2)  The dotted red line is the $y = x$ line. For this data, it looks like both models give very similar coefficients for the teams. Notice though that the points don’t lie exactly on the line: the model without home advantage gives coefficients that are slightly shrunken toward zero. Predicting the second half of the season We end off this post by showing how to use the Bradley-Terry model to make predictions. We split our data into two halves: we will use the first half of the season as training data, and will use the model to predict the results for the second half of the season. n_train <- nrow(X) / 2 train_df <- bt_df[1:n_train, ] test_df <- bt_df[(n_train + 1):nrow(X), ]  Code for training and making predictions: train_mod_home <- glm(y ~ ., data = train_df, family = binomial()) pred_home <- predict(train_mod_home, newdata = test_df, type = "response") train_mod_no_home <- glm(y ~ . + 0, data = train_df, family = binomial()) pred_no_home <- predict(train_mod_no_home, newdata = test_df, type = "response")  Let’s compare the predictions made by the two models: plot(pred_home, pred_no_home, pch = 16, cex = 0.5, xlab = "Prob. of home win (model w. home adv.)", ylab = "Prob. of home win (model w/o home adv.)") abline(0, 1)  As you might expect, the model that takes home advantage into account always gives a higher predicted probability of a home win. What’s interesting is that the size of this home advantage changes across the range: it is largest for probabilities around 0.5 and smallest for probabilities close to the extremes. The code below plots the ROC curves for the two models on the test data. In this situation it looks like whether we incorporate home advantage or not doesn’t make much of a difference. pred_df <- data.frame( truth = test_df$y,
pred_home = pred_home,
pred_no_home = pred_no_home
) %>%
pivot_longer(pred_home:pred_no_home,
names_to = "model",
values_to = "prediction")

roc_plot <- ggplot(pred_df) +
geom_roc(aes(d = truth, m = prediction, color = model), labels = FALSE)

roc_plot +
labs(x = "True positive fraction",
y = "False positive fraction",
title = "Test set ROC plots") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
theme(legend.position = "bottom")


We can use the calc_auc() function from the plotROC package to compute the area under the curve (AUCs) from the plot object:

calc_auc(roc_plot)

# PANEL group       AUC
# 1     1     1 0.7120678
# 2     1     2 0.7133042


We can also compute the Brier scores for the two models:

mean((pred_home - test_df$y)^2) # [1] 0.2130266 mean((pred_no_home - test_df$y)^2)
# [1] 0.2186777


Recall that smaller Brier scores are better, and that a model that always outputs probability 0.5 has a Brier score of 0.25. Looking at the Brier scores and AUCs, it looks like these models are better than random guessing, but not a whole lot better.

Update (2022-02-01): In private correspondence, Justin Dyer correctly points out that it’s not clear a priori that Brier scores of ~0.21 are bad: we should benchmark the scores against other models, or the Brier scores implied by bookmaking odds. Presumably bookmaking odds incorporate a lot more information than the simple Bradley-Terry model above! We’ll leave this for a future blog post.

References: