Use mfcol to have plots drawn by column

To plot multiple figures on a single canvas in base R, we can change the graphical parameter mfrow. For instance, the code below tells R that subsequent figures will by drawn in a 2-by-3 array:

par(mfrow = c(2, 3))

If we then run this next block of code, we will get the image below:

set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
    plot(x[, j], x[, j]^2 + 0.1 * rnorm(n), 
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
    plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
}

Notice how the plots are filled in by row? That is, the first plot goes in the top-left corner, the next plot goes to its right, and so on. What if we want the plots to be filled in by column instead? For instance, in our example above each feature is associated with two panels. It would make more sense for the first column to filled with plots related to x1, and so on.

There is an easy fix for that: instead of specifying the mfrow parameter, specify the mfcol parameter. See the results below:

par(mfcol = c(2, 3))
set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
    plot(x[, j], x[, j]^2 + 0.1 * rnorm(n), 
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
    plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
}

References:

  1. Graphical layouts.
Advertisements

Lesser known dplyr functions

The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself):

  • filter() to select cases based on their values.
  • arrange() to reorder the cases.
  • select() and rename() to select variables based on their names.
  • mutate() and transmute() to add new variables that are functions of existing variables.
  • summarise() to condense multiple values to a single value.
  • sample_n() and sample_frac() to take random samples.

The “Two-table verbs” vignette gives a good introduction to using dplyr function for joining two tables together. The “Window functions” vignette talks about, well, window functions, which are defined as functions which take n values and return n values (as opposed to aggregation functions, which take n values but return one value). The window functions mentioned there are (list taken from the vignette itself):

  • Ranking and ordering functions: row_number()min_rank()dense_rank()cume_dist()percent_rank(), and ntile(). These functions all take a vector to order by, and return various types of ranks.
  • Offsets: lead() and lag() allow you to access the previous and next values in a vector, making it easy to compute differences and trends.
  • Cumulative aggregates: cumsum()cummin()cummax() (from base R), and cumall()cumany(), and cummean() (from dplyr).

However, dplyr comes with several other functions that are not mentioned in the vignettes (or at least, not at length). In this post I’ll talk about some of them. (For the full list of dplyr functions, see the reference manual.)

  • Counting functions: n() and n_distinct()
  • If-else functions: if_else() and case_when()
  • Comparison functions: between() and near()
  • Selecting specific elements based on position: nth(), first() and last()
  • Selecting specific rows based on position/value: slice() and top_n()
  • Utilities: coalesce() and pull()

To illustrate these functions, I will use the flights dataset in the nycflights13 package.

library(tidyverse)
library(nycflights13)
data(flights)

Counting functions: n() and n_distinct()

Ok, these aren’t exactly lesser known functions but they’re certainly useful. n() counts the number of rows in each group. For example, to count the number of flights in each month:

flights %>% group_by(month) %>% 
    summarize(count = n())
# # A tibble: 12 x 2
#    month count
#    <int> <int>
# 1      1 27004
# 2      2 24951
# 3      3 28834
# 4      4 28330
# 5      5 28796
# 6      6 28243
# 7      7 29425
# 8      8 29327
# 9      9 27574
# 10    10 28889
# 11    11 27268
# 12    12 28135

n_distinct() counts the number of unique values in each group. To count the number of distinct values of day in the dataset:

flights %>% summarize(cnt = n_distinct(day))
# # A tibble: 1 x 1
#     cnt
#   <int>
# 1    31

as we expect, since the longest month only has 31 days. We can also count the number of unique sets of values across columns. To count the number of distinct (month, day) values:

flights %>% summarize(cnt = n_distinct(month,day))
# # A tibble: 1 x 1
#     cnt
#   <int>
# 1   365

as expected (number of days in a year).

If-else functions: if_else() and case_when()

if_else() returns a value which depends on whether a given condition is true or not. It works like the base ifelse() function, except that it is stricter in that the returned value must be of the same type, whether the given condition is true or not. if_else() is commonly used to create new columns. For example, we could create a new column that is “United” if a flight was from United Airlines, “Other” otherwise:

flights %>%
    transmute(carrier = carrier,
              isUA = if_else(carrier == "UA", "United", "Otherwise")) %>%
    head()
# # A tibble: 6 x 2
#   carrier isUA     
#   <chr>   <chr>    
# 1 UA      United   
# 2 UA      United   
# 3 AA      Otherwise
# 4 B6      Otherwise
# 5 DL      Otherwise
# 6 UA      United 

case_when() is like if_else() except that we have more than 2 possible values for the output. It is “the R equivalent of the SQL CASE WHEN statement”. If none of the cases match, an NA is returned. As with if_else(), the returned values for all the cases must be of the same type.

Here is some code to return “United” if carrier is UA, “JetBlue” if it is B6, and “Other” for all other carriers (notice that the syntax within case_when() is a little different from the usual, and how we use TRUE to catch all the cases that don’t meet the first two conditions):

flights %>%
    transmute(carrier = carrier,
              newCol = case_when(
                  carrier == "UA" ~ "United",
                  carrier == "B6" ~ "JetBlue",
                  TRUE ~ "Otherwise"
              )) %>%
    head()
# # A tibble: 6 x 2
#   carrier newCol   
#   <chr>   <chr>    
# 1 UA      United   
# 2 UA      United   
# 3 AA      Otherwise
# 4 B6      JetBlue  
# 5 DL      Otherwise
# 6 UA      United 

Comparison functions: between() and near()

between(x, left, right) is a shortcut for x >= left & x <= right that has an efficient implementation. It makes for more concise code too, at the expense of being unable to tell if the endpoints are included or not unless one is familiar with the function.

near() is a safer way than using == to compare if two vectors of floating point numbers are equal element-wise. This is because it uses tolerance which is based on the machine’s double precision. Here is the example provided in near()‘s documentation:

sqrt(2) ^ 2 == 2
# [1] FALSE
near(sqrt(2) ^ 2, 2)
# [1] TRUE

Selecting specific elements based on position: nth(), first() and last()

nth(x, n) returns the nth element in the vector x. An optional vector to determine the order can be passed to the function as well. If n is negative, we count from the end of x (e.g. -2 will return the second last value in x). first() and last() are obvious special cases of nth().

What’s interesting is that these functions are actually wrappers around [[. This means that they can be used to pull out the nth element in a list as well!

Selecting specific rows based on position/value: slice() and top_n()

slice() allows us to select certain rows based on their ordinal position (i.e. row number). This works especially well with grouped dataframes. For example, if we want the first row for each date:

flights %>% group_by(month, day) %>%
    slice(1) %>%
    select(month, day, dep_time)
# # A tibble: 365 x 3
# # Groups:   month, day [365]
#    month   day dep_time
#    <int> <int>    <int>
#  1     1     1      517
#  2     1     2       42
#  3     1     3       32
#  4     1     4       25
#  5     1     5       14
#  6     1     6       16
#  7     1     7       49
#  8     1     8      454
#  9     1     9        2
# 10     1    10        3
# # … with 355 more rows

A simple modification gives us the first two rows for each date:

flights %>% group_by(month, day) %>%
    slice(1:2) %>%
    select(month, day, dep_time)
# # A tibble: 730 x 3
# # Groups:   month, day [365]
#    month   day dep_time
#    <int> <int>    <int>
#  1     1     1      517
#  2     1     1      533
#  3     1     2       42
#  4     1     2      126
#  5     1     3       32
#  6     1     3       50
#  7     1     4       25
#  8     1     4      106
#  9     1     5       14
# 10     1     5       37
# # … with 720 more rows

In contrast to slice() which picks out rows by their position (within the group), top_n() picks out rows by their value in a pre-specified column (within the group). For example, if we wanted to see the carriers for the top 3 longest departure delays for each day, we could use this code:

flights %>% group_by(month, day) %>%
    top_n(3, dep_delay) %>%
    select(month, day, dep_delay, carrier)
# # A tibble: 1,108 x 4
# # Groups:   month, day [365]
#    month   day dep_delay carrier
#    <int> <int>     <dbl> <chr>  
#  1     1     1       853 MQ     
#  2     1     1       290 EV     
#  3     1     1       379 EV     
#  4     1     2       334 UA     
#  5     1     2       337 AA     
#  6     1     2       379 UA     
#  7     1     3       268 DL     
#  8     1     3       252 B6     
#  9     1     3       291 9E     
# 10     1     4       208 B6     
# # … with 1,098 more rows

Notice that while we get the top 3 rows in dep_delay for each day, the rows are not sorted according to the dep_delay column. To get the bottom three rows instead, use a negative number:

flights %>% group_by(month, day) %>%
    top_n(-3, dep_delay) %>%
    select(month, day, dep_delay, carrier)
# # A tibble: 1,504 x 4
# # Groups:   month, day [365]
#    month   day dep_delay carrier
#    <int> <int>     <dbl> <chr>  
#  1     1     1       -15 MQ     
#  2     1     1       -14 F9     
#  3     1     1       -15 AA     
#  4     1     2       -13 AA     
#  5     1     2       -13 UA     
#  6     1     2       -12 AA     
#  7     1     2       -12 9E     
#  8     1     3       -12 EV     
#  9     1     3       -12 EV     
# 10     1     3       -13 B6     
# # … with 1,494 more rows

Notice how Jan 2nd has 4 entries, not 3? That’s because there was a tie for 3rd place. top_n() either takes all rows with a value or none of them. Another gotcha: if the column that you are sorting against is not specified, it defaults to the last variable in the data frame, NOT to ordinal position/row number.

Utilities: coalesce() and pull()

From the documnetation: “Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.”

I haven’t had to use this function myself so far, but I imagine it could be useful when you have a number of different data sources reporting the same thing but with different missing values in each source. Here is a contrived example:

x <- c(NA, NA, 3)
y <- c(1, NA, 4)
z <- c(NA, 2, NA)
coalesce(x, y, z)
# [1] 1 2 3

Notice that the third value in y showing a 4 is ignored.

pull() works like [[: it allows you to pull out a particular column (as a vector). The variable you pass to the function can be a variable name, a positive integer (giving the column position when counting from the left) or a negative integer (giving the column position when counting from the right).

Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this:

It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortcut I accidentally used was Cmd+Option+O, which folds up all chunks. To unfold all chunks, use Cmd+Shift+Option+O.

The full list of RStudio keyboard shortcuts can be found here.

Update (2019-08-27): For Windows & Linux users: the equivalent shortcut keys for insert chunk, fold up all chunks and unfold all chunks are Ctrl+Alt+I, Alt+O and Shift+Alt+O respectively.

Visualizing the relationship between multiple variables

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical.

For all the code in this post in one file, click here.

The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package:

library(fueleconomy)
data(vehicles)
df <- vehicles[1:100, ]
str(df)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	100 obs. of  12 variables:
#     $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
# $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
# $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
# $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
# $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
# $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
# $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
# $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
# $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
# $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
# $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
# $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

Let’s see how GGally::ggpairs() visualizes relationships between quantitative variables:

library(GGally)
quant_df <- df[, c("cyl", "hwy", "cty")]
ggpairs(quant_df)

  • Along the diagonal, we see a density plot for each of the variables.
  • Below the diagonal, we see scatterplots for each pair of variables.
  • Above the diagonal, we see the (Pearson) correlation between each pair of variables.

The visualization changes a little when we have a mix of quantitative and categorical variables. Below, fuel is a categorical variable while hwy is a quantitative variable.

mixed_df <- df[, c("fuel", "hwy")]
ggpairs(mixed_df)

  • For a categorical variable on the diagonal, we see a barplot depicting the number of times each category appears.
  • In one of the corners (top-right), for each categorical value we have a boxplot for the quantitative variable.
  • In one of the corners (bottom-left), for each categorical value we have a histogram for the quantitative variable.

The only behavior for GGally::ggpairs() we haven’t observed yet is for a pair of categorical variables. In the code fragment below, all 3 variables are categorical:

cat_df <- df[, c("fuel", "make", "drive")]
ggpairs(cat_df)

For each pair of categorical variables, we have a barplot depicting the number of times each pair of categorical values appears.

You may have noticed that the plots above the diagonal are essentially transposes of the plot below the diagonal, and so they don’t really convey any more information. What follows below is my attempt to make the plots above the diagonal more useful. Instead of plotting the transpose barplot, I will plot a heatmap showing the relative proportion of observations having each pair of categorical values.

First, the scaffold for the plot. I will use the gridExtra package to put several ggplot2 objects together. The code below puts the same barplots below the diagonal, variable names along the diagonal, and empty canvases above the diagonal. (Notice that I need some tricks to make the barplots with the variables as strings, namely the use of aes_string() and as.formula() within facet_grid().)

library(gridExtra)
library(tidyverse)
grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
    for (j in 1:ncol(cat_df)) {
        idx <- idx + 1
        
        # get feature names (note that i & j are reversed)
        x_feat <- names(cat_df)[j]
        y_feat <- names(cat_df)[i]
        
        if (i < j) {
            # empty canvas
            grobs[[idx]] <- ggplot() + theme_void()
        } else if (i == j) {
            # just the name of the variable
            label_df <- data.frame(x = -0, y = 0, label = x_feat)
            grobs[[idx]] <- ggplot(label_df, aes(x = x, y = y, label = label), 
                                   fontface = "bold", hjust = 0.5) +
                geom_text() +
                coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) +
                theme_void()
        }
        else {
            # 2-dimensional barplot
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + 
                geom_bar() +
                facet_grid(as.formula(paste(y_feat, "~ ."))) +
                theme(legend.position = "none", axis.title = element_blank())
        }
    }
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))

This is essentially showing the same information as GGally::ggpairs(). To add the frequency proportion heatmaps, replace the code in the (i < j) branch with the following:

# frequency proportion heatmap
# get frequency proportions
freq_df <- cat_df %>% 
    group_by_at(c(x_feat, y_feat)) %>%
    summarize(proportion = n() / nrow(cat_df)) %>% 
    ungroup()
            
# get all pairwise combinations of values
temp_df <- expand.grid(unique(cat_df[[x_feat]]), 
                       unique(cat_df[[y_feat]]))
names(temp_df) <- c(x_feat, y_feat)
            
# join to get frequency proportion
temp_df <- temp_df %>%
    left_join(freq_df, by = c(setNames(x_feat, x_feat),
                              setNames(y_feat, y_feat))) %>%
    replace_na(list(proportion = 0))
            
grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + 
    geom_tile(aes(fill = proportion)) +
    geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
    scale_fill_gradient(low = "white", high = "#007acc") +
    theme(legend.position = "none", axis.title = element_blank())

Notice that each heatmap has its own limits for the color scale. If you want to have the same color scale for all the plots, you can add limits = c(0, 1) to the scale_fill_gradient() layer of the plot.

The one thing we lose here over the GGally::ggpairs() version is the marginal barplot for each variable. This is easy to add but then we don’t really have a place to put the variable names. Replacing the code in the (i == j) branch with the following is one possible option.

# df for positioning the variable name
label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, 
                       y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
# marginal barplot with variable name on top
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
    geom_bar() +
    geom_label(data = label_df, aes(x = x, y = y, label = label),
               size = 5)

In this final version, we clean up some of the axes so that more of the plot space can be devoted to the plot itself, not the axis labels:

theme_update(legend.position = "none", axis.title = element_blank())

grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
    for (j in 1:ncol(cat_df)) {
        idx <- idx + 1
        
        # get feature names (note that i & j are reversed)
        x_feat <- names(cat_df)[j]
        y_feat <- names(cat_df)[i]
        
        if (i < j) {
            # frequency proportion heatmap
            # get frequency proportions
            freq_df <- cat_df %>% 
                group_by_at(c(x_feat, y_feat)) %>%
                summarize(proportion = n() / nrow(cat_df)) %>% 
                ungroup()
            
            # get all pairwise combinations of values
            temp_df <- expand.grid(unique(cat_df[[x_feat]]), 
                                   unique(cat_df[[y_feat]]))
            names(temp_df) <- c(x_feat, y_feat)
            
            # join to get frequency proportion
            temp_df <- temp_df %>%
                left_join(freq_df, by = c(setNames(x_feat, x_feat),
                                          setNames(y_feat, y_feat))) %>%
                replace_na(list(proportion = 0))
            
            grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + 
                geom_tile(aes(fill = proportion)) +
                geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
                scale_fill_gradient(low = "white", high = "#007acc") +
                theme(axis.ticks = element_blank(), axis.text = element_blank())
        } else if (i == j) {
            # df for positioning the variable name
            label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, 
                                   y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
            # marginal barplot with variable name on top
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
                geom_bar() +
                geom_label(data = label_df, aes(x = x, y = y, label = label),
                           size = 5)
        }
        else {
            # 2-dimensional barplot
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + 
                geom_bar() +
                facet_grid(as.formula(paste(y_feat, "~ ."))) +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
        }
    }
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))

Changing the variable inside an R formula

I recently encountered a situation where I wanted to run several linear models, but where the response variables would depend on previous steps in the data analysis pipeline. Let me illustrate using the mtcars dataset:

data(mtcars)
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s say I wanted to fit a linear model of mpg vs. hp and get the coefficients. This is easy:

lm(mpg ~ hp, data = mtcars)$coefficients
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 

But what if I wanted to fit a linear model of y vs. hp, where y is a response variable that I won’t know until runtime? Or what if I want to fit 3 linear models: each of mpg, disp, drat vs. hp? Or what if I want to fit 300 such models? There has to be a way to do this programmatically.

It turns out that there are at least 4 different ways to achieve this in R. For all these methods, let’s assume that the responses we want to fit models for are in a character vector:

response_list <- c("mpg", "disp", "drat")

Here are the 4 ways I know (in decreasing order of preference):

1. as.formula()

as.formula() converts a string to a formula object. Hence, we can programmatically create the formula we want as a string, then pass that string to as.formula():

for (y in response_list) {
    lmfit <- lm(as.formula(paste(y, "~ hp")), data = mtcars)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

2. Don’t specify the data option

Passing the data = mtcars option to lm() gives us more succinct and readable code. However, lm() also accepts the response vector and data matrix themselves:

for (y in response_list) {
    lmfit <- lm(mtcars[[y]] ~ mtcars$hp)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

Edit: Commenter Tommaso Gennari shared a really nice solution that makes use of the fact that when you give lm() just a data frame, the first column is used as a dependent variable and the remaining columns are treated as independent variables:

for (y in response_list) {
    lmfit <- lm(mtcars[, c(y, "hp")])
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

3. get()

get() searches for an R object by name and returns that object if it exists.

for (y in response_list) {
    lmfit <- lm(get(y) ~ hp, data = mtcars)
    print(lmfit$coefficients)
}
#> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

4. eval(parse())

This one is a little complicated. parse() returns the parsed but unevaluated expressions, while eval() evaluates those expressions (in a specified environment).

for (y in response_list) {
    lmfit <- lm(eval(parse(text = y)) ~ hp, data = mtcars) print(lmfit$coefficients) } #> (Intercept)          hp 
#> 30.09886054 -0.06822828 
#> (Intercept)          hp 
#>    20.99248     1.42977 
#> (Intercept)          hp 
#>  4.10990867 -0.00349959 

 

Of course, for any of these methods, we could replace the outer loop with apply() or purrr::map().

References:

  1. johnramey. Converting a String to a Variable Name On-The-Fly and Vice-versa in R.

Be careful of NA/NaN/Inf values when using base R’s plotting functions!

I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error:

I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were NA/NaN/Inf values in my data matrix! The problem was that I had expected these problematic values to have been flagged during my exploratory data analysis. However, R’s plot and hist functions silently drop these values before giving a plot.

Here’s some code to demonstrate the issue. Let’s create some fake data with NA/NaN/Inf values:

n <- 50  # no. of observations
p <- 2   # no. of features

# create fake data matrix
set.seed(1)
x <- matrix(rnorm(n * p), nrow = n)

# make some entries invalid
x[1:3, 1] <- NA
x[4:5, 2]             [,1]       [,2]
#> [1,]         NA  0.3981059
#> [2,]         NA -0.6120264
#> [3,]         NA  0.3411197
#> [4,]  1.5952808        Inf
#> [5,]  0.3295078        Inf
#> [6,] -0.8204684  1.9803999

The two lines of code give plots in return, without any warning message to the console that data points have been dropped:

plot(x[, 1], x[, 2])
hist(x[,1])

The ggplot2 package does a better job of handling such values. While it also makes the plot, it sends a warning to the console that some values have been dropped in the process:

library(ggplot2)
df <- data.frame(x = x[,1])
ggplot(df, aes(x)) + geom_histogram()


Moral(s) of the story:

  1. Don’t assume that your data is free of NA/NaN/Inf values. Check!
  2. Base R’s hist and plot functions do not warn about invalid values being removed. Either follow the advice in the previous point or use code that flags such removals (e.g. ggplot2).

Looking at flood insurance claims with choroplethr

I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims.

I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than 2 million claims transactions going all the way back to 1970. (The dataset can be accessed from this URL.) From what I can tell, this dataset is updated somewhat regularly. For this post, the version of the dataset used was published on 26 Jun 2019, and I downloaded it on 10 Jul 2019. (Credit: I learned of this dataset from Jeremy Singer-Vine’s Data Is Plural newsletter, 2019.07.10 edition.)

To access all the code in this post in a single script, click here.

First, let’s load libraries and the dataset:

library(tidyverse)
library(choroplethr)
library(choroplethrMaps)


df <- read_csv("openfema_claims20190331.csv", 
               col_types = cols(asofdate = col_date(format = "%Y-%m-%d"), 
                                dateofloss = col_date(format = "%Y-%m-%d"), 
                                originalconstructiondate = col_date(format = "%Y-%m-%d"), 
                                originalnbdate = col_date(format = "%Y-%m-%d")))

I get some parsing failures, indicating that there might be some issues with the dataset, but I will ignore them for now since it’s not a relevant point for this post. I end up with around 2.4 million observations of 39 variables. We are only going to be interested in a handful of variables. Here they are along with the description given in the accompanying data dictionary:

    • yearofloss: Year in which the flood loss occurred (YYYY).
    • state: The two-character alpha abbreviation of the state in which the insured property is located.
    • countycode: The Federal Information Processing Standard(FIPS) defined unique 5-digit code for the identification of counties and equivalent entities of the united States, its possessions, and insular areas. The NFIP relies on our geocoding service to assign county.
    • amountpaidonbuildingclaim: Dollar amount paid on the building claim. In some instances, a negative amount may appear which occurs when a check issued to a policy holder isn’t cashed and has to be re-issued.
    • amountpaidoncontentsclaim: Dollar amount paid on the contents claim. In some instances, a negative amount may appear, which occurs when a check issued to a policy holder isn’t cashed and has to be re-issued.
    • amountpaidonincreasedcostofcomplianceclaim: Dollar amount paid on the Increased Cost of Compliance (ICC) claim. Increased Cost of Compliance (ICC) coverage is one of several flood insurances resources for policyholders who need additional help rebuilding after a flood. It provides up to $30,000 to help cover the cost of mitigation measures that will reduce the flood risk. As we are not explicitly defining the building and contents components of the claim, I don’t think it is necessary to define ICC.

I didn’t understand the explanation for negative amounts for the claim fields. Since there were less than 20 records which such fields, I decided to remove them. I also removed all rows without any state information (12 records).

# select just the columns we want and remove rows with no state info or
# negative claim amounts
small_df <- df %>% 
    select(year = yearofloss, state, countycode, 
           claim_bldg = amountpaidonbuildingclaim, 
           claim_contents = amountpaidoncontentsclaim, 
           claim_ICC = amountpaidonincreasedcostofcomplianceclaim) %>% 
    filter(!is.na(state)) %>%
    replace_na(list(claim_bldg = 0, claim_contents = 0, claim_ICC = 0)) %>%
    filter(!(claim_bldg < 0 | claim_contents < 0 | claim_ICC < 0))

The code below makes a histogram of the number of claims per year:

# histogram of year
with(small_df, hist(year, breaks = c(min(year):max(year)), 
                    main = "No. of claims / year", xlab = "Year"))

While there are supposed to be records from 1970, the records before 1978 seem to be pretty sparse. We also don’t have a full year of records for 2019 yet. As such, I filtered for just records between 1978 and 2018 (inclusive). I also assumed that the total claim amount was the sum of the 3 claim columns: this is the only claim amount I work with in this post.

small_df <- small_df %>% filter(year >= 1978 & year <= 2018) %>%
    mutate(total_cost = claim_bldg + claim_contents + claim_ICC)

Number of claims by state

For my first map, I wanted to show the number of claims made by state. To do that, I have to give to the state_choropleth() function a very specific dataframe: it must have a region column with state names (e.g. “new york”, not “NY” or “New York”), and a value column with the number of the claims. The dataset has states in two-letter abbreviations, so we have to do a mapping from those abbreviations to the state names. Thankfully it is easy to do that with the mapvalues() function in the plyr package and the state.regions dataframe in the choroplethMaps package:

state_no_claims_df <- small_df %>% group_by(state) %>%
    summarize(claim_cnt = n())

# map to state name
data(state.regions)
state_no_claims_df$region <- plyr::mapvalues(state_no_claims_df$state, 
                                             from = state.regions$abb, 
                                             to = state.regions$region)

# rename columns for choroplethr
names(state_no_claims_df) <- c("state", "value", "region")

# basic map
state_choropleth(state_no_claims_df)

Note that we get a warning because our dataframe has regions that are not mappable (AS, GU, PR, VI: presumably places like Guam and Puerto Rico). Nevertheless we get a colored map:

Not bad for a really short command! By default, the value column is split into 7 buckets (according to quantiles). To see which states are above/below the median number of claims, we can set the num_colors option to 2:

# above and below median
state_choropleth(state_no_claims_df, num_colors = 2)

For a continuous scale, set num_colors to 1 (if you have negative values, you can set num_colors 0 for a diverging scale):

# continuous scale
state_choropleth(state_no_claims_df, num_colors = 1)

From the map above, it is clear that Louisiana had the most number of claims, followed by Texas and Florida.

It is easy to add a title and legend label to the map:

# one with title
state_choropleth(state_no_claims_df, num_colors = 1, 
                 title = "No. of claims by state", legend = "No. of claims")

The output of state_choropleth is a ggplot object, so we can modify the output as we would with ggplot graphics. One bug I noticed was that in adding the color scale below, the legend name reverted to “value” (instead of “No. of claims”); I have not figured out how to amend this yet.

# choroplethr output is a ggplot object
gg1 <- state_choropleth(state_no_claims_df, num_colors = 1, title = "No. of claims by state", legend = "No. of claims") class(gg1) #> [1] "gg"     "ggplot"

gg1 + scale_fill_distiller(palette = "Spectral") +
    theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5))

state_choropleth() allows us to plot just the states we want by passing a vector of states to the zoom option. In the code below, we just plot the states which border the Gulf of Mexico:

# zoom in on gulf states: have a shoreline on gulf of mexico
gulf_states <- c("texas", "louisiana", "mississippi", "alabama", "florida")
state_choropleth(state_no_claims_df, num_colors = 1, 
                 title = "No. of claims by state (Gulf states)", legend = "No. of claims",
                 zoom = gulf_states)

This last barplot shows the no. of claims by state (in descending order). 6 states have over 100,000 claims, with North Carolina barely hitting that number.

# bar plot of no. of claims
ggplot(data = state_no_claims_df %>% arrange(desc(value)) %>% head(n = 10)) +
    geom_bar(aes(x = reorder(state, -value), y = value), stat = "identity",
             fill = "gray", col = "black") +
    geom_abline(intercept = 100000, slope = 0, col = "red", lty = 2) +
    labs(x = "State", y = "No. of claims", title = "Top 10 states by no. of claims")

Total claim amount by state

This section doesn’t contain any new choroplethr syntax, but is for the completeness of the data story. Instead of looking at the number of claims by state, I look at the total claim amounts by state. My initial guess is that we shouldn’t see that much variation from the plots above.

The code below extracts the summary information from the claims level dataset and makes the choropleth map and the barplot (the red line indicates the $1 billion mark):

state_claims_amt_df <- small_df %>% group_by(state) %>%
    summarize(value = log10(sum(total_cost)))

# map to state name
state_claims_amt_df$region <- plyr::mapvalues(state_claims_amt_df$state, from = state.regions$abb, to = state.regions$region) state_choropleth(state_claims_amt_df, num_colors = 1, title = "log10(Claim amount) by state", legend = "") + scale_fill_distiller(palette = "Spectral") + theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5)) # bar plot of claim amt ggplot(data = state_claims_amt_df %>% arrange(desc(value)) %>% head(n = 10)) +
    geom_bar(aes(x = reorder(state, -value), y = 10^value), stat = "identity", 
             fill = "gray", col = "black") +
    geom_abline(intercept = 10^9, slope = 0, col = "red", lty = 2) +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "State", y = "Total claims", title = "Top 10 states by total claims")

Perhaps not surprisingly the top 5 states with the most number of claims are also the top 5 states with the largest claim amounts. What might be more surprising is how much more money was claimed in LA and TX compared to the other states.

Number of claims by county

choroplethr makes plotting by county really easy, as long as you have the 5-digit FIPS code for the counties. Instead of using the state_choropleth() function, use the county_choropleth() function. As before, this function must have a “region” column with the FIPS codes (as numbers, not strings), and a “value” column for the values to be plotted. The county.regions dataset in the choroplethrMaps package can help the conversion of region values if necessary.

county_summary_df <- small_df %>% group_by(countycode) %>%
    summarize(claim_cnt = n(), claim_amt = sum(total_cost)) %>%
    mutate(countycode = as.numeric(countycode))

names(county_summary_df) <- c("region", "value", "claim_amt")

county_choropleth(county_summary_df)

As with state_choropleth(), the default is to split the values into 7 bins according to quantiles. We get a bunch of warnings because several counties have NA values. By default, these counties are filled black. In this particular case I think they affect how we read the map, I use the code below to change the NAs to white. In this context it makes sense too, since counties have NA values because they have no claims.

# change fill color for NAs to be white
county_choropleth(county_summary_df, title = "No. of claims by county") +
    scale_fill_brewer(na.value = "white")

We can draw county maps for specific states as well. The code below makes two plots: the first zooms in on just Florida, the second zooms in on all the Gulf states.

# zoom into florida only
county_choropleth(county_summary_df, title = "No. of claims by county (Florida)",
                  state_zoom = "florida") +
    scale_fill_brewer(na.value = "white")

# zoom into gulf
county_choropleth(county_summary_df, title = "No. of claims by county (Gulf states)",
                  state_zoom = gulf_states) +
    scale_fill_brewer(na.value = "white")

Top 5 states: a (slightly) deeper look

In this section we take a temporal look at the top 5 states. The code below filters for them and summarizes the data:

top_states <- (state_no_claims_df %>% arrange(desc(value)) %>% head(n = 5))[[1]]
top_df <- small_df %>% filter(state %in% top_states)
top_regions <- plyr::mapvalues(top_states, from = state.regions$abb, 
                               to = state.regions$region)

state_yr_summary_df <- top_df %>% group_by(state, year) %>%
    summarize(claim_cnt = n(), claim_amt = sum(total_cost))

The first chunk of code plots the number of claims for each state for each year, while the second chunk does the same for total claim amount:

ggplot(data = state_yr_summary_df, aes(x = year, y = claim_cnt, col = state)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "Year", y = "No. of claims", title = "No. of claims by state") +
    theme_bw()

ggplot(data = state_yr_summary_df, aes(x = year, y = claim_amt, col = state)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "Year", y = "Claim amount", title = "Claim amount by state") +
    theme_bw()

It looks like for these states, most of the claims and most of the damage comes in just one or two particular incidents. From the state and the year, you can probably guess the disaster that caused the spike.