tidyr::complete to show all possible combinations of variables

This is an issue I often face, so I thought it best to write it down. When doing data analysis, we often want to known how many observations there are in each subgroup. These subgroups can be defined by multiple variables. In the code example below, I want to know how many vehicles there are for each (cyl, gear) combination:

library(tidyverse)
data(mtcars)
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n())

# # A tibble: 8 x 3
# # Groups:   cyl [3]
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     5     2

If you look carefully, you will notice that there are no vehicles with cyl == 8 and gear == 4. In general it’s probably better to include this combination as a row in the tibble, with count as 0. This is especially important in data pipelines where future processes might expect there to be length(unique(cyl)) * length(unique(gear)) rows in the dataset.

We can achieve this by ungrouping the dataset and applying tidyr::complete(). This ensures that every possible (cyl, gear) combination gets a row.

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear)

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4    NA
# 9     8     5     2

For rows that didn’t appear in the original summary table, complete() fills up the remaining columns with NA. We can specify the value complete() should use to fill in these cells with the fill option:

mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear, fill = list(count = 0))

# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4     0
# 9     8     5     2

References:

  1. Reddit. Need help with dplyr: Show all possible group combinations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s