Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the `ggpairs()`

function in the `GGally`

package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical.

For all the code in this post in one file, click here.

The `GGally::ggpairs()`

function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the `vehicles`

dataset from the `fueleconomy`

package:

library(fueleconomy) data(vehicles) df <- vehicles[1:100, ] str(df) # Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 100 obs. of 12 variables: # $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ... # $ make : chr "AM General" "AM General" "AM General" "AM General" ... # $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ... # $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ... # $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ... # $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ... # $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ... # $ cyl : int 4 4 6 6 4 6 6 4 4 6 ... # $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ... # $ fuel : chr "Regular" "Regular" "Regular" "Regular" ... # $ hwy : int 17 17 13 13 17 13 21 26 28 26 ... # $ cty : int 18 18 13 13 16 13 14 20 22 18 ...

Let’s see how `GGally::ggpairs()`

visualizes relationships between quantitative variables:

library(GGally) quant_df <- df[, c("cyl", "hwy", "cty")] ggpairs(quant_df)

- Along the diagonal, we see a density plot for each of the variables.
- Below the diagonal, we see scatterplots for each pair of variables.
- Above the diagonal, we see the (Pearson) correlation between each pair of variables.

The visualization changes a little when we have a mix of quantitative and categorical variables. Below, `fuel`

is a categorical variable while `hwy`

is a quantitative variable.

mixed_df <- df[, c("fuel", "hwy")] ggpairs(mixed_df)

- For a categorical variable on the diagonal, we see a barplot depicting the number of times each category appears.
- In one of the corners (top-right), for each categorical value we have a boxplot for the quantitative variable.
- In one of the corners (bottom-left), for each categorical value we have a histogram for the quantitative variable.

The only behavior for `GGally::ggpairs()`

we haven’t observed yet is for a pair of categorical variables. In the code fragment below, all 3 variables are categorical:

cat_df <- df[, c("fuel", "make", "drive")] ggpairs(cat_df)

For each pair of categorical variables, we have a barplot depicting the number of times each pair of categorical values appears.

You may have noticed that the plots above the diagonal are essentially transposes of the plot below the diagonal, and so they don’t really convey any more information. What follows below is my attempt to make the plots above the diagonal more useful. **Instead of plotting the transpose barplot, I will plot a heatmap showing the relative proportion of observations having each pair of categorical values.**

First, the scaffold for the plot. I will use the `gridExtra`

package to put several `ggplot2`

objects together. The code below puts the same barplots below the diagonal, variable names along the diagonal, and empty canvases above the diagonal. (Notice that I need some tricks to make the barplots with the variables as strings, namely the use of `aes_string()`

and `as.formula()`

within `facet_grid()`

.)

library(gridExtra) library(tidyverse) grobs <- list() idx <- 0 for (i in 1:ncol(cat_df)) { for (j in 1:ncol(cat_df)) { idx <- idx + 1 # get feature names (note that i & j are reversed) x_feat <- names(cat_df)[j] y_feat <- names(cat_df)[i] if (i < j) { # empty canvas grobs[[idx]] <- ggplot() + theme_void() } else if (i == j) { # just the name of the variable label_df <- data.frame(x = -0, y = 0, label = x_feat) grobs[[idx]] <- ggplot(label_df, aes(x = x, y = y, label = label), fontface = "bold", hjust = 0.5) + geom_text() + coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) + theme_void() } else { # 2-dimensional barplot grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + geom_bar() + facet_grid(as.formula(paste(y_feat, "~ ."))) + theme(legend.position = "none", axis.title = element_blank()) } } } grid.arrange(grobs = grobs, ncol = ncol(cat_df))

This is essentially showing the same information as `GGally::ggpairs()`

. To add the frequency proportion heatmaps, replace the code in the `(i < j)`

branch with the following:

# frequency proportion heatmap # get frequency proportions freq_df <- cat_df %>% group_by_at(c(x_feat, y_feat)) %>% summarize(proportion = n() / nrow(cat_df)) %>% ungroup() # get all pairwise combinations of values temp_df <- expand.grid(unique(cat_df[[x_feat]]), unique(cat_df[[y_feat]])) names(temp_df) <- c(x_feat, y_feat) # join to get frequency proportion temp_df <- temp_df %>% left_join(freq_df, by = c(setNames(x_feat, x_feat), setNames(y_feat, y_feat))) %>% replace_na(list(proportion = 0)) grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + geom_tile(aes(fill = proportion)) + geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) + scale_fill_gradient(low = "white", high = "#007acc") + theme(legend.position = "none", axis.title = element_blank())

Notice that each heatmap has its own limits for the color scale. If you want to have the same color scale for all the plots, you can add `limits = c(0, 1)`

to the `scale_fill_gradient()`

layer of the plot.

The one thing we lose here over the `GGally::ggpairs()`

version is the marginal barplot for each variable. This is easy to add but then we don’t really have a place to put the variable names. Replacing the code in the `(i == j)`

branch with the following is one possible option.

# df for positioning the variable name label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, y = max(table(cat_df[[x_feat]])) / 2, label = x_feat) # marginal barplot with variable name on top grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + geom_bar() + geom_label(data = label_df, aes(x = x, y = y, label = label), size = 5)

In this final version, we clean up some of the axes so that more of the plot space can be devoted to the plot itself, not the axis labels:

theme_update(legend.position = "none", axis.title = element_blank()) grobs <- list() idx <- 0 for (i in 1:ncol(cat_df)) { for (j in 1:ncol(cat_df)) { idx <- idx + 1 # get feature names (note that i & j are reversed) x_feat <- names(cat_df)[j] y_feat <- names(cat_df)[i] if (i < j) { # frequency proportion heatmap # get frequency proportions freq_df <- cat_df %>% group_by_at(c(x_feat, y_feat)) %>% summarize(proportion = n() / nrow(cat_df)) %>% ungroup() # get all pairwise combinations of values temp_df <- expand.grid(unique(cat_df[[x_feat]]), unique(cat_df[[y_feat]])) names(temp_df) <- c(x_feat, y_feat) # join to get frequency proportion temp_df <- temp_df %>% left_join(freq_df, by = c(setNames(x_feat, x_feat), setNames(y_feat, y_feat))) %>% replace_na(list(proportion = 0)) grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + geom_tile(aes(fill = proportion)) + geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) + scale_fill_gradient(low = "white", high = "#007acc") + theme(axis.ticks = element_blank(), axis.text = element_blank()) } else if (i == j) { # df for positioning the variable name label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, y = max(table(cat_df[[x_feat]])) / 2, label = x_feat) # marginal barplot with variable name on top grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + geom_bar() + geom_label(data = label_df, aes(x = x, y = y, label = label), size = 5) } else { # 2-dimensional barplot grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + geom_bar() + facet_grid(as.formula(paste(y_feat, "~ ."))) + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) } } } grid.arrange(grobs = grobs, ncol = ncol(cat_df))

Why not report generalized correlation coefficients from R package generalCorr

You only need the code gmcmtx0(mtx) and report both above diag and below diag values instead of Pearson corr. Pearson is very bad for nonlinear data x=1:20; y=sin(x); cor(x,y) is near zero

gmcmtx0(cbind(x,y)) has superior properties. Perfect dependence between x and y is correctly estimated.

LikeLiked by 1 person

Nice – I really like this visualisation when all vars are metrics. But what about when you have a categorical variable with, say 15 categories?

LikeLike

I think that a variable with many categories would be a problem for any pairs plot. In those cases I would suggest some coalescing of categories, or maybe just focus on the categories of most interest.

LikeLiked by 1 person