How to include all levels of a factor variable in a model matrix in R

In R, the model.matrix function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).

Let’s see this in action on the iris dataset:

data(iris)
str(iris)
# 'data.frame':	150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

x <- model.matrix(Sepal.Length ~ Species, iris)
head(x)
#   (Intercept) Speciesversicolor Speciesvirginica
# 1           1                 0                0
# 2           1                 0                0
# 3           1                 0                0
# 4           1                 0                0
# 5           1                 0                0
# 6           1                 0                0

model.matrix returns a column of ones labeled (Intercept) by default. Also note that while the Species factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of model.matrix only has dummy variables for the latter two levels. For a factor variable, model.matrix treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity.

However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the contrasts.arg argument:

x <- model.matrix(
  Sepal.Length ~ Species,
  data = iris,
  contrasts.arg = list(Species = contrasts(iris$Species, contrasts = FALSE)))
head(x)
#   (Intercept) Speciessetosa Speciesversicolor Speciesvirginica
# 1           1             1                 0                0
# 2           1             1                 0                0
# 3           1             1                 0                0
# 4           1             1                 0                0
# 5           1             1                 0                0
# 6           1             1                 0                0

Let’s have a closer look at what we passed as the value of Species in the list:

contrasts(iris$Species, contrasts = FALSE)
#            setosa versicolor virginica
# setosa          1          0         0
# versicolor      0          1         0
# virginica       0          0         1

Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output:

contrasts(iris$Species, contrasts = TRUE)
#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1

It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.

df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)),
                 y = factor(rep(c("d", "e", "f"), times = 3)),
                 z = 1:9)
str(df)
# 'data.frame':	9 obs. of  3 variables:
#  $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
#  $ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3
#  $ z: int  1 2 3 4 5 6 7 8 9

# default: no dummy variable for baseline level
x <- model.matrix(~ ., data = df)
head(x)
#   (Intercept) xb xc ye yf z
# 1           1  0  0  0  0 1
# 2           1  1  0  1  0 2
# 3           1  0  1  0  1 3
# 4           1  0  0  0  0 4
# 5           1  1  0  1  0 5
# 6           1  0  1  0  1 6

# dummy variables for baseline levels included
x <- model.matrix(
  ~ .,
  data = df,
  contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE],
                         contrasts, contrasts = FALSE))
head(x)
#   (Intercept) xa xb xc yd ye yf z
# 1           1  1  0  0  1  0  0 1
# 2           1  0  1  0  0  1  0 2
# 3           1  0  0  1  0  0  1 3
# 4           1  1  0  0  1  0  0 4
# 5           1  0  1  0  0  1  0 5
# 6           1  0  0  1  0  0  1 6

References:

  1. StackOverflow. All Levels of a Factor in a Model Matrix in R.
Advertisement

1 thought on “How to include all levels of a factor variable in a model matrix in R

  1. Pingback: Changing the column names for model.matrix output | Statistical Odds & Ends

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s