How to include all levels of a factor variable in a model matrix in R

In R, the `model.matrix` function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).

Let’s see this in action on the `iris` dataset:

```data(iris)
str(iris)
# 'data.frame':	150 obs. of  5 variables:
#  \$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  \$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  \$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  \$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  \$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

x <- model.matrix(Sepal.Length ~ Species, iris)
#   (Intercept) Speciesversicolor Speciesvirginica
# 1           1                 0                0
# 2           1                 0                0
# 3           1                 0                0
# 4           1                 0                0
# 5           1                 0                0
# 6           1                 0                0
```

`model.matrix` returns a column of ones labeled `(Intercept)` by default. Also note that while the `Species` factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of `model.matrix` only has dummy variables for the latter two levels. For a factor variable, `model.matrix` treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity.

However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the `contrasts.arg` argument:

```x <- model.matrix(
Sepal.Length ~ Species,
data = iris,
contrasts.arg = list(Species = contrasts(iris\$Species, contrasts = FALSE)))
#   (Intercept) Speciessetosa Speciesversicolor Speciesvirginica
# 1           1             1                 0                0
# 2           1             1                 0                0
# 3           1             1                 0                0
# 4           1             1                 0                0
# 5           1             1                 0                0
# 6           1             1                 0                0
```

Let’s have a closer look at what we passed as the value of `Species` in the list:

```contrasts(iris\$Species, contrasts = FALSE)
#            setosa versicolor virginica
# setosa          1          0         0
# versicolor      0          1         0
# virginica       0          0         1
```

Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output:

```contrasts(iris\$Species, contrasts = TRUE)
#            versicolor virginica
# setosa              0         0
# versicolor          1         0
# virginica           0         1
```

It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.

```df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)),
y = factor(rep(c("d", "e", "f"), times = 3)),
z = 1:9)
str(df)
# 'data.frame':	9 obs. of  3 variables:
#  \$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
#  \$ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3
#  \$ z: int  1 2 3 4 5 6 7 8 9

# default: no dummy variable for baseline level
x <- model.matrix(~ ., data = df)
#   (Intercept) xb xc ye yf z
# 1           1  0  0  0  0 1
# 2           1  1  0  1  0 2
# 3           1  0  1  0  1 3
# 4           1  0  0  0  0 4
# 5           1  1  0  1  0 5
# 6           1  0  1  0  1 6

# dummy variables for baseline levels included
x <- model.matrix(
~ .,
data = df,
contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE],
contrasts, contrasts = FALSE))
#   (Intercept) xa xb xc yd ye yf z
# 1           1  1  0  0  1  0  0 1
# 2           1  0  1  0  0  1  0 2
# 3           1  0  0  1  0  0  1 3
# 4           1  1  0  0  1  0  0 4
# 5           1  0  1  0  0  1  0 5
# 6           1  0  0  1  0  0  1 6
```

References:

1. StackOverflow. All Levels of a Factor in a Model Matrix in R.