In R, the model.matrix
function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).
Let’s see this in action on the iris
dataset:
data(iris) str(iris) # 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... x <- model.matrix(Sepal.Length ~ Species, iris) head(x) # (Intercept) Speciesversicolor Speciesvirginica # 1 1 0 0 # 2 1 0 0 # 3 1 0 0 # 4 1 0 0 # 5 1 0 0 # 6 1 0 0
model.matrix
returns a column of ones labeled (Intercept)
by default. Also note that while the Species
factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of model.matrix
only has dummy variables for the latter two levels. For a factor variable, model.matrix
treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity.
However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the contrasts.arg
argument:
x <- model.matrix( Sepal.Length ~ Species, data = iris, contrasts.arg = list(Species = contrasts(iris$Species, contrasts = FALSE))) head(x) # (Intercept) Speciessetosa Speciesversicolor Speciesvirginica # 1 1 1 0 0 # 2 1 1 0 0 # 3 1 1 0 0 # 4 1 1 0 0 # 5 1 1 0 0 # 6 1 1 0 0
Let’s have a closer look at what we passed as the value of Species
in the list:
contrasts(iris$Species, contrasts = FALSE) # setosa versicolor virginica # setosa 1 0 0 # versicolor 0 1 0 # virginica 0 0 1
Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output:
contrasts(iris$Species, contrasts = TRUE) # versicolor virginica # setosa 0 0 # versicolor 1 0 # virginica 0 1
It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.
df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)), y = factor(rep(c("d", "e", "f"), times = 3)), z = 1:9) str(df) # 'data.frame': 9 obs. of 3 variables: # $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 # $ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3 # $ z: int 1 2 3 4 5 6 7 8 9 # default: no dummy variable for baseline level x <- model.matrix(~ ., data = df) head(x) # (Intercept) xb xc ye yf z # 1 1 0 0 0 0 1 # 2 1 1 0 1 0 2 # 3 1 0 1 0 1 3 # 4 1 0 0 0 0 4 # 5 1 1 0 1 0 5 # 6 1 0 1 0 1 6 # dummy variables for baseline levels included x <- model.matrix( ~ ., data = df, contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE], contrasts, contrasts = FALSE)) head(x) # (Intercept) xa xb xc yd ye yf z # 1 1 1 0 0 1 0 0 1 # 2 1 0 1 0 0 1 0 2 # 3 1 0 0 1 0 0 1 3 # 4 1 1 0 0 1 0 0 4 # 5 1 0 1 0 0 1 0 5 # 6 1 0 0 1 0 0 1 6
References:
- StackOverflow. All Levels of a Factor in a Model Matrix in R.
Pingback: Changing the column names for model.matrix output | Statistical Odds & Ends