A deep dive into glmnet: penalty.factor

The glmnet function (from the package of the same name) is probably the most used function for fitting the elastic net model in R. (It also fits the lasso and ridge regression, since they are special cases of elastic net.) The glmnet function is very powerful and has several function options that users may not know about. In a series of posts, I hope to shed some light on what these options do.

Here is the full signature of the glmnet function (v3.0-2):

glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial",
  "cox", "mgaussian"), weights, offset = NULL, alpha = 1,
  nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04),
  lambda = NULL, standardize = TRUE, intercept = TRUE,
  thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20,
  nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf,
  upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars <
  500, "covariance", "naive"), type.logistic = c("Newton",
  "modified.Newton"), standardize.response = FALSE,
  type.multinomial = c("ungrouped", "grouped"), relax = FALSE,
  trace.it = 0, ...)

In this post, we will focus on the penalty.factor option.

Unless otherwise stated, n will denote the number of observations, p will denote the number of features, and fit will denote the output/result of the glmnet call.


When this option is not set, for each value of \lambda in lambda, glmnet is minimizing the following objective function:

\begin{aligned} \underset{\beta}{\text{minimize}} \quad \frac{1}{2}\frac{\text{RSS}}{n} + \lambda \displaystyle\sum_{j=1}^p \left(\frac{1 - \alpha}{2}\|\beta_j \|_2^2 + \alpha \|\beta_j \|_1 \right). \end{aligned}

When the option is set to a vector c(c_1, ..., c_p), glmnet minimizes the following objective instead:

\begin{aligned} \underset{\beta}{\text{minimize}} \quad \frac{1}{2}\frac{\text{RSS}}{n} + \lambda \displaystyle\sum_{j=1}^p c_j \left(\frac{1 - \alpha}{2}\|\beta_j \|_2^2 + \alpha \|\beta_j \|_1 \right). \end{aligned}

In the documentation, it is stated that “the penalty factors are internally rescaled to sum to nvars and the lambda sequence will reflect this change.” However, from my own experiments, it seems that the penalty factors are internally rescaled to sum to nvars but the lambda sequence remains the same. Let’s generate some data:

n &lt;- 100; p &lt;- 5; true_p &lt;- 2
X &lt;- matrix(rnorm(n * p), nrow = n)
beta &lt;- matrix(c(rep(1, true_p), rep(0, p - true_p)), ncol = 1)
y &lt;- X %*% beta + 3 * rnorm(n)

We fit two models, fit which uses the default options for glmnet, and fit2 which has penalty.factor = rep(2, 5):

fit &lt;- glmnet(X, y)
fit2 &lt;- glmnet(X, y, penalty.factor = rep(2, 5))

What we find is that these two models have the exact same lambda sequence and produce the same beta coefficients.

sum(fit$lambda != fit2$lambda)
# [1] 0
sum(fit$beta != fit2$beta)
# [1] 0

The same thing happens when we supply our own lambda sequence:

fit3 &lt;- glmnet(X, y, lambda = c(1, 0.1, 0.01), penalty.factor = rep(10, 5))
fit4 &lt;- glmnet(X, y, lambda = c(1, 0.1, 0.01), penalty.factor = rep(1, 5))
sum(fit3$lambda != fit4$lambda)
# [1] 0
sum(fit3$beta != fit4$beta)
# [1] 0

Hence, my conclusion is that if penalty.factor is set to c(c_1, ..., c_p), glmnet is really minimizing

\begin{aligned} \underset{\beta}{\text{minimize}} \quad \frac{1}{2}\frac{\text{RSS}}{n} + \lambda \displaystyle\sum_{j=1}^p \frac{c_j}{\bar{c}} \left(\frac{1 - \alpha}{2}\|\beta_j \|_2^2 + \alpha \|\beta_j \|_1 \right), \end{aligned}

where \bar{c} = \frac{1}{p}\sum_{j=1}^p c_j.

4 thoughts on “A deep dive into glmnet: penalty.factor

    • Do you mean the elastic net penalty or the version with different weights on each coefficient?

      The lasso penalty induces sparsity of coefficients but has some drawbacks (e.g. lasso selects at most n variables, can have trouble if there are features with very high pairwise correlation, poor prediction performance in presence of high correlations). Elastic net is a variant that addresses some of these issues.

      As for the modified penalty, you may want to use this if you believe that some features are less informative than others and so you want to penalize them more. In particular, if you always want to keep a feature in the model, you can set its penalty.factor to 0.

      (As an aside, the SLOPE procedure uses a similar penalty, but there the coefficients are in sorted order. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4689150/)


      • Thanks, that makes sense!! I was thinking of the modified penalty.

        Aren’t we usually better off just letting the GLM manage the shrinkage itself, without interfering in this way?

        I’m a seasoned practitioner with GLMNet, but am not an expert on all the theory !


  1. Well, sometimes if we have additional side information on the features, it may be worth trying to incorporate some of that in the model fitting.

    Another use of this penalty is when we are trying to use glmnet as an intermediate step in a more complicated model. And I think having this option doesn’t make glmnet any slower, so why not make glmnet more flexible 🙂

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s