Assume we are in the supervised learning setting, with being the matrix of predictors ( observations each with features) and being the response we are interested in predicting. Let denote the values for the th variable.
The lasso (Tibshirani 1996) solves the optimization problem
where is a hyperparameter which the user chooses. The lasso is a popular regression method, especially when is very large, because it often sets several of the entries in to be exactly zero.
What are the strong rules?
Suppose we knew beforehand that at the lasso solution, some set of variables would have their associated beta coefficient be 0. Then, in the optimization problem above, we could replace with a smaller matrix (i.e. with just the columns not in ) and with . This gives us an equivalent but smaller optimization problem to solve, resulting in faster computation.
Strong rules (Tibshirani et al 2012) are one way to choose a candidate set of variables beforehand. There are two versions of the strong rules. The basic strong rule discards predictor at the hyperparameter if
where is the smallest value such that the lasso solution .
The other version, the sequential strong rule, applies when we are computing the lasso solution for a path of lambda values (which is usually the case). If the lasso solution at is , then at the sequential strong rule discards predictor if
For , take . Note that when this happens we get the basic strong rule, i.e. the basic version is a special case of the sequential version.
Why use strong rules?
It is possible that the strong rule discards predictors that should actually be included. So why bother with them? Why not use “SAFE” rules instead (El Ghaoui et al. 2010), which never mistakenly discard a predictor that should be included? Tibshirani et al. give two reasons:
- While strong rules can make mistakes, in practice this appears to be rare. To avoid mistakes, we just need to check KKT conditions at the end, and add any predictors we missed out. On the other hand, they are able to discard a very large proportion of predictors (more so than SAFE rules).
- The motivating arguments behind the strong rules are simple and can be extended easily to a wide array of other methods.
Intuition behind the strong rules
For a given , the KKT conditions for the lasso problem are
where if , otherwise. Let (the LHS of the equation above). Assume that for each , (thought of as a function of ) is 1-Lipschitz, i.e.
for any . Then, assuming that the sequential strong rule holds, we have
which implies (by the KKT conditions) that . See Section 2.2 of the strong rules paper for further intuition.
Is the assumption that is 1-Lipschitz for all reasonable? In practice it appears to be a good heuristic. Section 3.3 of the strong rules paper presents some settings where it can be proven that the assumption holds true.
glmnet package in R which implements the elastic net uses strong rules to speed up the computation of the elastic net solution. It’s one reason why the
glmnet() function is so fast!
Note that if we are not using the default values for the
penalty.factor options, the strong rules are slightly different. The underlying code accounts for these changes.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
- Tibshirani, R. et al. (2012). Strong rules for discarding predictors in lasso-type problems.
- El Ghaoui et al. (2010). Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems.