Strong rules for the lasso

Assume we are in the supervised learning setting, with {\bf X} \in \mathbb{R}^{n \times p} being the matrix of predictors (n observations each with p features) and y \in \mathbb{R}^n being the response we are interested in predicting. Let x_j \in \mathbb{R}^n denote the values for the jth variable.

The lasso (Tibshirani 1996) solves the optimization problem

\begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \: \frac{1}{2} \| y - {\bf X} \beta \|_2^2 + \lambda \| \beta \|_1, \end{aligned}

where \lambda \geq 0 is a hyperparameter which the user chooses. The lasso is a popular regression method, especially when p is very large, because it often sets several of the entries in \hat{\beta} to be exactly zero.

What are the strong rules?

Suppose we knew beforehand that at the lasso solution, some set of variables \mathcal{D} \subseteq \{ 1, \dots, p \} would have their associated beta coefficient be 0. Then, in the optimization problem above, we could replace {\bf X} with a smaller matrix {\bf X}_{\mathcal{D}^c} (i.e. \bf X with just the columns not in \mathcal{D}) and \beta with \beta_{\mathcal{D}^c}. This gives us an equivalent but smaller optimization problem to solve, resulting in faster computation.

Strong rules (Tibshirani et al 2012) are one way to choose a candidate set of variables \mathcal{D} beforehand. There are two versions of the strong rules. The basic strong rule discards predictor j at the hyperparameter \lambda if

\begin{aligned} |x_j^T y| < 2 \lambda - \lambda_{max}, \end{aligned}

where \lambda_{max} = \max_j |x_j^T y| is the smallest \lambda value such that the lasso solution \hat{\beta}(\lambda_{max}) = 0.

The other version, the sequential strong rule, applies when we are computing the lasso solution for a path of lambda values \lambda_1 > \dots > \lambda_m (which is usually the case). If the lasso solution at \lambda_{k-1} is \hat{\beta}(\lambda_{k-1}), then at \lambda_k the sequential strong rule discards predictor j if

\begin{aligned} |x_j^T [y - {\bf X} \hat{\beta}(\lambda_{k-1})]| < 2 \lambda_k - \lambda_{k-1}. \end{aligned}

For k = 1, take \lambda_{k-1} = \lambda_0 = \lambda_{max}. Note that when this happens we get the basic strong rule, i.e. the basic version is a special case of the sequential version.

Why use strong rules?

It is possible that the strong rule discards predictors that should actually be included. So why bother with them? Why not use “SAFE” rules instead (El Ghaoui et al. 2010), which never mistakenly discard a predictor that should be included? Tibshirani et al. give two reasons:

  1. While strong rules can make mistakes, in practice this appears to be rare. To avoid mistakes, we just need to check KKT conditions at the end, and add any predictors we missed out. On the other hand, they are able to discard a very large proportion of predictors (more so than SAFE rules).
  2. The motivating arguments behind the strong rules are simple and can be extended easily to a wide array of other methods.

Intuition behind the strong rules

For a given \lambda, the KKT conditions for the lasso problem are

\begin{aligned} x_j^T [y - {\bf X} \hat{\beta}(\lambda)] = \lambda \gamma_j(\lambda), \qquad j = 1, \dots, p, \end{aligned}

where \gamma_j(\lambda) = \text{sign}(\hat{\beta}_j(\lambda)) if \hat{\beta}_j (\lambda) \neq 0, \gamma_j(\lambda) \in [-1, 1] otherwise. Let c_j(\lambda) = x_j^T [y - {\bf X} \hat{\beta}(\lambda)] (the LHS of the equation above). Assume that for each j, c_j (thought of as a function of \lambda) is 1-Lipschitz, i.e.

\begin{aligned} |c_j(\lambda) - c_j(\tilde{\lambda})| \leq |\lambda - \tilde{\lambda}| \end{aligned}

for any \lambda, \tilde{\lambda}. Then, assuming that the sequential strong rule holds, we have

\begin{aligned} |c_j(\lambda_k)| &\leq |c_j(\lambda_k) - c_j(\lambda_{k-1})| + |c_j(\lambda_{k-1})| \\ &< |\lambda_k - \lambda_{k-1}| + (2\lambda_k - \lambda_{k-1}) \\ &= \lambda_k,  \end{aligned}

which implies (by the KKT conditions) that \hat{\beta}_j (\lambda_k) = 0. See Section 2.2 of the strong rules paper for further intuition.

Is the assumption that c_j is 1-Lipschitz for all j reasonable? In practice it appears to be a good heuristic. Section 3.3 of the strong rules paper presents some settings where it can be proven that the assumption holds true.

Computation

The glmnet package in R which implements the elastic net uses strong rules to speed up the computation of the elastic net solution. It’s one reason why the glmnet() function is so fast!

Note that if we are not using the default values for the weights and/or penalty.factor options, the strong rules are slightly different. The underlying code accounts for these changes.

References:

  1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
  2. Tibshirani, R. et al. (2012). Strong rules for discarding predictors in lasso-type problems.
  3. El Ghaoui et al. (2010). Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems.
Advertisements

Orthogonal polynomials for random variables

Two polynomials p(x) and q(x) are said to be orthogonal to each other w.r.t. some inner product \langle \cdot, \cdot \rangle if \langle p(x), q(x) \rangle = 0. A sequence of polynomials p_1(x), p_2(x), \dots form an orthogonal polynomial sequence if for any i \neq j, \langle p_i(x), p_j(x) \rangle = 0. If we have in addition that \langle p_i(x), p_i(x) \rangle = 1 for all i, the sequence is said to be orthonormal. (Note: Some references assume this latter condition holds even when they call the sequence “orthogonal”.)

For any probability distribution on the real line which has an associated probability density function w, we can define an inner product on the space of polynomials by

\begin{aligned} \langle p(x), q(x) \rangle = \int p(x)q(x) w(x) dx. \end{aligned}

(Note: The inner product can be defined more generally; see reference 1 for details.)

Different choices of probability distribution lead to different families of orthogonal polynomials. The table below shows the orthogonal polynomials for some special probability distributions (see reference 2 for a few more):

Distribution Orthogonal polynomial family
Normal \mathcal{N}(0, 1) Hermite polynomials
Uniform \mathcal{U}(-1, 1) Legendre polynomials
Gamma \Gamma(k, 1) Laguerre polynomials

Interestingly, any family of orthogonal polynomials satisfies the following recurrence relation:

\begin{aligned} p_{n+1}(x) = (a_n x + b_n) p_n(x) + c_n p_{n-1}(x). \end{aligned}

Different orthogonal families corresponds to different choices of \{ a_n \}, \{ b_n \} and \{ c_n \} (reference 2 lists these recurrence coefficients for the commonly used families.)

References:

  1. Wikipedia. Orthogonal polynomials: Definition for 1-variable case for a real measure.
  2. OpenTURNS. Orthogonal polynomials.

Eigenvalues of random matrices and Wigner’s semicircle law

Random matrix theory is concerned with the study of properties of large matrices whose entries are sampled according to some known probability distribution. Usually we are interested in objects like their eigenvalues and eigenvectors.

What is interesting is that the asymptotic behavior of the random matrices is often universal, i.e. it does not depend on the probability distribution of the entries. Wigner’s semicircle law is one foundational example of that. Reference 1 goes so far to say that “the semicircle law is as important to random matrix theory as the central limit theorem is to scalar probability theory”. That’s a pretty big claim!

First, let’s define the sequence of matrices for which the semicircle law applies:

Definition (Wigner matrices): Fix  an infinite family of real-valued random variables \{ Y_{ij} \}_{j \geq i \geq 1}. Define a sequence of symmetric random matrices {\bf Y}_n \in \mathbb{R}^{n \times n} by

\begin{aligned} ({\bf Y}_n)_{ij} = \begin{cases} Y_{ij} &\text{if } i \leq j, \\ Y_{ji} &\text{if } i > j. \end{cases} \end{aligned}

Assume that:

  1. The \{ Y_{ij} \} are independent.
  2. The diagonal entries \{ Y_{ii}\}_{i \geq 1} are identically distributed, and the off-diagonal entries \{ Y_{ij} \}_{1\leq i < j} are identically distributed.
  3. \mathbb{E}[Y_{ij}^2] < \infty for all i, j.
  4. \mathbb{E}[Y_{12}^2] > 0 (so that the off-diagonal terms are not trivially all 0 almost surely).

Then the matrices {\bf X}_n = n^{-1/2} {\bf Y}_n are Wigner matrices. (Sometimes, the entire sequence of matrices is referred to as a Wigner matrix.)

For t > 0, define the distribution

\sigma_t (dx) = \dfrac{1}{2\pi t} \sqrt{(4t - x^2)_+} dx.

Wigner’s semicircle law states that if \mathbb{E}[Y_{12}^2] = t, then the limiting distribution of the eigenvalues of {\bf X}_n is \sigma_t. The result is stated more formally below:

Theorem (Wigner’s semicircle law): Let {\bf X}_n = n^{-1/2} {\bf Y}_n be a sequence of Wigner matrices such that \mathbb{E}[Y_{ij}] = 0 for all i, j and \mathbb{E}[Y_{12}^2] = t > 0. Let I \subset \mathbb{R} be an interval. Define the random variables

\begin{aligned} E_n(I) = \dfrac{\# \{ \{ \lambda_1 ({\bf X}_n), \dots, \lambda_n ({\bf X}_n) \} \cap I \}}{n}, \end{aligned}

where \lambda_1 ({\bf X}_n), \dots, \lambda_n ({\bf X}_n) are the eigenvalues of {\bf X}_n. Then E_n(I) \rightarrow \sigma_t(I) in probability as n \rightarrow \infty.

(Note: Convergence in probability can be strengthened to almost sure convergence.)

See reference 2 for a proof of the semicircle law. Here, we demonstrate the semicircle law empirically. The code below generates a 1000 \times 1000 Wigner matrix with Y_{ij} \stackrel{iid}{\sim} \mathcal{N}(0, 1) and computes its eigenvalues:

# make Wigner matrix and compute eigenvalues
n <- 1000
set.seed(113)
Y <- matrix(rnorm(n * n), nrow = n)
Y[lower.tri(Y)] <- 0
Y <- Y + t(Y * upper.tri(Y))
X <- Y / sqrt(n)
eigval <- eigen(X)$values

In this setting, t = \mathbb{E}[Y_{12}^2] = 1. The code below computes the true limiting distribution and plots a histogram of the eigenvalues along with the true limit:

# truth
t <- 1
xx <- seq(-2 * sqrt(t), 2 * sqrt(t), length.out = 201)
yy <- ifelse(4 * t - xx^2 > 0, sqrt(4 * t - xx^2), 0) / (2 * pi * t)

# make plot
hist(eigval, breaks = 100, freq = FALSE, main = "Histogram of eigenvalues\nNormal(0, 1)")
lines(xx, yy, col = "red")

We can amend the code simply to have the entries have Laplace distribution (this time, we have t = 2).

# make Wigner matrix and compute eigenvalues
n <- 1000
set.seed(113)
Y <- matrix(rexp(n * n), nrow = n) * matrix(ifelse(rnorm(n * n) > 0, 1, -1), nrow = n)
Y[lower.tri(Y)] <- 0
Y <- Y + t(Y * upper.tri(Y))
X <- Y / sqrt(n)
eigval <- eigen(X)$values

# truth
t <- 2
xx <- seq(-2 * sqrt(t), 2 * sqrt(t), length.out = 201)
yy <- ifelse(4 * t - xx^2 > 0, sqrt(4 * t - xx^2), 0) / (2 * pi * t)

# make plot
hist(eigval, breaks = 100, freq = FALSE, main = "Histogram of eigenvalues\nLaplace rate 1")
lines(xx, yy, col = "red")

What happens if the distribution of Y_{ij} doesn’t have finite second moment? Then the theorem doesn’t hold. For example, the code below has the entries being i.i.d. Cauchy distributed (or t distribution with 1 degree of freedom):

# make Wigner matrix and compute eigenvalues
n <- 1000
set.seed(113)
Y <- matrix(rt(n * n, df = 1), nrow = n)
Y[lower.tri(Y)] <- 0
Y <- Y + t(Y * upper.tri(Y))
X <- Y / sqrt(n)
eigval <- eigen(X)$values

# make plot
hist(eigval, breaks = 100, freq = FALSE, main = "Histogram of eigenvalues\nCauchy")

Notice how different the scale on the x axis is from the previous plots! Most of the eigenvalues are still within a narrow range close to 0, but there are a handful of eigenvalues that are way out. This is somewhat reminiscent of the Central Limit Theorem in scalar probability theory, where one also needs to have a finite second moment condition in order to have the desired limiting behavior.

References:

  1. Feier, A. R. Methods of proof in random matrix theory.
  2. Kemp, T. Math 247A: Introduction to random matrix theory.

Use mfcol to have plots drawn by column

To plot multiple figures on a single canvas in base R, we can change the graphical parameter mfrow. For instance, the code below tells R that subsequent figures will by drawn in a 2-by-3 array:

par(mfrow = c(2, 3))

If we then run this next block of code, we will get the image below:

set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
    plot(x[, j], x[, j]^2 + 0.1 * rnorm(n), 
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
    plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
}

Notice how the plots are filled in by row? That is, the first plot goes in the top-left corner, the next plot goes to its right, and so on. What if we want the plots to be filled in by column instead? For instance, in our example above each feature is associated with two panels. It would make more sense for the first column to filled with plots related to x1, and so on.

There is an easy fix for that: instead of specifying the mfrow parameter, specify the mfcol parameter. See the results below:

par(mfcol = c(2, 3))
set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
    plot(x[, j], x[, j]^2 + 0.1 * rnorm(n), 
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
    plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
         xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
}

References:

  1. Graphical layouts.

The Demmler-Reinsch basis for smoothing splines

Smoothing splines

Given a set of observations (x_1, y_1), \dots, (x_n, y_n) \in \mathbb{R}^2, a smoothing spline is the function \hat{f}_\lambda which is the solution to

\hat{f}_\lambda = \underset{f}{\text{argmin}} \; \displaystyle\sum_{i=1}^n [y_i - f(x_i)]^2 + \lambda \displaystyle\int [f''(t)]^2 dt,

where \lambda \in [0, \infty] is a smoothing hyperparameter, and the argmin is taken over an appropriate Sobolev space for which the second term above is well-defined.

\hat{f}_\lambda is meant to be a “smooth” approximation of the relationship between y and x based on the observations. As \lambda increases from 0 to \infty, the functions go from very rough to very smooth.

It turns out that for \lambda \in (0, \infty), we can show that the smoothing spline is a natural cubic spline with knots at the unique values of the x_i (see Section 5.2.1 of Elements of Statistical Learning). Hence, we can write the solution as a linear combination of the natural spline basis functions:

\begin{aligned} f(x) = \sum_{j=1}^n N_j(x) \theta_j, \end{aligned}

where N_1, \dots, N_n are the natural spline basis functions:

\begin{aligned} N_1(x) &= 1, \\  N_2(x) &= x, \\  N_{k+2}(x) &= \frac{(x - x_k)_+^3 - (x - x_n)_+^3}{x_n - x_k} - \frac{(x - x_{n-1})_+^3 - (x - x_n)_+^3}{x_n - x_{n-1}}, \quad k = 3, 4, \dots, n-2. \end{aligned}

Writing {\bf{N}} \in \mathbb{R}^{n \times n} with N_{ij} = N_j(x_i), and {\bf{\Omega_N}} \in \mathbb{R}^{n \times n} with \Omega_{ij} = \int N_i''(t) N_j''(t) dt, the minimization problem over f now becomes a minimization problem over \theta = (\theta_1, \dots, \theta_n)^T \in \mathbb{R}^n:

\hat{\theta}_\lambda = \underset{\theta \in \mathbb{R}^n}{\text{argmin}} \;  (y - {\bf{N}}\theta)^T (y - {\bf{N}}\theta) + \lambda \theta^T {\bf{\Omega_N}} \theta.

We can solve for \theta (see previous post for details):

\hat{\theta}_\lambda = ({\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}})^{-1} {\bf{N}}^T y.

Smoothing splines as a linear smoother

With this solution, our smoothed estimate of y is \hat{f}_\lambda = {\bf N}({\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}})^{-1} {\bf{N}}^T y, or \hat{f}_\lambda = {\bf{S}}_\lambda y, where {\bf{S}}_\lambda = {\bf N}({\bf{N}}^T {\bf{N}} + \lambda {\bf{\Omega_N}})^{-1} {\bf{N}}^T. Note that the smoothing matrix {\bf{S}}_\lambda depends only on \lambda and the x_i‘s but not the y_i‘s: hence a smoothing spline is called a linear smoother (it simply applies a linear transformation to the target of interest, y).

Reinsch form for a smoothing spline

By manipulating the singular value decomposition (SVD) of \bf{N} (see slide 3 of reference 2), we can rewrite {\bf{S}}_\lambda as

{\bf{S}}_\lambda = ({\bf{I}} + \lambda {\bf K})^{-1},

where {\bf K} depends just on the x_i‘s and nothing else. The rewriting above is known as the Reinsch form for {\bf{S}}_\lambda. This rewriting is helpful because it allows us better understand what happens as we vary \lambda.

(To be explicit: if the SVD of \bf{N} is \bf{N} = {\bf UZV}^T, then {\bf K} = {\bf U}^T{\bf Z}^{-1} {\bf V}^T {\bf \Omega_N V} {\bf Z}^{-1} {\bf U}.)

The Demmler-Reinsch basis

Note that the matrix \bf K is positive semidefinite, and so it has an eigendecomposition. Write {\bf K} = {\bf UDU}^T, where \bf D is a diagonal matrix with 0 = d_1 = d_2 < d_2 \leq \dots \leq d_n. (Yes: we can show that d_1 and d_2 are always equal to 0, because the null space of \bf K is spanned by functions linear in x.) The columns of \bf U, u_1, \dots, u_n, form the Demmler-Reinsch basis: they are the eigenvectors of \bf K.

It follows immediately that the columns of \bf U are also the eigenvectors of {\bf S}_\lambda, but with different associated eigenvalues:

\begin{aligned} \rho_k (\lambda) = \frac{1}{1 + \lambda d_k}. \end{aligned}

The basis vectors tell us exactly how the smoothing spline acts: it decomposes the response y with respect to u_1, \dots, u_n, then shrinks each component (differently) by a factor of \rho_k(\lambda):

\begin{aligned} {\bf S}_\lambda y = \sum_{k=1}^n \rho_k(\lambda) (u_k \cdot y) u_k. \end{aligned}

The expression for \rho_k(\lambda) explains the role of \lambda to us: it affects how much each u_k component is shrunk. The larger \lambda is, the more it is shrunk toward zero.

The importance of the Demmler-Reinsch basis lies in the fact that it is a function of the x_i‘s alone: it is not influenced by \lambda or the response y.

References:

  1. Hansen, N. R. (2009). Statistics Learning.
  2. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning (Section 5.4.1).