The SCAD penalty

Assume that we are in the regression context with response y \in \mathbb{R}^n and design matrix X \in \mathbb{R}^{n \times p}. The LASSO solves the following minimization problem:

\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + \lambda \| \beta\|_1.

The LASSO is a special case of bridge estimators, first studied by Frank and Friedman (1993), which is the solution to

\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + \lambda | \beta|^q,

with q > 0. The LASSO corresponds to the case where q = 1 and ridge regression corresponds to the case where q = 2. We typically do not consider the case where q < 1 as it results in a non-convex minimization problem which is hard to solve for globally.

When the design matrix is orthogonal, the minimization problem above decouples, and we obtain the LASSO estimates (\hat{\beta}_\lambda)_j = (z_j - \lambda)_+, where z_j = X_j^T y is the OLS solution. This is known as soft-thresholding, where we reduce something by a fixed value (in this case \lambda) without letting it go negative.

It’s nice to have a thresholding rule in the orthogonal case; it’s also nice for the solution to be continuous in z_j. Fan & Li (2001) show that the only bridge estimator which has both these properties is the LASSO.

One problem with the LASSO is that the penalty term is linear in the size of the regression coefficient, hence it tends to give substantially biased estimates for large regression coefficients. To that end, Fan & Li (2001) propose the SCAD (smoothly clipped absolute deviation) penalty:

\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + p(\beta),

where the derivative of the penalty function is

p'(\beta) = \lambda \left[ I(\beta \leq \lambda) + \dfrac{(a\lambda - \beta)_+}{(a-1)\lambda}I(\beta > \lambda) \right],

with a > 2. This corresponds to a quadratic spline function with knots at \lambda and a\lambda. Under orthogonal design, we get the solution

\begin{aligned} (\hat{\beta}_\lambda)_j = \begin{cases} \text{sgn}(z_j) (z_j - \lambda)_+ &\text{if } |z_j| \leq 2\lambda, \\ \dfrac{(a-1)z_j - \text{sgn}(z_j)a\lambda}{a-2} &\text{if } 2\lambda < |z_j| \leq a\lambda, \\ z_j &\text{otherwise.} \end{cases}  \end{aligned}

The plot below shows what the SCAD estimates look like (\lambda = 1, a = 3). The dotted line is the y = x line. The line in black represents soft-thresholding (LASSO estimates) while the line in red represents the SCAD estimates. We see that the SCAD estimates are the same as soft-thresholding for |x| \leq 2\lambda and are equal to hard-thresholding for |x| > a\lambda; the estimates in the remaining regions are linear interpolations of these two regimes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s