# The SCAD penalty

Assume that we are in the regression context with response $y \in \mathbb{R}^n$ and design matrix $X \in \mathbb{R}^{n \times p}$. The LASSO solves the following minimization problem: $\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + \lambda \| \beta\|_1.$

The LASSO is a special case of bridge estimators, first studied by Frank and Friedman (1993), which is the solution to $\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + \lambda | \beta|^q,$

with $q > 0$. The LASSO corresponds to the case where $q = 1$ and ridge regression corresponds to the case where $q = 2$. We typically do not consider the case where $q < 1$ as it results in a non-convex minimization problem which is hard to solve for globally.

When the design matrix is orthogonal, the minimization problem above decouples, and we obtain the LASSO estimates $(\hat{\beta}_\lambda)_j = (z_j - \lambda)_+$, where $z_j = X_j^T y$ is the OLS solution. This is known as soft-thresholding, where we reduce something by a fixed value (in this case $\lambda$) without letting it go negative.

It’s nice to have a thresholding rule in the orthogonal case; it’s also nice for the solution to be continuous in $z_j$. Fan & Li (2001) show that the only bridge estimator which has both these properties is the LASSO.

One problem with the LASSO is that the penalty term is linear in the size of the regression coefficient, hence it tends to give substantially biased estimates for large regression coefficients. To that end, Fan & Li (2001) propose the SCAD (smoothly clipped absolute deviation) penalty: $\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + p(\beta),$

where the derivative of the penalty function is $p'(\beta) = \lambda \left[ I(\beta \leq \lambda) + \dfrac{(a\lambda - \beta)_+}{(a-1)\lambda}I(\beta > \lambda) \right],$

with $a > 2$. This corresponds to a quadratic spline function with knots at $\lambda$ and $a\lambda$. Explicitly, the penalty is \begin{aligned} p(\beta) = \begin{cases} \lambda |\beta| &\text{if } |\beta| \leq \lambda, \\ \dfrac{2a\lambda |\beta| - \beta^2 - \lambda^2}{2(a-1)} &\text{if } \lambda < |\beta| \leq a\lambda, \\ \dfrac{\lambda^2 (a + 1)}{2} &\text{otherwise.} \end{cases} \end{aligned}

Below is a plot of the penalty function, where we have set $\lambda = 1$ and $a = 3$. The SCAD penalty function is in red while the LASSO penalty function is in black for comparison. The dotted lines are the penalty’s transition points ( $\pm \lambda$ and $\pm a \lambda$). Under orthogonal design, we get the SCAD solution \begin{aligned} (\hat{\beta}_\lambda)_j = \begin{cases} \text{sgn}(z_j) (z_j - \lambda)_+ &\text{if } |z_j| \leq 2\lambda, \\ \dfrac{(a-1)z_j - \text{sgn}(z_j)a\lambda}{a-2} &\text{if } 2\lambda < |z_j| \leq a\lambda, \\ z_j &\text{otherwise.} \end{cases} \end{aligned}

The plot below shows what the SCAD estimates look like ( $\lambda = 1$, $a = 3$). The dotted line is the $y = x$ line. The line in black represents soft-thresholding (LASSO estimates) while the line in red represents the SCAD estimates. We see that the SCAD estimates are the same as soft-thresholding for $|x| \leq 2\lambda$ and are equal to hard-thresholding for $|x| > a\lambda$; the estimates in the remaining regions are linear interpolations of these two regimes. References:

1. Fan, J., and Liu, R. (2001). Variable selection via penalized likelihood.
2. Breheny, P. Adaptive lasso, MCP, and SCAD.