Let’s say we have a convex, differentiable function and that we want to minimize it. In this previous post, we saw that a gradient descent step at time (from point ) could be viewed as the minimizer of a quadratic approximation of at :
What if is not differentiable at all points? If has some special structure, we are not totally hosed. Suppose can be decomposed as the sum of two functions , where is convex and differentiable, while is convex but not differentiable. Instead of doing the gradient update as above (i.e. minimizing the quadratic approximation of around ), we can try to minimize the sum of and the quadratic approximation of around :
This update is known as the generalized gradient update. (Why? This update encompasses a number of gradient descent algorithms, as we’ll see in the examples below.)
Let’s try to solve this new minimization problem. To do that, for a convex function we introduce the notion of a proximal operator, defined as
With this notation,
This reformulation is useful if the proximal operator can be computed cheaply. We finish off this post with 3 examples where this is the case:
Example 1: for all .
That is, itself is convex and differentiable. In this case, , so , which is the usual gradient descent update.
Example 2: if , otherwise, for some set .
Here, the proximal operator reduces to , which is the usual Euclidean projection onto . Hence, this case corresponds to projected gradient descent.
Example 3: for some .
Here, the proximal operator reduces to
The minimization problem can be solved independently for each . The result is that , where is the soft-thresholding operator.
When applied to the LASSO minimization problem (i.e. and ), the proximal gradient update becomes
This method of solving the LASSO is sometimes called the Iterative Soft-Thresholding Algorithm (ISTA).
- Gordon, G. & Tibshirani, R. Generalized gradient descent.
- Hastie, T., Tibshirani, R., & Wainwright, M. Statistical learning with sparsity (Section 5.3).