Proximal operators and generalized gradient descent

Let’s say we have a convex, differentiable function f: \mathbb{R}^p \rightarrow \mathbb{R} and that we want to minimize it. In this previous post, we saw that a gradient descent step at time t (from point \beta^t) could be viewed as the minimizer of a quadratic approximation of f at \beta^t:

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.

What if f is not differentiable at all points? If f has some special structure, we are not totally hosed. Suppose f can be decomposed as the sum of two functions f = g + h, where g is convex and differentiable, while h is convex but not differentiable. Instead of doing the gradient update as above (i.e. minimizing the quadratic approximation of f around \beta^t), we can try to minimize the sum of h and the quadratic approximation of g around \beta^t:

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\}.

This update is known as the generalized gradient update. (Why? This update encompasses a number of gradient descent algorithms, as we’ll see in the examples below.)

Let’s try to solve this new minimization problem. To do that, for a convex function h: \mathbb{R}^p \mapsto \mathbb{R} we introduce the notion of a proximal operator, defined as

\begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

With this notation,

\begin{aligned} \beta^{t+1} &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\} \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta \|_2^2 - s^t \langle \nabla g(\beta^t), \beta^t - \beta \rangle  + s^t h(\beta) \right\}  \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta - s^t \nabla g(\beta^t) \|_2^2  + s^t h(\beta) \right\} \\  &= \textbf{prox}_{s^t h} \left( \beta^t - s^t \nabla g(\beta^t) \right). \end{aligned}

This reformulation is useful if the proximal operator can be computed cheaply. We finish off this post with 3 examples where this is the case:

Example 1: h(z) = 0 for all z \in \mathbb{R}^p.

That is, f itself is convex and differentiable. In this case, \textbf{prox}_h(z) = z, so \beta^{t+1} = \beta^t - s^t \nabla g(\beta^t), which is the usual gradient descent update.

Example 2: h(z) = 0 if z \in \mathcal{C}, h(z) = +\infty otherwise, for some set \mathcal{C} \subseteq \mathbb{R}^p.

Here, the proximal operator reduces to \begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ \| z - \beta \|_2^2 \right\}\end{aligned}, which is the usual Euclidean projection onto \mathcal{C}. Hence, this case corresponds to projected gradient descent.

Example 3: h(z) = \lambda \| z \|_1 for some \lambda \geq 0.

Here, the proximal operator reduces to

\begin{aligned} \textbf{prox}_h (z) &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + \lambda \|\beta\|_1 \right\} \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \sum_{j=1}^n \frac{1}{2}(z_j - \beta_j)^2 + \lambda |\beta_j|. \end{aligned}

The minimization problem can be solved independently for each \beta_j. The result is that \textbf{prox}_h (z) = \mathcal{S}_\lambda (z), where \mathcal{S} is the soft-thresholding operator.

When applied to the LASSO minimization problem (i.e. g(\beta) = \frac{1}{2n} \| y - X\beta \|_2^2 and h(\beta) = \lambda \|\beta\|_1), the proximal gradient update becomes

\begin{aligned} \beta^{t+1} = \mathcal{S}_{s^t \lambda} \left(\beta^t - s^t \frac{1}{n} X^T (y - X\beta^t) \right). \end{aligned}

This method of solving the LASSO is sometimes called the Iterative Soft-Thresholding Algorithm (ISTA).

References:

  1. Gordon, G. & Tibshirani, R. Generalized gradient descent.
  2. Hastie, T., Tibshirani, R., & Wainwright, M. Statistical learning with sparsity (Section 5.3).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s