# Proximal operators and generalized gradient descent

Let’s say we have a convex, differentiable function $f: \mathbb{R}^p \rightarrow \mathbb{R}$ and that we want to minimize it. In this previous post, we saw that a gradient descent step at time $t$ (from point $\beta^t$) could be viewed as the minimizer of a quadratic approximation of $f$ at $\beta^t$:

$\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.$

What if $f$ is not differentiable at all points? If $f$ has some special structure, we are not totally hosed. Suppose $f$ can be decomposed as the sum of two functions $f = g + h$, where $g$ is convex and differentiable, while $h$ is convex but not differentiable. Instead of doing the gradient update as above (i.e. minimizing the quadratic approximation of $f$ around $\beta^t$), we can try to minimize the sum of $h$ and the quadratic approximation of $g$ around $\beta^t$:

$\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\}.$

This update is known as the generalized gradient update. (Why? This update encompasses a number of gradient descent algorithms, as we’ll see in the examples below.)

Let’s try to solve this new minimization problem. To do that, for a convex function $h: \mathbb{R}^p \mapsto \mathbb{R}$ we introduce the notion of a proximal operator, defined as

\begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

With this notation,

\begin{aligned} \beta^{t+1} &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta \|_2^2 - s^t \langle \nabla g(\beta^t), \beta^t - \beta \rangle + s^t h(\beta) \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta - s^t \nabla g(\beta^t) \|_2^2 + s^t h(\beta) \right\} \\ &= \textbf{prox}_{s^t h} \left( \beta^t - s^t \nabla g(\beta^t) \right). \end{aligned}

This reformulation is useful if the proximal operator can be computed cheaply. We finish off this post with 3 examples where this is the case:

Example 1: $h(z) = 0$ for all $z \in \mathbb{R}^p$.

That is, $f$ itself is convex and differentiable. In this case, $\textbf{prox}_h(z) = z$, so $\beta^{t+1} = \beta^t - s^t \nabla g(\beta^t)$, which is the usual gradient descent update.

Example 2: $h(z) = 0$ if $z \in \mathcal{C}$, $h(z) = +\infty$ otherwise, for some set $\mathcal{C} \subseteq \mathbb{R}^p$.

Here, the proximal operator reduces to \begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ \| z - \beta \|_2^2 \right\}\end{aligned}, which is the usual Euclidean projection onto $\mathcal{C}$. Hence, this case corresponds to projected gradient descent.

Example 3: $h(z) = \lambda \| z \|_1$ for some $\lambda \geq 0$.

Here, the proximal operator reduces to

\begin{aligned} \textbf{prox}_h (z) &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + \lambda \|\beta\|_1 \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \sum_{j=1}^n \frac{1}{2}(z_j - \beta_j)^2 + \lambda |\beta_j|. \end{aligned}

The minimization problem can be solved independently for each $\beta_j$. The result is that $\textbf{prox}_h (z) = \mathcal{S}_\lambda (z)$, where $\mathcal{S}$ is the soft-thresholding operator.

When applied to the LASSO minimization problem (i.e. $g(\beta) = \frac{1}{2n} \| y - X\beta \|_2^2$ and $h(\beta) = \lambda \|\beta\|_1$), the proximal gradient update becomes

\begin{aligned} \beta^{t+1} = \mathcal{S}_{s^t \lambda} \left(\beta^t - s^t \frac{1}{n} X^T (y - X\beta^t) \right). \end{aligned}

This method of solving the LASSO is sometimes called the Iterative Soft-Thresholding Algorithm (ISTA).

References:

1. Gordon, G. & Tibshirani, R. Generalized gradient descent.
2. Hastie, T., Tibshirani, R., & Wainwright, M. Statistical learning with sparsity (Section 5.3).