# Proximal operators and generalized gradient descent

Let’s say we have a convex, differentiable function $f: \mathbb{R}^p \rightarrow \mathbb{R}$ and that we want to minimize it. In this previous post, we saw that a gradient descent step at time $t$ (from point $\beta^t$) could be viewed as the minimizer of a quadratic approximation of $f$ at $\beta^t$: $\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.$

What if $f$ is not differentiable at all points? If $f$ has some special structure, we are not totally hosed. Suppose $f$ can be decomposed as the sum of two functions $f = g + h$, where $g$ is convex and differentiable, while $h$ is convex but not differentiable. Instead of doing the gradient update as above (i.e. minimizing the quadratic approximation of $f$ around $\beta^t$), we can try to minimize the sum of $h$ and the quadratic approximation of $g$ around $\beta^t$: $\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\}.$

This update is known as the generalized gradient update. (Why? This update encompasses a number of gradient descent algorithms, as we’ll see in the examples below.)

Let’s try to solve this new minimization problem. To do that, for a convex function $h: \mathbb{R}^p \mapsto \mathbb{R}$ we introduce the notion of a proximal operator, defined as \begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

With this notation, \begin{aligned} \beta^{t+1} &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta \|_2^2 - s^t \langle \nabla g(\beta^t), \beta^t - \beta \rangle + s^t h(\beta) \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta - s^t \nabla g(\beta^t) \|_2^2 + s^t h(\beta) \right\} \\ &= \textbf{prox}_{s^t h} \left( \beta^t - s^t \nabla g(\beta^t) \right). \end{aligned}

This reformulation is useful if the proximal operator can be computed cheaply. We finish off this post with 3 examples where this is the case:

Example 1: $h(z) = 0$ for all $z \in \mathbb{R}^p$.

That is, $f$ itself is convex and differentiable. In this case, $\textbf{prox}_h(z) = z$, so $\beta^{t+1} = \beta^t - s^t \nabla g(\beta^t)$, which is the usual gradient descent update.

Example 2: $h(z) = 0$ if $z \in \mathcal{C}$, $h(z) = +\infty$ otherwise, for some set $\mathcal{C} \subseteq \mathbb{R}^p$.

Here, the proximal operator reduces to \begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ \| z - \beta \|_2^2 \right\}\end{aligned}, which is the usual Euclidean projection onto $\mathcal{C}$. Hence, this case corresponds to projected gradient descent.

Example 3: $h(z) = \lambda \| z \|_1$ for some $\lambda \geq 0$.

Here, the proximal operator reduces to \begin{aligned} \textbf{prox}_h (z) &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + \lambda \|\beta\|_1 \right\} \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \sum_{j=1}^n \frac{1}{2}(z_j - \beta_j)^2 + \lambda |\beta_j|. \end{aligned}

The minimization problem can be solved independently for each $\beta_j$. The result is that $\textbf{prox}_h (z) = \mathcal{S}_\lambda (z)$, where $\mathcal{S}$ is the soft-thresholding operator.

When applied to the LASSO minimization problem (i.e. $g(\beta) = \frac{1}{2n} \| y - X\beta \|_2^2$ and $h(\beta) = \lambda \|\beta\|_1$), the proximal gradient update becomes \begin{aligned} \beta^{t+1} = \mathcal{S}_{s^t \lambda} \left(\beta^t - s^t \frac{1}{n} X^T (y - X\beta^t) \right). \end{aligned}

This method of solving the LASSO is sometimes called the Iterative Soft-Thresholding Algorithm (ISTA).

References:

1. Gordon, G. & Tibshirani, R. Generalized gradient descent.
2. Hastie, T., Tibshirani, R., & Wainwright, M. Statistical learning with sparsity (Section 5.3).

# Gradient descent as a minimization problem

Let’s say that I have a convex, differentiable function $f: \mathbb{R}^p \rightarrow \mathbb{R}$ and that I am interested in minimizing it. A common first-order iterative method for minimizing $f$ is gradient descent. In words, a each step, compute the gradient of $f$ at the current point and move a certain distance in the direction of the negative gradient. In math, if $\beta^t$ is the current iterate, then the next iterate is given by $\beta^{t+1} = \beta^t - s^t \nabla f(\beta^t),$

where $s^t > 0$ is some step-size, which can be predetermined or adaptively chosen.

Did you know that this gradient descent step can be framed as an optimization problem as well? Indeed, $\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}. \qquad (1)$

The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of $\beta$, the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. $\nabla f(\beta^t) + \frac{1}{s^t} (\beta - \beta^t) = 0$, which rearranges to $\beta = \beta^t - s^t \nabla f(\beta^t)$.

You can think of the expression on the RHS in $(1)$ as a quadratic approximation. Taking a second-order Taylor expansion of $f$ and $\beta^t$, we have $f(\beta) \approx f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2} (\beta - \beta^t)^T \nabla^2 f(\beta^t) (\beta - \beta^t).$

Replacing the Hessian $\nabla^2 f(\beta^t)$ with $\dfrac{1}{t}I$ gives the expression in $(1)$.

What does this alternate formulation give us? This formulation generalizes easily to the idea of projected gradient descent. The problem above was unconstrained, in the sense that $\beta$ could potentially be any vector in $\mathbb{R}^p$. If we require $\beta$ to lie in some convex constraint set $\mathcal{C}$, then we could generalize each gradient step to be $\beta^{t+1} = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.$

This corresponds to taking a gradient step $\beta^t - s^t \nabla f(\beta^t)$, then projecting the result onto the set $\mathcal{C}$.

This formulation will also lead us to a more general class of methods called proximal gradient methods, of which project gradient descent is a special case.

References:

1. Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)