Proximal operators and generalized gradient descent

Let’s say we have a convex, differentiable function f: \mathbb{R}^p \rightarrow \mathbb{R} and that we want to minimize it. In this previous post, we saw that a gradient descent step at time t (from point \beta^t) could be viewed as the minimizer of a quadratic approximation of f at \beta^t:

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.

What if f is not differentiable at all points? If f has some special structure, we are not totally hosed. Suppose f can be decomposed as the sum of two functions f = g + h, where g is convex and differentiable, while h is convex but not differentiable. Instead of doing the gradient update as above (i.e. minimizing the quadratic approximation of f around \beta^t), we can try to minimize the sum of h and the quadratic approximation of g around \beta^t:

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\}.

This update is known as the generalized gradient update. (Why? This update encompasses a number of gradient descent algorithms, as we’ll see in the examples below.)

Let’s try to solve this new minimization problem. To do that, for a convex function h: \mathbb{R}^p \mapsto \mathbb{R} we introduce the notion of a proximal operator, defined as

\begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

With this notation,

\begin{aligned} \beta^{t+1} &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ g(\beta^t) + \langle \nabla g(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 + h(\beta) \right\} \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta \|_2^2 - s^t \langle \nabla g(\beta^t), \beta^t - \beta \rangle  + s^t h(\beta) \right\}  \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| \beta^t - \beta - s^t \nabla g(\beta^t) \|_2^2  + s^t h(\beta) \right\} \\  &= \textbf{prox}_{s^t h} \left( \beta^t - s^t \nabla g(\beta^t) \right). \end{aligned}

This reformulation is useful if the proximal operator can be computed cheaply. We finish off this post with 3 examples where this is the case:

Example 1: h(z) = 0 for all z \in \mathbb{R}^p.

That is, f itself is convex and differentiable. In this case, \textbf{prox}_h(z) = z, so \beta^{t+1} = \beta^t - s^t \nabla g(\beta^t), which is the usual gradient descent update.

Example 2: h(z) = 0 if z \in \mathcal{C}, h(z) = +\infty otherwise, for some set \mathcal{C} \subseteq \mathbb{R}^p.

Here, the proximal operator reduces to \begin{aligned} \textbf{prox}_h (z) = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ \| z - \beta \|_2^2 \right\}\end{aligned}, which is the usual Euclidean projection onto \mathcal{C}. Hence, this case corresponds to projected gradient descent.

Example 3: h(z) = \lambda \| z \|_1 for some \lambda \geq 0.

Here, the proximal operator reduces to

\begin{aligned} \textbf{prox}_h (z) &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ \frac{1}{2} \| z - \beta \|_2^2 + \lambda \|\beta\|_1 \right\} \\  &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \sum_{j=1}^n \frac{1}{2}(z_j - \beta_j)^2 + \lambda |\beta_j|. \end{aligned}

The minimization problem can be solved independently for each \beta_j. The result is that \textbf{prox}_h (z) = \mathcal{S}_\lambda (z), where \mathcal{S} is the soft-thresholding operator.

When applied to the LASSO minimization problem (i.e. g(\beta) = \frac{1}{2n} \| y - X\beta \|_2^2 and h(\beta) = \lambda \|\beta\|_1), the proximal gradient update becomes

\begin{aligned} \beta^{t+1} = \mathcal{S}_{s^t \lambda} \left(\beta^t - s^t \frac{1}{n} X^T (y - X\beta^t) \right). \end{aligned}

This method of solving the LASSO is sometimes called the Iterative Soft-Thresholding Algorithm (ISTA).

References:

  1. Gordon, G. & Tibshirani, R. Generalized gradient descent.
  2. Hastie, T., Tibshirani, R., & Wainwright, M. Statistical learning with sparsity (Section 5.3).

Gradient descent as a minimization problem

Let’s say that I have a convex, differentiable function f: \mathbb{R}^p \rightarrow \mathbb{R} and that I am interested in minimizing it. A common first-order iterative method for minimizing f is gradient descent. In words, a each step, compute the gradient of f at the current point and move a certain distance in the direction of the negative gradient. In math, if \beta^t is the current iterate, then the next iterate is given by

\beta^{t+1} = \beta^t - s^t \nabla f(\beta^t),

where s^t > 0 is some step-size, which can be predetermined or adaptively chosen.

Did you know that this gradient descent step can be framed as an optimization problem as well? Indeed,

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}. \qquad (1)

The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of \beta, the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. \nabla f(\beta^t) + \frac{1}{s^t} (\beta - \beta^t) = 0, which rearranges to \beta = \beta^t - s^t \nabla f(\beta^t).

You can think of the expression on the RHS in (1) as a quadratic approximation. Taking a second-order Taylor expansion of f and \beta^t, we have

f(\beta) \approx  f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2} (\beta - \beta^t)^T \nabla^2 f(\beta^t) (\beta - \beta^t).

Replacing the Hessian \nabla^2 f(\beta^t) with \dfrac{1}{t}I gives the expression in (1).

What does this alternate formulation give us? This formulation generalizes easily to the idea of projected gradient descent. The problem above was unconstrained, in the sense that \beta could potentially be any vector in \mathbb{R}^p. If we require \beta to lie in some convex constraint set \mathcal{C}, then we could generalize each gradient step to be

\beta^{t+1} = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.

This corresponds to taking a gradient step \beta^t - s^t \nabla f(\beta^t), then projecting the result onto the set \mathcal{C}.

This formulation will also lead us to a more general class of methods called proximal gradient methods, of which project gradient descent is a special case.

References:

  1. Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)