Gradient descent as a minimization problem

Let’s say that I have a convex, differentiable function $f: \mathbb{R}^p \rightarrow \mathbb{R}$ and that I am interested in minimizing it. A common first-order iterative method for minimizing $f$ is gradient descent. In words, a each step, compute the gradient of $f$ at the current point and move a certain distance in the direction of the negative gradient. In math, if $\beta^t$ is the current iterate, then the next iterate is given by $\beta^{t+1} = \beta^t - s^t \nabla f(\beta^t),$

where $s^t > 0$ is some step-size, which can be predetermined or adaptively chosen.

Did you know that this gradient descent step can be framed as an optimization problem as well? Indeed, $\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.$

The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of $\beta$, the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. $\nabla f(\beta^t) + \frac{1}{s^t} (\beta - \beta^t) = 0$, which rearranges to $\beta = \beta^t - s^t \nabla f(\beta^t)$.

What does this alternate formulation give us? This formulation generalizes easily to the idea of projected gradient descent. The problem above was unconstrained, in the sense that $\beta$ could potentially be any vector in $\mathbb{R}^p$. If we require $\beta$ to lie in some convex constraint set $\mathcal{C}$, then we could generalize each gradient step to be $\beta^{t+1} = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.$

This corresponds to taking a gradient step $\beta^t - s^t \nabla f(\beta^t)$, then projecting the result onto the set $\mathcal{C}$.

This formulation will also lead us to a more general class of methods called proximal gradient methods, of which project gradient descent is a special case.

References:

1. Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)