Let’s say that I have a convex, differentiable function and that I am interested in minimizing it. A common first-order iterative method for minimizing is gradient descent. In words, a each step, compute the gradient of at the current point and move a certain distance in the direction of the negative gradient. In math, if is the current iterate, then the next iterate is given by
where is some step-size, which can be predetermined or adaptively chosen.
Did you know that this gradient descent step can be framed as an optimization problem as well? Indeed,
The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of , the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. , which rearranges to .
You can think of the expression on the RHS in as a quadratic approximation. Taking a second-order Taylor expansion of and , we have
Replacing the Hessian with gives the expression in .
What does this alternate formulation give us? This formulation generalizes easily to the idea of projected gradient descent. The problem above was unconstrained, in the sense that could potentially be any vector in . If we require to lie in some convex constraint set , then we could generalize each gradient step to be
This corresponds to taking a gradient step , then projecting the result onto the set .
This formulation will also lead us to a more general class of methods called proximal gradient methods, of which project gradient descent is a special case.
- Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)