Let’s say that I have a convex, differentiable function and that I am interested in minimizing it. A common first-order iterative method for minimizing is **gradient descent**. In words, a each step, compute the gradient of at the current point and move a certain distance in the direction of the negative gradient. In math, if is the current iterate, then the next iterate is given by

where is some step-size, which can be predetermined or adaptively chosen.

**Did you know that this gradient descent step can be framed as an optimization problem as well?** Indeed,

The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of , the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. , which rearranges to .

You can think of the expression on the RHS in as a **quadratic approximation**. Taking a second-order Taylor expansion of and , we have

Replacing the Hessian with gives the expression in .

**What does this alternate formulation give us?** This formulation generalizes easily to the idea of **projected gradient descent**. The problem above was unconstrained, in the sense that could potentially be any vector in . If we require to lie in some convex constraint set , then we could generalize each gradient step to be

This corresponds to taking a gradient step , then projecting the result onto the set .

This formulation will also lead us to a more general class of methods called **proximal gradient methods**, of which project gradient descent is a special case.

References:

- Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)

### Like this:

Like Loading...

*Related*

Pingback: Proximal operators and generalized gradient descent | Statistical Odds & Ends