Gradient descent as a minimization problem

Let’s say that I have a convex, differentiable function f: \mathbb{R}^p \rightarrow \mathbb{R} and that I am interested in minimizing it. A common first-order iterative method for minimizing f is gradient descent. In words, a each step, compute the gradient of f at the current point and move a certain distance in the direction of the negative gradient. In math, if \beta^t is the current iterate, then the next iterate is given by

\beta^{t+1} = \beta^t - s^t \nabla f(\beta^t),

where s^t > 0 is some step-size, which can be predetermined or adaptively chosen.

Did you know that this gradient descent step can be framed as an optimization problem as well? Indeed,

\beta^{t+1} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.

The proof follows directly from differentiating the expression inside the square brackets. Note that as a function of \beta, the expression is differentiable and convex, hence its minimum will be achieved at the stationary point, i.e. \nabla f(\beta^t) + \frac{1}{s^t} (\beta - \beta^t) = 0, which rearranges to \beta = \beta^t - s^t \nabla f(\beta^t).

What does this alternate formulation give us? This formulation generalizes easily to the idea of projected gradient descent. The problem above was unconstrained, in the sense that \beta could potentially be any vector in \mathbb{R}^p. If we require \beta to lie in some convex constraint set \mathcal{C}, then we could generalize each gradient step to be

\beta^{t+1} = \underset{\beta \in \mathcal{C}}{\text{argmin}} \left\{ f(\beta^t) + \langle \nabla f(\beta^t), \beta - \beta^t \rangle  + \frac{1}{2s^t} \| \beta - \beta^t \|_2^2 \right\}.

This corresponds to taking a gradient step \beta^t - s^t \nabla f(\beta^t), then projecting the result onto the set \mathcal{C}.

This formulation will also lead us to a more general class of methods called proximal gradient methods, of which project gradient descent is a special case.

References:

  1. Hastie, Tibshirani & Wainwright. Statistical Learning with Sparsity. (In particular, pp101-103.)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s