Assume we have a convex function and we would like to find its minimum. A commonly used class of algorithms, called descent algorithms, function in the following manner:
- Find a descent direction .
- Update , where is chosen in some manner (e.g. deterministically, backtracking line search, exact line search).
- Repeat steps 1 and 2 until convergence.
There are several descent directions to choose from: which should we use? If we look at the first order Taylor expansion of around :
the term is known as the directional derivative of at in the direction . It makes sense to find a whose directional derivative is as negative as possible, i.e. as steep as possible.
The problem, of course, is that once we find some such that is negative, we can increase the norm of as large as we like so that the directional derivative is negative infinity. This means that without normalization, we cannot choose between all the directions having .
To that end, let be some norm on . (What this norm is will matter later.) We define a normalized steepest descent direction with respect to as
(The subscript “nsd” is short for “normalized steepest descent”.) This direction need not be unique. We can then descend in the direction of .
It is also common to look at the unnormalized steepest descent step
where is the dual norm of . (For exact line search, it makes no difference whether we use or since the scale factor does not matter.)
Steepest descent as a class of algorithms
Notice that the steepest descent direction depends on the norm that is chosen in its definition. Hence, steepest descent is really a generic name: specific algorithms arise when we choose to be a particular norm.
Steepest descent vs. gradient descent
When I was looking around the internet for information on steepest descent, there appears to be some confusion between steepest descent and gradient descent, with some webpages saying that the two are the same. That is not quite correct: gradient descent is a special case of steepest descent where the norm chosen is the Euclidean norm. But I suppose in many contexts we only consider the Euclidean norm, which is why we might conflate the two algorithms.
Why use different norms?
This begs the question: why consider any norm in the definition of the steepest descent direction other than the Euclidean norm? I am not too familiar with material in this area, but it seems that the choice of norm can greatly affect the convergence rate of the algorithm. See Section 9.4.4 of Reference 1 for some examples. This argument might become clearer as well when we see that several popular variations of gradient descent are actually special cases of steepest descent.
Special cases of steepest descent
A few popular optimization algorithms can be viewed as steepest descent for a particular choice of norm. Again, see Section 9.4 of Reference 1 for details.
- Euclidean norm: If is the Euclidean norm, then steepest descent is gradient descent.
- Quadratic norms: For any symmetric (strictly) positive definite matrix , define the norm . Steepest descent under this norm amounts to doing gradient descent after doing a change of coordinates. Concretely, our variable is now , and we are minimizing defined by . Steepest descent in the original space corresponds to gradient descent of w.r.t. .
In particular, if , then steepest descent becomes Newton’s method.
- norm: If , steepest descent becomes coordinate descent.
- Boyd, S., and Vandenberghe, L. Convex Optimization (Section 9.4: Steepest descent method).