Assume we are trying to minimize some convex function which is differentiable. One way to do this is to use a descent method. A general descent method can be described as follows: Given a starting point
in the domain of
, iterate over the following 3 steps until some stopping criterion is satisfied:
- Choose a descent direction
.
- Choose a step size
.
- Update
.
There are several ways to choose a descent direction (see this previous post on steepest descent for some examples). This post is about two ways to choose the step size. This step is called a line search because the step size determines where along the line the next iterate will be.
Exact line search
Exact line search finds such that
attains its minimum value at
, where we let
range over non-negative values. The diagram below demonstrates this:
The curve is the value of the function along the ray
with
.
is the point where this function is minimized. It should be clear from the description that this choice will result in smaller values of the function in each iteration.
Since exact line search requires solving a minimization problem (albeit in one dimension), it can be expensively unless can be computed efficiently (or even analytically).
Backtracking line search
Backtracking line search uses a heuristic to quickly determine a step size which decreases the objective function value by a certain amount. The user needs to set two parameters, and
. In each iteration, after the descent direction
has been chosen, the step size is determine by the following:
- Set
.
- While
, set
.
In other words, we keep reducing the step size by the same multiplicative factor until the .
Let’s draw a picture to understand the intuition behind the inequality:
This is the same picture as that for exact line search, except we’ve added the line
. This line always has negative slope: because
is a descent direction, we must have
. The stopping condition,
, holds for the blue segment
. The operation
makes the step size smaller and smaller until
falls in the blue segment, and by the way it’s set up, we know that the objective function will have a strictly smaller value.
How do we know that is always
(i.e. the blue segment exists)? It’s because of how we chose the term
. It can be shown that the derivative of
w.r.t.
at the point
is
. Hence, the line
is tangent to
at
. Multiplying the term
by some
ensures that the stopping condition will hold for some non-trivial segment
.
(Note: The last diagram is taken from Reference 1, and the other diagrams are this base + my annotations.)
References:
- Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Chapter 9.