# Quasi-Newton methods in optimization

Consider the unconstrained minimization problem \begin{aligned} \min_x f(x) \end{aligned}

where $f : \mathbb{R}^n \mapsto \mathbb{R}$ is twice differentiable and $\text{dom}(f) = \mathbb{R}^n$. Newton’s method is a second-order descent method for finding the minimum. Starting at some initial point, at the $k$th iteration we update the candidate solution with the formula $x_{k+1} = x_k - t_k [\nabla^2 f(x_k)]^{-1} \nabla f(x_k),$

where $\nabla f$ and $\nabla^2 f$ are the gradient and Hessian of $f$ respectively, and $t_k$ is a step size chosen appropriately (e.g. with backtracking line search). Another way to write the update is \begin{aligned} \nabla^2 f(x_k) p_k &= - \nabla f(x_k), \\ x_{k+1} &= x_k + t_k p_k. \end{aligned}

That is, we (i) compute the gradient and Hessian of $f$ at $x_k$, (ii) solve for $p_k$, then (iii) update $x_{k+1}$ according to the second line.

A big drawback of Newton’s method is that computing the Hessian can be very expensive. Quasi-Newton methods address this problem: update $x_{k+1}$ with \begin{aligned} B_k p_k &= - \nabla f(x_k), \\ x_{k+1} &= x_k + t_k p_k, \end{aligned}

where $B_k$ is an approximation of $\nabla^2 f(x_k)$ which is easy to compute.

Quasi-Newton methods include SR1, DFP, BFGS and L-BFGS, which I hope to describe in future posts.

The secant equation

The question becomes how to generate the $B_k$‘s. Writing the Taylor expansion of $\nabla f$ around $x_{k+1}$, \begin{aligned} \nabla f(x_{k+1}) \approx \nabla f(x_k) + \nabla^2 f(x_k) (x_{k+1} - x_k). \end{aligned}

It seems reasonable for $B_{k+1}$ to play the role of $\nabla^2 f(x_k)$ above, i.e. \begin{aligned} \nabla f(x_{k+1}) = \nabla f(x_k) + B_{k+1} (x_{k+1} - x_k), \end{aligned}

or equivalently \begin{aligned} B_{k+1} s_k = y_k, \end{aligned}

where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$. This last equation is known as the secant equation, and is sometimes written as $B^+ s = y$ for brevity. Note that this equation involves $n$ equations in $n \times n$ unknowns, so there is a lot of freedom in how to choose the $B_k$‘s. On top of satisfying the secant equation, we often require 3 other conditions:

1. $B^+$ is symmetric.
2. $B^+$ is “close” to $B$.
3. $B$ being positive definite implies that $B^+$ is positive definite.

Each quasi-Newton iteration now looks like this:

1. Solve for $p_k$ in $B_k p_k = - \nabla f(x_k)$.
2. Update $x$ with $x_{k+1} = x_k + t_k p_k$.
3. Update $B$ according to some procedure so that the secant equation (and other conditions) are satisfied.

The different quasi-Newton methods amount to different updates for $B$.

References:

1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)