Quasi-Newton methods: L-BFGS

BFGS

In this previous post, we described how Quasi-Newton methods can be used to minimize a twice-differentiable function $f$ whose domain is all of $\mathbb{R}^n$. BFGS is a popular quasi-Newton method. At each iteration, we take the following steps:

1. Solve for $p_k$ in $p_k = - H_k \nabla f(x_k)$.
2. Update $x$ with $x_{k+1} = x_k + t_k p_k$.
3. Update $H$ according to the equation

\begin{aligned} H_{k+1}&= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right) + \dfrac{s_k s_k^\top}{y_k^\top s_k}, \end{aligned}

where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$.

Limited memory BFGS (L-BFGS)

The issue with BFGS is that we have to maintain the $n \times n$ matrix $H_k$, which is too costly when $n$ is large. The main idea behind limited memory BFGS (L-BFGS) is that we don’t really need $H_k$: what we really want is $p_k = - H_k \nabla f(x_k)$. If we can compute $H_k \nabla f(x_k)$ implicitly by maintaining a few $n \times 1$ vectors, then we never have to deal with an $n \times n$ matrix.

Let $g$ be any $n \times 1$ vector. We have

\begin{aligned} H_{k+1} g &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right) g + \dfrac{s_k s_k^\top g}{y_k^\top s_k} \\ &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \dfrac{s_k^\top g}{y_k^\top s_k} y_k \right) + \dfrac{s_k^\top g}{y_k^\top s_k} s_k \\ &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \alpha_k y_k \right) + \alpha_k s_k &(\alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}) \\ &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k q_k + \alpha_k s_k &(q_k = g - \alpha_k y_k) \\ &= p_k - \dfrac{s_k y_k^\top p_k}{y_k^\top s_k} + \alpha_k s_k &(p_k = H_k q_k) \\ &= p_k - (\alpha_k - \beta_k) s_k. &(\beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}). \end{aligned}

Thus, if we had an efficient way to compute $p_k = H_k q_k$, we have an efficient way to compute $H_{k+1} g$ involving just $n \times 1$ vectors:

1. Compute $\alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}$.
2. Compute $q_k = g - \alpha_k y_k$.
3. Compute $p_k = H_k q_k$ efficiently.
4. Compute $\beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}$.
5. Compute $H_{k+1}g = p_k - (\alpha_k - \beta_k) s_k$.

But we do have an efficient way to computing $p_k = H_k q_k$: simply replace $k+1$ with $k$ and $g$ with $q_k$ above! By recursively unfolding Step 3, we get the quasi-Newton update for iteration $k$ (computation of $p_k = -H_k \nabla f(x_k)$):

1. $q \leftarrow - \nabla f(x_k)$.
2. For $i = k-1, \dots, 0$, set $\alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}$, and $q \leftarrow q - \alpha y_i$.
3. $p \leftarrow H_0 q$.
4. For $i = 0, \dots, k-1$, set $\beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}$, and $p \leftarrow p + (\alpha_i - \beta) s_i$.
5. Return $p$.

Limited memory BFGS goes one step further: instead of doing 2 loops of length $k$ for the $k$th iteration, do 2 loops of length $\min (m, k)$ instead, where $m$ is usually set to 5 or 10. Here is the appropriately modified update:

1. $q \leftarrow - \nabla f(x_k)$.
2. For $i = k-1, \dots, \min (k-m), 0$, set $\alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}$, and $q \leftarrow q - \alpha y_i$.
3. $p \leftarrow H_{0,k} q$.
4. For $i = \min (k-m, 0), \dots, k-1$, set $\beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}$, and $p \leftarrow p + (\alpha_i - \beta) s_i$.
5. Return $p$.

A popular choice for $H_{0,k}$ is $H_{0,k} = \dfrac{y_{k-1}^\top s_{k-1}}{y_{k-1}^\top y_{k-1}}I$.

References:

1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)