Quasi-Newton methods: L-BFGS

BFGS

In this previous post, we described how Quasi-Newton methods can be used to minimize a twice-differentiable function f whose domain is all of \mathbb{R}^n. BFGS is a popular quasi-Newton method. At each iteration, we take the following steps:

  1. Solve for p_k in p_k = - H_k \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Update H according to the equation

\begin{aligned} H_{k+1}&= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right)  + \dfrac{s_k s_k^\top}{y_k^\top s_k}, \end{aligned}

where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k).

Limited memory BFGS (L-BFGS)

The issue with BFGS is that we have to maintain the n \times n matrix H_k, which is too costly when n is large. The main idea behind limited memory BFGS (L-BFGS) is that we don’t really need H_k: what we really want is p_k = - H_k \nabla f(x_k). If we can compute H_k \nabla f(x_k) implicitly by maintaining a few n \times 1 vectors, then we never have to deal with an n \times n matrix.

Let g be any n \times 1 vector. We have

\begin{aligned} H_{k+1} g &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right) g  + \dfrac{s_k s_k^\top g}{y_k^\top s_k}  \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \dfrac{s_k^\top g}{y_k^\top s_k} y_k \right)  + \dfrac{s_k^\top g}{y_k^\top s_k} s_k \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \alpha_k y_k \right)  + \alpha_k s_k &(\alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}) \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k q_k  + \alpha_k s_k &(q_k = g - \alpha_k y_k) \\  &= p_k - \dfrac{s_k y_k^\top p_k}{y_k^\top s_k} + \alpha_k s_k &(p_k = H_k q_k) \\  &= p_k - (\alpha_k - \beta_k) s_k. &(\beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}). \end{aligned}

Thus, if we had an efficient way to compute p_k = H_k q_k, we have an efficient way to compute H_{k+1} g involving just n \times 1 vectors:

  1. Compute \alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}.
  2. Compute q_k = g - \alpha_k y_k.
  3. Compute p_k = H_k q_k efficiently.
  4. Compute \beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}.
  5. Compute H_{k+1}g = p_k - (\alpha_k - \beta_k) s_k.

But we do have an efficient way to computing p_k = H_k q_k: simply replace k+1 with k and g with q_k above! By recursively unfolding Step 3, we get the quasi-Newton update for iteration k (computation of p_k = -H_k \nabla f(x_k)):

  1. q \leftarrow - \nabla f(x_k).
  2. For i = k-1, \dots, 0, set \alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}, and q \leftarrow q - \alpha y_i.
  3. p \leftarrow H_0 q.
  4. For i = 0, \dots, k-1, set \beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}, and p \leftarrow p + (\alpha_i - \beta) s_i.
  5. Return p.

Limited memory BFGS goes one step further: instead of doing 2 loops of length k for the kth iteration, do 2 loops of length \min (m, k) instead, where m is usually set to 5 or 10. Here is the appropriately modified update:

  1. q \leftarrow - \nabla f(x_k).
  2. For i = k-1, \dots, \min (k-m), 0, set \alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}, and q \leftarrow q - \alpha y_i.
  3. p \leftarrow H_{0,k} q.
  4. For i = \min (k-m, 0), \dots, k-1, set \beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}, and p \leftarrow p + (\alpha_i - \beta) s_i.
  5. Return p.

A popular choice for H_{0,k} is H_{0,k} = \dfrac{y_{k-1}^\top s_{k-1}}{y_{k-1}^\top y_{k-1}}I .

References:

  1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s