Quasi-Newton methods: L-BFGS

BFGS

In this previous post, we described how Quasi-Newton methods can be used to minimize a twice-differentiable function f whose domain is all of \mathbb{R}^n. BFGS is a popular quasi-Newton method. At each iteration, we take the following steps:

  1. Solve for p_k in p_k = - H_k \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Update H according to the equation

\begin{aligned} H_{k+1}&= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right)  + \dfrac{s_k s_k^\top}{y_k^\top s_k}, \end{aligned}

where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k).

Limited memory BFGS (L-BFGS)

The issue with BFGS is that we have to maintain the n \times n matrix H_k, which is too costly when n is large. The main idea behind limited memory BFGS (L-BFGS) is that we don’t really need H_k: what we really want is p_k = - H_k \nabla f(x_k). If we can compute H_k \nabla f(x_k) implicitly by maintaining a few n \times 1 vectors, then we never have to deal with an n \times n matrix.

Let g be any n \times 1 vector. We have

\begin{aligned} H_{k+1} g &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( I - \dfrac{y_k s_k^\top}{y_k^\top s_k} \right) g  + \dfrac{s_k s_k^\top g}{y_k^\top s_k}  \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \dfrac{s_k^\top g}{y_k^\top s_k} y_k \right)  + \dfrac{s_k^\top g}{y_k^\top s_k} s_k \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k \left( g - \alpha_k y_k \right)  + \alpha_k s_k &(\alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}) \\  &= \left( I - \dfrac{s_k y_k^\top}{y_k^\top s_k}\right) H_k q_k  + \alpha_k s_k &(q_k = g - \alpha_k y_k) \\  &= p_k - \dfrac{s_k y_k^\top p_k}{y_k^\top s_k} + \alpha_k s_k &(p_k = H_k q_k) \\  &= p_k - (\alpha_k - \beta_k) s_k. &(\beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}). \end{aligned}

Thus, if we had an efficient way to compute p_k = H_k q_k, we have an efficient way to compute H_{k+1} g involving just n \times 1 vectors:

  1. Compute \alpha_k = \dfrac{s_k^\top g}{y_k^\top s_k}.
  2. Compute q_k = g - \alpha_k y_k.
  3. Compute p_k = H_k q_k efficiently.
  4. Compute \beta_k = \dfrac{y_k^\top p_k}{y_k^\top s_k}.
  5. Compute H_{k+1}g = p_k - (\alpha_k - \beta_k) s_k.

But we do have an efficient way to computing p_k = H_k q_k: simply replace k+1 with k and g with q_k above! By recursively unfolding Step 3, we get the quasi-Newton update for iteration k (computation of p_k = -H_k \nabla f(x_k)):

  1. q \leftarrow - \nabla f(x_k).
  2. For i = k-1, \dots, 0, set \alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}, and q \leftarrow q - \alpha y_i.
  3. p \leftarrow H_0 q.
  4. For i = 0, \dots, k-1, set \beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}, and p \leftarrow p + (\alpha_i - \beta) s_i.
  5. Return p.

Limited memory BFGS goes one step further: instead of doing 2 loops of length k for the kth iteration, do 2 loops of length \min (m, k) instead, where m is usually set to 5 or 10. Here is the appropriately modified update:

  1. q \leftarrow - \nabla f(x_k).
  2. For i = k-1, \dots, \min (k-m), 0, set \alpha_i \leftarrow \dfrac{s_i^\top q}{y_i^\top s_i}, and q \leftarrow q - \alpha y_i.
  3. p \leftarrow H_{0,k} q.
  4. For i = \min (k-m, 0), \dots, k-1, set \beta \leftarrow \dfrac{y_i^\top p}{y_i^\top s_i}, and p \leftarrow p + (\alpha_i - \beta) s_i.
  5. Return p.

A popular choice for H_{0,k} is H_{0,k} = \dfrac{y_{k-1}^\top s_{k-1}}{y_{k-1}^\top y_{k-1}}I .

References:

  1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)

Quasi-Newton methods: DFP and BFGS

Introduction

In this previous post, we described how Quasi-Newton methods can be used to minimize a twice-differentiable function f whose domain is all of \mathbb{R}^n. At each iteration of the method, we take the following steps:

  1. Solve for p_k in B_k p_k = - \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Update B according to some procedure.

Many methods update B so that it satisfies the secant equation:

\begin{aligned} B_{k+1} s_k = y_k, \end{aligned}

where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k). This is often written as B^+ s = y for brevity. If we let H denote the inverse of B, then the secant equation is equivalent to H^+ y = s.

DFP

Recall that the SR1 update assumes that the next B is the current B plus a symmetric rank-one matrix. The DFP update (Davidon-Fletcher-Powell) assumes that the next H is the current H plus a symmetric rank-two matrix, i.e.

\begin{aligned} H^+ = H + auu^\top + bvv^\top. \end{aligned}

Plugging this into the secant equation gives

\begin{aligned} s - Hy = (au^\top y) u + (b v^\top y)v. \end{aligned}

Setting u = s and v = Hy, we can solve for a and b to get the update

\begin{aligned} H^+ = H - \dfrac{Hyy^\top H}{y^\top H y} + \dfrac{ss^\top}{y^\top s}. \end{aligned}

Thus, each iteration of DFP can be written as:

  1. Set p_k = - H_k \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Set H_{k+1} = H_k - \dfrac{H_k y_k y_k^\top H_k}{y_k^\top H_k y_k} + \dfrac{s_k s_k^\top}{y_k^\top s_k}.

One advantage that the DFP update has over the SR1 update is that B being positive definite implies that B^+ is also positive definite.

BFGS

Whereas DFP assumes H^+ is H plus a symmetric rank-two matrix, the BFGS update (Broyden-Fletcher-Goldfarb-Shanno) assumes that B^+ is B plus a symmetric rank-two matrix. Going through the same chain of reasoning as the DFP update but with B replacing H and y and s switching roles, we get the updates

\begin{aligned} B+ &= B - \dfrac{Bss^\top B}{s^\top Bs} + \dfrac{yy^\top}{y^\top s}, \\  H^+ &= \left( I - \dfrac{sy^\top}{y^\top s}\right) H \left( I - \dfrac{ys^\top}{y^\top s} \right)  + \dfrac{ss^\top}{y^\top s}. \end{aligned}

Like the DFP update, the BFGS update maintains the positive definiteness of B. In addition, BFGS also has a self-correcting property that makes it attractive in practice (see Nocedal 1992, Reference 2). The BFGS update is the most popular quasi-Newton update in practice.

References:

  1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)
  2. Nocedal, J. (1992). Theory of algorithms for unconstrained optimization.

Quasi-Newton methods: SR1

Introduction

In this previous post, we described how Quasi-Newton methods can be used to minimize a twice-differentiable function f whose domain is all of \mathbb{R}^n. At each iteration of the method, we take the following steps:

  1. Solve for p_k in B_k p_k = - \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Update B according to some procedure.

Many methods update B so that it satisfies the secant equation:

\begin{aligned} B_{k+1} s_k = y_k, \end{aligned}

where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k). This is often written as B^+ s = y for brevity. Because this is a system of n equations with n \times n unknowns, there are many ways to update B.

Symmetric rank-one update (SR1)

One way to update B easily is to assume that the next B is the current B plus a symmetric rank-one matrix, i.e.

B^+ = B + auu^\top

for some a \in \mathbb{R} and u \in \mathbb{R}^n. Plugging this update into the secant equation and rearranging, we get

(au^\top s) u = y -  Bs,

which implies that u must be a multiple of y - Bs. Plugging this into the equation and solving for a, we obtain the update

\begin{aligned} B^+ = B + \dfrac{(y - Bs)(y - Bs)^\top}{(y - Bs)^\top s}. \end{aligned}

Recall from the quasi-Newton updates that we need to solve the equation B p = - \nabla f(x) for p, which means that we need to compute the inverse H = B^{-1} to get p = - H \nabla f(x). Because of the Sherman-Morrison-Woodbury formula, we can use the update to B to derive an update for H:

\begin{aligned} H^+ = H + \dfrac{(s - Hy)(s - Hy)^\top}{(s - Hy)^\top y}. \end{aligned}

This is known as the SR1 update. SR1 is simple to understand and implement, but has two shortcomings. First, the denominator in the update for B, (y - Bs)^\top s, might be approximately zero, causing numerical issues. Second, it does not preserve positive definiteness.

References:

  1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)

Quasi-Newton methods in optimization

Consider the unconstrained minimization problem

\begin{aligned} \min_x f(x) \end{aligned}

where f : \mathbb{R}^n \mapsto \mathbb{R} is twice differentiable and \text{dom}(f) = \mathbb{R}^n. Newton’s method is a second-order descent method for finding the minimum. Starting at some initial point, at the kth iteration we update the candidate solution with the formula

x_{k+1} = x_k - t_k [\nabla^2 f(x_k)]^{-1} \nabla f(x_k),

where \nabla f and \nabla^2 f are the gradient and Hessian of f respectively, and t_k is a step size chosen appropriately (e.g. with backtracking line search). Another way to write the update is

\begin{aligned} \nabla^2 f(x_k) p_k &= - \nabla f(x_k), \\  x_{k+1} &= x_k + t_k p_k. \end{aligned}

That is, we (i) compute the gradient and Hessian of f at x_k, (ii) solve for p_k, then (iii) update x_{k+1} according to the second line.

A big drawback of Newton’s method is that computing the Hessian can be very expensive. Quasi-Newton methods address this problem: update x_{k+1} with

\begin{aligned} B_k p_k &= - \nabla f(x_k), \\  x_{k+1} &= x_k + t_k p_k, \end{aligned}

where B_k is an approximation of \nabla^2 f(x_k) which is easy to compute.

Quasi-Newton methods include SR1, DFP, BFGS and L-BFGS, which I hope to describe in future posts.

The secant equation

The question becomes how to generate the B_k‘s. Writing the Taylor expansion of \nabla f around x_{k+1},

\begin{aligned} \nabla f(x_{k+1}) \approx \nabla f(x_k) + \nabla^2 f(x_k) (x_{k+1} - x_k).  \end{aligned}

It seems reasonable for B_{k+1} to play the role of \nabla^2 f(x_k) above, i.e.

\begin{aligned} \nabla f(x_{k+1}) = \nabla f(x_k) + B_{k+1} (x_{k+1} - x_k),  \end{aligned}

or equivalently

\begin{aligned} B_{k+1} s_k = y_k, \end{aligned}

where s_k = x_{k+1} - x_k and y_k = \nabla f(x_{k+1}) - \nabla f(x_k). This last equation is known as the secant equation, and is sometimes written as B^+ s = y for brevity. Note that this equation involves n equations in n \times n unknowns, so there is a lot of freedom in how to choose the B_k‘s. On top of satisfying the secant equation, we often require 3 other conditions:

  1. B^+ is symmetric.
  2. B^+ is “close” to B.
  3. B being positive definite implies that B^+ is positive definite.

Each quasi-Newton iteration now looks like this:

  1. Solve for p_k in B_k p_k = - \nabla f(x_k).
  2. Update x with x_{k+1} = x_k + t_k p_k.
  3. Update B according to some procedure so that the secant equation (and other conditions) are satisfied.

The different quasi-Newton methods amount to different updates for B.

References:

  1. Peña, J. (2016). Quasi-Newton Methods. (CMU Convex Optimization: Fall 2016, Lecture on Nov 2.)

Exact line search and backtracking line search

Assume we are trying to minimize some convex function f: \mathbb{R}^n \mapsto \mathbb{R} which is differentiable. One way to do this is to use a descent method. A general descent method can be described as follows: Given a starting point x in the domain of f, iterate over the following 3 steps until some stopping criterion is satisfied:

  1. Choose a descent direction \Delta x.
  2. Choose a step size t > 0.
  3. Update x \leftarrow x + t \Delta x.

There are several ways to choose a descent direction (see this previous post on steepest descent for some examples). This post is about two ways to choose the step size. This step is called a line search because the step size determines where along the line \{x + t \Delta x : t \geq 0 \} the next iterate will be.

Exact line search

Exact line search finds t^\star such that f(x + t \Delta x) attains its minimum value at t = t^\star, where we let t range over non-negative values. The diagram below demonstrates this:

The curve is the value of the function along the ray x + t \Delta x with t \geq 0. t^* is the point where this function is minimized. It should be clear from the description that this choice will result in smaller values of the function in each iteration.

Since exact line search requires solving a minimization problem (albeit in one dimension), it can be expensively unless t^* can be computed efficiently (or even analytically).

Backtracking line search

Backtracking line search uses a heuristic to quickly determine a step size which decreases the objective function value by a certain amount. The user needs to set two parameters, 0 < \alpha < 0.5 and 0 < \beta < 1. In each iteration, after the descent direction \Delta x has been chosen, the step size is determine by the following:

  1. Set t = 1.
  2. While f(x + t\Delta x) > f(x) + \alpha t \nabla f(x)^\top \Delta x, set t \leftarrow \beta t.

In other words, we keep reducing the step size by the same multiplicative factor until the f(x + t\Delta x) \leq f(x) + \alpha t \nabla f(x)^\top \Delta x.

Let’s draw a picture to understand the intuition behind the inequality:

This is the same picture as that for exact line search, except we’ve added the line y = f(x) + \alpha t \nabla f(x)^\top \Delta x. This line always has negative slope: because \Delta x is a descent direction, we must have \nabla f(x)^\top \Delta x < 0. The stopping condition, f(x + t\Delta x) \leq f(x) + \alpha t \nabla f(x)^\top \Delta x, holds for the blue segment [0, t_0]. The operation t \leftarrow \beta t makes the step size smaller and smaller until t falls in the blue segment, and by the way it’s set up, we know that the objective function will have a strictly smaller value.

How do we know that t_0 is always > 0 (i.e. the blue segment exists)? It’s because of how we chose the term \alpha t \nabla f(x)^\top \Delta x. It can be shown that the derivative of f(x + t \Delta x) w.r.t. t at the point t = 0 is \nabla f(x)^\top \Delta x. Hence, the line y = f(x) + t \nabla f(x)^\top \Delta x is tangent to y = f(x + t \nabla x) at t = 0. Multiplying the term t \nabla f(x)^\top \Delta x by some 0 < \alpha < 1 ensures that the stopping condition will hold for some non-trivial segment [0, t_0].

(Note: The last diagram is taken from Reference 1, and the other diagrams are this base + my annotations.)

References:

  1. Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Chapter 9.

Affine hull vs. convex hull

The affine hull and convex hull are closely related concepts. Let S be a set in \mathbb{R}^n. The affine hull of S is the set of all affine combinations of elements of S:

\begin{aligned} \text{aff}(S) = \left\{ \sum_{i=1}^k a_i x_i : k > 0, x_i \in S, a_i \in \mathbb{R}, \sum_{i=1}^k a_i = 1 \right\}. \end{aligned}

The convex hull of S is the set of all convex combinations of the elements of S:

\begin{aligned} \text{conv}(S) = \left\{ \sum_{i=1}^k a_i x_i : k > 0, x_i \in S, a_i \in \mathbb{R}, a_i \geq 0, \sum_{i=1}^k a_i = 1 \right\}. \end{aligned}

Putting the definitions side by side, we see that the only difference is that for the convex hull, the weights in the linear combination (the a_i‘s) have an additional restriction of being non-negative. The only restriction on the a_i‘s for combinations in the affine hull is that they sum to 1. This also means that the affine hull always contains the convex hull.

Let’s have a look at a few examples. If S consists of two points, the convex hull is the line segment joining them (including the endpoints) while the affine hull is the entire line through these two points:

For 3 non-collinear points in two dimensions, the convex hull is the triangle with these 3 points as vertices while the affine hull is the entire plane \mathbb{R}^2:

The final illustration below shows the affine and convex hulls of 3 non-collinear points in \mathbb{R}^3. They are both two-dimensional sets living in 3-dimensional space. Notice also how the affine hull need not pass through the origin (whereas all subspaces of \mathbb{R}^3 must pass through it).

What are the KKT conditions?

Consider an optimization problem in standard form:

\begin{aligned} \text{minimize} \quad& f_0(x) \\  \text{subject to} \quad& f_i(x) \leq 0, \quad i = 1, \dots, m, \\  & h_i(x) = 0, \quad i = 1, \dots, p, \end{aligned}

with the variable x \in \mathbb{R}^n. Assume that the f_i‘s and h_i‘s are differentiable. (At this point, we are not assuming anything about their convexity.)

As before, define the Lagrangian as the function

\begin{aligned} L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x). \end{aligned}

Let x^\star and (\lambda^\star, \nu^\star) be the primal and dual optimal points respectively (i.e. points where the primal and dual problems are optimized). The Karush-Kuhn-Tucker (KKT) conditions refer to the following set of 4 assumptions:

  1. Primal feasibility: f_i(x^\star) \leq 0 for i = 1, \dots, m, and h_i(x^\star) = 0 for i = 1, \dots, p.
  2. Dual feasibility: \lambda_i^\star \geq 0 for i = 1, \dots, m.
  3. Stationarity: \partial_x L(x^\star, \lambda^\star, \nu^\star) = 0, i.e. \nabla f_0(x^\star) + \displaystyle\sum_{i=1}^m \lambda_i^\star \nabla f_i(x^\star) + \displaystyle\sum_{i=1}^p \nu_i^\star \nabla h_i(x^\star) = 0.
  4. Complementary slackness: \lambda_i^\star f_i(x^\star) = 0 for i = 1, \dots, m.

Here are 3 statements (theorems, if you will) involving the KKT conditions which highlight their importance:

  1. (Necessity) For any optimization problem for which strong duality holds, any pair of primal and dual optimal points must satisfy the KKT conditions.
  2. (Sufficiency) For a convex optimization problem (i.e. f_i‘s convex and h_i‘s affine), if the KKT conditions hold for some (\tilde{x}, \tilde{\lambda}, \tilde{\nu}), then \tilde{x} and (\tilde{\lambda}, \tilde{\nu}) are primal and dual optimal, and there is zero duality gap.
  3. (Necessity & sufficiency) For a convex optimization problem that satisfies Slater’s condition, then \tilde{x} is primal optimal if and only if there exists (\tilde{\lambda}, \tilde{\nu}) such that the KKT conditions are satisfied.

Note: There are more general versions of the KKT conditions; see Reference 2 for details.

References:

  1. Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Chapter 5.
  2. Wikipedia. Karush-Kuhn-Tucker conditions.

Lagrange dual, weak duality and strong duality

Consider an optimization problem in standard form:

\begin{aligned} \text{minimize} \quad& f_0(x) \\  \text{subject to} \quad& f_i(x) \leq 0, \quad i = 1, \dots, m, \\  & h_i(x) = 0, \quad i = 1, \dots, p, \end{aligned}

with the variable x \in \mathbb{R}^n. Let \mathcal{D} be the domain for x, i.e. the intersection of the domains of the f_i‘s and the h_i‘s. Let p^\star denote the optimal value of the problem.

The Lagrange dual function is the function g: \mathbb{R}^m \times \mathbb{R}^p \mapsto \mathbb{R} defined as the minimum value of the Lagrangian over x:

\begin{aligned} g(\lambda, \nu) &= \inf_{x \in \mathcal{D}} L(x, \lambda, \nu) \\  &= \inf_{x \in \mathcal{D}} \left\{ f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \right\}. \end{aligned}

The Lagrange dual problem is the optimization problem

\begin{aligned} \text{maximize} \quad& g(\lambda, \nu) \\  \text{subject to} \quad& \lambda \geq 0, \end{aligned}

where the inequality above is understood to be element-wise. Let d^\star denote the optimal value of this problem.

The Lagrange dual problem is always a convex optimization problem, because the objective being maximized is concave (see this previous post for a proof) and the constraint is convex. One reason the dual problem is of critical importance is that it provides a lower bound on p^\star (it is the best lower bound that can be obtained from the Lagrange dual function) and in some cases, gives the value of p^\star.

Weak duality

Weak duality refers to the fact that we always have

d^\star \leq p^\star.

This inequality holds all the time, even when the original problem is not convex. The proof is straightforward. If \tilde{x} is feasible for the original problem, then for any \lambda \geq 0 and any \nu we have

\begin{aligned} f_0(\tilde{x}) &\geq f_0(\tilde{x}) + \sum_{i=1}^m \lambda_i f_i(\tilde{x}) + \sum_{i=1}^p \nu_i h_i(\tilde{x}) \\  &\geq \inf_{x \in \mathcal{D}} \left\{ f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \right\} \\  &= g(\lambda, \nu). \end{aligned}

Since the above holds for any \lambda \geq 0 and any \nu, maximizing over these variables gives us f_0(\tilde{x}) \geq d^\star. Since this holds for all feasible \tilde{x}, minimizing over all feasible \tilde{x} gives us p^\star \geq d^\star.

The difference p^\star - d^\star is known as the optimal duality gap.

Strong duality

Strong duality refers to the property that d^\star = p^\star. This does not always hold, but it can hold for important classes of problems. Conditions which guarantee strong duality are known as constraint qualifications.

A convex optimization problem is a problem of the form

\begin{aligned} \text{minimize} \quad& f_0(x) \\  \text{subject to} \quad& f_i(x) \leq 0, \quad i = 1, \dots, m, \\  & Ax = b, \end{aligned}

with f_0, \dots, f_m being convex functions. Convex optimization problems don’t always have strong duality, but there are some simple constraint qualifications under which strong duality holds. The most commonly cited one is Slater’s condition, which seems fairly easy to satisfy.

Slater’s condition states that the problem admits a strictly feasible point, that is, a point x in the relative interior of \mathcal{D} (the domain of the problem) such that

\begin{aligned} f_i(x) < 0, \quad i = 1, \dots, m, \quad \text{and} \quad Ax = b. \end{aligned}

Slater’s theorem says that strongly duality holds if Slater’s condition holds and the problem is convex. A proof of this can be found in Section 5.3.2 of Reference 1; see Section 2.1.3 of the same reference for more details on the concept of relative interior.

References:

  1. Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Chapter 5.

The Lagrange dual function is always concave

Consider an optimization problem in standard form:

\begin{aligned} \text{minimize} \quad& f_0(x) \\  \text{subject to} \quad& f_i(x) \leq 0, \quad i = 1, \dots, m, \\  & h_i(x) = 0, \quad i = 1, \dots, p, \end{aligned}

with the variable x \in \mathbb{R}^n. Let \mathcal{D} be the domain for x, i.e. the intersection of the domains of the f_i‘s and the h_i‘s.

The Lagrangian associated with this problem is the function L: \mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^p \mapsto \mathbb{R} defined as

\begin{aligned} L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x), \end{aligned}

with domain \mathcal{L} = \mathcal{D} \times \mathbb{R}^m \times \mathbb{R}^p. The Lagrange dual function is the function g: \mathbb{R}^m \times \mathbb{R}^p \mapsto \mathbb{R} defined as the minimum value of the Lagrangian over x:

\begin{aligned} g(\lambda, \nu) = \inf_{x \in \mathcal{D}} L(x, \lambda, \nu). \end{aligned}

The dual function is a very important function in optimization theory. One interesting fact about the dual function is that it is always concave, even if the original optimization problem is not convex. The proof can be stated in one-line: for each x, L(x, \lambda, \nu) is an affine function, and the point-wise infimum of a family of affine functions is always concave.

Here’s a more explicit version of the proof. Let \theta = (\lambda, \nu) \in \mathbb{R}^{m + p}. Fix \theta_1, \theta_2 \in \mathbb{R}^{m + p} and t \in [0, 1]. For each x \in \mathcal{D},

\begin{aligned} L(x, t\theta_1 + (1-t)\theta_2) &= t L(x, \theta_1) + (1-t) L(x, \theta_2) \\  &\geq t g (\theta_1) + (1-t) g(\theta_2), \end{aligned}

where the first equality uses the definition of L (or the fact that L is affine in \theta). Since this holds for all x \in \mathcal{D}, we can take an infimum over x to obtain g(t\theta_1 + (1-t)\theta_2) \geq t g (\theta_1) + (1-t) g(\theta_2), i.e. g is concave.

References:

  1. Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Chapter 5.

What is Danskin’s Theorem?

Danskin’s Theorem is a theorem from convex analysis that gives information about the derivatives of a particular kind of function. It was first proved in 1967 (Reference 1, what a title!). The statement of the theorem is pretty long, so we’ll walk our way slowly through it.

Set-up

Let \phi : \mathbb{R}^n \times Z \mapsto \mathbb{R} be a continuous function, with Z \subset \mathbb{R}^m being a compact set. Assume that the function is convex in the first argument, i.e. for each z \in Z, the mapping x \mapsto \phi(x, z) is convex.

Define a new function f: \mathbb{R}^n \mapsto \mathbb{R} by f(x) = \max_{z \in Z} \phi (x, z). Define the set of maximizing points as

\begin{aligned} Z(x) = \left\{ \hat{z} : \phi (x, \hat{z}) = \max_{z \in Z} \phi (x, z) \right\} .\end{aligned}

For a direction y, let \phi' (x, z; y) denote the directional derivative of the function \phi (\cdot, z) at x in the direction y.

Danskin’s Theorem

Danskin’s Theorem provides information about the function f. Without any other assumptions,

Part 1: The function f is convex.

Part 2: The function f has a directional semi-differential. The semi-differential of f at the point x in the direction y, denoted \partial_y f(x), is given by \partial_y f(x) = \max_{z \in Z(x)} \phi' (x, z; y).

The next two parts of Danskin’s Theorem (which I label as Parts 3 and 4) make further assumptions.

Part 3: Assume that for some x, there is a unique maximizing point, i.e. Z(x) = \{ \hat{z} \} for some \hat{z}, and that \phi ( \cdot, \hat{z}) is differentiable at x. Then f is differentiable at x, and

\begin{aligned} \frac{\partial f}{\partial x} = \frac{\partial \phi (x, \hat{z})}{\partial x}. \end{aligned}

Part 4: Assume that \phi ( \cdot, z) is differentiable for all z \in Z, and that the derivative \partial \phi / \partial x is continuous with respect to z for all x. Then the subdifferential of f(x) is given by

\begin{aligned} \partial f(x) = \text{conv} \left\{ \frac{\partial \phi (x, z)}{\partial x}: z \in Z(x) \right\}, \end{aligned}

where \text{conv} is the convex hull operation.

A proof of the theorem can be found in Appendix B of Reference 3.

Example

All that was pretty abstract, let’s end off with an example (adapted from the supplemental material of Reference 4). Let C be a convex compact subset of \mathbb{R}^n, and let \gamma \geq 0 be some constant. Consider the function \phi : \mathbb{R}^n \times C \mapsto \mathbb{R} with

\begin{aligned} \phi (x, z) = -x^\top z - \frac{\gamma}{2} z^\top z \end{aligned}

\phi is clearly a continuous function, and for fixed z, \phi is linear, and hence is convex. This means that the assumptions for Danskin’s Theorem hold. The function f is

\begin{aligned} f(x) &= \max_{z \in C} \phi (x, z) \\  &= - \min_{z \in C} \left\{ x^\top z + \frac{\gamma}{2} z^\top z \right\}. \end{aligned}

If \gamma > 0, the function being minimized above is strongly convex and is being minimized over a convex set, it has a unique minimizer \hat{z}(x) (proof here). Hence, the additional assumptions of Part 3 apply, and we can conclude that the function f is differentiable at x, and

\begin{aligned} \nabla f(x) = \frac{\partial \phi (x, \hat{z}(x))}{\partial x} = - \hat{z}(x). \end{aligned}

If \gamma = 0, the maximizing point is no longer unique, and so the assumptions in Part 3 don’t apply. The assumption in Part 4 still applies though, and so f has a subdifferential given by

\begin{aligned} \partial f(x) &= \text{conv} \left\{ \frac{\partial \phi (x, z)}{\partial x}: z \in Z(x) \right\} \\  &= \text{conv} \left\{ -z : z \in \text{argmax}_{z \in C} \left( -x^\top z \right) \right\} \\  &= \text{conv} \left\{ -z : z \in \text{argmin}_{z \in C} x^\top z \right\} \\  &= - \text{conv} \left\{ z : z \in \text{argmin}_{z \in C} x^\top z \right\} \\  &= - \left\{ z : z \in \text{argmin}_{z \in C} x^\top z \right\}. \end{aligned}

We can drop the convex hull operator in the last line because if z_1 and z_2 minimize x^\top z, then so does any convex combination tz_1 + (1-t)z_2.

References:

  1. Danskin, J. M. (1967). The Theory of Max-Min and its Application to Weapons Allocation Problems.
  2. Wikipedia. Danskin’s theorem.
  3. Bertsekas, D. P. (2016). Nonlinear Programming.
  4. Basu, K., et. al. (2020). ECLIPSE: An Extreme-Scale Linear Program Solver for Web-Applications.