What is steepest descent?

Introduction

Assume we have a convex function f: \mathbb{R}^d \mapsto \mathbb{R} and we would like to find its minimum. A commonly used class of algorithms, called descent algorithms, function in the following manner:

  1. Find a descent direction \Delta x.
  2. Update x = x + t \Delta x, where t is chosen in some manner (e.g. deterministically, backtracking line search, exact line search).
  3. Repeat steps 1 and 2 until convergence.

Steepest descent

There are several descent directions to choose from: which should we use? If we look at the first order Taylor expansion of f(x+v) around x:

\begin{aligned} f(x + v) \approx f(x) + \nabla f(x)^\top v, \end{aligned}

the term \nabla f(x)^\top v is known as the directional derivative of f at x in the direction v. It makes sense to find a v whose directional derivative is as negative as possible, i.e. as steep as possible.

The problem, of course, is that once we find some v such that \nabla f(x)^\top v is negative, we can increase the norm of v as large as we like so that the directional derivative is negative infinity. This means that without normalization, we cannot choose between all the directions having \nabla f(x)^\top v < 0.

To that end, let \| \cdot \| be some norm on \mathbb{R}^d. (What this norm is will matter later.) We define a normalized steepest descent direction with respect to \| \cdot \| as

\Delta x_{nsd} = \text{argmin} \; \left\{ \nabla f(x)^\top v : \| v \| \leq 1 \right\}.

(The subscript “nsd” is short for “normalized steepest descent”.) This direction need not be unique. We can then descend in the direction of \Delta x_{nsd}.

It is also common to look at the unnormalized steepest descent step

\Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd},

where \| \cdot \|_* is the dual norm of \| \cdot \|. (For exact line search, it makes no difference whether we use \Delta x_{nsd} or \Delta x_{sd} since the scale factor does not matter.)

Steepest descent as a class of algorithms

Notice that the steepest descent direction \Delta x_{sd} depends on the norm that is chosen in its definition. Hence, steepest descent is really a generic name: specific algorithms arise when we choose \| \cdot \| to be a particular norm.

Steepest descent vs. gradient descent

When I was looking around the internet for information on steepest descent, there appears to be some confusion between steepest descent and gradient descent, with some webpages saying that the two are the same. That is not quite correct: gradient descent is a special case of steepest descent where the norm chosen is the Euclidean norm. But I suppose in many contexts we only consider the Euclidean norm, which is why we might conflate the two algorithms.

Why use different norms?

This begs the question: why consider any norm in the definition of the steepest descent direction other than the Euclidean norm? I am not too familiar with material in this area, but it seems that the choice of norm can greatly affect the convergence rate of the algorithm. See Section 9.4.4 of Reference 1 for some examples. This argument might become clearer as well when we see that several popular variations of gradient descent are actually special cases of steepest descent.

Special cases of steepest descent

A few popular optimization algorithms can be viewed as steepest descent for a particular choice of norm. Again, see Section 9.4 of Reference 1 for details.

  • Euclidean norm: If \| \cdot \| is the Euclidean norm, then steepest descent is gradient descent.
  • Quadratic norms: For any symmetric (strictly) positive definite matrix P \in \mathbb{R}^{d \times d}, define the norm \| x \|_P = (x^\top P x)^{1/2} = \| P^{1/2} x \|_2. Steepest descent under this norm amounts to doing gradient descent after doing a change of coordinates. Concretely, our variable is now \bar{x} = P^{1/2} x, and we are minimizing \bar{f} defined by \bar{f}(\bar{x}) = f(P^{-1/2}\bar{x}) = f(x). Steepest descent in the original space corresponds to gradient descent of \bar{f} w.r.t. \bar{x}.
    In particular, if P = \nabla^2 f(x), then steepest descent becomes Newton’s method.
  • \ell_1 norm: If \| \cdot \| = \| \cdot \|_1, steepest descent becomes coordinate descent.

References:

  1. Boyd, S., and Vandenberghe, L. Convex Optimization (Section 9.4: Steepest descent method).

What is the orthogonal Procrustes problem?

The orthogonal Procrustes problem is a matrix approximation problem that can be stated as follows: for \mathbf{M} \in \mathbb{R}^{n \times p} and \mathbf{N} \in \mathbb{R}^{n \times p}, solve

\begin{aligned} \widehat{\mathbf{A}} = \underset{\mathbf{A} \in \mathbb{R}^{p \times p}}{\text{argmin}} \: \| \mathbf{M} - \mathbf{NA} \|_F^2 \quad \text{subject to} \quad \mathbf{A}^T \mathbf{A} = \mathbf{I}, \end{aligned}

where \| \cdot \|_F represents the Frobenius norm. Essentially we want to find the orthogonal matrix that most closely maps \mathbf{N} to \mathbf{M}. The solution can be expressed in terms of the singular value decomposition (SVD) of a special matrix: if we have the SVD \mathbf{M}^T \mathbf{N} = \mathbf{UDV}^T, then \widehat{\mathbf{A}} = \mathbf{VU}^T.

This problem was posed and solved in Green (1952) (Reference 2). The transformation sending \mathbf{N} to \mathbf{N \widehat{A}} is sometimes known as the Procrustes rotation. We defer the proof to the next section since it will be a special case there.

The reduced rank Procrustes rotation

In the original Procrustes problem, \mathbf{M} and \mathbf{N} had to have the same dimensions. If instead \mathbf{M} \in \mathbb{R}^{n \times p} and \mathbf{N} \in \mathbb{R}^{n \times k}, we have the reduced rank Procrustes problem

\begin{aligned} \widehat{\mathbf{A}} = \underset{\mathbf{A} \in \mathbb{R}^{p \times k}}{\text{argmin}} \: \| \mathbf{M} - \mathbf{NA}^T \|_F^2 \quad \text{subject to} \quad \mathbf{A}^T \mathbf{A} = \mathbf{I}_{k \times k}. \end{aligned}

(Here I have replaced \mathbf{A} with \mathbf{A}^T to be consistent with Zou et al. (2006) Theorem 4 (Reference 3), where I got this from. It doesn’t change the substance of the problem.) The solution is the same as before: if we have the SVD \mathbf{M}^T \mathbf{N} = \mathbf{UDV}^T, then \widehat{\mathbf{A}} = \mathbf{UV}^T.

Here is the proof (from Reference 3): by the circular property of the trace operator, we can write

\begin{aligned} \| \mathbf{M} - \mathbf{NA}^T \|_F^2 &= \text{tr}(\mathbf{M}^T \mathbf{M}) - 2 \text{tr}(\mathbf{M}^T \mathbf{NA}^T) + \text{tr}(\mathbf{AN}^T \mathbf{NA}^T) \\  &= \text{tr}(\mathbf{M}^T \mathbf{M}) - 2 \text{tr}(\mathbf{M}^T \mathbf{NA}^T) + \text{tr}(\mathbf{A}^T\mathbf{AN}^T \mathbf{N}) \\  &= \text{tr}(\mathbf{M}^T \mathbf{M}) - 2 \text{tr}(\mathbf{M}^T \mathbf{NA}^T) + \text{tr}(\mathbf{N}^T \mathbf{N}). \end{aligned}

Only the second term depends on \mathbf{A}, so we want to find \mathbf{A} that maximizes \text{tr}(\mathbf{M}^T \mathbf{NA}^T). If we have the SVD \mathbf{M}^T \mathbf{N} = \mathbf{UDV}^T, then

\begin{aligned} \text{tr}(\mathbf{M}^T \mathbf{NA}^T) &= \text{tr}(\mathbf{UDV}^T \mathbf{A}^T) = \text{tr}(\mathbf{UD}(\mathbf{AV})^T) = \text{tr}((\mathbf{AV})^T\mathbf{UD}) \\  &= \sum_{i = 1}^d [(\mathbf{AV})^T\mathbf{U}]_{ii} d_i,  \end{aligned}

where d_i is the ith entry on the diagonal of \mathbf{D}, and d is the number of non-zero entries in \mathbf{D} (also the number of non-zero singular values of \mathbf{M}^T \mathbf{N}). Hence, to maximize the sum above, we want the diagonal entries of (\mathbf{AV})^T\mathbf{U} to be as large as possible. But since it is an orthogonal matrix, its diagonal entries cannot exceed 1 (if not the L_2 norm of that corresponding column/row will exceed 1). Thus, the sum is maximized when all the diagonal entries of (\mathbf{AV})^T\mathbf{U} are equal to 1. This occurs when

\begin{aligned} (\mathbf{AV})^T\mathbf{U} &= \mathbf{I}, \\  \mathbf{AV} &= \mathbf{U}, \\  \mathbf{A} &= \mathbf{UV}^T. \end{aligned}

Where does the problem get its name?

The problem gets its name from Hurley and Cattell (1962) (Reference 4). Quoting directly from the paper:

It will be recalled that the Greek hero Theseus encountered in his wanderings a character called Procrustes, whose beds fitted all travelers. Those who were too short for his beds he cruelly stretched and those who were too tall he cut down to size. If an investigator is satisfied—as many are—to announce that the fit is good, from visual judgment, then this program lends itself to the brutal feat of making almost any data fit almost any hypothesis! Because of this possible proclivity we gave the code name Procrustes to this program, for this reference describes what it does, for better or worse.

That got dark pretty quickly… The commentary continues with instructive warnings which (in my opinion) apply to the field of statistics in general:

To publish widely a program which permits any tyro, by pressing a computer button, to seem to verify any theory, is as irresponsible as loosing opium on the open market. That computers and their programs can be a real danger to proper values and directions in research must already be evident in several fields. … The obviously still vaster possibilities of abuse in Procrustes constituted a major reason, as explained above, for our abstaining from early publication. …

We are not completely without hope:

Fortunately, [other statistical tests] have been developed by forced marches in the meantime, which make it possible to supply an antidote along with the drug, or a means of detection along with a means of crime. Though still admittedly imperfect, [these tests] give a basis for evaluation. When a Procrustean fit that might satisfy the eye-and such as has been quoted as verification of a favored hypothesis-can be challenged by such [tests], the danger of abuse of Procrustes has passed.

References:

  1. Wikipedia. Orthogonal Procrustes problem.
  2. Green, B. F. (1952). The orthogonal approximation of an oblique structure in factor analysis.
  3. Zou, H., et al. (2006). Sparse principal component analysis.
  4. Hurley, J. R., and Cattell, R. B. (1962). The procrustes program: Producing direct rotation to test a hypothesized factor structure.

What is the Frank-Wolfe method?

The Frank-Wolfe method is an iterative first-order optimization algorithm for constrained convex optimization. It solves the following problem:

\begin{aligned} \text{minimize}_x \quad& f(x) \\ \text{subject to} \quad &x \in \mathcal{S}, \end{aligned}

where f is a convex, differentiable real-valued function, and \mathcal{S} is a compact convex set.

The method itself is very easy to describe:

  1. Initialize k \leftarrow 0, x_0 any point in \mathcal{S}.
  2. For k = 0, 1, \dots until convergence:
    1. Set \tilde{x}_k = \text{argmin}_{x \in \mathcal{S}} \: x^T \nabla f(x_k).
    2. Set x_{k+1} = x_k + \gamma_k (\tilde{x}_k - x_k), where \gamma_k \in [0,1].

The key step here is Step 2A: If we can solve the minimization problem

\begin{aligned} \text{minimize}_{x \in \mathcal{S}} \quad x^T w \end{aligned}

for any w quickly, then Frank-Wolfe will give us a very fast method.

How do we choose \gamma_k? A common choice is to set \gamma_k = 2/(k+2). Another common choice is to do an exact line search: \gamma_k = \text{argmin}_{0 \leq \gamma \leq 1} \: f(x_k + \gamma (\tilde{x}_k - x_k)).

Lower bound property

The iterates of the Frank-Wolfe method satisfy a lower bound property as follows:

Proposition. Let f^* be the optimal value of the optimization problem at the beginning of this post. Then for any k,

\begin{aligned} f(x_k) + \nabla f(x_k)^T (\tilde{x}_k - x_k) \leq f^*. \end{aligned}

Here is the proof: Assume that the minimum of f is attained at some point x^*. Then

\begin{aligned} f^* = f(x^*) &\geq f(x_k) + (x^* - x_k)^T \nabla f(x_k) &\text{(since } f \text{ convex)} \\ &= f(x_k) - x_k^T \nabla f(x_k) + (x^*)^T \nabla f(x_k) \\ &\geq f(x_k) - x_k^T \nabla f(x_k) + \min_{y \in \mathcal{S}} y^T \nabla f(x_k) \\ &= f(x_k) - x_k^T \nabla f(x_k) + \tilde{x}_k^T \nabla f(x_k) &\text{(by definition of } \tilde{x}_k \text{)} \\ &= f(x_k) + (\tilde{x}_k - x_k)^T \nabla f(x_k).  \end{aligned}

References:

  1. Wikipedia. Frank-Wolfe algorithm.

(Strictly/strongly) convex functions

Convex functions

Let \mathcal{X} \subseteq \mathbb{R}^n be a convex set. A function f: \mathcal{X} \mapsto \mathbb{R} is convex if

\begin{aligned} f (tx + (1-t)y) \leq t f(x) + (1-t) f(y) \quad \text{for all } x, y \in \mathcal{X}, t \in [0, 1]. \quad -(1)\end{aligned}

Loosely speaking, a function is convex if it curves upwards like a bowl.

If f is differentiable, being a convex function is equivalent to

\begin{aligned} f(y) \geq f(x) + \nabla f(x)^T (y-x) \quad \text{for all } x, y \in \mathcal{X}, \quad -(2)\end{aligned}

or

\begin{aligned} (\nabla f(x) - \nabla f(y))^T (x-y) \geq 0 \quad \text{for all } x, y \in \mathcal{X}. \quad -(3)\end{aligned}

I find Equation (2) particularly intuitive. What it is saying is that if I make my best linear approximation to my function at a point x, then my function is always going to lie above (or on) the linear approximation.

If f is twice continuously differentiable, being a convex function is equivalent to having a positive semidefinite Hessian: \nabla^2 f(x) \succeq 0 for all x \in \mathcal{X}.

Strictly convex functions

f is strictly convex if

\begin{aligned} f (tx + (1-t)y) < t f(x) + (1-t) f(y) \quad \text{for all } x \neq y \in \mathcal{X}, t \in (0, 1).\end{aligned}

This condition is essentially Equation (1) with the inequality being strict except in cases where we cannot hope for an inequality. If f is differentiable, being strictly convex means

\begin{aligned} f(y) > f(x) + \nabla f(x)^T (y-x) \quad \text{for all } x \neq y \in \mathcal{X},\end{aligned}

and if f is twice continuously differentiable, it is equivalent to having a positive definite Hessian.

Every strictly convex function is convex, but the reverse is not true. For example, the function g(x) = \max( |x| - 1, 0) is convex but not strictly convex.

We can think of strictly convex functions has having strictly positive “curvature” at all points. The function above is “flat” along each of (-\infty, -1], [-1, 1] and [1, \infty), and so is not strictly convex.

Strongly convex functions

Strong convexity is one formulation that allows us to talk about how “convex” or “curved” a convex function is. f is strongly convex with parameter m > 0 if

\begin{aligned} f (tx + (1-t)y) &\leq t f(x) + (1-t) f(y) - \frac{1}{2}mt(1-t)\|x-y\|_2^2 \\ &\qquad\qquad \text{for all } x, y \in \mathcal{X}, t \in [0, 1]. \quad -(4)\end{aligned}

Equation (4) is just like Equation (1) except the RHS has an added negative term which makes it smaller.

If f is differentiable, being strongly convex with parameter m is equivalent to

\begin{aligned} f(y) \geq f(x) + \nabla f(x)^T (y-x) + \frac{m}{2}\|y-x\|_2^2 \quad \text{for all } x, y \in \mathcal{X}, \quad -(5)\end{aligned}

or

\begin{aligned} (\nabla f(x) - \nabla f(y))^T (x-y) \geq m \|x - y\|_2^2 \quad \text{for all } x, y \in \mathcal{X}. \quad -(6)\end{aligned}

(Equations (5) and (6) are analogs of Equations (2) and (3).) Like Equation (2), Equation (5) has a nice intuitive explanation: for any point x, if I make my best linear approximation to my function there, then my function is always going to lie above (or on) the linear approximation plus the quadratic term \frac{m}{2}\|y-x\|_2^2.

If f is twice continuously differentiable, then f is strongly convex with parameter m if and only if \nabla^2 f(x) \succeq mI for all x \in \mathcal{X}.

If f is strongly convex (with any parameter m > 0), then f is strictly convex. The converse is not true: for example, the function f(x) = \exp (x) is strictly convex but not strongly convex:

You can think of the parameter m as measuring how “curved” the function is: the larger m is the more curved f is. That is why f(x) = \exp (x) is not strongly convex: as x goes to infinity, the curve becomes flatter and flatter (m \rightarrow 0).

Strong convexity and optimization/theoretical statistics

Strong convexity is a nice property to have because it allows us to draw conclusions of the following form:

Let’s say f achieves its minimum at x_0. If the value of f(x) is close to f(x_0), then x must be close to x_0.

If f is minimized at x_0, then \nabla f(x_0) = 0, and so Equation (5) tells us that the function must lie above the quadratic curve y = f(x_0) + \frac{m}{2}\|x-x_0\|^2. Thus,

\begin{aligned} & f(x) \text{ close to } f(x_0) \\  \Rightarrow \qquad & f(x) - f(x_0) \text{ small} \\  \Rightarrow \qquad & \left[f(x_0) + \frac{m}{2}\|x-x_0\|^2 \right] - f(x_0) \text{ small} \\  \Rightarrow \qquad &  \frac{m}{2}\|x-x_0\|^2 \text{ small} \\  \Rightarrow \qquad & x \text{ close to } x_0.  \end{aligned}

The picture below demonstrates the argument above. In the picture, we know that x_1 must lie in the interval I. It is also clear from the picture that the larger m is, the smaller the interval I is going to be.

A similar pictorial argument also shows us why strict convexity is not enough to give us such a result.

References:

  1. Wikipedia. Convex function.

What is the Charnes-Cooper transformation?

In what follows, bold capital letters represent matrices, smaller letters in bold represent vectors, and non-bolded characters are scalars.

The Charnes-Cooper transformation (Reference 1) is a transformation that allows us to solve linear-fractional programs by solving a linear program.

A linear program is an optimization problem of the form

\begin{aligned} \underset{\bf x}{\text{maximize}} \quad& {\bf c}^T{\bf x} \\  \text{subject to} \quad& {\bf Ax} \leq {\bf b}, \\  &{\bf x} \geq {\bf 0},\end{aligned}

where {\bf A} \in \mathbb{R}^{m \times n}, {\bf x, c} \in \mathbb{R}^n and {\bf b} \in \mathbb{R}^m. In the above, the inequalities are to be interpreted element-wise. We like linear programs because we have many efficient algorithms for solving them (see this for a preliminary list).

A linear-fractional program is a generalization of the linear program where the objective function is replaced by a ratio of two affine functions, and the optimization is over the same type of constraints:

\begin{aligned} \underset{\bf x}{\text{maximize}} \quad& \frac{{\bf c}^T{\bf x} + \alpha}{{\bf d}^T{\bf x} + \beta} \\  \text{subject to} \quad& {\bf Ax} \leq {\bf b}, \\  &{\bf x} \geq {\bf 0},\end{aligned}

where {\bf d} \in \mathbb{R}^n \alpha, \beta are scalar constants.

The Charnes-Cooper transformation

The Charnes-Cooper transformation is a change of variables which allows us to transform a linear-fractional program into a linear program. Define new variables

\begin{aligned} {\bf y} = \dfrac{1}{{\bf d}^T {\bf x} + \beta} \cdot {\bf x}, \qquad t = \dfrac{1}{{\bf d}^T {\bf x} + \beta}. \end{aligned}

Under the assumption that the feasible region is non-empty and bounded, Reference 1 showed that the linear-fractional program above is equivalent to the linear program below:

\begin{aligned} \underset{{\bf y}, t}{\text{maximize}} \quad& {\bf c}^T{\bf y} + \alpha t \\  \text{subject to} \quad& {\bf Ay} \leq {\bf b}t, \\  &{\bf d}^T {\bf y} + \beta t = 1, \\  &t \geq 0.\end{aligned}

Given the solution {\bf y} and t, we can reconstruct the solution to our original problem: {\bf x} = {\bf y} / t.

References:

  1. Charnes, A., and Cooper, W. W. (1962). Programming with linear fractional functionals.
  2. Wikipedia. Linear-fractional programming.

Consensus ADMM

In this previous post, I introduced the alternating direction method of multipliers (ADMM), an algorithm which solves a particular class of convex optimization problems. ADMM has gained popularity in recent years because it allows certain problems to be solved in a distributed manner. Consensus optimization, the subject of this post, is one example of this.

(Note: The material for this post is taken from Chapter 7 of Reference 1.)

Recap

Recall that ADMM solves problems that can be written in the form

\begin{aligned} \text{minimize} \quad &f(x) + g(z) \\  \text{subject to}\quad & Ax + Bz = c, \end{aligned}

with variables x \in \mathbb{R}^n, z \in \mathbb{R}^m, and A \in \mathbb{R}^{p \times n}, B \in \mathbb{R}^{p \times m}, and c \in \mathbb{R}^p. f and g are assumed to be convex functions. For brevity, I will call this problem the “ADMM problem”, or I may say that the problem is in “ADMM form”.

The ADMM algorithm consists of the iterations

\begin{aligned} x^{k+1} &= \text{argmin}_x \; L_\rho \left(x, z^k, y^k \right), \\  z^{k+1} &= \text{argmin}_z \; L_\rho \left(x^{k+1}, z, y^k \right), \\  y^{k+1} &= y^k + \rho (Ax^{k+1} + Bz^{k+1} - c). \end{aligned}

In the above, superscripts refer to the iteration number.

Global Variable Consensus Optimization

Consider the optimization problem

\begin{aligned} \text{minimize} \quad \displaystyle\sum_{i=1}^N f_i(x), \end{aligned}

where x \in \mathbb{R}^n and f_i : \mathbb{R}^n \mapsto \mathbb{R} \cup \{ +\infty \} are convex. (Note that each term can encode constraints by setting f_i(x) = +\infty when a constraint is violated.) Imagine that we had N worker machines available to us (along with one central machine). Could we break the optimization up into N parallel parts, with the optimization of f_i being done by machine i (roughly speaking)?

We can do so if we introduce auxiliary variables. For f_i, replace the argument x with a “local copy” x_i \in \mathbb{R}^n, then constrain each of the x_i to be equal to some other auxiliary variable z \in \mathbb{R}^n. In other words, solve the equivalent problem

\begin{aligned} \text{minimize} \quad &\displaystyle\sum_{i=1}^N f_i(x_i) \\  \text{subject to} \quad & x_i - z = 0, \qquad i = 1, \dots, N. \end{aligned}

At first it might seem like we have taken a step back, since we now have to solve for N+1 variables instead of just one variable. However, notice that the problem is in ADMM form, and so we can write the ADMM udpates:

\begin{aligned} x_i^{k+1} &= \text{argmin}_{x_i} \; \left[ f_i(x_i) + (y_i^k)^T (x_i - z^k) + \dfrac{\rho}{2}\|x_i - z^k \|_2^2 \right], \\  z^{k+1} &= \dfrac{1}{N}\sum_{i=1}^N \left( x_i^{k+1} + \dfrac{1}{\rho} y_i^k \right), \\  y_i^{k+1} &= y_i^k + \rho (x_i^{k+1} - z^{k+1}). \end{aligned}

The first and third steps can be done in parallel across N machines, with machine i solving the minimization problem associated with f_i. After step 1, all the new values x_1^{k+1}, \dots, x_N^{k+1} are fed to a central processing element (sometimes called the central collector or fusion center) to compute the z update. This new value of z is then broadcast back to the N machines to compute the y updates in parallel.

Global Variable Consensus Optimization: An example

One example of global variable consensus optimization is linear regression over a huge dataset. Let’s say that the design matrix is X \in \mathbb{R}^{Nk \times p} and the response vector is y \in \mathbb{R}^{Nk}. Linear regression solves the optimization problem

\text{minimize}_\beta \quad \| y - X\beta \|_2^2.

If X is huge, it may be broken up into parts and stored on different machines. For example, assume that X and y are broken up row-wise into N parts, with part i, denoted by X_i \in \mathbb{R}^{k \times p} and y_i \in \mathbb{R}^k being on machine i. We can then write the optimization problem as

\text{minimize}_\beta \quad \sum_{i=1}^N \| y_i - X_i\beta \|_2^2,

which is the global variable consensus problem. This allows us to use consensus ADMM to solve for \beta in a distributed manner.

General Form Consensus Optimization

The global variable consensus problem can be generalized. In the global variable consensus problem, machine i deals with f_i(x_i), where x_i is a local copy of the global variable z. In general form consensus optimization, machine i deals with f_i(x_i), where x_i is (instead) a subvector of the global variable z.

This generalization is easy conceptually but cumbersome in terms of notation. Let z \in \mathbb{R}^n be the global variable. For each i = 1, \dots, N, let x_i \in \mathbb{R}^{n_i} be a “local” variable that maps to a subvector of z. Let \mathcal{G} be a mapping such that \mathcal{G}(i,j) = g if and only if local variable (x_i)_j corresponds to the global variable z_g.

Perhaps an example will help make this notation more understandable. In the diagram below, n = 4, N = 3, and n_1 = 4, n_2 = 2, n_3 = 3. (x_2)_1 corresponds to z_1 and (x_2)_2 corresponds to z_3, so \mathcal{G}(2, 1) = 1 and \mathcal{G}(2,2) = 3.

The general form consensus optimization problem is of the form

\begin{aligned} \text{minimize} \quad \displaystyle\sum_{i=1}^N f_i(x_i). \end{aligned}

Let \tilde{z}_i \in \mathbb{R}^{n_i} be defined by (\tilde{z}_i)_j = z_{\mathcal{G}(i, j)}. That is, \tilde{z}_i is the global variable’s idea of what local variable x_i should be. We can rewrite the optimization problem above as

\begin{aligned} \text{minimize} \quad &\displaystyle\sum_{i=1}^N f_i(x_i) \\  \text{subject to} \quad & x_i - \tilde{z}_i = 0, \qquad i = 1, \dots, N. \end{aligned}

Again, this is in ADMM form and so we can write the ADMM updates easily:

\begin{aligned} x_i^{k+1} &= \text{argmin}_{x_i} \; \left[ f_i(x_i) + (y_i^k)^T x_i + \dfrac{\rho}{2}\|x_i - \tilde{z}_i^k \|_2^2 \right], \\  z^{k+1} &= \text{argmin}_z \; \left[ \sum_{i=1}^N \left( - (y_i^k)^T \tilde{z}_i + \dfrac{\rho}{2} \| x_i^{k+1} - \tilde{z}_i \|_2^2 \right) \right], \\  y_i^{k+1} &= y_i^k + \rho (x_i^{k+1} - \tilde{z}_i^{k+1}). \end{aligned}

As in global variable consensus, step 1 and 3 decouple into N subproblems, one for each x_i and y_i. Unlike global variable consensus, step 2 also decouples! We can compute the update for each z_g, g = 1, \dots, n separately:

\begin{aligned} z_g^{k+1} = \dfrac{\sum_{\mathcal{G}(i,j) = g} \left[ (x_i^{k+1})_j + (1/\rho) (y_i^k)_j \right]}{\sum_{\mathcal{G}(i,j) = g} 1} \quad \text{for each } g = 1, \dots, n. \end{aligned}

Hence, less broadcasting is required between steps 1 and 2 and steps 2 and 3. For example, in the diagram above, we only have to broadcast along the arrows from left to right between steps 1 and 2, and along the arrows from right to left between steps 2 and 3. (For global variable consensus, we have to broadcast along the complete bipartite graph between the left and right.)

General Form Consensus Optimization: An example

One example of global variable consensus optimization is linear regression over a huge dataset, where the dataset is split up row-wise across many machines. An example of general form consensus optimization is where the dataset is further split up column-wise: that is, each machine only has feature values for a subset of features, and for only a subset of the observations.

References:

  1. Boyd, S., et al. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

Alternating Direction Method of Multipliers (ADMM)

The alternating direction method of multipliers (ADMM) is an algorithm for solving particular types of convex optimization problems. It was originally proposed in the mid-1970s by Glowinski & Marrocco (1975) and Gabay & Mercier (1976). According to Reference 1,

ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.

(Both dual ascent and the method of multipliers are described in Section 2 of Reference 1. I may write blog posts on these in the future.) ADMM is gaining a lot of popularity now because it often allows optimization to be done in a distributed manner, making problems of ever-increasing size tractable.

The problem ADMM solves

ADMM solves problems of the form

\begin{aligned} \text{minimize} \quad &f(x) + g(z) \\  \text{subject to}\quad & Ax + Bz = c, \end{aligned}

with variables x \in \mathbb{R}^n, z \in \mathbb{R}^m, and A \in \mathbb{R}^{p \times n}, B \in \mathbb{R}^{p \times m}, and c \in \mathbb{R}^p. f and g are assumed to be convex functions.

The art of convex optimization involves learning how to formulate your problem in the form you want (in this case, in the problem described above). One reason for ADMM’s appeal is its broad applicability: several different common types of convex optimization problems can be framed as the problem above. In particular, ADMM deals with an objective function that is separable. Such objective functions appear widely in machine learning, where we want to minimize the sum of a loss function and a penalty (or regularization) function.

The ADMM algorithm

Define the augmented Lagrangian

\begin{aligned} L_\rho(x, z, y) = f(x) + g(z) + y^T (Ax + Bz - c) + \dfrac{\rho}{2} \| Ax + Bz - c \|_2^2, \end{aligned}

where \rho > 0 is some penalty parameter. (y is known as the dual variable.) The ADMM algorithm consists of the iterations

\begin{aligned} x^{k+1} &= \text{argmin}_x \; L_\rho \left(x, z^k, y^k \right), \\  z^{k+1} &= \text{argmin}_z \; L_\rho \left(x^{k+1}, z, y^k \right), \\  y^{k+1} &= y^k + \rho (Ax^{k+1} + Bz^{k+1} - c). \end{aligned}

In the above, superscripts refer to the iteration number.

These updates are known as the unscaled version of ADMM. The scaled version of ADMM refers to the iteration updates listed below. The scaled version is often used because it typically results in shorter formulas.

\begin{aligned} x^{k+1} &= \text{argmin}_x \; \left( f(x) + \dfrac{\rho}{2} \| Ax + Bz^k - c + u^k \|_2^2 \right), \\  z^{k+1} &= \text{argmin}_z \; \left( g(z) + \dfrac{\rho}{2} \| Ax^{k+1} + Bz - c + u^k \|_2^2 \right), \\  u^{k+1} &= u^k + Ax^{k+1} + Bz^{k+1} - c. \end{aligned}

Here, u = y / \rho is known as the scaled dual variable.

Things to note in practice

According to both References 1 and 2, in practice ADMM usually obtains a modestly accurate solution in a handful of iterations, but requires a large number of iterations for a highly accurate solution.

While ADMM is guaranteed to converge under mild conditions on f and g and for all values of the parameter \rho (see Section 3.2 of Reference 1 for details), in practice the value of \rho can greatly affect the performance of the algorithm. There appears to be some work done on determining what the optimal value of \rho is in particular cases, but I haven’t seen any general heuristic for choosing it. Reference 1 often sets \rho = 1, and in one example used values of \rho in the range 0.1 to 100.

How many iterations should we run ADMM for before terminating? Section 3.3 of Reference 1 gives a few possible heuristics. The easiest to implement is to set thresholds \varepsilon^{\text{pri}} and \varepsilon^{\text{dual}} for feasibility tolerances for the primal and dual feasibility conditions respectively (i.e. how far off we are willing to be for our primal and dual conditions). We terminate once the following two conditions are met:

\begin{aligned} \| r^k \|_2 &= \| Ax^k + Bz^k - c \|_2 \leq \varepsilon^{\text{pri}}, \text{ and} \\  \| s^k \|_2 &= \| \rho A^T B (z^k - z^{k-1}) \|_2 \leq \varepsilon^{\text{dual}}. \end{aligned}

r^k and s^k are known as the primal and dual residuals respectively. See Section 3.3 of Reference 1 for more details.

References:

  1. Boyd, S., et al. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.
  2. Tibshirani, R. Alternating Direction Method of Multipliers.
  3. Boyd, S. ADMM.

An identity linking the nuclear norm and the Frobenius norm

Mazumder et al. (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma:

Lemma: For any matrix Z \in \mathbb{R}^{m \times n}, we have

\begin{aligned} \| Z \|_* = \min_{U, V : Z = UV^T} \frac{1}{2} \left( \| U\|_F^2 + \|V\|_F^2 \right). \end{aligned}

If \text{rank}(Z) = k \leq \min (m, n), then the minimum above is attained at a factor decomposition Z = U_{m \times k} V_{n \times k}^T.

The proof of the lemma is just a page long and is provided in the paper under Appendix.A.5.

References:

  1. Mazumder, R., Hastie, T., and Tibshirani, R. (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices.

FISTA: A description

This post introduces the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), a fast optimization algorithm for a particular class of unconstrained convex optimization problems. In particular, we seek to minimize F(x) = f(x) + g(x) over x \in \mathbb{R}^n where:

  • g: \mathbb{R}^n \mapsto \mathbb{R} is a continuous convex function which is possibly non-smooth,
  • f: \mathbb{R}^n \mapsto \mathbb{R} is a smooth convex function that is continuously differentiable with Lipschitz continuous gradient L(f): i.e. \begin{aligned} \| \nabla f(x) - \nabla f(y) \| \leq L(f) \| x - y \| \end{aligned} for every x, y \in \mathbb{R}^n, where \| \cdot \| is the standard Euclidean norm and L(f) > 0 is the Lipschitz constant of \nabla f.

We also assume that the problem is solvable, i.e. there exists some x^* \in \mathbb{R}^n such that F(x^*) \leq F(x) for all x \in \mathbb{R}^n.

ISTA

FISTA is a refinement of Iterative Shrinkage-Thresholding Algorithm (ISTA), which we introduce first. Define

\begin{aligned} p_L(y) = \text{argmin}_x \; \left\{ g(x) + \frac{L}{2} \left\| x - \left( y - \frac{1}{L} \nabla f(y) \right) \right\|^2 \right\}. \end{aligned}

(This is very close to the proximal operator for g.) ISTA (with constant step size) is as follows:

  • Input: L = L(f), a Lipschitz constant of \nabla f.
  • Step 0: Take x_0 \in \mathbb{R}^n.
  • Step k (k \geq 1): Set x_k = p_L (x_{k-1}).

(The steps for ISTA with backtracking can be found in the FISTA paper.) ISTA is fast If the operator p_L can be computed cheaply. Beck & Teboulle show that ISTA has a worst-case complexity of O(1/k), i.e. F(x_k) - F(x^*) \simeq O(1/k).

FISTA

FISTA improves on ISTA by choosing a better sequence of y_k‘s to apply the operator p_L too. It is inspired by Nesterov (1983). The formulas appear pretty magical to me: it seems to be using some sort of “momentum” since it applies p_L to a specific linear combination of x_{k-1} and x_{k-2} (instead of just x_{k-1}.

Here are the steps for FISTA with constant step size (FISTA with backtracking is in the FISTA paper):

  • Input: L = L(f), a Lipschitz constant of \nabla f.
  • Step 0: Take y_1 = x_0 \in \mathbb{R}^n, t_1 = 1.
  • Step k (k \geq 1):
    • Set x_k = p_L (y_k).
    • Set t_{k+1} = \frac{1 + \sqrt{1 + 4 t_k^2}}{2}.
    • Set y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1}).

The operations for FISTA are pretty much the same as that for ISTA, but FISTA enjoys faster convergence: F(x_k) - F(x^*) \simeq O(1/k^2). Here is the convergence theorem for FISTA with constant step size:

Theorem (convergence rate of FISTA): Let \{ x_k\} and \{y_k\} be generated by FISTA with constant step size. Then for any k \geq 1, we have

\begin{aligned} F(x_k) - F(x^*) \leq \frac{2L(f) \|x_0 - x^*\|^2}{(k+1)^2}, \end{aligned}

for any minimizer x^*.

See Theorem 4.4 in the FISTA paper for the full statement (it includes the bound for FISTA with backtracking.)

References:

  1. Beck, A., and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.
  2. Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k^2).

Test functions for optimization and the Branin-Hoo function

In the field of optimization, test functions are functions against which various characteristics of optimization algorithms are measured (e.g. convergence rate, robustness). They often serve as benchmarks in papers to compare newly proposed optimization algorithms with existing ones. Different test functions have different properties (e.g. degree of smoothness, number of local/global minima) to describe the range of functions optimization algorithms might face in production.

One such function I came across recently in a paper is the Branin-Hoo function (or Branin function). The Branin-Hoo function takes a two-dimensional input, and has the form

\begin{aligned} f(x_1, x_2) = a(x_2 - bx_1^2 + cx_1 - r)^2 + s(1-t) \cos (x_1) + s. \end{aligned}

The typical parameter values are a = 1, b = 5.1 / (4\pi^2), c = 5/ \pi, r = 6, s = 10 and t = 1/ (8\pi). The function is usually evaluated over the square x_1 \in [-5, 10], x_2 \in [0, 15].

The function has 3 global minima: x^* = (-\pi, 12.275), (\pi, 2.275) and (9.42478, 2.475), with the minimum function value being f(x^*) = 0.397887. Here is a picture of the function surface:

Branin Function surface. Credit: https://www.sfu.ca/~ssurjano/branin.html

And here is a contour plot of the surface:

Branin function contour plot. Credit: https://uqworld.org/t/branin-function/53

I couldn’t track down the origin of this function and its etymology. One of the sources that it has been traced to is a 1978 article by Dixon and Szego, but I could not find a copy. Another source is a 1972 article by Franklin Branin, but it sits behind a paywall so I could not verify if the function was introduced there.

For a list of more test functions, see this Wikipedia page or this list (by Sonja Surjanovic & Derek Bingham, Simon Fraser University).

References:

  1. Surjanovic, S., and Bingham, D. Branin Function.
  2. Dixon, L. C. W., and Szego, G. P. (1978). The global optimization problem: an introduction.
  3. Branin Jr., Franklin H. (1972). Widely convergent method of finding multiple solutions of simultaneous nonlinear equations.