# PHDG for solving the saddle point formulation of linear programs (LPs)

Saddle point formulation of linear programs (LPs)

Consider the linear program of the form

\begin{aligned} \text{minimize}_x \quad & c^\top x \\ \text{subject to} \quad& Ax \leq b, \end{aligned}

where $c, x \in \mathbb{R}^n$, $A \in \mathbb{R}^{m \times n}$, and $b \in \mathbb{R}^m$. The Lagrangian for this problem is

$L(x,y) = c^\top x + y^\top (Ax - b) = y^\top Ax + c^\top x - b^\top y,$

where the domain of $L$ is $(x, y) \in \mathbb{R}^n \times \mathbb{R}_+^m$ (elements of $y$ must be non-negative).

In this previous post, we introduced the saddle point theorem for convex optimization. Here is that same saddle point theorem, but for LPs:

Theorem. If $(x^*, y^*) \in \mathbb{R}^n \times \mathbb{R}_+^m$ is a saddle point for the Lagrangian $L$, then $x^*$ solves the primal problem. Conversely, if $x^*$ is a finite solution to the primal problem, then there is a $y* \in \mathbb{R}_+^m$ such that $(x^*, y^*)$ is a saddle point for $L$.

Finding the saddle point of an LP Lagrangian can be written as either of the two (equivalent) formulations:

\begin{aligned} &\text{Primal Formulation:} && \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right], \\ &\text{Dual Formulation:} && \max_{y \in \mathbb{R}_+^m} \min_{x \in \mathbb{R}^n} \left[ y^\top Ax + c^\top x - b^\top y \right]. \end{aligned}

The saddle point formulation of the LP allows us to use primal-dual methods to solve the problem. One such method is called the Primal-Dual Hybrid Gradient (PDHG) method introduced by Chambolle & Pock (2011) (Reference 1). PDHG solves the more general problem

\begin{aligned} \min_{x \in X} \max_{y \in Y} \left\{ \left\langle Kx, y \right\rangle + G(x) - F^*(y) \right\}, \end{aligned}

where $X, Y$ are finite-dimensional real vector spaces, $K : X \mapsto Y$ is a continuous linear operator with induced norm $\| K \| = \max \{ \|Kx \|: x \in X \text{ and } \|x\| \leq 1 \}$, $G: X \mapsto [0, +\infty]$ and $F^*: Y \mapsto [0, +\infty]$ are proper, convex, lower-semicontinuous functions.

The PDHG algorithm simply loops over the following 3 steps until convergence:

1. (Dual step) $y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \bar{x}^n \right)$.
2. (Primal step) $x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^{n+1}\right)$.
3. (Over-relaxation) $\bar{x}^{n+1} = x^{n+1} + \theta (x^{n+1} - x^n)$.

In the above, $\tau, \sigma > 0$ and $\theta \in [0, 1]$ are hyperparameters to be chosen by the user. (Note: Reference 1 doesn’t have proximal operators but has resolvent operators instead. See this post for why we can rewrite the algorithm with prox operators.) It is also possible to switch the order of the dual and primal steps, which gives rise to the following 2 steps until convergence (see Reference 2):

1. $x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^n \right)$.
2. $y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \left[ x^n + \theta (x^{n+1} - x^n)\right] \right)$.

PDHG for LPs

Recall the LP problem is equivalent to

\begin{aligned} \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right]. \end{aligned}

To map this problem onto the PDHG problem, we need $X \leftarrow \mathbb{R}^n$, $Y \leftarrow \mathbb{R}_+^m$, $K(x) = Ax$, $G(x) = c^\top x$, $F^*(y) = b^\top y$. With this, the PDHG algorithm becomes

1. $x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau A^\top y^n \right)$.
2. $y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n)\right] \right)$.

Note that for $h: X \mapsto \mathbb{R}$ such that $h(x) = c^\top x$, we have $\text{prox}_h (x) = \text{proj}_X (x - c)$. (See this post for the proof.) Hence, we can replace the prox operators above with Euclidean projection operators:

1. $x^{n+1} = \text{proj}_{\mathbb{R}^n} \left( x^n - \tau A^\top y^n - \tau c \right) = x^n - \tau \left( A^\top y^n + c \right)$.
2. $y^{n+1} = \text{proj}_{\mathbb{R}_+^m}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n) \right] - \sigma b \right)$.

References:

1. Chambolle, A., and Pock, T. (2011). A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging.
2. Pock, T., and Chambolle, A. (2011). Diagonal preconditioning for first order primal-dual algorithms in convex optimization.

# Proximal operator for inner product in R^n

Let $h: \mathcal{X} \mapsto \mathbb{R}$ be a lower semi-continuous convex function with $\mathcal{X} \subseteq \mathbb{R}^n$ being a convex set. The proximal operator for $h$ is defined by

\begin{aligned} \text{prox}_h (x) = \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

In this post, we derive the value of the proximal operator for the function $h: \mathcal{X} \mapsto \mathbb{R}$ where $h(x) = c^\top x$ for some $c \in \mathbb{R}^n$.

\begin{aligned} \text{prox}_h (x) &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + c^\top \beta \right\} \\ &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left( \beta^\top \beta - 2x^\top \beta \right) + c^\top \beta \right\} \\ &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left(\beta^\top \beta - 2(x - c)^\top \beta \right) \right\} \\ &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left\| \beta - (x - c) \right\|_2^2 \right\} \\ &= \text{proj}_{\mathcal{X}} (x - c), \end{aligned}

where $\text{proj}_{\mathcal{X}}(x)$ is the Euclidean projection of $x$ onto $\mathcal{X}$. That is, the proximal operator is a translation followed by a Euclidean projection.

# Strong duality for linear programs (LPs)

I keep forgetting the possible scenarios for strong duality of linear programs (LPs), so I’m putting it here as a reference.

Theorem (Strong duality of LPs). Consider a primal LP and its dual. There are 4 possibilities:
1. Both primal and dual have no feasible solutions.
2. The primal is infeasible and the dual is unbounded.
3. The dual is infeasible and the primal is unbounded.
4. Both primal and dual have feasible solutions and their values are equal.

Examples of each possibility can be found in Table 4.2 of Reference 1. A proof of this theorem can be found in multiple places (e.g. Reference 2).

References:

1. Duality in Linear Programming.
2. Williamson, D. P., and Kircher, K. (2014). Lecture 8: Strong duality.

# Saddle point theorem for convex optimization

Consider the convex optimization problem

\begin{aligned} \underset{x}{\text{minimize}} \quad & f_0(x) \\ \text{subject to} \quad & f_i(x) \leq 0, \quad i = 1, \dots, s, \\ &f_i(x) = 0, \quad i = s+1, \dots, m, \end{aligned}

where $f_0, f_1, \dots, f_s$ are convex functions from $\mathbb{R}^n$ to $\mathbb{R}$ and $f_{s+1}, \dots, f_m$ are affine functions from $\mathbb{R}^n$ to $\mathbb{R}$. The Lagrangian for this problem can be written as

\begin{aligned} L(x, y) = f_0(x) + \sum_{i=1}^m y_m f_m(x), \end{aligned}

where the domain of $y$ is $\mathcal{K} = \mathbb{R}_+^s \times \mathbb{R}^{m-s}$. (Dual variables associated with inequalities must be non-negative, while there are no restrictions on dual variables associated with equalities.)

A point $(x^*, y^*) \in \mathbb{R}^n \times \mathcal{K}$ is called a saddle point if for all $x \in \mathbb{R}^n$ and $y \in \mathcal{K}$ we have

\begin{aligned} L(x^*, y) \leq L(x^*, y^*) \leq L(x, y^*). \end{aligned}

Another way of saying this is if we fix $x = x^*$, then the Lagrangian is maximized at $y^*$ (red line) and if we fix $y = y^*$, then the Lagrangian is minimized at $x^*$ (blue line).

Illustration of saddle point. (Image from https://davidlibland.github.io/posts/2018-11-10-saddle-points-and-sdg.html, with some of my annotations on top.)

The saddle point theorem links optimality in the primal problem with the existence of a saddle point for the Lagrangian:

Theorem. If $(x^*, y^*) \in \mathbb{R}^n \times \mathcal{K}$ is a saddle point for the Lagrangian $L$, then $x^*$ solves the primal problem. Conversely, if $x^*$ is a solution to the primal problem at which Slater’s condition holds, then there is a $y* \in \mathcal{K}$ such that $(x^*, y^*)$ is a saddle point for $L$.

A proof of this theorem can be found in both References 1 and 2. The first statement of the theorem provides motivation for an optimization algorithm: try to find a saddle point for the Lagrangian, because once we do, we have also found an optimal point for the primal problem.

Minimax and maximin

For any Lagrangian $L$ we have

\begin{aligned} \inf_{x} L(x, y) &\leq L(x, y) \leq \sup_y L(x, y) &\text{for all } (x, y), \\ \sup_y \inf_x L(x, y) &\leq \sup_y L(x, y) &\text{(by taking sup over all } y), \\ \sup_y \inf_x L(x, y) &\leq \inf_x \sup_y L(x, y) &\text{(by taking inf over all } x), \end{aligned}

that is, the minimax is always greater than the maximin. However, if $L$ has a saddle point $(x^*, y^*)$, then

\begin{aligned} \inf_x \sup_y L(x, y) \leq \sup_y L(x^*, y) \leq L(x^*, y^*) \leq \inf_x L(x, y^*) \leq \sup_y \inf_x L(x, y), \end{aligned}

i.e. the minimax is less than or equal to the maximin. Thus, if $L$ has a saddle point, the minimax and maximin are equal:

\begin{aligned}\sup_y \inf_x L(x, y) = \inf_x \sup_y L(x, y). \end{aligned}

In particular, note that the LHS of the above is equal to the maximum value of the dual problem, while the RHS is equal to the minimum value of the primal problem (see e.g. this). Thus, if $L$ has a saddle point, then we have strong duality.

(This is closely related to von Neumann’s minimax theorem, which gives a sufficient condition for the minimax to be equal to the maximin.)

References:

1. Burke, J. V. Convex Optimization, Saddle Point Theory, and Lagrangian Duality.
2. Tangkijwanichakul, T. (2020). Saddle Point Theorem.

# Deriving the dual of a linear program (LP)

Did you know that the dual of a linear program (LP) is also a linear program? I derive the dual LP in this post.

Consider the primal LP

\begin{aligned} \underset{x}{\text{minimize}} \quad & c^\top x \\ \text{subject to} \quad & Ax = b, \\ & Gx \leq h. \end{aligned}

The Lagrangian is

\begin{aligned} L(x, u, v) &= c^\top x + u^\top (Ax - b) + v^\top (Gx - h) \\ &= \left( A^\top u + G^\top v + c \right)^\top x - b^\top u - h^\top v, \end{aligned}

where $v \geq 0$ and there are no restrictions on $u$. The Lagrange dual function is

\begin{aligned} g(u, v) &= \inf_x L(x, u, v) \\ &= - b^\top u - h^\top v + \inf_x \left\{ \left( A^\top u + G^\top v + c \right)^\top x \right\} \\ &= \begin{cases} - b^\top u - h^\top v &\text{if } A^\top u + G^\top v + c = 0, \\ -\infty &\text{otherwise}. \end{cases} \end{aligned}

The dual problem is maximizing the dual function subject to the constraints of the variables, i.e.

\begin{aligned} \underset{u, v}{\text{maximize}} \quad & g(u, v) \\ \text{subject to} \quad & v \geq 0. \end{aligned}

or equivalently

\begin{aligned} \underset{u, v}{\text{maximize}} \quad & - b^\top u - h^\top v \\ \text{subject to} \quad & -A^\top u - G^\top v = c, \\ & v \geq 0. \end{aligned}

Special case 1: Standard form LP

Consider the primal LP in standard form:

\begin{aligned} \underset{x}{\text{minimize}} \quad & c^\top x \\ \text{subject to} \quad & Ax \leq b, \\ & x \geq 0. \end{aligned}

To match notation with the previous section, we need $A \leftarrow 0$, $b \leftarrow 0$, $G \leftarrow \begin{pmatrix} A \\ -I \end{pmatrix}$, $h \leftarrow \begin{pmatrix} b \\ 0 \end{pmatrix}$, where $I$ is the identity matrix. With these substitutions, the dual problem becomes

\begin{aligned} \underset{v}{\text{maximize}} \quad & - h^\top v \\ \text{subject to} \quad & - G^\top v = c, \\ & v \geq 0. \end{aligned}

If we write $v = \begin{pmatrix} y \\ z \end{pmatrix}$ where $y$ is the dual variable corresponding to the inequality $Ax \leq b$, then the above becomes

\begin{aligned} \underset{y, z}{\text{maximize}} \quad & - b^\top y \\ \text{subject to} \quad & - A^\top y + z = c, \\ & y, z \geq 0. \end{aligned}

$z$ can viewed as a slack variable, and so the dual LP is equivalent to

\begin{aligned} \underset{y}{\text{maximize}} \quad & - b^\top y \\ \text{subject to} \quad & A^\top y \geq -c, \\ & y \geq 0. \end{aligned}

Special case 2: Standard form LP (maximization)

Consider the primal LP in standard form but where we want to maximize the objective function instead of minimizing it:

\begin{aligned} \underset{x}{\text{maximize}} \quad & c^\top x \\ \text{subject to} \quad & Ax \leq b, \\ & x \geq 0. \end{aligned}

This is equivalent to minimizing the negative of the objective function subject to the same constraints:

\begin{aligned} \underset{x}{\text{minimize}} \quad & (-c)^\top x \\ \text{subject to} \quad & Ax \leq b, \\ & x \geq 0. \end{aligned}

Thus by replacing $c$ by $-c$ in the previous section, the dual LP in this case is

\begin{aligned} \underset{y}{\text{maximize}} \quad & - b^\top y \\ \text{subject to} \quad & A^\top y \geq -(-c), \\ & y \geq 0, \end{aligned}

or equivalently

\begin{aligned} \underset{y}{\text{minimize}} \quad & b^\top y \\ \text{subject to} \quad & A^\top y \geq c, \\ & y \geq 0. \end{aligned}

(This is the version that you’ll see most often in other websites, and is the one currently on the Wikipedia page.)

# What is Moreau’s Decomposition Theorem?

Let $f: \mathbb{R}^n \mapsto \mathbb{R}$ be a convex function. Moreau’s Decomposition Theorem is the following result:

Theorem (Moreau Decomposition). For all $x \in \mathbb{R}^n$,

$x = \textbf{prox}_{f} (x) + \textbf{prox}_{f^*} (x)$,

where $\textbf{prox}_{f}$ is the proximal operator for $f$ and $f^*$ is the convex conjugate of $f$.

Here is the proof: Let $z = \textbf{prox}_{f} (x)$. Then,

\begin{aligned} z = \textbf{prox}_{f} (x) \quad &\Leftrightarrow \quad x - z \in \partial f(z) \\ &\Leftrightarrow \quad z \in \partial f^*(x-z) \\ &\Leftrightarrow \quad x - (x-z) \in \partial f^*(x-z) \\ &\Leftrightarrow \quad x - z = \textbf{prox}_{f^*} (x) \\ &\Leftrightarrow \quad x = \textbf{prox}_{f} (x) + \textbf{prox}_{f^*} (x). \end{aligned}

The second equivalence is a result involving convex conjugates and subdifferentials (see this post for statement and proof) while the 4th equivalence is a property of the proximal operator (see the note at the end of this post).

Extended Moreau Decomposition

The extended version of Moreau’s Decomposition Theorem involves a scaling factor $\lambda > 0$. It states that for all $x$,

\begin{aligned} x = \textbf{prox}_{\lambda f} (x) + \textbf{prox}_{\lambda^{-1} f^*} (x / \lambda). \end{aligned}

Proof: Applying Moreau decomposition to the function $\lambda f$ gives

$x = \textbf{prox}_{\lambda f} (x) + \textbf{prox}_{(\lambda f)^*} (x).$

Using the definitions of the proximal operator and the convex conjugate,

\begin{aligned} \textbf{prox}_{(\lambda f)^*} (x) &= \underset{u}{\text{argmin}} \left\{ \frac{1}{2} \| x - u \|_2^2 + (\lambda f)^* (u) \right\} \\ &= \underset{u}{\text{argmin}} \left\{ \frac{1}{2} \| x - u \|_2^2 + \sup_v \left[ u^\top v - \lambda f(v) \right] \right\} \\ &= \underset{u}{\text{argmin}} \left\{ \frac{1}{2} \|(x/\lambda) - (u / \lambda) \|_2^2 + \frac{1}{\lambda} \sup_v \left[ (u / \lambda)^\top v - f(v) \right] \right\} \\ &= \underset{u}{\text{argmin}} \left\{ \frac{1}{2} \|(x/\lambda) - (u / \lambda) \|_2^2 + \frac{1}{\lambda} f^*( u / \lambda) \right\} \\ &= \underset{u}{\text{argmin}} \left\{ \frac{1}{2} \| (x / \lambda) - u \|_2^2 + \lambda^{-1}f^* (u) \right\} \\ &= \textbf{prox}_{\lambda^{-1} f^*} (x / \lambda). \end{aligned}

References:

1. Gu, Q. (2016). “SYS 6003: Optimization. Lecture 25.”

# Relations, resolvents and the proximal operator

Relations and resolvents

A relation $F$ on $\mathbb{R}^n$ is a subset of $\mathbb{R}^n \times \mathbb{R}^n$. We use the notation $F(x)$ to mean the set $F(x) = \{ y: (x, y) \in F \}$. You can think of $F$ as an operator that maps vectors $x \in \mathbb{R}^n$ to sets $F(x) \subseteq \mathbb{R}^n$. (Along this line of thinking, functions are a special kind of relation where every vector is mapped to a set consisting of exactly one element.)

The inverse relation is defined as $F^{-1} = \{ (y, x): (x, y) \in F \}$.

For a relation $R$ and some parameter $\lambda \in \mathbb{R}$, the resolvent of $R$ is defined as the relation

$S = (I + \lambda R)^{-1}.$

In other words,

\begin{aligned} S &= \{ (y, x): (x, y) \in I + \lambda R \}, \\ S &= \{ (x + \lambda y, x): (x, y) \in R \}. \end{aligned}

Connection with the proximal operator

Let $f: \mathbb{R}^n \mapsto \mathbb{R}$ be some convex function. Recall that the proximal operator of $f$ is defined by

\begin{aligned} \textbf{prox}_f (x) = \underset{u \in \mathbb{R}^n}{\text{argmin}} \left\{ \frac{1}{2} \| x - u \|_2^2 + f(u) \right\}. \end{aligned}

It turns out that the resolvent of the subdifferential operator is the proximal operator. Here is the proof: Let $z = (I + \lambda \partial f)^{-1}(x)$ for some $\lambda > 0$. By definition of the resolvent,

\begin{aligned} z = (I + \lambda \partial f)^{-1}(x) \quad &\Leftrightarrow \quad x \in z + \lambda \partial f(z) \\ &\Leftrightarrow \quad 0 \in \lambda \partial f(z) + (z - x) \\ &\Leftrightarrow \quad 0 \in \partial \left[ \lambda f(z) + \frac{1}{2} \| z - x \|_2^2 \right] \\ &\Leftrightarrow \quad z = \underset{u}{\text{argmin}} \; \lambda f(u) + \frac{1}{2} \| u - x \|_2^2 \\ &\Leftrightarrow \quad z = \textbf{prox}_{\lambda f} (x). \end{aligned}

(Note: Setting $\lambda = 1$, the chain of reasoning above gives $z = \textbf{prox}_{f} (x) \; \Leftrightarrow x - z \in \partial f(z)$, which is useful to know.)

References:

1. Pilanci, M. (2022). “Monotone Operators.”

# Convex conjugate functions, Fenchel’s inequality, subdifferentials and optimality

Convex conjugate functions and Fenchel’s inequality

Let $f: \mathbb{R}^n \mapsto \mathbb{R}$ be some function. The convex conjugate of $f$ (also know as the Fenchel transformation), is the function $f^*: \mathbb{R}^n \mapsto \mathbb{R}$ defined as

$f^*(y) = \underset{x}{\sup} \left\{ y^\top x - f(x) \right\}.$

Fenchel’s equality is the following statement:

Theorem (Fenchel’s inequality). For any $x, y \in \mathbb{R}^n$,

$f(x) + f^*(y) \geq x^\top y.$

The proof of Fenchel’s equality follows directly from the definition of the convex conjugate:

\begin{aligned} f(x) + f^*(y) &= f(x) + \underset{u}{\sup} \left\{ y^\top u - f(u) \right\} \\ &\geq f(x) + y^\top x - f(x) \\ &= x^\top y. \end{aligned}

A direct application of Fenchel’s inequality shows that the conjugate of the conjugate, denoted $f^{**}$, always satisfies

$f^{**} \leq f.$

To see this: Fenchel’s inequality says $f(x) \geq x^\top y - f^*(y)$ for all $x, y$. Taking a supremum over all $y$, the inequality becomes

$f(x) \geq \underset{y}{\sup} \left\{ x^\top y - f^*(y) \right\} = f^{**}(x).$

It turns out that if $f$ is closed and convex, then we actually have $f^{**} = f$. The proof is a bit cumbersome and so I’m not providing it in this post. (For a proof, see Reference 1 or Slides 5.13-5.14 of Reference 2.)

Subdifferentials and optimality

A vector $g \in \mathbb{R}^n$ is a subgradient of $f$ at point $x \in \mathbb{R}^n$ if

$f(y) \geq f(x) + g^\top (y - x)$ for all $y \in \mathbb{R}^n$.

The subdifferential of $f$ at point $x$, denoted $\partial f(x)$, is the set of all the subgradients of $f$ and point $x$.

Convex conjugates appear in the following theorem concerning subdifferentials:

Theorem. If $f$ is closed and convex, then for any $x, y \in \mathbb{R}^n$,

\begin{aligned} y \in \partial f(x) \; \Leftrightarrow \; x \in \partial f^*(y) \; \Leftrightarrow \; f(x) + f^*(y) = x^\top y. \end{aligned}

Proof (adapted from Slide 5.15 of Reference 2 and Reference 3): First we show that $y \in \partial f(x) \; \Rightarrow \; x \in \partial f^*(y)$. Using the definitions of subgradients and convex conjugates,

\begin{aligned} y \in \partial f(x) &\Rightarrow \; f(u) \geq f(x) + y^\top (u - x) \quad \text{for all } u \\ &\Rightarrow \; y^\top u - f(u) \leq y^\top x - f(x) \quad \text{for all } u \\ &\Rightarrow f^*(y) = y^\top x - f(x). \end{aligned}

Thus, for any $v$,

\begin{aligned} f^*(v) &= \underset{u}{\sup} \left\{ v^\top u - f(u) \right\} \\ &\geq v^\top x - f(x) \\ &= y^\top x - f(x) + x^\top (v - y) \\ &= f^*(y) + x^\top (v - y), \end{aligned}

which implies that $x \in \partial f^*(y)$.

To get the reverse implication $\partial x \in f^*(y) \; \Rightarrow \; y \in \partial f(x)$, apply the same logic above with $f^*$ in place of $f$ and use the fact that $f^{**} = f$ for closed convex $f$.

Next, we show that $y \in \partial f(x) \; \Rightarrow \; f(x) + f^*(y) = x^\top y$. Using the definitions of subgradients,

\begin{aligned} y \in \partial f(x) &\Rightarrow \; f(u) \geq f(x) + y^\top (u - x) \quad \text{for all } u \\ &\Rightarrow \; x^\top y \geq f(x) + (y^\top u - f(u)) \quad \text{for all } u. \end{aligned}

Taking supremum over all $u$, the inequality becomes $x^\top y \geq f(x) + f^*(y)$. Combining this with Fenchel’s inequality gives us equality $x^\top y = f(x) + f^*(y)$.

Conversely, if $x^\top y = f(x) + f^*(y)$, we have

\begin{aligned} x^\top y &\geq f(x) + f^*(y), \\ x^\top y &\geq f(x) + y^\top u - f(u) \quad \text{for all } u, \\ f(u) &\geq f(x) + y^\top (u - x) \quad \text{for all } u, \\ y &\in \partial f(x). \end{aligned}

References:

1. StackExchange. “How to prove the conjugate of the conjugate function is itself?”
2. Vandenberghe, L. (2022). “5. Conjugate functions.”

# What is Minkowski’s Representation Theorem?

Minkowski’s representation theorem is a fundamental result on how polyhedra can be represented.

Preliminary definitions

• In $\mathbb{R}^n$, a polyhedron is a set which can be described as $\mathcal{P} = \{ x \in \mathbb{R}^n: Ax \leq b\}$, where $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$.
• $x \in \mathcal{P}$ is an extreme point of a polyhedron $\mathcal{P}$ if there do not exist $x_1, x_2 \in \mathcal{P}$ such that $x = \frac{1}{2}(x_1 + x_2)$.
• $r \in \mathbb{R}^n$ is a ray of a polyhedron $\mathcal{P} = \{ x \in \mathbb{R}^n: Ax \leq b\}$ if $r$ is non-zero and $Ar \leq 0$. Let $\mathcal{P}^{ray}$ denote the set of rays of $\mathcal{P}$.
• $r \in \mathcal{P}^{ray}$ is an extreme ray of a polyhedron $\mathcal{P}$ if there do not exist linearly independent $r_1, r_2 \in \mathcal{P}^{ray}$ such that $r = \frac{1}{2}(r_1 + r_2)$.

It can be shown that for any polyhedron, the number of extreme points and extreme rays is finite (proof in Reference 1).

Minkowski’s representation theorem

Minkowski’s representation theorem essentially says that a polyhedron can be described by its extreme points and extreme rays.

Theorem: If $\mathcal{P} = \{ x \in \mathbb{R}^n: Ax \leq b\}$ is non-empty and $\text{rank}(A) = n$, then

\begin{aligned} \mathcal{P} = \bigg\{ x \in \mathbb{R}^n: &x = \sum_{k \in K} \lambda_k x_k + \sum_{j \in J} \mu_j r_j, \\ &\sum_{k \in K} \lambda_k = 1, \lambda_k \geq 0 \;\forall k, \mu_j \geq 0 \;\forall j \bigg\}, \end{aligned}

where $\{ x_k \}_{k \in K}$ and $\{ r_j\}_{j \in J}$ are the set of extreme points and extreme rays of $\mathcal{P}$ respectively.

A proof of this theorem can be found in Reference 1. As a special case, if the polyhedron is bounded, then it does not have any extreme rays and so any point in the polyhedron can be described as a convex combination of its extreme points.

Minkowski representation & projection

The theorem tells us that a polyhedron can be expressed as the set of convex combination of its extreme points and rays. For notational convenience, let $x_1, \dots, x_K$ denote the extreme points and extreme rays of the polyhedron $\mathcal{P}$, and let $\delta_k = 1$ if $x_k$ is an extreme point, and $\delta_k = 0$ otherwise. The Minkowski representation of the polyhedron $\mathcal{P} \subseteq \mathbb{R}^n$ consists of variables $\lambda_1, \dots, \lambda_K$, one for each extreme point/ray:

\begin{aligned} \mathcal{P}^{Mk} = \left\{ \lambda \in \mathbb{R}^K : \sum_{k = 1}^K \delta_k \lambda_k = 1, \lambda \geq 0 \right\}, \end{aligned}

and the Minkowski projection $\pi^{Mk}: \mathbb{R}^K \mapsto \mathbb{R}^n$ defined by

\begin{aligned} \pi^{Mk}(\lambda) = \sum_{k=1}^K \lambda_k x_k. \end{aligned}

You can think of the Minkowski representation as expressing each point in the polyhedron in terms of “coordinates” associated with the extreme points/rays.

Note that these coordinates are not unique: each point in the polyhedron may be associated with more than one set of coordinates. For example, consider the bounded triangle in $\mathbb{R}^2$ with coordinates $\begin{pmatrix} 1 \\ 1 \end{pmatrix}$, $\begin{pmatrix} 5 \\ 1 \end{pmatrix}$ and $\begin{pmatrix} 1 \\ 5 \end{pmatrix}$. The point $\begin{pmatrix} 2 \\ 2 \end{pmatrix}$ can be written as

\begin{aligned} \begin{pmatrix} 2 \\ 2 \end{pmatrix} = 2 \begin{pmatrix} 1 \\ 1 \end{pmatrix} + 0 \begin{pmatrix} 5 \\ 1 \end{pmatrix} + 0 \begin{pmatrix} 1 \\ 5 \end{pmatrix}, \end{aligned}

or

\begin{aligned} \begin{pmatrix} 2 \\ 2 \end{pmatrix} = 0 \begin{pmatrix} 1 \\ 1 \end{pmatrix} + \frac{1}{3} \begin{pmatrix} 5 \\ 1 \end{pmatrix} + \frac{1}{3} \begin{pmatrix} 1 \\ 5 \end{pmatrix}. \end{aligned}

References:

1. Nemhauser, G., and Wolsey, L. (1988). Integer and Combinatorial Optimization. (Chapter I.4).
2. Tebboth, J. R. (2001). A Computational Study of Dantzig-Wolfe Decomposition. (Section 3.3).

# What is Sharpness-Aware Minimization (SAM)?

Introduction

Consider the general machine learning set-up. We have a class of models parameterized by $w \in \mathcal{W} \subseteq \mathbb{R}^d$ (e.g. for linear regression, $w$ would be the coefficients of the model). Each of these models takes in some input $x \in \mathcal{X}$ and outputs a result $y \in \mathcal{Y} \in \mathcal{Y}$. We want to select the parameter $w$ which minimizes some population loss:

\begin{aligned} L_P(w) = \mathbb{E}_{(x,y) \sim D}[l(w, x, y)], \end{aligned}

where $l$ is the loss incurred for a single data point, and $D$ is the population distribution for our data. Unfortunately we don’t know $D$ and hence can’t evaluate $L_P(w)$ exactly. Instead, we often have a dataset $(x_1, y_1), \dots, (x_n, y_n)$ with which we can define the empirical loss

\begin{aligned} L_S(w) = \frac{1}{n}\sum_{i=1}^n l(w, x_i, y_i). \end{aligned}

A viable approach is to select $w$ by minimizing the empirical loss: the hope here is that the empirical loss is a good approximation to the population loss.

For many modern ML models (especially overparameterized models), the loss functions are non-convex with multiple local and global minima. In this setup, it’s known that the parameter $\hat{w}$ obtained by minimizing the empirical loss does not necessarily translate into small population loss. That is, good performance on the training set does not “generalize” well. In fact, many different $w$ can give similar values of $L_S(w)$ but very different values of $L_P(w)$.

“Flatness” and generalization performance

One thing that is emerging from the literature is a connection between the “flatness” of minima and generalization performance. In particular, models corresponding to a minimum point whose loss function neighborhood is relatively “flat” tends to have better generalization performance. The intuition is as follows: think of the empirical loss $L_S$ as a random function, with the randomness coming from which data points are chosen from the sample. As we draw several of these empirical loss functions, minima associated with flat areas of the function tend to stay in the same area (and hence are “robust”), while minima associated with sharp areas move around a lot.

Here is a stylized example to illustrate the intuition. Imagine that the line in black is the population loss. There are 10 blue lines in the figure, each representing a possible empirical loss function. You can see that there is a lot of variation in the part of the loss associated with the sharper minimum on the left as compared to the flatter minimum on the right.

Imagine for each of the 10 empirical loss functions, we locate the $x$ value of the two minima, but record their value on the true population loss. The points in red correspond to the population loss for the sharper minimum while the points in blue correspond to that for the flatter minimum. We can see that the loss values for the blue points don’t fluctuate as much as that for the red points.

Sharpness-aware minimization (SAM)

They are many ways to define “flatness” or “sharpness”. Sharpness-aware minimization (SAM), introduced by Foret et. al. (2020) (Reference 1), is one way to formalize the notion of sharpness and use it in model training. Instead of finding parameter values which minimize the empirical loss at a point:

\begin{aligned} \hat{w} = \text{argmin}_{w \in \mathcal{W}} L_S(w), \end{aligned}

find parameter values whose entire neighborhoods have uniformly small empirical loss:

\begin{aligned} \hat{w} = \text{argmin}_{w \in \mathcal{W}} L_S^{SAM}(w) = \text{argmin}_{w \in \mathcal{W}} \left\{ \max_{\| \epsilon \|_p \leq \rho} L_S(w + \epsilon) \right\}, \end{aligned}

where $\rho \geq 0$ is a hyperparameter and $\| \cdot \|_p$ is the Lp-norm, with $p$ typically chosen to be 2.

The figure below shows what $L_S^{SAM}$ looks like for different values of $\rho$ in our simple example. As $\rho$ increases, the value of $L_S^{SAM}$ increases a lot for the sharp minimum on the left but not very much for the flat minimum on the right.

In practice we don’t actually minimize $L_S^{SAM}$, but minimize the SAM loss with an L2 regularization term. Also, $L_S^{SAM}$ can’t be computed analytically: the paper has details on how to get around it (via approximations and such). One final note is on what value $\rho$ should take. In the paper, $\rho$ is treated as a hyperparameter which needs to be tuned (e.g. with grid search).

References:

1. Foret, P., et. al. (2020). Sharpness-Aware Minimization for Efficiently Improving Generalization.