More details on sharpness-aware minimization (SAM)


In this previous post, we discussed the intuition behind sharpness-aware minimization (SAM). In this post, we describe how SAM is implemented.

Consider the general supervised learning set-up, and assume that we have a class of models parameterized by w \in \mathcal{W} \subseteq \mathbb{R}^d. If we have a dataset (x_1, y_1), \dots, (x_n, y_n), a common approach is to select w by minimizing the empirical loss

\begin{aligned} L_S(w) = \frac{1}{n}\sum_{i=1}^n l(w, x_i, y_i), \end{aligned}

where l is the loss incurred for a single data point. Instead of minimizing the empirical loss, sharpness-aware minimization (SAM) (Foret et al. 2020, Reference 1) solves the problem

\begin{aligned} \hat{w} &= \text{argmin}_{w \in \mathcal{W}} \left\{ L_S^{SAM}(w) + \lambda \| w \|_2^2 \right\} \\ &= \text{argmin}_{w \in \mathcal{W}} \left\{ \max_{\| \epsilon \|_p \leq \rho} L_S(w + \epsilon) + \lambda \| w \|_2^2 \right\}, \end{aligned}

where \rho, \lambda \geq 0 are hyperparameters and \| \cdot \|_p is the Lp-norm, with p typically chosen to be 2.

We could try to solve this problem using first-order methods, e.g. gradient descent:

\begin{aligned} w^{(k+1)} \leftarrow w^{(k)} - \eta_k \left[ \nabla_w L_S^{SAM}(w^{(k)}) + 2 \lambda w^{(k)} \right] \end{aligned}

for some step size \eta_k. However, we need some tricks to approximate \nabla_w L_S^{SAM}(w) efficiently.

Simplification 1: Taylor expansion

The main difficulty in computing \nabla_w L_S^{SAM}(w) is that we need to solve the maximization problem before taking a gradient w.r.t. w. Let’s approximate the objective function in the maximization problem with a first-order Taylor expansion. If we define \epsilon^*(w) = \text{argmax}_{\| \epsilon \|_p \leq \rho} L_S(w + \epsilon), then

\begin{aligned} \epsilon^*(w) \approx \underset{\| \epsilon \|_p \leq \rho}{\text{argmax}} \left\{ L_S(w) + \epsilon^\top \nabla_w L_S(w) \right\} = \underset{\| \epsilon \|_p \leq \rho}{\text{argmax}} \; \epsilon^\top \nabla_w L_S(w). \end{aligned}

Defining the last quantity as \hat{\epsilon}(w), we recognize it as the solution to the classical dual norm problem:

\begin{aligned} \hat{\epsilon}(w) = \rho \cdot \text{sign} (\nabla_w L_S(w)) |\nabla_w L_S(w)|^{q-1} / \left( \| \nabla_w L_S(w) \|_q^q \right)^{1/p}, \end{aligned}

where 1/p + 1/q = 1. With this expression,

\begin{aligned} L_S^{SAM}(w) &\approx L_S(w + \hat{\epsilon}(w)), \\  \nabla_w L_S^{SAM}(w) &\approx \nabla_w L_S(w + \hat{\epsilon}(w)) \\  &= \nabla_w L_S(w) |_{w + \hat\epsilon(w)} \cdot \dfrac{d(w + \hat\epsilon(w))}{dw} \\  &= \nabla_w L_S(w) |_{w + \hat\epsilon(w)} + \nabla_w L_S(w) |_{w + \hat\epsilon(w)} \cdot \dfrac{d\hat\epsilon(w)}{dw}. \end{aligned}

Simplification 2: Drop second-order terms

Since \hat{\epsilon}(w) is a function of \nabla_w L_S(w), computing \dfrac{d\hat\epsilon(w)}{dw} implicitly depends on the Hessian of L_S(w). While Reference 1 notes that such a computation can be done, we can just drop this term to accelerate computation. This gives us our final gradient approximation which we use in practice:

\begin{aligned} \nabla_w L_S^{SAM}(w) \approx \nabla_w L_S(w) |_{w + \hat\epsilon(w)}. \end{aligned}

In other words, the gradient of the SAM loss at w is approximately the gradient of the empirical loss at w + \hat\epsilon(w), a point that is distance \rho away from w in a specially chosen direction.


  1. Foret, P., et. al. (2020). Sharpness-Aware Minimization for Efficiently Improving Generalization.

Solving the dual norm problem

Let’s say we are in \mathbb{R}^d. Consider the maximization problem

\begin{aligned} \max_{\| x \|_p \leq \rho} c^\top x, \end{aligned}

where c \in \mathbb{R}^d is given and \| \cdot \|_p is the Lp-norm. (We may assume that c \neq 0.) The problem can be solved by applying Hölder’s inequality directly:

\begin{aligned} c^\top x &\leq \| c \|_q \| x \|_p \leq \| c \|_q \rho, \end{aligned}

where 1/p + 1/q = 1.

When does equality hold? For the first inequality, equality holds if the vectors \text{sign}(c) |c|^q and \text{sign}(x) |x|^p are pointing in the same direction, i.e.

\begin{aligned} x = k \cdot \text{sign}(c) |c|^{q/p} = k \cdot \text{sign}(c) |c|^{q-1}\end{aligned}

for some constant k. For the second inequality, equality holds if

\begin{aligned} \|x\|_p &= \rho, \\  \left\| k \cdot \text{sign}(c) |c|^{q/p} \right\|_p &= \rho, \\  k \left( \| c\|_q^q \right)^{1/p} &= \rho, \\  k &= \rho / \left( \| c\|_q^q \right)^{1/p}. \end{aligned}

Hence, the original maximization problem is maximized at x = \rho \cdot \text{sign}(c) |c|^{q-1} / \left( \| c\|_q^q \right)^{1/p}.

Getting matplotlib’s default colors

When you make plots with Python’s matplotlib package, the package chooses some default colors for you. In the code below, we draw the lines y = x^2 + i for i = 1, \dots, 11.

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, 0.01)

for i in range(1, 12):
    y = x**2 + i
    plt.plot(x, y)

Notice that the first and the 11th line have the same color. That’s because matplotlib only has 10 default colors. We can use the code snippet below to get the hex codes for these 10 colors:

# ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']

The code snippet below is an example of how to pick out specific default colors. For example, if we wanted all the lines to be green (the 3rd color from the bottom):

default_color = plt.rcParams['axes.prop_cycle'].by_key()['color']
for i in range(1, 12):
    y = x**2 + i
    plt.plot(x, y, color = default_color[2])

Update (2023-05-27): As commenter jared f points out, there are several other ways to define colors and get the default matplotlib colors. This reference describes all the different ways to do so.


  1. Stack Overflow. “Get default line colour cycle.

Truncated SVD as the best matrix approximation in Frobenius norm

For any matrix \mathbf{M} \in \mathbb{R}^{n \times p}, it has a singular value decomposition \mathbf{M} = \mathbf{UDV}^\top, where \mathbf{U} \in \mathbb{R}^{n \times n} is orthogonal, \mathbf{D} \in \mathbf{R}^{n \times p} is diagonal with d_{11} \geq d_{22} \geq \dots \geq 0, and \mathbf{V} \in \mathbb{R}^{p \times p} is orthogonal.

If the \mathbf{M} has rank r, then exactly r of the entries on the diagonal of \mathbf{D} are strictly positive. For q < r, the truncated SVD of \mathbf{M} of rank q is

\begin{aligned} \mathbf{M}_q = \mathbf{U}_q \mathbf{D}_q \mathbf{V}_q^\top, \end{aligned}

where \mathbf{U}_q \in \mathbb{R}^{n \times q} is the first q columns of \mathbf{U}, and \mathbf{D}_q \in \mathbb{R}^{q \times q} and \mathbf{V}_q \in \mathbb{R}^{p \times q} are analogously defined.

The truncated SVD of \mathbf{M} is optimal in the sense that it is the best rank-q approximation of \mathbf{M} in the Frobenius norm. This is made concrete in the following theorem:

Theorem. For q < r, \mathbf{M}_q = \mathbf{U}_q \mathbf{D}_q \mathbf{V}_q^\top is the solution to the minimization problem

\begin{aligned} \underset{\mathbf{X}: \text{rank}(\mathbf{X}) = q}{\text{minimize}}  \quad \| \mathbf{X} - \mathbf{M} \|_F^2, \end{aligned}

where \| \cdot \|_F represents the Frobenius norm.

We prove this theorem in the remainder of this post.

Since \mathbf{X} is assumed to have rank q, we can decompose it as \mathbf{X} = \mathbf{Q}_1 \mathbf{A}, where \mathbf{Q}_1 \in \mathbb{R}^{m \times q} is orthogonal and \mathbf{A} \in \mathbb{R}^{q \times n}. “Complete” the orthogonal matrix: i.e. \mathbf{Q} \in \mathbb{R}^{m \times m} with \mathbf{Q} = (\mathbf{Q}_1 ; \mathbf{Q}_2). \mathbf{Q}^\top \mathbf{M} is a m \times n matrix: the top q rows are \mathbf{Q}_1^\top \mathbf{M} and the rest of the rows are \mathbf{Q}_2^\top \mathbf{M}. We have

\begin{aligned} \| \mathbf{X} - \mathbf{M} \|_F^2 &= \left\| \mathbf{QQ}^\top \mathbf{M} - \mathbf{Q}_1 \mathbf{A} \right\|_F^2 \\  &= \left\| \mathbf{Q}_1 (\mathbf{Q}_1^\top \mathbf{M}) + \mathbf{Q}_2 (\mathbf{Q}_2^\top \mathbf{M}) - \mathbf{Q}_1 \mathbf{A} \right\|_F^2 \\  &= \left\| \mathbf{Q}_1 (\mathbf{Q}_1^\top \mathbf{M} - \mathbf{A}) \right\|_F^2 + \left\| \mathbf{Q}_2 (\mathbf{Q}_2^\top \mathbf{M}) \right\|_F^2 \\  &\geq \left\| \mathbf{Q}_2 (\mathbf{Q}_2^\top \mathbf{M}) \right\|_F^2, \end{aligned}

with equality at \mathbf{A} = \mathbf{Q}_1^\top \mathbf{M}. Hence, for fixed \mathbf{Q}_1, the objective function in the minimization problem is minimized at \mathbf{A} = \mathbf{Q}_1^\top \mathbf{M}, i.e. \mathbf{X} = \mathbf{Q}_1 \mathbf{Q}_1^\top \mathbf{M}. It remains to find \mathbf{Q}_1. We have

\begin{aligned} &\underset{\mathbf{Q}_1 \in \mathbb{R}^{m \times q}: \mathbf{Q}_1^\top \mathbf{Q}_1 = \mathbf{I}_q}{\text{minimize}} \: \left\| \mathbf{M} - \mathbf{Q}_1 \mathbf{Q}_1^\top \mathbf{M} \right\|_F^2 \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_1 \in \mathbb{R}^{m \times q}: \mathbf{Q}_1^\top \mathbf{Q}_1 = \mathbf{I}_q}{\text{minimize}} \: \text{tr} \left[ (\mathbf{M} - \mathbf{Q}_1 \mathbf{Q}_1^\top \mathbf{M})^\top (\mathbf{M} - \mathbf{Q}_1 \mathbf{Q}_1^\top \mathbf{M}) \right] \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_1 \in \mathbb{R}^{m \times q}: \mathbf{Q}_1^\top \mathbf{Q}_1 = \mathbf{I}_q}{\text{maximize}} \: \text{tr} \left( \mathbf{M}^\top \mathbf{Q}_1 \mathbf{Q}_1^\top \mathbf{M} \right) \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_1 \in \mathbb{R}^{m \times q}: \mathbf{Q}_1^\top \mathbf{Q}_1 = \mathbf{I}_q}{\text{maximize}} \: \text{tr} \left( \mathbf{Q}_1^\top \mathbf{MM}^\top \mathbf{Q}_1 \right) \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_1 \in \mathbb{R}^{m \times q}: \mathbf{Q}_1^\top \mathbf{Q}_1 = \mathbf{I}_q}{\text{maximize}} \: \text{tr} \left( \mathbf{Q}_1^\top \mathbf{UDV}^\top \mathbf{VD}^\top \mathbf{U}^\top \mathbf{Q}_1 \right) \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_* \in \mathbb{R}^{m \times q}: \mathbf{Q}_*^\top \mathbf{Q}_* = \mathbf{I}_q}{\text{maximize}} \: \text{tr} \left( \mathbf{Q}_*^\top \mathbf{D} \mathbf{D}^\top \mathbf{Q}_* \right) \\  \Leftrightarrow \quad& \underset{\mathbf{Q}_* \in \mathbb{R}^{m \times q}: \mathbf{Q}_*^\top \mathbf{Q}_* = \mathbf{I}_q}{\text{maximize}} \: \text{tr} \left( \mathbf{D} \mathbf{D}^\top \mathbf{Q}_* \mathbf{Q}_*^\top \right), \end{aligned}

where \mathbf{Q}_* = \mathbf{U}^\top \mathbf{Q}_1. \mathbf{DD}^\top is an m \times m matrix with the first r diagonal entries being d_{11}^2, \dots, d_{rr}^2 and the rest of the entries being 0. Letting \mathbf{H} = \mathbf{Q}_* \mathbf{Q}_*^\top, the minimization problem is equivalent to

\begin{aligned} \underset{\mathbf{Q}_* \in \mathbb{R}^{m \times q}: \mathbf{Q}_*^\top \mathbf{Q}_* = \mathbf{I}_q}{\text{maximize}} \: \sum_{i=1}^r d_{ii}^2 h_{ii}. \end{aligned}

Note that for any \mathbf{Q}_* \in \mathbb{R}^{m \times q} with \mathbf{Q}_*^\top \mathbf{Q}_* = \mathbf{I}_q, we have \sum_i h_{ii} = \text{tr}(\mathbf{Q}_* \mathbf{Q}_*^\top) = \text{tr} (\mathbf{Q}_*^\top \mathbf{Q}_*) = \text{tr}(\mathbf{I}_q) = q. For each i, we have h_{ii} = \sum_{j=1}^q (\mathbf{Q}_*)_{ij}^2 \geq 0. Furthermore,

\begin{aligned} h_{ii} = \sum_{j=1}^q (\mathbf{Q}_*)_{ij}^2 &= \sum_{j=1}^q (\mathbf{U}^\top \mathbf{Q}_1)_{ij}^2  \\  &\leq \sum_{j=1}^m (\mathbf{U}^\top \mathbf{Q}_1; \mathbf{U}^\top \mathbf{Q}_2)_{ij}^2 \\  &= \sum_{j=1}^m (\mathbf{U}^\top \mathbf{Q})_{ij}^2 \\  &= (\mathbf{U}^\top \mathbf{Q} \mathbf{Q}^\top \mathbf{U})_{ii} \\  &= 1. \end{aligned}

Hence, \sum_{i=1}^r d_{ii}^2 h_{ii} is maximized when h_{ii} = 1 for i = 1, \dots, q, h_{ii} = 0 otherwise. It remains to find \mathbf{Q}_1 such that h_{ii} = 1 for i = 1, \dots, q, h_{ii} = 0 otherwise. If we let \mathbf{Q}_1 = \mathbf{U}_q, then

\begin{aligned} \mathbf{Q}_* &= \mathbf{U}^\top \mathbf{U}_q = \begin{pmatrix} \mathbf{I}_q \\ 0 \end{pmatrix}, \\  \mathbf{Q}_* \mathbf{Q}_*^\top &= \begin{pmatrix} \mathbf{I}_q & 0 \\ 0 & 0 \end{pmatrix} ,\end{aligned}

which satisfies the requirements. Hence, the required rank-q matrix that minimizes the Frobenius norm from \mathbf{M} is

\begin{aligned} \mathbf{X} &= \mathbf{U}_q \mathbf{U}_q^\top \mathbf{M}  \\  &= \mathbf{U}_q (\mathbf{U}_q^\top \mathbf{U}) \mathbf{DV}^\top \\  &= \mathbf{U}_q \begin{pmatrix} \mathbf{I}_q & 0 \end{pmatrix}\mathbf{DV}^\top \\  &= \mathbf{U}_q \begin{pmatrix} \mathbf{D}_q & 0 \end{pmatrix}\mathbf{V}^\top  \\  &= \mathbf{U}_q \mathbf{D}_q \mathbf{V}_q^\top. \end{aligned}

(Credits: I learnt of this proof from one of the lectures by Trevor Hastie in his course “Multivariate Analysis”.)

The Woodbury matrix identity and some special cases

The Woodbury matrix identity is a commonly used identity for inverting a matrix of a particular form. The identity states that

\begin{aligned} (A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}, \end{aligned}

where A, U, C and V are conformable matrices, i.e. have dimensions such that the matrix multiplication in the identity makes sense, and the relevant matrix inverses exist.

The Woodbury matrix identity can be verified directly by multiplying the RHS above by A + UCV and showing that it reduces to the identity. The intuition behind the identity is a bit harder to describe… In the theory of matrices, there is the concept of the Schur complement which turns out to be really important for inverting matrices. The Woodbury matrix identity relates two different Schur complements for a block matrix: see the “Derivation from LDU decomposition” here.

The Woodbury matrix identity is especially useful if A and/or C are easy to invert (e.g. if they are orthogonal: M^{-1} = M^\top, or even better, are equal to the identity matrix). The Woodbury matrix identity is also useful in some special cases; below are some possibilities.

Example 1: A = \lambda I for some \lambda \in \mathbb{R}. The identity becomes

\begin{aligned} (\lambda I + UCV)^{-1} = \dfrac{I}{\lambda} - \dfrac{U}{\lambda}(C^{-1} + \frac{1}{\lambda}VU)^{-1}\dfrac{V}{\lambda}. \end{aligned}

If we let \lambda = 1, we get

\begin{aligned} (I + UCV)^{-1} = I - U(C^{-1} + VU)^{-1}V. \end{aligned}

Further letting C = I gives

\begin{aligned} (I + UV)^{-1} = I - U(I + VU)^{-1}V. \end{aligned}

Example 2: U = V = I. The identity becomes

\begin{aligned} (A + C)^{-1} &= A^{-1} - A^{-1}(C^{-1} + A^{-1})^{-1}A^{-1} \\  &= A^{-1} - A^{-1}(AC^{-1} + I)^{-1} \\  &= A^{-1} - (AC^{-1}A + A)^{-1}. \end{aligned}

This is also known as Hua’s identity. (Setting C = U = I or C = V = I will give this result as well.)

Example 3: C = I. The identity becomes

\begin{aligned} (A + UV)^{-1} = A^{-1} - A^{-1}U(I + VA^{-1}U)^{-1}VA^{-1}.\end{aligned}

If we let U = u \in \mathbb{R}^n and V = v^\top with v \in \mathbb{R}^n (assuming A \in \mathbb{R}^n), then we can simplify further:

\begin{aligned} (A + uv^\top)^{-1} &= A^{-1} - A^{-1}u (1 + v^\top A^{-1}u)^{-1}v^\top A^{-1} \\  &= A^{-1} - \frac{A^{-1}u v^\top A^{-1}}{1 + v^\top A^{-1}u}. \end{aligned}

This is also known as the Sherman-Morrison formula.

If we further assume that A = I, we have

\begin{aligned} (I + uv^\top)^{-1} = I - \frac{u v^\top}{1 + v^\top u}. \end{aligned}

PHDG for solving the saddle point formulation of linear programs (LPs)

Saddle point formulation of linear programs (LPs)

Consider the linear program of the form

\begin{aligned} \text{minimize}_x \quad & c^\top x \\  \text{subject to} \quad& Ax \leq b, \end{aligned}

where c, x \in \mathbb{R}^n, A \in \mathbb{R}^{m \times n}, and b \in \mathbb{R}^m. The Lagrangian for this problem is

L(x,y) = c^\top x + y^\top (Ax - b) = y^\top Ax + c^\top x - b^\top y,

where the domain of L is (x, y) \in \mathbb{R}^n \times \mathbb{R}_+^m (elements of y must be non-negative).

In this previous post, we introduced the saddle point theorem for convex optimization. Here is that same saddle point theorem, but for LPs:

Theorem. If (x^*, y^*) \in \mathbb{R}^n \times \mathbb{R}_+^m is a saddle point for the Lagrangian L, then x^* solves the primal problem. Conversely, if x^* is a finite solution to the primal problem, then there is a y* \in \mathbb{R}_+^m such that (x^*, y^*) is a saddle point for L.

Finding the saddle point of an LP Lagrangian can be written as either of the two (equivalent) formulations:

\begin{aligned} &\text{Primal Formulation:} && \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right], \\  &\text{Dual Formulation:} && \max_{y \in \mathbb{R}_+^m} \min_{x \in \mathbb{R}^n} \left[ y^\top Ax + c^\top x - b^\top y \right].  \end{aligned}

Primal-Dual Hybrid Gradient (PDHG) method

The saddle point formulation of the LP allows us to use primal-dual methods to solve the problem. One such method is called the Primal-Dual Hybrid Gradient (PDHG) method introduced by Chambolle & Pock (2011) (Reference 1). PDHG solves the more general problem

\begin{aligned} \min_{x \in X} \max_{y \in Y} \left\{ \left\langle Kx, y \right\rangle + G(x) - F^*(y) \right\}, \end{aligned}

where X, Y are finite-dimensional real vector spaces, K : X \mapsto Y is a continuous linear operator with induced norm \| K \| = \max \{ \|Kx \|: x \in X \text{ and } \|x\| \leq 1 \}, G: X \mapsto [0, +\infty] and F^*: Y \mapsto [0, +\infty] are proper, convex, lower-semicontinuous functions.

The PDHG algorithm simply loops over the following 3 steps until convergence:

  1. (Dual step) y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \bar{x}^n \right).
  2. (Primal step) x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^{n+1}\right).
  3. (Over-relaxation) \bar{x}^{n+1} = x^{n+1} + \theta (x^{n+1} - x^n).

In the above, \tau, \sigma > 0 and \theta \in [0, 1] are hyperparameters to be chosen by the user. (Note: Reference 1 doesn’t have proximal operators but has resolvent operators instead. See this post for why we can rewrite the algorithm with prox operators.) It is also possible to switch the order of the dual and primal steps, which gives rise to the following 2 steps until convergence (see Reference 2):

  1. x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau K^* y^n \right).
  2. y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma K \left[ x^n + \theta (x^{n+1} - x^n)\right] \right).

PDHG for LPs

Recall the LP problem is equivalent to

\begin{aligned} \min_{x \in \mathbb{R}^n} \max_{y \in \mathbb{R}_+^m} \left[ y^\top Ax + c^\top x - b^\top y \right]. \end{aligned}

To map this problem onto the PDHG problem, we need X \leftarrow \mathbb{R}^n, Y \leftarrow \mathbb{R}_+^m, K(x) = Ax, G(x) = c^\top x, F^*(y) = b^\top y. With this, the PDHG algorithm becomes

  1. x^{n+1} = \text{prox}_{\tau G} \left( x^n - \tau A^\top y^n \right).
  2. y^{n+1} = \text{prox}_{\sigma F^*}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n)\right] \right).

Note that for h: X \mapsto \mathbb{R} such that h(x) = c^\top x, we have \text{prox}_h (x) = \text{proj}_X (x - c). (See this post for the proof.) Hence, we can replace the prox operators above with Euclidean projection operators:

  1. x^{n+1} = \text{proj}_{\mathbb{R}^n} \left( x^n - \tau A^\top y^n - \tau c \right) = x^n - \tau \left( A^\top y^n + c \right).
  2. y^{n+1} = \text{proj}_{\mathbb{R}_+^m}\left( y^n + \sigma A \left[ x^n + \theta (x^{n+1} - x^n) \right] - \sigma b \right).


  1. Chambolle, A., and Pock, T. (2011). A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging.
  2. Pock, T., and Chambolle, A. (2011). Diagonal preconditioning for first order primal-dual algorithms in convex optimization.

Proximal operator for inner product in R^n

Let h: \mathcal{X} \mapsto \mathbb{R} be a lower semi-continuous convex function with \mathcal{X} \subseteq \mathbb{R}^n being a convex set. The proximal operator for h is defined by

\begin{aligned} \text{prox}_h (x) = \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + h(\beta) \right\}. \end{aligned}

In this post, we derive the value of the proximal operator for the function h: \mathcal{X} \mapsto \mathbb{R} where h(x) = c^\top x for some c \in \mathbb{R}^n.

\begin{aligned} \text{prox}_h (x) &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \| x - \beta \|_2^2 + c^\top \beta \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left( \beta^\top \beta - 2x^\top \beta \right)  + c^\top \beta \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left(\beta^\top \beta - 2(x - c)^\top \beta \right) \right\} \\  &= \text{argmin}_{\beta \in \mathcal{X}} \left\{ \frac{1}{2} \left\| \beta - (x - c) \right\|_2^2 \right\} \\  &= \text{proj}_{\mathcal{X}} (x - c), \end{aligned}

where \text{proj}_{\mathcal{X}}(x) is the Euclidean projection of x onto \mathcal{X}. That is, the proximal operator is a translation followed by a Euclidean projection.

Strong duality for linear programs (LPs)

I keep forgetting the possible scenarios for strong duality of linear programs (LPs), so I’m putting it here as a reference.

Theorem (Strong duality of LPs). Consider a primal LP and its dual. There are 4 possibilities:
1. Both primal and dual have no feasible solutions.
2. The primal is infeasible and the dual is unbounded.
3. The dual is infeasible and the primal is unbounded.
4. Both primal and dual have feasible solutions and their values are equal.

Examples of each possibility can be found in Table 4.2 of Reference 1. A proof of this theorem can be found in multiple places (e.g. Reference 2).


  1. Duality in Linear Programming.
  2. Williamson, D. P., and Kircher, K. (2014). Lecture 8: Strong duality.

Saddle point theorem for convex optimization

Consider the convex optimization problem

\begin{aligned} \underset{x}{\text{minimize}} \quad & f_0(x) \\  \text{subject to} \quad & f_i(x) \leq 0, \quad i = 1, \dots, s, \\  &f_i(x) = 0, \quad i = s+1, \dots, m, \end{aligned}

where f_0, f_1, \dots, f_s are convex functions from \mathbb{R}^n to \mathbb{R} and f_{s+1}, \dots, f_m are affine functions from \mathbb{R}^n to \mathbb{R}. The Lagrangian for this problem can be written as

\begin{aligned} L(x, y) = f_0(x) + \sum_{i=1}^m y_m f_m(x), \end{aligned}

where the domain of y is \mathcal{K} = \mathbb{R}_+^s \times \mathbb{R}^{m-s}. (Dual variables associated with inequalities must be non-negative, while there are no restrictions on dual variables associated with equalities.)

A point (x^*, y^*) \in \mathbb{R}^n \times \mathcal{K} is called a saddle point if for all x \in \mathbb{R}^n and y \in \mathcal{K} we have

\begin{aligned} L(x^*, y) \leq L(x^*, y^*) \leq L(x, y^*). \end{aligned}

Another way of saying this is if we fix x = x^*, then the Lagrangian is maximized at y^* (red line) and if we fix y = y^*, then the Lagrangian is minimized at x^* (blue line).

Illustration of saddle point. (Image from, with some of my annotations on top.)

Saddle point theorem

The saddle point theorem links optimality in the primal problem with the existence of a saddle point for the Lagrangian:

Theorem. If (x^*, y^*) \in \mathbb{R}^n \times \mathcal{K} is a saddle point for the Lagrangian L, then x^* solves the primal problem. Conversely, if x^* is a solution to the primal problem at which Slater’s condition holds, then there is a y* \in \mathcal{K} such that (x^*, y^*) is a saddle point for L.

A proof of this theorem can be found in both References 1 and 2. The first statement of the theorem provides motivation for an optimization algorithm: try to find a saddle point for the Lagrangian, because once we do, we have also found an optimal point for the primal problem.

Minimax and maximin

For any Lagrangian L we have

\begin{aligned} \inf_{x} L(x, y) &\leq L(x, y) \leq \sup_y L(x, y) &\text{for all } (x, y), \\  \sup_y \inf_x L(x, y) &\leq \sup_y L(x, y) &\text{(by taking sup over all } y), \\  \sup_y \inf_x L(x, y) &\leq \inf_x \sup_y L(x, y) &\text{(by taking inf over all } x), \end{aligned}

that is, the minimax is always greater than the maximin. However, if L has a saddle point (x^*, y^*), then

\begin{aligned} \inf_x \sup_y L(x, y) \leq \sup_y L(x^*, y) \leq L(x^*, y^*) \leq \inf_x L(x, y^*) \leq \sup_y \inf_x L(x, y), \end{aligned}

i.e. the minimax is less than or equal to the maximin. Thus, if L has a saddle point, the minimax and maximin are equal:

\begin{aligned}\sup_y \inf_x L(x, y) = \inf_x \sup_y L(x, y). \end{aligned}

In particular, note that the LHS of the above is equal to the maximum value of the dual problem, while the RHS is equal to the minimum value of the primal problem (see e.g. this). Thus, if L has a saddle point, then we have strong duality.

(This is closely related to von Neumann’s minimax theorem, which gives a sufficient condition for the minimax to be equal to the maximin.)


  1. Burke, J. V. Convex Optimization, Saddle Point Theory, and Lagrangian Duality.
  2. Tangkijwanichakul, T. (2020). Saddle Point Theorem.

Deriving the dual of a linear program (LP)

Did you know that the dual of a linear program (LP) is also a linear program? I derive the dual LP in this post.

Consider the primal LP

\begin{aligned} \underset{x}{\text{minimize}} \quad & c^\top x \\  \text{subject to} \quad & Ax = b, \\  & Gx \leq h. \end{aligned}

The Lagrangian is

\begin{aligned} L(x, u, v) &= c^\top x + u^\top (Ax - b) + v^\top (Gx - h) \\  &= \left( A^\top u + G^\top v + c \right)^\top x - b^\top u - h^\top v, \end{aligned}

where v \geq 0 and there are no restrictions on u. The Lagrange dual function is

\begin{aligned} g(u, v) &= \inf_x L(x, u, v) \\  &= - b^\top u - h^\top v + \inf_x \left\{ \left( A^\top u + G^\top v + c \right)^\top x \right\} \\  &= \begin{cases} - b^\top u - h^\top v &\text{if } A^\top u + G^\top v + c = 0, \\  -\infty &\text{otherwise}. \end{cases} \end{aligned}

The dual problem is maximizing the dual function subject to the constraints of the variables, i.e.

\begin{aligned} \underset{u, v}{\text{maximize}} \quad & g(u, v) \\  \text{subject to} \quad & v \geq 0. \end{aligned}

or equivalently

\begin{aligned} \underset{u, v}{\text{maximize}} \quad & - b^\top u - h^\top v \\  \text{subject to} \quad & -A^\top u - G^\top v = c, \\  & v \geq 0. \end{aligned}

Special case 1: Standard form LP

Consider the primal LP in standard form:

\begin{aligned} \underset{x}{\text{minimize}} \quad & c^\top x \\  \text{subject to} \quad & Ax \leq b, \\  & x \geq 0. \end{aligned}

To match notation with the previous section, we need A \leftarrow 0, b \leftarrow 0, G \leftarrow \begin{pmatrix} A \\ -I \end{pmatrix}, h \leftarrow \begin{pmatrix} b \\ 0 \end{pmatrix}, where I is the identity matrix. With these substitutions, the dual problem becomes

\begin{aligned} \underset{v}{\text{maximize}} \quad & - h^\top v \\  \text{subject to} \quad & - G^\top v = c, \\  & v \geq 0. \end{aligned}

If we write v = \begin{pmatrix} y \\ z \end{pmatrix} where y is the dual variable corresponding to the inequality Ax \leq b, then the above becomes

\begin{aligned} \underset{y, z}{\text{maximize}} \quad & - b^\top y \\  \text{subject to} \quad & - A^\top y + z = c, \\  & y, z \geq 0. \end{aligned}

z can viewed as a slack variable, and so the dual LP is equivalent to

\begin{aligned} \underset{y}{\text{maximize}} \quad & - b^\top y \\  \text{subject to} \quad & A^\top y \geq -c, \\  & y \geq 0. \end{aligned}

Special case 2: Standard form LP (maximization)

Consider the primal LP in standard form but where we want to maximize the objective function instead of minimizing it:

\begin{aligned} \underset{x}{\text{maximize}} \quad & c^\top x \\  \text{subject to} \quad & Ax \leq b, \\  & x \geq 0. \end{aligned}

This is equivalent to minimizing the negative of the objective function subject to the same constraints:

\begin{aligned} \underset{x}{\text{minimize}} \quad & (-c)^\top x \\  \text{subject to} \quad & Ax \leq b, \\  & x \geq 0. \end{aligned}

Thus by replacing c by -c in the previous section, the dual LP in this case is

\begin{aligned} \underset{y}{\text{maximize}} \quad & - b^\top y \\  \text{subject to} \quad & A^\top y \geq -(-c), \\  & y \geq 0, \end{aligned}

or equivalently

\begin{aligned} \underset{y}{\text{minimize}} \quad & b^\top y \\  \text{subject to} \quad & A^\top y \geq c, \\  & y \geq 0. \end{aligned}

(This is the version that you’ll see most often in other websites, and is the one currently on the Wikipedia page.)