Convex conjugate functions, Fenchel’s inequality, subdifferentials and optimality

Convex conjugate functions and Fenchel’s inequality

Let f: \mathbb{R}^n \mapsto \mathbb{R} be some function. The convex conjugate of f (also know as the Fenchel transformation), is the function f^*: \mathbb{R}^n \mapsto \mathbb{R} defined as

f^*(y) = \underset{x}{\sup} \left\{ y^\top x - f(x) \right\}.

Fenchel’s equality is the following statement:

Theorem (Fenchel’s inequality). For any x, y \in \mathbb{R}^n,

f(x) + f^*(y) \geq x^\top y.

The proof of Fenchel’s equality follows directly from the definition of the convex conjugate:

\begin{aligned} f(x) + f^*(y) &= f(x) + \underset{u}{\sup} \left\{ y^\top u - f(u) \right\}  \\  &\geq f(x) + y^\top x - f(x) \\  &= x^\top y. \end{aligned}

A direct application of Fenchel’s inequality shows that the conjugate of the conjugate (the biconjugate), denoted f^{**}, always satisfies

f^{**} \leq f.

To see this: Fenchel’s inequality says f(x) \geq x^\top y - f^*(y) for all x, y. Taking a supremum over all y, the inequality becomes

f(x) \geq \underset{y}{\sup} \left\{ x^\top y - f^*(y) \right\} = f^{**}(x).

It turns out that if f is closed and convex, then we actually have f^{**} = f. The proof is a bit cumbersome and so I’m not providing it in this post. (For a proof, see Reference 1 or Slides 5.13-5.14 of Reference 2.) More generally, the Fenchel-Moreau theorem gives necessary and sufficient conditions for f^{**} = f.

Subdifferentials and optimality

A vector g \in \mathbb{R}^n is a subgradient of f at point x \in \mathbb{R}^n if

f(y) \geq f(x) + g^\top (y - x) for all y \in \mathbb{R}^n.

The subdifferential of f at point x, denoted \partial f(x), is the set of all the subgradients of f and point x.

Convex conjugates appear in the following theorem concerning subdifferentials:

Theorem. If f is closed and convex, then for any x, y \in \mathbb{R}^n,

\begin{aligned} y \in \partial f(x) \; \Leftrightarrow \; x \in \partial f^*(y) \; \Leftrightarrow \; f(x) + f^*(y) = x^\top y. \end{aligned}

Proof (adapted from Slide 5.15 of Reference 2 and Reference 3): First we show that y \in \partial f(x) \; \Rightarrow \; x \in \partial f^*(y). Using the definitions of subgradients and convex conjugates,

\begin{aligned} y \in \partial f(x) &\Rightarrow \; f(u) \geq f(x) + y^\top (u - x) \quad \text{for all } u \\  &\Rightarrow \; y^\top u - f(u) \leq y^\top x - f(x) \quad \text{for all } u \\  &\Rightarrow f^*(y) = y^\top x - f(x). \end{aligned}

Thus, for any v,

\begin{aligned} f^*(v) &= \underset{u}{\sup} \left\{ v^\top u - f(u) \right\} \\  &\geq v^\top x - f(x) \\  &= y^\top x - f(x) + x^\top (v - y) \\  &= f^*(y) + x^\top (v - y), \end{aligned}

which implies that x \in \partial f^*(y).

To get the reverse implication \partial x \in f^*(y) \; \Rightarrow \; y \in \partial f(x), apply the same logic above with f^* in place of f and use the fact that f^{**} = f for closed convex f.

Next, we show that y \in \partial f(x) \; \Rightarrow \; f(x) + f^*(y) = x^\top y. Using the definitions of subgradients,

\begin{aligned} y \in \partial f(x) &\Rightarrow \; f(u) \geq f(x) + y^\top (u - x) \quad \text{for all } u \\  &\Rightarrow \; x^\top y \geq f(x) + (y^\top u - f(u)) \quad \text{for all } u. \end{aligned}

Taking supremum over all u, the inequality becomes x^\top y \geq f(x) + f^*(y). Combining this with Fenchel’s inequality gives us equality x^\top y = f(x) + f^*(y).

Conversely, if x^\top y = f(x) + f^*(y), we have

\begin{aligned} x^\top y &\geq f(x) + f^*(y), \\  x^\top y &\geq f(x) + y^\top u - f(u) \quad \text{for all } u, \\  f(u) &\geq f(x) + y^\top (u - x) \quad \text{for all } u, \\  y &\in \partial f(x). \end{aligned}

References:

  1. StackExchange. “How to prove the conjugate of the conjugate function is itself?”
  2. Vandenberghe, L. (2022). “5. Conjugate functions.”
  3. StackExchange. “Proof about Conjugate and subgradient.”
Advertisement

Wolpert’s “no free lunch” theorem for supervised learning

I recently learnt of the existence of a “no free lunch” (NFL) theorem for supervised learning. The result was derived by David H. Wolpert in a 1996 paper. (Note: Wolpert, together with William G. Macready, proved a similar theorem for optimization as well; see references.) In the paper, Wolpert provides an easy-to-understand high-level explanation of his result:

Some of those theorems show, loosely speaking, that for any two algorithms A and B, there are “as many” targets for which algorithm A has lower expected [off-training set] OTS error than algorithm B as vice versa (whether one averages over training sets or not). In particular, such equivalence holds even if one of the algorithms is random guessing: there are “as many” targets for which any particular learning algorithm gets confused by the data and performs worse than random as for which it performs better.

In the quote above, “target” refers to the true relationship between the predictors x and the response y (which we usually denote by f), an “algorithm” refers to a candidate solution (which we usually refer to as \hat{f}), and “off-training set (OTS) error” refers to the error which \hat{f} incurs on data points (x_{test}, y_{test}) such that x_{test} does not appear in the training data.

The main takeaway is NOT that it is not worthwhile to find better algorithms/models. Rather, it is that no one model works best for every problem. Hence, for a given problem, we should try to find the best model for that problem, and not be content with the result of one algorithm/model.

The NFL theorems have been somewhat contentious; this Quora article gives some arguments for why the implications of the NFL theorems are not as far-reaching as they sound:

  1. While no one model/algorithm works best for every problem, it may be that one model/algorithm works best for all real-world problems, or all problems that we care about, practically speaking.
  2. There are certain assumptions underlying the NFL theorems which may not hold for some algorithms or loss functions.
  3. NFL theorems prove that the arithmetic mean of performance across all problems is the same for all algorithms, but maybe the arithmetic mean is not what we care about. If one algorithm performs better more than 50% of the time against all other algorithms (at the expense of doing much worse the rest of the time), perhaps we should consider that a “free lunch” to be had.

References:

  1. Wolpert, D. H. (1996). The Lack of A Priori Distinctions Between Learning Algorithms.
  2. Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization.