Generating correlation matrix for AR(1) model

Assume that we are in the time series data setting, where we have data at equally-spaced times 1, 2, \dots which we denote by random variables X_1, X_2, \dots. TheĀ AR(1) model, commonly used in econometrics, assumes that the correlation between X_i and X_j is \text{Cor}(X_i, X_j) = \rho^{|i-j|}, where \rho is some parameter that usually has to be estimated.

If we were writing out the full correlation matrix for n consecutive data points X_1, \dots, X_n, it would look something like this:

\begin{pmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} &\dots & 1 \end{pmatrix}

(Side note: This is an example of a correlation matrix which has Toeplitz structure.)

Given \rho, how can we generate this matrix quickly in R? The function below is my (current) best attempt:

ar1_cor <- function(n, rho) {
exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) - 
    (1:n - 1))
rho^exponent
}

In the function above, n is the number of rows in the desired correlation matrix (which is the same as the number of columns), and rho is the \rho parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9)
#       [,1] [,2] [,3]  [,4]
# [1,] 1.000 0.90 0.81 0.729
# [2,] 0.900 1.00 0.90 0.810
# [3,] 0.810 0.90 1.00 0.900
# [4,] 0.729 0.81 0.90 1.000

Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the Sigma parameter for MASS::mvrnorm(), which generates samples from a multivariate normal distribution.

Can you think of other ways to generate this matrix?

Guide for using Microsoft Equation

Microsoft Equation Editor is the easiest way to include math symbols into Microsoft Office documents. I’d always found the interface clunky, but I recently discovered that you can insert many symbols using LaTeX-like code (instead of fumbling around in the dropdown menus)! Linking to the immensely helpful guide that I found online.

What is balanced accuracy?

Balanced accuracy is a metric that one can use when evaluating how good a binary classifier is. It is especially useful when the classes are imbalanced, i.e. one of the two classes appears a lot more often than the other. This happens often in many settings such as anomaly detection and the presence of a disease.

As with all discussions on the performance of a binary classifier, we start with a confusion matrix:

In the above, the “positive” or “negative” in TP/FP/TN/FN refers to the prediction made, not the actual class. (Hence, a “false positive” is a case where we wrongly predicted positive.)

Balanced accuracy is based on two more commonly used metrics: sensitivity (also known as true positive rate or recall) and specificity (also known as false positive rate). Sensitivity answers the question: “How many of the positive cases did I detect?” Or to put it in a manufacturing setting: “How many (truly) defective products did I manage to recall?” Specificity answers that same question but for the negative cases. Here are the formulas for sensitivity and specificity in terms of the confusion matrix:

Balanced accuracy is simply the arithmetic mean of the two:

\begin{aligned} \textbf{Balanced accuracy} = \dfrac{\text{Sensitivity} + \text{Specificity}}{2}. \end{aligned}

Let’s use an example to illustrate how balanced accuracy can be a better judge of performance in the imbalanced class setting. Assume that we have a binary classifier and it gave us the results in the confusion matrix below:

The accuracy of this classifier, i.e. the proportion of correct predictions, is \dfrac{5 + 10000}{5 + 50 + 10 + 10000} \approx 99.4\%. That sounds really impressive until you realize that simply by predicting all negative, we would have obtained an accuracy of \dfrac{0 + 10050}{0 + 0 + 15 + 10050} \approx 99.9\%, which is better than our classifier!

Balanced accuracy attempts to account for the imbalance in classes. Here is the computation for balanced accuracy for our classifier:

\begin{aligned} \text{Sensitivity} &= \dfrac{5}{5 + 10} \approx 33.3 \%, \\  \text{Specificity} &= \dfrac{10000}{50 + 10000} \approx 99.5 \%, \\  \text{Balanced accuracy} &= \dfrac{\text{Sensitivity} + \text{Specificity}}{2} \\  &\approx \dfrac{33.3\% + 99.5\%}{2} \\  &= 66.4\%. \end{aligned}

Our classifier is doing a great job at picking out the negatives but not so for the positives. Balanced accuracy still seems a little high if identifying the positives is what we care about, but it’s much lower than what accuracy suggested.

For comparison, let’s do the computation for the classifier that always predicts 0 (negative):

\begin{aligned} \text{Sensitivity} &= \dfrac{0}{0 + 15} = 0 \%, \\  \text{Specificity} &= \dfrac{10050}{0 + 10050} = 100 \%, \\  \text{Balanced accuracy} &= \dfrac{\text{Sensitivity} + \text{Specificity}}{2} \\  &= \dfrac{0\% + 100\%}{2} \\  &= 50\%. \end{aligned}

Based on balanced accuracy, we would say that our classifier is doing a little better than the naive “all negatives” classifier, but not much better. This seems like a reasonable conclusion since our classifier is able to pick out some positives but not many of them.

Here is some R code that you can use to compute these measures:

TP <- 0
TN <- 10050
FP <- 0
FN <- 15

# metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
balanced_accuracy <- (sensitivity + specificity)

# print out metrics
options(digits = )
cat("Accuracy:", accuracy, "\n",
    "Sensitivity:", sensitivity, "\n",
    "Specificity:", specificity, "\n",
    "Balanced accuracy:", balanced_accuracy)

Note: This reference points out that balanced accuracy can be extended easily to the multi-class setting: there it is simply the arithmetic mean of the recall for all the classes.

Note: Another popular metric one can use for imbalanced datasets is the F1 score, which is the harmonic mean of precision and recall.

Convex hull of a closed set is not necessarily closed

A set C \in \mathbb{R}^n is convex if for any two points x, y \in C and any t \in [0,1], the point tx + (1-t)y also lies in C. Given a set of points C, the convex hull of C, which we denote by \text{conv}(C), is defined to be the (unique) minimal convex set containing C. (Alternatively we may define \text{conv}(C) to be the intersection of all convex sets containing C.)

We usually hope that \text{conv}(C) retains the properties of C. For example, if C is open (compact resp.), then \text{conv}(C) remains open (compact resp.). (The proofs can be found here and here.)

However, if C is closed, its convex hull \text{conv}(C) need not be closed. Here is a counter-example: In \mathbb{R}^2, the set

\begin{aligned} C = \left\{ (x, y): y \geq \dfrac{1}{1+x^2} \right\} \end{aligned}

is closed, but \text{conv}(C) is the upper half-plane without the x-axis. Below is a visual representation of the set C. The black line is y = 1/(1+x^2) and C is denoted by the shaded area including the black line.

Side note 1: The function y = 1/(1+x^2) actually has a name: the Witch of Agnesi! See this Wikipedia article for the history behind this function. It also happens to be the probability density function of the Cauchy distribution.

Side note 2: The statement “convex hull of a compact set is also compact” can be false if we are defining our sets over a space other than \mathbb{R}^n! See here for counter-example.

References:

  1. Wikipedia. Convex hull.

What is a diagonally dominant matrix?

A diagonally dominant matrix is a square matrix A \in \mathbb{R}^{n \times n} such that for each i = 1, \dots, n,

\begin{aligned} |A_{ii}| \geq \sum_{j \neq i} |A_{ij}|. \end{aligned}

It is said to be strictly diagonally dominant if the inequality above is strict for all values of i.

In words, a diagonally dominant matrix is a square matrix such that in each row, the absolute value of the term on the diagonal is greater than or equal to the sum of absolute values of the rest of the terms in that row.

Properties

  1. A strictly diagonally dominant matrix is non-singular, i.e. has an inverse. This is known as the Levy-Desplanques theorem; a proof of the theorem can be found here.
  2. A symmetric diagonally dominant real matrix with non-negative diagonal entries is positive semidefinite (PSD). A proof can be found in the Wikipedia article (Reference 2). Similarly, a symmetric strictly diagonally dominant real matrix with positive diagonal entries is positive definite.

(Note: Diagonally dominant matrices can be defined for matrices with complex entries as well. The references and links in this post apply to matrices with complex entries; here I just focus on matrices with real entries.)

References:

  1. Wolfram MathWorld. Diagonally Dominant Matrix.
  2. Wikipedia. Diagonally dominant matrix.

Consensus ADMM

In this previous post, I introduced theĀ alternating direction method of multipliers (ADMM), an algorithm which solves a particular class of convex optimization problems. ADMM has gained popularity in recent years because it allows certain problems to be solved in a distributed manner. Consensus optimization, the subject of this post, is one example of this.

(Note: The material for this post is taken from Chapter 7 of Reference 1.)

Recap

Recall that ADMM solves problems that can be written in the form

\begin{aligned} \text{minimize} \quad &f(x) + g(z) \\  \text{subject to}\quad & Ax + Bz = c, \end{aligned}

with variables x \in \mathbb{R}^n, z \in \mathbb{R}^m, and A \in \mathbb{R}^{p \times n}, B \in \mathbb{R}^{p \times m}, and c \in \mathbb{R}^p. f and g are assumed to be convex functions. For brevity, I will call this problem the “ADMM problem”, or I may say that the problem is in “ADMM form”.

The ADMM algorithm consists of the iterations

\begin{aligned} x^{k+1} &= \text{argmin}_x \; L_\rho \left(x, z^k, y^k \right), \\  z^{k+1} &= \text{argmin}_z \; L_\rho \left(x^{k+1}, z, y^k \right), \\  y^{k+1} &= y^k + \rho (Ax^{k+1} + Bz^{k+1} - c). \end{aligned}

In the above, superscripts refer to the iteration number.

Global Variable Consensus Optimization

Consider the optimization problem

\begin{aligned} \text{minimize} \quad \displaystyle\sum_{i=1}^N f_i(x), \end{aligned}

where x \in \mathbb{R}^n and f_i : \mathbb{R}^n \mapsto \mathbb{R} \cup \{ +\infty \} are convex. (Note that each term can encode constraints by setting f_i(x) = +\infty when a constraint is violated.) Imagine that we had N worker machines available to us (along with one central machine). Could we break the optimization up into N parallel parts, with the optimization of f_i being done by machine i (roughly speaking)?

We can do so if we introduce auxiliary variables. For f_i, replace the argument x with a “local copy” x_i \in \mathbb{R}^n, then constrain each of the x_i to be equal to some other auxiliary variable z \in \mathbb{R}^n. In other words, solve the equivalent problem

\begin{aligned} \text{minimize} \quad &\displaystyle\sum_{i=1}^N f_i(x_i) \\  \text{subject to} \quad & x_i - z = 0, \qquad i = 1, \dots, N. \end{aligned}

At first it might seem like we have taken a step back, since we now have to solve for N+1 variables instead of just one variable. However, notice that the problem is in ADMM form, and so we can write the ADMM udpates:

\begin{aligned} x_i^{k+1} &= \text{argmin}_{x_i} \; \left[ f_i(x_i) + (y_i^k)^T (x_i - z^k) + \dfrac{\rho}{2}\|x_i - z^k \|_2^2 \right], \\  z^{k+1} &= \dfrac{1}{N}\sum_{i=1}^N \left( x_i^{k+1} + \dfrac{1}{\rho} y_i^k \right), \\  y_i^{k+1} &= y_i^k + \rho (x_i^{k+1} - z^{k+1}). \end{aligned}

The first and third steps can be done in parallel across N machines, with machine i solving the minimization problem associated with f_i. After step 1, all the new values x_1^{k+1}, \dots, x_N^{k+1} are fed to a central processing element (sometimes called the central collector or fusion center) to compute the z update. This new value of z is then broadcast back to the N machines to compute the y updates in parallel.

Global Variable Consensus Optimization: An example

One example of global variable consensus optimization is linear regression over a huge dataset. Let’s say that the design matrix is X \in \mathbb{R}^{Nk \times p} and the response vector is y \in \mathbb{R}^{Nk}. Linear regression solves the optimization problem

\text{minimize}_\beta \quad \| y - X\beta \|_2^2.

If X is huge, it may be broken up into parts and stored on different machines. For example, assume that X and y are broken up row-wise into N parts, with part i, denoted by X_i \in \mathbb{R}^{k \times p} and y_i \in \mathbb{R}^k being on machine i. We can then write the optimization problem as

\text{minimize}_\beta \quad \sum_{i=1}^N \| y_i - X_i\beta \|_2^2,

which is the global variable consensus problem. This allows us to use consensus ADMM to solve for \beta in a distributed manner.

General Form Consensus Optimization

The global variable consensus problem can be generalized. In the global variable consensus problem, machine i deals with f_i(x_i), where x_i is a local copy of the global variable z. In general form consensus optimization, machine i deals with f_i(x_i), where x_i is (instead) a subvector of the global variable z.

This generalization is easy conceptually but cumbersome in terms of notation. Let z \in \mathbb{R}^n be the global variable. For each i = 1, \dots, N, let x_i \in \mathbb{R}^{n_i} be a “local” variable that maps to a subvector of z. Let \mathcal{G} be a mapping such that \mathcal{G}(i,j) = g if and only if local variable (x_i)_j corresponds to the global variable z_g.

Perhaps an example will help make this notation more understandable. In the diagram below, n = 4, N = 3, and n_1 = 4, n_2 = 2, n_3 = 3. (x_2)_1 corresponds to z_1 and (x_2)_2 corresponds to z_3, so \mathcal{G}(2, 1) = 1 and \mathcal{G}(2,2) = 3.

The general form consensus optimization problem is of the form

\begin{aligned} \text{minimize} \quad \displaystyle\sum_{i=1}^N f_i(x_i). \end{aligned}

Let \tilde{z}_i \in \mathbb{R}^{n_i} be defined by (\tilde{z}_i)_j = z_{\mathcal{G}(i, j)}. That is, \tilde{z}_i is the global variable’s idea of what local variable x_i should be. We can rewrite the optimization problem above as

\begin{aligned} \text{minimize} \quad &\displaystyle\sum_{i=1}^N f_i(x_i) \\  \text{subject to} \quad & x_i - \tilde{z}_i = 0, \qquad i = 1, \dots, N. \end{aligned}

Again, this is in ADMM form and so we can write the ADMM updates easily:

\begin{aligned} x_i^{k+1} &= \text{argmin}_{x_i} \; \left[ f_i(x_i) + (y_i^k)^T x_i + \dfrac{\rho}{2}\|x_i - \tilde{z}_i^k \|_2^2 \right], \\  z^{k+1} &= \text{argmin}_z \; \left[ \sum_{i=1}^N \left( - (y_i^k)^T \tilde{z}_i + \dfrac{\rho}{2} \| x_i^{k+1} - \tilde{z}_i \|_2^2 \right) \right], \\  y_i^{k+1} &= y_i^k + \rho (x_i^{k+1} - \tilde{z}_i^{k+1}). \end{aligned}

As in global variable consensus, step 1 and 3 decouple into N subproblems, one for each x_i and y_i. Unlike global variable consensus, step 2 also decouples! We can compute the update for each z_g, g = 1, \dots, n separately:

\begin{aligned} z_g^{k+1} = \dfrac{\sum_{\mathcal{G}(i,j) = g} \left[ (x_i^{k+1})_j + (1/\rho) (y_i^k)_j \right]}{\sum_{\mathcal{G}(i,j) = g} 1} \quad \text{for each } g = 1, \dots, n. \end{aligned}

Hence, less broadcasting is required between steps 1 and 2 and steps 2 and 3. For example, in the diagram above, we only have to broadcast along the arrows from left to right between steps 1 and 2, and along the arrows from right to left between steps 2 and 3. (For global variable consensus, we have to broadcast along the complete bipartite graph between the left and right.)

General Form Consensus Optimization: An example

One example of global variable consensus optimization is linear regression over a huge dataset, where the dataset is split up row-wise across many machines. An example of general form consensus optimization is where the dataset is further split up column-wise: that is, each machine only has feature values for a subset of features, and for only a subset of the observations.

References:

  1. Boyd, S., et al. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

Alternating Direction Method of Multipliers (ADMM)

The alternating direction method of multipliers (ADMM) is an algorithm for solving particular types of convex optimization problems. It was originally proposed in the mid-1970s by Glowinski & Marrocco (1975) and Gabay & Mercier (1976). According to Reference 1,

ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.

(Both dual ascent and the method of multipliers are described in Section 2 of Reference 1. I may write blog posts on these in the future.) ADMM is gaining a lot of popularity now because it often allows optimization to be done in a distributed manner, making problems of ever-increasing size tractable.

The problem ADMM solves

ADMM solves problems of the form

\begin{aligned} \text{minimize} \quad &f(x) + g(z) \\  \text{subject to}\quad & Ax + Bz = c, \end{aligned}

with variables x \in \mathbb{R}^n, z \in \mathbb{R}^m, and A \in \mathbb{R}^{p \times n}, B \in \mathbb{R}^{p \times m}, and c \in \mathbb{R}^p. f and g are assumed to be convex functions.

The art of convex optimization involves learning how to formulate your problem in the form you want (in this case, in the problem described above). One reason for ADMM’s appeal is its broad applicability: several different common types of convex optimization problems can be framed as the problem above. In particular, ADMM deals with an objective function that is separable. Such objective functions appear widely in machine learning, where we want to minimize the sum of a loss function and a penalty (or regularization) function.

The ADMM algorithm

Define the augmented Lagrangian

\begin{aligned} L_\rho(x, z, y) = f(x) + g(z) + y^T (Ax + Bz - c) + \dfrac{\rho}{2} \| Ax + Bz - c \|_2^2, \end{aligned}

where \rho > 0 is some penalty parameter. (y is known as the dual variable.) The ADMM algorithm consists of the iterations

\begin{aligned} x^{k+1} &= \text{argmin}_x \; L_\rho \left(x, z^k, y^k \right), \\  z^{k+1} &= \text{argmin}_z \; L_\rho \left(x^{k+1}, z, y^k \right), \\  y^{k+1} &= y^k + \rho (Ax^{k+1} + Bz^{k+1} - c). \end{aligned}

In the above, superscripts refer to the iteration number.

These updates are known as the unscaled version of ADMM. The scaled version of ADMM refers to the iteration updates listed below. The scaled version is often used because it typically results in shorter formulas.

\begin{aligned} x^{k+1} &= \text{argmin}_x \; \left( f(x) + \dfrac{\rho}{2} \| Ax + Bz^k - c + u^k \|_2^2 \right), \\  z^{k+1} &= \text{argmin}_z \; \left( g(z) + \dfrac{\rho}{2} \| Ax^{k+1} + Bz - c + u^k \|_2^2 \right), \\  u^{k+1} &= u^k + Ax^{k+1} + Bz^{k+1} - c. \end{aligned}

Here, u = y / \rho is known as the scaled dual variable.

Things to note in practice

According to both References 1 and 2, in practice ADMM usually obtains a modestly accurate solution in a handful of iterations, but requires a large number of iterations for a highly accurate solution.

While ADMM is guaranteed to converge under mild conditions on f and g and for all values of the parameter \rho (see Section 3.2 of Reference 1 for details), in practice the value of \rho can greatly affect the performance of the algorithm. There appears to be some work done on determining what the optimal value of \rho is in particular cases, but I haven’t seen any general heuristic for choosing it. Reference 1 often sets \rho = 1, and in one example used values of \rho in the range 0.1 to 100.

How many iterations should we run ADMM for before terminating? Section 3.3 of Reference 1 gives a few possible heuristics. The easiest to implement is to set thresholds \varepsilon^{\text{pri}} and \varepsilon^{\text{dual}} for feasibility tolerances for the primal and dual feasibility conditions respectively (i.e. how far off we are willing to be for our primal and dual conditions). We terminate once the following two conditions are met:

\begin{aligned} \| r^k \|_2 &= \| Ax^k + Bz^k - c \|_2 \leq \varepsilon^{\text{pri}}, \text{ and} \\  \| s^k \|_2 &= \| \rho A^T B (z^k - z^{k-1}) \|_2 \leq \varepsilon^{\text{dual}}. \end{aligned}

r^k and s^k are known as the primal and dual residuals respectively. See Section 3.3 of Reference 1 for more details.

References:

  1. Boyd, S., et al. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.
  2. Tibshirani, R. Alternating Direction Method of Multipliers.
  3. Boyd, S. ADMM.