If we were writing out the full correlation matrix for consecutive data points , it would look something like this:

(* Side note:* This is an example of a correlation matrix which has Toeplitz structure.)

* Given , how can we generate this matrix quickly in R?* The function below is my (current) best attempt:

ar1_cor <- function(n, rho) { exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) - (1:n - 1)) rho^exponent }

In the function above, `n`

is the number of rows in the desired correlation matrix (which is the same as the number of columns), and `rho`

is the parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9) # [,1] [,2] [,3] [,4] # [1,] 1.000 0.90 0.81 0.729 # [2,] 0.900 1.00 0.90 0.810 # [3,] 0.810 0.90 1.00 0.900 # [4,] 0.729 0.81 0.90 1.000

Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the `Sigma`

parameter for `MASS::mvrnorm()`

, which generates samples from a multivariate normal distribution.

**Can you think of other ways to generate this matrix?**

As with all discussions on the performance of a binary classifier, we start with a confusion matrix:

In the above, the “positive” or “negative” in TP/FP/TN/FN refers to the prediction made, not the actual class. (Hence, a “false positive” is a case where we wrongly predicted positive.)

Balanced accuracy is based on two more commonly used metrics: * sensitivity* (also known as

* Balanced accuracy* is simply the arithmetic mean of the two:

Let’s use an example to illustrate how balanced accuracy can be a better judge of performance in the imbalanced class setting. Assume that we have a binary classifier and it gave us the results in the confusion matrix below:

The accuracy of this classifier, i.e. the proportion of correct predictions, is . That sounds really impressive until you realize that simply by predicting all negative, we would have obtained an accuracy of , which is better than our classifier!

Balanced accuracy attempts to account for the imbalance in classes. Here is the computation for balanced accuracy for our classifier:

Our classifier is doing a great job at picking out the negatives but not so for the positives. Balanced accuracy still seems a little high if identifying the positives is what we care about, but it’s much lower than what accuracy suggested.

For comparison, let’s do the computation for the classifier that always predicts 0 (negative):

Based on balanced accuracy, we would say that our classifier is doing a little better than the naive “all negatives” classifier, but not much better. This seems like a reasonable conclusion since our classifier is able to pick out some positives but not many of them.

Here is some R code that you can use to compute these measures:

TP <- 0 TN <- 10050 FP <- 0 FN <- 15 # metrics accuracy <- (TP + TN) / (TP + TN + FP + FN) sensitivity <- TP / (TP + FN) specificity <- TN / (TN + FP) balanced_accuracy <- (sensitivity + specificity) # print out metrics options(digits = ) cat("Accuracy:", accuracy, "\n", "Sensitivity:", sensitivity, "\n", "Specificity:", specificity, "\n", "Balanced accuracy:", balanced_accuracy)

* Note:* This reference points out that balanced accuracy can be extended easily to the multi-class setting: there it is simply the arithmetic mean of the recall for all the classes.

* Note: *Another popular metric one can use for imbalanced datasets is the

We usually hope that retains the properties of . For example, if is open (compact resp.), then remains open (compact resp.). (The proofs can be found here and here.)

However, * if is closed, its convex hull need not be closed.* Here is a counter-example: In , the set

is closed, but is the upper half-plane *without* the -axis. Below is a visual representation of the set . The black line is and is denoted by the shaded area including the black line.

* Side note 1:* The function actually has a name: the

* Side note 2:* The statement “convex hull of a compact set is also compact” can be false if we are defining our sets over a space other than ! See here for counter-example.

References:

- Wikipedia. Convex hull.

It is said to be * strictly diagonally dominant* if the inequality above is strict for all values of .

In words, a diagonally dominant matrix is a square matrix such that in each row, the absolute value of the term on the diagonal is greater than or equal to the sum of absolute values of the rest of the terms in that row.

**Properties**

- A strictly diagonally dominant matrix is non-singular, i.e. has an inverse. This is known as the
; a proof of the theorem can be found here.**Levy-Desplanques theorem** - A symmetric diagonally dominant real matrix with non-negative diagonal entries is positive semidefinite (PSD). A proof can be found in the Wikipedia article (Reference 2). Similarly, a symmetric strictly diagonally dominant real matrix with positive diagonal entries is positive definite.

(* Note:* Diagonally dominant matrices can be defined for matrices with complex entries as well. The references and links in this post apply to matrices with complex entries; here I just focus on matrices with real entries.)

References:

- Wolfram MathWorld. Diagonally Dominant Matrix.
- Wikipedia. Diagonally dominant matrix.

(Note: The material for this post is taken from Chapter 7 of Reference 1.)

**Recap**

Recall that ADMM solves problems that can be written in the form

with variables , , and , , and . and are assumed to be convex functions. For brevity, I will call this problem the * “ADMM problem”*, or I may say that the problem is in

The * ADMM algorithm* consists of the iterations

In the above, superscripts refer to the iteration number.

**Global Variable Consensus Optimization**

Consider the optimization problem

where and are convex. (Note that each term can encode constraints by setting when a constraint is violated.) Imagine that we had worker machines available to us (along with one central machine). **Could we break the optimization up into parallel parts, with the optimization of being done by machine (roughly speaking)?**

We can do so if we introduce auxiliary variables. For , replace the argument with a “local copy” , then constrain each of the to be equal to some other auxiliary variable . In other words, solve the equivalent problem

At first it might seem like we have taken a step back, since we now have to solve for variables instead of just one variable. However, notice that the problem is in ADMM form, and so we can write the ADMM udpates:

The first and third steps can be done in parallel across machines, with machine solving the minimization problem associated with . After step 1, all the new values are fed to a central processing element (sometimes called the * central collector* or

**Global Variable Consensus Optimization: An example**

One example of global variable consensus optimization is linear regression over a huge dataset. Let’s say that the design matrix is and the response vector is . Linear regression solves the optimization problem

If is huge, it may be broken up into parts and stored on different machines. For example, assume that and are broken up row-wise into parts, with part , denoted by and being on machine . We can then write the optimization problem as

which is the global variable consensus problem. This allows us to use consensus ADMM to solve for in a distributed manner.

**General Form Consensus Optimization**

The global variable consensus problem can be generalized. In the global variable consensus problem, machine deals with , where is a local copy of the global variable . In general form consensus optimization, machine deals with , where * is (instead) a subvector of the global variable *.

This generalization is easy conceptually but cumbersome in terms of notation. Let be the global variable. For each , let be a “local” variable that maps to a subvector of . Let be a mapping such that if and only if local variable corresponds to the global variable .

Perhaps an example will help make this notation more understandable. In the diagram below, , , and . corresponds to and corresponds to , so and .

The general form consensus optimization problem is of the form

Let be defined by . That is, is the global variable’s idea of what local variable should be. We can rewrite the optimization problem above as

Again, this is in ADMM form and so we can write the ADMM updates easily:

As in global variable consensus, step 1 and 3 decouple into subproblems, one for each and . * Unlike global variable consensus, step 2 also decouples!* We can compute the update for each , separately:

Hence, less broadcasting is required between steps 1 and 2 and steps 2 and 3. For example, in the diagram above, we only have to broadcast along the arrows from left to right between steps 1 and 2, and along the arrows from right to left between steps 2 and 3. (For global variable consensus, we have to broadcast along the complete bipartite graph between the left and right.)

**General Form Consensus Optimization: An example**

One example of global variable consensus optimization is linear regression over a huge dataset, where the dataset is split up row-wise across many machines. An example of general form consensus optimization is where the dataset is further split up column-wise: that is, each machine only has feature values for a subset of features, and for only a subset of the observations.

References:

]]>ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers.

(Both * dual ascent* and the

**The problem ADMM solves**

ADMM solves problems of the form

with variables , , and , , and . and are assumed to be convex functions.

The art of convex optimization involves learning how to formulate your problem in the form you want (in this case, in the problem described above). One reason for ADMM’s appeal is its * broad applicability*: several different common types of convex optimization problems can be framed as the problem above. In particular, ADMM deals with an objective function that is separable. Such objective functions appear widely in machine learning, where we want to minimize the sum of a loss function and a penalty (or regularization) function.

**The ADMM algorithm**

Define the **augmented Lagrangian**

where is some penalty parameter. ( is known as the * dual variable*.) The ADMM algorithm consists of the iterations

In the above, superscripts refer to the iteration number.

These updates are known as the * unscaled version* of ADMM. The

Here, is known as the * scaled dual variable*.

**Things to note in practice**

According to both References 1 and 2, in practice ADMM usually obtains a modestly accurate solution in a handful of iterations, but requires a large number of iterations for a highly accurate solution.

While ADMM is guaranteed to converge under mild conditions on and and for all values of the parameter (see Section 3.2 of Reference 1 for details), in practice the value of can greatly affect the performance of the algorithm. There appears to be some work done on determining what the optimal value of is in particular cases, but I haven’t seen any general heuristic for choosing it. Reference 1 often sets , and in one example used values of in the range 0.1 to 100.

* How many iterations should we run ADMM for before terminating?* Section 3.3 of Reference 1 gives a few possible heuristics. The easiest to implement is to set thresholds and for feasibility tolerances for the primal and dual feasibility conditions respectively (i.e. how far off we are willing to be for our primal and dual conditions). We terminate once the following two conditions are met:

and are known as the primal and dual residuals respectively. See Section 3.3 of Reference 1 for more details.

References:

- Boyd, S., et al. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.
- Tibshirani, R. Alternating Direction Method of Multipliers.
- Boyd, S. ADMM.

One example where the trace trick is useful is in proving the following lemma for the expectation of a quadratic form:

Lemma.Suppose is a random vector such that and . Then for any fixed matrix ,

Here is the proof: Note that

Taking expectations on both sides, the middle two terms on the RHS are equal to 0, giving

Note that the expression inside the expectation for the second term on the RHS is a matrix. Hence, applying the trace trick and using the fact that the matrix trace is invariant under cyclic permutations,

]]>**Brier score for binary predictions**

Assume that we have binary outcomes that we want to predict. Let our predictions be denoted by for . The * Brier score* for these predictions is given by the formula

The Brier score is only one of several loss functions one could use for binary classification. See this Wikipedia article for a longer list, and see this reference for a comparison between the Brier score and logistic loss.

**Brier score for multi-category predictions**

The Brier score can be generalized for multi-class predictions. Let be the number of possible outcomes (e.g. if the possible outcomes are A, B and C, then ). Without loss of generality, let the outcome classes be denoted by . For case , if the true outcome is class , define

Let denote the predicted probability of the th outcome being in class . Then the * (multi-category) Brier score* is given by

In this setting, the best possible Brier score is (probability 1 for the correct class, probability 0 for everything else) while the worst possible Brier score is (probability 1 for one of the wrong classes, probability 0 for everything else). Note that in the special case of (the binary setting), the Brier score defined in this way is exactly twice that of the Brier score as defined in the previous section.

**Don’t use Brier scores for ordinal predictions**

The Wikipedia article for Brier score notes that it is “inappropriate for ordinal variables which can take on three or more values.” This is because any misclassification is treated in the same way. For example, let’s say we are trying to predict whether a soccer team is going to score (i) 0-1 goals (“class 1”), (ii) 2-3 goals (“class 2”) or (iii) more than 3 goals (“class 3”). Let’s say we only want to predict one outcome which turned out to be class 1. Consider the following two predictions:

A. Class 2 with 100% probability.

B. Class 3 with 100% probability.

Because of the ordinal scale, we would prefer prediction A over prediction B. However, the Brier score for each of these predictions would be the same: 2.

**Brier skill score**

* A noted shortcoming of the Brier score is that it doesn’t fare well when the classes are imbalanced.* For example, suppose that we are trying to predict 10 outcomes, 9 of which are 1 (the remaining case being 0). Suppose that our model has a Brier score of 0.15. Is it a good model?

Even though the score sounds low, the model isn’t good at all. Consider the naive prediction of “always predict class 1 with 100% probability”: this has a Brier score of

which is better than our model!

For such settings, * Brier skill scores* are more meaningful in that they compare our model’s Brier score with that of some “reference forecast”. The formula for a model’s Brier skill score is

where and are the Brier scores of the model and the reference forecast respectively. The Brier skill score takes on values from to , with higher scores meaning more accurate predictions (compared with the reference). A BSS of 0 indicates the same accuracy as that of the reference.

The reference is often some naive prediction one can make, e.g. always predict the dominant class with 100% probability, or predict each class with its long-term proportion.

References:

- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability.
- Wikipedia. Brier score.
- Statistics How To. Brier Score: Definition, Examples.

Let be some positive integer. A * -dimensional copula* is a joint cumulative distribution function (CDF) of a -dimensional random vector on with uniform marginals, i.e. if one of the arguments is and the rest are .

* Copulas are useful because they contain all information on the dependence structure between the elements of a -dimensional random vector. One example of this is in the generation of random samples from multivariate probability distributions. *Suppose we have a random vector such that the marginal CDF for each , denoted by , is continuous. Then

has uniformly distributed marginals, i.e. for . We can think of the copula as the joint CDF of , i.e.

Now let’s say we want a random draw of . * Assuming that we can draw a sample from the copula distribution implied by ,* this gives us a sample of . From this, we can obtain the desired sample

All this, of course, is predicated on being able to sample from the copula distribution.

**Sklar’s theorem**

Sklar’s theorem tells us that we can always decompose a -dimensional joint CDF into a copula and the marginal CDFs, and that conversely, any copula and set of univariate CDFs gives us a valid -dimensional joint CDF. Below is the formal statement of the theorem:

Sklar’s theorem. (1959).Let be a -dimensional CDF with marginals . Then there exists a copula such thatfor all and . If is continuous for , then is unique; otherwise is uniquely determined only on .

In the opposite direction, for any copula and univariate CDFs , the function as defined above is a multivariate CDF with marginals .

Viewed in this light, copulas allow us to think of the dependence structure independently from the marginal distributions. (This may or may not be a good idea! The dependence between the ‘s is one step removed from reality, i.e. dependence between the ‘s.)

**Gaussian copulas**

The most famous family of copulas are * Gaussian copulas*. Let be a correlation matrix, and let represent the joint CDF of a -dimensional Gaussian distribution with mean and covariance matrix . Then the Gaussian copula with parameter matrix is defined by

where is the CDF of the standard Gaussian distribution.

* One reason Gaussian copulas are popular is because they are easy to simulate.* Let be the square root matrix of , i.e. . We can draw from the copula using this simple recipe:

- , where .
- Set . (Note that we have .)
- Return .

**More on copulas**

Reference 2 contains details on other popular copulas as well as how to estimate copulas from data. Reference 3 contains quite a bit of information on the history of copulas.

References:

- Wikipedia. Copula (probability theory).
- Haugh, Martin (2016). An introduction to copulas.
- Sempi, C. (2011). An introduction to copulas.