A correlation coefficient that measures the degree of dependence between variables

I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…

1. … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
2. … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
3. … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).

What was surprising to me was Point 2: that this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of linear dependence between $X$ and $Y$. (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)

Defining the sample correlation coefficient

Let $X$ and $Y$ be real-valued random variables such that $Y$ is not a constant, and let $(X_1, Y_1), \dots, (X_n, Y_n)$ be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:

1. Rearrange the data as $(X_{(1)}, Y_{(1)}), \dots, (X_{(n)}, Y_{(n)})$ so that the $X$ values are in increasing order, i.e. $X_{(1)} \leq \dots \leq X_{(n)}$. If there are ties, break them uniformly at random.
2. For each index $i$, let $r_i$ be the rank of $Y_{(i)}$, i.e. the number of $j$ such that $Y_{(j)} \leq Y_{(i)}$, and let $\l_i$ be the number of $j$ such that $Y_{(j)} \geq Y_{(i)}$.
3. Define the new correlation coefficient as

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{n \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{2 \sum_{i=1}^n l_i (n - l_i)}. \end{aligned}

If there are no ties among the $Y$‘s, the denominator simplifies and we don’t have to compute the $l_i$‘s:

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{3 \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{n^2 - 1}. \end{aligned}

What is the sample correlation coefficient estimating?

The following theorem tells us what $\xi_n(X,Y)$ is trying to estimate:

Theorem: As $n \rightarrow \infty$, $\xi_n(X, Y)$ converges almost surely to the deterministic limit

\begin{aligned} \xi (X, Y) = \dfrac{\int \text{Var}(\mathbb{E}[1_{\{ Y \geq t\}} \mid X]) d\mu(t) }{\int \text{Var}(1_{\{ Y \geq t\}}) d\mu(t) }. \end{aligned}

$\xi(X, Y)$ seems like a nasty quantity but it has some nice properties:

1. $\xi(X, Y)$ always belongs to $[0, 1]$. (This follows immediately from the law of total variance.)
2. $\xi(X, Y) = 0$ iff $X$ and $Y$ are independent, and $\xi(X, Y) = 1$ iff there is a measurable function $f: \mathbb{R} \mapsto \mathbb{R}$ such that $Y = f(X)$ almost surely.

This later paper by Mona Azadkia and Chatterjee extends $\xi$ to capture degrees of conditional dependence.

Some properties of $\xi_n(X, Y)$ and $\xi(X, Y)$

Here are some other key properties of this new correlation coefficient:

1. $\xi$ is not symmetric in $X$ and $Y$, i.e. often we will have $\xi(X, Y) \neq \xi(Y, X)$. It can be symmetrized by considering $\max \{ \xi(X, Y), \xi(Y, X)\}$ as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if $Y$ is a function of $X$, and not just if one of the variables is a function of the other”.
2. $\xi_n(X, Y)$ remains unchanged if we apply strictly increasing transformations to $X$ and $Y$ as it is only based on the ranks of the data.
3. Since $\xi_n(X, Y)$ is only based on ranks, it can be computed in $O(n \log n)$ time.
4. We have some asymptotic theory for $\xi_n$ under the assumption of independence:

Theorem: Suppose $X$ and $Y$ are independent and $Y$ is continuous. Then $\sqrt{n} \xi_n (X, Y) \rightarrow \mathcal{N}(0, 2/5)$ in distribution as $n \rightarrow \infty$.

(The paper has a corresponding result for when $Y$ is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.

References:

1. Chatterjee, S. (2019). A new coefficient of correlation.

What is Cramér’s V statistic?

Cramér’s V statistic is a commonly used measure of association between two categorical variables. It takes on values between 0 and 1 (inclusive), with 0 corresponding to no association between the variables and 1 corresponding to one variable being completely determined by the other.

It is named after Harald Cramér, a Swedish mathematician, who published it in 1946. (Note: Cramer’s rule for computing the solution to systems of linear equations is NOT named after him, but after Gabriel Cramer, a 18th-century Genevan mathematician.) It is an easily computable statistic: Reference 2 gives instructions for how to compute Cramér’s V in SAS and R.

Now for the math! Let’s say we have two categorical variables: $X$ taking values from $\{ a_1, \dots, a_I \}$ and $Y$ taking values from $\{ b_1, \dots, b_J \}$. We have $n$ observations of these variables. For possible combination of values $(a_i, b_j)$, let $n_{ij}$ denote the number of times we see that combination in our sample. (We will have $n = \sum_{i,j} n_{ij}$.) Define row and column totals

\begin{aligned} n_{i \cdot} = \sum_{j=1}^J n_{ij}, \qquad n_{\cdot j} = \sum_{i=1}^I n_{ij}. \end{aligned}

With this notation, the chi-squared statistic is defined as

\begin{aligned} \chi^2 = \sum_{i=1}^I \sum_{j=1}^J \dfrac{\left( n_{ij} - n_{i \cdot}n_{\cdot j} / n \right)^2}{n_{i \cdot}n_{\cdot j} / n}. \end{aligned}

Cramér’s V statistic, commonly denoted by $V$ but sometimes by $\phi_c$, is simply a transformation of the chi-squared statistic:

\begin{aligned} V = \sqrt{\dfrac{\chi^2 / n}{\min (I, J) - 1}}. \end{aligned}

According to Reference 3, $V$ has a known sampling distribution, and so it is possible to compute its standard error and significance.

References:

1. Wikipedia. Cramér’s V.
2. Kleinman, K. Example 8.39: calculating Cramer’s V.
3. Nominal Association: Phi and Cramer’s V.

Equicorrelation matrix

The equicorrelation matrix is the $\mathbb{R}^{n \times n}$ matrix where the entries on the diagonal are all equal to $1$ and all off-diagonal entries are equal to some parameter $\rho$ which lies in $[-1, 1]$. If we were to write out the matrix, it would look something like this:

${\bf \Sigma} = \begin{pmatrix} 1 & \rho & \rho & \dots & \rho \\ \rho & 1 & \rho & \dots & \rho \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \rho &\dots & 1 \end{pmatrix}$

Alternatively, we can write it as ${\bf \Sigma} = \rho {\bf 11}^T + (1-\rho){\bf I}$, where ${\bf 1} \in \mathbb{R}^n$ is the column vector with all entries being 1 and $\bf I$ is the identity matrix.

Here are some useful properties of the equicorrelation matrix:

• It is a Toeplitz matrix, and hence has the properties that all Toeplitz matrices has (see e.g. this link).
• It has two eigenvalues. The first eigenvalue is $1 + (n-1)\rho$, with associated eigenvector $v_1 = {\bf 1}$. The second eigenvalue is $1 - \rho$, with $n-1$ associated eigenvectors $v_2, \dots, v_n$, where the entries of $v_k$ are\begin{aligned} (v_k)_j = \begin{cases} 1 &\text{if } j = 1, \\ -1 &\text{if } j = k, \\ 0 &\text{otherwise.} \end{cases} \end{aligned}This can be verified directly by doing the matrix multiplication.
• $\text{det} ({\bf \Sigma}) = (1-\rho)^{n-1}[1 + (n-1)\rho]$. This is because the determinant of a square matrix is equal to the product of its eigenvalues.
• $\bf \Sigma$ is positive definite if and only if $-\frac{1}{n-1} < \rho < 1$. A sketch of the proof can be found in the answer here. It boils down to proving some properties of the determinant expression in the previous point.
• ${\bf \Sigma}^{-1} = \dfrac{1}{1-\rho} \left( {\bf I} - \dfrac{\rho}{1 + (n-1)\rho} {\bf 11}^T \right)$. This can be verified directly by matrix multiplication. It can also be derived using the Sherman-Morrison formula.

Generating correlation matrix for AR(1) model

Assume that we are in the time series data setting, where we have data at equally-spaced times $1, 2, \dots$ which we denote by random variables $X_1, X_2, \dots$. The AR(1) model, commonly used in econometrics, assumes that the correlation between $X_i$ and $X_j$ is $\text{Cor}(X_i, X_j) = \rho^{|i-j|}$, where $\rho$ is some parameter that usually has to be estimated.

If we were writing out the full correlation matrix for $n$ consecutive data points $X_1, \dots, X_n$, it would look something like this:

$\begin{pmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} &\dots & 1 \end{pmatrix}$

(Side note: This is an example of a correlation matrix which has Toeplitz structure.)

Given $\rho$, how can we generate this matrix quickly in R? The function below is my (current) best attempt:

ar1_cor <- function(n, rho) {
exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) -
(1:n - 1))
rho^exponent
}


In the function above, n is the number of rows in the desired correlation matrix (which is the same as the number of columns), and rho is the $\rho$ parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9)
#       [,1] [,2] [,3]  [,4]
# [1,] 1.000 0.90 0.81 0.729
# [2,] 0.900 1.00 0.90 0.810
# [3,] 0.810 0.90 1.00 0.900
# [4,] 0.729 0.81 0.90 1.000


Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the Sigma parameter for MASS::mvrnorm(), which generates samples from a multivariate normal distribution.

Can you think of other ways to generate this matrix?

Spearman’s rho and Kendall’s tau

Pearson correlation is the most commonly used metric for describing the relationship between two variables. If we have samples $x_1, \dots, x_n$ for a random variable $X$ and corresponding samples $y_1, \dots, y_n$ for another random variable $Y$, the (sample Pearson) correlation between the two variables is

\begin{aligned} r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) }{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}, \end{aligned}

where $\bar{x}$ and $\bar{y}$ are the sample means of $X$ and $Y$ respectively. (In this post, we will talk only of sample versions of the metrics, rather than population versions.)

While commonly used, Pearson correlation really only measures the strength of the linear relationship between two variables. It is possible for two variables to have a very strong non-linear relationship but have a small Pearson correlation. In this post, we present two non-parametric methods for measuring the correlation between two random variables: Spearman’s rho and Kendall’s tau.

Note: Computing these two metrics is easy in R! Simply modify the method argument for the cor() function:

cor(x, y, method = "spearman")
cor(x, y, method = "kendall")


Spearman’s rho

While Pearson correlation assesses the linear relationship between two variables, Spearman’s rank correlation coefficient (or Spearman’s rho), assesses the monotonic relationship between two variables.

To compute Spearman’s rho, we first compute the ranks of the sample values: replace $x_1, \dots, x_n$ with ranks $a_1, \dots, a_n$, where the smallest $x$ value is replaced with 1, the next smallest is replaced with 2, and so on. If there are ties, replace them with the average of the ranks they occupy. (Likewise, replace $y_1, \dots, y_n$ with ranks $b_1, \dots, b_n$.) Spearman’s rho is simply the Pearson correlation for these ranks:

\begin{aligned} \rho_{xy} = \frac{\sum_{i=1}^n (a_i - \bar{a})(b_i - \bar{b}) }{\sqrt{\sum_{i=1}^n (a_i - \bar{a})^2} \sqrt{\sum_{i=1}^n (b_i - \bar{b})^2}}. \end{aligned}

We always have $-1 \leq \rho_{xy} \leq 1$. A positive Spearman’s rho is interpreted as $Y$ tending to increase when $X$ increases; a rho of 1 indicates a perfect monotonically increasing relationship. (Similar statements apply for negative values and -1.)

If there are no ties, there is a simpler formula for Spearman’s rho:

\begin{aligned} \rho_{xy} = 1 - \frac{6 \sum_{i=1}^n (a_i - b_i)^2 }{n(n^2 - 1)} .\end{aligned}

Spearman’s rho can be computed as long as we can rank the $x$‘s and $y$‘s.

Kendall’s tau

Instead of converting the data to ranks and then computing the Pearson correlation, Kendall’s rank correlation coefficient (or Kendall’s tau), considering the similarity of orderings of $x_1, \dots, x_n$ and $y_1, \dots, y_n$. For any pair of indices $1 \leq i < j \leq n$,

• If both $x_i > x_j$ and $y_i > y_j$, or both $x_i < x_j$ and $y_i < y_j$, $(i, j)$ is called a concordant pair.
• If both $x_i > x_j$ and $y_i < y_j$, or both $x_i < x_j$ and $y_i > y_j$, $(i, j)$ is called a discordant pair.
• If $x_i = x_j$ or $y_i = y_j$, $(i, j)$ is neither concordant nor discordant.

Kendall’s tau is defined as

\begin{aligned} \tau_{xy} = \frac{|\text{concordant pairs}| - |\text{discordant pairs}|}{n(n-1)/2}. \end{aligned}

We always have $-1 \leq \tau_{xy} \leq 1$. $\tau_{xy} = 1$ if and only if the two rankings are the same, and $\tau_{xy} = -1$ if and only if one ranking is the reverse of the other. The wikipedia page describes a number of ways to break ties.

Which one to use?

Most of the time, these two measures align closely and lead to the same inferences. There doesn’t seem to be a clear advantage in using one over the other; you are probably better off reporting both of them. Nevertheless, here are the arguments for/against each of the measures, as gleaned from the internet:

In favor of Spearman’s rho:

• Appears to be more popular in practice.

In favor of Kendall’s tau:

• Spearman’s rho is more sensitive to error and discrepancies in the data.
• When data is normal, Kendall’s tau has smaller gross error sensitivity and smaller asymptotic variance.

In favor of neither:

• Both can be computed in $O(n \log n)$ time.
• Spearman rho values tend to be higher than Kendall tau values.

References:

1. Statistics Solutions. Kendall’s Tau and Spearman’s Rank Correlation Coefficient.
2. Cross Validated. Kendall Tau or Spearman’s rho?
3. Researchgate. Does Spearman’s rho have any advantage over Kendall’s tau?
4. STATISTICA Help. Nonparametrics Statistics Notes – Correlations (Spearman, Kendall tau, Gamma).

Toeplitz covariance structure

When someone says that their model has Toeplitz covariance (or correlation) structure, what do they mean?

In linear algebra, a Toeplitz matrix is also known as a diagonal-constant matrix: i.e. “each descending diagonal from left to right is constant”:

$A = \begin{bmatrix} a_0 & a_{-1} & \dots & \dots & a_{-(n-1)} \\ a_1 & a_0 & a_{-1} &\ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ \vdots & \ddots & a_1 & a_0 & a_{-1} \\ a_{n-1} & \dots & \dots & a_1 & a_0 \end{bmatrix}$.

A model is said to have Toeplitz covariance (correlation resp.) structure if the covariance  (correlation resp.) matrix is a Toeplitz matrix. Here are 2 places where we see such structures pop up:

• We have time series data at equally-spaced times $1, 2, \dots$, denoted by $X_1, X_2, \dots$. This model has Toeplitz covariance (correlation resp.) structure if for any $n$, the covariance (correlation resp.) matrix of $X_1, X_2, \dots, X_n$ is Toeplitz. The AR(1) model, commonly used in econometrics, is an example of this, since it has $\text{Cor}(X_i, X_j) = \rho^{|i-j|}$ for some $\rho$.
• We have a continuous-time stochastic process $\{ X_t \}$ which is weakly stationary, i.e. for any 2 times $t$ and $s$, $\text{Cov}(X_t, X_s)$ depends only on $t - s$. In this setting, for any equally-spaced times $t_1, \dots, t_n$, the covariance matrix of $X_{t_1}, \dots, X_{t_n}$ will be Toeplitz.

Why work with Toeplitz covariance structure? Other than the fact that they arise naturally in certain situations (like the two above), operations with Toeplitz matrices are fast, and a matrix inverse always exists.

References:

1. Toeplitz matrix, Wikipedia.
2. Guidelines for Selecting the Covariance Structure in Mixed Model Analysis, Chuck Kincaid.
3. Toeplitz Covariance Matrix Estimation for Adaptive Beamforming and Ultrasound Imaging, Michael Pettersson.