# A correlation coefficient that measures the degree of dependence between variables

I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…

1. … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
2. … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
3. … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).

What was surprising to me was Point 2: that this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of linear dependence between $X$ and $Y$. (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)

Defining the sample correlation coefficient

Let $X$ and $Y$ be real-valued random variables such that $Y$ is not a constant, and let $(X_1, Y_1), \dots, (X_n, Y_n)$ be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:

1. Rearrange the data as $(X_{(1)}, Y_{(1)}), \dots, (X_{(n)}, Y_{(n)})$ so that the $X$ values are in increasing order, i.e. $X_{(1)} \leq \dots \leq X_{(n)}$. If there are ties, break them uniformly at random.
2. For each index $i$, let $r_i$ be the rank of $Y_{(i)}$, i.e. the number of $j$ such that $Y_{(j)} \leq Y_{(i)}$, and let $\l_i$ be the number of $j$ such that $Y_{(j)} \geq Y_{(i)}$.
3. Define the new correlation coefficient as \begin{aligned} \xi_n (X, Y) = 1 - \dfrac{n \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{2 \sum_{i=1}^n l_i (n - l_i)}. \end{aligned}

If there are no ties among the $Y$‘s, the denominator simplifies and we don’t have to compute the $l_i$‘s: \begin{aligned} \xi_n (X, Y) = 1 - \dfrac{3 \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{n^2 - 1}. \end{aligned}

What is the sample correlation coefficient estimating?

The following theorem tells us what $\xi_n(X,Y)$ is trying to estimate:

Theorem: As $n \rightarrow \infty$, $\xi_n(X, Y)$ converges almost surely to the deterministic limit \begin{aligned} \xi (X, Y) = \dfrac{\int \text{Var}(\mathbb{E}[1_{\{ Y \geq t\}} \mid X]) d\mu(t) }{\int \text{Var}(1_{\{ Y \geq t\}}) d\mu(t) }. \end{aligned} $\xi(X, Y)$ seems like a nasty quantity but it has some nice properties:

1. $\xi(X, Y)$ always belongs to $[0, 1]$. (This follows immediately from the law of total variance.)
2. $\xi(X, Y) = 0$ iff $X$ and $Y$ are independent, and $\xi(X, Y) = 1$ iff there is a measurable function $f: \mathbb{R} \mapsto \mathbb{R}$ such that $Y = f(X)$ almost surely.

This later paper by Mona Azadkia and Chatterjee extends $\xi$ to capture degrees of conditional dependence.

Some properties of $\xi_n(X, Y)$ and $\xi(X, Y)$

Here are some other key properties of this new correlation coefficient:

1. $\xi$ is not symmetric in $X$ and $Y$, i.e. often we will have $\xi(X, Y) \neq \xi(Y, X)$. It can be symmetrized by considering $\max \{ \xi(X, Y), \xi(Y, X)\}$ as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if $Y$ is a function of $X$, and not just if one of the variables is a function of the other”.
2. $\xi_n(X, Y)$ remains unchanged if we apply strictly increasing transformations to $X$ and $Y$ as it is only based on the ranks of the data.
3. Since $\xi_n(X, Y)$ is only based on ranks, it can be computed in $O(n \log n)$ time.
4. We have some asymptotic theory for $\xi_n$ under the assumption of independence:

Theorem: Suppose $X$ and $Y$ are independent and $Y$ is continuous. Then $\sqrt{n} \xi_n (X, Y) \rightarrow \mathcal{N}(0, 2/5)$ in distribution as $n \rightarrow \infty$.

(The paper has a corresponding result for when $Y$ is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.

References:

1. Chatterjee, S. (2019). A new coefficient of correlation.