A correlation coefficient that measures the degree of dependence between variables

I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…

  1. … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
  2. … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
  3. … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).

What was surprising to me was Point 2: that this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of linear dependence between X and Y. (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)

Defining the sample correlation coefficient

Let X and Y be real-valued random variables such that Y is not a constant, and let (X_1, Y_1), \dots, (X_n, Y_n) be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:

  1. Rearrange the data as (X_{(1)}, Y_{(1)}), \dots, (X_{(n)}, Y_{(n)}) so that the X values are in increasing order, i.e. X_{(1)} \leq \dots \leq X_{(n)}. If there are ties, break them uniformly at random.
  2. For each index i, let r_i be the rank of Y_{(i)}, i.e. the number of j such that Y_{(j)} \leq Y_{(i)}, and let \l_i be the number of j such that Y_{(j)} \geq Y_{(i)}.
  3. Define the new correlation coefficient as

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{n \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{2 \sum_{i=1}^n l_i (n - l_i)}. \end{aligned}

If there are no ties among the Y‘s, the denominator simplifies and we don’t have to compute the l_i‘s:

\begin{aligned} \xi_n (X, Y) = 1 - \dfrac{3 \sum_{i=1}^{n-1} |r_{i+1} - r_i|}{n^2 - 1}. \end{aligned}

What is the sample correlation coefficient estimating?

The following theorem tells us what \xi_n(X,Y) is trying to estimate:

Theorem: As n \rightarrow \infty, \xi_n(X, Y) converges almost surely to the deterministic limit

\begin{aligned} \xi (X, Y) = \dfrac{\int \text{Var}(\mathbb{E}[1_{\{ Y \geq t\}} \mid X]) d\mu(t) }{\int \text{Var}(1_{\{ Y \geq t\}}) d\mu(t) }. \end{aligned}

\xi(X, Y) seems like a nasty quantity but it has some nice properties:

  1. \xi(X, Y) always belongs to [0, 1]. (This follows immediately from the law of total variance.)
  2. \xi(X, Y) = 0 iff X and Y are independent, and \xi(X, Y) = 1 iff there is a measurable function f: \mathbb{R} \mapsto \mathbb{R} such that Y = f(X) almost surely.

This later paper by Mona Azadkia and Chatterjee extends \xi to capture degrees of conditional dependence.

Some properties of \xi_n(X, Y) and \xi(X, Y)

Here are some other key properties of this new correlation coefficient:

  1. \xi is not symmetric in X and Y, i.e. often we will have \xi(X, Y) \neq \xi(Y, X). It can be symmetrized by considering \max \{ \xi(X, Y), \xi(Y, X)\} as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if Y is a function of X, and not just if one of the variables is a function of the other”.
  2. \xi_n(X, Y) remains unchanged if we apply strictly increasing transformations to X and Y as it is only based on the ranks of the data.
  3. Since \xi_n(X, Y) is only based on ranks, it can be computed in O(n \log n) time.
  4. We have some asymptotic theory for \xi_n under the assumption of independence:

    Theorem: Suppose X and Y are independent and Y is continuous. Then \sqrt{n} \xi_n (X, Y) \rightarrow \mathcal{N}(0, 2/5) in distribution as n \rightarrow \infty.

    (The paper has a corresponding result for when Y is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.

References:

  1. Chatterjee, S. (2019). A new coefficient of correlation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s