I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…
- … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
- … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
- … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).
What was surprising to me was Point 2: that this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of linear dependence between and . (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)
Defining the sample correlation coefficient
Let and be real-valued random variables such that is not a constant, and let be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:
- Rearrange the data as so that the values are in increasing order, i.e. . If there are ties, break them uniformly at random.
- For each index , let be the rank of , i.e. the number of such that , and let be the number of such that .
- Define the new correlation coefficient as
If there are no ties among the ‘s, the denominator simplifies and we don’t have to compute the ‘s:
What is the sample correlation coefficient estimating?
The following theorem tells us what is trying to estimate:
Theorem: As , converges almost surely to the deterministic limit
seems like a nasty quantity but it has some nice properties:
- always belongs to . (This follows immediately from the law of total variance.)
- iff and are independent, and iff there is a measurable function such that almost surely.
Some properties of and
Here are some other key properties of this new correlation coefficient:
- is not symmetric in and , i.e. often we will have . It can be symmetrized by considering as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if is a function of , and not just if one of the variables is a function of the other”.
- remains unchanged if we apply strictly increasing transformations to and as it is only based on the ranks of the data.
- Since is only based on ranks, it can be computed in time.
- We have some asymptotic theory for under the assumption of independence:
Theorem: Suppose and are independent and is continuous. Then in distribution as .
(The paper has a corresponding result for when is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.
- Chatterjee, S. (2019). A new coefficient of correlation.