What is the Atkinson index?

What is the Atkinson index?

The Atkinson index, introduced by Atkinson (1970) (Reference 1), is a measure of inequality used in economics. Given a population with values x_1, \dots, x_n and an inequality-aversion parameter \epsilon, the Atkinson index is defined as

\begin{aligned} A_\epsilon (x_1, \dots, x_n) = \begin{cases} 1 - \left( \frac{1}{n} \sum_{i=1}^n x_i^{1 - \epsilon} \right)^{1/(1 - \epsilon)} \big/ \left( \frac{1}{n} \sum_{i=1}^n x_i \right) &\text{if } \epsilon \neq 1, \\ 1 - \left( \prod_{i=1}^n x_i \right)^{1/n} \big/ \left( \frac{1}{n} \sum_{i=1}^n x_i \right) &\text{if } \epsilon = 1. \end{cases} \end{aligned}

If we denote the Hölder mean by

\begin{aligned} M_p (x_1, \dots, x_n) = \begin{cases} \left( \frac{1}{n} \sum_{i=1}^n x_i^p \right)^{1/p} &\text{if } p \neq 0, \\ \left( \prod_{i=1}^n x_i \right)^{1/n} &\text{if } p = 0, \end{cases} \end{aligned}

then the Atkinson index is simply

\begin{aligned} A_\epsilon (x_1, \dots, x_n) = 1 - \dfrac{M_{1-\epsilon} (x_1, \dots, x_n)}{M_1 (x_1, \dots, x_n)}. \end{aligned}

While the index is defined for all \epsilon, we restrict \epsilon to be \geq 0. (Some of the properties in the next section would not hold otherwise.)

Properties of the Atkinson index

The Atkinson index has a number of nice properties:

  1. The index lies between 0 and 1, and is equal to 0 if and only if x_1 = \dots = x_n. Smaller values of the index indicate lower levels of inequality.
  2. It satisfies the population replication axiom: If we replicate the population \{ x_1, \dots, x_n \} any number of times, the index remains the same.
  3. It satisfies homogeneity: If we multiply x_1, \dots, x_n by some positive constant k, the index remains the same.
  4. It satisfies the principle of transfers: If we transfer t > 0 from x_i to x_j such that x_i - t > x_j + t, the index does not increase.
  5. It is subgroup decomposable: If we partition the population into subgroups, we can express the index for the entire population as a weighted sum of the indices for the population subgroups plus the index for the subgroup means. (The popular Gini index is not subgroup decomposable.)
  6. It has only one parameter, \epsilon, which represents the decision maker’s aversion to inequality. (See Section 2.3 of Reference 3 for one method of eliciting \epsilon.)
  7. It is computationally scalable: the sums in the numerator and denominator are amenable to the map-reduce paradigm, and so the index can be computed in a distributed fashion.

Some intuition for the Atkinson index

In R, the Atkinson function in the DescTools package implements the Atkinson index. It is so simple that I can reproduce the whole function here (most of the function is dedicated to checking for NA values):

function (x, n = rep(1, length(x)), parameter = 0.5, na.rm = FALSE) 
{
    x <- rep(x, n)
    if (na.rm) 
        x <- na.omit(x)
    if (any(is.na(x)) || any(x < 0)) 
        return(NA_real_)
    if (is.null(parameter)) 
        parameter <- 0.5
    if (parameter == 1) 
        A <- 1 - (exp(mean(log(x)))/mean(x))
    else {
        x <- (x/mean(x))^(1 - parameter)
        A <- 1 - mean(x)^(1/(1 - parameter))
    }
    A
}

To get some intuition for the Atkinson index, let’s look at the index for a population consisting of just 2 people. By homogeneity, we can assume that the first person has value 1; we will denote the second person’s value by x. We plot the Atkinson index for \{ 1, x \} and \epsilon = 1, with x ranging from 10^{-4} to 10^4:

library(DescTools)

x <- 10^(-40:40 / 10)
eps <- 1
atkinsonIndex <- sapply(x,
  function(x) Atkinson(c(1, x), parameter = eps))

# log10 x axis
par(mfrow = c(1, 2))
plot(x, atkinsonIndex, type = "l", log = "x",
     ylab = "Atkinson index for (1, x)",
     main = "Atkinson index, eps = 1 (log x-axis)")

# regular x axis
plot(x, atkinsonIndex, type = "l", xlim = c(0, 1000),
     ylab = "Atkinson index for (1, x)",
     main = "Atkinson index, eps = 1 (linear x-axis)")

The two plots show the same curve, with the only difference being the x-axis (log scale on the left, linear scale on the right). The curve is symmetric around x = 1 when the x-axis is on the log scale. We expect this since, by homogeneity, the index for \{ 1, x \} is the same as the index for \{ 1/x, 1\}.

Next, we look at the Atkinson index for \{ 1, x\} for a range of values for the \epsilon parameter:

x <- 10^(0:40 / 10)
epsList <- 10^(-2:2 / 4)

plot(c(1, 10^4), c(0, 1), log = "x", type = "n",
     xlab = "x", ylab = "Atkinson index for (1, x)",
     main = "Atkinson index for various epsilon")
for (i in seq_along(epsList)) {
  atkinsonIndex <- sapply(x,
                          function(x) Atkinson(c(1, x), parameter = epsList[i]))
  lines(x, atkinsonIndex, col = i, lty = i, lwd = 2)
}
legend("topleft", legend = sprintf("%.2f", epsList), 
       col = seq_along(epsList), lty = seq_along(epsList), lwd = 2)

The larger \epsilon is, the more “inequality-averse” we are. For fixed x, the Atkinson index for \{ 1, x\} increases as \epsilon increases.

Finally, let’s look at what values the Atkinson index might take for samples taken from different distributions. In each of the panels below, we take 100 samples, each of size 1000. The samples are drawn from a log-t distribution with a given degrees of freedom (that is, the log of the values follows a t distribution). For each of these 100 samples, we compute the Atkinson index (with the default \epsilon = 0.5), then make a histogram of the 100 index values. (The t distribution with df = 50 is basically indistinguishable from the standard normal distribution.)

nsim <- 100
n <- 1000
dfList <- c(50, 10, 5, 3)

png("various-t-df.png", width = 900, height = 700, res = 120)
par(mfrow = c(2, 2))
set.seed(1)
for (dfVal in dfList) {
  atkinsonIndices <- replicate(nsim, Atkinson(exp(rt(n, df = dfVal))))
  hist(atkinsonIndices, xlim = c(0, 1),
       xlab = "Atkinson Index",
       main = paste("Histogram of Atkinson indices, df =", dfVal))
}
dev.off()

References:

  1. Atkinson, A. B. (1970). On the Measurement of Inequality.
  2. Wikipedia. Atkinson index.
  3. Saint-Jacques, G., et. al. (2020). Fairness through Experimentation: Inequality in A/B testing as an approach to responsible design.

In R, 1 == “1” evaluates to TRUE

The title says it all. This appears to work for any number, including non-integers and negative numbers:

> 1 == "1"
[1] TRUE
> 123 == "123"
[1] TRUE
> -6 == "-6"
[1] TRUE
> 1.2 == "1.2"
[1] TRUE
> -0.1 == "-0.1"
[1] TRUE

I can see how this feature might be convenient, but my gut instinct is that it is unintuitive behavior. Hence it is dangerous and probably should never be used in production code.

Here’s an (contrived) example of possible unintended behavior:

> 0 == -0
[1] TRUE
> 0 == "-0"
[1] FALSE
> -0 == "-0"
[1] FALSE

What is an Equal Percent Bias Reducing (EPBR) matching method?

Introduction

A fundamental problem faced in causal inference is that the distribution of the pre-treatment covariates among the treatment units is different from that among the control units. Any difference in the dependent variable between treatment and control cannot be ascribed to the treatment itself because it could be due to the distributional differences as well. Matching methods match each of the treatment units to one or more control units which are as similar as possible with respect to the pre-treatment variables to obtain better “balance”. After matching, we might feel more comfortable saying that the treatment and control group are the same except for the fact that one group received the treatment. This provides a more apples-to-apples comparison.

The equal percent bias reducing (EPBR) property

The equal percent bias reducing (EPBR) property was introduced by Rubin (1976) (Reference 1). We are going to need a fair amount of notation to describe this property. The notation here follows that in Iacus et. al. (2011) (Reference 2). (For an answer to “why EPBR?”, skip to the next section.)

Assume that we are in the potential outcomes framework for causal inference:

  • We have a sample of n units. Let n_C and n_T denote the number of units in treatment and control respectively. (It follows that n_C + n_T = n.) We typically assume that n_T \geq n_C. Assume that n_T and n_C are fixed.
  • Let \mathcal{T} and \mathcal{C} denote the indices of the units in treatment and control respectively.
  • For unit i, let \mathbf{X}_i be a k-dimensional vector of pre-treatment covariates that we observe for the unit.
  • For t = 0, 1, let \mu_t = \mathbb{E}[\mathbf{X} \mid T = t] be a vector of expected values.

For a matching method,

  • Let M_\mathcal{T} \subseteq \mathcal{T} and M_\mathcal{C} \subseteq \mathcal{C} denote the indices of the matched units in the two groups.
  • Let m_T and m_C denote the number of treated and control units matched.
  • Let \bar{\mathbf{X}}_{m_T} = \frac{1}{m_T}\sum_{i \in M_\mathcal{T}} \mathbf{X}_i and \bar{\mathbf{X}}_{m_C} = \frac{1}{m_C}\sum_{i \in M_\mathcal{C}} \mathbf{X}_i denote the sample means of the treatment and control units in the matched data.

A matching method is said to be equal percent bias reducing (EPBR) if it satisfies

\begin{aligned} \mathbb{E}\left[ \bar{\mathbf{X}}_{m_T} - \bar{\mathbf{X}}_{m_C} \right] = \gamma (\mu_1 - \mu_0) \end{aligned}

for some scalar 0 < \gamma < 1. Note that the equation above is vector-valued. Hence, it means that the percent reduction in “bias” (the difference in means between treatment and control) is the same across the pre-treatment variables.

Why do we care about EPBR?

An implication of EPBR is that one gets the same percent reduction in bias for any linear function of the pre-treatment covariates if and only if the matching method is EPBR. As the original paper states, for a linear function \beta^\top \mathbf{X}, the expected bias in the observed data is (\mu_1 - \mu_0)^T \beta. On the other hand, the expected bias for the matched data is

\begin{aligned} \mathbb{E}\left[ (\bar{\mathbf{X}}_{m_T} - \bar{\mathbf{X}}_{m_C})^T \beta \right] = \gamma (\mu_1 - \mu_0)^T \beta < (\mu_1 - \mu_0)^T \beta. \end{aligned}

This implication matters if we believe that the dependent variable (or some transformation of it) is (approximately) linearly related to the pre-treatment covariates. For a long time, the models we set up in causal inference made this assumption, and so EPBR was an appealing property for a matching method to have.

Phrased negatively, if a matching method is not EPBR, then matching can substantially increase the bias for some linear function of the pre-treatment covariates. This means that with the assumption above, a non-EPBR matching method could increase the bias of our estimation procedure.

Of course, if we don’t believe that the dependent variable is linearly related to the pre-treatment covariates, or if we believe that the dependent variable is much more affected by some variables than others, then the EPBR property loses its appeal. The original paper acknowledges as much:

It should be noted also that an EPBR matching method helps us to get mean balance only. This can be quite different from what we really want, which is for the distributions of the treatment and control samples to be as similar as possible.

Some examples of EPBR

For a matching method to be EPBR, we usually need some assumptions. Reference 2 notes that the most general version of the assumptions require that

  1. The \mathbf{X}_i are drawn from a population distribution which is an ellipsoidally symmetric density or a discriminant mixture of proportional ellipsoidally symmetric densities; and
  2. The matching algorithm is invariant to affine transformations of \mathbf{X}.

Several matching methods are known to satisfy the second assumption (e.g. propensity score matching, unweighted Mahalanobis matching, discriminate matching), but the first assumption is much harder to check and satisfy.

References:

  1. Rubin, D. B. (1976). Multivariate Matching Methods That are Equal Percent Bias Reducing, I: Some Examples.
  2. Iacus, S. M., et. al. (2011). Multivariate Matching Methods That Are Monotonic Imbalance Bounding.

Yule-Walker equations for AR(p) processes and Toeplitz matrices

In this post I wrote some time back, I introduced the notion of a Toeplitz matrix, which simply put is a diagonal-constant matrix:

A = \begin{pmatrix} a_0 & a_{-1} & \dots & \dots & a_{-(n-1)} \\  a_1 & a_0 & a_{-1} &\ddots & \vdots \\  \vdots & \ddots & \ddots & \ddots & \vdots \\  \vdots & \ddots & a_1 & a_0 & a_{-1} \\  a_{n-1} & \dots & \dots & a_1 & a_0  \end{pmatrix}.

In that post I mentioned two settings where Toeplitz matrices appear. In this past week I came across another one: the Yule-Walker equations for AR(p) processes.

Recall that an AR(p) process is defined by

\begin{aligned} X_t = Z_t + \phi_1 X_{t-1} + \dots + \phi_p X_{t-p}, \qquad (1) \end{aligned}

with Z_0, Z_1, \dots being i.i.d. random variables with mean 0 and variance \sigma^2. Assuming (weak) stationarity, we define the autocorrelation function as \rho (k) = \text{Cor} (X_t, X_{t-k}) for k \in \mathbb{Z}. Notice that \rho (0) = 1 and that \rho (k) = \rho (-k) for all k.

Multiplying the definition above by X_{t-k}, taking expectations and using the fact that \mathbb{E}[X_t] = 0 for all t,

\begin{aligned} \text{Cov}(X_t, X_{t-k}) = \phi_1 \text{Cov}(X_{t-1}, X_{t-k}) + \dots  + \phi_p \text{Cov}(X_{t-p}, X_{t-k}). \end{aligned}

Since all the X_t‘s have the same variance (by stationarity), the above implies

\begin{aligned} \rho (k) = \phi_1 \rho(k-1) + \dots  + \phi_p \rho (k - p). \qquad (1) \end{aligned}

Equation (1) is valid for all k! Hence, if we plug in k = 1, 2, \dots, p and write the resulting equations in matrix form, we get

\begin{aligned} \begin{pmatrix} \rho (1) \\ \rho (2) \\ \vdots \\ \rho (p-1) \\ \rho(p) \end{pmatrix} = \begin{pmatrix} 1 & \rho (1) & \rho (2) & \dots & \rho (p-1) \\ \rho (1) & 1 & \rho (1) & \dots & \rho (p-2) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho (p-2) & \rho (p-3) & \rho (p-4) & \dots & \rho (1) \\ \rho (p-1) & \rho (p-2) & \rho (p-3) & \dots & 1 \end{pmatrix} \begin{pmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{p-1} \\ \phi_p \end{pmatrix} \end{aligned}

These are known as the Yule-Walker equations. The first matrix on the RHS is a Toeplitz matrix. (It is also symmetric!)

Note: What are the Yule-Walker equations for? They are used estimate the coefficients of an AR(p) process. We use the data to estimate the function \rho, plug in these estimates to the equation above, then pre-multiply both sides by the inverse of the Toeplitz autocorrelation matrix.

References:

  1. The State University of New York’s Practical Time Series course on Coursera. Week 4: All slides together for the next two lessons.

Why we can ignore the constant term for AR(p) processes

In this previous post, I introduced the AR(p) process as having the form

\begin{aligned} X_t = Z_t + \phi_1 X_{t-1} + \dots + \phi_p X_{t-p}, \qquad (1) \end{aligned}

with Z_0, Z_1, \dots being i.i.d. random variables with mean 0 and variance \sigma^2. In some references, the AR(p) process has an additional constant term on the RHS:

\begin{aligned} X_t = Z_t + \phi_0 + \phi_1 X_{t-1} + \dots + \phi_p X_{t-p}, \qquad (2) \end{aligned}

If we assume the AR(p) process is weakly stationary, then We can take \phi_0 = 0 without loss of generality. (And we almost always assume stationarity when working with AR(p) processes.)

Since we assume stationarity, \mathbb{E}[X_t] = \mu for all t for some \mu \in \mathbb{R}. Taking expectations on both sides for (2):

\begin{aligned} \mu = 0 + \phi_0 + \phi_1 \mu + \dots + \phi_p \mu. \qquad (3) \end{aligned}

Subtracting (3) from (2):

\begin{aligned} (X_t - \mu) = Z_t + \phi_1 (X_{t-1} - \mu) + \dots + \phi_p (X_{t-p} - \mu). \end{aligned}

If we let Y_t = X_t - \mu, then the above is simply (1) but with the X_t‘s replaced by Y_t‘s. Thus, if we have an AR(p) process that has a constant term, we can simply work with the mean-centered version instead.

References:

  1. The State University of New York’s Practical Time Series course on Coursera. Week 4: All slides together for the next two lessons.

The relationship between MA(q)/AR(p) processes and ACF/PACF

The relationship between the MA(q) process and ACF

Assume that Z_0, Z_1, \dots are i.i.d. random variables with mean 0 and variance \sigma^2. The moving average process of order q (or the MA(q) process) has the form

\begin{aligned} X_t = \theta_0 Z_t + \theta_1 Z_{t-1} + \dots + \beta_q Z_{t-q}. \end{aligned}

Note that the RHS actually has q + 1 (not q) terms in total. (Also in many references, \theta_0 is constrained to be equal to 1; here I have left it unconstrained.)

For a weakly stationary process X_t, its autocovariance function is defined as

\gamma (k) = \text{Cov}(X_t, X_{t-k}) = \mathbb{E}[X_t X_{t-k}] \quad \text{for } k \in \mathbb{Z}.

(The second equality is true if \mathbb{E}[X_t] = 0 for all t, which is usually the case in these settings.) Its autocorrelation function (ACF) is defined as

\rho (k) = \text{Cor}(X_t, X_{t-k}) = \dfrac{\text{Cov}(X_t, X_{t-k})}{[\text{Var}(X_t) \text{Var}(X_{t-k})]^{1/2}} = \dfrac{\gamma (k)}{\gamma (0)} \quad \text{for } k \in \mathbb{Z}.

For an MA(q) process, the ACF cuts off after lag q. That is,

\rho (k) = \gamma(k) = 0 \quad \text{for } k \geq q + 1.

This fact can be verified by directly plugging the definition of the MA(q) process into the definition for the autocovariance function. Here is a little simulation demonstrating this fact for an MA(3) process:

set.seed(1)
x <- arima.sim(list(ma = c(1, 0.5, 1)), n = 2000)
acf(x, main = "Autocorrelation function for x")

The ACF cuts off after lag 3. Note that there are 4 (not 3) significantly non-zero lines: one for lag 0 (we always have \rho(0) = 1) and 3 for the 3 positive lags. For an MA(q) process, we expect to see q+1 “peaks” in the ACF plot.

This can be used as a method to determine the order of a moving average process: estimate the ACF and the largest lag with significantly non-zero autocorrelation is the order of the process. (Note, however, that If you try different random seeds or different coefficients, the break between the non-zero and zero autocorrelations is not always obvious.)

The relationship between the AR(p) process and PACF

The autoregressive process of order p (or the AR(p) process) has the form

\begin{aligned} X_t = Z_t + \phi_1 X_{t-1} + \dots + \phi_p X_{t-p}. \end{aligned}

Here there are p + 1 terms on the RHS, with one of them belonging to the Z process and p of them belonging to the X process. The coefficient of Z_t is constrained to be 1.

For a weakly stationary process X_t, its partial autocorrelation function is defined by

\begin{aligned} \rho (X_t, X_{t-k}) = \dfrac{\text{Cov}(X_t, X_{t-k} \mid X_{t-1}, X_{t-2}, \dots, X_{t - (k-1)})}{ [\text{Var}(X_t \mid X_{t-1}, X_{t-2}, \dots, X_{t - (k-1)}) \text{Var}(X_{t-k} \mid X_{t-1}, X_{t-2}, \dots, X_{t - (k-1)})]^{1/2} } \quad \text{for } k \geq 1. \end{aligned}

This is basically the same formula as that for the ACF except we condition on X_{t-1}, \dots, X_{t-(k-1)}.

For an AR(p) process, the PACF cuts off after lag p. That is,

\rho (X_t, X_{t-k}) = 0 \quad \text{for } k \geq p + 1.

Here is a simulation illustrating this for an AR(3) process:

set.seed(1)
x <- arima.sim(list(ar = c(1, -0.6, 0.4)), n = 2000)
acf(x, type = "partial", main = "Partial autocorrelation function for x")

Note that the PACF plot starts at 1, not 0 like the ACF plot. As expected, the PACF is essentially zero for lags above 3. For an AR(p) process, we expect to see p “peaks” in the PACF plot.

This can be used as a method to determine the order of an autoregressive process: estimate the PACF and the largest lag with significantly non-zero partial autocorrelation is the order of the process.

What are Shapley values (in interpretable machine learning)?

Shapley values are a cornerstone of interpretable machine learning, a way to attribute how much each feature played a role in a model’s prediction. This previous post describes Shapley values as conceived in the context of game theory; in this post we will explain how Shapley values can apply in the context of interpretable ML.

The question we want to answer

Imagine that we are in the supervised learning setting. We have inputs x \in \mathcal{X}, where \mathcal{X} is some M-dimensional space, and we want to use these inputs to predict an output y \in \mathbb{R}. For convenience, let the features be indexed by 1, \dots, M and let M denote both the set \{1, \dots, M\} and its size. (The context will make clear which I’m referring to.) Let’s say that we have already trained a model f : \mathcal{X} \mapsto \mathbb{R}.

For a particular test input x, the model gives the prediction f(x). How much does feature i (i = 1, \dots, M) contribute to the prediction f(x)?

(Note: The question is more commonly phrased as how much does feature i contribute to the gain in the prediction f(x) over the average prediction. We will see why in the final “Notes” section.)

Shapley values to the rescue…

This is a question that Shapley values can answer. In the interpretable ML setting, the players are the features and the payout is the gain in the prediction f(x) over the average prediction. All we have to do is apply the formula for Shapley values are we are done!

… or not (yet)?

Recall that the definition of the Shapley value for feature i is

\begin{aligned} \phi_i (v) = \sum_{S \subseteq M \setminus \{ i \}} \dfrac{|S|! (M - |S| - 1)!}{M!} [v (S \cup \{i\} ) - v(S)]. \end{aligned}

In order to use this formula, we need to define v, and that is where things start to get complicated.

A first stab at defining v(S) for a given subset S is to train a model (denoted f_S) using only the features in S, then set v(S) = f_S(x). (Recall that x is the test input whose prediction we want to intepret.) The problem with this is that we have to train O(2^M) models, which is computationally infeasible.

A second approach would be to stick with just one model, and to set v(S) = f(z(x, S)), where

\begin{aligned} z(x, S)_i = \begin{cases} x_i & \text{if } i \in S \\ \text{missing} & \text{otherwise.}\end{cases} \end{aligned}

That is, z(x, S) is x for the features in S and is missing otherwise. The problem here is that the model f might not be able to accept missing values as input (trees being a notable exception).

A way out

The trick is to fill in the missing components of z(x, S) appropriately. To do so we are going to need more notation. Given two inputs x, r \in \mathcal{X}, define the composite input z(x, r, S) \in \mathcal{X} as having entries

\begin{aligned} z(x, r, S)_i = \begin{cases} x_i & \text{if } i \in S \\ r_i & \text{otherwise.}\end{cases} \end{aligned}

For each subset S, let D_{x, S} denote some reference distribution on the space of inputs \mathcal{X}. We can then define v by

\begin{aligned} v(S) = \mathbb{E}_{r \sim D_{x, S}} \left[ f (z(x, r, S)) \right]. \end{aligned}

That is, the payoff associated with subset S is the average prediction over composite inputs where the features in S match the test input x and the other features are drawn from some reference distribution. It’s more common to define v by

\begin{aligned} v(S) = \mathbb{E}_{r \sim D_{x, S}} \left[ f (z(x, r, S)) \right] - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] \end{aligned}

so that the payoff for the empty set is zero:

\begin{aligned} v(\emptyset) &= \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (z(x, r, \emptyset)) \right] - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] \\  &= \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] \\  &= 0. \end{aligned}

The remaining question is how the reference distribution D_{x, S} should be defined. It turns out that there are several different possibilities for this: see Slide 29 of Reference 1 for details. (Reference 1 is a great resource for understanding Shapley values for interpretable ML and is worth skimming through.)

Given the reference distribution, how do you compute v(S)?

For some choices of reference distribution, the expectation in v(S) can be computed analytically. For example, if we have a training set x_1, \dots, x_n available to us, the reference distribution could be the point mass at the training mean (x_1 + \dots x_n) / n. This is equivalent to plugging in the feature-wise training means for the features not in S.

For some other choices, the best way to estimate v(S) would be to do Monte Carlo simulation by repeatedly sampling from D_{x, S} and evaluating f(z(x, r, S)), i.e.

\begin{aligned} r_1, \dots, r_B &\stackrel{i.i.d.}{\sim} D_{x, S}, \\  \mathbb{E}_{r \sim D_{x, S}} \left[ f (z(x, r, S)) \right] &\approx \dfrac{1}{B}\sum_{b=1}^B f(z(x, r_b, S)). \end{aligned}

Final notes

Note 1: By the additivity property of Shapley values, we have

\begin{aligned} \sum_{i=1}^M \phi_i(v) &= v(M) \\  &= \mathbb{E}_{r \sim D_{x, M}} \left[ f (z(x, r, M)) \right] - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] \\  &= \mathbb{E}_{r \sim D_{x, M}} \left[ f (x) \right] - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right] \\  &= f(x) - \mathbb{E}_{r \sim D_{x, \emptyset}} \left[ f (r) \right]. \end{aligned}

That is, Shapley values apportion the gain in the prediction f(x) over the average prediction to the individual features.

Note 2: Shapley values are tied not just to a particular model f, but also to a specific input x. Thus, Shapley values can vary from prediction to prediction.

References:

  1. Taly, A. The Explanation Game: Explaining ML models with Shapley Values.

What are Shapley values (in game theory)?

Shapley values are a cornerstone of interpretable machine learning, a way to attribute how much each feature played a role in a model’s prediction. While I may talk about Shapley values in the context of interpretable ML in the future, this post is about Shapley values in its original context: cooperative game theory.

Imagine the following coalitional game: there is a set of players of N and each players gets to decide whether they want to “cooperate” or not. Let S denote the set of players who cooperate. Based on S, the game gives a payout of v(S) which is to be split among all the players. (You can think of v as a function v : 2^{|N|} \mapsto \mathbb{R}, with v(\emptyset) = 0.) Note that v(S) can vary greatly depending on who is in S: this reflects the range of bargaining power different players have. What is a “fair” way to distribute the payout among the players if all of them cooperate?

The Shapley value is one way to distribute the gains among the players. The Shapley value for the ith player (how much player i should get) is

\begin{aligned} \phi_i (v) = \frac{1}{|N|!} \sum_R \left[ v (P_i^R \cup \{i\}) - v(P_i^R) \right],  \end{aligned}

where the sum ranges over all |N|! orders R of the players, and P_i^R is the set of players which precede player i in the order R. Here, we can imagine that there are |N|! different ways of adding players to the coalition one at a time; the Shapley value is just the mean of additional payout player i brings across all these different ways.

Written this way, it looks like O(|N|!) time is needed to compute Shapley values. However, noting that the set P_i^R is the same for many orders R, we can rewrite the formula as

\begin{aligned} \phi_i (v) = \sum_{S \subseteq N \setminus \{ i \}} \dfrac{|S|! (|N| - |S| - 1)!}{|N|!} [v (S \cup \{i\} ) - v(S)]. \end{aligned}

This formula is more commonly cited as the definition of Shapley values, though I personally find it less intuitive. In this form, the sum takes O(2^{|N|} to compute.

There is yet another way to view the Shapley value formula:

\begin{aligned} \phi_i (v) = \frac{1}{|N|} \sum_{S \subseteq N \setminus \{ i \}} \binom{n-1}{|S|}^{-1} [v (S \cup \{i\} ) - v(S)], \end{aligned}

or in words,

\begin{aligned} \phi_i (v) = \frac{1}{\text{no. of players}} \sum_{\text{coaltions excluding } i} \dfrac{\text{additional payout } i \text{ brings to coalition}}{\text{no. of coalitions excluding } i \text{ of this size}}. \end{aligned}

See the Wikipedia article (Reference 1) for some worked examples on how to compute Shapley values. In general it takes exponential time to compute them because the sum is over all possible subsets of N. (This is one of its largest drawbacks.)

Properties of Shapley values

Shapley values are popular because they have a number of desirable properties:

  1. Efficiency. The sum of the Shapley values of all players is equal to the payout when everyone cooperates: \sum_{i \in N} \phi(i)(v) = v(N).
  2. Symmetry / equal treatment of equals. If two players i and j are equivalent in the sense that v (S \cup \{i\}) = v (S \cup \{j\}) for all S \subseteq N \setminus \{i, j\}, then \phi_i (v) = \phi_j (v).
  3. Linearity. If v and w are two coalitional games, then \phi_i (v + w) = \phi_i (v) + \phi_i (w). Also, for any a \in \mathbb{R}, \phi_i (av) = a \phi_i(v).
  4. Null player. If player i doesn’t contribute in the sense that v(S \cup \{i\}) = v(S) for all S \subseteq N \setminus \{i\}, then \phi_i (v) = 0.

In fact, the Shapley value is the only map from the set of all games to payoff vectors that satisfies all 4 properties above, which is one reason for its popularity.

References:

  1. Wikipedia. Shapley value.

Bloom’s result relating ITT and TOT

Consider the randomized controlled trial (RCT) setup with non-compliance, i.e. those who are offered the treatment can choose to take the treatment or refuse it. For each individual i, define the dummy variables

\begin{aligned} Z_i &= 1 \{ \text{individual i is assigned to treatment} \}, \\  D_i &= 1 \{ \text{individual i actually takes the treatment} \}. \end{aligned}

Let Y_{1i} denote the potential outcome for individual i if the treatment is taken (i.e. D_i = 1), Y_{0i} denote the potential outcome for individual i if the treatment is not taken (i.e. D_i = 0), and Y_i denote the outcome that is actually observed for individual i.

There are several quantities that might be of interest to the researcher. The first is known as the intention-to-treat effect (ITT), which is defined as

\text{ITT} = \mathbb{E} [Y_i \mid Z_i = 1] - \mathbb{E}[Y_i \mid Z_i = 0].

It is simply the average difference in outcomes for those who were assigned to treatment and those who were assigned to control. Another quantity that might be of interest is the effect of treatment on the treated (TOT), defined as

\text{TOT} = \mathbb{E}[Y_{1i} - Y_{0i} \mid D_i = 1].

Bloom’s result connects these two quantities via a simple equation if a bunch of assumptions hold. The assumptions are going to require more notation, so before giving the full technical result here is the result stated informally:

\begin{aligned} TOT = \dfrac{ITT}{Compliance}. \end{aligned}

Now to define the additional quantities needed for the technical result:

  1. Y_i(d, z) is the potential outcome for individual i if D_i = d (did they actually take the treatment?) and Z_i = z (were they assigned treatment?).
  2. D_{1i} is individual i‘s treatment status when Z_i = 1, and D_{0i} is individual i‘s treatment status when Z_i = 0.

Theorem (Bloom’s result). Assume that:

  1. (Independence) \{ Y_i(D_{1i}, 1), Y_i(D_{0i}, 0), D_{1i}, D_{0i} \} \perp \!\!\! \perp Z_i.
  2. (Exclusion) Y_i(d, 0) = Y_i(d, 1) := Y_{di} for d = 0, 1.
  3. (First stage) \mathbb{E}[D_{1i} - D_{0i}] \neq 0.
  4. (Monotonicity) D_{1i} - D_{0i} \geq 0 for all i, or vice versa.
  5. (No assigned controls get treatment) \mathbb{P} \{ D_i = 1 \mid Z_i = 0\} = 0.

Then,

\begin{aligned} \mathbb{E}[Y_{1i} - Y_{0i} \mid D_i = 1] = \dfrac{\mathbb{E}[Y_i \mid Z_i = 1] - \mathbb{E}[Y_i \mid Z_i = 0]}{\mathbb{P} \{ D_i = 1 \mid Z_i = 1\} }. \end{aligned}

The proof involves some manipulation of conditional expectations and can be found in Reference 1. See this one-pager for an example relating ITT and TOT.

References:

  1. Angrist, J. D., and Pischke, J.-S. (2009). Mostly harmless econometrics (Section 4.4.3, p164).

General formula for the asymptotic covariance of the OLS estimator

This post derives the general formula for the covariance of the ordinary least squares (OLS) estimator.

Imagine we are in the regression setup with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response {\bf y} \in \mathbb{R}^n. Let {\bf x}_i \in \mathbb{R}^p and y_i \in \mathbb{R} denote the ith row of \bf X and the ith element of \bf y respectively. We can always make the following decomposition:

y_i = \mathbb{E} [y_i \mid {\bf x_i}] + \varepsilon_i,

where \mathbb{E}[\varepsilon_i \mid {\bf x_i}] = 0 and \varepsilon is uncorrelated with any function of {\bf x}_i. (This is Theorem 3.1.1 of Reference 1.)

The population regression function approximates y_i as {\bf x}_i^T \beta, where \beta solves the minimization problem

\beta = \text{argmin}_b \: \mathbb{E} \left[ (y_i - {\bf x}_i^T b)^2 \right].

It can be shown that

\beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i].

The ordinary least squares (OLS) estimator is a sample version of this and is given by

\begin{aligned} \hat\beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y} = \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i y_i \right). \end{aligned}

We are often interested in estimating the covariance matrix of \hat\beta as it is needed to construct standard errors for \hat\beta. Defining \varepsilon_i = y_i - {\bf x}_i^T \beta as the ith residual, we can rewrite the above as

\begin{aligned} \hat{\beta} &= \beta + \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \left( \sum_i {\bf x}_i \varepsilon_i \right), \\  \sqrt{n}(\hat{\beta} - \beta) &= n \left( \sum_i {\bf x}_i {\bf x}_i^T \right)^{-1} \cdot \dfrac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). \end{aligned}

By Slutsky’s Theorem, the quantity above has the same asymptotic distribution as \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \cdot \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right). Since \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}*, the Central Limit Theorem tells us that \frac{1}{\sqrt{n}} \left( \sum_i {\bf x}_i \varepsilon_i \right) is asymptotically normally distributed with mean zero and covariance \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] (the matrix of fourth moments). Thus,

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

We can use the diagonal elements of \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E}[{\bf x}_i {\bf x}_i^T \varepsilon_i^2] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} to construct standard errors of \hat\beta. The standard errors computed in this way are called heteroskedasticity-consistent standard errors (or White standard errors, or Eicker-White standard errors). They are “robust” in the sense that they use few assumptions on the data and the model: only those needed to make the Central Limit Theorem go through.

*Note: We do NOT need to assume that \mathbb{E} [y_i \mid {\bf x_i}] is linear in order to conclude that \mathbb{E}[{\bf x}_i \varepsilon_i] = {\bf 0}. All we need are the relations \beta = \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] and \varepsilon_i = y_i - {\bf x}_i^T \beta. The derivation is as follows:

\begin{aligned} \mathbb{E}[{\bf x}_i \varepsilon_i] &= \mathbb{E}[{\bf x}_i (y_i - {\bf x}_i^T \beta)] \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \beta \\  &= \mathbb{E}[{\bf x}_i y_i] - \mathbb{E}[ {\bf x}_i {\bf x}_i^T] \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \mathbb{E} [{\bf x}_i y_i] \\  &= {\bf 0}. \end{aligned}

Special case of homoskedastic errors

If we assume that the errors are homoskedastic, i.e.

\mathbb{E}[\varepsilon_i \mid X_i] = \sigma^2 for all i

for some constant \sigma, then the asymptotic covariance simplifies a little:

\begin{aligned} \sqrt{n}(\hat{\beta} - \beta) \stackrel{d}{\rightarrow} \mathcal{N} \left( {\bf 0}, \sigma^2 \mathbb{E}[{\bf x}_i {\bf x}_i^T]^{-1} \right). \end{aligned}

References:

  1. Angrist, J. D., and Pischke, J.-S. (2009). Mostly harmless econometrics (Section 3.1.3).