Slight inconsistency between forcats’ fct_lump_min and fct_lump_prop

I recently noticed a slight inconsistency between the forcats package’s fct_lump_min and fct_lump_prop functions. (I’m working with v0.5.1, which is the latest version at the time of writing.) These functions lump levels that meet a certain criteria into an “other” level. According to the documentation,

  • fct_lump_min “lumps levels that appear fewer than min times”, and
  • fct_lump_prop “lumps levels that appear fewer than prop * n times”, where n is the length of the factor variable.

Let’s try this out in an example:

x <- factor(c(rep(1, 6), rep(2, 3), rep(3, 1)))
x
#  [1] 1 1 1 1 1 1 2 2 2 3
# Levels: 1 2 3

fct_lump_min(x, min = 3)
#  [1] 1     1     1     1     1     1     2     2     2     Other
# Levels: 1 2 Other

The levels 1 and 2 appear at least 3 times, and so they are not converted to the “Other” level. The level 3 appears just once, and so is converted.

What do you think this line of code returns?

fct_lump_prop(x, prop = 0.3)

Since prop * n = 0.3 * 10 = 3, the documentation suggests that only the level 3 should be converted to “Other”. However, that is NOT the case:

fct_lump_prop(x, prop = 0.3)
#  [1] 1     1     1     1     1     1     Other Other Other Other
# Levels: 1 Other

The level 2 appears exactly 3 times, and is converted into “Other”, contrary to what the documentation says.

Digging, into the source code of fct_lump_prop, you will find this line of code:

new_levels <- ifelse(prop_n > prop, levels(f), other_level)

prop_n is the proportion of times each factor level appears. If the documentation is correct, the greater than sign should really be a greater than or equal sign.

So what’s the right fix here? One way is to fix the documentation to say that fct_lump_prop “lumps levels that appear at most prop * n times”, but that breaks consistency with fct_lump_min. Another is to make the fix in the code as suggested above, but that will change code behavior. Either is probably fine, and it’s up to the package owner to decide which makes more sense.

What is the synthetic control estimator?

Introduction

The synthetic control estimator (Abadie et al. (2010), Reference 1) is a technique for performing causal inference in the time series setting. The intervention/treatment is assumed to happen at a certain point in time for the treatment unit. The synthetic control estimator finds weights for a set of control units such that the weighted combination of control units mimics the behavior of the treatment unit in the pre-intervention period. The behavior of this weighted combination of control units (the “synthetic control”) in the post-intervention period is assumed to be a good proxy for what would have happened to the treatment unit, had it not received the intervention. The difference between what actually happened for the treatment unit and the synthetic control in the post-intervention period is an estimate of the treatment effect.

Seminal example

The seminal example for synthetic controls, discussed in Abadie et al. (2010), is estimating the effect of California’s tobacco control program. Specifically, a synthetic control was used to estimate the effect of Proposition 99, which went into effect in January 1989, on per-capita cigarette sales. The figure below shows that we cannot simply compare California against the rest of the US, as cigarette sale trends were pretty different before the intervention (so any changes post-intervention cannot be attributed solely to the intervention):

We also can’t compare California directly with any single state for the same reason. A synthetic control takes a weighted average of some other states such that the pre-intervention cigarette sales trend for this weighted average closely matches that of California. Because the pre-intervention sale trends match, it is more plausible that the post-intervention difference is due to the intervention (and not something else). The figure below shows the trends for real and synthetic California:

The weights give us a way to interpret the control as well. In this example, synthetic California is 0.164 Colorado, 0.069 Connecticut, 0.199 Montana, 0.234 Nevada and 0.334 Utah. (We will need domain knowledge to know whether this makes sense or not.)

Mathematical details

Assume that we have J + 1 units, with just the first unit being exposed to the treatment. (The remaining J units are controls.) Assume that there are T time periods in total, with 1, \dots, T_0 being pre-intervention periods and T_0 + 1, \dots, T being post-intervention periods. Let Y_{it}(0) denote the outcome that would be observed for unit i at time t if it was not exposed to the intervention in periods T_0 + 1 to T, and let Y_{it}(1) denote the outcome that would be observed it is was exposed. What we want to estimate is

\alpha_{1t} = Y_{1t}(1) - Y_{1t}(0)

for t > T_0. In the above, Y_{1t}(1) is observed while Y_{1t}(0) is not and has to be estimated by the synthetic control. If we have the synthetic control weights w_2, \dots, w_J, then we estimate \alpha_{1t} with

\begin{aligned} \hat\alpha_{1t} = Y_{1t}(1) - \sum_{j=2}^{J+1} w_j Y_{jt}(0). \end{aligned}

Thus, the goal is to find a set of weights w_2, w_3, \dots, w_{J+1} such that the weighted average of the J controls “looks like the treatment unit” before the intervention. The devil is of course in the specifics of what it means to “look like the treatment unit”. There are several ways to do this!

Mathematical details: Variation 1

The first is to minimize the squared difference between the treatment and synthetic control in the pre-intervention period:

\begin{aligned} \text{minimize} \qquad & \sum_{t=1}^{T_0} \left( Y_{1t}(0) - \sum_{j=2}^{J+1} w_j Y_{jt}(0) \right)^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1. \end{aligned}

The constraints on the weights are not strictly necessary, but interpretation comes more easily when the synthetic control is a convex combination of the control units. (E.g. How do you interpret California as being 1.2 times of Utah and -0.2 times of Colorado?)

We can write the optimization problem above more compactly if we introduce matrix notation. Let \mathbf{W} = (w_2, \dots, w_{J+1})^\top \in \mathbb{R}^J be the vector of weights. Let \mathbf{Y}_1 = (Y_{11}(0), \dots, Y_{1T_0}(0))^\top \in \mathbb{R}^{T_0}, and let \mathbf{Y}_0 \in \mathbb{R}^{T_0 \times J} be the matrix such that the jth column of \mathbf{Y}_0 is the pre-intervention time series for unit j + 1. Then we can write the above as

\begin{aligned} \text{minimize} \qquad & \| \mathbf{Y}_1 - \mathbf{Y}_0 \mathbf{W} \|_F^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1, \end{aligned}

where \| \cdot \|_F denotes the Frobenius norm.

Mathematical details: Variation 2

In Variation 1, we have an optimization problem involving a T_0 \times J matrix. Solving the optimization problem could take a while if T_0 is large. Instead of minimizing the squared difference in each pre-intervention time period, Abadie et al. (2010) suggest minimizing the difference for a few linear combinations of the pre-intervention time periods. Specifically, for m = 1, \dots, M, define \overline{Y}_j^m = \sum_{t=1}^{T_0} v_t^m Y_{jt}(0), where v_t^m are some pre-specified weights. Let \overline{\mathbf{Y}}_1 = \left( \overline{Y}_1^1, \dots, \overline{Y}_1^M \right)^\top \in \mathbb{R}^M, and define \overline{\mathbf{Y}}_0 \in \mathbb{R}^{M \times J} analogously. The synthetic control weights are the solution to

\begin{aligned} \text{minimize} \qquad & \| \overline{\mathbf{Y}}_1 - \overline{\mathbf{Y}}_0 \mathbf{W} \|_F^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1. \end{aligned}

(With today’s computational power, I’m not sure there’s a need to make this adjustment.)

Mathematical details: Variation 3

In the previous two variations, we assumed that the only data we had was for the response of interest. In some cases, we might have observed covariates in the pre-intervention period (e.g. gender, age). It makes intuitive sense that a good synthetic control matches the treatment unit’s behavior for both the pre-intervention response and other observed covariates.

Assume for each unit j, we have observed covariates Z_{j1}, \dots, Z_{jr}. Let \mathbf{X}_1 = \left( Z_{11}, \dots, Z_{1r}, \overline{Y}_1^1, \dots, \overline{Y}_1^M \right)^\top \in \mathbb{R}^{r + M}, and define \mathbf{X}_0 \in \mathbb{R}^{(r + M) \times J} analogously. The synthetic control weights are the solution to

\begin{aligned} \text{minimize} \qquad & \| \mathbf{X}_1 - \mathbf{X}_0 \mathbf{W} \|_F^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1. \end{aligned}

Mathematical details: Variation 4

In the previous variations, we assumed that the distance of interest was squared distance (represented by the Frobenius norm). This need not be the case: it may be more important to match on some covariates rather than others. To accommodate this, we can solve the problem

\begin{aligned} \text{minimize} \qquad & \| \mathbf{X}_1 - \mathbf{X}_0 \mathbf{W} \|_\mathbf{V}^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1, \end{aligned}

where \mathbf{V} is some symmetric and positive semidefinite matrix, and where \| \mathbf{X} \|_\mathbf{V}^2 = \mathbf{X}^\top \mathbf{V} \mathbf{X}. Section 2.3 of Abadie et al. (2010) has some discussion on how to choose \mathbf{V}.

Adapting to cross-sectional data

While the synthetic control estimator was introduced for the time series setting, it can be adapted to the cross-sectional setting as well (see for e.g. Abadie & L’Hour (2021), Reference 2). Assume again that we have J+1 units with just the first being exposed to treatment. The responses for unit j under treatment and control are denoted by Y_j(1) and Y_j(0) respectively. (This is the same as before except there is no subscript for time.) We wish to estimate the treatment effect for the treated unit:

\tau_1 = Y_1(1) - Y_1(0).

The synthetic control estimate for this is

\begin{aligned} \hat\tau_1 = Y_1(1) - \sum_{j=2}^{J+1} w_j Y_j(0), \end{aligned}

where the w_j‘s are the synthetic control weights. These weights are determined by the same optimization problem as before:

\begin{aligned} \text{minimize} \qquad & \| \mathbf{X}_1 - \mathbf{X}_0 \mathbf{W} \|_\mathbf{V}^2 \\  \text{subject to} \qquad & w_2, \dots, w_{J+1} \geq 0, \\  & w_2 + \dots + w_{J+1} = 1. \end{aligned}

The difference is just in the definition of \mathbf{X}_1 and $latex \mathbf{X}_0$. Recall that in the time series case, \mathbf{X}_1 = \left( Z_{11}, \dots, Z_{1r}, \overline{Y}_1^1, \dots, \overline{Y}_1^M \right)^\top \in \mathbb{R}^{r + M}, where the Z‘s were observed covariates and the \overline{Y}‘s were linear combinations of the response in the pre-intervention period. For the cross-sectional setting, \mathbf{X}_1 = \left( Z_{11}, \dots, Z_{1r} \right)^\top \in \mathbb{R}^r consists of just the observed covariates. (\mathbf{X}_0 is defined analogously.)

The synthetic control estimator can be thought of as a type of nearest-neighbor estimator. However, as noted in Reference 2, it departs from the nearest-neighbor estimator in 2 major ways:

  1. The number of matches for the treated unit is not fixed a priori. (For k-nearest neighbors, the number of matches is fixed at k.)
  2. The matched units are not given equal weight: rather, the weights are determined by the optimization problem described above.

References:

  1. Abadie, A., et al. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program.
  2. Abadie, A., and L’Hour, J. (2021). A penalized synthetic control estimator for disaggregated data.

A quirk when using data.table?

I recently came across this quirk in using data.table that I don’t really have a clean solution for. I outline the issue below as well as my current way around it. Appreciate any better solutions!

The problem surfaces quite generally, but I’ll illustrate it by trying to achieve the following task: write a function that takes a data table and a column name, and returns the data table with the data in that column scrambled.

The function below was my first attempt:

library(data.table)

scramble_col <- function(input_dt, colname) {
  input_dt[[colname]] <- sample(input_dt[[colname]])
  input_dt
}

The code snippet below shows that it seems to work:

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2

However, when I tried to add a new column that is a replica of the x column, I get a strange warning!

input_dt[, y := x]  # gives warning

There are few webpages out there that try to explain what’s going on with this warning, but I haven’t had time to fully digest what is going on. My high-level takeaway is that the assignment in the line input_dt[[colname]] <- sample(input_dt[[colname]]) is problematic.

This was my second attempt:

scramble_col <- function(input_dt, colname) {
  input_dt[, c(colname) := sample(get(colname))]
}

This version works well: it doesn’t throw the warning when I added a second column.

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2
input_dt[, y := x]
input_dt
#    x y
# 1: 1 1
# 2: 4 4
# 3: 3 3
# 4: 5 5
# 5: 2 2

However, the function does not work for a particular input: when the column name is colname! When I run the following code, I get an error message.

input_dt <- data.table(colname = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "colname") # error

The function below was my workaround and I think it works for all inputs, but it seems a bit inelegant:

scramble_col <- function(input_dt, colname) {
  new_col <- sample(input_dt[[colname]])
  input_dt[, c(colname) := new_col]
}

Would love to hear if anyone has a better solution for this task, or if you have some insight into what is going on with the warning/error above.

7 guiding principles for using synthetic control estimators

The synthetic control estimator (Abadie et al. 2010, Reference 1) is a technique for performing causal inference in the time series setting. The intervention/treatment is assumed to happen at a certain point in time for the treatment unit. The synthetic control estimator finds weights for a set of control units such that the weighted combination of control units mimics the behavior of the treatment unit in the pre-intervention period. The behavior of this weighted combination of control units (the “synthetic control”) in the post-intervention period is assumed to be a good proxy for what would have happened to the treatment unit, had it not received the intervention. The difference between what actually happened for the treatment unit and the synthetic control in the post-intervention period is an estimate of the treatment effect.

Abadie & Vives-i-Bastida (2022) (Reference 2) recently proposed 7 guidelines to keep in mind when using synthetic control estimators in practice. I thought they were pretty useful and worth reproducing in full as a blog post. Here they are (with some paraphrasing on my part):

  1. Low volatility. Closely fitting a highly volatile series is likely the result of over-fitting at work, especially if it happens over a short pre-intervention period. Synthetic controls were designed for settings with aggregate series, where aggregation attenuates the magnitude of the noise.
  2. Extended pre-intervention period. A good control unit must closely reproduce the trajectory of the outcome variable for the treated unit over an extended pre-intervention period. A good fit, if it is the result of a secular agreement between the treated and the synthetic control units in their responses to unobserved factors, should persist in time.
  3. Small donor pool. A larger donor pool (i.e. the set of control units from which the synthetic control is created) is not necessarily better than a smaller one. Adopting a small donor pool of untreated units that are close to the treated unit in the space of the predictors helps reduce over-fitting and interpolation biases.
  4. Sparsity makes synthetic controls interpretable.
  5. Covariates matter. Under the assumption that the response is a linear combination of observed covariates, unobserved covariates and noise, an observed covariate that is not controlled for is effectively thrown into the unobserved component. Thus, including important covariates when determining the synthetic control weights is likely to result in a better fit.
  6. Fit matters. The bound on the bias is predicated on close fit. A deficient fit raises concerns about the validity of a synthetic control. (There are however some exceptions and qualifications to this; see references cited in the paper.)
  7. Out-of-sample validation is key. The goal of synthetic controls is to predict the trajectory of the outcome variable for the treated unit in the absence of the intervention of interest. The quality of a synthetic control can be assessed by measuring predictive power in pre-intervention periods that are left out of the sample used to calculate the synthetic control weights.

References:

  1. Abadie, A., et al. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program.
  2. Abadie, A., and Vives-i-Bastida, J. (2022). Synthetic controls in action.

T-learners, S-learners and X-learners

T-learners, S-learners and X-learners are all meta-algorithms that one can use for estimating the conditional average treatment effect (CATE) in the causal inference setting. The information here is largely taken from Künzel et. al. (2019) (Reference 1). All 3 methods are implemented in Microsoft’s EconML python package.

Set-up

  • We index individuals by i.
  • Each individual is either in control (W_i = 0) or treatment (W_i = 1).
  • We have some outcome/response metric of interest. If the individual is in control (treatment resp.), the response metric is Y_i(0) (Y_i(1) resp.). We only get to observe one of them, which we denote by Y_i^{obs} = Y_i(W_i).
  • For each individual, we have a vector of pre-treatment covariates X_i.

The conditional average treatment effect (CATE) is defined as

\begin{aligned} \tau(x) := \mathbb{E}[Y(1) - Y(0) \mid X = x]. \end{aligned}

If we define the response under control and the response under treatment as

\begin{aligned} \mu_0(x) &:= \mathbb{E}[Y(0) \mid X = x], \\  \mu_1(x) &:= \mathbb{E}[Y(1) \mid X = x], \end{aligned}

then we can write the CATE as

\begin{aligned} \tau(x) = \mu_1(x) - \mu_0(x). \end{aligned}

T-learner

The T-learner consists of 2 steps:

  1. Use observations in the control group to estimate the response under control, \hat\mu_0(x). Similarly, use observations in the treatment group to estimate the response under treatment, \hat\mu_1(x). Any machine learning method can be used to get these estimates.
  2. Estimate the CATE by \hat\tau_T(x) = \hat\mu_1(x)- \hat\mu_0(x).

S-learner

The S-learner treats the treatment variable W_i as if it was just another covariate like those in the vector X_i. Instead of having two models for the response as a function of the covariates X, the S-learner has a single model for the response as a function of X and the treatment W:

\begin{aligned} \mu(x, w) := \mathbb{E}[Y^{obs} \mid X = x, W = w]. \end{aligned}

The S-learner consists of 2 steps:

  1. Use all the observations to estimate the response function above, \hat\mu (x, w).
  2. Estimate the CATE by \hat\tau_S(x) = \hat\mu (x, 1) - \hat\mu (x, 0).

X-learner

The X-learner consists of 3 steps (sharing the first step with the T-learner):

  1. Use observations in the control group to estimate the response under control, \hat\mu_0(x), and use observations in the treatment group to estimate the response under treatment, \hat\mu_1(x).
  2. Use the estimates in Step 1 to obtain estimates of the individual treatment effects (ITE). For observations in the control group, the ITE estimate is \hat{D}_i = \hat\mu_1(X_i) - Y_i^{obs}; For observations in the treatment group, it is \hat{D}_i =  Y_i^{obs} - \hat\mu_0(X_i). Build a model for the ITE using just observations from the control group (with the imputed/estimated ITE as the response), \hat\tau_0(x). Do so similarly with just observations from the treatment group to get \hat\tau_1(x).
  3. Estimate the CATE by combining the two estimates above: \hat\tau_X(x)  = g(x) \hat\tau_0(x) + [1-g(x)]\hat\tau_1(x),, where g\in [0,1] is a weight function. A good choice for g is an estimate of the propensity score.

When to use what?

The conclusions here were drawn from Reference 1, the paper which proposed the X-learner. So while I think they make intuitive sense, just keep that in mind when reading this section.

  • Overall
    • The choice of base learner (for the intermediate models) can make a large difference in prediction accuracy. (This is an important advantage of metalearners in general.)
    • There is no universally best metalearner: for each of these 3 metalearners, there are situations where it performs best.
  • T-learner
    • Performs well if there are no common trends in the response under control and response under treatment and if the treatment effect is very complicated.
    • Because data is not pooled across treatment groups, it is difficult for the T-learner to mimic a behavior (e.g. discontinuity) that appears in all the treatment groups.
  • S-learner
    • Since the treatment indicator plays no special role, the base learners can completely ignore it during model-fitting. This is good if the CATE is zero in many places.
    • The S-learner can be biased toward zero.
    • For some base learners (e.g. k-nearest neighbors), treating the treatment indicator like any other covariate may not make sense.
  • X-learner
    • The X-learner can adapt to structural properties such as sparsity or smoothness of the CATE. (This is useful as CATE is often zero or approximately linear.)
      • When CATE is zero, it usually is not as good as the S-learner but is better than the T-learner.
      • When CATE is complex, it outperforms the S-learner, and is often better than the T-learner too.
    • It is particularly effective when the number of units in one treatment group (often the control group) is much larger than in the other.

References:

  1. Künzel, S. R., et. al. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning.

An example demonstrating the problem with “net treatment effects”

In this previous post, we described the concept of principal stratification in causal inference. Principal stratification was proposed as a framework for how to think about causal effects in the presence of post-treatment variables. It was a response to a standard method at that time known as “net treatment effects adjusting for post-treatment variables”. In this post, we present two examples that demonstrate why net treatment effects were the wrong way to think about causal effects in this setting. These examples are taken from Frangakis & Rubin (2002) (Reference 1).

Background

  • We index individuals by i.
  • Each individual is either in control (Z_i = 1) or treatment (Z_i = 2).
  • We have some outcome/response metric of interest. If the individual is in control (treatment resp.), the response metric is Y_i(1) (Y_i(2) resp.). We only get to observe one of them, which we denote by Y_i^{obs} = Y_i(Z_i).
  • We have a post-treatment variable that takes on value S_i(1) if the individual is in control and is S_i(2) if the individual is in treatment. We only get to observe one of them, which we done by S_i^{obs} = S_i(Z_i).

Net treatment effects propose comparing the distributions

\text{pr} \left\{ Y_i^{obs} \mid S_i^{obs} = s, Z_i = 1 \right\} and \text{pr} \left\{ Y_i^{obs} \mid S_i^{obs} = s, Z_i = 2 \right\}.

The resulting difference is called the “net treatment effect of assignment Z adjusting for the post-treatment variable S^{obs}.

Example 1

Consider a scenario where we have 3 subgroups of patients (sicker, normal, healthier) based on their possible post-treatment values (S_i(1), S_i(2)). We assume here that the post-treatment variable can only take on two values: L for low and H for high. Assume also that there are the same number of individuals in each subgroup, and that we care about the average treatment effect.

From the 3rd and 4th columns in the table below, we see that the treatment has no effect on the response Y for sicker and healthier patients, but for normal patients, the treatment increases the response from 30 to 50 for a treatment effect of +20.

Consider the net treatment effect of assignment Z adjusting for S^{obs} for S^{obs} = L. This compares the means

\mathbb{E} \left[ Y_i^{obs} \mid S_i^{obs} = L, Z_i = 1 \right] and \mathbb{E} \left[ Y_i^{obs} \mid S_i^{obs} = L, Z_i = 2 \right],

which is equivalent to comparing the blue box to the orange box. This comparison suggests that the treatment effect is 10 - 20 = -10: a negative effect! This is clearly an incorrect conclusion, and it arises because the blue box includes normal patients while the orange box does not.

Example 2

The set-up for the second example is the same as the first, except that the potential outcome values in the 3rd and 4th columns have changed. In this example, every subgroup has a treatment effect of +10.

The net treatment effect compares either the top blue box with the top orange box, or the bottom blue box with the bottom orange box. Either way, the net treatment effect is zero! This is clearly an incorrect conclusion. Again, this incorrect conclusion arises because the groups being compared are not the same. For S^{obs} = L, we are comparing sicker patients with sicker and normal ones, while for S^{obs} = H, we are comparing normal and healthier patients to just healthier ones.

Principal stratification corrects for this by saying that we should only make comparisons between groups which have the same membership.

References:

  1. Frangakis, C. E., and Rubin, D. B. (2002). Principal stratification in causal inference.

A short note on the startsWith function

The startsWith function comes with base R, and determines whether entries of an input start with a given prefix. (The endsWith function does the same thing but for suffixes.) The following code checks if each of “ant”, “banana” and “balloon” starts with “a”:

startsWith(c("ant", "banana", "balloon"), "a")
# [1]  TRUE FALSE FALSE

The second argument (the prefix to check) can also be a vector. The code below checks if “ant” starts with “a” and if “ant” starts with “b”:

startsWith("ant", c("a", "b"))
# [1]  TRUE FALSE

Where things might get a bit unintuitive is when both arguments are vectors of length >1. Why do you think the line of code below returned the result it did?

startsWith(c("ant", "banana", "balloon"), c("a", "b"))
# [1] TRUE TRUE FALSE

This makes sense when we look at the documentation for startsWith‘s return value:

A logical vector, of “common length” of x and prefix (or suffix), i.e., of the longer of the two lengths unless one of them is zero when the result is also of zero length. A shorter input is recycled to the output length.

startsWith(x, prefix) checks if x[i] starts with prefix[i] for each i. In our line of code above, the function checks if “ant” starts with “a” and “banana” starts with “b”. Since x had length greater than prefix, we “recycle” prefix and check if “balloon” starts with “a”.

If you want to check if each x[i] starts with any prefix[j] (with j possibly being different from i), we could do the following:

x <- c("ant", "banana", "balloon")
prefix <- c("a", "b")
has_prefix <- sapply(prefix, function(p) startsWith(x, p))
has_prefix
#          a     b
# [1,]  TRUE FALSE
# [2,] FALSE  TRUE
# [3,] FALSE  TRUE

apply(has_prefix, 1, any)
# [1] TRUE TRUE TRUE

What is principal stratification in causal inference?

Principal stratification was proposed in Frangakis & Rubin (2002) (Reference 1) as a way to make sense of treatment effects that are adjusted for post-treatment variables. The exposition below largely follows that in their paper.

Background

Imagine that we are in the potential outcomes framework for causal inference (the notation here follows that of Reference 1):

  • We index individuals by i.
  • Each individual is either in control (Z_i = 1) or treatment (Z_i = 2).
  • We have some outcome/response metric of interest. If the individual is in control (treatment resp.), the response metric is Y_i(1) (Y_i(2) resp.). We only get to observe one of them, which we denote by Y_i^{obs} = Y_i(Z_i).

We are interested in a comparison between the two sets

\{ Y_i(1): i \in \text{set}_1\} and \{ Y_i(2): i \in \text{set}_2\},

where \text{set}_1 and \text{set}_2 are identical. We call such comparisons causal effects. For example, the average treatment effect is a comparison of the means of these two sets where \text{set}_1 and \text{set}_2 are the entire population, and so is a causal effect.

Adjusting for treatment effects: An incorrect approach

In addition to the above, assume that after each unit is assigned to one of the treatment arms, we also measure a post-treatment variable S_i^{obs}. For simplicity, assume that S_i^{obs} is binary, taking on the value 1 or 2. (The framework can be extended easily to multi-category or continuous post-treatment variables.) S_i^{obs} typically encodes characteristics of the unit and of the treatment, so we may want to “adjust” the causal effect to take this into account.

One possible approach (suggested and used fairly widely in the 1990s, sometimes known as “net treatment effects”) is to make a comparison between the distributions

\text{pr} \left\{ Y_i^{obs} \mid S_i^{obs} = s, Z_i = 1 \right\} and \text{pr} \left\{ Y_i^{obs} \mid S_i^{obs} = s, Z_i = 2 \right\}.

However, Reference 1 notes that such comparisons cannot be considered causal effects at all! The key to understanding this is to recognize that because S_i^{obs} is a post-treatment variable, it could be affected by the treatment. Thus, we should recognize that there are potential outcomes for S_i as well (S_i(1) and S_i(2)), and that S_i^{obs} = S_i(Z_i).

With this language, consider the special case where treatment assignment is completely randomized, i.e. \text{pr} (Z_i = 1 \mid S_i(1), S_i(2), Y_i(1), Y_i(2)) = p for some 0 < p < 1. Under this assignment, the comparison above is equivalent to the comparison between

\text{pr} \left\{ Y_i(1) \mid S_i(1) = s \right\} and \text{pr} \left\{ Y_i(2) \mid S_i(2) = s \right\}.

In general, the groups \{i: S_i(1) = s\} and \{i: S_i(2) = s\} are not the same, so the comparison is not a causal effect!

Adjusting for treatment effects: Principal stratification

Frangakis & Rubin propose that we should adjust for post-treatment variables by thinking of treatment effects for each principal strata. Individuals are placed into different principal strata based on their vector of potential outcomes for S. Here is a formal definition:

Definition (Principal stratification). The basic principal stratification P_0 w.r.t. post-treatment variable S is the partition of units such that, within any set of P_0, all units have the same vector (S_i(1), S_i(2)).

A principal stratification P w.r.t. S is a partition of units whose sets are unions of sets in the basic principal stratification P_0.

This allows us to define treatment effects that are always causal effects (as compared to the net treatment effects from the previous section):

Definition (Principal effects). Let P be a principal stratification w.r.t. S and let S_i^P be the stratum of P that unit i belongs to. For a given principal stratum \zeta \in P, a principal effect for \zeta is defined as a comparison between the sets

\left\{ Y_i(1): S_i^P = \zeta \right\} and \left\{ Y_i(2): S_i^P = \zeta \right\}.

Theorem. Principal effects are always causal effects.

The main benefit of principal stratification is that it gives us conceptual clarity of the estimand: what is it that we are actually trying to estimate? If the quantity that we are estimating does not make sense, then strategies to estimate those quantities are misguided.

Principal stratification gives us quantities that we can claim to be causal effects. Unfortunately, in general we do not know which principal stratum a unit belongs to since we can only observe one of the potential outcomes for S_i. What we could do is to treat it as a missing data problem and use techniques from the missing data literature to help us out.

Connection with non-compliance

Imagine that S_i is the post-treatment variable indicating whether unit i actually took the treatment or not (1 if they did not take the treatment, 2 if they did take the treatment). We can then divide the population into 4 principal strata, as shown in the diagram below:

Principal effects are defined for each of the 4 groups. You may recognize the treatment effect for the compliers as the complier average causal effect (CACE), which is the typical estimand in instrumental variables studies. Hence, the CACE is a principal effect, and any techniques we use to estimate principal effects can be used to estimate the CACE.

References:

  1. Frangakis, C. E., and Rubin, D. B. (2002). Principal stratification in causal inference.

Fieller’s confidence sets for ratio of two means

Assume you have a sample (X_1, Y_1), \dots, (X_n, Y_n) drawn i.i.d. from some underlying distribution. Let \mathbb{E}[X] = \mu_1 and \mathbb{E}[Y] = \mu_2. Our goal is to estimate the ratio of means \rho := \mu_2 / \mu_1 and provide a level-(1-\alpha) confidence set for it, i.e. a set R such that \mathbb{P}_\rho \{ \rho \in R \} \geq 1 - \alpha. Fieller’s confidence sets are one way to do this. (The earliest reference to Fieller’s work is listed as Reference 1. The exposition here follows that in von Luxburg & Franz 2009 (Reference 2).)

Define the standard estimators for the means and covariances:

\begin{aligned} \hat\mu_1 &:= \dfrac{1}{n}\sum_{i=1}^n X_i, \quad \hat\mu_2 := \dfrac{1}{n}\sum_{i=1}^n Y_i, \\  \hat{c}_{11} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (X_i - \hat\mu_1)^2, \\  \hat{c}_{22} &:= \dfrac{1}{n(n-1)}\sum_{i=1}^n (Y_i - \hat\mu_2)^2, \\  \hat{c}_{12} &:= \hat{c}_{21} := \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \hat\mu_1)(Y_i - \hat\mu_2). \end{aligned}

Let q := q(t_{n-1}, 1 - \alpha/2) be the (1 - \alpha/2)-quantile of the t distribution with n-1 degrees of freedom. Compute the quantities

\begin{aligned} q_\text{exclusive}^2 &= \dfrac{\hat\mu_1^2}{\hat{c}_{11}}, \\  q_\text{complete}^2 &= \dfrac{\hat\mu_2^2 \hat{c}_{11} - 2\hat\mu_1 \hat\mu_2 \hat{c}_{12} + \hat\mu_1^2 \hat{c}_{22} }{\hat{c}_{11}\hat{c}_{22} - \hat{c}_{12}^2}, \end{aligned}

and

\begin{aligned} \ell_{1, 2} = \dfrac{1}{\hat\mu_1^2 - q^2 \hat{c}_{11}} \left[ (\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12}) \pm \sqrt{(\hat\mu_1 \hat\mu_2 - q^2 \hat{c}_{12})^2 - (\hat\mu_1^2 - q^2 \hat{c}_{11})(\hat\mu_2^2 - q^2 \hat{c}_{22})} \right] .\end{aligned}

The confidence set R_\text{Fieller} is defined as follows:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\  (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The following result is often known as Fieller’s theorem:

Theorem (Fieller). If (X, Y) is jointly normal, then  R_{Fieller} is an exact confidence region of level 1 - \alpha for \rho, i.e. \mathbb{P}_\rho \{ \rho \in R \} = 1 - \alpha.

This is Theorem 3 of Reference 2, and there is a short proof of the result there. Of course, if (X,  Y) is not jointly normal (which is almost always the case), then Fieller’s confidence sets are no longer exact but approximate.

Fieller’s theorem is also valid in the more general setting where we are given two independent samples X_1, \dots, X_n and Y_1, \dots, Y_m (rather than paired samples), and use unbiased estimators for the means and independent unbiased estimators for the covariances. Reference 2 notes that the degrees of freedom for the t distribution needs to be chosen appropriately in this case.

I like the way Reference 2 explicitly lays out the 3 possibilities for the Fieller confidence set:

\begin{aligned} R_\text{Fieller} = \begin{cases} (-\infty, \infty) &\text{if } q_\text{complete}^2 \leq q^2, \\  (-\infty, \min(\ell_1, \ell_2)] \cup [ \max(\ell_1, \ell_2), \infty) &\text{if } q_\text{exclusive}^2 < q^2 < q_\text{complete}^2, \\ [\min(\ell_1, \ell_2), \max(\ell_1, \ell_2) ] &\text{otherwise.} \end{cases} \end{aligned}

The first case corresponds to the setting where both \mathbb{E}[X] and \mathbb{E}[Y] are close to 0: here, we can’t really conclude anything about the ratio. The second case corresponds to the setting where \mathbb{E}[Y] is close to zero while \mathbb{E}[X] is not: here the ratio is something like C/\epsilon or -C/\epsilon where C is a big constant while \epsilon is small. The final case corresponds to the setting where both quantities are not close to zero. The Wikipedia article for Fieller’s theorem describes the confidence set using a single formula. Even though all 3 cases above are contained within this formula, I found the formula a little misleading because on the surface it looks like the confidence set is always of the form [a, b] for some real numbers a and b.

References:

  1. Fieller, E. C. (1932). The distribution of the index in a normal bivariate population.
  2. Von Luxburg, U., and Franz, V. H. (2009). A geometric approach to confidence sets for ratios: Fieller’s theorem, generalizations and bootstrap.

Asymptotic normality of sample quantiles

Let X_1, \dots, X_n be i.i.d. random variables with the cumulative distribution function (CDF) F. The central limit theorem demonstrates that the sample mean \overline{X}_n is asymptotically normal (as long as F has a finite second moment):

\begin{aligned} \sqrt{n} (\overline{X}_n - \mu) \stackrel{d}{\rightarrow} \mathcal{N}(0, \sigma^2), \end{aligned}

where \mu and \sigma^2 are the mean and variance of the random variable with CDF F.

It turns out that for any 0 < p < 1, the pth sample quantile is asymptotically normal as well. Here is the result:

Theorem. Fix 0 < p < 1. If F is differentiable at the pth quantile F^{-1}(p) with positive derivative f(F^{-1}(p)), then the sample quantile F_n^{-1}(p) is asymptotically normal:

\begin{aligned} \sqrt{n}\left( F_n^{-1}(p) - F^{-1}(p) \right) \stackrel{d}{\rightarrow} \mathcal{N} \left( 0, \dfrac{p(1-p)}{f^2 (F^{-1}(p))} \right). \end{aligned}

This is frequently cited as Corollary 21.5 in Reference 1. A proof of this result can be found in Reference 2 (there, x refers to F^{-1}(p) above). (It is essentially an application of the Central Limit Theorem followed by an application of the delta method.)

The numerator of the variance p(1-p) is largest when p = 1/2 and gets smaller as we go to more extreme quantiles. This makes intuitive sense: for a fixed level of precision, we need more data to estimate the extreme quantiles as compared to the central quantiles.

The denominator of the variance f^2 (F^{-1}(p)) is largest when the CDF F is changing rapidly at the pth quantile. This again makes intuitive sense: if F is changing a lot there, it means that we have greater probability of getting some samples in the neighborhood of F^{-1}(p), and so we can estimate it more accurately.

References:

  1. Van der Vaart, A. W. (2000). Asymptotic statistics.
  2. Stephens, D. A. Asymptotic distribution of sample quantiles.