T-learners, S-learners and X-learners

T-learners, S-learners and X-learners are all meta-algorithms that one can use for estimating the conditional average treatment effect (CATE) in the causal inference setting. The information here is largely taken from Künzel et. al. (2019) (Reference 1). All 3 methods are implemented in Microsoft’s EconML python package.


  • We index individuals by i.
  • Each individual is either in control (W_i = 0) or treatment (W_i = 1).
  • We have some outcome/response metric of interest. If the individual is in control (treatment resp.), the response metric is Y_i(0) (Y_i(1) resp.). We only get to observe one of them, which we denote by Y_i^{obs} = Y_i(W_i).
  • For each individual, we have a vector of pre-treatment covariates X_i.

The conditional average treatment effect (CATE) is defined as

\begin{aligned} \tau(x) := \mathbb{E}[Y(1) - Y(0) \mid X = x]. \end{aligned}

If we define the response under control and the response under treatment as

\begin{aligned} \mu_0(x) &:= \mathbb{E}[Y(0) \mid X = x], \\  \mu_1(x) &:= \mathbb{E}[Y(1) \mid X = x], \end{aligned}

then we can write the CATE as

\begin{aligned} \tau(x) = \mu_1(x) - \mu_0(x). \end{aligned}


The T-learner consists of 2 steps:

  1. Use observations in the control group to estimate the response under control, \hat\mu_0(x). Similarly, use observations in the treatment group to estimate the response under treatment, \hat\mu_1(x). Any machine learning method can be used to get these estimates.
  2. Estimate the CATE by \hat\tau_T(x) = \hat\mu_1(x)- \hat\mu_0(x).


The S-learner treats the treatment variable W_i as if it was just another covariate like those in the vector X_i. Instead of having two models for the response as a function of the covariates X, the S-learner has a single model for the response as a function of X and the treatment W:

\begin{aligned} \mu(x, w) := \mathbb{E}[Y^{obs} \mid X = x, W = w]. \end{aligned}

The S-learner consists of 2 steps:

  1. Use all the observations to estimate the response function above, \hat\mu (x, w).
  2. Estimate the CATE by \hat\tau_S(x) = \hat\mu (x, 1) - \hat\mu (x, 0).


The X-learner consists of 3 steps (sharing the first step with the T-learner):

  1. Use observations in the control group to estimate the response under control, \hat\mu_0(x), and use observations in the treatment group to estimate the response under treatment, \hat\mu_1(x).
  2. Use the estimates in Step 1 to obtain estimates of the individual treatment effects (ITE). For observations in the control group, the ITE estimate is \hat{D}_i = \hat\mu_1(X_i) - Y_i^{obs}; For observations in the treatment group, it is \hat{D}_i =  Y_i^{obs} - \hat\mu_0(X_i). Build a model for the ITE using just observations from the control group (with the imputed/estimated ITE as the response), \hat\tau_0(x). Do so similarly with just observations from the treatment group to get \hat\tau_1(x).
  3. Estimate the CATE by combining the two estimates above: \hat\tau_X(x)  = g(x) \hat\tau_0(x) + [1-g(x)]\hat\tau_1(x),, where g\in [0,1] is a weight function. A good choice for g is an estimate of the propensity score.

When to use what?

The conclusions here were drawn from Reference 1, the paper which proposed the X-learner. So while I think they make intuitive sense, just keep that in mind when reading this section.

  • Overall
    • The choice of base learner (for the intermediate models) can make a large difference in prediction accuracy. (This is an important advantage of metalearners in general.)
    • There is no universally best metalearner: for each of these 3 metalearners, there are situations where it performs best.
  • T-learner
    • Performs well if there are no common trends in the response under control and response under treatment and if the treatment effect is very complicated.
    • Because data is not pooled across treatment groups, it is difficult for the T-learner to mimic a behavior (e.g. discontinuity) that appears in all the treatment groups.
  • S-learner
    • Since the treatment indicator plays no special role, the base learners can completely ignore it during model-fitting. This is good if the CATE is zero in many places.
    • The S-learner can be biased toward zero.
    • For some base learners (e.g. k-nearest neighbors), treating the treatment indicator like any other covariate may not make sense.
  • X-learner
    • The X-learner can adapt to structural properties such as sparsity or smoothness of the CATE. (This is useful as CATE is often zero or approximately linear.)
      • When CATE is zero, it usually is not as good as the S-learner but is better than the T-learner.
      • When CATE is complex, it outperforms the S-learner, and is often better than the T-learner too.
    • It is particularly effective when the number of units in one treatment group (often the control group) is much larger than in the other.


  1. Künzel, S. R., et. al. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s