Introduction
The synthetic control estimator (Abadie et al. (2010), Reference 1) is a technique for performing causal inference in the time series setting. The intervention/treatment is assumed to happen at a certain point in time for the treatment unit. The synthetic control estimator finds weights for a set of control units such that the weighted combination of control units mimics the behavior of the treatment unit in the pre-intervention period. The behavior of this weighted combination of control units (the “synthetic control”) in the post-intervention period is assumed to be a good proxy for what would have happened to the treatment unit, had it not received the intervention. The difference between what actually happened for the treatment unit and the synthetic control in the post-intervention period is an estimate of the treatment effect.
Seminal example
The seminal example for synthetic controls, discussed in Abadie et al. (2010), is estimating the effect of California’s tobacco control program. Specifically, a synthetic control was used to estimate the effect of Proposition 99, which went into effect in January 1989, on per-capita cigarette sales. The figure below shows that we cannot simply compare California against the rest of the US, as cigarette sale trends were pretty different before the intervention (so any changes post-intervention cannot be attributed solely to the intervention):

We also can’t compare California directly with any single state for the same reason. A synthetic control takes a weighted average of some other states such that the pre-intervention cigarette sales trend for this weighted average closely matches that of California. Because the pre-intervention sale trends match, it is more plausible that the post-intervention difference is due to the intervention (and not something else). The figure below shows the trends for real and synthetic California:

The weights give us a way to interpret the control as well. In this example, synthetic California is 0.164 Colorado, 0.069 Connecticut, 0.199 Montana, 0.234 Nevada and 0.334 Utah. (We will need domain knowledge to know whether this makes sense or not.)
Mathematical details
Assume that we have
units, with just the first unit being exposed to the treatment. (The remaining
units are controls.) Assume that there are
time periods in total, with
being pre-intervention periods and
being post-intervention periods. Let
denote the outcome that would be observed for unit
at time
if it was not exposed to the intervention in periods
to
, and let
denote the outcome that would be observed it is was exposed. What we want to estimate is

for
. In the above,
is observed while
is not and has to be estimated by the synthetic control. If we have the synthetic control weights
, then we estimate
with

Thus, the goal is to find a set of weights
such that the weighted average of the
controls “looks like the treatment unit” before the intervention. The devil is of course in the specifics of what it means to “look like the treatment unit”. There are several ways to do this!
Mathematical details: Variation 1
The first is to minimize the squared difference between the treatment and synthetic control in the pre-intervention period:

The constraints on the weights are not strictly necessary, but interpretation comes more easily when the synthetic control is a convex combination of the control units. (E.g. How do you interpret California as being 1.2 times of Utah and -0.2 times of Colorado?)
We can write the optimization problem above more compactly if we introduce matrix notation. Let
be the vector of weights. Let
, and let
be the matrix such that the
th column of
is the pre-intervention time series for unit
. Then we can write the above as

where
denotes the Frobenius norm.
Mathematical details: Variation 2
In Variation 1, we have an optimization problem involving a
matrix. Solving the optimization problem could take a while if
is large. Instead of minimizing the squared difference in each pre-intervention time period, Abadie et al. (2010) suggest minimizing the difference for a few linear combinations of the pre-intervention time periods. Specifically, for
, define
, where
are some pre-specified weights. Let
, and define
analogously. The synthetic control weights are the solution to

(With today’s computational power, I’m not sure there’s a need to make this adjustment.)
Mathematical details: Variation 3
In the previous two variations, we assumed that the only data we had was for the response of interest. In some cases, we might have observed covariates in the pre-intervention period (e.g. gender, age). It makes intuitive sense that a good synthetic control matches the treatment unit’s behavior for both the pre-intervention response and other observed covariates.
Assume for each unit
, we have observed covariates
. Let
, and define
analogously. The synthetic control weights are the solution to

Mathematical details: Variation 4
In the previous variations, we assumed that the distance of interest was squared distance (represented by the Frobenius norm). This need not be the case: it may be more important to match on some covariates rather than others. To accommodate this, we can solve the problem

where
is some symmetric and positive semidefinite matrix, and where
. Section 2.3 of Abadie et al. (2010) has some discussion on how to choose
.
Adapting to cross-sectional data
While the synthetic control estimator was introduced for the time series setting, it can be adapted to the cross-sectional setting as well (see for e.g. Abadie & L’Hour (2021), Reference 2). Assume again that we have
units with just the first being exposed to treatment. The responses for unit
under treatment and control are denoted by
and
respectively. (This is the same as before except there is no subscript for time.) We wish to estimate the treatment effect for the treated unit:

The synthetic control estimate for this is

where the
‘s are the synthetic control weights. These weights are determined by the same optimization problem as before:

The difference is just in the definition of
and $latex \mathbf{X}_0$. Recall that in the time series case,
, where the
‘s were observed covariates and the
‘s were linear combinations of the response in the pre-intervention period. For the cross-sectional setting,
consists of just the observed covariates. (
is defined analogously.)
The synthetic control estimator can be thought of as a type of nearest-neighbor estimator. However, as noted in Reference 2, it departs from the nearest-neighbor estimator in 2 major ways:
- The number of matches for the treated unit is not fixed a priori. (For
-nearest neighbors, the number of matches is fixed at
.)
- The matched units are not given equal weight: rather, the weights are determined by the optimization problem described above.
References:
- Abadie, A., et al. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program.
- Abadie, A., and L’Hour, J. (2021). A penalized synthetic control estimator for disaggregated data.