# Strategies for proving convergence in distribution

When starting out in theoretical statistics, it’s often difficult to know how to even get started on the problem at hand. It helps to have a few general strategies in mind. One thing you are often asked to do is to prove that an estimator $\hat{\theta}_n$ (appropriately scaled) converges in distribution, and you are often asked to determine that limiting distribution. In this post, I list some strategies you can use for that.

Note:

1. This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
2. The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

1. Use the central limit theorem (CLT), or some version of it. This is especially applicable for i.i.d. sums, but there are versions of the CLT that don’t require identical distribution (and there are some that don’t even require complete independence!).
2. Use the delta method.
3. If your estimator is the maximum likelihood estimator (MLE), then it is asymptotically normal under some regularity conditions (see here).
4. If your estimator is an M-estimator or a Z-estimator, then it is asymptotically normal under some regularity conditions (see here).
5. Work directly with the definition of convergence in distribution: show that the CDFs convergence pointwise to the limiting CDF (at all points of continuity for the limiting CDF).
6. Use Lévy’s continuity theorem: If the characteristic functions of the random variable sequence converge pointwise to the characteristic function of the limiting random variable, then the random variable sequence converges in distribution to the limiting random variable. (A similar theorem applies for moment generating functions, see here.)
7. Use the Portmanteau theorem.
8. Slutsky’s theorem will often come in handy. Often the thing we need to prove convergence in distribution for is a ratio $A_n / B_n$ with randomness in both numerator and denominator. Slutsky’s theorem allows us to deal with the two sources of randomness separately. If we can show something like $A_n / m_n \stackrel{d}{\rightarrow} P$ and $m_n / B_n \stackrel{P}{\rightarrow} c$ for some constant $c$ and deterministic sequence $\{ m_n \}$, then Slutsky’s theorem concludes that $A_n / B_n \stackrel{d}{\rightarrow} cP$. (When $m_n = m$ for all $n$, then showing $m_n / B_n \stackrel{P}{\rightarrow} c$ is the same as showing $B_n \stackrel{P}{\rightarrow} cm$.)

# Strategies for proving that an estimator is consistent

When starting out in theoretical statistics, it’s often difficult to know how to even get started on the problem at hand. It helps to have a few general strategies in mind. In this post, I list some strategies you can use to prove that an estimator $\hat{\theta}_n$ is consistent for a parameter $\theta$.

Note:

1. This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
2. The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

1. Use the (weak) law of large numbers. This is especially applicable for i.i.d. sums.
2. Use Chebyshev’s inequality: If $\hat\theta_n$ is unbiased, then $\mathbb{P} \{ |\hat\theta_n - \theta | > \epsilon \} \leq \dfrac{Var(\hat\theta_n)}{\epsilon^2}$. Thus, the estimator will be consistent if $Var(\hat\theta_n) \rightarrow 0$.
3. Actually in strategy 2, we only need $\hat\theta_n$ to be asymptotically unbiased (i.e. $\mathbb{E}[\hat\theta_n] \rightarrow \theta$) and the result would still hold.
4. If your estimator is the maximum likelihood estimator (MLE), then it is consistent under some regularity conditions (see here).
5. If your estimator is an M-estimator or a Z-estimator, try to use the argmax consistency theorem (e.g. see slides 7 and 9 here).
6. Try to use the continuous mapping theorem: if $X_n \stackrel{P}{\rightarrow} X$, then $g(X_n) \stackrel{P}{\rightarrow} g(X)$ for any continuous function $g$.

# Attributes in R

In R, objects are allowed to have attributes, which is a way for users to tag additional information to an R object.

There are a few reasons why one might want to use attributes. One reason that I encountered recently was to ensure that the type of object returned from a function remains consistent across a range of function options.

For example, imagine that you have a function that does a lot of complicated work to get an intermediate result res1, and just a little bit more work to get the final result res2. In some cases you might want to just return res2, while in other cases you might want to return res1 as well to save you computation in the future.

Without attributes, you might do something like this:

f <- function(keep_intermediate) {
...
if (keep_intermediate) {
return(list(res1 = res1, res2 = res2))
} else {
return(res2)
}
}


The tricky thing here is that the return value is a list if keep_intermediate = TRUE, but may not be if keep_intermediate = FALSE. With attributes, you can avoid this issue:

f <- function(keep_intermediate) {
...
if (keep_intermediate) {
attr(res2, "res1") <- res1
}
return(res2)
}


Working with attributes in R

Use the attributes() function to look at all the attributes an object has (returned as a list). The code below shows that by default a matrix will have a dim attribute, and the output of the lm() function will have names and class attributes.

x <- matrix(rnorm(10), ncol = 2)
attributes(x)
# $dim # [1] 5 2 fit <- lm(rnorm(5) ~ x) attributes(fit) #$names
# [1] "coefficients" "residuals" "effects"
# [4] "rank" "fitted.values" "assign"
# [7] "qr" "df.residual" "xlevels"
# [10] "call" "terms" "model"
#
# $class # [1] "lm"  Use the attr() function to set an attribute for an object: x <- 1:3 attr(x, "test") <- "this is a test" x # [1] 1 2 3 # attr(,"test") # [1] "this is a test"  Note that the method above will not work when the object is NULL: y <- NULL attr(y, "attr1") <- "test" # Error in attr(y, "attr1") <- "test" : attempt to set an attribute on NULL  If the object is NULL, we can do the following instead: attributes(y)$attr1 <- "this works"
attributes(y)["attr2"] <- "this also works"
str(y)
# list()
# - attr(*, "attr1")= chr "this works"
# - attr(*, "attr2")= chr "this also works"


Checking if an object has a particular attribute doesn’t seem that easy, the code below is what I have (perhaps there is an easier way!). Note that

attributes(attributes(y))
# $names # [1] "attr1" "attr2"  Hence, the code below checks if y has the "attr1" and "attr3" attribute: "attr1" %in% attributes(attributes(y))$names
# [1] TRUE
"attr3" %in% attributes(attributes(y))$names # [1] FALSE  Update (2020-10-20): Commenter Kent Johnson notes that the code below is an easier way to check for attributes: names(attributes(y)) # [1] "attr1" "attr2" "attr1" %in% names(attributes(y)) # [1] TRUE "attr3" %in% names(attributes(y)) # [1] FALSE  You can use the following code to remove an attribute: attr(y, "attr1") <- NULL str(y) # list() # - attr(*, "attr2")= chr "this also works"  References: 1. Stack Overflow. How to set attributes for a variable in R? 2. de Vries, A., and Meys, J. How to Play With Attributes in R. # What is calibration? What is calibration? In the context of binary classification, calibration refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1. The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines). (Note: The idea of calibration can be extended naturally to multi-class classification; for simplicity I do not talk about it here.) What does it mean for a classifier to be well-calibrated? A classifier is said to be well-calibrated if the estimated probabilities it outputs are accurate. In other words, for any $p \in [0, 1]$, if I consider all the observations which the classifier assigns a probability $p$ of being in class 1, the long-run proportion of those which are truly in class 1 is $p$. Note that it is possible for a classifier to have high sensitivity/specificity/AUC while being poorly calibrated. This is because these metrics only quantify how good the classifier is at ranking the probability of observations being in class 1 relative to each other. For example, compare classifier A which always outputs the true probability with classifier B which always outputs the true probability divided by 2. Both classifiers have the same sensitivity/specificity/AUC, but classifier A is perfectly calibrated while classifier B is not (its estimated probabilities for being in class 1 are too pessimistic). (Aside: Nate Silver‘s The Signal and the Noise has an excellent chapter on calibration in weather forecasting.) How can I find out if my model is well-calibrated? To assess how well-calibrated a classifier is, we can plot a calibration curve: a plot of actual probability vs. estimated probability for each observation. If the model is perfectly calibrated, then the points should line up on the $y = x$ line. The difficulty here is that we don’t get to see actual probabilities: we only get to see 0s and 1s (“did this observation fall in class 1 or not?”). In practice, this is what we do: 1. Sort observations by the classifiers estimated probabilities. 2. Bin the observations into equally sized bins (it is common to pick 10 bins). 3. For each bin, plot the actual proportion of observations in class one against the mean estimated probability for the observations in the bin. With this procedure, we will end up with a plot that looks something like this (taken from Reference 1): Note that there is a tradeoff when it comes to selecting the number of bins: too few bins and we won’t have enough points on the curve, too many bins and we will have too few observations in each bin leading to more noise. It is common to select 10 bins. In theory you should be able to construct a calibration curve with just the predictions, actual class membership, and a parameter for the number of bins. However, all the functions I’ve found in R that plot calibration curves are more sophisticated in their output and require significantly more complex inputs… Does anyone know of a routine that plots calibration curves with these bare-bone inputs? Below is my homebrew version (I assume that the tidyverse package is loaded): GetCalibrationCurve <- function(y, y_pred, bins = 10) { data.frame(y = y, y_pred = y_pred) %>% arrange(y_pred) %>% mutate(pos = row_number() / n(), bin = ceiling(pos * bins)) %>% group_by(bin) %>% summarize(estimated_prob = mean(y_pred), actual_prob = mean(y)) }  The function returns a dataframe with one row for each bin, giving the estimated and actual probabilities for the observations in that bin. Here is an example of how this function can be used to make a calibration curve: # generate data set.seed(1) x <- matrix(rnorm(100 * 10), nrow = 100) eta <- x[, 1] + x[, 2]^2 - x[, 3]^4 mu <- exp(eta) / (1 + exp(eta)) y <- sapply(mu, function(p) rbinom(1, size = 1, prob = p)) df <- data.frame(x, y) # fit logistic regression model fit <- glm(y ~ ., data = df, family = binomial()) y_pred <- predict(fit, df, type = "response") # plot calibration curve df <- GetCalibrationCurve(y, y_pred, bins = 10) ggplot(df, aes(estimated_prob, actual_prob)) + geom_point() + geom_line() + geom_abline(slope = 1, intercept = 0, linetype = 2) + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) + theme_bw() + labs(title = "Calibration curve", x = "Estimated probability", y = "Actual probability")  This model does not seem to be calibrated well. (See Reference 1 for code for plotting calibration curves in Python.) How can I calibrate my model? The Wikipedia page for calibration lists a number of methods for calibration. Based on my googling it looks like Platt scaling and isotonic regression are the more commonly used methods (I might write a post on them in the future). Reference 1 gives Python code for running these two methods. R has several different functions that perform calibration but none of them seem very easy to use. Reference 2 has some R code for both Platt scaling and isotonic regression. Can I run a hypothesis test to check if my model is well-calibrated? The most commonly used hypothesis test for checking model calibration is the Hosmer-Lemeshow goodness-of-fit test. It does have its deficiencies (see discussion section of Reference 3), and several other methods have been proposed as alternatives. It doesn’t seem like any of them has become the new de-facto standard. References: 1. Poulopoulos, D. Classifier calibration. 2. NSS. (2016). Using Platt scaling and isotonic regression to minimize logloss error in R. 3. Dreiseitl, S., and Osl, M. (2012). Testing the calibration of classification models from first principles. # LaTeX: underbrace and underset to get text underneath text In LaTeX, when using \underbrace and \underset, I often forget the correct syntax and where the text on top and below should go. This is a short reference to remind myself. \underbrace: Use \underbrace{text-above}_{text-below} to get $\underbrace{text-above}_{text-below}$ \underset: Use \underset{text-below}{text-above} to get $\underset{text-below}{text-above}$ # p-value around 0.32? You may have a problem with outliers… I’ve been reading through Kohavi’s et al.’s Trustworthy Online Controlled Experiments, which gives a ton of great advice on how to do A/B testing well. This post is about one nugget regarding the $t$-test and outliers that I did not know before. (This appears in Chapter 19 of the book, “The A/A Test”. In the book they mention this issue in the context of computing the $p$-values for several two-sample $t$-tests. In this post I use one one-sample $t$-test because it is easier to illustrate the point with fewer sources of randomness.) Imagine we are doing a one-sample $t$-test with 100 observations. Say our observations are generated in the following way: set.seed(1) x <- runif(100, min = 0, max = 1) + 10 x[100] <- 100000  Every single value in x is much higher zero. We have this one huge outlier, but it’s in the “right” direction (i.e. greater than all values in x, much greater than zero). If we run a two-sided $t$-test with this data for the null hypothesis that the mean of the data generating distribution is equal to zero, we should get a really small $p$-value right? Wrong! Here is the R output: t.test(x)$p.value
# [1] 0.3147106


The problem is because of that outlier that x has. Sure, it increases the mean of x by a lot, but it also increases the variance a ton! Remember that the $t$-statistic is a ratio of the mean estimate to its sample standard deviation, so the outlier’s effects on the mean and variance “cancel” each other.

Here’s a sketch of the mathematical argument. Assume the outlier has value $o$, and that there are $n$ observations in each sample. When the outlier is very large, it dominates the mean estimate, so the numerator of the $t$-statistic is something like $o / n$. It also dominates the variance estimate:

\begin{aligned} \text{Variance} &= \mathbb{E}[X^2] - (\mathbb{E}[X])^2, \\ \text{Sample variance} s^2 &\approx \frac{o^2}{n} - \frac{o^2}{n^2} \approx \frac{o^2}{n}. \end{aligned}

Hence, the $t$-statistic is roughly

\begin{aligned} t = \frac{\hat{\mu} - 0}{\sqrt{s^2 / n}} \approx \frac{o/n}{\sqrt{(o^2/n)/n}} = 1 \text{ or } -1, \end{aligned}

depending on whether $o$ is positive or negative. Hence, the two-sided $p$-value is approximately $\mathbb{P} \{ |\mathcal{N}(0,1)| \geq 1 \}$:

2 * pnorm(-1)
# [1] 0.3173105


Another way to understand this intuitively is to make a plot of the data. First, let’s plot the data without the outlier:

plot(x[1:99], pch = 16, cex = 0.5, ylim = c(0, max(x[1:99])),
xlab = "Observation index", ylab = "Value",
main = "Plot of values without outlier")
abline(h = 0, col = "red", lty = 2)


It’s obvious here that the variation in x pales in comparison to the distance of x from the $y = 0$ line, so we expect a very small $p$-value. (In fact, R tells me the $p$-value is $\approx 4 \times 10^{-158}$!) Next, let’s plot the data with the outlier:

plot(x, pch = 16, cex = 0.5, ylim = c(0, max(x)),
xlab = "Observation index", ylab = "Value",
main = "Plot of values with outlier")
abline(h = 0, col = "red", lty = 2)


Not so obvious that the mean of x is different from zero now right?

# What is the Frisch–Waugh–Lovell theorem?

The Frisch–Waugh–Lovell (FWL) theorem (named after econometricians Ragnar Frisch, Frederick V. Waugh and Michael C. Lovell) relates the coefficients for two different regressions.

Assume we have a response $y \in \mathbb{R}^n$ and two data matrices $X_1 \in \mathbb{R}^{n \times p_1}$ and $X_2 \in \mathbb{R}^{n \times p_2}$. On one hand, we could perform ordinary least squares (OLS) of $y$ on $X_1$ and $X_2$ (jointly) to get coefficient vectors $\hat\beta_1 \in \mathbb{R}^{p_1}$ and $\hat\beta_2 \in \mathbb{R}^{p_2}$. (The implicit model here is $y = X_1 \beta_1 + X_2 \beta_2 + \text{noise}$.)

Alternatively, we could do the following:

1. Perform OLS of $y$ on $X_1$ to get residuals $\tilde{y}$.
2. Perform OLS of $X_2$ on $X_1$ (in a column-wise fashion) to get residuals $\tilde{X_2}$.
3. Perform OLS of $\tilde{y}$ on $\tilde{X_2}$ to get coefficient vector $\tilde{\beta}_2$.

The Frisch–Waugh–Lovell theorem states that $\hat{\beta}_2 = \tilde{\beta}_2$.

The proof of the theorem is not hard. Reference 2 proves the theorem using two well-known facts about OLS, while Reference 3 proves the theorem using just matrix algebra.

References:

1. Wikipedia. Frisch-Waugh-Lovell theorem.
2. Lovell, M. C. (2008). A simple proof of the FWL theorem.
3. Belzile, L. (2019). Frisch-Waugh-Lovell theorem.

# Simulating paths from a random walk

If you’ve ever visited this blog at wordpress.com, you might have noticed a header image that looks like this:

Ever wonder how it was generated? The image depicts 100 simulations of an asymmetric random walk. In this post, I’ll go through the code used to generate this image. All the code can also be found here.

For $t = 1, 2, \dots$, consider a series of i.i.d. random variables $X_1, X_2, \dots$ such that for any $t$, $X_t = 1$ with probability $p$, and $X_t = -1$ with probability $1-p$. ($p \in [0,1]$ is a parameter that we get to choose.) A random walk simply tracks the cumulative sum of these random variables, i.e.

$S_0 = 0, \quad S_t = \sum_{s = 1}^t X_s.$

In my image, I let the random walk run until it hits a fixed upper limit or a fixed lower limit. Here is an R function that generates one realization of this random walk:

# returns the random walk path values as a vector
# (random walk always starts at 0)
# p: probability of increasing by 1
# stop if path value hits either lower or upper
run <- function(p, lower, upper) {
values <- c(0)
current <- 0
while (current > lower & current < upper) {
current <- current + ifelse(runif(1) < p, 1, -1)
values <- c(values, current)
}
values
}


(There might be more efficient ways of doing this, but since computation is relatively fast this is good enough for our purposes.)

The code below creates a list of 100 elements, each element being a simulated random walk. The probability of going up at any one step is 0.48, and we stop the random walk once we hit -50 or 50.

N <- 100  # no. of paths to simulate
p <- 0.48
lower <- -50
upper <- 50

# simulate paths
set.seed(1055)
vlist <- replicate(N, run(p, lower, upper))


We can plot these paths along with the x-axis and the upper & lower limits:

# get length of longest path
max_length <- max(sapply(vlist, length))

# make plot
par(mar = rep(0, 4))  # no margins
plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
lines(1:length(vlist[[i]]), vlist[[i]])
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)


Without color, it’s hard to make much sense of the image. To introduce color, let’s create a function that picks a color for path based on (i) whether the path hit the upper or lower limit, and (ii) how long the path is when compared to the longest path in the image.

colorPicker <- function(values, max_length,
ls_color = c(178, 34, 34), ll_color = c(255, 204, 0),
us_color = c(0, 0, 102), ul_color = c(102, 204, 225)) {
l <- length(values)
if (values[l] < 0) {
rgb_values <- (ls_color + (ll_color - ls_color) * l / max_length) / 255
} else {
rgb_values <- (us_color + (ul_color - us_color) * l / max_length) / 255
}
rgb(rgb_values[1], rgb_values[2], rgb_values[3])
}


If a path hits the lower limit and has length 0, it will have color ls_color / 255, and if a path hits the lower limit and has the longest length among all our simulated paths, it will have color ll_color / 255. There is a similar relationship for paths hitting the upper limit and us_color and ul_color. (You can see what these colors are at a website such as this.)

We can now color our black-and-white image:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
lines(1:length(vlist[[i]]), vlist[[i]],
col = colorPicker(vlist[[i]], max_length), lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)


From the image it is now obvious that most of the paths hit the lower limit first, and fairly quickly at that! Here is the same image, but with different choices of color scale:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
lines(1:length(vlist[[i]]), vlist[[i]],
col = colorPicker(vlist[[i]], max_length,
ls_color = c(230, 230, 230),
ll_color = c(166, 166, 166),
us_color = c(255, 0, 0),
ul_color = c(0, 0, 255)),
lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)


The possibilities are endless!

If you haven’t seen random walks before, you might be surprised at how a slightly biased walk ($p = 0.48$ instead of $p = 0.5$) results in so many more paths hitting the lower limit before the upper limit, even though the two limits are the same distance from 0. This is an example of 100 simulations when $p = 0.5$:

This is an example of 100 simulations when $p = 0.52$:

Isn’t it interesting how slight deviations in the probability of going one step up (instead of down) completely change the dynamics? It turns out that there is a closed form for the probability of hitting one limit (as opposed to hitting the other). If the upper limit is $x = b$ and the lower limit is $x = -a$, and $q = 1-p$, then

\begin{aligned} \mathbb{P} \{ \text{hit } x = b \text{ before } x = -a \} = \begin{cases} \frac{1 - (q/p)^a}{1 - (q/p)^{a+b}} &\text{if } p \neq 1/2, \\ \frac{a}{a+b} &\text{if } p = 1/2. \end{cases} \end{aligned}

For my blog header image, $p = 0.48$, $a = b = 50$, which means the probability of hitting the upper limit is approximately 0.0179: not high at all!

# Confusion over ratio metrics in causal inference

Set-up

Assume that we are in the potential outcomes set-up in causal inference. We have a sample of $n$ individuals, and individual $i$ has potential outcomes $Y_i(1)$ and $Y_i(0)$. $Y_i(1)$ denotes the value of individual $i$‘s response if the individual is in the treatment group, while $Y_i(0)$ denotes the value if the individual is in the control group. The fundamental problem of causal inference is that as the experimenter, we only ever get to observe one of $Y_i(1)$ and $Y_i(0)$, but NEVER BOTH.

In causal inference, a common target that we want to estimate is the average treatment effect (ATE), defined as the expected difference in potential outcomes:

\begin{aligned} ATE = \mathbb{E}[Y(1) - Y(0)]. \end{aligned}

Ratio metrics: why we care

In some cases, we might be interested in a ratio of potential outcomes instead. This is not as unusual as one might think! We often hear claims such as “if you take this supplement, you will be x% stronger”: x can be expressed as

\begin{aligned} x = 100 \left( \frac{after}{before} - 1 \right). \end{aligned}

The work involved in estimating $x$ is essentially the same as that in estimating $\frac{after}{before}$.

The wrong way to define ratio metrics

What is the target we are trying to estimate for a ratio metric? Instinctively, one might define the target as

\begin{aligned} \mathbb{E}\left[\frac{Y(1)}{Y(0)} \right]. \end{aligned}

This is incorrect! In statistical parlance, the quantity above is unidentifiable: even if we have infinite data from this model (i.e. we see an infinite number of observations), we still cannot estimate this target!

Let’s see this through an example. In this first set-up, imagine that there is no treatment effect, i.e. $Y(1) = Y(0)$ for all individuals. Imagine also that $Y(0) = 1$ for half of the population, and $Y(0) = 2$ for half of the population and so $\mathbb{E}[Y(1) / Y(0)] = 1$.

If I were to run a huge randomized experiment with half of the observations in control and half of the observations in treatment, what would I see? I would see 50% of the controls having value 1, 50% of the controls having value 2, and the same for the treatment group.

Now, imagine a second set-up, where again $Y(0) = 1$ for half of the population and $Y(0) = 2$ for half of the population. However, if $Y(0) = 1$, then $Y(1) = 2$, and if $Y(0) = 2$, then $Y(1) = 1$. In this set-up, $\mathbb{E}[Y(1) / Y(0)] = \frac{1}{2} \cdot (2/1) + \frac{1}{2} \cdot (1/2) = 1.25$.

What would I see if I were to run a huge randomized experiment? I would see EXACTLY the same data as that in the first set-up: 50% 1s and 50% 2s in the control group, and the same in the treatment group! We will not be able to differentiate between set-ups 1 and 2, even with infinite data, as the observed data will be the same.

(Notice that this problem does not arise for the ATE: both of these set-ups have the same ATE: 0.)

What we do instead

The target that practitioners use for ratio metrics is

\begin{aligned} \frac{\mathbb{E}[Y(1)]}{\mathbb{E}[Y(0)]}. \end{aligned}

If you run it through the two set-ups above, you will find that this target will have the same value in both settings: 1.

Reflections

It took me a while to wrap my head around this. One takeaway I have is that the fundamental problem of causal inference forces us to think hard about what quantities we can even hope to estimate. This is why I think the issue of identification comes up a lot more in causal inference than in the rest of statistics.

I want to end this post off by highlighting two other things to worry about when using ratio metrics:

1. What happens if the denominator can be negative? How do you interpret the target in that case?
2. What happens if the denominator is very close to zero, or worse, equal to zero? Having something close to zero in the denominator usually causes estimates to be very unstable.

All that to say: approach ratio metrics with care!

# Using dplyr::filter when the condition is a string

This took me a while to figure out and so I thought I would post this as future reference. Let’s say I have the mtcars data and I want to filter for just the rows with cyl == 6. I would do something like this:

library(tidyverse)
data(mtcars)
mtcars %>% filter(cyl == 6)

#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6


What if I had the filter condition as a string instead? The code below doesn’t work:

filter_string <- "cyl == 6"
mtcars %>% filter(filter_string)

# Error: Problem with filter() input ..1.
# x Input ..1 must be a logical vector, not a character.
#   Input ..1 is filter_string.
# Run rlang::last_error() to see where the error occurred.


This is one possible solution:

mtcars %>% filter(!! rlang::parse_expr(filter_string))

#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6


This can be useful if you are trying to running several different filters automatically. In the following contrived example, I want to compute the mean MPG for two different slices of the data:

filters <- c("carb == 4", "am == 1")
for (filter in filters) {
print(paste0("Mean mpg for ",
filter,
": ",
mtcars %>% filter(!! rlang::parse_expr(filter)) %>%
summarize(mean_mpg = mean(mpg)) %>%
pull()))

# [1] "Mean mpg for carb == 4: 15.79"
# [1] "Mean mpg for am == 1: 24.3923076923077"
}



You can read about parse_expr here and about !! here. I don’t fully understand tidy evaluation at this point, but the code above should work in a wide variety of situations.

(Disclaimer: There was a reference I came across for the !! + rlang::parse_expr trick that I can’t find now. If anyone knows where it is please let me know and I’ll acknowledge it here in the references.)