Note:

- This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
- The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

- Use the central limit theorem (CLT), or some version of it. This is especially applicable for i.i.d. sums, but there are versions of the CLT that don’t require identical distribution (and there are some that don’t even require complete independence!).
- Use the delta method.
- If your estimator is the maximum likelihood estimator (MLE), then it is asymptotically normal under some regularity conditions (see here).
- If your estimator is an M-estimator or a Z-estimator, then it is asymptotically normal under some regularity conditions (see here).
- Work directly with the definition of convergence in distribution: show that the CDFs convergence pointwise to the limiting CDF (at all points of continuity for the limiting CDF).
- Use Lévy’s continuity theorem: If the characteristic functions of the random variable sequence converge pointwise to the characteristic function of the limiting random variable, then the random variable sequence converges in distribution to the limiting random variable. (A similar theorem applies for moment generating functions, see here.)
- Use the Portmanteau theorem.
- Slutsky’s theorem will often come in handy. Often the thing we need to prove convergence in distribution for is a ratio with randomness in both numerator and denominator. Slutsky’s theorem allows us to deal with the two sources of randomness separately. If we can show something like and for some constant and deterministic sequence , then Slutsky’s theorem concludes that . (When for all , then showing is the same as showing .)

Note:

- This list is not exhaustive. If you have a strategy for proving consistency that is not in the list, feel free to share it!)
- The arguments here are not necessarily rigorous, so you need to check the conditions under which they apply.

Now for the strategies:

- Use the (weak) law of large numbers. This is especially applicable for i.i.d. sums.
- Use Chebyshev’s inequality: If is unbiased, then . Thus, the estimator will be consistent if .
- Actually in strategy 2, we only need to be asymptotically unbiased (i.e. ) and the result would still hold.
- If your estimator is the maximum likelihood estimator (MLE), then it is consistent under some regularity conditions (see here).
- If your estimator is an M-estimator or a Z-estimator, try to use the argmax consistency theorem (e.g. see slides 7 and 9 here).
- Try to use the continuous mapping theorem: if , then for any continuous function .

There are a few reasons why one might want to use attributes. One reason that I encountered recently was to ensure that the type of object returned from a function remains consistent across a range of function options.

For example, imagine that you have a function that does a lot of complicated work to get an intermediate result `res1`

, and just a little bit more work to get the final result `res2`

. In some cases you might want to just return `res2`

, while in other cases you might want to return `res1`

as well to save you computation in the future.

Without attributes, you might do something like this:

f <- function(keep_intermediate) { ... if (keep_intermediate) { return(list(res1 = res1, res2 = res2)) } else { return(res2) } }

The tricky thing here is that the return value is a list if `keep_intermediate = TRUE`

, but may not be if `keep_intermediate = FALSE`

. With attributes, you can avoid this issue:

f <- function(keep_intermediate) { ... if (keep_intermediate) { attr(res2, "res1") <- res1 } return(res2) }

**Working with attributes in R**

Use the `attributes()`

function to look at all the attributes an object has (returned as a list). The code below shows that by default a matrix will have a `dim`

attribute, and the output of the `lm()`

function will have `names`

and `class`

attributes.

x <- matrix(rnorm(10), ncol = 2) attributes(x) # $dim # [1] 5 2 fit <- lm(rnorm(5) ~ x) attributes(fit) # $names # [1] "coefficients" "residuals" "effects" # [4] "rank" "fitted.values" "assign" # [7] "qr" "df.residual" "xlevels" # [10] "call" "terms" "model" # # $class # [1] "lm"

Use the `attr()`

function to set an attribute for an object:

x <- 1:3 attr(x, "test") <- "this is a test" x # [1] 1 2 3 # attr(,"test") # [1] "this is a test"

Note that the method above will not work when the object is `NULL`

:

y <- NULL attr(y, "attr1") <- "test" # Error in attr(y, "attr1") <- "test" : attempt to set an attribute on NULL

If the object is `NULL`

, we can do the following instead:

attributes(y)$attr1 <- "this works" attributes(y)["attr2"] <- "this also works" str(y) # list() # - attr(*, "attr1")= chr "this works" # - attr(*, "attr2")= chr "this also works"

Checking if an object has a particular attribute doesn’t seem that easy, the code below is what I have (perhaps there is an easier way!). Note that

attributes(attributes(y)) # $names # [1] "attr1" "attr2"

Hence, the code below checks if `y`

has the `"attr1"`

and `"attr3"`

attribute:

"attr1" %in% attributes(attributes(y))$names # [1] TRUE "attr3" %in% attributes(attributes(y))$names # [1] FALSE

* Update (2020-10-20):* Commenter Kent Johnson notes that the code below is an easier way to check for attributes:

names(attributes(y)) # [1] "attr1" "attr2" "attr1" %in% names(attributes(y)) # [1] TRUE "attr3" %in% names(attributes(y)) # [1] FALSE

You can use the following code to remove an attribute:

attr(y, "attr1") <- NULL str(y) # list() # - attr(*, "attr2")= chr "this also works"

References:

- Stack Overflow. How to set attributes for a variable in R?
- de Vries, A., and Meys, J. How to Play With Attributes in R.

In the context of binary classification, * calibration* refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1.

The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines).

(* Note:* The idea of calibration can be extended naturally to multi-class classification; for simplicity I do not talk about it here.)

**What does it mean for a classifier to be well-calibrated?**

A classifier is said to be * well-calibrated* if the estimated probabilities it outputs are accurate. In other words, for any , if I consider all the observations which the classifier assigns a probability of being in class 1, the long-run proportion of those which are truly in class 1 is .

Note that it is possible for a classifier to have high sensitivity/specificity/AUC while being poorly calibrated. This is because these metrics only quantify how good the classifier is at ranking the probability of observations being in class 1 *relative* to each other. For example, compare classifier A which always outputs the true probability with classifier B which always outputs the true probability divided by 2. Both classifiers have the same sensitivity/specificity/AUC, but classifier A is perfectly calibrated while classifier B is not (its estimated probabilities for being in class 1 are too pessimistic).

(* Aside:* Nate Silver‘s

**How can I find out if my model is well-calibrated?**

To assess how well-calibrated a classifier is, we can plot a * calibration curve*: a plot of actual probability vs. estimated probability for each observation. If the model is perfectly calibrated, then the points should line up on the line. The difficulty here is that we don’t get to see actual probabilities: we only get to see 0s and 1s (“did this observation fall in class 1 or not?”). In practice, this is what we do:

- Sort observations by the classifiers estimated probabilities.
- Bin the observations into equally sized bins (it is common to pick 10 bins).
- For each bin, plot the actual proportion of observations in class one against the mean estimated probability for the observations in the bin.

With this procedure, we will end up with a plot that looks something like this (taken from Reference 1):

Note that there is a tradeoff when it comes to selecting the number of bins: too few bins and we won’t have enough points on the curve, too many bins and we will have too few observations in each bin leading to more noise. It is common to select 10 bins.

In theory you should be able to construct a calibration curve with just the predictions, actual class membership, and a parameter for the number of bins. However, all the functions I’ve found in R that plot calibration curves are more sophisticated in their output and require significantly more complex inputs… Does anyone know of a routine that plots calibration curves with these bare-bone inputs? Below is my homebrew version (I assume that the `tidyverse`

package is loaded):

GetCalibrationCurve <- function(y, y_pred, bins = 10) { data.frame(y = y, y_pred = y_pred) %>% arrange(y_pred) %>% mutate(pos = row_number() / n(), bin = ceiling(pos * bins)) %>% group_by(bin) %>% summarize(estimated_prob = mean(y_pred), actual_prob = mean(y)) }

The function returns a dataframe with one row for each bin, giving the estimated and actual probabilities for the observations in that bin. Here is an example of how this function can be used to make a calibration curve:

# generate data set.seed(1) x <- matrix(rnorm(100 * 10), nrow = 100) eta <- x[, 1] + x[, 2]^2 - x[, 3]^4 mu <- exp(eta) / (1 + exp(eta)) y <- sapply(mu, function(p) rbinom(1, size = 1, prob = p)) df <- data.frame(x, y) # fit logistic regression model fit <- glm(y ~ ., data = df, family = binomial()) y_pred <- predict(fit, df, type = "response") # plot calibration curve df <- GetCalibrationCurve(y, y_pred, bins = 10) ggplot(df, aes(estimated_prob, actual_prob)) + geom_point() + geom_line() + geom_abline(slope = 1, intercept = 0, linetype = 2) + coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) + theme_bw() + labs(title = "Calibration curve", x = "Estimated probability", y = "Actual probability")

This model does not seem to be calibrated well.

(See Reference 1 for code for plotting calibration curves in Python.)

**How can I calibrate my model?**

The Wikipedia page for calibration lists a number of methods for calibration. Based on my googling it looks like Platt scaling and isotonic regression are the more commonly used methods (I might write a post on them in the future). Reference 1 gives Python code for running these two methods. R has several different functions that perform calibration but none of them seem very easy to use. Reference 2 has some R code for both Platt scaling and isotonic regression.

**Can I run a hypothesis test to check if my model is well-calibrated?**

The most commonly used hypothesis test for checking model calibration is the Hosmer-Lemeshow goodness-of-fit test. It does have its deficiencies (see discussion section of Reference 3), and several other methods have been proposed as alternatives. It doesn’t seem like any of them has become the new de-facto standard.

References:

- Poulopoulos, D. Classifier calibration.
- NSS. (2016). Using Platt scaling and isotonic regression to minimize logloss error in R.
- Dreiseitl, S., and Osl, M. (2012). Testing the calibration of classification models from first principles.

`\underbrace`

and `\underset`

, I often forget the correct syntax and where the text on top and below should go. This is a short reference to remind myself.
`\underbrace`

: Use `\underbrace{text-above}_{text-below}`

to get

`\underset`

: Use `\underset{text-below}{text-above}`

to get

(This appears in Chapter 19 of the book, “The A/A Test”. In the book they mention this issue in the context of computing the -values for several two-sample -tests. In this post I use one one-sample -test because it is easier to illustrate the point with fewer sources of randomness.)

Imagine we are doing a one-sample -test with 100 observations. Say our observations are generated in the following way:

set.seed(1) x <- runif(100, min = 0, max = 1) + 10 x[100] <- 100000

Every single value in `x`

is much higher zero. We have this one huge outlier, but it’s in the “right” direction (i.e. greater than all values in `x`

, much greater than zero). **If we run a two-sided -test with this data for the null hypothesis that the mean of the data generating distribution is equal to zero, we should get a really small -value right?**

* Wrong!* Here is the R output:

t.test(x)$p.value # [1] 0.3147106

The problem is because of that outlier that `x`

has. Sure, it increases the mean of `x`

by a lot, * but it also increases the variance a ton!* Remember that the -statistic is a ratio of the mean estimate to its sample standard deviation, so the outlier’s effects on the mean and variance “cancel” each other.

Here’s a sketch of the mathematical argument. Assume the outlier has value , and that there are observations in each sample. When the outlier is very large, it dominates the mean estimate, so the numerator of the -statistic is something like . It also dominates the variance estimate:

Hence, the -statistic is roughly

depending on whether is positive or negative. Hence, the two-sided -value is approximately :

2 * pnorm(-1) # [1] 0.3173105

Another way to understand this intuitively is to make a plot of the data. First, let’s plot the data without the outlier:

plot(x[1:99], pch = 16, cex = 0.5, ylim = c(0, max(x[1:99])), xlab = "Observation index", ylab = "Value", main = "Plot of values without outlier") abline(h = 0, col = "red", lty = 2)

It’s obvious here that the variation in `x`

pales in comparison to the distance of `x`

from the line, so we expect a very small -value. (In fact, R tells me the -value is !) Next, let’s plot the data with the outlier:

plot(x, pch = 16, cex = 0.5, ylim = c(0, max(x)), xlab = "Observation index", ylab = "Value", main = "Plot of values with outlier") abline(h = 0, col = "red", lty = 2)

Not so obvious that the mean of `x`

is different from zero now right?

Assume we have a response and two data matrices and . On one hand, we could perform ordinary least squares (OLS) of on and (jointly) to get coefficient vectors and . (The implicit model here is .)

Alternatively, we could do the following:

- Perform OLS of on to get residuals .
- Perform OLS of on (in a column-wise fashion) to get residuals .
- Perform OLS of on to get coefficient vector .

**The Frisch–Waugh–Lovell theorem states that .**

The proof of the theorem is not hard. Reference 2 proves the theorem using two well-known facts about OLS, while Reference 3 proves the theorem using just matrix algebra.

References:

- Wikipedia. Frisch-Waugh-Lovell theorem.
- Lovell, M. C. (2008). A simple proof of the FWL theorem.
- Belzile, L. (2019). Frisch-Waugh-Lovell theorem.

Ever wonder how it was generated? * The image depicts 100 simulations of an asymmetric random walk.* In this post, I’ll go through the code used to generate this image. All the code can also be found here.

For , consider a series of i.i.d. random variables such that for any , with probability , and with probability . ( is a parameter that we get to choose.) A random walk simply tracks the cumulative sum of these random variables, i.e.

In my image, I let the random walk run until it hits a fixed upper limit or a fixed lower limit. Here is an R function that generates one realization of this random walk:

# returns the random walk path values as a vector # (random walk always starts at 0) # p: probability of increasing by 1 # stop if path value hits either `lower` or `upper` run <- function(p, lower, upper) { values <- c(0) current <- 0 while (current > lower & current < upper) { current <- current + ifelse(runif(1) < p, 1, -1) values <- c(values, current) } values }

(There might be more efficient ways of doing this, but since computation is relatively fast this is good enough for our purposes.)

The code below creates a list of 100 elements, each element being a simulated random walk. The probability of going up at any one step is 0.48, and we stop the random walk once we hit -50 or 50.

N <- 100 # no. of paths to simulate p <- 0.48 lower <- -50 upper <- 50 # simulate paths set.seed(1055) vlist <- replicate(N, run(p, lower, upper))

We can plot these paths along with the x-axis and the upper & lower limits:

# get length of longest path max_length <- max(sapply(vlist, length)) # make plot par(mar = rep(0, 4)) # no margins plot(c(1, max_length), c(lower, upper), type = "n") for (i in 1:N) { lines(1:length(vlist[[i]]), vlist[[i]]) } abline(h = 0, lty = "dashed") abline(h = lower, lwd = 2) abline(h = upper, lwd = 2)

Without color, it’s hard to make much sense of the image. To introduce color, let’s create a function that picks a color for path based on (i) whether the path hit the upper or lower limit, and (ii) how long the path is when compared to the longest path in the image.

colorPicker <- function(values, max_length, ls_color = c(178, 34, 34), ll_color = c(255, 204, 0), us_color = c(0, 0, 102), ul_color = c(102, 204, 225)) { l <- length(values) if (values[l] < 0) { rgb_values <- (ls_color + (ll_color - ls_color) * l / max_length) / 255 } else { rgb_values <- (us_color + (ul_color - us_color) * l / max_length) / 255 } rgb(rgb_values[1], rgb_values[2], rgb_values[3]) }

If a path hits the lower limit and has length 0, it will have color `ls_color / 255`

, and if a path hits the lower limit and has the longest length among all our simulated paths, it will have color `ll_color / 255`

. There is a similar relationship for paths hitting the upper limit and `us_color`

and `ul_color`

. (You can see what these colors are at a website such as this.)

We can now color our black-and-white image:

plot(c(1, max_length), c(lower, upper), type = "n") for (i in 1:N) { lines(1:length(vlist[[i]]), vlist[[i]], col = colorPicker(vlist[[i]], max_length), lwd = 0.5) } abline(h = 0, lty = "dashed") abline(h = lower, lwd = 2) abline(h = upper, lwd = 2)

From the image it is now obvious that most of the paths hit the lower limit first, and fairly quickly at that! Here is the same image, but with different choices of color scale:

plot(c(1, max_length), c(lower, upper), type = "n") for (i in 1:N) { lines(1:length(vlist[[i]]), vlist[[i]], col = colorPicker(vlist[[i]], max_length, ls_color = c(230, 230, 230), ll_color = c(166, 166, 166), us_color = c(255, 0, 0), ul_color = c(0, 0, 255)), lwd = 0.5) } abline(h = 0, lty = "dashed") abline(h = lower, lwd = 2) abline(h = upper, lwd = 2)

The possibilities are endless!

If you haven’t seen random walks before, you might be surprised at how a slightly biased walk ( instead of ) results in so many more paths hitting the lower limit before the upper limit, even though the two limits are the same distance from 0. This is an example of 100 simulations when :

This is an example of 100 simulations when :

* Isn’t it interesting how slight deviations in the probability of going one step up (instead of down) completely change the dynamics?* It turns out that there is a closed form for the probability of hitting one limit (as opposed to hitting the other). If the upper limit is and the lower limit is , and , then

For my blog header image, , , which means the probability of hitting the upper limit is approximately 0.0179: not high at all!

]]>Assume that we are in the potential outcomes set-up in causal inference. We have a sample of individuals, and individual has potential outcomes and . denotes the value of individual ‘s response if the individual is in the treatment group, while denotes the value if the individual is in the control group. The ** fundamental problem of causal inference** is that as the experimenter, we only ever get to observe one of and , but

In causal inference, a common target that we want to estimate is the ** average treatment effect (ATE)**, defined as the expected difference in potential outcomes:

**Ratio metrics: why we care**

In some cases, we might be interested in a ratio of potential outcomes instead. This is not as unusual as one might think! We often hear claims such as “if you take this supplement, you will be x% stronger”: x can be expressed as

The work involved in estimating is essentially the same as that in estimating .

**The wrong way to define ratio metrics**

What is the target we are trying to estimate for a ratio metric? Instinctively, one might define the target as

* This is incorrect!* In statistical parlance, the quantity above is unidentifiable: even if we have infinite data from this model (i.e. we see an infinite number of observations), we still cannot estimate this target!

Let’s see this through an example. In this first set-up, imagine that there is no treatment effect, i.e. for all individuals. Imagine also that for half of the population, and for half of the population and so .

If I were to run a huge randomized experiment with half of the observations in control and half of the observations in treatment, what would I see? I would see 50% of the controls having value 1, 50% of the controls having value 2, and the same for the treatment group.

Now, imagine a second set-up, where again for half of the population and for half of the population. However, if , then , and if , then . In this set-up, .

What would I see if I were to run a huge randomized experiment? I would see EXACTLY the same data as that in the first set-up: 50% 1s and 50% 2s in the control group, and the same in the treatment group! **We will not be able to differentiate between set-ups 1 and 2, even with infinite data, as the observed data will be the same.**

(Notice that this problem does not arise for the ATE: both of these set-ups have the same ATE: 0.)

**What we do instead**

The target that practitioners use for ratio metrics is

If you run it through the two set-ups above, you will find that this target will have the same value in both settings: 1.

**Reflections**

It took me a while to wrap my head around this. One takeaway I have is that the fundamental problem of causal inference forces us to think hard about what quantities we can even hope to estimate. This is why I think the issue of identification comes up a lot more in causal inference than in the rest of statistics.

I want to end this post off by highlighting two other things to worry about when using ratio metrics:

- What happens if the denominator can be negative? How do you interpret the target in that case?
- What happens if the denominator is very close to zero, or worse, equal to zero? Having something close to zero in the denominator usually causes estimates to be very unstable.

All that to say: approach ratio metrics with care!

]]>`mtcars`

data and I want to filter for just the rows with `cyl == 6`

. I would do something like this:
library(tidyverse) data(mtcars) mtcars %>% filter(cyl == 6) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 # Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 # Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

* What if I had the filter condition as a string instead?* The code below doesn’t work:

filter_string <- "cyl == 6" mtcars %>% filter(filter_string) # Error: Problem with `filter()` input `..1`. # x Input `..1` must be a logical vector, not a character. # Input `..1` is `filter_string`. # Run `rlang::last_error()` to see where the error occurred.

This is one possible solution:

mtcars %>% filter(!! rlang::parse_expr(filter_string)) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 # Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 # Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

This can be useful if you are trying to running several different filters automatically. In the following contrived example, I want to compute the mean MPG for two different slices of the data:

filters <- c("carb == 4", "am == 1") for (filter in filters) { print(paste0("Mean mpg for ", filter, ": ", mtcars %>% filter(!! rlang::parse_expr(filter)) %>% summarize(mean_mpg = mean(mpg)) %>% pull())) # [1] "Mean mpg for carb == 4: 15.79" # [1] "Mean mpg for am == 1: 24.3923076923077" }

You can read about `parse_expr`

here and about `!!`

here. I don’t fully understand tidy evaluation at this point, but the code above should work in a wide variety of situations.

(Disclaimer: There was a reference I came across for the `!!`

+ `rlang::parse_expr`

trick that I can’t find now. If anyone knows where it is please let me know and I’ll acknowledge it here in the references.)