Playing Wordle in R

The game Wordle has taken the world (or at least my facebook feed) by storm. It’s a really simple word game that’s a lot like the classic Mastermind. Here are the rules from the Wordle website:

The logic behind the game is pretty simple, so I thought I’d code up an R version so that those of you who can’t get enough of it can play it on your own! The full code is available here.

In my version, I allow the user to set 3 parameters:

  • dictionary: A vector of possible words that the computer can choose from for the word to be guessed.
  • wordLength: The length of the word to be guessed. Wordle sets this parameter to 5.
  • nGuesses: Number of guesses the player is allowed. Wordle sets this parameter to 6.

While the user sees guesses and answers as strings (e.g."early"), it’s much easier to work with vectors of characters in R (e.g. c("e", "a", "r", "l", "y")). Since these are going to be short words, there isn’t much of a performance difference.

I only have one helper function, which takes a guess and the answer (as character vectors) and evaluates the color of the tiles (“G” for green, “Y” for yellow, “-” for neither). I use two passes over the guess vector: the first to evaluate which tiles should be green, and the second to evaluate which should be yellow. There might be a way to do it over one pass, but I wanted to achieve certain behavior when letters were repeated (either in guess or answer) that wasn’t easy to do so in one pass. Again, there isn’t much of a performance hit since the words are short.

evaluateGuess <- function(guessVec, answerVec) {
  wordLength <- length(answerVec)

  resVec <- rep("-", wordLength)
  # first pass: mark exact matches (green)
  for (i in 1:wordLength) {
    if (guessVec[i] == answerVec[i]) {
      resVec[i] <- "G"
      answerVec[i] <- "-"  # mark unavailable for yellow
  # second pass: mark yellow
  for (i in 1:wordLength) {
    if (resVec[i] != "G") {
      idx <- match(guessVec[i], answerVec)
      if (! {
        resVec[i] <- "Y"
        answerVec[idx] <- "-"

# example
evaluateGuess(strsplit("early", "")[[1]], 
              strsplit("later", "")[[1]])
# [1] "Y" "G" "Y" "Y" "-"

Here is the main function for playing one round of Wordle:

playGame <- function(dictionary, wordLength = 5, nGuesses = 6) {
  # select an answer
  possibleAnswers <- dictionary[nchar(dictionary) == wordLength]
  answer <- sample(possibleAnswers, 1)
  answerVec <- strsplit(answer, "")[[1]]
  print(paste("You have", nGuesses, "chances to guess a word of length", 
  guessCnt <- 0
  lettersLeft <- LETTERS
  while (guessCnt < nGuesses) {
    # display "keyboard"
    print(paste(c("Letters left:", lettersLeft), collapse = " "))
    # read in guess
    guessCnt <- guessCnt + 1
    guess <- readline(paste0("Enter guess ", guessCnt, ": "))
    while (nchar(guess) != wordLength) {
      guess <- readline(paste0("Guess must have ", wordLength, " characters: "))
    guess <- toupper(guess)
    guessVec <- strsplit(guess, "")[[1]]
    # evaluate guess and update keyboard
    resVec <- evaluateGuess(guessVec, answerVec)
    # update keyboard
    lettersLeft <- setdiff(lettersLeft, guessVec)
    # print result
    print(paste(strsplit(guess, "")[[1]], collapse = " "))
    print(paste(resVec, collapse = " "))
    if (all(resVec == "G")) {
      print("You won!")
  print(paste("Sorry, you lost! Answer was ", answer))

Finally, we need to get a dictionary of words for our function playGame to choose words from. I found at least two possibilities: Collin’s Scrabble word list and the 10,000 most common English words according to Google’s Trillion Word Corpus. Here’s how you can read each of them into R after downloading the text file (you might need to change the file path):

# scrabble
dictionary <- read.csv("../Dictionaries/Collins Scrabble Words (2019).txt", 
                       header = FALSE, skip = 2)[, 1]

# google
dictionary <- read.csv("../Dictionaries/google-10000-english-usa-no-swears.txt",
                       head = FALSE)[, 1]

We’re now ready to play the game! Here’s an example of one call of playGame():

# [1] "You have 6 chances to guess a word of length 5"
# [1] "Letters left: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
# Enter guess 1: state
# [1] "S T A T E"
# [1] "Y - Y G Y"
# [1] "Letters left: B C D F G H I J K L M N O P Q R U V W X Y Z"
# Enter guess 2: Seat
# Guess must have 5 characters: seats
# [1] "S E A T S"
# [1] "Y G Y G -"
# [1] "Letters left: B C D F G H I J K L M N O P Q R U V W X Y Z"
# Enter guess 3: pesta
# [1] "P E S T A"
# [1] "- G G G G"
# [1] "Letters left: B C D F G H I J K L M N O Q R U V W X Y Z"
# Enter guess 4: vesta
# [1] "V E S T A"
# [1] "G G G G G"
# [1] "You won!"
# [1] 4

While this is playable, there are a number of ways this can be improved. If you are interested, perhaps you can implement these changes! (Some are much harder than others!)

  • For each guess, I only check that the input was of the correct length. There are a number of different checks that should also be implemented:
    • All the characters in the input string should be letters.
    • The guess should be a valid word.
    • In hard mode, any revealed hints must be used in all subsequent guesses.
  • One key UI component that makes Wordle fun is its keyboard: it tells you which letters you have used already and whether they resulted in yellow or green tiles (or neither). See figure at the end of this post for an example of this. In my version, I only display letters that have never been used in past guesses.
    • If you’ve implemented hard mode, determining which letters on the keyboard should be green or yellow is easy.
    • If hard mode is not on, determining which letters should be green or yellow is not so straightforward because (i) past hints might not be used in the current guess, (ii) letters which were yellow can turn green, and (iii) the situation is trickier if letters are repeated.
  • How do I handle cases where the guess and/or the answer has repeatedly letters? How might you change the code if you want different behavior?
  • This version of Wordle is played in the console. Can you make a version that has a UI like the actual game (maybe a Shiny app)?
  • The playability of Wordle depends quite a lot on the dictionary you give it. Are there better dictionaries out there?
  • Train an AI algorithm to play (this version of) Wordle optimally.

Wordle keyboard from an unspecified round.

What do we mean by triggering in A/B testing?

What do we mean by triggering in the context of A/B testing? According to Kohavi et al. (2020) (Reference 1),

Users are triggered into the analysis of an experiment if there is (potentially) some difference in the system or user behavior between the variant they are in and any other variant (counterfactual).

(Reference 1 has an entire chapter on triggering which I recommend for A/B testing practitioners.) That’s a bit opaque, and I think it’s probably easier to understand triggering through an example.

Imagine that you are running an experiment comparing two different versions of the checkout page for your e-commerce website. Users in the control group (denoted C) get to see the existing checkout page, while users in the treatment group (denoted T) get to see the new checkout page which you think drives more revenue.

The moment a user comes to your website and is assigned to your experiment, you assign the user to either the treatment group or the control group. (This is usually done by hashing some user identifier and checking which box the hash ends up in.) You then track the user as they navigate through your website. The moment the user hits the checkout page, they get to see either the new checkout page or the old one (depending on whether they were assigned to treatment or control); we say that this user is triggered. However, if the user never visits the checkout page, they will never get a chance to be exposed to the treatment, and we say that the user is not triggered.

Triggering splits our experiment sample into 4 parts: users in treatment who were triggered, users in treatment who were not triggered, and likewise for control. The figure below illustrates this. If assignment to treatment/control is completely random (e.g. coin flip), then we would expect triggering rates to be the same across treatment and control, as suggested by the figure.

Pictorial representation of triggering. Figure modified from Deng et. al. (2021) (Reference 2).

Connection with non-compliance

There is a close connection between triggering and the literature of non-compliance in economics. The basic idea in these set-ups is that there is a difference between being assigned treatment and actually taking the treatment. “A user not visiting the checkout page” is conceptually very similar as “a patient assigned to take a new experimental drug not taking it”. (One might even argue that they are exactly the same. I haven’t given it too much thought yet but I think there might be subtle differences between the two, especially in interpretation.)

Triggered analysis vs. all-up analysis

When the triggering rate is low, it is usually recommended that one performs a triggered analysis (comparing T1 against C1) in addition to an all-up analysis (comparing T = T0 + T1 against C = C0 + C1).

The simplest analysis one could perform is the difference of means. Let Y denote the response metric of interest, and for any set I, let \overline{Y}_I denote the mean of the Y values for the users in I. For triggered analysis, we would look at \overline{Y}_{T1} - \overline{Y}_{C1}, while for all-up analysis, we would look at \overline{Y}_T - \overline{Y}_C.

It is important to note that the two estimators are estimating different things. For each individual i, define the dummy variables

\begin{aligned} Z_i &= 1 \{ \text{individual i is assigned to treatment} \}, \\  D_i &= 1 \{ \text{individual i is triggered} \} \\  &= 1 \{ \text{individual i is actually exposed to treatment} \}. \end{aligned}

Let Y_{1i} denote the potential outcome for individual i if the treatment is taken (i.e. D_i = 1), Y_{0i} denote the potential outcome for individual i if the treatment is not taken (i.e. D_i = 0), and Y_i denote the outcome that is actually observed for individual i.

The overall difference of means \overline{Y}_T - \overline{Y}_C is estimating the intent-to-treat effect (ITT):

\text{ITT} = \mathbb{E} [Y_i \mid Z_i = 1] - \mathbb{E}[Y_i \mid Z_i = 0].

On the other hand, the triggered difference of means \overline{Y}_{T1} - \overline{Y}_{C1} is estimating the effect of treatment on the treated (TOT):

\text{TOT} = \mathbb{E}[Y_{1i} - Y_{0i} \mid D_i = 1].

These two quantities are not the same in general. Under some assumptions, there is a simple relation between ITT and TOT known as Bloom’s result.

A small gotcha to be aware of

If the TOT is +5 increase and 10% of the users are triggered, does it make sense that the ITT is +5 * 0.1 = +0.5? Yes, because for the users who weren’t triggered, the treatment has zero effect on them (since they will never be exposed to it).

In many situations, we are not only interested in the absolute value of the difference between treatment and control, but the change relative to the control baseline (also known as “lift”). If we see a +5% increase for the triggered population and 10% of users are triggered, does it  follow that we have a +5% * 0.1 = +0.5% increase for the overall population?

The answer is no: the overall increase can be anywhere from 0% to 5%! This is because we don’t know how much of the metric the triggered population is responsible for. For example, assume the metric is revenue and that the total revenue under control is 100. At one extreme, when triggered users are responsible for all of the revenue, the overall increase is

\begin{aligned} \dfrac{(0 + 105) - 100}{100} = 5\%. \end{aligned}

At the other extreme, when triggered users are responsible for \epsilon revenue with \epsilon small, the overall increase is

\begin{aligned} \dfrac{[(100 - \epsilon) + 1.05\epsilon) - 100}{100} = 0.05 \epsilon \%. \end{aligned}


  1. Kohavi, R., et. al. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.
  2. Deng, A., et. al. (2021). Variance Reduction for Experiments with One-Sided Triggering using CUPED.

Mathematical odds and ends

I’ve just started a new blog, Mathematical Odds & Ends! As you might guess, the purpose of that blog is very similar to this one: to be a collection of mathematical tidbits that I find interesting, as well as a knowledge repository for (re-)learning the derivation of or intuition behind mathematical results.

I had been wanting to write some mathematical but non-statistical posts and first considered just putting them on this blog, but eventually decided to keep this blog “pure” and true to its original purpose 🙂

While I don’t expect to be posting very often on Mathematical Odds & Ends, please consider following the blog/subscribing to the RSS feed if the content interests you! Here are links to the first 3 posts of the blog to whet your appetite:

A stats joke to start the new year

I just came across this old stats joke that I posted on my old blog which is too good not to post again. From page 68 of Simon Singh’s The Simpsons and their Mathematical Secrets:

While heading to a conference on board a train, three statisticians meet three biologists. The biologists complain about the cost of the train fare, but the statisticians reveal a cost-saving trick. As soon as they hear the inspector’s voice, the statisticians squeeze into the toilet. The inspector knocks on the toilet door and shouts: “Tickets, please!” The statisticians pass a single ticket under the door, and the inspector stamps it and returns it. The biologists are impressed. Two days later, on the return train, the biologists showed the statisticians that they have bought only one ticket, but the statisticians reply: “Well, we have no ticket at all.” Before they can ask any questions, the inspector’s voice is heard in the distance. This time the biologists bundle into the toilet. One of the statisticians secretly follows them, knocks on the toilet door and asks: “Tickets please!” The biologists slip the ticket under the door. The statistician takes the ticket, dashes into another toilet with his colleagues, and waits for the real inspector. The moral of the story is simple: “Don’t use a statistical technique that you don’t understand.”

Happy new year and don’t use statistical techniques you don’t understand! 🙂

Simulating dice bingo

Note: This post was inspired by the “Classroom Bingo” probability puzzle in the Royal Statistical Society’s Significance magazine (Dec 2021 edition).


Imagine that we are playing bingo, but where the numbers are generated by the roll of two 6-sided dice with faces 1, 2, …, 6. Each round, the two dice are rolled. If the sum of the two dice appears on your bingo card, you can strike it off. The game ends when you strike off all the numbers on your card.

For this simple game, you are asked to write just two numbers on your bingo card (which can be the same number if you wish). What two numbers should you write in order to finish the game as quickly as possible (on average)?


With the roll of two dice, the most probable outcome is 7 (6 out of 36), followed by 6 and 8 (5 out of 36), and so on. Hence, the two numbers you should write down are probably (7, 7) or (7, 6) / (7, 8). Which is better?


This problem space is small enough that we can estimate the number of rounds for each possible bingo card via Monte Carlo simulation. Here is the function that runs one round of dice bingo in R:

# Roll `nDice` dice with sides 1, 2, ..., `nFaces` and return the sum
RollDice <- function(nDice, nFaces) {
  dice <- sample(nFaces, size = nDice, replace = TRUE)

SimulateBingo <- function(bingoVec, nDice, nFaces, verbose = FALSE) {
  nRolls <- 0
  while (length(bingoVec) > 0) {
    nRolls <- nRolls + 1
    currentRoll <- RollDice(nDice, nFaces)
    # check if roll matches a number on card; if so, remove that number
    matchIndex <- match(currentRoll, bingoVec)
    if (!
      bingoVec <- bingoVec[-matchIndex]
    if (verbose) print(paste0("Roll ", nRolls, ": ", currentRoll, ", (", 
                              paste(bingoVec, collapse = ","), 
                              ") remaining"))

The SimulateBingo function is flexible enough for us to have any number of numbers on our bingo card at the beginning of the game. Here’s an example output from SimulateBingo(), where we have 3 numbers (6, 6, 7) on our card at the beginning, and at each round we roll 2 dice with sides 1 to 6:

SimulateBingo(c(6, 6, 7), 2, 6, verbose = TRUE)

# [1] "Roll 1: 11, (6,6,7) remaining"
# [1] "Roll 2: 7, (6,6) remaining"
# [1] "Roll 3: 6, (6) remaining"
# [1] "Roll 4: 9, (6) remaining"
# [1] "Roll 5: 3, (6) remaining"
# [1] "Roll 6: 4, (6) remaining"
# [1] "Roll 7: 9, (6) remaining"
# [1] "Roll 8: 5, (6) remaining"
# [1] "Roll 9: 7, (6) remaining"
# [1] "Roll 10: 5, (6) remaining"
# [1] "Roll 11: 9, (6) remaining"
# [1] "Roll 12: 7, (6) remaining"
# [1] "Roll 13: 11, (6) remaining"
# [1] "Roll 14: 9, (6) remaining"
# [1] "Roll 15: 6, () remaining"
# [1] 15

Next, we introduce a function that runs SimulateBingo() multiple times and returns the mean number of rounds needed to complete the game:

EstimateRounds <- function(nSim, bingoVec, nDice, nFaces) {
  rounds <- replicate(nSim, SimulateBingo(bingoVec, nDice, nFaces))

(For R aficionados wondering why I didn’t use ... to pass function arguments to SimulateBingo(): it’s because replicate() and ... don’t play well together. See more details here.)

The following code runs 100,000 simulations of the game for each possible pair of numbers on the bingo card (this takes a while to run):

nDice <- 2
nFaces <- 6
nSim <- 100000

df <- data.frame(matrix(nrow = 0, ncol = 3))
maxSum <- nDice * nFaces
minSum <- nDice
for (x in minSum:maxSum) {
  for (y in x:maxSum) {
    print(paste("Simulating bingoVec = (", x, ", ", y, ")"))
    meanRounds <- EstimateRounds(nSim, bingoVec = c(x, y), nDice, nFaces)
    df <- rbind(df, c(x, y, meanRounds))
names(df) <- c("x", "y", "mean_rounds")

Let’s make a heatmap to visualize the results:

ggplot(df, aes(factor(x), factor(y))) +
  geom_tile(aes(fill = mean_rounds)) +
  geom_text(aes(label = sprintf(mean_rounds, fmt = "%.2f"))) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Bingo value 1", y = "Bingo value 2", 
       title = "Mean # of dice rolls to reach bingo")
  theme_update(legend.position = "null")

Notice the symmetry in the plot: the mean number of rounds for (x, y) is basically the same as that for (14 - x, 14 - y). This makes sense due to symmetry inherent in the problem (the probability of rolling x is the same as rolling 7 - x for one die).

The smallest number of rounds is obtained for the pairs (6, 7) and (7, 8)!


Let T_1 denote the number of rounds until (and including) the first time a number on the bingo card is hit, and let T_2 denote the number of rounds from then until (and including) the second number is hit. The number of rounds to complete the game is just T = T_1 + T_2.

Note that T_1 and T_2 are geometric random variables, and so their expectation is the reciprocal of the probability of success.

For (7, 7), the probability of success of T_1 and T_2 is the same: it’s the probability of rolling a 7, which is 6 / 36 = 1/6. Hence,

\begin{aligned} \mathbb{E}[T] &= \mathbb{E}[T_1] + \mathbb{E}[T_2] =  6 + 6 = 12. \end{aligned}

For (6, 7), the computation is a bit trickier because the distribution of T_2 depends on whether the 6 or the 7 was hit first. If 6 was hit first, then T_2 \sim \text{Geom}(6/36), and if 7 was hit first, then T_2 \sim \text{Geom}(5, 36). What is the probability that 6 is hit first? It’s

\text{Prob 6 first} = \dfrac{5/36}{5/36 + 6/36} = \dfrac{5}{11}.


\begin{aligned} \mathbb{E}[T_2] &= \mathbb{E}[\text{Geom}(6/36)] \cdot \dfrac{5}{11} + \mathbb{E}[\text{Geom}(5/36)] \cdot \left( 1 - \dfrac{5}{11} \right) \\  &= \dfrac{366}{55} \\  &\approx 6.65 \end{aligned}

T_1 is a geometric distribution with success probability equal to the probability that 6 or 7 is rolled, i.e. 5/36 + 6/36 = 11/36, implying that \mathbb{E}[T_1] = 36/11. Hence,

\begin{aligned} \mathbb{E}[T] &= \mathbb{E}[T_1] + \mathbb{E}[T_2] \\  &=  \dfrac{36}{11} + \dfrac{366}{55} \\  &= \dfrac{546}{55} \\  &\approx 9.93, \end{aligned}

which is quite a bit less than the expected value for (7, 7)! (Notice that these values are very close to the simulated results.)

On second thought this makes sense: even though it takes longer to hit the second number in (6, 7), this is more than made up for in the time it takes to hit the first number, since we can hit either 6 or 7, not just 7. (Looking at the simulation results suggests that quite a few other pairs of numbers are better than (7, 7).)


There are many interesting questions to ponder as extensions to this game:

  1. What if you had to write down 3 numbers on your bingo card instead? Or 4? Or n?
  2. What if there were more dice?
  3. What if the dice had non-standard faces?

Verifying a stat from The Athletic NBA Show

A few weeks ago, I was listening to The Athletic NBA Show podcast (Episode 581: “5 Players I was wrong about, 20 Games in Contenders, and Sam Vecenie on the 2021 Rookie Class”) and the following statistic caught my attention:

Question: Since the 2000-2001 NBA finals, there have been 42 teams in the NBA finals. (This episode aired in Dec 2021, so the most recent finals was the 2020-2021 edition.) At the 20-game mark in the regular season, how many of these 42 teams were below 0.500 (that is, less than 50% winning percentage)?

Answer: Zero (!) (One of the guests on the show guessed 6.)

(If you want to hear the full segment on the stat, it’s the first topic of the podcast, right after the recap of the week.) Since a full regular season in the NBA is 82 games per team, 20 games represents just under a quarter of the season. It seemed a little surprising that NBA finalists were so consistently >= 0.500 so early in the season, so I wanted to verify the statistic for myself.

What I found was even more surprising: none of the teams in the NBA finals since the 1981-1982 edition were below 0.500 at the 20-game mark! That’s 40 years of NBA finals, almost double the number considered in the podcast. The 1980-1981 Houston Rockets were the most recent NBA finalists to be under 0.500 at the 20-game mark: they were 9-11.

The rest of this post goes through the data and R code to verify this statistic. The full R script is available here.

To verify this statistic, I needed two from two different sources. The first data source is Tristan Malherbe’s NBA Finals and MVPs dataset hosted on This tells us which teams were in the NBA finals in each year. The dataset only goes to the 2017-2018 season, so I added NBA finals data up to the 2020-2021 season (inclusive) manually. The second data source is Wyatt Walsh’s Basketball Dataset hosted on Kaggle. This is an extremely comprehensive dataset on the NBA stored as an SQLite database; we only use the Game table, which contains statistics on every regular season game.

(Note: If you play around with the data, you might find some oddities/errors. For example, there are one or two teams that had more than 82 games in the regular season when there should be at most 82 games. There is also some missing data. Since this is quick analysis for a blog post I didn’t try to ensure 100% correctness; this is something one should do or strive toward for more high-stakes analyses.)

First, let’s load the regular season data. I’m being a bit lazy by importing the whole Game table, but since the data isn’t very big we don’t suffer a huge efficiency loss.



firstYear <- 1980  # represents 1980-1981 season

# these filepaths are specific to my local drive
mainFile <- "../data/nba-kaggle-wyattowalsh/basketball.sqlite"
finalsFile <- "../data/datatouille-nba-finals-and-mvps/data/data-updated.csv"

# get all regular season games (only relevant columns 
# selected)
mydb <- dbConnect(RSQLite::SQLite(), mainFile)
df <- dbGetQuery(mydb, "SELECT * FROM Game")
regular_df <- df %>% mutate(GAME_DATE = as.Date(GAME_DATE),
                              SEASON = as.numeric(SEASON)) %>% 
  filter(SEASON >= firstYear) %>%
  arrange(SEASON, GAME_DATE)

# 1   1980 1980-10-10      Boston Celtics    Cleveland Cavaliers       W       L
# 2   1980 1980-10-10     Detroit Pistons     Washington Bullets       L       W
# 3   1980 1980-10-10 Seattle SuperSonics     Los Angeles Lakers       L       W
# 4   1980 1980-10-10        Phoenix Suns  Golden State Warriors       W       L
# 5   1980 1980-10-10  Philadelphia 76ers        Milwaukee Bucks       L       W
# 6   1980 1980-10-10           Utah Jazz Portland Trail Blazers       W       L

Next, let’s load the NBA finals data. Note that this data uses a different convention for labeling the season (e.g. 2021 refers to the 2020-2021 season), so we have to subtract 1 in order for the season column to match across the two data sources. For the rest of this analysis, season x refers to the season starting in the year x.

# get NBA finals info
# season = year - 1 to match regular_df
finals_df <- read_csv(finalsFile) %>%
  select(season = year, nba_champion, nba_vice_champion) %>%
  mutate(season = season - 1) %>%
  filter(season >= firstYear)
names(finals_df) <- toupper(names(finals_df))

# # A tibble: 6 × 3
#    <dbl> <chr>              <chr>             
# 1   1980 Boston Celtics     Houston Rockets   
# 2   1981 Los Angeles Lakers Philadelphia 76ers
# 3   1982 Philadelphia 76ers Los Angeles Lakers
# 4   1983 Boston Celtics     Los Angeles Lakers
# 5   1984 Los Angeles Lakers Boston Celtics    
# 6   1985 Boston Celtics     Houston Rockets 

I’ve never heard the term “vice champion” before! Since the dataset uses that to refer to the team that was in the finals but didn’t win, I’ll use that term too. Join the two data frames and keep just the games that an NBA finalist played in:

joined_df <- regular_df %>%
  inner_join(finals_df, by = "SEASON") %>%
                                 TEAM_NAME_AWAY == NBA_CHAMPION,

Next, let’s compute the win percentages across games for the NBA finalists. I did it separately for the champions and vice champions then joined the two datasets together since it was a little easier to figure out the code.

# Compute win percentages across games for champions
champion_df <- joined_df %>% filter(FOR_CHAMPION) %>%
  group_by(SEASON) %>%
  mutate(GAME = 1,
         WIN = ifelse((TEAM_NAME_HOME == NBA_CHAMPION & WL_HOME == "W") |
                        (TEAM_NAME_HOME != NBA_CHAMPION & WL_HOME == "L"),
                      1, 0)) %>%
  transmute(TEAM_TYPE = "Champion",
            SEASON = factor(SEASON),
            GAMES_PLAYED = cumsum(GAME),
            WINS = cumsum(WIN),
            WIN_PCT = WINS / GAMES_PLAYED)

# Compute win percentages across games for vice champions
vice_champion_df <- joined_df %>% filter(FOR_VICE_CHAMPION) %>%
  group_by(SEASON) %>%
  mutate(GAME = 1,
         WIN = ifelse((TEAM_NAME_HOME == NBA_VICE_CHAMPION & WL_HOME == "W") |
                        (TEAM_NAME_HOME != NBA_VICE_CHAMPION & WL_HOME == "L"),
                      1, 0)) %>%
  transmute(TEAM_TYPE = "Vice Champion", 
            SEASON = factor(SEASON),
            GAMES_PLAYED = cumsum(GAME),
            WINS = cumsum(WIN),
            WIN_PCT = WINS / GAMES_PLAYED)

# put champion & vice champion data together
# final_win_pct_df is used for the year labels in the plots
win_pct_df <- rbind(champion_df, vice_champion_df)
final_win_pct_df <- win_pct_df %>% group_by(TEAM_TYPE, SEASON) %>%
  slice_tail(n = 1)

# # A tibble: 6 × 5
# # Groups:   SEASON [1]
#   <chr>     <fct>         <dbl> <dbl>   <dbl>
# 1 Champion  1980              1     1   1    
# 2 Champion  1980              2     1   0.5  
# 3 Champion  1980              3     2   0.667
# 4 Champion  1980              4     2   0.5  
# 5 Champion  1980              5     3   0.6  
# 6 Champion  1980              6     3   0.5 

We are now ready to make the plot! First, let’s make the plot of win percentage vs. games played for the NBA champions:

ggplot(champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  geom_vline(xintercept = 20, linetype = "dashed") +
  geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Champion"), 
            aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON),
            hjust = 0, size = 3) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none") +
  labs(title = "Win percentage by games played (champions)",
       x = "Games played", y = "Win percentage")

Indeed, at 20 games all NBA champions were 0.500 or better. Only one team was exactly at 0.500: the 2005-2006 Miami Heat.

Let’s make the same plot for the vice champions:

ggplot(vice_champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  geom_vline(xintercept = 20, linetype = "dashed") +
  geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Vice Champion"), 
            aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON),
            hjust = 0, size = 3) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none") +
  labs(title = "Win percentage by games played (vice champions)",
       x = "Games played", y = "Win percentage")

Again, at 20 games all NBA vice champions after the 1980-1981 season were 0.500 or better. Only one team at 0.500 was the 2004-2005 Detroit Pistons. As we can see from the plot, the most recent NBA finalist to be below 0.500 was the 1980-1981 Houston Rockets, who also happen to be the only NBA finalist in this time window to finish the entire regular season under 0.500.

Computing the saturated log partial likelihood for the Cox proportional hazards model

TLDR: This post explains how to derive the saturated partial likelihood for the Cox proportional hazards model.

Imagine that we are in the survival analysis setting with right-censoring, and that we are considering the Cox proportional hazards model. Let:

  • Data come in the form (x_1, y_1, \delta_1), \dots, (x_n, y_n, \delta_n), where x_j \in \mathbb{R}^p, y_j \in (0, \infty), and \delta_j = 1 if failure occurs, and \delta_j = 0 if the time is censored.
  • t_1 < t_2 < \dots, t_m denote the unique failure times.
  • w_j be the observation weight for observation j.
  • \beta \in \mathbb{R}^p denote the coefficient vector for the Cox model.
  • \eta_j = x_j^T \beta.
  • D_i denote the set of indices of failure at time t_i.
  • d_i = \sum_{j \in D_i} w_j.
  • R_i denote the risk set at time t_i, i.e. the indices j for which y_j \geq t_i.

Assuming the Breslow approximation for ties (read more about ties in Reference 1), the partial likelihood for the model is

\begin{aligned} L(\beta) = \prod_{i=1}^m \frac{\exp(\sum_{j \in D_i} w_j \eta_j)}{\left( \sum_{j \in R_i} w_j e^{\eta_j} \right)^{d_i}}, \end{aligned}

and the log partial likelihood is

\begin{aligned} \ell(\beta) = \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in R_i} w_j e^{\eta_j}  \right) \right]. \end{aligned}

The fitted model finds \beta which maximizes the log partial likelihood; equivalently, it minimizes the deviance

D(\beta) = 2(\ell_{saturated} - \ell(\beta)),

where \ell_{saturated} is the log partial likelihood for the saturated model, i.e. where the \eta_j‘s are allowed to vary freely (as opposed to being constrained by \eta_j = x_j^T \beta). (See this previous post for more details on deviance and the saturated model.)

Reference 2 states (in Section 2.6) that

\begin{aligned} \ell_{saturated} = - \sum_{i=1}^m d_i \log (d_i). \end{aligned}

How did they derive this? In the paper, they say it was a “simple calculation”. Perhaps I missed something: my proof below doesn’t feel that straightforward…

If anyone knows of a simple way to get this result, let me know!!


(I don’t think it’s air-tight as written but can be made rigorous.)

Step 1: Deal with censored observations

The first thing to notice is that if observation j is censored, i.e. \delta_j = 0, then making \eta_j as negative as possible will only increase the log partial likelihood. Hence, we should set \eta_j = -\infty for these observations. Hence, if we let E_i denote the set of indices for failure times that happen after t_i, we are left to maximize

\begin{aligned} \ell = \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i \cup E_i} w_j e^{\eta_j}  \right) \right]. \end{aligned}

(The last sum is now over D_i \cup E_i instead of R_i.)

Step 2: Deal with just immediate deaths

Notice that w_j e^{\eta_j} \geq 0 for every j. Hence,

\begin{aligned} \ell &= \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i \cup E_i} w_j e^{\eta_j}  \right) \right] \\  &\leq \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i } w_j e^{\eta_j} \right) \right]. \end{aligned}

By doing this, each \eta_j appears in exactly one term of the sum. Hence, maximizing \ell decouples into m independent maximization problems: maximize \ell_i where

\begin{aligned} \ell_i = \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i } w_j e^{\eta_j} \right). \end{aligned}

Step 3: Maximize each term

Fix time index i. For k \in D_i, to find the \eta_k that maximizes \ell_i, take a partial derivative and set it to zero:

\begin{aligned} \frac{\partial \ell_i}{\partial \eta_k} = w_k - \frac{d_i w_k e^{\eta_k}}{\sum_{j \in D_i} w_j e^{\eta_j}} &= 0, \\  \frac{d_i e^{\eta_k}}{\sum_{j \in D_i} w_j e^{\eta_j}} &= 1, \\  \eta_k &= \log \left( \sum_{j \in D_i} w_j e^{\eta_j}\right) - \log d_i. \end{aligned}

(By differentiating again, we can show that \dfrac{\partial^2 \ell_j}{\partial \eta_k^2} < 0 for all values of \eta_k, meaning that the above does indeed give a local maximum. I believe (but haven’t proven) that \ell_j is concave, so this is in fact a global maximum. This is the one step in the proof argument that I haven’t verified.) The implication of the above is that at the maximum, the \eta_k‘s are equal for all k \in D_i. In fact, they can all be equal to any value: the first-order equation will still hold.

Letting A = \log \left( \sum_{j \in D_i} w_j e^{\eta_j}\right), we have

\begin{aligned} \ell_i &= \left(\sum_{j \in D_i} w_j \left( A - \log d_i \right) \right) - d_i A = -d_i \log d_i, \end{aligned}

and hence

\begin{aligned} \ell &\leq \sum_{i=1}^m \ell_i \leq \sum_{i=1}^m -d_i \log d_i. \end{aligned}

Step 4: Showing that we can get to equality

Step 3 shows that \ell is upper-bounded by \sum_{i=1}^m -d_i \log d_i through two inequalities. Can we actually attain equality?

From Step 3, we know that we can obtain equality as long as there are constants \alpha_1, \dots, \alpha_m such that

\begin{aligned} \eta_k = \alpha_i \quad \text{for all } k \in D_i. \end{aligned}

Let’s assume that this equality condition holds. Under this condition, can we achieve equality for the first inequality as well?

Recall how we obtained the first inequality:

\begin{aligned} \ell &= \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i \cup E_i} w_j e^{\eta_j}  \right) \right] \\  &\leq \sum_{i=1}^m  \left[ \left( \sum_{j \in D_i} w_j \eta_j \right) - d_i \log \left( \sum_{j \in D_i } w_j e^{\eta_j} \right) \right]. \end{aligned}

That is, equality is only obtained if w_j e^{\eta_j} = 0 for all j \in E_i. This is not possible unless \eta_j = -\infty for all j. However, we can get arbitrarily close to equality by making \sum_{j \in E_i} w_j e^{\eta_j} both small in absolute terms and small compared to \sum_{j \in D_i} w_j e^{\eta_j}. For example, if we set up the \alpha‘s such that \alpha_1 \rightarrow 0 and \alpha_{i+1} / \alpha_i \rightarrow 0 for all i, then \ell \rightarrow \sum_{j=1}^m - d_j \log d_j.


  1. Breheny, P. Tied survival times; estimation of survival probabilities.
  2. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent.

What is mean independence?

Mean independence is a relationship between two random variables that lies between the usual definition of independence and uncorrelatedness. A random variable Y is said to mean independent of X if (and only if)

\mathbb{E}[Y \mid X = x] = \mathbb{E}[Y]

for all x such that the probability mass/density of X at x is not zero.

Mean independence comes up most often in econometrics as an assumption that is weaker than independence but stronger than uncorrelatedness. Independence implies mean independence, and mean independence implies uncorrelatedness.

Independence implies mean independence

This is the most technical argument of the lot (adapted from Reference 2). By definition of conditional expectation, \mathbb{E}[Y \mid X] is the unique random variable such that for any A \in \sigma(X) (the \sigma-algebra generated by X), we have

\mathbb{E}[\mathbb{E}[Y \mid X] 1_A] = \mathbb{E}[Y 1_A].

However, by independence of X and Y,

\mathbb{E}[Y 1_A] = \mathbb{E}[Y] \mathbb{E}[1_A] = \mathbb{E}[\mathbb{E}[Y]1_A].

Hence, \mathbb{E}[Y \mid X] must be equal to the constant random variable E[Y], i.e. Y is mean independent of X.

(Update: Ok here’s an easier way to see this: independence implies that the conditional distribution of Y given X = x is the same for all x. Since the conditional mean \mathbb{E}[Y \mid X = x] is a function of the conditional distribution Y \mid X = x, it follows that \mathbb{E}[Y \mid X = x] is the same for all x, and so must be equal to \mathbb{E}[Y].)

Mean independence implies uncorrelatedness

(Proof adapted from Reference 3.) Assume Y is mean independent of X. By the law of iterated expectations,

\begin{aligned} \mathbb{E}[XY] &= \mathbb{E}[\mathbb{E}[XY \mid X]] \\  &= \mathbb{E}[X \mathbb{E}[Y \mid X]] \\  &= \mathbb{E}[X \mathbb{E}[Y]] \\  &= \mathbb{E}[Y] \mathbb{E}[X],  \end{aligned}

so \text{Cov}(X, Y) = 0.

Both converses are not true, as the counterexamples below show. (Counterexamples adapted from Reference 4.)

Mean independence does not imply independence

Let X = \cos \theta and Y = \sin \theta, with \theta \sim \text{Unif}[0, 2\pi]. For any x \in [-1, 1], \mathbb{P}(Y = \sqrt{1-x^2}) = \mathbb{P}(Y = -\sqrt{1-x^2}) = 0.5, so \mathbb{E}[Y \mid X = x] = 0, i.e. Y is mean independent of X. We can similarly establish that X is mean independent of Y.

On the other hand, \mathbb{P}(X > 3/4) > 0 and \mathbb{P}(Y > 3/4) > 0, but since X^2 + Y^2 = 1 < (3/4)^2 + (3/4)^2,

\mathbb{P}(X > 3/4 \text{ and } Y > 3/4) = 0 \neq \mathbb{P}(X > 3/4 ) \mathbb{P}(Y > 3/4).

Therefore, X and Y are not independent.

Uncorrelatedness does not imply mean independence

Let X and Y be such that (X, Y) \in \{ (1, 3), (-3, 1), (-1, -3), (3, -1) \} with equal probability. With this setup,

\mathbb{E}[XY] = \mathbb{E}[X] = \mathbb{E}[Y] = 0,

so \text{Cov}(X, Y) = 0, i.e. X and Y uncorrelated. However,

\mathbb{E}[Y \mid X = 1] = 3, \quad \mathbb{E}[Y \mid X = -3] = 1,

so Y is not mean independent of X.

Mean independence is not symmetric

Both independence and uncorrelatedness are symmetric; however (maybe surprisingly) mean independence is not. Here is a simple from StackExchange demonstrating this. Let X and Y be such that (X, Y) \in \{ (2, 1), (-2, 1), (5, -1), (-5, -1) \} with equal probability. With this setup,

\mathbb{E}[X \mid Y = 1] = \mathbb{E}[X \mid Y = -1] = 0,

so X is mean independent of Y, but

\mathbb{E}[Y \mid X = 2] = 1, \quad \mathbb{E}[Y \mid X = 5] = -1,

so Y is not mean independent of X.


  1. Wikipedia. Mean dependence.
  2. Dembo, A. Stat/219 Math 136 – Stochastic Processes: Notes on Section 4.1.2.
  3. Zhang, H. Mean-independence implies zero covariance.
  4. Merx, J.-P. Mean independent and correlated variables.

Doing linear regression in an online manner

Assume we are in the linear regression context: we have i.i.d. data (x_1, y_1), \dots, (x_n, y_n) with x_i \in \mathbb{R}^p and y_i \in \mathbb{R}, and we want to do linear regression of y on x. Assume further that the error terms in the model are i.i.d. normal with variance \sigma^2. Letting

\begin{aligned} \mathbf{X} = \begin{pmatrix}  \rule{1em}{0.4pt} & x_1^T & \rule{1em}{0.4pt} \\ & \vdots &  \\  \rule{1em}{0.4pt} & x_n^T & \rule{1em}{0.4pt} \end{pmatrix} \in \mathbb{R}^{n \times p}, \quad \mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n, \end{aligned}

we have the standard estimates

\begin{aligned} \hat\beta &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}, \\  S^2 &= \frac{1}{n-p-1} \sum_{i=1}^n \left( x_i^T \hat\beta - y_i \right)^2, \\  \widehat{\text{Var}}(\hat\beta) &= S^2 (\mathbf{X}^T \mathbf{X})^{-1} = S^2 \left( \sum_{i=1}^n x_ix_i^T \right)^{-1}. \end{aligned}

We can use these formulas if we have all n data points at the same time. What if the data points come in one at a time, and we lose access to a data point once the next one comes in? Is there a way to update these linear regression estimates each time a new data point comes in?

It turns out that there is such a way: Reference 1 states the formulas with some hints at the derivation. I thought it would be worth working out the details.

Updating (\mathbf{X}^T \mathbf{X})^{-1}

Let \mathbf{X}_t \in \mathbb{R}^{t \times p} represent the \mathbf{X} matrix for the first t observations. The update for (\mathbf{X}^T \mathbf{X})^{-1} is a straightforward application of the matrix inversion lemma:

\begin{aligned} (\mathbf{X}_{t+1}^T \mathbf{X}_{t+1})^{-1} &= \left( \mathbf{X}_t^T \mathbf{X}_t + x_{t+1}x_{t+1}^T \right)^{-1} \\  &= \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} - \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1} \left( 1 + x_{t+1}^T \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1} \right)^{-1} x_{t+1}^T \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} \\  &= \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} - \dfrac{\left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1} x_{t+1}^T \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1}}{1 + x_{t+1}^T \left( \mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1}}.  \end{aligned}

Updating \hat\beta

The update for \hat\beta is similarly straightforward but involves some ugly math. Using the updating formula for (\mathbf{X}^T \mathbf{X})^{-1} and letting \mathbf{C} = \mathbf{X}_t^T \mathbf{X}_t, x = x_{t+1} and y = y_{t+1},

\begin{aligned} \hat\beta_{t+1} &= \left( \mathbf{X}_{t+1}^T \mathbf{X}_{t+1} \right)^{-1} \left( \mathbf{X}_{t+1}^T \mathbf{y}_{t+1} \right) \\  &= \left( \mathbf{X}_t^T \mathbf{X}_t + x_{t+1}x_{t+1}^T \right)^{-1} \left( \mathbf{X}_t^T \mathbf{y}_t + y_{t+1}x_{t+1} \right) \\  &= \left[ \mathbf{C}^{-1} - \dfrac{\mathbf{C}^{-1} x x^T \mathbf{C}^{-1}}{1 + x^T \mathbf{C}^{-1} x} \right] \left( \mathbf{X}_t^T \mathbf{y}_t + yx \right) \\  &= \hat\beta_t + y \mathbf{C}^{-1} x - \dfrac{\mathbf{C}^{-1} x x^T \mathbf{C}^{-1}}{1 + x^T \mathbf{C}^{-1} x} \left( \mathbf{X}_t^T \mathbf{y}_t + yx \right) \\  &= \hat\beta_t - \dfrac{\mathbf{C}^{-1} x x^T \hat\beta_t}{1 + x^T \mathbf{C}^{-1} x} + y \mathbf{C}^{-1} x - \dfrac{y\mathbf{C}^{-1} x x^T \mathbf{C}^{-1} x}{1 + x^T \mathbf{C}^{-1} x} \\  &= \hat\beta_t - \dfrac{\mathbf{C}^{-1} x x^T \hat\beta_t}{1 + x^T \mathbf{C}^{-1} x} + \dfrac{(1 + x^T \mathbf{C}^{-1} x) y \mathbf{C}^{-1} x - y\mathbf{C}^{-1} x (x^T \mathbf{C}^{-1} x )}{1 + x^T \mathbf{C}^{-1} x} \\  &= \hat\beta_t - \dfrac{\mathbf{C}^{-1} x x^T \hat\beta_t}{1 + x^T \mathbf{C}^{-1} x} + \dfrac{y \mathbf{C}^{-1}x}{1 + x^T \mathbf{C}^{-1} x} \\  &= \hat\beta_t + \dfrac{\mathbf{C}^{-1} x (y - x^T \hat\beta_t)}{1 + x^T \mathbf{C}^{-1} x} \\  &= \hat\beta_t + \dfrac{(y_{t+1} - x_{t+1}^T \hat\beta_t)\left(\mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1} }{1 + x_{t+1}^T \left(\mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1}}.  \end{aligned}

Update (2021-12-03): I was initially not able to derive the update for \hat\beta as presented in Reference 1. The update above is indeed correct, and matches the final result at this other reference. Thanks to commenter Jiaxin, I was able to push through the computations and verify that the update in Reference 1 is correct. Here is the proof of the equivalence (the first line being the update for \hat\beta as in Reference 1):

\begin{aligned} \hat\beta_{t+1} &= \hat\beta_t + (y_{t+1} - x_{t+1}^T \hat\beta_t) \left(\mathbf{X}_{t+1}^T \mathbf{X}_{t+1}\right)^{-1} x_{t+1}   \\  &= \hat\beta_t + (y_{t+1} - x_{t+1}^T \hat\beta_t) \left[ \mathbf{C}^{-1} - \dfrac{\mathbf{C}^{-1} x_{t+1} x_{t+1}^T \mathbf{C}^{-1}}{1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}} \right] x_{t+1} \\  &= \hat\beta_t + (y_{t+1} - x_{t+1}^T \hat\beta_t) \dfrac{\left[ \mathbf{C}^{-1} +  x_{t+1}^T \mathbf{C}^{-1} x_{t+1}\mathbf{C}^{-1} - \mathbf{C}^{-1} x_{t+1} x_{t+1}^T \mathbf{C}^{-1} \right] x_{t+1}}{1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}} \\  &= \hat\beta_t +  \dfrac{ (y_{t+1} - x_{t+1}^T \hat\beta_t) \mathbf{C}^{-1} x_{t+1}}{1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}} \\  &\qquad + \dfrac{y_{t+1} - x_{t+1}^T \hat\beta_t}{1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}} \left[  (x_{t+1}^T \mathbf{C}^{-1} x_{t+1}) \mathbf{C}^{-1} x_{t+1} - \mathbf{C}^{-1} x_{t+1} (x_{t+1}^T \mathbf{C}^{-1}  x_{t+1}) \right] \\  &= \hat\beta_t +  \dfrac{ (y_{t+1} - x_{t+1}^T \hat\beta_t) \mathbf{C}^{-1} x_{t+1}}{1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}}, \end{aligned}

as required.

Updating S^2

The update for S^2 has a slightly nicer form if we remove the normalizing factor in front, i.e. redefine it as

\begin{aligned} S_t^2 = \sum_{i=1}^t \left( x_i^T \hat\beta_t - y_i \right)^2 = \sum_{i=1}^t r_{i, t}^2, \end{aligned}

where we have defined r_{i,t} = x_i^T \hat\beta_t - y_i to be the residual for observation i at time step t.

The derivation of the update rule for S^2 is the most involved of the lot. Let \gamma = \hat\beta_{t+1} - \hat\beta_t, and let D = 1 + x_{t+1}^T \left(\mathbf{X}_t^T \mathbf{X}_t \right)^{-1} x_{t+1} = 1 + x_{t+1}^T \mathbf{C}^{-1} x_{t+1}. With this notation, we can rewrite the first update we derived for \hat\beta as

\gamma = - \dfrac{r_{t+1, t} \mathbf{C}^{-1} x_{t+1}}{D}.


\begin{aligned} S_{t+1}^2 &= \sum_{i=1}^{t+1} \left[ (x_i^T \hat\beta_t - y_i) + x_i^T (\hat\beta_{t+1} - \hat\beta_t) \right]^2 \\  &= \sum_{i=1}^{t+1} \left[ r_{i, t} + x_i^T \gamma \right]^2 \\  &= \left[ r_{t+1, t} + x_{t+1}^T \gamma \right]^2 + \sum_{i=1}^t \left[ r_{i, t}^2 + 2r_{i, t} x_i^T \gamma + \gamma^T x_i x_i^T \gamma \right] \\  &=  \left[ r_{t+1, t} + x_{t+1}^T \gamma \right]^2 + S_t^2 + 2 \gamma^T \left( \sum_{i=1}^t r_{i, t} x_i \right) + \gamma^T \left( \sum_{i=1}^t x_i x_i^T \right) \gamma \\  &= \left[ r_{t+1, t} + x_{t+1}^T \gamma \right]^2 + S_t^2 + \gamma^T \mathbf{C} \gamma, \end{aligned}

where in the last step, we use a property of the OLS residuals at time t: \sum_{i=1}^t r_{i, t} x_i  = 0. Now, substituting the expression we had for \gamma,

\begin{aligned} S_{t+1}^2 &= S_t^2 + \left[ r_{t+1, t} + x_{t+1}^T \gamma \right]^2 +  \gamma^T \mathbf{C} \gamma \\  &= S_t^2 + \left[ r_{t+1, t} - \dfrac{r_{t+1, t} x_{t+1}^T \mathbf{C}^{-1} x_{t+1}}{D} \right]^2 \\  &\qquad+ \left(- \dfrac{r_{t+1, t} \mathbf{C}^{-1} x_{t+1}}{D} \right)^T \mathbf{C} \left( - \dfrac{r_{t+1, t} \mathbf{C}^{-1} x_{t+1}}{D}\right) \\  &= S_t^2 + \left[ r_{t+1, t} - \dfrac{r_{t+1, t} (D - 1)}{D} \right]^2 + \dfrac{r_{t+1,t}^2 x_{t+1}^T \mathbf{C}^{-1} \mathbf{C} \mathbf{C}^{-1} x_{t+1}}{D^2} \\  &= S_t^2 + \dfrac{r_{t+1, t}^2}{D^2} + \dfrac{r_{t+1, t}^2 (D - 1)}{D^2} \\  &= S_t^2 + \dfrac{r_{t+1, t}^2}{D} \\  &= S_t^2 + \dfrac{(x_{t+1}^T \hat\beta_t - y_{t+1})^2}{1 + x_{t+1}^T ( \mathbf{X}_t^T \mathbf{X}_t )^{-1}x_{t+1}}.  \end{aligned}

(This matches the formula in Reference 1.)


  1. Chou, W. (2021). Randomized Controlled Trials without Data Retention.

Computing E[1/(a+X)] for binomial X

Theorem: For a = 1, 2, \dots, let U_{n,a} = \mathbb{E}\left[\dfrac{1}{a+X} \right] with X \sim \text{Binom}(n, p). Then

\begin{aligned} U_{n,1} &= \dfrac{1 - (1-p)^{n+1}}{(n+1)p}, \\  U_{n, a} &= \dfrac{1 - (a-1) U_{n+1, a-1}}{(n+1)p} \qquad \text{for } a \geq 2. \end{aligned}

While the theorem does not give us a single explicit formula for \mathbb{E}\left[\dfrac{1}{a+X} \right], we can write out an expression for \mathbb{E}\left[\dfrac{1}{a+X} \right] for a given a by iteratively substituting the recurrence relation into itself. (Maybe we can use that to get a single explicit formula; I didn’t try too hard.)

The theorem is easy to prove if you know the identity presented in this previous post: for a > 0,

\begin{aligned} \mathbb{E}\left[ \dfrac{1}{a + X} \right] = \int_0^1 t^{a-1} \cdot P_X(t) dt, \end{aligned}

where P_X is the probability generating function for X. Letting q = 1-p, we have P_X(t) = (q + pt)^n for X \sim \text{Binom}(n, p) (proof here). For a = 1,

\begin{aligned} U_{n, 1} = \int_0^1 t^0 (q + pt)^n dt  = \left[ \dfrac{(q+pt)^{n+1}}{(n+1)p} \right]_0^1  = \dfrac{1 - q^{n+1}}{(n+1)p}. \end{aligned}

For a \geq 2, using integration by parts,

\begin{aligned} U_{n, a} &= \int_0^1 t^{a-1} (q + pt)^n dt \\  &= \left[ t^{a-1} \cdot \dfrac{(q+pt)^{n+1}}{(n+1)p} \right]_0^1 - \int_0^1 \dfrac{(q+pt)^{n+1}}{(n+1)p} \cdot (a-1)t^{a-2} dt \\  &= \left[ \dfrac{1}{(n+1)p} - 0 \right] - \dfrac{a-1}{(n+1)p} \int_0^1 t^{(a-1)-1} (q+pt)^{n+1} dt \\  &= \dfrac{1 - (a-1) U_{n+1, a-1}}{(n+1)p},  \end{aligned}

as required.

We can verify this recurrence relation via simulation. Below are two R functions:

GetExpectation <- function(n, p, a) {
  if (a == 1) {
    (1 - (1-p)^(n+1)) / ((n+1) * p)
  } else {
    (1 - (a - 1) * GetExpectation(n + 1, p, a - 1)) / ((n+1) * p)

SimulateExpectation <- function(n, p, a, nSim = 1e6) {
  X <- rbinom(nSim, size = n, prob = p)
  mean(1 / (X + a))

The first computes U_{n,a} exactly with the recurrence relation, while the second does a Monte Carlo simulation. Here is the code that verifies the identity for n = 5, p = 0.7 and a = 4:

n <- 5
p <- 0.7
a <- 4
GetExpectation(n, p, a)
# [1] 0.1361207
SimulateExpectation(n, p, a)
# [1] 0.1361158