The NBA playoffs are in full swing! A total of 16 teams are competing in a playoff-format competition, with the winner of each best-of-7 series moving on to the next round. In each matchup, two teams play 7 basketball games against each other, and the team that wins more games progresses. Of course, we often don’t have to play all 7 games: we can determine the overall winner once either team reaches 4 wins.

During one of the games, a commentator made a remark along the lines of * “you have no idea how hard it is for the better team to lose in a best-of-7 series”*. In this post, I’m going to try and quantify how hard it is to do so! I will do this not via analytical methods, but by Monte Carlo simulation. (You can see the code all at once here.)

Let’s imagine an idealized set-up, where for any given game, team A has probability of beating team B. This probability is assumed to be the same from game to game, and the outcome of each game is assumed to be independent of the others. With these assumptions, the number of games that team A wins has a binomial distribution .

In our simulation, for each , we will generate a large number of random values and determine the proportion of them that are : that would be our estimate for the winning probability of the 7-game series. * How many random values should we draw?* We should draw enough so that we are reasonably confident of the accuracy of the proportion estimate. If we draw samples, then the plug-in estimate of standard error is . In what follows, we will plot estimates with error bars indicating standard error.

First, let’s set up a function that takes in three parameters: the number of simulation runs (`simN`

), the total number of games in a series (`n`

), and the probability that team A wins (`p`

). The function returns the proportion of the `simN`

runs which team A wins.

sim_fn <- function(simN, n, p) { if (n %% 2 == 0) { stop("n should be an odd number") } mean(rbinom(simN, n, p) > n / 2) }

Next, we set up a data frame with each row representing a different win probability for team A. Because of symmetry inherent in the problem, we focus on win probabilities which are .

n <- 7 df <- data.frame(p = seq(0.5, 1, length.out = 26), n = n)

For a start, we set `simN`

to be 100. For each row, we run `simN`

simulation runs to get the win probability estimate and compute the associated standard error estimate. We also compute the upper and lower 1 standard error deviations from the probability estimate, to be used when plotting error bars.

set.seed(1043) simN <- 100 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err)

Here is the code for the plot and the plot itself:

library(ggplot2) ggplot(df, aes(x = p, y = win_prob)) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of", n, "series"), x = "Win prob for single game", y = "Win prob for series")

We can see that the line is still quite wiggly with large error bars, suggesting that 100 simulation runs is not enough. (Indeed, we expect the graph to be monotonically increasing.) Below, we run 100,000 simulations for each value of `p`

instead (with the same random seed):

We get a much smoother line, and the error bars are so small we can hardly see them. My conclusion here is that while it is harder for the weaker team to win a best-of-7 series that a single game, the odds are not insurmountable. For example, a team that has a 70% chance of winning any one game, which is a huge advantage, still has about a 13% chance of losing a best-of-7 series: not insignificant!

How do these probabilities change if we consider best-of-n series for different values of `n`

? The code below is very similar to that above, except that our data frame contains data for `n`

equal to all the odd numbers from 1 to 15, not just `n = 7`

.

p <- seq(0.5, 1, length.out = 26) n <- seq(1, 15, length.out = 8) df <- expand.grid(p, n) names(df) <- c("p", "n") set.seed(422) simN <- 100000 df$win_prob <- apply(df, 1, function(x) sim_fn(simN, x[2], x[1])) # compute std err & p_hat \pm 1 std err df$std_err <- with(df, sqrt(win_prob * (1 - win_prob) / simN)) df$lower <- with(df, win_prob - std_err) df$upper <- with(df, win_prob + std_err) ggplot(df, aes(x = p, y = win_prob, col = factor(n))) + geom_line() + geom_linerange(aes(ymin = lower, ymax = upper)) + labs(title = paste("Win probability for best of n series"), x = "Win prob for single game", y = "Win prob for series") + theme(legend.position = "bottom")

Here is the plot:

As we expect, the series win probabilities increase as `n`

increases. Also, the `n = 1`

graph corresponds to the line. It looks like the win probabilities don’t change much from best-of-9 to best-of-15.

“This probability is assumed to be the same from game to game”

Yet, sports models consider the host team to have an advantage, which varies depending on sport. How would accounting for the host team advantage affect these models?

LikeLike

To account for this, we could give team A two parameters: p_H for win probability at home and p_A for win probability on the road. We could then simulate over a grid of (p_H, p_A) values. We will also have to consider two different case: whether team A starts at home or starts away.

LikeLike

It looks like your probability of winning calculation assumes they will play the entire series even if a team has already won? If a team wins the first 4 games of the best of 7 they don’t play the remaining 3. I’m not sure if this actually affects the probability of a team winning however.

LikeLike

It won’t affect the probability. We could have simulated game by game and stop once some team hits 4 games, but in R that is a little cumbersome to code up and potentially slower to run as well. (To see it’s the same: think of our method as simulating 7 full games, then truncating the series once some team had hit 4 games. That’s the same as simulating game by game until one team wins 4 games.)

LikeLike

Nice analysis! I’m sure that an expert statistician could derive an analytic result with three lines of algebra, but it would be boring and wouldn’t mean anything to a general audience. So let’s kick it up a notch, and leave that smart guy behind.

The probability of winning the Nth game in a series isn’t independent of N. Players and coaches learn from the previous games. (I play an individual sport; when I face a lower-ranked club-mate in a tournament it’s insanely hard to beat them. They know me too well.) A simple model might be that the probability of victory in game N converges to .5 as exp(-N).

It’d be interesting to see if 7 games is some kind of optimum balance between predictability and excitement.

LikeLike