Deviance for logistic regression

In this previous post, I explained what deviance was and how it could be viewed as a generalization of residual sum of squares in linear models. In this post, I will compute the deviance formula for logistic regression, i.e. data with binary response.

We have data (\boldsymbol{x}_1, y_1), \dots, (\boldsymbol{x}_K, y_K), where \boldsymbol{x}_i \in \mathbb{R}^p and y_i \in \{ 0, 1 \}. (As usual, y_i denotes the response variable and \boldsymbol{x}_i being the variables we are using to explain or predict the response. Recall that deviance is

\begin{aligned} D_{\mathcal{M}} = -2 (\log L_{\mathcal{M}} - \log L_{\mathcal{S}}), \end{aligned}

where L_{\mathcal{M}} denotes the maximum achievable likelihood under our model while L_{\mathcal{S}} denotes the likelihood under the “saturated model”. Our model treats the \boldsymbol{x}_i‘s as fixed, and y_i = 1 with probability p_i, where p_i is a function of \boldsymbol{x}_i.

Let’s compute L_{\mathcal{M}} first. If, given \boldsymbol{x}_i, our model predicts the success probability to be \hat{p}_i, then the likelihood associated with this data point is

\begin{aligned} \hat{p}_i^{y_i} \left(1 - \hat{p}_i \right)^{1 - y_i}. \end{aligned}

Since the data points are assumed to be i.i.d., we have

\begin{aligned} L_{\mathcal{M}} &= \prod_{i=1}^K \hat{p}_i^{y_i} \left(1 - \hat{p}_i \right)^{1 - y_i}, \\  \log L_{\mathcal{M}} &= \sum_{i=1}^K y_i \log \hat{p}_i + (1 - y_i) \log \left(1 - \hat{p}_i \right). \end{aligned}

(Note: If all we are interested in is fitting our model via maximum likelihood, we can stop here, since the remaining calculations are for a term which does not involve model parameters.) Next, let’s compute L_{\mathcal{S}}. In the saturated model, the success probability for data point i is simply y_i, so

\begin{aligned} L_{\mathcal{S}} &= \prod_{i=1}^K y_i^{y_i} \left(1 - y_i \right)^{1 - y_i}, \\  \log L_{\mathcal{S}} &= \sum_{i=1}^K y_i \log y_i + (1 - y_i) \log \left(1 - y_i \right). \end{aligned}

Hence,

\begin{aligned} \text{Deviance} &= -2 \sum_{i=1}^K \left[ y_i \log \hat{p}_i + (1 - y_i) \log \left(1 - \hat{p}_i \right) - y_i \log y_i - (1 - y_i) \log \left(1 - y_i \right) \right] \\  &= -2 \sum_{i=1}^K \left[ y_i \log \left( \frac{\hat{p}_i}{y_i}\right) + (1-y_i) \log \left( \frac{1-\hat{p}_i}{1-y_i} \right) \right] \\  &= 2 \sum_{i=1}^K \left[ y_i \log \left( \frac{y_i}{\hat{p}_i}\right) + (1-y_i) \log \left( \frac{1-y_i}{1-\hat{p}_i} \right) \right]. \end{aligned}

In the logistic regression model we can simplify this further. There, we have

\begin{aligned} \text{logit}(\hat{p}_i) = \boldsymbol{x}_i^T \beta, \quad \Leftrightarrow \quad \hat{p}_i = \frac{e^{\boldsymbol{x}_i^T \beta}}{1 + e^{\boldsymbol{x}_i^T \beta}}.  \end{aligned}

Plugging this expression into the deviance formula:

\begin{aligned} \text{Deviance} &= 2 \sum_{i=1}^K \left[ y_i \log \left( \frac{y_i(1 + e^{\boldsymbol{x}_i^T \beta})}{e^{\boldsymbol{x}_i^T \beta}}\right) + (1-y_i) \log \left( (1-y_i)(1 + e^{\boldsymbol{x}_i^T \beta}) \right) \right] \\  &= 2 \sum_{i=1}^K \left[ y_i \log y_i - y_i \boldsymbol{x}_i^T \beta + (1-y_i) \log (1-y_i) + \log (1 + e^{\boldsymbol{x}_i^T \beta}) \right].  \end{aligned}

If we are trying to find the \beta which minimizes deviance (i.e. maximum likelihood estimation), the value of the first and third terms don’t change with \beta, so

\begin{aligned} \hat{\beta}_{MLE} &= \underset{\beta}{\text{argmin}} \:  \sum_{i=1}^K \left[ - y_i \boldsymbol{x}_i^T \beta + \log (1 + e^{\boldsymbol{x}_i^T \beta}) \right] \\  &= \underset{\beta}{\text{argmax}} \:  \sum_{i=1}^K \left[ y_i \boldsymbol{x}_i^T \beta - \log (1 + e^{\boldsymbol{x}_i^T \beta}) \right]. \end{aligned}

Advertisements

What is the Poisson binomial distribution?

I recently learnt of the Poisson binomial distribution from my friend Evan Rosenman.

If we have n independent and identically distributed Bernoulli random variables X_1, \dots, X_n, each with probability of success p, then their sum is a binomial random variable with n trials and probability of success p:

B = X_1 + \dots + X_n \sim \text{Binom}(n, p).

What if the X_i‘s are independent but not identically distributed? That is, X_i \sim \text{Bern}(p_i), where the p_i‘s are not all equal? In that case, we say that their sum B has the Poisson binomial distribution.

The Poisson binomial distribution has support \{ 0, 1, \dots, n\}. The probability mass function involves a combinatorial sum. For k \in \{ 0, 1, \dots, n\}, let F_k denote the collection of all subsets of \{ 0, 1, \dots, n\} of size k. Then

\begin{aligned} \mathbb{P}(B = k) &= \sum_{A \in F_k} \prod_{i \in A} p_i \prod_{j \notin A} (1- p_j). \end{aligned}

Because B is the sum of independent random variables, its mean and variance are easy to compute:

\begin{aligned} \mathbb{E}[B] &= \sum_{i=1}^n \mathbb{E}[X_i] = \sum_{i=1}^n p_i, \\  \text{Var} (B) &= \sum_{i=1}^n \text{Var} (X_i) = \sum_{i=1}^n p_i(1-p_i). \end{aligned}

One application where the Poisson binomial distribution comes up naturally is in voting. For a given voting precinct, we can model individual i has voting for a certain party (say party A) with probability i. Then, assuming that the votes are independent, the total number of votes for party A has the Poisson binomial distribution.

In R, the probability mass/distribution/quantile and random number generation functions for the Poisson-binomial distribution are (d/p/q/r)poisbinom from the poisbinom package.

Two interesting facts about high-dimensional random projections

John Cook recently wrote an interesting blog post on random vectors and random projections. In the post, he states two surprising facts of high-dimensional geometry and gives some intuition for the second fact. In this post, I will provide R code to demonstrate both of them.

Fact 1: Two randomly chosen vectors in a high-dimensional space are very likely to be nearly orthogonal.

Cook does not discuss this fact as it is “well known”. Let me demonstrate it empirically. Below, the first function generates a p-dimensional unit vector uniformly at random. The second function takes in two p-dimensional vectors, x1 and x2, and computes the angle between them. (For details, see Cook’s blog post.)

genRandomVec <- function(p) {
    x <- rnorm(p)
    x / sqrt(sum(x^2))
}

findAngle <- function(x1, x2) {
    dot_prod <- sum(x1 * x2) / (sqrt(sum(x1^2) * sum(x2^2)))
    acos(dot_prod)
}

Next, we use the replicate function to generate 100,000 pairs of 10,000-dimensional vectors and plot a histogram of the angles they make:

simN <- 100000  # no. of simulations
p <- 10000
set.seed(100)
angles <- replicate(simN, findAngle(genRandomVec(p), genRandomVec(p)))
angles <- angles / pi * 180  # convert from radians to degrees
hist(angles)

Note the scale of the x-axis: the angles are very closely bunched up around 90 degrees, as claimed.

This phenomenon only happens for “high” dimensions. If we change the value of p above to 2, we obtain a very different histogram:

How “high” does the dimension have to be before we see this phenomenon kick in? Well, it depends on how tightly bunched up we want the angles to be around 90 degrees. The histogram below is the same simulation but for p = 20 (notice the wider x-axis scale):

It seems like the bell-shaped curve already starts to appear with p = 3!

Fact 2: Generate 10,000 random vectors in 20,000 dimensional space. Now, generate another random vector in that space. Then the angle between this vector and its projection on the span of the first 10,000 vectors is very likely to be very near 45 degrees.

Cook presents a very cool intuitive explanation of this fact which I highly recommend. Here, I present simulation evidence of the fact.

The difficulty in this simulation is computing the projection of a vector onto the span of many vectors. It can be shown that the projection of a vector v onto the column span of a (full-rank) matrix A is given by \text{proj}_A (v) = A(A^T A)^{-1}A^T v (see this post and this post for details). For our fact, A is a 20,000 \times 10,000 matrix, so computing (A^T A)^{-1} is going to take prohibitively long.

I don’t know another way to compute the projection of a vector onto the span of other vectors. (Does anyone know of better ways?) Fortunately, based on my simulations in Fact 1, this phenomenon will probably kick in for much smaller dimensions too!

First, let’s write up two functions: one that takes a vector v and matrix A and returns the projection of v onto the column span of A:

projectVec <- function(v, A) {
    A %*% solve(t(A) %*% A) %*% t(A) %*% v
}

and a function that does one run of the simulation. Here, p is the dimensionality of each of the vectors, and I assume that we are looking at the span of p/2 vectors:

simulationRun <- function(p) {
    A <- replicate(p/2, genRandomVec(p))
    v <- genRandomVec(p)
    proj_v <- projectVec(v, A)
    findAngle(proj_v, v)
}

The code below runs 10,000 simulations for p = 20, taking about 2 seconds on my laptop:

simN <- 10000 # no. of simulations
p <- 20     # dimension of the vectors
set.seed(54)
angles <- replicate(simN, simulationRun(p))
angles <- angles / pi * 180  # convert from radians to degrees
hist(angles, breaks = seq(0, 90, 5))
abline(v = 45, col = "red", lwd = 3)

We can already see the bunching around 45 degrees:

The simulation for p = 200 takes just under 2 minutes on my laptop, and we see tighter bunching around 45 degrees (note the x-axis scale).

The sinh-arcsinh normal distribution

This month’s issue of Significance magazine has a very nice summary article of the sinh-arcsinh normal distribution. (Unfortunately, the article seems to be behind a paywall.)

This distribution was first introduced by Chris Jones and Arthur Pewsey in 2009 as a generalization of the normal distribution. While the normal distribution is symmetric and has light to moderate tails and can be defined by just two parameters (\mu for location and \sigma for scale), the sinh-arcsinh distribution has two more parameters which control asymmetry and tail weight.

Given the 4 parameters, the sinh-arcsinh normal distribution is defined as

\begin{aligned} X = \mu + \sigma \cdot \text{sinh}\left[ \frac{\text{sinh}^{-1} (Z) + \nu}{\tau} \right], \end{aligned}

where \text{sinh}(x) = \dfrac{e^x - e^{-x}}{2} and \text{sinh}^{-1}(x) = \log \left( x + \sqrt{1 + x^2} \right) are the hyperbolic sine function and its inverse.

  • \mu controls the location of the distribution (where it is “centered” at),
  • \sigma controls the scale (the larger it is, the more spread out the distribution is),
  • \nu controls the asymmetry of the distribution (can be any real value, more positive means more right skew, more negative means more left skew), and
  • \tau controls tail weight (any positive real value, \tau > 1 means lighter than normal distribution, \tau < 1 means heavier).

From the expression, we can also see that when \nu = 0 and \tau = 1, the distribution reduces to the normal distribution with mean \mu and standard deviation \sigma.

In R, the gamlss.dist package provides functions for plotting this distribution. The package provides functions for 3 different parametrizations of this distribution; the parametrization above corresponds to the SHASHo set of functions. As is usually the case in R, dSHASHo, pSHASHo, qSHASHo and rSHASHo are for the density, distribution function, quantile function and random generation for the distribution.

First, we demonstrate the effect of skewness (i.e. varying \nu).

library(gamlss.dist)
library(dplyr)
library(ggplot2)

x <- seq(-6, 6, length.out = 301)
nu_list <- -3:3
df <- data.frame()
for (nu in nu_list) {
    temp_df <- data.frame(x = x, 
                          y = dSHASHo2(x, mu = 0, sigma = 1, nu = nu, tau = 1))
    temp_df$nu <- nu
    df <- rbind(df, temp_df)
}

As \nu becomes more positive, the distribution becomes more right-skewed:

df %>% filter(nu >= 0) %>%
    ggplot(aes(x = x, y = y, col = factor(nu))) +
    geom_line() + theme_bw()

As \nu becomes more negative, the distribution becomes more left-skewed:

df %>% filter(nu <= 0) %>%
    ggplot(aes(x = x, y = y, col = factor(nu))) +
    geom_line() + theme_bw()

Next, we demonstrate the effect varying \tau has on the weight of the tails. The code and picture below is for when there is no skewness in the distribution:

tau_list <- c(0.25, 0.75, 1, 1.5)
df <- data.frame()
for (tau in tau_list) {
    temp_df <- data.frame(x = x, 
                          y = dSHASHo(x, mu = 0, sigma = 1, nu = 0, tau = tau))
    temp_df$tau <- tau
    df <- rbind(df, temp_df)
}

ggplot(data = df, aes(x = x, y = y, col = factor(tau))) +
    geom_line() + theme_bw()

By changing nu = 0 to nu = 1 in the code above, we see the effect of tail weight when there is skewness:

(Note: For reasons unclear to me, the Significance article uses different symbols for the 4 parameters: \xi instead of \mu, \eta instead of \sigma, \epsilon instead of \nu and \delta instead of \tau.)

The authors note that it is possible to perform maximum likelihood estimation with this distribution. It is an example of GAMLSS regression, which can be performed in R using the gamlss package.

References:

  1. Jones, C. and Pewsey, A. (2019). The sinh-arcsinh normal distribution.
  2. Jones, M. C. and Pewsey, A. (2009). Sinh-arcsinh distributions.

Testing numeric variables for NA/NaN/Inf

In R, a numeric variable is either a number (like 0, 42, or -3.14), or one of 4 special values: NA, NaN, Inf or -Inf. It can be hard to remember how the is.x functions treat each of the special values, especially NA and NaN! The table below summarizes how each of these values is treated by different base R functions. Functions are listed in alphabetical order.

Function Typical number NA NaN Inf -Inf
is.double TRUE FALSE TRUE TRUE TRUE
is.finite TRUE FALSE FALSE FALSE FALSE
is.infinite FALSE FALSE FALSE TRUE TRUE
is.integer *1 FALSE FALSE FALSE FALSE
is.na FALSE TRUE TRUE FALSE FALSE
is.nan FALSE FALSE TRUE FALSE FALSE
is.null FALSE FALSE FALSE FALSE FALSE
is.numeric TRUE *2 TRUE TRUE TRUE

*1: is.integer can return TRUE or FALSE, depending on whether the number is an integer or not.

*2: R actually has different types of NAs (see here for more details). is.numeric(NA) returns FALSE, but is.numeric(NA_integer_) and is.numeric(NA_real_) return TRUE. Interestingly, is.numeric(NA_complex_) returns FALSE.

Many ways to do the same thing: linear regression

One feature of R (could be positive, could be negative) is that there are many ways to do the same thing. In this post, I list out the different ways we can get certain results from a linear regression model. Feel free to comment if you know more ways other than those listed!

In what follows, we will use the linear regression object lmfit:

data(mtcars)
lmfit <- lm(mpg ~ hp + cyl, data = mtcars)

Extracting coefficients of the linear model

# print the lm object to screen
lmfit

# part of the summary output
summary(lmfit)

# extract from summary output
summary(lmfit)$coefficients[, 1]

# use the coef function
coef(lmfit)

# extract using list syntax
lmfit$coefficients

Getting fitted values for the training data set

# use the predict function
predict(lmfit)

# extract from lm object
lmfit$fitted.values

Getting residuals for the training data set

# use the predict function
resid(lmfit)

# extract from lm object
lmfit$residuals

# extract from lm summary
summary(lmfit)$residuals

Influence functions and plug-in estimators

In statistics, we frequently assume that our data are X_1, \dots, X_n \stackrel{iid}{\sim} F, where F is some distribution, and we want to estimate some quantity that depends on F. (Assume that the X‘s live in some set S.) In parametric statistics, we assume that the true F belongs to some parametric family (e.g. Gaussian distributions \mathcal{N}(\mu, \sigma^2)), and we estimate some functional of the parameters (e.g. for the Gaussian parametric family, estimating the mean means estimating \mu). In nonparametric statistics, we try not to make assumptions except when we need to (e.g. assuming that F has a finite first moment). The object of estimation can be expressed as a function of F, written as \theta = T(F). Common objects of estimation are the mean \mu = \int x dF(x), the variance \sigma^2 = \int (x-\mu)^2 dF(x), and the median m = F^{-1}(1/2).

(Side note: The letter F is used to denote several things at once. It can refer to the distribution itself as an object. For distributions F defined on \mathbb{R}, It can refer to the cumulative distribution function (CDF) of the distribution, F(x) = \mathbb{P} \{ X \leq x \}, where X \sim F. It can also refer more generally to the (probability) measure F defines on the space S, where the X‘s live. What F is referring to can be deduced from the context.)

To estimate \theta = T(F), we often use the plug-in estimator \hat{\theta}_n = T(\hat{F}_n), where \hat{F}_n is the empirical distribution. Simply put, the empirical distribution \hat{F}_n is our best guess for the true distribution F, so by analogy T(\hat{F}_n) is probably a good estimate of T(F).

How good is this estimate? It turns out that we have a central limit theorem for this situation. Before stating the theorem, we need to define the influence function. For a given statistic T and distribution F, the influence function L_F is defined as

\begin{aligned} L_F(x) = \lim_{\epsilon \rightarrow 0} \dfrac{T((1-\epsilon)F + \epsilon \delta_x) - T(F)}{\epsilon} \end{aligned} \quad for all x \in \mathbb{R},

where \delta_x is the point mass at x. You can think of the influence function as a derivative, trying to quantify the change in the statistic T when F is “contaminated” with a point mass at x.

Here is the CLT for the plug-in estimator:

Assume that T(F) is of the form T(F) = \int a(x) dF(x). Let \tau^2 = \int L_F^2 (x) dF(x). If \tau^2 < \infty, then \sqrt{n}[T(F) - T(\hat{F}_n)] converges in distribution to \mathcal{N}(0, \tau^2).

References:

  1. Wasserman, L. All of nonparametric statistics (Chapter 2).