The minimax concave penalty (MCP)

Assume that we are in the regression context with response y \in \mathbb{R}^n and design matrix X \in \mathbb{R}^{n \times p}. In an earlier post, we described the SCAD (smoothly clipped absolute deviation) penalty which was introduced as an alternative to the lasso with less biased estimates for the non-zero regression coefficients.

The minimax concave penalty (MCP) is another alternative to get less biased regression coefficients in sparse models. Introduced by Zhang 2010, MCP solves

\text{minimize}_\beta \quad\frac{1}{2} \| y - X\beta\|_2^2 + p(\beta),

where the derivative of the penalty function is

\begin{aligned} p'(\beta) = \begin{cases} \text{sgn}(\beta) \left(\lambda - \dfrac{|\beta|}{a} \right) &\text{if } |\beta| \leq a\lambda, \\ 0 &\text{otherwise.} \end{cases} \end{aligned}

Here, a is a second hyperparameter which is usually taken to be greater than 1. The penalty function can be written explicitly:

\begin{aligned} p(\beta) = \begin{cases} \lambda |\beta| - \dfrac{\beta^2}{2a} &\text{if } |\beta| \leq a\lambda, \\ \dfrac{a\lambda^2}{2} &\text{otherwise.} \end{cases} \end{aligned}

Below is a plot of the lasso (black), SCAD (red) and MCP (blue) penalty functions, where we have set \lambda = 1 and a = 3. The dotted lines are the penalties’ transition points (\pm \lambda and \pm a \lambda). The SCAD and lasso penalties are the same for \beta \in [-\lambda, \lambda]. The SCAD and MCP penalties are constant for \beta \notin [-a \lambda, a \lambda].

MCP is named as such because of the following result: Among all continuously differentiable penalty functions p satisfying p'(0+) = \lambda (“selection”) and p'(t) = 0 for all t \geq a\lambda (“unbiasedness”), MCP minimizes the maximum concavity

\begin{aligned} \kappa = \sup_{0 < t_1 < t_2} \dfrac{p'(t_2) - p'(t_1)}{t_2 - t_1}. \end{aligned}

Why is minimizing the maximum concavity desirable? Zhang 2010 gives a fairly terse explanation:

Although the unbiasedness and selection features [i.e. the conditions we place on the penalty function] preclude convex penalties, the MCP provides the sparse convexity to the broadest extent by minimizing the maximum concavity.

In other words, among penalty functions satisfying the two conditions, MCP is the “most convex” of them all. Personally I’m not sure how important of a feature that is, since MCP is still not convex, and I’m not sure that minimizing maximum concavity makes computation any easier.

References:

  1. Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
  2. Breheny, P. Adaptive lasso, MCP, and SCAD.

OLS coefficients as a weighted sum of OLS coefficients on subsets of data

I recently learned of an interesting result from Subrahmanyam (1972) that expresses the ordinary least squares (OLS) coefficients as a weighted sum of OLS coefficients computed over just a subset of the data.

Imagine that one has a data matrix X \in \mathbb{R}^{n \times p} consisting of n observations, each with p features, as well as a response vector y \in \mathbb{R}^n. Assume that n \geq p, and that X has rank p (i.e. full rank).

Let’s now introduce a bit more notation for matrices that we will use later.

  • Let S = X^T X.
  • Let X^j be the matrix obtained by replacing the jth column of X with y.
  • Let S^j be the matrix obtained by replacing the jth column of S = X^T X with X^T y. Note that S^j = X^T X^{j}.
  • For any subset of K \subseteq \{ 1, 2, \dots, n\} and any matrix A, let A_{(K)} be the matrix obtained by taking just the rows of A with indices in K.
  • Let \hat{\beta} be the OLS coefficients for data (X, y).
  • Let \hat{\beta}^{K} be the OLS coefficients for data (X_{(K)}, y_{(K)}). Note that this is NOT the same as \hat{\beta}_{(K)}, which are a subset of the OLS coefficients of the whole data.

The result is based on the following lemma on determinants:

Lemma.

\begin{aligned} |S| = \sum_K |X_{(K)}|^2, \quad |S^j| = \sum_K |X_{(K)}| \cdot |X_{(K)}^j|, \end{aligned}

where |A| is the determinant of the matrix A, and the sums are taken over all subsets K \subseteq \{ 1, 2, \dots, n\} of size p.

This lemma is apparently a well-known result. It appeared as an exercise problem in Rao’s Linear Statistical Inference and its Applications (p32 exercise 2.2 in the 2nd edition).

And now for the result:

Theorem. 

\begin{aligned} \hat{\beta} = \sum_K w_K \hat{\beta}^K = \sum_K \dfrac{|X_{(K)}|^2}{\sum_L |X_{(L)}|^2} \cdot \hat{\beta}^K, \end{aligned}

where the sums for K and L are over all subsets K \subseteq \{ 1, 2, \dots, n\} of size p.

The proof simply uses on Cramer’s rule and the fact that the OLS solution \hat{\beta} for data (X, y) solves the equation (X^T X)\hat{\beta} = y.

\begin{aligned} \hat{\beta}_j &= \dfrac{|S^j|}{|S|} &\text{(Cramer's rule)} \\  &= \dfrac{ \sum_K |X_{(K)}| \cdot |X_{(K)}^j| }{ \sum_L |X_{(L)}|^2 } &\text{(lemma above)} \\  &= \dfrac{1}{ \sum_L |X_{(L)}|^2 } \sum_K |X_{(K)}|^2 \cdot \dfrac{ |X_{(K)}^j| }{|X_{(K)}|} \\  &=  \dfrac{1}{ \sum_L |X_{(L)}|^2 } \sum_K |X_{(K)}|^2 \cdot \beta^K. \end{aligned}

In their paper introducing quantile regression, Koenker and Bassett Jr (1978) mention this result. They note that OLS places positive weight on all \beta^K such that the corresponding data matrix X_{(K)} has non-zero determinant. In contrast, they introduce a class of estimators which place positive weight on only a few \beta^K.

References:

  1. Subrahmanyam, M. (1972). A property of simple least squares estimates.
  2. Koenker, R., and Bassett Jr, G. (1978). Regression quantiles.

Non-negative least squares

Imagine that one has a data matrix X \in \mathbb{R}^{n \times p} consisting of n observations, each with p features, as well as a response vector y \in \mathbb{R}^n. We want to build a model for y using the feature columns in X. In ordinary least squares (OLS), one seeks a vector of coefficients \hat{\beta} \in \mathbb{R}^p such that

\begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad  \| y - X\beta \|_2^2. \end{aligned}

In non-negative least squares (NNLS), we seek a vector coefficients \hat{\beta} \in \mathbb{R}^p such that it minimizes \| y - X \beta\|_2^2 subject to the additional requirement that each element of \hat{\beta} is non-negative.

There are a number of ways to perform NNLS in R. The first two methods come from Reference 1, while I came up with the third. (I’m not sharing the third way Reference 1 details because it claims that the method is buggy.)

Let’s generate some fake data that we will use for the rest of the post:

set.seed(1)
n <- 100; p <- 10
x <- matrix(rnorm(n * p), nrow = n)
y <- x %*% matrix(rep(c(1, -1), length.out = p), ncol = 1) + rnorm(n)

Method 1: the nnls package

library(nnls)
mod1 <- nnls(x, y)
mod1$x 
# [1] 0.9073423 0.0000000 1.2971069 0.0000000 0.9708051 
# [6] 0.0000000 1.2002310 0.0000000 0.3947028 0.0000000

Method 2: the glmnet package

The glmnet() function solves the minimization problem

\begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n} \| y - X\beta \|_2^2 + \lambda \left[ \frac{1-\alpha}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1 \right], \end{aligned}

where \alpha and \lambda are hyperparameters the user chooses. By setting \alpha = 1 (the default) and \lambda = 0, glmnet() ends up solving the OLS problem. By setting lower.limits = 0, this forces the coefficients to be non-negative. We should also set intercept = FALSE so that we don’t have an extraneous intercept term.

library(glmnet)
mod2 <- glmnet(x, y, lambda = 0, lower.limits = 0, intercept = FALSE)
coef(mod2)
# 11 x 1 sparse Matrix of class "dgCMatrix"
# s0
# (Intercept) .        
# V1          0.9073427
# V2          .        
# V3          1.2971070
# V4          .        
# V5          0.9708049
# V6          .        
# V7          1.2002310
# V8          .        
# V9          0.3947028
# V10         . 

Method 3: the bvls package

NNLS is a special case of bounded-variable least squares (BVLS), where instead of having constraints \beta_j \geq 0 for each j = 1, \dots, p, one has constraints a_j \leq \beta_j \leq b_j for each j. BVLS is implemented in the bvls() function of the bvls package:

library(bvls)
mod3 <- bvls(x, y, bl = rep(0, p), bu = rep(Inf, p))
mod3$x
# [1] 0.9073423 0.0000000 1.2971069 0.0000000 0.9708051 
# [6] 0.0000000 1.2002310 0.0000000 0.3947028 0.0000000

In the above, bl contains the lower limits for the coefficients while bu contains the upper limits for the coefficients.

References:

  1. Things I Thought At One Point. Three ways to do non-negative least squares in R.

An unofficial vignette for the gamsel package

I’ve been working on a project/package that closely mirrors that of GAMSEL (generalized additive model selection), a method for fitting sparse generalized additive models (GAMs). In preparing my package, I realized that (i) the gamsel package which implements GAMSEL doesn’t have a vignette, and (ii) I could modify the vignette for my package minimally to create one for gamsel. So here it is!

For a markdown version of the vignette, go here. Unfortunately LaTeX doesn’t play well on Github… The Rmarkdown file is available here: you can download it and knit it on your own machine.

Introduction

This is an unofficial vignette for the gamsel package. GAMSEL (Generalized Additive Model Selection) is a method for fitting sparse generalized additive models proposed by Alexandra Chouldechova and Trevor Hastie. Here is the abstract of the paper:

We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.

Let there be n observations, each consisting of p features. Let \mathbf{X} \in \mathbb{R}^{n \times p} denote the overall feature matrix, and let y \in \mathbb{R}^n denote the vector of responses. Let X_j \in \mathbb{R}^n denote the jth column of \mathbf{X}.

For each variable X_j, let U_j \in \mathbb{R}^{n \times m_j} be the matrix of evaluations of m_j basis functions u_1, \dots, u_{m_j} at points x_{1j}, \dots, x_{nj}. The model response from GAMSEL is of the form

\begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \beta_j \in \mathbb{R}^{m_j}.

For more details on the method, see the arXiv paper. For `gamsel`’s official R documentation, see this link.

The gamsel() function

The purpose of this section is to give users a general sense of the gamsel() function, which is probably the most important function of the package. First, we load the gamsel package:

library(gamsel)
#> Loading required package: foreach
#> Loading required package: mda
#> Loading required package: class
#> Loaded mda 0.4-10
#> Loaded gamsel 1.8-1

Let’s generate some data:

set.seed(1)
n <- 100; p <- 12
x = matrix(rnorm((n) * p), ncol = p)
f4 = 2 * x[,4]^2 + 4 * x[,4] - 2
f5 = -2 * x[, 5]^2 + 2
f6 = 0.5 * x[, 6]^3
mu = rowSums(x[, 1:3]) + f4 + f5 + f6
y = mu + sqrt(var(mu) / 4) * rnorm(n)

We fit the model using the most basic call to gamsel():

fit <- gamsel(x, y)

The function gamsel() fits GAMSEL for a path of lambda values and returns a gamsel object. Typical usage is to have gamsel() specify the lambda sequence on its own. (By default, the model is fit for 50 different lambda values.) The returned gamsel object contains some useful information on the fitted model. For a given value of the \lambda hyperparameter, GAMSEL gives the predictions of the form

\begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \alpha_0 \in \mathbb{R}, \alpha_j \in \mathbb{R} and \beta_j \in \mathbb{R}^{m_j} are fitted coefficients.

Printing the returned gamsel object tells us how many features, linear components and non-linear components were included in the model for each lambda value respectively. It also shows the fraction of null deviance explained by the model and the lambda value for that model.

fit
#> 
#> Call:  gamsel(x = x, y = y) 
#> 
#>       NonZero Lin NonLin    %Dev  Lambda
#>  [1,]       0   0      0 0.00000 80.1300
#>  [2,]       1   1      0 0.03693 72.9400
#>  [3,]       1   1      0 0.06754 66.4000
#>  [4,]       1   1      0 0.09290 60.4400
#>  [5,]       1   1      0 0.11390 55.0200
#>  [6,]       1   1      0 0.13130 50.0900
#>  [7,]       1   1      0 0.14580 45.5900
#>  .... (redacted for conciseness)

gamsel has a tuning parameter gamma which is between 0 and 1. Smaller values of gamma penalize the linear components less than the non-linear components, resulting in more linear components for the fitted model. The default value is gamma = 0.4.

fit2 <- gamsel(x, y, gamma = 0.8)

By default, each variable is given m_j = 10 basis functions. This can be modified with the degrees option, and this value can differ from variable to variable (to allow for this, pass a vector of length equal to the number of variables to the degrees option).

By default, the maximum degrees of freedom for each variable is 5. This can be modified with the dfs option, with larger values allowing more “wiggly” fits. Again, this value can differ from variable to variable.

Predictions

Predictions from this model can be obtained by using the predict method of the gamsel() function output: each column gives the predictions for a value of lambda.

# get predictions for all values of lambda
all_predictions <- predict(fit, x) dim(all_predictions) #> [1] 100  50

# get predictions for 20th model for first 5 observations
all_predictions[1:5, 20]
#> [1]  1.88361056 -4.47189543  8.05935149 -0.04271584  5.93270321

One can also specify the lambda indices for which predictions are desired:

# get predictions for 20th model for first 5 observations
predict(fit, x[1:5, ], index = 20)
#>              l20
#> [1,]  1.88361056
#> [2,] -4.47189543
#> [3,]  8.05935149
#> [4,] -0.04271584
#> [5,]  5.93270321

The predict method has a type option which allows it to return different types of information to the user. type = "terms" gives a matrix of fitted functions, with as many columns as there are variables. This can be useful for understanding the effect that each variable has on the response. Note that what is returned is a 3-dimensional array!

# effect of variables for first 10 observations and 20th model
indiv_fits <- predict(fit, x[1:10, ], index = c(20), type = "terms") dim(indiv_fits) #> [1] 10  1 12
# effect of variable 4 on these first 10 observations
plot(x[1:10, 4], indiv_fits[, , 4])

type = "nonzero" returns a list of indices of non-zero coefficients at a given lambda.

# variables selected by GAMSEL and the 10th and 50th lambda values
predict(fit, x, index = c(10, 50), type = "nonzero")
#> $l10
#> [1] 2 3 4 5 6
#> 
#> $l50
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Plots and summaries

Let’s fit the basic gamsel model again:

fit <- gamsel(x, y)

fit is a class “gamsel” object which comes with a plot method. The plot method shows us the relationship our predicted response has with each input feature, i.e. it plots \alpha_j X_j + U_j \beta_j vs. X_j for each j. Besides passing fit to the plot() call, the user must also pass an input matrix x: this is used to determine the coordinate limits for the plot. It is recommended that the user simply pass in the same input matrix that the GAMSEL model was fit on.

By default, plot() gives the fitted functions for the last value of the lambda key in fit, and gives plots for all the features. For high-dimensional data, this latter default is problematic as it will produce too many plots! You can pass a vector of indices to the which option to specify which features you want plots for. The code below gives plots for the first 4 features:

par(mfrow = c(1, 4))
par(mar = c(4, 2, 2, 2))
plot(fit, x, which = 1:4)

The user can specify the index of the lambda value to show using the index option:

# show fitted functions for x2, x5 and x8 at the model for the 15th lambda value
par(mfrow = c(1, 3))
plot(fit, x, index = 15, which = c(2, 5, 8))

Linear functions are colored green, non-linear functions are colored red, while zero functions are colored blue.

Class “gamsel” objects also have a summary method which allows the user to see the coefficient profiles of the linear and non-linear features. On each plot (one for linear features and one for non-linear features), the x-axis is the \lambda value going from large to small. For linear components, the y-axis is the coefficient for each variable while for the nonlinear components, it is the norm of the nonlinear coefficients.

par(mfrow = c(1, 2))
summary(fit)

We can include annotations to show which profile belongs to which feature by specifying label = TRUE.

par(mfrow = c(1, 2))
summary(fit, label = TRUE)

Cross-validation

We can perform k-fold cross-validation (CV) for GAMSEL with cv.gamsel(). It does 10-fold cross-validation by default:

cvfit <- cv.gamsel(x, y)

We can change the number of folds using the nfolds option:

cvfit <- cv.gamsel(x, y, nfolds = 5)

If we want to specify which observation belongs to which fold, we can do that by specifying the foldid option, which is a vector of length n, with the ith element being the fold number for observation i.

set.seed(3)
foldid <- sample(rep(seq(10), length = n))
cvfit <- cv.gamsel(x, y, foldid = foldid)

A cv.gamsel() call returns a cv.gamsel object. We can plot this object to get the CV curve with error bars (one standard error in each direction). The left vertical dotted line represents lambda.min, the lambda value which attains minimum CV error, while the right vertical dotted line represents lambda.1se, the largest lambda value with CV error within one standard error of the minimum CV error.

plot(cvfit)

The numbers at the top represent the number of features in our original input matrix that are included in the model.

The two special lambda values, as well as their indices in the lambda path, can be extracted directly from the cv.gamsel object:

# lambda values
cvfit$lambda.min
#> [1] 3.959832
cvfit$lambda.1se
#> [1] 8.39861
# corresponding lambda indices
cvfit$index.min
#> [1] 33
cvfit$index.1se
#> [1] 25

Logistic regression with binary data

In the examples above, y was a quantitative variable (i.e. takes values along the real number line). As such, using the default family = "gaussian" for gamsel() was appropriate. In theory, the GAMSEL algorithm is very flexible and can be used when y is not a quantitative variable. In practice, gamsel() has been implemented for binary response data. The user can use gamsel() (or cv.gamsel()) to fit a model for binary data by specifying family = "binomial". All the other functions we talked about above can be used in the same way.

In this setting, the response y should be a numeric vector containing just 0s and 1s. When doing prediction, note that by default predict() gives just the value of the linear predictors, i.e.

\begin{aligned} \log [\hat{p} / (1 - \hat{p})] = \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \hat{p} is the predicted probability. To get the predicted probability, the user has to pass type = "response" to the predict() call.

# fit binary model
bin_y <- ifelse(y > 0, 1, 0)
binfit <- gamsel(x, bin_y, family = "binomial")

# linear predictors for first 5 observations at 10th model
predict(binfit, x[1:5, ])[, 10]
#> [1]  0.1293867 -0.4531994  0.4528493 -0.2539989  0.3576436

# predicted probabilities for first 5 observations at 10th model
predict(binfit, x[1:5, ], type = "response")[, 10]
#> [1] 0.5323016 0.3886003 0.6113165 0.4368395 0.5884699

The hidden diagnostic plots for the lm object

When plotting an lm object in R, one typically sees a 2 by 2 panel of diagnostic plots, much like the one below:

set.seed(1)
x <- matrix(rnorm(200), nrow = 20)
y <- rowSums(x[,1:3]) + rnorm(20)
lmfit <- lm(y ~ x)
summary(lmfit)
par(mfrow = c(2, 2))
plot(lmfit)

This link has an excellent explanation of each of these 4 plots, and I highly recommend giving it a read.

Most R users are familiar with these 4 plots. But did you know that the plot() function for lm objects can actually give you 6 plots? It says so right in the documentation:

We can specify which of the 6 plots we want when calling this function using the which option. By default, we are given plots 1, 2, 3 and 5. Let’s have a look at what plots 4 and 6 are.

Plot 4 is of Cook’s distance vs. observation number (i.e. row number). Cook’s distance is a measure of how influential a given observation is on the linear regression fit, with a value > 1 typically indicating a highly influential point. By plotting this value against row number, we can see if highly influential points exhibit any relationship to their position in the dataset. This is useful for time series data as it can indicate if our fit is disproportionately influenced by data from a particular time period.

Here is what plot 4 might look like:

plot(lmfit, which = 4)

Plot 6 is of Cook’s distance against (leverage)/(1 – leverage). An observation’s leverage must fall in the interval [0,1], so plotting against (leverage)/(1 – leverage) allows the x-axis to span the whole positive real line. The contours on the plot represent points where the absolute value of the standardized residual is the same. On this plot they happen to be straight lines; the documentation says so as well but I haven’t had time to check it mathematically.

Here is what plot 6 might look like:

plot(lmfit, which = 6)

I’m not too sure how one should interpret this plot. As far as I know, one should take extra notice of points with high leverage and/or high Cook’s distance. So any observation in the top-left, top-right or bottom-right corner should be taken note of. If anyone knows of a better way to interpret this plot, let me know!

Random FORTRAN FAQ

I recently got myself into a project involving FORTRAN. Here are some random FORTRAN FAQ that I’ve accumulated through working through code.

Q: What does 0d0 mean?
A: Double precision 0. First 0 is the value, D means x10 to the power of, second 0 is the exponent. (Reference)

Q: What does implicit double precision(a-h,o-z) mean?
A: In FORTRAN, variables have default types based on their starting character. An implicit statement overrides that. The code above means that any variable starting with a-h or o-z is a double precision real number. To undo this behavior in FORTRAN, use the statement implicit none. (Reference)

Q: What are the default types for FORTRAN variables?
A: By default, variables starting with letters i-n or I-N are integer type, while those beginning with any other letter are real type. (Reference)

Q: What does the sign() function do? Why does it take two arguments?
A: It is a sign copying function. sign(a,b) returns the value of a with the sign of b. More specifically, if b is greater than or equal to 0, sign(a,b) returns the absolute value of a; if not, it returns negative the absolute value of a. (Reference)

Fitting a generalized linear model (GLM)

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p.

To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p, which in turn give us estimates for \eta_i = g (\mu_i) = g(\mathbb{E}[Y_i]).

Likelihood equations

More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}.  \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}).

If we let {\bf W} \in \mathbb{R}^{n \times n} be the diagonal matrix such that w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i), and {\bf D} \in \mathbb{R}^{n \times n} be the diagonal matrix such that d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for \beta (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let u \in \mathbb{R}^p be the gradien \partial L / \partial \beta{\bf H} \in \mathbb{R}^{p \times p} be the Hessian matrix, and assume that \beta^{(t)} is our guess for \hat{\beta} at time t. (We can think of u and \bf H as functions of \beta.) Let  u^{(t)} and {\bf H}^{(t)} be the evaluation of u and \bf H at \beta^{(t)}. Taylor expansion of L around \beta^{(t)} gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. \beta and setting it to 0, we can rearrange terms to solve for \beta, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough t, |\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2 for all j and some c > 0.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix \bf H evaluated at the current guess \beta^{(t)}. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b, the elements of the information matrix \bf J are j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b].

The updates for \beta under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that \bf J has a very nice expression: it is simply {\bf J} = {\bf X}^T \bf WX. (We had previously defined \bf W: a diagonal matrix with entries w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i).)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of \hat{\beta}, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by {\bf J}^{(t)}, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for \bf J and u that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where z^{(t)} = {\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}).

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is z^{(t)} and the weight matrix is {\bf W}^{(t)}.

We can think of z^{(t)} as a linearized form of the link function g evaluated at y:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.