Fitting a generalized linear model (GLM)

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p.

To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p, which in turn give us estimates for \eta_i = g (\mu_i) = g(\mathbb{E}[Y_i]).

Likelihood equations

More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}.  \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}).

If we let {\bf W} \in \mathbb{R}^{n \times n} be the diagonal matrix such that w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i), and {\bf D} \in \mathbb{R}^{n \times n} be the diagonal matrix such that d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for \beta (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let u \in \mathbb{R}^p be the gradien \partial L / \partial \beta{\bf H} \in \mathbb{R}^{p \times p} be the Hessian matrix, and assume that \beta^{(t)} is our guess for \hat{\beta} at time t. (We can think of u and \bf H as functions of \beta.) Let  u^{(t)} and {\bf H}^{(t)} be the evaluation of u and \bf H at \beta^{(t)}. Taylor expansion of L around \beta^{(t)} gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. \beta and setting it to 0, we can rearrange terms to solve for \beta, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough t, |\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2 for all j and some c > 0.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix \bf H evaluated at the current guess \beta^{(t)}. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b, the elements of the information matrix \bf J are j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b].

The updates for \beta under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that \bf J has a very nice expression: it is simply {\bf J} = {\bf X}^T \bf WX. (We had previously defined \bf W: a diagonal matrix with entries w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i).)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of \hat{\beta}, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by {\bf J}^{(t)}, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for \bf J and u that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where z^{(t)} = {\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}).

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is z^{(t)} and the weight matrix is {\bf W}^{(t)}.

We can think of z^{(t)} as a linearized form of the link function g evaluated at y:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.


  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

Generalized least squares (GLS) and weighted least squares (WLS)

Assume that we are in the standard regression setting where we have n observations, responses y = (y_1, \dots, y_n) \in \mathbb{R}^n, and p feature values X \in \mathbb{R}^{n \times p}, where x_{ij} denotes the value of the jth feature for the ith observation. Assume that X is fixed. In ordinary least squares (OLS), we assume that the true model is

\begin{aligned} y = X\beta + \varepsilon, \end{aligned}

where \mathbb{E}[\varepsilon] = 0 and \text{Cov}(\varepsilon) = \sigma^2 I for some known \sigma \in \mathbb{R}. The OLS estimate of \beta is

\begin{aligned} \hat{\beta}_{OLS} = (X^T X)^{-1} X^T y. \end{aligned}

Under the assumptions above, the Gauss-Markov theorem says that \hat{\beta}_{OLS} is the best linear unbiased estimator (BLUE) for \beta.

Generalized least squares (GLS)

In generalized least squares (GLS), instead of assuming that \text{Cov}(\varepsilon) = \sigma^2 I, we assume instead that \text{Cov}(\varepsilon) = V for some known, non-singular covariance matrix V \in \mathbb{R}^{n \times n}. We have

\begin{aligned} y &= X\beta + \varepsilon, &(1)\\  V^{-1/2}y &= V^{-1/2} X \beta + \varepsilon', &(2) \end{aligned}

where \text{Cov}(\varepsilon') = \text{Cov}(V^{-1/2}\varepsilon) = I. Applying the OLS formula to this last equation (2), we get the generalized least squares estimate (GLS) for \beta:

\begin{aligned} \hat{\beta}_{GLS} &= [(V^{-1/2} X)^T(V^{-1/2} X)]^{-1}(V^{-1/2} X)^T(V^{-1/2}y) \\  &= (X^TV^{-1}X)^{-1}X^TV^{-1}y.  \end{aligned}

Why not just apply the OLS formula to (1) directly? That is because the assumptions for the Gauss-Markov theorem hold for (2), and so we can conclude that \hat{\beta}_{GLS} is the best linear unbiased estimator (BLUE) for \beta in this setup.

Note that if we let W = V^{-1} denote the inverse covariance matrix, then the GLS solution has a slightly nicer form:

\begin{aligned} \hat{\beta}_{GLS} = (X^TWX)^{-1}X^TWy.  \end{aligned}

Weighted least squares (WLS) Part 1

One can think of weighted least squares (WLS) as a special case of GLS. In WLS, we assume that the covariance matrix for \varepsilon, V is diagonal. That is, the error terms for each observation may have its own variance, but they are pairwise independent. If \varepsilon_i = \sigma_i^2 for all i = 1, \dots, n and W is the diagonal matrix such that w_{ii} = \frac{1}{\sigma_i^2}, then

\begin{aligned} \hat{\beta}_{WLS} = (X^TWX)^{-1}X^TWy.  \end{aligned}

Weighted least squares (WLS) Part 2

Another sense in which the least squares problem can be “weighted” is where the observations are assigned different weights. Instead of trying to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

we want to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

where the w_i‘s are some known “weights” that we place on the observations. (We might do so, for example, if we know that some observations are more trustworthy than others.) Note that

\begin{aligned} \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 = \sum_{i=1}^n \left( \sqrt{w_i} y_i - \sum_{j=1}^p \sqrt{w_i}x_{ij}\beta_j \right)^2. \end{aligned}

Thus, if W^{1/2} is a diagonal matrix with the ith diagonal entry being \sqrt{w_i}, solving the above is equivalent to performing OLS of W^{1/2} y on W^{1/2} X:

\begin{aligned} \hat{\beta}_{WLS} &= [(W^{1/2} X)^T (W^{1/2} X)]^{-1} (W^{1/2} X) ^T(W^{1/2} y) \\  &= (X^T W X)^{-1} X^TWy. \end{aligned}

This is the same expression that we had in WLS Part 1, except that the weights in W mean something different.