Fitting a generalized linear model (GLM)

Assume that you have $n$ data points $(x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$. We want to build a generalized linear model (GLM) of the response $y$ using the $p$ other features $X_1, \dots, X_p$.

To that end, assume that the $x$ values are all fixed. Assume that $y_1, \dots, y_n$ are samples of independent random variables $Y_1, \dots, Y_n$ which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of $f$ is known, but not the values of the $\theta_i$‘s and $\phi$.

Let $\mu_i = \mathbb{E} [Y_i]$. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where $\eta_i = g(\mu_i)$ for some link function $g$, assumed to be known.

The goal of fitting a GLM is to find estimates for $\beta_1, \dots, \beta_p$, which in turn give us estimates for $\eta_i = g (\mu_i) = g(\mathbb{E}[Y_i])$.

Likelihood equations

More specifically, we want to find the values of $\beta_1, \dots, \beta_p$ which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate $L$ w.r.t. $\beta$ and set it to be 0. Using the chain rule as well as the form of $f$, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

$\beta$ appears implicitly in the equations above through $\mu_i$: $\mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij})$.

If we let ${\bf W} \in \mathbb{R}^{n \times n}$ be the diagonal matrix such that $w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i)$, and ${\bf D} \in \mathbb{R}^{n \times n}$ be the diagonal matrix such that $d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}$, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for $\beta$ (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let $u \in \mathbb{R}^p$ be the gradien $\partial L / \partial \beta$${\bf H} \in \mathbb{R}^{p \times p}$ be the Hessian matrix, and assume that $\beta^{(t)}$ is our guess for $\hat{\beta}$ at time $t$. (We can think of $u$ and $\bf H$ as functions of $\beta$.) Let  $u^{(t)}$ and ${\bf H}^{(t)}$ be the evaluation of $u$ and $\bf H$ at $\beta^{(t)}$. Taylor expansion of $L$ around $\beta^{(t)}$ gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. $\beta$ and setting it to 0, we can rearrange terms to solve for $\beta$, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough $t$, $|\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2$ for all $j$ and some $c > 0$.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix $\bf H$ evaluated at the current guess $\beta^{(t)}$. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are $h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b$, the elements of the information matrix $\bf J$ are $j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b]$.

The updates for $\beta$ under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that $\bf J$ has a very nice expression: it is simply ${\bf J} = {\bf X}^T \bf WX$. (We had previously defined $\bf W$: a diagonal matrix with entries $w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i)$.)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of $\hat{\beta}$, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by ${\bf J}^{(t)}$, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for $\bf J$ and $u$ that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} + ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where $z^{(t)} = {\bf X}\beta^{(t)} + ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})$.

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is $z^{(t)}$ and the weight matrix is ${\bf W}^{(t)}$.

We can think of $z^{(t)}$ as a linearized form of the link function $g$ evaluated at $y$:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.