Assume that you have data points for . We want to build a * generalized linear model (GLM)* of the response using the other features .

To that end, assume that the values are all fixed. Assume that are samples of independent random variables which have the probability density (or mass) function of the form

In the above, the form of is known, but not the values of the ‘s and .

Let . We assume that

where for some *link function* , assumed to be known.

The goal of fitting a GLM is to find estimates for , which in turn give us estimates for .

**Likelihood equations**

More specifically, we want to find the values of which maximize the (log)-likelihood of the data:

To do that, we differentiate w.r.t. and set it to be 0. Using the chain rule as well as the form of , after some algebra we have

Hence, the * likelihood equations* (or

*) are*

**score equations**appears implicitly in the equations above through : .

If we let be the diagonal matrix such that , and be the diagonal matrix such that , then we can rewrite the likelihood equations in matrix form:

**Newton-Raphson method**

Unfortunately there isn’t a closed form solution for (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let be the gradien , be the Hessian matrix, and assume that is our guess for at time . (We can think of and as functions of .) Let and be the evaluation of and at . Taylor expansion of around gives

Taking a derivative w.r.t. and setting it to 0, we can rearrange terms to solve for , giving us the next guess

* Why use Newton-Raphson?* It usually converges quickly. It has second-order convergence, i.e. for large enough , for all and some .

**Fisher scoring method**

In Newton-Raphson, we used the observed Hessian matrix evaluated at the current guess . In * Fisher scoring*, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are , the elements of the information matrix are .

The updates for under Fisher scoring are

If you go through the algebra for the elements of the Fisher information matrix, you will find that has a very nice expression: it is simply . (We had previously defined : a diagonal matrix with entries .)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

* Why use this over Newton-Raphson?* The inverse of the information matrix is also the asymptotic covariance of , so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

**Fisher scoring as iterated reweighted least squares (IRLS)**

Multiplying both sides of the Fisher scoring update by , we get

Using expression for and that we had before, the above is the same as

where .

Looking closely at the last equation, we recognize those as the likelihood equations for * weighted least squares*, where the response is and the weight matrix is .

We can think of as a linearized form of the link function evaluated at :

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.