Assume that you have data points for . We want to build a generalized linear model (GLM) of the response using the other features .
To that end, assume that the values are all fixed. Assume that are samples of independent random variables which have the probability density (or mass) function of the form
In the above, the form of is known, but not the values of the ‘s and .
Let . We assume that
where for some link function , assumed to be known.
The goal of fitting a GLM is to find estimates for , which in turn give us estimates for .
More specifically, we want to find the values of which maximize the (log)-likelihood of the data:
To do that, we differentiate w.r.t. and set it to be 0. Using the chain rule as well as the form of , after some algebra we have
Hence, the likelihood equations (or score equations) are
appears implicitly in the equations above through : .
If we let be the diagonal matrix such that , and be the diagonal matrix such that , then we can rewrite the likelihood equations in matrix form:
Unfortunately there isn’t a closed form solution for (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let be the gradien , be the Hessian matrix, and assume that is our guess for at time . (We can think of and as functions of .) Let and be the evaluation of and at . Taylor expansion of around gives
Taking a derivative w.r.t. and setting it to 0, we can rearrange terms to solve for , giving us the next guess
Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough , for all and some .
Fisher scoring method
In Newton-Raphson, we used the observed Hessian matrix evaluated at the current guess . In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are , the elements of the information matrix are .
The updates for under Fisher scoring are
If you go through the algebra for the elements of the Fisher information matrix, you will find that has a very nice expression: it is simply . (We had previously defined : a diagonal matrix with entries .)
If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.
Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of , so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).
Fisher scoring as iterated reweighted least squares (IRLS)
Multiplying both sides of the Fisher scoring update by , we get
Using expression for and that we had before, the above is the same as
Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is and the weight matrix is .
We can think of as a linearized form of the link function evaluated at :
This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.
- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.