Assume that you have data points
for
. We want to build a generalized linear model (GLM) of the response
using the
other features
.
To that end, assume that the values are all fixed. Assume that
are samples of independent random variables
which have the probability density (or mass) function of the form
In the above, the form of is known, but not the values of the
‘s and
.
Let . We assume that
where for some link function
, assumed to be known.
The goal of fitting a GLM is to find estimates for , which in turn give us estimates for
.
Likelihood equations
More specifically, we want to find the values of which maximize the (log)-likelihood of the data:
To do that, we differentiate w.r.t.
and set it to be 0. Using the chain rule as well as the form of
, after some algebra we have
Hence, the likelihood equations (or score equations) are
appears implicitly in the equations above through
:
.
If we let be the diagonal matrix such that
, and
be the diagonal matrix such that
, then we can rewrite the likelihood equations in matrix form:
Newton-Raphson method
Unfortunately there isn’t a closed form solution for (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let
be the gradien
,
be the Hessian matrix, and assume that
is our guess for
at time
. (We can think of
and
as functions of
.) Let
and
be the evaluation of
and
at
. Taylor expansion of
around
gives
Taking a derivative w.r.t. and setting it to 0, we can rearrange terms to solve for
, giving us the next guess
Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough ,
for all
and some
.
Fisher scoring method
In Newton-Raphson, we used the observed Hessian matrix evaluated at the current guess
. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are
, the elements of the information matrix
are
.
The updates for under Fisher scoring are
If you go through the algebra for the elements of the Fisher information matrix, you will find that has a very nice expression: it is simply
. (We had previously defined
: a diagonal matrix with entries
.)
If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.
Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of , so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).
Fisher scoring as iterated reweighted least squares (IRLS)
Multiplying both sides of the Fisher scoring update by , we get
Using expression for and
that we had before, the above is the same as
where .
Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is and the weight matrix is
.
We can think of as a linearized form of the link function
evaluated at
:
This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.
References:
- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.