Fitting a generalized linear model (GLM)

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p.

To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p, which in turn give us estimates for \eta_i = g (\mu_i) = g(\mathbb{E}[Y_i]).

Likelihood equations

More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}.  \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}).

If we let {\bf W} \in \mathbb{R}^{n \times n} be the diagonal matrix such that w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i), and {\bf D} \in \mathbb{R}^{n \times n} be the diagonal matrix such that d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for \beta (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let u \in \mathbb{R}^p be the gradien \partial L / \partial \beta{\bf H} \in \mathbb{R}^{p \times p} be the Hessian matrix, and assume that \beta^{(t)} is our guess for \hat{\beta} at time t. (We can think of u and \bf H as functions of \beta.) Let  u^{(t)} and {\bf H}^{(t)} be the evaluation of u and \bf H at \beta^{(t)}. Taylor expansion of L around \beta^{(t)} gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. \beta and setting it to 0, we can rearrange terms to solve for \beta, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough t, |\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2 for all j and some c > 0.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix \bf H evaluated at the current guess \beta^{(t)}. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b, the elements of the information matrix \bf J are j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b].

The updates for \beta under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that \bf J has a very nice expression: it is simply {\bf J} = {\bf X}^T \bf WX. (We had previously defined \bf W: a diagonal matrix with entries w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i).)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of \hat{\beta}, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by {\bf J}^{(t)}, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for \bf J and u that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where z^{(t)} = {\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}).

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is z^{(t)} and the weight matrix is {\bf W}^{(t)}.

We can think of z^{(t)} as a linearized form of the link function g evaluated at y:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.
Advertisement

Generalized least squares (GLS) and weighted least squares (WLS)

Assume that we are in the standard regression setting where we have n observations, responses y = (y_1, \dots, y_n) \in \mathbb{R}^n, and p feature values X \in \mathbb{R}^{n \times p}, where x_{ij} denotes the value of the jth feature for the ith observation. Assume that X is fixed. In ordinary least squares (OLS), we assume that the true model is

\begin{aligned} y = X\beta + \varepsilon, \end{aligned}

where \mathbb{E}[\varepsilon] = 0 and \text{Cov}(\varepsilon) = \sigma^2 I for some known \sigma \in \mathbb{R}. The OLS estimate of \beta is

\begin{aligned} \hat{\beta}_{OLS} = (X^T X)^{-1} X^T y. \end{aligned}

Under the assumptions above, the Gauss-Markov theorem says that \hat{\beta}_{OLS} is the best linear unbiased estimator (BLUE) for \beta.

Generalized least squares (GLS)

In generalized least squares (GLS), instead of assuming that \text{Cov}(\varepsilon) = \sigma^2 I, we assume instead that \text{Cov}(\varepsilon) = V for some known, non-singular covariance matrix V \in \mathbb{R}^{n \times n}. We have

\begin{aligned} y &= X\beta + \varepsilon, &(1)\\  V^{-1/2}y &= V^{-1/2} X \beta + \varepsilon', &(2) \end{aligned}

where \text{Cov}(\varepsilon') = \text{Cov}(V^{-1/2}\varepsilon) = I. Applying the OLS formula to this last equation (2), we get the generalized least squares estimate (GLS) for \beta:

\begin{aligned} \hat{\beta}_{GLS} &= [(V^{-1/2} X)^T(V^{-1/2} X)]^{-1}(V^{-1/2} X)^T(V^{-1/2}y) \\  &= (X^TV^{-1}X)^{-1}X^TV^{-1}y.  \end{aligned}

Why not just apply the OLS formula to (1) directly? That is because the assumptions for the Gauss-Markov theorem hold for (2), and so we can conclude that \hat{\beta}_{GLS} is the best linear unbiased estimator (BLUE) for \beta in this setup.

Note that if we let W = V^{-1} denote the inverse covariance matrix, then the GLS solution has a slightly nicer form:

\begin{aligned} \hat{\beta}_{GLS} = (X^TWX)^{-1}X^TWy.  \end{aligned}

Weighted least squares (WLS) Part 1

One can think of weighted least squares (WLS) as a special case of GLS. In WLS, we assume that the covariance matrix for \varepsilon, V is diagonal. That is, the error terms for each observation may have its own variance, but they are pairwise independent. If \varepsilon_i = \sigma_i^2 for all i = 1, \dots, n and W is the diagonal matrix such that w_{ii} = \frac{1}{\sigma_i^2}, then

\begin{aligned} \hat{\beta}_{WLS} = (X^TWX)^{-1}X^TWy.  \end{aligned}

Weighted least squares (WLS) Part 2

Another sense in which the least squares problem can be “weighted” is where the observations are assigned different weights. Instead of trying to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

we want to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

where the w_i‘s are some known “weights” that we place on the observations. (We might do so, for example, if we know that some observations are more trustworthy than others.) Note that

\begin{aligned} \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 = \sum_{i=1}^n \left( \sqrt{w_i} y_i - \sum_{j=1}^p \sqrt{w_i}x_{ij}\beta_j \right)^2. \end{aligned}

Thus, if W^{1/2} is a diagonal matrix with the ith diagonal entry being \sqrt{w_i}, solving the above is equivalent to performing OLS of W^{1/2} y on W^{1/2} X:

\begin{aligned} \hat{\beta}_{WLS} &= [(W^{1/2} X)^T (W^{1/2} X)]^{-1} (W^{1/2} X) ^T(W^{1/2} y) \\  &= (X^T W X)^{-1} X^TWy. \end{aligned}

This is the same expression that we had in WLS Part 1, except that the weights in W mean something different.