# Fitting a generalized linear model (GLM)

Assume that you have $n$ data points $(x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$. We want to build a generalized linear model (GLM) of the response $y$ using the $p$ other features $X_1, \dots, X_p$.

To that end, assume that the $x$ values are all fixed. Assume that $y_1, \dots, y_n$ are samples of independent random variables $Y_1, \dots, Y_n$ which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of $f$ is known, but not the values of the $\theta_i$‘s and $\phi$.

Let $\mu_i = \mathbb{E} [Y_i]$. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where $\eta_i = g(\mu_i)$ for some link function $g$, assumed to be known.

The goal of fitting a GLM is to find estimates for $\beta_1, \dots, \beta_p$, which in turn give us estimates for $\eta_i = g (\mu_i) = g(\mathbb{E}[Y_i])$.

Likelihood equations

More specifically, we want to find the values of $\beta_1, \dots, \beta_p$ which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate $L$ w.r.t. $\beta$ and set it to be 0. Using the chain rule as well as the form of $f$, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

$\beta$ appears implicitly in the equations above through $\mu_i$: $\mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij})$.

If we let ${\bf W} \in \mathbb{R}^{n \times n}$ be the diagonal matrix such that $w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i)$, and ${\bf D} \in \mathbb{R}^{n \times n}$ be the diagonal matrix such that $d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}$, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for $\beta$ (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let $u \in \mathbb{R}^p$ be the gradien $\partial L / \partial \beta$${\bf H} \in \mathbb{R}^{p \times p}$ be the Hessian matrix, and assume that $\beta^{(t)}$ is our guess for $\hat{\beta}$ at time $t$. (We can think of $u$ and $\bf H$ as functions of $\beta$.) Let  $u^{(t)}$ and ${\bf H}^{(t)}$ be the evaluation of $u$ and $\bf H$ at $\beta^{(t)}$. Taylor expansion of $L$ around $\beta^{(t)}$ gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. $\beta$ and setting it to 0, we can rearrange terms to solve for $\beta$, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough $t$, $|\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2$ for all $j$ and some $c > 0$.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix $\bf H$ evaluated at the current guess $\beta^{(t)}$. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are $h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b$, the elements of the information matrix $\bf J$ are $j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b]$.

The updates for $\beta$ under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that $\bf J$ has a very nice expression: it is simply ${\bf J} = {\bf X}^T \bf WX$. (We had previously defined $\bf W$: a diagonal matrix with entries $w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i)$.)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of $\hat{\beta}$, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by ${\bf J}^{(t)}$, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for $\bf J$ and $u$ that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} + ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where $z^{(t)} = {\bf X}\beta^{(t)} + ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})$.

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is $z^{(t)}$ and the weight matrix is ${\bf W}^{(t)}$.

We can think of $z^{(t)}$ as a linearized form of the link function $g$ evaluated at $y$:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

# Generalized least squares (GLS) and weighted least squares (WLS)

Assume that we are in the standard regression setting where we have $n$ observations, responses $y = (y_1, \dots, y_n) \in \mathbb{R}^n$, and $p$ feature values $X \in \mathbb{R}^{n \times p}$, where $x_{ij}$ denotes the value of the $j$th feature for the $i$th observation. Assume that $X$ is fixed. In ordinary least squares (OLS), we assume that the true model is

\begin{aligned} y = X\beta + \varepsilon, \end{aligned}

where $\mathbb{E}[\varepsilon] = 0$ and $\text{Cov}(\varepsilon) = \sigma^2 I$ for some known $\sigma \in \mathbb{R}$. The OLS estimate of $\beta$ is

\begin{aligned} \hat{\beta}_{OLS} = (X^T X)^{-1} X^T y. \end{aligned}

Under the assumptions above, the Gauss-Markov theorem says that $\hat{\beta}_{OLS}$ is the best linear unbiased estimator (BLUE) for $\beta$.

Generalized least squares (GLS)

In generalized least squares (GLS), instead of assuming that $\text{Cov}(\varepsilon) = \sigma^2 I$, we assume instead that $\text{Cov}(\varepsilon) = V$ for some known, non-singular covariance matrix $V \in \mathbb{R}^{n \times n}$. We have

\begin{aligned} y &= X\beta + \varepsilon, &(1)\\ V^{-1/2}y &= V^{-1/2} X \beta + \varepsilon', &(2) \end{aligned}

where $\text{Cov}(\varepsilon') = \text{Cov}(V^{-1/2}\varepsilon) = I$. Applying the OLS formula to this last equation $(2)$, we get the generalized least squares estimate (GLS) for $\beta$:

\begin{aligned} \hat{\beta}_{GLS} &= [(V^{-1/2} X)^T(V^{-1/2} X)]^{-1}(V^{-1/2} X)^T(V^{-1/2}y) \\ &= (X^TV^{-1}X)^{-1}X^TV^{-1}y. \end{aligned}

Why not just apply the OLS formula to $(1)$ directly? That is because the assumptions for the Gauss-Markov theorem hold for $(2)$, and so we can conclude that $\hat{\beta}_{GLS}$ is the best linear unbiased estimator (BLUE) for $\beta$ in this setup.

Note that if we let $W = V^{-1}$ denote the inverse covariance matrix, then the GLS solution has a slightly nicer form:

\begin{aligned} \hat{\beta}_{GLS} = (X^TWX)^{-1}X^TWy. \end{aligned}

Weighted least squares (WLS) Part 1

One can think of weighted least squares (WLS) as a special case of GLS. In WLS, we assume that the covariance matrix for $\varepsilon$, $V$ is diagonal. That is, the error terms for each observation may have its own variance, but they are pairwise independent. If $\varepsilon_i = \sigma_i^2$ for all $i = 1, \dots, n$ and $W$ is the diagonal matrix such that $w_{ii} = \frac{1}{\sigma_i^2}$, then

\begin{aligned} \hat{\beta}_{WLS} = (X^TWX)^{-1}X^TWy. \end{aligned}

Weighted least squares (WLS) Part 2

Another sense in which the least squares problem can be “weighted” is where the observations are assigned different weights. Instead of trying to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

we want to solve

\begin{aligned} \text{minimize} \: \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2, \end{aligned}

where the $w_i$‘s are some known “weights” that we place on the observations. (We might do so, for example, if we know that some observations are more trustworthy than others.) Note that

\begin{aligned} \sum_{i=1}^n w_i \left( y_i - \sum_{j=1}^p x_{ij}\beta_j \right)^2 = \sum_{i=1}^n \left( \sqrt{w_i} y_i - \sum_{j=1}^p \sqrt{w_i}x_{ij}\beta_j \right)^2. \end{aligned}

Thus, if $W^{1/2}$ is a diagonal matrix with the $i$th diagonal entry being $\sqrt{w_i}$, solving the above is equivalent to performing OLS of $W^{1/2} y$ on $W^{1/2} X$:

\begin{aligned} \hat{\beta}_{WLS} &= [(W^{1/2} X)^T (W^{1/2} X)]^{-1} (W^{1/2} X) ^T(W^{1/2} y) \\ &= (X^T W X)^{-1} X^TWy. \end{aligned}

This is the same expression that we had in WLS Part 1, except that the weights in $W$ mean something different.