# Derivation of GLM likelihood equations

In this previous post, I stated the likelihood equations (or score equations) for generalized linear models (GLMs). Any solution to the score equations is a maximum likelihood estimator (MLE) for the GLM. In this post, I work through the derivation of the score equations.

Recap: What is a GLM?

Assume we have $n$ data points $(x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$. We want to build a generalized linear model (GLM) of the response $y$ using the $p$ other features $X_1, \dots, X_p$. To that end, assume that the $x$ values are all fixed. Assume that $y_1, \dots, y_n$ are samples of independent random variables $Y_1, \dots, Y_n$ which have the probability density (or mass) function of the form \begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \qquad -(1) \end{aligned}

In the above, the form of $f$ is known, but not the values of the $\theta_i$‘s and $\phi$.

Let $\mu_i = \mathbb{E} [Y_i]$. We assume that \begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \qquad -(2) \end{aligned}

where $\eta_i = g(\mu_i)$ for some link function $g$, assumed to be known.

Recap: The likelihood/score equations

The goal of fitting a GLM is to find estimates for $\beta_1, \dots, \beta_p$. More specifically, we want to find the values of $\beta_1, \dots, \beta_p$ which maximize the (log)-likelihood of the data: \begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate $L$ w.r.t. $\beta$ and set it to be 0. Using the chain rule as well as the form of $f$, after some algebra we have \begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \qquad -(3) \end{aligned}

Hence, the likelihood equations (or score equations) are \begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \qquad -(4) \end{aligned} $\beta$ appears implicitly in the equations above through $\mu_i$: $\mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij})$. (See the original post for a matrix form of these equations.)

Likelihood equations: A derivation

All this involves is evaluating the 4 partial derivatives in $(3)$ and multiplying them together.

Using the form of the probability density function for $Y_i$ $(1)$: \begin{aligned} L_i &= \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi), \\ \dfrac{\partial L_i}{\partial \theta_i} &= \frac{y_i - b'(\theta_i)}{a(\phi)}. \end{aligned}

The second partial derivative, $\partial \theta_i / \partial \mu_i$, is probably the trickiest of the lot: simplifying it requires some properties of exponential families. Under general regularity conditions, we have \begin{aligned} \mathbb{E} \left( \dfrac{\partial L}{\partial \theta} \right) &= 0, &-(5) \\ - \mathbb{E} \left( \dfrac{\partial^2 L}{\partial \theta^2} \right) &= \mathbb{E} \left[ \left( \dfrac{\partial L}{\partial \theta} \right)^2 \right]. &-(6) \end{aligned}

Applying $(5)$ to our setting: \begin{aligned} \mathbb{E} \left[ \frac{Y_i - b'(\theta_i)}{a(\phi)} \right] &= 0, \\ \mu_i = \mathbb{E}[Y_i] &= b'(\theta_i), \text{ and } \quad \dfrac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i). \end{aligned}

Applying $(6)$ to our setting: \begin{aligned} - \dfrac{b''(\theta_i)}{a(\phi)} = \mathbb{E} \left( \dfrac{\partial^2 L_i}{\partial \theta_i^2} \right) &= -\mathbb{E} \left[\left( \dfrac{\partial L_i}{\partial \theta_i} \right)^2 \right] = -\mathbb{E} \left[ \left( \frac{Y_i - b'(\theta_i)}{a(\phi)} \right)^2 \right], \\ \dfrac{b''(\theta_i)}{a(\phi)} &= \dfrac{\mathbb{E}[(Y_i - \mu_i)^2]}{a(\phi)^2}, \\ b''(\theta_i) &= \text{Var}(Y_i) / a(\phi). \end{aligned}

Thus, \begin{aligned} \dfrac{\partial \mu_i}{\partial \theta_i} &= b''(\theta_i) = \text{Var}(Y_i) / a(\phi), \\ \dfrac{\partial \theta_i}{\partial \mu_i} &= a(\phi) / \text{Var}(Y_i). \end{aligned}

The third partial derivative, $\partial \mu_i / \partial \eta_i$, actually appears in the likelihood equations so we don’t have to do any algebraic manipulation here. Finally, using the systematic component of the GLM $(2)$, we have \begin{aligned} \dfrac{\partial \eta_i}{\partial \beta_j} = x_{ij}. \end{aligned}

Putting these 4 parts together: \begin{aligned} \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} &= \frac{y_i - b'(\theta_i)}{a(\phi)} \cdot \frac{a(\phi)}{\text{Var}(Y_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot x_{ij} \\ &= \dfrac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)}\frac{\partial \mu_i}{\partial \eta_i}, \end{aligned}

as required.

References:

1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.