Derivation of GLM likelihood equations

In this previous post, I stated the likelihood equations (or score equations) for generalized linear models (GLMs). Any solution to the score equations is a maximum likelihood estimator (MLE) for the GLM. In this post, I work through the derivation of the score equations.

Recap: What is a GLM?

Assume we have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p. To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \qquad -(1) \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \qquad -(2) \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

Recap: The likelihood/score equations

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p. More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \qquad -(3) \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \qquad -(4) \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}). (See the original post for a matrix form of these equations.)

Likelihood equations: A derivation

All this involves is evaluating the 4 partial derivatives in (3) and multiplying them together.

Using the form of the probability density function for Y_i (1):

\begin{aligned} L_i &= \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi), \\  \dfrac{\partial L_i}{\partial \theta_i} &= \frac{y_i - b'(\theta_i)}{a(\phi)}. \end{aligned}

The second partial derivative, $\partial \theta_i / \partial \mu_i$, is probably the trickiest of the lot: simplifying it requires some properties of exponential families. Under general regularity conditions, we have

\begin{aligned} \mathbb{E} \left( \dfrac{\partial L}{\partial \theta} \right) &= 0, &-(5) \\  - \mathbb{E} \left( \dfrac{\partial^2 L}{\partial \theta^2} \right) &= \mathbb{E} \left[ \left( \dfrac{\partial L}{\partial \theta} \right)^2 \right]. &-(6) \end{aligned}

Applying (5) to our setting:

\begin{aligned} \mathbb{E} \left[ \frac{Y_i - b'(\theta_i)}{a(\phi)} \right] &= 0, \\  \mu_i = \mathbb{E}[Y_i] &= b'(\theta_i), \text{ and } \quad \dfrac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i). \end{aligned}

Applying (6) to our setting:

\begin{aligned} - \dfrac{b''(\theta_i)}{a(\phi)} = \mathbb{E} \left( \dfrac{\partial^2 L_i}{\partial \theta_i^2} \right) &= -\mathbb{E} \left[\left( \dfrac{\partial L_i}{\partial \theta_i} \right)^2 \right] = -\mathbb{E} \left[ \left( \frac{Y_i - b'(\theta_i)}{a(\phi)} \right)^2 \right], \\  \dfrac{b''(\theta_i)}{a(\phi)} &= \dfrac{\mathbb{E}[(Y_i - \mu_i)^2]}{a(\phi)^2}, \\  b''(\theta_i) &= \text{Var}(Y_i) / a(\phi). \end{aligned}

Thus,

\begin{aligned} \dfrac{\partial \mu_i}{\partial \theta_i} &= b''(\theta_i) = \text{Var}(Y_i) / a(\phi), \\  \dfrac{\partial \theta_i}{\partial \mu_i} &= a(\phi) / \text{Var}(Y_i). \end{aligned}

The third partial derivative, \partial \mu_i / \partial \eta_i, actually appears in the likelihood equations so we don’t have to do any algebraic manipulation here. Finally, using the systematic component of the GLM (2), we have

\begin{aligned} \dfrac{\partial \eta_i}{\partial \beta_j} = x_{ij}. \end{aligned}

Putting these 4 parts together:

\begin{aligned} \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} &= \frac{y_i - b'(\theta_i)}{a(\phi)} \cdot \frac{a(\phi)}{\text{Var}(Y_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot x_{ij} \\  &= \dfrac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)}\frac{\partial \mu_i}{\partial \eta_i}, \end{aligned}

as required.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s