Understanding the components of a generalized linear model (GLM)

Generalized linear models (GLMs) are significantly more complicated than ordinary linear models. There is more notation, more conceptual terms, and more confusion about what’s random (or not) and what’s known (or not). This post will lay out the setup of a GLM in detail to clarify any possible confusion.

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a model of the response y using the p other features X_1, \dots, X_p. Assume that the x values are all fixed throughout the discussion.

A GLM consists of three components:

  1. A random component,
  2. A systematic component, and
  3. A link function.

Random component

We assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n respectively. We assume that Y_i has the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In the above, the form of f (and hence, that of a, b and Q) is assumed to be known. What is unknown are the \theta_i‘s, which have to be estimated. The value of \theta_i can vary across i.

The family of distributions above is known as an exponential family, and Q(\theta) is called the natural parameter. If Q(\theta) = \theta, the exponential family is said to be in canonical form.

Let \mu_i = \mathbb{E} [Y_i]. Often we will not estimate the \theta_i‘s directly, but rather some function of \mu_i (as we will see soon).

Systematic component

The systematic component relates some vector (\eta_1, \dots, \eta_n) to the p features. We assume that the relationship is given by

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij} \end{aligned}

for i = 1, \dots, n. \beta_1, \dots, \beta_p are not known and have to be estimated. What are the \eta_i‘s? Read on!

Link function

The link function is a function g such that \eta_i = g(\mu_i) for i = 1, \dots, n. The function g is assumed to be known, and is something which the data modeler picks.

If \eta_i = g(\mu_i) = \mu_i, the link function is called the identity link. If \eta_i = g(\mu_i) = Q(\theta_i), the link function is called the canonical link.

In a GLM, we wish to estimate the \beta_j‘s. This in turn gives us an estimate for the g(\mu_i)‘s, which will give us an estimate for the \mu_i‘s.

Example 1: Logistic regression

With binary data, we assume that Y_i \sim \text{Bern}(\pi_i). We can write the probability mass function as

\begin{aligned} f(y_i; \pi_i) &= \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\  &= (1 - \pi_i) [\pi_i / (1 - \pi_i)]^{y_i} \\  &= (1 - \pi_i) \exp \left[ y_i \log \left( \frac{\pi_i}{1 - \pi_i} \right) \right]. \end{aligned}

To match this to the earlier formula for exponential families, take \theta_i = \pi_i, a(\theta_i) = 1 - \theta_i, b(y_i) = 1 and Q(\theta_i) = \log [\theta_i / (1 - \theta_i)]. The natural parameter for this family is Q(\pi_i) = \log [\pi_i / (1 - \pi_i)].

In logistic regression, we take the link function to be the canonical link. That is, our systematic component is

\begin{aligned} \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Example 2: Poisson regression

Assume that our data is count data, and that the time period over which the count data was collected is the same across i. A simple model is to assume that Y_i \sim \text{Poisson}(\mu_i). We can write the probability mass function as

\begin{aligned} f(y_i ; \mu_i) = \frac{\mu_i^{y_i} e^{-\mu_i} }{y_i !} = e^{-\mu_i} \frac{1}{y_i !} \exp (y_i \log \mu_i). \end{aligned}

To match this to the earlier formula for exponential families, take \theta_i = \mu_i, a(\theta_i) = \exp(-\theta_i), b(y_i) = 1 / {y_i} ! and Q(\theta_i) = \log \theta_i. The natural parameter for this family is Q(\mu_i) = \log \mu_i.

In Poisson loglinear regression, we take the link function to be the canonical link, and so the systematic component is

\begin{aligned} \log \mu_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Extending the GLM to exponential dispersion families

For the random component of GLMs, we assume that Y_i has the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In some cases, it helps to add an additional parameter, called the dispersion parameter, to model the data more accurately. For example, by using the Poisson distribution for count data, we assume that \text{Var}(Y_i) = \mu_i = \mathbb{E}[Y_i], which may not be the case.

With this new dispersion parameter \phi, it is common to write the PDF/PMF of Y_i in the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

When \phi is known, the above reduces to the exponential family that we introduced at first, and we can use all the GLM machinery. Usually \phi is not known: what we can do is estimate it first, then treat it as known for the rest of the GLM procedure.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.
Advertisement

1 thought on “Understanding the components of a generalized linear model (GLM)

  1. Pingback: Variance of coefficients for linear models and generalized linear models | Statistical Odds & Ends

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s