# Understanding the components of a generalized linear model (GLM)

Generalized linear models (GLMs) are significantly more complicated than ordinary linear models. There is more notation, more conceptual terms, and more confusion about what’s random (or not) and what’s known (or not). This post will lay out the setup of a GLM in detail to clarify any possible confusion.

Assume that you have $n$ data points $(x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$. We want to build a model of the response $y$ using the $p$ other features $X_1, \dots, X_p$. Assume that the $x$ values are all fixed throughout the discussion.

A GLM consists of three components:

1. A random component,
2. A systematic component, and

Random component

We assume that $y_1, \dots, y_n$ are samples of independent random variables $Y_1, \dots, Y_n$ respectively. We assume that $Y_i$ has the probability density (or mass) function of the form \begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In the above, the form of $f$ (and hence, that of $a$, $b$ and $Q$) is assumed to be known. What is unknown are the $\theta_i$‘s, which have to be estimated. The value of $\theta_i$ can vary across $i$.

The family of distributions above is known as an exponential family, and $Q(\theta)$ is called the natural parameter. If $Q(\theta) = \theta$, the exponential family is said to be in canonical form.

Let $\mu_i = \mathbb{E} [Y_i]$. Often we will not estimate the $\theta_i$‘s directly, but rather some function of $\mu_i$ (as we will see soon).

Systematic component

The systematic component relates some vector $(\eta_1, \dots, \eta_n)$ to the $p$ features. We assume that the relationship is given by \begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij} \end{aligned}

for $i = 1, \dots, n$. $\beta_1, \dots, \beta_p$ are not known and have to be estimated. What are the $\eta_i$‘s? Read on!

The link function is a function $g$ such that $\eta_i = g(\mu_i)$ for $i = 1, \dots, n$. The function $g$ is assumed to be known, and is something which the data modeler picks.

If $\eta_i = g(\mu_i) = \mu_i$, the link function is called the identity link. If $\eta_i = g(\mu_i) = Q(\theta_i)$, the link function is called the canonical link.

In a GLM, we wish to estimate the $\beta_j$‘s. This in turn gives us an estimate for the $g(\mu_i)$‘s, which will give us an estimate for the $\mu_i$‘s.

Example 1: Logistic regression

With binary data, we assume that $Y_i \sim \text{Bern}(\pi_i)$. We can write the probability mass function as \begin{aligned} f(y_i; \pi_i) &= \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ &= (1 - \pi_i) [\pi_i / (1 - \pi_i)]^{y_i} \\ &= (1 - \pi_i) \exp \left[ y_i \log \left( \frac{\pi_i}{1 - \pi_i} \right) \right]. \end{aligned}

To match this to the earlier formula for exponential families, take $\theta_i = \pi_i$, $a(\theta_i) = 1 - \theta_i$, $b(y_i) = 1$ and $Q(\theta_i) = \log [\theta_i / (1 - \theta_i)]$. The natural parameter for this family is $Q(\pi_i) = \log [\pi_i / (1 - \pi_i)]$.

In logistic regression, we take the link function to be the canonical link. That is, our systematic component is \begin{aligned} \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Example 2: Poisson regression

Assume that our data is count data, and that the time period over which the count data was collected is the same across $i$. A simple model is to assume that $Y_i \sim \text{Poisson}(\mu_i)$. We can write the probability mass function as \begin{aligned} f(y_i ; \mu_i) = \frac{\mu_i^{y_i} e^{-\mu_i} }{y_i !} = e^{-\mu_i} \frac{1}{y_i !} \exp (y_i \log \mu_i). \end{aligned}

To match this to the earlier formula for exponential families, take $\theta_i = \mu_i$, $a(\theta_i) = \exp(-\theta_i)$, $b(y_i) = 1 / {y_i} !$ and $Q(\theta_i) = \log \theta_i$. The natural parameter for this family is $Q(\mu_i) = \log \mu_i$.

In Poisson loglinear regression, we take the link function to be the canonical link, and so the systematic component is \begin{aligned} \log \mu_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Extending the GLM to exponential dispersion families

For the random component of GLMs, we assume that $Y_i$ has the probability density (or mass) function of the form \begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In some cases, it helps to add an additional parameter, called the dispersion parameter, to model the data more accurately. For example, by using the Poisson distribution for count data, we assume that $\text{Var}(Y_i) = \mu_i = \mathbb{E}[Y_i]$, which may not be the case.

With this new dispersion parameter $\phi$, it is common to write the PDF/PMF of $Y_i$ in the form \begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

When $\phi$ is known, the above reduces to the exponential family that we introduced at first, and we can use all the GLM machinery. Usually $\phi$ is not known: what we can do is estimate it first, then treat it as known for the rest of the GLM procedure.

References:

1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.