* Generalized linear models (GLMs)* are significantly more complicated than ordinary linear models. There is more notation, more conceptual terms, and more confusion about what’s random (or not) and what’s known (or not). This post will lay out the setup of a GLM in detail to clarify any possible confusion.

Assume that you have data points for . We want to build a model of the response using the other features . Assume that the values are all fixed throughout the discussion.

A GLM consists of three components:

- A random component,
- A systematic component, and
- A link function.

**Random component**

We assume that are samples of independent random variables respectively. We assume that has the probability density (or mass) function of the form

In the above, the form of (and hence, that of , and ) is assumed to be known. What is unknown are the ‘s, which have to be estimated. The value of can vary across .

The family of distributions above is known as an * exponential family*, and is called the

*. If , the exponential family is said to be in*

**natural parameter***.*

**canonical form**Let . Often we will not estimate the ‘s directly, but rather some function of (as we will see soon).

**Systematic component**

The systematic component relates some vector to the features. We assume that the relationship is given by

for . are not known and have to be estimated. What are the ‘s? Read on!

**Link function**

The link function is a function such that for . The function is assumed to be known, and is something which the data modeler picks.

If , the link function is called the * identity link*. If , the link function is called the

*.*

**canonical link**In a GLM, we wish to estimate the ‘s. This in turn gives us an estimate for the ‘s, which will give us an estimate for the ‘s.

**Example 1: Logistic regression**

With * binary data*, we assume that . We can write the probability mass function as

To match this to the earlier formula for exponential families, take , , and . The natural parameter for this family is .

In * logistic regression*, we take the link function to be the canonical link. That is, our systematic component is

**Example 2: Poisson regression**

Assume that our data is * count data*, and that the time period over which the count data was collected is the same across . A simple model is to assume that . We can write the probability mass function as

To match this to the earlier formula for exponential families, take , , and . The natural parameter for this family is .

In * Poisson loglinear regression*, we take the link function to be the canonical link, and so the systematic component is

**Extending the GLM to exponential dispersion families**

For the random component of GLMs, we assume that has the probability density (or mass) function of the form

In some cases, it helps to add an additional parameter, called the * dispersion parameter*, to model the data more accurately. For example, by using the Poisson distribution for count data, we assume that , which may not be the case.

With this new dispersion parameter , it is common to write the PDF/PMF of in the form

When is known, the above reduces to the exponential family that we introduced at first, and we can use all the GLM machinery. Usually is not known: what we can do is estimate it first, then treat it as known for the rest of the GLM procedure.

References:

- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

Pingback: Variance of coefficients for linear models and generalized linear models | Statistical Odds & Ends