Generalized linear models (GLMs) are significantly more complicated than ordinary linear models. There is more notation, more conceptual terms, and more confusion about what’s random (or not) and what’s known (or not). This post will lay out the setup of a GLM in detail to clarify any possible confusion.
Assume that you have data points
for
. We want to build a model of the response
using the
other features
. Assume that the
values are all fixed throughout the discussion.
A GLM consists of three components:
- A random component,
- A systematic component, and
- A link function.
Random component
We assume that are samples of independent random variables
respectively. We assume that
has the probability density (or mass) function of the form
In the above, the form of (and hence, that of
,
and
) is assumed to be known. What is unknown are the
‘s, which have to be estimated. The value of
can vary across
.
The family of distributions above is known as an exponential family, and is called the natural parameter. If
, the exponential family is said to be in canonical form.
Let . Often we will not estimate the
‘s directly, but rather some function of
(as we will see soon).
Systematic component
The systematic component relates some vector to the
features. We assume that the relationship is given by
for .
are not known and have to be estimated. What are the
‘s? Read on!
Link function
The link function is a function such that
for
. The function
is assumed to be known, and is something which the data modeler picks.
If , the link function is called the identity link. If
, the link function is called the canonical link.
In a GLM, we wish to estimate the ‘s. This in turn gives us an estimate for the
‘s, which will give us an estimate for the
‘s.
Example 1: Logistic regression
With binary data, we assume that . We can write the probability mass function as
To match this to the earlier formula for exponential families, take ,
,
and
. The natural parameter for this family is
.
In logistic regression, we take the link function to be the canonical link. That is, our systematic component is
Example 2: Poisson regression
Assume that our data is count data, and that the time period over which the count data was collected is the same across . A simple model is to assume that
. We can write the probability mass function as
To match this to the earlier formula for exponential families, take ,
,
and
. The natural parameter for this family is
.
In Poisson loglinear regression, we take the link function to be the canonical link, and so the systematic component is
Extending the GLM to exponential dispersion families
For the random component of GLMs, we assume that has the probability density (or mass) function of the form
In some cases, it helps to add an additional parameter, called the dispersion parameter, to model the data more accurately. For example, by using the Poisson distribution for count data, we assume that , which may not be the case.
With this new dispersion parameter , it is common to write the PDF/PMF of
in the form
When is known, the above reduces to the exponential family that we introduced at first, and we can use all the GLM machinery. Usually
is not known: what we can do is estimate it first, then treat it as known for the rest of the GLM procedure.
References:
- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.
Pingback: Variance of coefficients for linear models and generalized linear models | Statistical Odds & Ends