Log-likelihood of probit regression is globally concave

Assume we are in the supervised learning setting with n observations, where observation i consists of the response y_i and features (x_{i1},\dots, x_{ip}). A generalized linear model (GLM) consists of 3 components:

  1. A random component, or a family of distributions f indexed by \mu_i (usually an exponential family), such that y_i \sim f_{\mu_i},
  2. A systematic component \eta_i = \sum_{i=1}^p \beta_j x_{ij}, and
  3. A link function g such that \eta_i = g(\mu_i).

(See this previous post for more details of the components of a GLM.) The user gets to define the family of distributions f and the link function g, and \beta = (\beta_1, \dots, \beta_p)^T is the parameter to be determined by maximum likelihood estimation.

For one-dimensional exponential families with the canonical link function, it is known that the log-likelihood of the GLM is globally concave in \beta (see, for example, Reference 1). Hence, the MLE \hat\beta can be found using methods such as gradient descent or coordinate descent. When non-canonical links are used, the GLM’s log-likelihood is no longer guaranteed to be concave in \beta. However, in some situations we can stilll show that the log-likelihood is concave in \beta. In this post, we show that the log-likelihood for probit regression is concave in \beta.

In the probit regression model, \mu_i = \Phi (\eta_i) = \Phi \left( \sum_{i=1}^p \beta_j x_{ij} \right), where \Phi is the cumulative distribution function (CDF) of the standard normal distribution. The responses are binary with y_i \sim \text{Bern}(\mu_i). The likelihood function is

\begin{aligned} L(\beta) &= \prod_{i=1}^n \mu_i^{y_i} (1- \mu_i)^{1 - y_i} \\  &= \prod_{i=1}^n [\Phi(x_i^T \beta)]^{y_i} [1 - \Phi(x_i^T \beta)]^{1 - y_i}, \end{aligned}

and the log-likelihood function is

\begin{aligned} \ell(\beta) = \sum_{i=1}^n y_i \log [\Phi(x_i^T \beta)] + (1-y_i) \log [1 - \Phi(x_i^T \beta)].  \end{aligned}

To show that \ell is concave in \beta, we make two reductions:

  1. Since the sum of concave functions is concave, it is enough to show that \beta \mapsto y \log [\Phi(x^T \beta)] + (1-y) \log [1 - \Phi(x^T \beta)] is concave.
  2. Since composition with an affine function preserves concavity, it is enough to show that f(x) = y \log [\Phi(x)] + (1-y) \log [1 - \Phi(x)] is concave in x. (Here, x \in \mathbb{R}.)

From here, we can show that f is concave by showing that its second derivative is negative: f''(x) < 0 for all x. Since y can only take on the values of 0 and 1, we can consider those cases separately.

Let \phi denote the probability density function of the standard normal distribution. Recall that \phi'(x) = -x \phi(x). When y = 1,

\begin{aligned} f'(x) &= \dfrac{\phi(x)}{\Phi(x)}, \\  f''(x)&= \dfrac{\Phi(x) [-x \phi(x)] - \phi(x)^2}{\Phi(x)^2} \\  &= \dfrac{\phi(x)}{\Phi(x)^2}[- x \Phi(x) - \phi(x)] \\  &< 0, \end{aligned}

since x \Phi(x) + \phi(x) > 0 for all x (see this previous post for a proof).

When y = 0,

\begin{aligned} f'(x) &= \dfrac{-\phi(x)}{1-\Phi(x)}, \\  f''(x) &= \dfrac{[1- \Phi(x)][x \phi(x)] + \phi(x) [-\phi(x)]}{[1 - \Phi(x)]^2} \\  &= -\dfrac{\phi(x)}{[1-\Phi(x)]^2} \left[ -x + x \Phi(x) + \phi(x) \right]. \end{aligned}

To show concavity of f, it remains to show that x\Phi(x) + \phi(x) > x for all x. But this is true: see this previous post for a proof.

References:

  1. Chapter 9: Generalized Linear Models.
Advertisement

Variance of coefficients for linear models and generalized linear models

Variance of coefficients in linear models

Assume we are in the supervised learning setting with design matrix {\bf X} \in \mathbb{R}^{n \times p} and response y \in \mathbb{R}^n. If the linear regression model is true, i.e.

y = {\bf X} \beta + \epsilon,

where \beta \in \mathbb{R}^p is the vector of coefficients and the entries of \epsilon \in \mathbb{R}^p are i.i.d mean 0 and variance \sigma^2, it is well-known that the maximum likelihood estimate (MLE) of \beta is

\hat\beta = ({\bf X}^T{\bf X})^{-1} {\bf X}^T y.

Assuming {\bf X} is fixed and that \epsilon is the only source of randomness, it follows easily that

Var(\hat\beta) = \sigma^2 ({\bf X}^T{\bf X})^{-1}.

It’s remarkable that (i) we have a closed form solution for the variance of \hat\beta, and (ii) this variance does not depend on the value of \beta.

Variance of coefficients in generalized linear models (GLMs)

We are not so lucky on both counts in the generalized linear model (GLM) setting. Assume that the true data generating model is given by a GLM, i.e.

\begin{aligned} Y_i &\stackrel{ind.}{\sim} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right], \\  \eta_i &= \sum_{j=1}^p \beta_j x_{ij}, \text{ and } \\  \eta_i &= g (\mu_i),  \end{aligned}

where i = 1, \dots, n indexes the observations, \mu_i = \mathbb{E}[Y_i], and g is the link function. (See this previous post for details.)

In this setting, we do not have a closed form solution for \hat\beta, the MLE of \beta. (See this previous post for the likelihood equations and ways to obtain the MLE numerically.) We are also not able to get a closed form solution for the variance of \hat\beta. We can however get a closed form solution for the asymptotic variance of \hat\beta:

\begin{aligned} Var( \hat\beta) &= ({\bf X}^T {\bf WX})^{-1}, \text{ where} \\  {\bf W}_{ij} &= \begin{cases} \left( \dfrac{\partial \mu_i}{\partial \eta_i} \right)^2 / Var(Y_i) &\text{if } i = j, \\ 0 &\text{otherwise}. \end{cases} \end{aligned}

References:

  1. Agresti, A. (2013). Categorical data analysis (3rd ed): Section 4.4.8.

Derivation of GLM likelihood equations

In this previous post, I stated the likelihood equations (or score equations) for generalized linear models (GLMs). Any solution to the score equations is a maximum likelihood estimator (MLE) for the GLM. In this post, I work through the derivation of the score equations.

Recap: What is a GLM?

Assume we have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p. To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \qquad -(1) \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \qquad -(2) \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

Recap: The likelihood/score equations

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p. More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}. \qquad -(3) \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \qquad -(4) \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}). (See the original post for a matrix form of these equations.)

Likelihood equations: A derivation

All this involves is evaluating the 4 partial derivatives in (3) and multiplying them together.

Using the form of the probability density function for Y_i (1):

\begin{aligned} L_i &= \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi), \\  \dfrac{\partial L_i}{\partial \theta_i} &= \frac{y_i - b'(\theta_i)}{a(\phi)}. \end{aligned}

The second partial derivative, $\partial \theta_i / \partial \mu_i$, is probably the trickiest of the lot: simplifying it requires some properties of exponential families. Under general regularity conditions, we have

\begin{aligned} \mathbb{E} \left( \dfrac{\partial L}{\partial \theta} \right) &= 0, &-(5) \\  - \mathbb{E} \left( \dfrac{\partial^2 L}{\partial \theta^2} \right) &= \mathbb{E} \left[ \left( \dfrac{\partial L}{\partial \theta} \right)^2 \right]. &-(6) \end{aligned}

Applying (5) to our setting:

\begin{aligned} \mathbb{E} \left[ \frac{Y_i - b'(\theta_i)}{a(\phi)} \right] &= 0, \\  \mu_i = \mathbb{E}[Y_i] &= b'(\theta_i), \text{ and } \quad \dfrac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i). \end{aligned}

Applying (6) to our setting:

\begin{aligned} - \dfrac{b''(\theta_i)}{a(\phi)} = \mathbb{E} \left( \dfrac{\partial^2 L_i}{\partial \theta_i^2} \right) &= -\mathbb{E} \left[\left( \dfrac{\partial L_i}{\partial \theta_i} \right)^2 \right] = -\mathbb{E} \left[ \left( \frac{Y_i - b'(\theta_i)}{a(\phi)} \right)^2 \right], \\  \dfrac{b''(\theta_i)}{a(\phi)} &= \dfrac{\mathbb{E}[(Y_i - \mu_i)^2]}{a(\phi)^2}, \\  b''(\theta_i) &= \text{Var}(Y_i) / a(\phi). \end{aligned}

Thus,

\begin{aligned} \dfrac{\partial \mu_i}{\partial \theta_i} &= b''(\theta_i) = \text{Var}(Y_i) / a(\phi), \\  \dfrac{\partial \theta_i}{\partial \mu_i} &= a(\phi) / \text{Var}(Y_i). \end{aligned}

The third partial derivative, \partial \mu_i / \partial \eta_i, actually appears in the likelihood equations so we don’t have to do any algebraic manipulation here. Finally, using the systematic component of the GLM (2), we have

\begin{aligned} \dfrac{\partial \eta_i}{\partial \beta_j} = x_{ij}. \end{aligned}

Putting these 4 parts together:

\begin{aligned} \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} &= \frac{y_i - b'(\theta_i)}{a(\phi)} \cdot \frac{a(\phi)}{\text{Var}(Y_i)} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot x_{ij} \\  &= \dfrac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)}\frac{\partial \mu_i}{\partial \eta_i}, \end{aligned}

as required.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

Fitting a generalized linear model (GLM)

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a generalized linear model (GLM) of the response y using the p other features X_1, \dots, X_p.

To that end, assume that the x values are all fixed. Assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n which have the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

In the above, the form of f is known, but not the values of the \theta_i‘s and \phi.

Let \mu_i = \mathbb{E} [Y_i]. We assume that

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n, \end{aligned}

where \eta_i = g(\mu_i) for some link function g, assumed to be known.

The goal of fitting a GLM is to find estimates for \beta_1, \dots, \beta_p, which in turn give us estimates for \eta_i = g (\mu_i) = g(\mathbb{E}[Y_i]).

Likelihood equations

More specifically, we want to find the values of \beta_1, \dots, \beta_p which maximize the (log)-likelihood of the data:

\begin{aligned} L = \sum_{i=1}^n L_i = \sum_{i=1}^n \log f(y_i ; \theta_i, \phi). \end{aligned}

To do that, we differentiate L w.r.t. \beta and set it to be 0. Using the chain rule as well as the form of f, after some algebra we have

\begin{aligned} 0 = \frac{\partial L_i}{\partial \beta_j} = \frac{\partial L_i}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta_j} = \dots = \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i}.  \end{aligned}

Hence, the likelihood equations (or score equations) are

\begin{aligned} \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\text{Var}(Y_i)} \frac{\partial \mu_i}{\partial \eta_i} = 0, \quad j = 1, \dots, p. \end{aligned}

\beta appears implicitly in the equations above through \mu_i: \mu_i = g^{-1}(\sum_{j=1}^p \beta_j x_{ij}).

If we let {\bf W} \in \mathbb{R}^{n \times n} be the diagonal matrix such that w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i), and {\bf D} \in \mathbb{R}^{n \times n} be the diagonal matrix such that d_{ii} = \dfrac{\partial \mu_i}{\partial \eta_i}, then we can rewrite the likelihood equations in matrix form:

\begin{aligned} {\bf X}^T {\bf WD}^{-1} (y - \mu) = 0. \end{aligned}

Newton-Raphson method

Unfortunately there isn’t a closed form solution for \beta (except in very special cases). The Newton-Raphson method is an iterative method that can be used instead. At each iteration, we make a quadratic approximation of the log likelihood and maximize that. More specifically, let u \in \mathbb{R}^p be the gradien \partial L / \partial \beta{\bf H} \in \mathbb{R}^{p \times p} be the Hessian matrix, and assume that \beta^{(t)} is our guess for \hat{\beta} at time t. (We can think of u and \bf H as functions of \beta.) Let  u^{(t)} and {\bf H}^{(t)} be the evaluation of u and \bf H at \beta^{(t)}. Taylor expansion of L around \beta^{(t)} gives

\begin{aligned} L(\beta) \approx L(\beta^{(t)}) + (u^{(t)})^T (\beta - \beta^{(t)}) + \frac{1}{2} (\beta - \beta^{(t)})^T {\bf H}^{(t)} (\beta - \beta^{(t)}). \end{aligned}

Taking a derivative w.r.t. \beta and setting it to 0, we can rearrange terms to solve for \beta, giving us the next guess

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} - ({\bf H}^{(t)})^{-1} u^{(t)}. \end{aligned}

Why use Newton-Raphson? It usually converges quickly. It has second-order convergence, i.e. for large enough t, |\beta_j^{(t+1)} - \hat{\beta}_j| \leq c |\beta_j^{(t)} - \hat{\beta}_j|^2 for all j and some c > 0.

Fisher scoring method

In Newton-Raphson, we used the observed Hessian matrix \bf H evaluated at the current guess \beta^{(t)}. In Fisher scoring, we use the information matrix instead, which you can think of as (the negative of) the expectation of the Hessian. If the elements of the Hessian are h_{ab} = \partial^2 L(\beta) / \partial \beta_a \partial \beta_b, the elements of the information matrix \bf J are j_{ab} = -\mathbb{E} [\partial^2 L(\beta) / \partial \beta_a \partial \beta_b].

The updates for \beta under Fisher scoring are

\begin{aligned} \beta^{(t+1)} = \beta^{(t)} + ({\bf J}^{(t)})^{-1} u^{(t)}. \end{aligned}

If you go through the algebra for the elements of the Fisher information matrix, you will find that \bf J has a very nice expression: it is simply {\bf J} = {\bf X}^T \bf WX. (We had previously defined \bf W: a diagonal matrix with entries w_{ii} = \left(\dfrac{\partial \mu_i}{\partial \eta_i}\right)^2 / \text{Var}(Y_i).)

If we are using the canonical link function, the observed Hessian matrix and the information matrix are the same, and so Fisher scoring is exactly the same as the Newton-Raphson method.

Why use this over Newton-Raphson? The inverse of the information matrix is also the asymptotic covariance of \hat{\beta}, so Fisher scoring gives you this as a by-product. It is also closely related to weighted least squares (as we will see below).

Fisher scoring as iterated reweighted least squares (IRLS)

Multiplying both sides of the Fisher scoring update by {\bf J}^{(t)}, we get

\begin{aligned} ({\bf J}^{(t)})\beta^{(t+1)} = u^{(t)} + ({\bf J}^{(t)})\beta^{(t)}. \end{aligned}

Using expression for \bf J and u that we had before, the above is the same as

\begin{aligned} ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t+1)} &= {\bf X}^T {\bf W}^{(t)} ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}) + ({\bf X}^T {\bf W}^{(t)} {\bf X}) \beta^{(t)} \\ &= {\bf X}^T {\bf W}^{(t)} [{\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)})] \\ &= {\bf X}^T {\bf W}^{(t)} z^{(t)}, \end{aligned}

where z^{(t)} = {\bf X}\beta^{(t)} +  ({\bf D}^{(t)})^{-1}(y - \mu^{(t)}).

Looking closely at the last equation, we recognize those as the likelihood equations for weighted least squares, where the response is z^{(t)} and the weight matrix is {\bf W}^{(t)}.

We can think of z^{(t)} as a linearized form of the link function g evaluated at y:

\begin{aligned} z_i^{(t)} &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{\partial \eta_i^{(t)}}{\partial \mu_i^{(t)}} \\ &= g(\mu_i^{(t)}) + (y_i - \mu_i^{(t)}) g'(\mu_i^{(t)}) \\ &\approx g(y_i). \end{aligned}

This connection gives us a way to implement the Fisher scoring method using routines for solving weight least squares.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

Understanding the components of a generalized linear model (GLM)

Generalized linear models (GLMs) are significantly more complicated than ordinary linear models. There is more notation, more conceptual terms, and more confusion about what’s random (or not) and what’s known (or not). This post will lay out the setup of a GLM in detail to clarify any possible confusion.

Assume that you have n data points (x_{i1}, \dots, x_{ip}, y_i) \in \mathbb{R}^{p+1} for i = 1, \dots, n. We want to build a model of the response y using the p other features X_1, \dots, X_p. Assume that the x values are all fixed throughout the discussion.

A GLM consists of three components:

  1. A random component,
  2. A systematic component, and
  3. A link function.

Random component

We assume that y_1, \dots, y_n are samples of independent random variables Y_1, \dots, Y_n respectively. We assume that Y_i has the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In the above, the form of f (and hence, that of a, b and Q) is assumed to be known. What is unknown are the \theta_i‘s, which have to be estimated. The value of \theta_i can vary across i.

The family of distributions above is known as an exponential family, and Q(\theta) is called the natural parameter. If Q(\theta) = \theta, the exponential family is said to be in canonical form.

Let \mu_i = \mathbb{E} [Y_i]. Often we will not estimate the \theta_i‘s directly, but rather some function of \mu_i (as we will see soon).

Systematic component

The systematic component relates some vector (\eta_1, \dots, \eta_n) to the p features. We assume that the relationship is given by

\begin{aligned} \eta_i = \sum_{j=1}^p \beta_j x_{ij} \end{aligned}

for i = 1, \dots, n. \beta_1, \dots, \beta_p are not known and have to be estimated. What are the \eta_i‘s? Read on!

Link function

The link function is a function g such that \eta_i = g(\mu_i) for i = 1, \dots, n. The function g is assumed to be known, and is something which the data modeler picks.

If \eta_i = g(\mu_i) = \mu_i, the link function is called the identity link. If \eta_i = g(\mu_i) = Q(\theta_i), the link function is called the canonical link.

In a GLM, we wish to estimate the \beta_j‘s. This in turn gives us an estimate for the g(\mu_i)‘s, which will give us an estimate for the \mu_i‘s.

Example 1: Logistic regression

With binary data, we assume that Y_i \sim \text{Bern}(\pi_i). We can write the probability mass function as

\begin{aligned} f(y_i; \pi_i) &= \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\  &= (1 - \pi_i) [\pi_i / (1 - \pi_i)]^{y_i} \\  &= (1 - \pi_i) \exp \left[ y_i \log \left( \frac{\pi_i}{1 - \pi_i} \right) \right]. \end{aligned}

To match this to the earlier formula for exponential families, take \theta_i = \pi_i, a(\theta_i) = 1 - \theta_i, b(y_i) = 1 and Q(\theta_i) = \log [\theta_i / (1 - \theta_i)]. The natural parameter for this family is Q(\pi_i) = \log [\pi_i / (1 - \pi_i)].

In logistic regression, we take the link function to be the canonical link. That is, our systematic component is

\begin{aligned} \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Example 2: Poisson regression

Assume that our data is count data, and that the time period over which the count data was collected is the same across i. A simple model is to assume that Y_i \sim \text{Poisson}(\mu_i). We can write the probability mass function as

\begin{aligned} f(y_i ; \mu_i) = \frac{\mu_i^{y_i} e^{-\mu_i} }{y_i !} = e^{-\mu_i} \frac{1}{y_i !} \exp (y_i \log \mu_i). \end{aligned}

To match this to the earlier formula for exponential families, take \theta_i = \mu_i, a(\theta_i) = \exp(-\theta_i), b(y_i) = 1 / {y_i} ! and Q(\theta_i) = \log \theta_i. The natural parameter for this family is Q(\mu_i) = \log \mu_i.

In Poisson loglinear regression, we take the link function to be the canonical link, and so the systematic component is

\begin{aligned} \log \mu_i = \sum_{j=1}^p \beta_j x_{ij}, \quad i = 1, \dots, n. \end{aligned}

Extending the GLM to exponential dispersion families

For the random component of GLMs, we assume that Y_i has the probability density (or mass) function of the form

\begin{aligned} f(y_i ; \theta_i) = a(\theta_i) b(y_i) \exp [y_i Q(\theta_i)]. \end{aligned}

In some cases, it helps to add an additional parameter, called the dispersion parameter, to model the data more accurately. For example, by using the Poisson distribution for count data, we assume that \text{Var}(Y_i) = \mu_i = \mathbb{E}[Y_i], which may not be the case.

With this new dispersion parameter \phi, it is common to write the PDF/PMF of Y_i in the form

\begin{aligned} f(y_i ; \theta_i, \phi) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \end{aligned}

When \phi is known, the above reduces to the exponential family that we introduced at first, and we can use all the GLM machinery. Usually \phi is not known: what we can do is estimate it first, then treat it as known for the rest of the GLM procedure.

References:

  1. Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.