Log-likelihood of probit regression is globally concave

Assume we are in the supervised learning setting with n observations, where observation i consists of the response y_i and features (x_{i1},\dots, x_{ip}). A generalized linear model (GLM) consists of 3 components:

  1. A random component, or a family of distributions f indexed by \mu_i (usually an exponential family), such that y_i \sim f_{\mu_i},
  2. A systematic component \eta_i = \sum_{i=1}^p \beta_j x_{ij}, and
  3. A link function g such that \eta_i = g(\mu_i).

(See this previous post for more details of the components of a GLM.) The user gets to define the family of distributions f and the link function g, and \beta = (\beta_1, \dots, \beta_p)^T is the parameter to be determined by maximum likelihood estimation.

For one-dimensional exponential families with the canonical link function, it is known that the log-likelihood of the GLM is globally concave in \beta (see, for example, Reference 1). Hence, the MLE \hat\beta can be found using methods such as gradient descent or coordinate descent. When non-canonical links are used, the GLM’s log-likelihood is no longer guaranteed to be concave in \beta. However, in some situations we can stilll show that the log-likelihood is concave in \beta. In this post, we show that the log-likelihood for probit regression is concave in \beta.

In the probit regression model, \mu_i = \Phi (\eta_i) = \Phi \left( \sum_{i=1}^p \beta_j x_{ij} \right), where \Phi is the cumulative distribution function (CDF) of the standard normal distribution. The responses are binary with y_i \sim \text{Bern}(\mu_i). The likelihood function is

\begin{aligned} L(\beta) &= \prod_{i=1}^n \mu_i^{y_i} (1- \mu_i)^{1 - y_i} \\  &= \prod_{i=1}^n [\Phi(x_i^T \beta)]^{y_i} [1 - \Phi(x_i^T \beta)]^{1 - y_i}, \end{aligned}

and the log-likelihood function is

\begin{aligned} \ell(\beta) = \sum_{i=1}^n y_i \log [\Phi(x_i^T \beta)] + (1-y_i) \log [1 - \Phi(x_i^T \beta)].  \end{aligned}

To show that \ell is concave in \beta, we make two reductions:

  1. Since the sum of concave functions is concave, it is enough to show that \beta \mapsto y \log [\Phi(x^T \beta)] + (1-y) \log [1 - \Phi(x^T \beta)] is concave.
  2. Since composition with an affine function preserves concavity, it is enough to show that f(x) = y \log [\Phi(x)] + (1-y) \log [1 - \Phi(x)] is concave in x. (Here, x \in \mathbb{R}.)

From here, we can show that f is concave by showing that its second derivative is negative: f''(x) < 0 for all x. Since y can only take on the values of 0 and 1, we can consider those cases separately.

Let \phi denote the probability density function of the standard normal distribution. Recall that \phi'(x) = -x \phi(x). When y = 1,

\begin{aligned} f'(x) &= \dfrac{\phi(x)}{\Phi(x)}, \\  f''(x)&= \dfrac{\Phi(x) [-x \phi(x)] - \phi(x)^2}{\Phi(x)^2} \\  &= \dfrac{\phi(x)}{\Phi(x)^2}[- x \Phi(x) - \phi(x)] \\  &< 0, \end{aligned}

since x \Phi(x) + \phi(x) > 0 for all x (see this previous post for a proof).

When y = 0,

\begin{aligned} f'(x) &= \dfrac{-\phi(x)}{1-\Phi(x)}, \\  f''(x) &= \dfrac{[1- \Phi(x)][x \phi(x)] + \phi(x) [-\phi(x)]}{[1 - \Phi(x)]^2} \\  &= -\dfrac{\phi(x)}{[1-\Phi(x)]^2} \left[ -x + x \Phi(x) + \phi(x) \right]. \end{aligned}

To show concavity of f, it remains to show that x\Phi(x) + \phi(x) > x for all x. But this is true: see this previous post for a proof.

References:

  1. Chapter 9: Generalized Linear Models.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s