# Log-likelihood of probit regression is globally concave

Assume we are in the supervised learning setting with $n$ observations, where observation $i$ consists of the response $y_i$ and features $(x_{i1},\dots, x_{ip})$. A generalized linear model (GLM) consists of 3 components:

1. A random component, or a family of distributions $f$ indexed by $\mu_i$ (usually an exponential family), such that $y_i \sim f_{\mu_i}$,
2. A systematic component $\eta_i = \sum_{i=1}^p \beta_j x_{ij}$, and
3. A link function $g$ such that $\eta_i = g(\mu_i)$.

(See this previous post for more details of the components of a GLM.) The user gets to define the family of distributions $f$ and the link function $g$, and $\beta = (\beta_1, \dots, \beta_p)^T$ is the parameter to be determined by maximum likelihood estimation.

For one-dimensional exponential families with the canonical link function, it is known that the log-likelihood of the GLM is globally concave in $\beta$ (see, for example, Reference 1). Hence, the MLE $\hat\beta$ can be found using methods such as gradient descent or coordinate descent. When non-canonical links are used, the GLM’s log-likelihood is no longer guaranteed to be concave in $\beta$. However, in some situations we can stilll show that the log-likelihood is concave in $\beta$. In this post, we show that the log-likelihood for probit regression is concave in $\beta$.

In the probit regression model, $\mu_i = \Phi (\eta_i) = \Phi \left( \sum_{i=1}^p \beta_j x_{ij} \right)$, where $\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution. The responses are binary with $y_i \sim \text{Bern}(\mu_i)$. The likelihood function is

\begin{aligned} L(\beta) &= \prod_{i=1}^n \mu_i^{y_i} (1- \mu_i)^{1 - y_i} \\ &= \prod_{i=1}^n [\Phi(x_i^T \beta)]^{y_i} [1 - \Phi(x_i^T \beta)]^{1 - y_i}, \end{aligned}

and the log-likelihood function is

\begin{aligned} \ell(\beta) = \sum_{i=1}^n y_i \log [\Phi(x_i^T \beta)] + (1-y_i) \log [1 - \Phi(x_i^T \beta)].Â \end{aligned}

To show that $\ell$ is concave in $\beta$, we make two reductions:

1. Since the sum of concave functions is concave, it is enough to show that $\beta \mapsto y \log [\Phi(x^T \beta)] + (1-y) \log [1 - \Phi(x^T \beta)]$ is concave.
2. Since composition with an affine function preserves concavity, it is enough to show that $f(x) = y \log [\Phi(x)] + (1-y) \log [1 - \Phi(x)]$ is concave in $x$. (Here, $x \in \mathbb{R}$.)

From here, we can show that $f$ is concave by showing that its second derivative is negative: $f''(x) < 0$ for all $x$. Since $y$ can only take on the values of 0 and 1, we can consider those cases separately.

Let $\phi$ denote the probability density function of the standard normal distribution. Recall that $\phi'(x) = -x \phi(x)$. When $y = 1$,

\begin{aligned} f'(x) &= \dfrac{\phi(x)}{\Phi(x)}, \\ f''(x)&= \dfrac{\Phi(x) [-x \phi(x)] - \phi(x)^2}{\Phi(x)^2} \\ &= \dfrac{\phi(x)}{\Phi(x)^2}[- x \Phi(x) - \phi(x)] \\ &< 0, \end{aligned}

since $x \Phi(x) + \phi(x) > 0$ for all $x$ (see this previous post for a proof).

When $y = 0$,

\begin{aligned} f'(x) &= \dfrac{-\phi(x)}{1-\Phi(x)}, \\ f''(x) &= \dfrac{[1- \Phi(x)][x \phi(x)] + \phi(x) [-\phi(x)]}{[1 - \Phi(x)]^2} \\ &= -\dfrac{\phi(x)}{[1-\Phi(x)]^2} \left[ -x + x \Phi(x) + \phi(x) \right]. \end{aligned}

To show concavity of $f$, it remains to show that $x\Phi(x) + \phi(x) > x$ for all $x$. But this is true: see this previous post for a proof.

References: