A lower bound on x Phi(x) + phi(x)

I recently came across this lower bound that I thought was interesting and not immediately obvious. Let \phi and \Phi be the probability density function (PDF) and the cumulative distribution function (CDF) of the standard normal distribution, i.e.

\begin{aligned} \phi (x) = \dfrac{1}{\sqrt{2\pi}} \exp \left( -\dfrac{x^2}{2}\right), \qquad \Phi(x) = \int_{-\infty}^x \phi (t)dt. \end{aligned}

Then the following inequality holds:

\begin{aligned} x \Phi(x) + \phi(x) > \max(0, x). \end{aligned}

The plot below illustrates this inequality with the function on the LHS and RHS being in red and black respectively:

x <- seq(-30, 30) / 10
plot(x, x * pnorm(x) + dnorm(x), type = 'l', col = "red",
     ylab = "f(x)")
lines(x, pmax(0, x), lty = 2)

The derivation of the inequality below is mine; there may be faster ways to do it. A key formula we will use in the proof is that for the derivative of \phi: \phi' (x) = -x\phi(x).

Let f(x) = x \Phi(x) + \phi(x). First, we show that f(x) > 0 for all x. Since, for all x,

\begin{aligned} f'(x) = x\phi(x) + \Phi(x) - x\phi(x) = \Phi(x) > 0, \end{aligned}

the function f is strictly increasing on the entire real line. Hence, using L’Hôpital’s rule,

\begin{aligned} f(x) &> \lim_{x \rightarrow -\infty} f(x) \\  &= \lim_{x \rightarrow -\infty} \frac{\Phi(x) + \phi(x) / x}{1/x} \\  &= \lim_{x \rightarrow -\infty} \frac{\phi(x) + \frac{x(-x\phi(x)) - \phi(x)}{x^2} }{-1/x^2} \\  &= \lim_{x \rightarrow -\infty} \frac{x^2 \phi(x) - x^2 \phi(x) - \phi(x) }{-1} \\  &= \lim_{x \rightarrow -\infty} \phi(x) \\  &= 0. \end{aligned}

Next, we show that f(x) > x for all x. Note that we only have to prove this for x > 0, since f(x) > 0 \geq x for x \leq 0. Letting g(x) = \Phi(x) + \dfrac{\phi(x)}{x}, we see that proving f(x) > x for x > 0 is equivalent to proving g(x) > 1 for x > 0. Since

\begin{aligned} g'(x) = \phi(x) + \frac{x(-x\phi(x)) - \phi(x)}{x^2} = - \phi(x) / x^2 < 0 \end{aligned}

for all x > 0, the function g is strictly decreasing for x > 1. Thus, for this range of x values,

\begin{aligned} g(x) &> \lim_{x \rightarrow \infty} g(x) \\  &= \lim_{x \rightarrow \infty} \frac{x \Phi(x) + \phi(x)}{x} \\  &= \lim_{x \rightarrow \infty} \frac{x\phi(x) + \Phi(x) - x\phi(x)}{1} \\  &= \lim_{x \rightarrow \infty} \Phi(x) \\  &= 1, \end{aligned}

as required. (We used L’Hôpital’s rule again in the sequence of equalities above.)

Density of the log-normal distribution

A random variable X has a log-normal distribution if \log X has a normal distribution. Since the normal distribution can be parameterized by the mean and standard deviation (usually denoted \mu and \sigma), the log-normal distribution can be parameterized similarly.

Given \mu \in \mathbb{R} and \sigma > 0, the density of the log-normal distribution is given by

\begin{aligned} f(x) = \dfrac{1}{x \sigma \sqrt{2\pi}} \exp \left[ - \dfrac{(\log x - \mu)^2}{2 \sigma^2} \right], \qquad \text{for } x \in (0, +\infty). \end{aligned}

The figure below shows the probability density function (PDF) of the log-normal distribution for \mu = 0 and varying \sigma. This is very much like the figure you see on the Wikipedia page.

Unfortunately this appears to be most common picture we see for the log-normal, when in fact the picture is different for different parameter settings. There are hardly figures depicting what happens when we vary \mu.

Well, this post aims to fill that gap. All the figures in this post have the x-axis going from 0 to 3 for comparability (the y-axis will vary). Code for generating these figures is available here.

Below are two figures where we vary \sigma as well. The first has \mu fixed at 1 while the second has \mu fixed at -0.75.

The next 3 figures have varying \mu and \sigma fixed at 0.25, 1 and 1.25 respectively.

varying_mean

What is Isserlis’ theorem?

I learnt about Isserlis’ theorem (also known as Wick’s probability theorem) at a talk today. The theorem comes from a paper from 1918, which is listed as Reference 1 below. In the words of Reference 2, the theorem

… allows [us] express to the expectation of a monomial in an arbitrary number of components of a zero mean Gaussian vector X \in \mathbb{R}^d in terms of the entries of its covariance matrix only.

We introduce some notation (as in Reference 2) to state the theorem succinctly. Let A = \{ \alpha_1, \dots, \alpha_N \} be a set of integers such that 1 \leq \alpha_i \leq d for all i. The \alpha_i need not be distinct. For any vector X \in \mathbb{R}^d, denote

\begin{aligned} X_A = \prod_{\alpha_i \in A} X_{\alpha_i}, \end{aligned}

with the convention that X_\emptyset = 1. Let \Pi (A) denote the set of all pairings of A, i.e. partitions of A into disjoint pairs. For a pairing \sigma \in \Pi (A), let A / \sigma denote the set of indices i such that the pairs in \sigma are \{ (\alpha_i, \alpha_{\sigma(i)}) : i \in A / \sigma \}.

(As an example, if A = \{ 1, 2, 3, 4 \}, one possible pairing \sigma is \{\{ 1, 3\}, \{ 2, 4\} \}. For this pairing, a possible choice of A / \sigma is A / \sigma = \{ 1, 2\}, with \sigma(1) = 3 and \sigma(2) = 4.)

We are now ready to state the theorem:

Theorem (Isserlis’ theorem): Let A = \{ \alpha_1, \dots, \alpha_N \} be a set of integers such that 1 \leq \alpha_i \leq d for all i, and let X \in \mathbb{R}^d be a Gaussian vector with zero mean. If N is even, then

\begin{aligned} \mathbb{E} [X_A] = \sum_{\sigma \in \Pi (A)} \prod_{i \in A / \sigma} \mathbb{E} [X_{\alpha_i} X_{\alpha_{\sigma(i)}}]. \end{aligned}

If N is odd, then \mathbb{E}[X_A] = 0.

Here are some special cases of Isserlis’ theorem to demonstrate how to interpret the equation above. If \alpha_i = i for 1 \leq i \leq 4, there are 3 possible pairings, giving us

\begin{aligned} \mathbb{E}[X_1 X_2 X_3 X_4] = \mathbb{E}[X_1 X_2] \mathbb{E}[X_3 X_4] + \mathbb{E}[X_1 X_3] \mathbb{E}[X_2 X_4] + \mathbb{E}[X_1 X_4] \mathbb{E}[X_2 X_3]. \end{aligned}

If we take \alpha_i = 1 for 1 \leq i \leq 4, there are still 3 possible pairings, and we get

\begin{aligned} \mathbb{E}[X_1 X_1 X_1 X_1] &= \mathbb{E}[X_1 X_1] \mathbb{E}[X_1 X_1] + \mathbb{E}[X_1 X_1] \mathbb{E}[X_1 X_1] + \mathbb{E}[X_1 X_1] \mathbb{E}[X_1 X_1], \\  \mathbb{E}[X_1^4] &= 3 \left(\mathbb{E}[X_1^2] \right)^2. \end{aligned}

This tells us that the 4th moment of a mean-zero 1-dimensional gaussian random variable is 3 times the square of its 2nd moment.

As a final example, if we take \alpha_1 = \alpha_2 = 1 and \alpha_3 = \alpha_4 = 2, we still have 3 possible pairings, and we get

\begin{aligned} \mathbb{E}[X_1 X_1 X_2 X_2] &= \mathbb{E}[X_1 X_1] \mathbb{E}[X_2 X_2] + \mathbb{E}[X_1 X_2] \mathbb{E}[X_1 X_2] + \mathbb{E}[X_1 X_2] \mathbb{E}[X_1 X_2], \\  \mathbb{E} [X_1^2 X_2^2] &= \mathbb{E}[X_1^2] \mathbb{E}[X_2^2] + 2 \left( \mathbb{E}[X_1 X_2] \right)^2. \end{aligned}

References:

  1. Isserlis, L. (1918) On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.
  2. Vignat, C. (2011) A generalized Isserlis theorem for location mixtures of Gaussian random vectors.

Proof that sample mean is independent of sample variance (under normality)

Suppose we have i.i.d. observations X_1, \dots, X_n drawn from a population. We usually estimate the mean and variance of the population by the mean and variance of the sample we have:

\begin{aligned} \bar{X} &= \dfrac{X_1 + \dots + X_n}{n}, \\  S^2 &= \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2. \end{aligned}

Under the assumption that the population is normally distributed, the sample mean and sample variance are independent of each other.

The first proof of this fact is short but requires some basic knowledge of theoretical statistics. One can prove that the sample mean is a complete sufficient statistic and that the sample variance is an ancillary statistic. Then, by Basu’s Theorem, they must be independent of each other.

The second proof is longer and more explicit (and hence less magical than the first). Assume that X_1, \dots, X_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2). Then

\begin{aligned} \bar{X} &\sim \mathcal{N} \left( \mu, \frac{\sigma^2}{n}\right), \\  \bar{X} - X_j &= \frac{1}{n}(X_1 + \dots + X_{j-1} + X_{j+1} + \dots + X_n) - \frac{n-1}{n}X_j \\  &\sim \mathcal{N}\left( \frac{(n-1)\mu}{n} - \frac{(n-1)\mu}{n}, (n-1)\frac{\sigma^2}{n^2} + (n-1)^2 \frac{\sigma^2}{n^2} \right) \\  &= \mathcal{N} \left( 0, \frac{n-1}{n}\sigma^2 \right), \end{aligned}

for any j = 1, \dots, n. Furthermore, for any j = 1, \dots, n, we have

\begin{aligned} \text{Cov}(X_j - \bar{X}, \bar{X}) &= \text{Cov}(X_j, \bar{X}) -\text{Cov}(\bar{X}, \bar{X}) \\  &= \frac{\sigma^2}{n} - \frac{\sigma^2}{n} \\ &= 0. \end{aligned}

This implies that the vector \begin{pmatrix} \bar{X}, X_1 - \bar{X}, \dots, X_n - \bar{X}\end{pmatrix}^T has multivariate normal distribution with covariance matrix of the form

\begin{pmatrix} \frac{\sigma^2}{n} & \mathbf{0}^T \\ \mathbf{0} & \Sigma  \end{pmatrix}

for some \Sigma. Because of the multivariate normality, we can conclude that \bar{X} and {\mathbf{X}} = \begin{pmatrix} X_1 - \bar{X}, \dots, X_n - \bar{X}\end{pmatrix}^T are independent normal vectors, and so \bar{X} is also independent of {\mathbf{X}}^T {\mathbf{X}} = (n-1) S^2.

References:

  1. StackExchange. Proof of the independence of the sample mean and sample variance.
  2. Statistics 351 (Fall 2009). Independence of \bar{X} and S^2 in a Normal Sample.

What is the Shapiro-Wilk test?

I recently learnt of the Shapiro-Wilk test from this blog post. So what is it?

The Shapiro-Wilk test is a statistical test for the hypothesis that a group of values come from a normal distribution. (The mean and variance of this normal distribution need not be 0 or 1 respectively.) Empirically, this test appears to have the best power (among tests that test for normality).

Assume that the data are x_1, \dots, x_n \in \mathbb{R} and that we want to test if they come for a population that is normally distributed. The test statistic is

\begin{aligned} W =\dfrac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \end{aligned}

where

  • \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i is the mean of the x_i‘s,
  • x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)} are the order statistics,
  • a_1, \dots, a_n are “constants generated from the means, variances and covariances of the order statistics of a sample of size n from a normal distribution” (Ref 3).

Let m = (m_1, \dots, m_n) be the expected values of the standard normal order statistics, and let V be the corresponding covariance matrix. Then

\begin{aligned} a = (a_1, \dots, a_n) = \dfrac{V^{-1}m}{\|V^{-1}m\|_2}. \end{aligned}

We reject the null hypothesis (the data come from a normal distribution) if W is small. In R, the Shapiro-Wilk test can be performed with the shapiro.test() function.

As far as I can tell there isn’t a closed form for the distribution of W under the null.

What is the intuition behind this test? Reference 2 has a good explanation for this:

The basis idea behind the Shapiro-Wilk test is to estimate the variance of the sample in two ways: (1) the regression line in the QQ-Plot allows to estimate the variance, and (2) the variance of the sample can also be regarded as an estimator of the population variance. Both estimated values should approximately equal in the case of a normal distribution and thus should result in a quotient of close to 1.0. If the quotient is significantly lower than 1.0 then the null hypothesis (of having a normal distribution) should be rejected.

Why is it called the Shapiro-Wilk test? It was proposed by S. S. Shapiro and M. B. Wilk in a 1965 Biometrika paper “An Analysis of Variance Test for Normality“. This original paper proves a number of properties of the W statistic, e.g. \dfrac{na_1^2}{n-1} \leq W \leq 1.

References:

  1. Wikipedia. Shapiro-Wilk test.
  2. Fundamentals of Statistics. Shapiro-Wilk test.
  3. Engineering Statistics Handbook. Section 7.2.1.3. Anderson-Darling and Shapiro-Wilk tests.

Marginally Gaussian does not imply jointly Gaussian

A multivariate random variable \mathbf{X} = (X_1, \dots, X_p) \in \mathbb{R}^p is said to have joint multivariate normal/Gaussian distribution if for any \mathbf{a} \in \mathbb{R}^p, \mathbf{a^T X} has the normal/Gaussian distribution. In other words, any linear combination of the elements of \mathbf{X} must have the normal distribution.

In particular, if we define \mathbf{a} such that it has zeros in all entries except the jth entry where it has a one, we see that X_j must be normally distributed. Hence, joint Gaussianity implies marginal Gaussianity.

Is the inverse true? That is, does X_j being normally distributed for j = 1, \dots, p imply that \mathbf{X} = (X_1, \dots, X_p) is jointly normally distributed? The answer is NO. The first reference below has a nice heatmaps demonstrating this. There are a total of 6 heatmaps, which depicting a bivariate distribution which is marginally Gaussian. Only 3 of them (the two on the left and top-center) are jointly Gaussian, the other 3 are not. I’ve produced the plot below for convenience:

BIvariate distributions with marginal Gaussianity

BIvariate distributions with marginal Gaussianity. Only 3 of them are joint Gaussian distributions.

References:

  1. CrossValidated. Is it possible to have a pair of Gaussian random variables for which the joint distribution is not Gaussian?

Laplace distribution as a mixture of normals

Previously on this blog we showed that the t-distribution can be expressed as a continuous mixture of normal distributions. Today, I learned from this paper that the Laplace distribution can be viewed as a continuous mixture of normal distributions as well.

The Laplace distribution with mean \mu \in \mathbb{R} and scale b > 0 has the probability density function

\begin{aligned} f(x) = \frac{1}{2b} \exp \left(-\frac{|x-\mu|}{b} \right). \end{aligned}

(The Laplace distribution is sometimes known as the double exponential distribution, since each tail corresponds to that for an exponential random variable.) Note that if we can show that the Laplace distribution with mean 0 is a mixture of normals, then by shifting all these normals by \mu, it follows that the Laplace distribution with mean \mu is also a mixture of normals.

Fix b > 0. Let W \sim \text{Exp}(\frac{1}{2b^2}), i.e. f_W(w) = \dfrac{1}{2b^2} \exp \left(-\dfrac{x}{2b^2} \right) for w \geq 0, and let X \mid W = w \sim \mathcal{N}(0, w). We claim that X has the Laplace distribution with mean 0 and scale b. This is equivalent to showing that for any x,

\begin{aligned} f_X(x) &= \frac{1}{2b} \exp \left(-\frac{|x|}{b} \right), \\  \Leftrightarrow \quad \int_0^\infty f_{X \mid W = w}(x) f_W(w) dw &= \frac{1}{2b} \exp \left(-\frac{|x|}{b} \right). \end{aligned}

The key to showing this identity is noticing that the integrand on the LHS looks very much like the probability density function (PDF) of the inverse Gaussian distribution. Letting \lambda = |x|^2, \mu = |x|b and Z be an inverse Gaussian random variable with mean \mu and shape \lambda,

\begin{aligned} \int_0^\infty f_{X \mid W = w}(x) f_W(w) dw &= \int_0^\infty \frac{1}{\sqrt{2\pi w}} \exp \left( - \frac{x^2}{2w} \right) \frac{1}{2b^2} \exp \left( - \frac{w}{2b^2} \right) dw \\  &= \frac{1}{2b^2} \int_0^\infty \frac{1}{\sqrt{2\pi w}} \exp \left( - \frac{w^2 + |x|^2b^2}{2b^2 w} \right) dw \\  &= \frac{1}{2b^2} \int_0^\infty \frac{1}{\sqrt{2\pi w}} \exp \left( - \frac{(w - |x|b)^2}{2b^2 w} - \frac{2w|x|b}{2b^2w} \right) dw \\  &= \frac{1}{2b^2} e^{-|x|/b} \int_0^\infty \frac{1}{\sqrt{2\pi w}} \exp \left( - \frac{|x|^2(w - |x|b)^2}{2|x|^2b^2 w} \right) dw \\  &= \frac{1}{2b^2} e^{-|x|/b} \frac{1}{\sqrt{\lambda}} \int_0^\infty w \frac{\sqrt{\lambda}}{\sqrt{2\pi w^3}} \exp \left( - \frac{\lambda(w - \mu)^2}{2\mu^2 w} \right) dw \\  &= \frac{1}{2b^2} e^{-|x|/b} \frac{1}{|x|} \mathbb{E}[Z] \\  &= \frac{1}{2|x|b^2} e^{-|x|/b} \mu \\  &= \frac{1}{2b} e^{-|x|/b}. \end{aligned}

t distribution as a mixture of normals

In class, the t distribution is usually introduced like this: if X \sim \mathcal{N}(0,1) and Z \sim \chi_\nu^2 are independent, then T = \dfrac{X}{\sqrt{Z / \nu}} has t distribution with \nu degrees of freedom, denoted t_\nu or t_{(\nu)}.

Did you know that the t distribution can also be viewed as a (continuous) mixture of normal random variables? Specifically, let W have inverse-gamma distribution \text{InvGam}\left(\dfrac{\nu}{2}, \dfrac{\nu}{2} \right), and define the conditional distribution X \mid W = w \sim \mathcal{N}(0, w). Then the unconditional distribution of X is the t distribution with \nu degrees of freedom.

The proof follows directly from computing the unconditional (or marginal) density of X:

\begin{aligned} f_X(x) &= \int_0^\infty f_{X \mid W}(x) f_W(w) dw \\  &\propto \int_0^\infty \frac{1}{\sqrt{w}} \exp \left( -\frac{x^2}{2w} \right) \cdot w^{-\nu/2 - 1} \exp \left( - \frac{\nu}{2w} \right) \\  &= \int_0^\infty w^{-\frac{\nu + 1}{2} - 1} \exp \left( - \frac{x^2 + \nu}{2w} \right) dw. \end{aligned}

Note that the integrand above is proportional to the PDF of the inverse-gamma distribution with \alpha = \dfrac{\nu + 1}{2} and \beta = \dfrac{x^2 + \nu}{2}. Hence, we can evaluate the last integral exactly to get

f_X(x) \propto \Gamma \left( \dfrac{\nu + 1}{2} \right) \left(\dfrac{x^2 + \nu}{2}\right)^{-\frac{\nu + 1}{2}} \propto\left( x^2 + \nu \right)^{-\frac{\nu + 1}{2}} \propto \left( 1 + \dfrac{x^2}{\nu} \right)^{-\frac{\nu + 1}{2}},

which is proportional to the PDF of the  t_\nu distribution.

Sources for the information above:

  1. Student-t as a mixture of normals, John D. Cook Consulting.

Approximating large normal quantiles 2

In my STATS 300C notes, another approximation for large normal quantiles was given: let Z \sim \mathcal{N}(0,1), and let z(\alpha) be the \alpha-quantile of Z, that is, z(\alpha) is the value such that \mathbb{P}(Z \leq z(\alpha)) = \alpha. Fix \alpha, and let B = 2 \log n - 2 \log \alpha - \log (2\pi). Then for large n,

z\left( 1 - \dfrac{\alpha}{n} \right) \approx \sqrt{B \left(1 - \dfrac{\log B}{B} \right)}.

The proof is very similar to the approximation in the previous post. Let z\left( 1 - \dfrac{\alpha}{n} \right) = t. For large n,

\begin{aligned} \frac{\phi(t)}{t} &\approx \frac{\alpha}{n}, \\  -\frac{t^2}{2} - \log t - \frac{1}{2} \log (2\pi) &\approx \log \alpha - \log n, \\  t^2 + 2 \log t &\approx B.  \end{aligned}

n large implies B large, which in turn implies t large. This means that t^2 dominates the LHS, which gives us the 1st order approximation t \approx \sqrt{B}. If we let t = \sqrt{B(1-u)} and substitute it in the equation above instead, we get

\begin{aligned} B(1-u) + \log [B(1-u)] &\approx B, \\  -Bu + \log B + \log (1-u) &\approx 0, \\  u &\approx \frac{B}{\log B}, \end{aligned}

since \log (1-u) is going to be small compared to the rest of the terms. Substituting this value of u back into the expression for t gives us the desired approximation.

Approximating large normal quantiles

In the last post, we showed that for Z \sim \mathcal{N}(0,1) and large t, \mathbb{P}(Z > t) = 1 - \Phi(t) \approx \dfrac{\phi(t)}{t}. In this post, we’ll show how we can use this approximation to get an approximation for large quantiles of the normal distribution.

Let z(\alpha) be the \alpha-quantile of Z, that is, z(\alpha) is the value such that \mathbb{P}(Z \leq z(\alpha)) = \alpha. Fix \alpha. Then for large n,

z\left( 1 - \dfrac{\alpha}{n} \right) = t \quad \Leftrightarrow \quad \dfrac{\phi(t)}{t} \approx \dfrac{\alpha}{n}.

We are going to solve for t in the equation on the right:

\begin{aligned} \frac{1}{\sqrt{2\pi}} e^{-t^2 / 2 - \log t} &\approx \frac{\alpha}{n}, \\  - \log (\sqrt{2\pi}) - t^2 / 2 - \log t &\approx \log \alpha - \log n, \\  t^2 /2 + \log t &\approx \log n - \log \alpha - \log (\sqrt{2\pi}).  \end{aligned}

As n grows large, t has to grow large too. When these quantities are large, t^2 / 2 dominates the LHS while \log n dominates the RHS. Thus, the approximation becomes t^2 \approx 2 \log n, or t \approx \sqrt{2 \log n}. (Interestingly, this approximation does not depend on \alpha!)

We can get a finer approximation if we do not drop the \log t term in the LHS above. Instead, let t = \sqrt{2 \log n} (1 - u). Then we have

\begin{aligned} \frac{2 \log n \cdot (1-u)^2}{2} + \log [\sqrt{2 \log n} (1-u)] &\approx \log n, \\  (u^2 - 2u) \log n + \frac{1}{2}[\log 2 + \log \log n] + \log (1-u) &\approx 0, \\  u^2 - 2u + \frac{\log 2 + \log \log n}{2\log n} + \frac{\log (1 - u)}{\log n} &\approx 0. \end{aligned}

As n grows large, u is going to grow small. The terms that end up dominating are -2u and \dfrac{\log \log n}{2 \log n}, which gives us the approximation u \approx \dfrac{\log \log n}{4 \log n}. Hence, we have t \approx \sqrt{2 \log n} \left(1 - \dfrac{\log \log n}{4 \log n} \right).

Credits: I learnt of this result in my STATS 300C class, but no proof was given (other than “use the \phi(t) / t approximation”).