What is the Shapiro-Wilk test?

I recently learnt of the Shapiro-Wilk test from this blog post. So what is it?

The Shapiro-Wilk test is a statistical test for the hypothesis that a group of values come from a normal distribution. (The mean and variance of this normal distribution need not be 0 or 1 respectively.) Empirically, this test appears to have the best power (among tests that test for normality).

Assume that the data are x_1, \dots, x_n \in \mathbb{R} and that we want to test if they come for a population that is normally distributed. The test statistic is

\begin{aligned} W =\dfrac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \end{aligned}

where

  • \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i is the mean of the x_i‘s,
  • x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)} are the order statistics,
  • a_1, \dots, a_n are “constants generated from the means, variances and covariances of the order statistics of a sample of size n from a normal distribution” (Ref 3).

Let m = (m_1, \dots, m_n) be the expected values of the standard normal order statistics, and let V be the corresponding covariance matrix. Then

\begin{aligned} a = (a_1, \dots, a_n) = \dfrac{V^{-1}m}{\|V^{-1}m\|_2}. \end{aligned}

We reject the null hypothesis (the data come from a normal distribution) if W is small. In R, the Shapiro-Wilk test can be performed with the shapiro.test() function.

As far as I can tell there isn’t a closed form for the distribution of W under the null.

What is the intuition behind this test? Reference 2 has a good explanation for this:

The basis idea behind the Shapiro-Wilk test is to estimate the variance of the sample in two ways: (1) the regression line in the QQ-Plot allows to estimate the variance, and (2) the variance of the sample can also be regarded as an estimator of the population variance. Both estimated values should approximately equal in the case of a normal distribution and thus should result in a quotient of close to 1.0. If the quotient is significantly lower than 1.0 then the null hypothesis (of having a normal distribution) should be rejected.

Why is it called the Shapiro-Wilk test? It was proposed by S. S. Shapiro and M. B. Wilk in a 1965 Biometrika paper “An Analysis of Variance Test for Normality“. This original paper proves a number of properties of the W statistic, e.g. \dfrac{na_1^2}{n-1} \leq W \leq 1.

References:

  1. Wikipedia. Shapiro-Wilk test.
  2. Fundamentals of Statistics. Shapiro-Wilk test.
  3. Engineering Statistics Handbook. Section 7.2.1.3. Anderson-Darling and Shapiro-Wilk tests.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s