What is the Shapiro-Wilk test?

I recently learnt of the Shapiro-Wilk test from this blog post. So what is it?

The Shapiro-Wilk test is a statistical test for the hypothesis that a group of values come from a normal distribution. (The mean and variance of this normal distribution need not be 0 or 1 respectively.) Empirically, this test appears to have the best power (among tests that test for normality).

Assume that the data are x_1, \dots, x_n \in \mathbb{R} and that we want to test if they come for a population that is normally distributed. The test statistic is

\begin{aligned} W =\dfrac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \end{aligned}


  • \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i is the mean of the x_i‘s,
  • x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)} are the order statistics,
  • a_1, \dots, a_n are “constants generated from the means, variances and covariances of the order statistics of a sample of size n from a normal distribution” (Ref 3).

Let m = (m_1, \dots, m_n) be the expected values of the standard normal order statistics, and let V be the corresponding covariance matrix. Then

\begin{aligned} a = (a_1, \dots, a_n) = \dfrac{V^{-1}m}{\|V^{-1}m\|_2}. \end{aligned}

We reject the null hypothesis (the data come from a normal distribution) if W is small. In R, the Shapiro-Wilk test can be performed with the shapiro.test() function.

As far as I can tell there isn’t a closed form for the distribution of W under the null.

What is the intuition behind this test? Reference 2 has a good explanation for this:

The basis idea behind the Shapiro-Wilk test is to estimate the variance of the sample in two ways: (1) the regression line in the QQ-Plot allows to estimate the variance, and (2) the variance of the sample can also be regarded as an estimator of the population variance. Both estimated values should approximately equal in the case of a normal distribution and thus should result in a quotient of close to 1.0. If the quotient is significantly lower than 1.0 then the null hypothesis (of having a normal distribution) should be rejected.

Why is it called the Shapiro-Wilk test? It was proposed by S. S. Shapiro and M. B. Wilk in a 1965 Biometrika paper “An Analysis of Variance Test for Normality“. This original paper proves a number of properties of the W statistic, e.g. \dfrac{na_1^2}{n-1} \leq W \leq 1.


  1. Wikipedia. Shapiro-Wilk test.
  2. Fundamentals of Statistics. Shapiro-Wilk test.
  3. Engineering Statistics Handbook. Section Anderson-Darling and Shapiro-Wilk tests.

Be careful of NA/NaN/Inf values when using base R’s plotting functions!

I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error:

I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were NA/NaN/Inf values in my data matrix! The problem was that I had expected these problematic values to have been flagged during my exploratory data analysis. However, R’s plot and hist functions silently drop these values before giving a plot.

Here’s some code to demonstrate the issue. Let’s create some fake data with NA/NaN/Inf values:

n <- 50  # no. of observations
p <- 2   # no. of features

# create fake data matrix
x <- matrix(rnorm(n * p), nrow = n)

# make some entries invalid
x[1:3, 1] <- NA
x[4:5, 2] <- Inf head(x) #>            [,1]       [,2]
#> [1,]         NA  0.3981059
#> [2,]         NA -0.6120264
#> [3,]         NA  0.3411197
#> [4,]  1.5952808        Inf
#> [5,]  0.3295078        Inf
#> [6,] -0.8204684  1.9803999

The two lines of code give plots in return, without any warning message to the console that data points have been dropped:

plot(x[, 1], x[, 2])

The ggplot2 package does a better job of handling such values. While it also makes the plot, it sends a warning to the console that some values have been dropped in the process:

df <- data.frame(x = x[,1])
ggplot(df, aes(x)) + geom_histogram()

Moral(s) of the story:

  1. Don’t assume that your data is free of NA/NaN/Inf values. Check!
  2. Base R’s hist and plot functions do not warn about invalid values being removed. Either follow the advice in the previous point or use code that flags such removals (e.g. ggplot2).

An identity linking the nuclear norm and the Frobenius norm

Mazumder et al. (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma:

Lemma: For any matrix Z \in \mathbb{R}^{m \times n}, we have

\begin{aligned} \| Z \|_* = \min_{U, V : Z = UV^T} \frac{1}{2} \left( \| U\|_F^2 + \|V\|_F^2 \right). \end{aligned}

If \text{rank}(Z) = k \leq \min (m, n), then the minimum above is attained at a factor decomposition Z = U_{m \times k} V_{n \times k}^T.

The proof of the lemma is just a page long and is provided in the paper under Appendix.A.5.


  1. Mazumder, R., Hastie, T., and Tibshirani, R. (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices.

Marginally Gaussian does not imply jointly Gaussian

A multivariate random variable \mathbf{X} = (X_1, \dots, X_p) \in \mathbb{R}^p is said to have joint multivariate normal/Gaussian distribution if for any \mathbf{a} \in \mathbb{R}^p, \mathbf{a^T X} has the normal/Gaussian distribution. In other words, any linear combination of the elements of \mathbf{X} must have the normal distribution.

In particular, if we define \mathbf{a} such that it has zeros in all entries except the jth entry where it has a one, we see that X_j must be normally distributed. Hence, joint Gaussianity implies marginal Gaussianity.

Is the inverse true? That is, does X_j being normally distributed for j = 1, \dots, p imply that \mathbf{X} = (X_1, \dots, X_p) is jointly normally distributed? The answer is NO. The first reference below has a nice heatmaps demonstrating this. There are a total of 6 heatmaps, which depicting a bivariate distribution which is marginally Gaussian. Only 3 of them (the two on the left and top-center) are jointly Gaussian, the other 3 are not. I’ve produced the plot below for convenience:

BIvariate distributions with marginal Gaussianity

BIvariate distributions with marginal Gaussianity. Only 3 of them are joint Gaussian distributions.


  1. CrossValidated. Is it possible to have a pair of Gaussian random variables for which the joint distribution is not Gaussian?

FISTA: A description

This post introduces the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA), a fast optimization algorithm for a particular class of unconstrained convex optimization problems. In particular, we seek to minimize F(x) = f(x) + g(x) over x \in \mathbb{R}^n where:

  • g: \mathbb{R}^n \mapsto \mathbb{R} is a continuous convex function which is possibly non-smooth,
  • f: \mathbb{R}^n \mapsto \mathbb{R} is a smooth convex function that is continuously differentiable with Lipschitz continuous gradient L(f): i.e. \begin{aligned} \| \nabla f(x) - \nabla f(y) \| \leq L(f) \| x - y \| \end{aligned} for every x, y \in \mathbb{R}^n, where \| \cdot \| is the standard Euclidean norm and L(f) > 0 is the Lipschitz constant of \nabla f.

We also assume that the problem is solvable, i.e. there exists some x^* \in \mathbb{R}^n such that F(x^*) \leq F(x) for all x \in \mathbb{R}^n.


FISTA is a refinement of Iterative Shrinkage-Thresholding Algorithm (ISTA), which we introduce first. Define

\begin{aligned} p_L(y) = \text{argmin}_x \; \left\{ g(x) + \frac{L}{2} \left\| x - \left( y - \frac{1}{L} \nabla f(y) \right) \right\|^2 \right\}. \end{aligned}

(This is very close to the proximal operator for g.) ISTA (with constant step size) is as follows:

  • Input: L = L(f), a Lipschitz constant of \nabla f.
  • Step 0: Take x_0 \in \mathbb{R}^n.
  • Step k (k \geq 1): Set x_k = p_L (x_{k-1}).

(The steps for ISTA with backtracking can be found in the FISTA paper.) ISTA is fast If the operator p_L can be computed cheaply. Beck & Teboulle show that ISTA has a worst-case complexity of O(1/k), i.e. F(x_k) - F(x^*) \simeq O(1/k).


FISTA improves on ISTA by choosing a better sequence of y_k‘s to apply the operator p_L too. It is inspired by Nesterov (1983). The formulas appear pretty magical to me: it seems to be using some sort of “momentum” since it applies p_L to a specific linear combination of x_{k-1} and x_{k-2} (instead of just x_{k-1}.

Here are the steps for FISTA with constant step size (FISTA with backtracking is in the FISTA paper):

  • Input: L = L(f), a Lipschitz constant of \nabla f.
  • Step 0: Take y_1 = x_0 \in \mathbb{R}^n, t_1 = 1.
  • Step k (k \geq 1):
    • Set x_k = p_L (y_k).
    • Set t_{k+1} = \frac{1 + \sqrt{1 + 4 t_k^2}}{2}.
    • Set y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1}).

The operations for FISTA are pretty much the same as that for ISTA, but FISTA enjoys faster convergence: F(x_k) - F(x^*) \simeq O(1/k^2). Here is the convergence theorem for FISTA with constant step size:

Theorem (convergence rate of FISTA): Let \{ x_k\} and \{y_k\} be generated by FISTA with constant step size. Then for any k \geq 1, we have

\begin{aligned} F(x_k) - F(x^*) \leq \frac{2L(f) \|x_0 - x^*\|^2}{(k+1)^2}, \end{aligned}

for any minimizer x^*.

See Theorem 4.4 in the FISTA paper for the full statement (it includes the bound for FISTA with backtracking.)


  1. Beck, A., and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.
  2. Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate O(1/k^2).

Looking at flood insurance claims with choroplethr

I recently learned how to use the choroplethr package through a short tutorial by the package author Ari Lamstein (youtube link here). To cement what I learned, I thought I would use this package to visualize flood insurance claims.

I am using the FIMA NFIP redacted claims dataset from FEMA, and it contains more than 2 million claims transactions going all the way back to 1970. (The dataset can be accessed from this URL.) From what I can tell, this dataset is updated somewhat regularly. For this post, the version of the dataset used was published on 26 Jun 2019, and I downloaded it on 10 Jul 2019. (Credit: I learned of this dataset from Jeremy Singer-Vine’s Data Is Plural newsletter, 2019.07.10 edition.)

To access all the code in this post in a single script, click here.

First, let’s load libraries and the dataset:


df <- read_csv("openfema_claims20190331.csv", 
               col_types = cols(asofdate = col_date(format = "%Y-%m-%d"), 
                                dateofloss = col_date(format = "%Y-%m-%d"), 
                                originalconstructiondate = col_date(format = "%Y-%m-%d"), 
                                originalnbdate = col_date(format = "%Y-%m-%d")))

I get some parsing failures, indicating that there might be some issues with the dataset, but I will ignore them for now since it’s not a relevant point for this post. I end up with around 2.4 million observations of 39 variables. We are only going to be interested in a handful of variables. Here they are along with the description given in the accompanying data dictionary:

    • yearofloss: Year in which the flood loss occurred (YYYY).
    • state: The two-character alpha abbreviation of the state in which the insured property is located.
    • countycode: The Federal Information Processing Standard(FIPS) defined unique 5-digit code for the identification of counties and equivalent entities of the united States, its possessions, and insular areas. The NFIP relies on our geocoding service to assign county.
    • amountpaidonbuildingclaim: Dollar amount paid on the building claim. In some instances, a negative amount may appear which occurs when a check issued to a policy holder isn’t cashed and has to be re-issued.
    • amountpaidoncontentsclaim: Dollar amount paid on the contents claim. In some instances, a negative amount may appear, which occurs when a check issued to a policy holder isn’t cashed and has to be re-issued.
    • amountpaidonincreasedcostofcomplianceclaim: Dollar amount paid on the Increased Cost of Compliance (ICC) claim. Increased Cost of Compliance (ICC) coverage is one of several flood insurances resources for policyholders who need additional help rebuilding after a flood. It provides up to $30,000 to help cover the cost of mitigation measures that will reduce the flood risk. As we are not explicitly defining the building and contents components of the claim, I don’t think it is necessary to define ICC.

I didn’t understand the explanation for negative amounts for the claim fields. Since there were less than 20 records which such fields, I decided to remove them. I also removed all rows without any state information (12 records).

# select just the columns we want and remove rows with no state info or
# negative claim amounts
small_df <- df %>% 
    select(year = yearofloss, state, countycode, 
           claim_bldg = amountpaidonbuildingclaim, 
           claim_contents = amountpaidoncontentsclaim, 
           claim_ICC = amountpaidonincreasedcostofcomplianceclaim) %>% 
    filter(!is.na(state)) %>%
    replace_na(list(claim_bldg = 0, claim_contents = 0, claim_ICC = 0)) %>%
    filter(!(claim_bldg < 0 | claim_contents < 0 | claim_ICC < 0))

The code below makes a histogram of the number of claims per year:

# histogram of year
with(small_df, hist(year, breaks = c(min(year):max(year)), 
                    main = "No. of claims / year", xlab = "Year"))

While there are supposed to be records from 1970, the records before 1978 seem to be pretty sparse. We also don’t have a full year of records for 2019 yet. As such, I filtered for just records between 1978 and 2018 (inclusive). I also assumed that the total claim amount was the sum of the 3 claim columns: this is the only claim amount I work with in this post.

small_df <- small_df %>% filter(year >= 1978 & year <= 2018) %>%
    mutate(total_cost = claim_bldg + claim_contents + claim_ICC)

Number of claims by state

For my first map, I wanted to show the number of claims made by state. To do that, I have to give to the state_choropleth() function a very specific dataframe: it must have a region column with state names (e.g. “new york”, not “NY” or “New York”), and a value column with the number of the claims. The dataset has states in two-letter abbreviations, so we have to do a mapping from those abbreviations to the state names. Thankfully it is easy to do that with the mapvalues() function in the plyr package and the state.regions dataframe in the choroplethMaps package:

state_no_claims_df <- small_df %>% group_by(state) %>%
    summarize(claim_cnt = n())

# map to state name
state_no_claims_df$region <- plyr::mapvalues(state_no_claims_df$state, 
                                             from = state.regions$abb, 
                                             to = state.regions$region)

# rename columns for choroplethr
names(state_no_claims_df) <- c("state", "value", "region")

# basic map

Note that we get a warning because our dataframe has regions that are not mappable (AS, GU, PR, VI: presumably places like Guam and Puerto Rico). Nevertheless we get a colored map:

Not bad for a really short command! By default, the value column is split into 7 buckets (according to quantiles). To see which states are above/below the median number of claims, we can set the num_colors option to 2:

# above and below median
state_choropleth(state_no_claims_df, num_colors = 2)

For a continuous scale, set num_colors to 1 (if you have negative values, you can set num_colors 0 for a diverging scale):

# continuous scale
state_choropleth(state_no_claims_df, num_colors = 1)

From the map above, it is clear that Louisiana had the most number of claims, followed by Texas and Florida.

It is easy to add a title and legend label to the map:

# one with title
state_choropleth(state_no_claims_df, num_colors = 1, 
                 title = "No. of claims by state", legend = "No. of claims")

The output of state_choropleth is a ggplot object, so we can modify the output as we would with ggplot graphics. One bug I noticed was that in adding the color scale below, the legend name reverted to “value” (instead of “No. of claims”); I have not figured out how to amend this yet.

# choroplethr output is a ggplot object
gg1 <- state_choropleth(state_no_claims_df, num_colors = 1, title = "No. of claims by state", legend = "No. of claims") class(gg1) #> [1] "gg"     "ggplot"

gg1 + scale_fill_distiller(palette = "Spectral") +
    theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5))

state_choropleth() allows us to plot just the states we want by passing a vector of states to the zoom option. In the code below, we just plot the states which border the Gulf of Mexico:

# zoom in on gulf states: have a shoreline on gulf of mexico
gulf_states <- c("texas", "louisiana", "mississippi", "alabama", "florida")
state_choropleth(state_no_claims_df, num_colors = 1, 
                 title = "No. of claims by state (Gulf states)", legend = "No. of claims",
                 zoom = gulf_states)

This last barplot shows the no. of claims by state (in descending order). 6 states have over 100,000 claims, with North Carolina barely hitting that number.

# bar plot of no. of claims
ggplot(data = state_no_claims_df %>% arrange(desc(value)) %>% head(n = 10)) +
    geom_bar(aes(x = reorder(state, -value), y = value), stat = "identity",
             fill = "gray", col = "black") +
    geom_abline(intercept = 100000, slope = 0, col = "red", lty = 2) +
    labs(x = "State", y = "No. of claims", title = "Top 10 states by no. of claims")

Total claim amount by state

This section doesn’t contain any new choroplethr syntax, but is for the completeness of the data story. Instead of looking at the number of claims by state, I look at the total claim amounts by state. My initial guess is that we shouldn’t see that much variation from the plots above.

The code below extracts the summary information from the claims level dataset and makes the choropleth map and the barplot (the red line indicates the $1 billion mark):

state_claims_amt_df <- small_df %>% group_by(state) %>%
    summarize(value = log10(sum(total_cost)))

# map to state name
state_claims_amt_df$region <- plyr::mapvalues(state_claims_amt_df$state, from = state.regions$abb, to = state.regions$region) state_choropleth(state_claims_amt_df, num_colors = 1, title = "log10(Claim amount) by state", legend = "") + scale_fill_distiller(palette = "Spectral") + theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5)) # bar plot of claim amt ggplot(data = state_claims_amt_df %>% arrange(desc(value)) %>% head(n = 10)) +
    geom_bar(aes(x = reorder(state, -value), y = 10^value), stat = "identity", 
             fill = "gray", col = "black") +
    geom_abline(intercept = 10^9, slope = 0, col = "red", lty = 2) +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "State", y = "Total claims", title = "Top 10 states by total claims")

Perhaps not surprisingly the top 5 states with the most number of claims are also the top 5 states with the largest claim amounts. What might be more surprising is how much more money was claimed in LA and TX compared to the other states.

Number of claims by county

choroplethr makes plotting by county really easy, as long as you have the 5-digit FIPS code for the counties. Instead of using the state_choropleth() function, use the county_choropleth() function. As before, this function must have a “region” column with the FIPS codes (as numbers, not strings), and a “value” column for the values to be plotted. The county.regions dataset in the choroplethrMaps package can help the conversion of region values if necessary.

county_summary_df <- small_df %>% group_by(countycode) %>%
    summarize(claim_cnt = n(), claim_amt = sum(total_cost)) %>%
    mutate(countycode = as.numeric(countycode))

names(county_summary_df) <- c("region", "value", "claim_amt")


As with state_choropleth(), the default is to split the values into 7 bins according to quantiles. We get a bunch of warnings because several counties have NA values. By default, these counties are filled black. In this particular case I think they affect how we read the map, I use the code below to change the NAs to white. In this context it makes sense too, since counties have NA values because they have no claims.

# change fill color for NAs to be white
county_choropleth(county_summary_df, title = "No. of claims by county") +
    scale_fill_brewer(na.value = "white")

We can draw county maps for specific states as well. The code below makes two plots: the first zooms in on just Florida, the second zooms in on all the Gulf states.

# zoom into florida only
county_choropleth(county_summary_df, title = "No. of claims by county (Florida)",
                  state_zoom = "florida") +
    scale_fill_brewer(na.value = "white")

# zoom into gulf
county_choropleth(county_summary_df, title = "No. of claims by county (Gulf states)",
                  state_zoom = gulf_states) +
    scale_fill_brewer(na.value = "white")

Top 5 states: a (slightly) deeper look

In this section we take a temporal look at the top 5 states. The code below filters for them and summarizes the data:

top_states <- (state_no_claims_df %>% arrange(desc(value)) %>% head(n = 5))[[1]]
top_df <- small_df %>% filter(state %in% top_states)
top_regions <- plyr::mapvalues(top_states, from = state.regions$abb, 
                               to = state.regions$region)

state_yr_summary_df <- top_df %>% group_by(state, year) %>%
    summarize(claim_cnt = n(), claim_amt = sum(total_cost))

The first chunk of code plots the number of claims for each state for each year, while the second chunk does the same for total claim amount:

ggplot(data = state_yr_summary_df, aes(x = year, y = claim_cnt, col = state)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "Year", y = "No. of claims", title = "No. of claims by state") +

ggplot(data = state_yr_summary_df, aes(x = year, y = claim_amt, col = state)) +
    geom_line() + geom_point() +
    scale_y_continuous(labels = function(x) format(x, scientific = TRUE)) +
    labs(x = "Year", y = "Claim amount", title = "Claim amount by state") +

It looks like for these states, most of the claims and most of the damage comes in just one or two particular incidents. From the state and the year, you can probably guess the disaster that caused the spike.

Laplace’s approximation for Bayesian posterior distribution

In Bayesian statistics, we want some parameter, \theta, that we want to estimate. To do so, we set a prior on the parameter, p(\theta), and assume a likelihood function, p(x \mid \theta), that models the relationship between the parameter and the data we see.

The key object in Bayesian inference is the posterior distribution, p(\theta \mid x), representing our belief about the distribution of the parameter given the prior and the data. It is computed using Bayes’ theorem:

\begin{aligned} p(\theta \mid x) = \frac{p(x \mid \theta) p (\theta)}{p(x)} =  \frac{p(x \mid \theta) p (\theta)}{\int p(x \mid \theta = t) p (\theta = t) d t}. \end{aligned}

A big difficulty of Bayesian statistics is evaluating the integral in the denominator above, also known as the normalization constant. Laplace’s approximation is one technique we can use to approximate the denominator.

(The rest of this post closely follows Reference 1.) For simpler notation, let’s rewrite the denominator as \int_a^b g(t) dt. Assume that we can find the point t_0 where the function g attains its maximum. Let h = \log g. Then, taking a second-order Taylor expansion of h around t_0, we have

\begin{aligned} \int_a^b g(t) dt &= \int_a^b \exp[h(t)] dt \\  &\approx \int_a^b \exp \left[ h(t_0) + h'(t_0) (t - t_0) + \frac{1}{2}h''(t_0) (t - t_0)^2 \right] dt \\  &=  \int_a^b \exp \left[ h(t_0) + \frac{1}{2}h''(t_0) (t - t_0)^2 \right] dt &(h'(t_0) = 0) \\  &= \exp [h(t_0)] \int_a^b \exp \left[ \frac{1}{2}h''(t_0) (t - t_0)^2 \right]dt \\  &= \exp [h(t_0)] \int_a^b \exp \left[ - \frac{1}{2} \frac{(t - t_0)^2}{-h''(t_0)^{-1}}  \right]dt. \end{aligned}

We recognize the integrand to be proportional to the PDF of a normal distribution with mean t_0 and variance -h''(t_0)^{-1}. Hence, we have a closed form for the integral in terms of the normal CDF:

\begin{aligned} \int_a^b g(t) dt = \exp [h(t_0)] \sqrt{\frac{2\pi}{-h''(t_0)}} \left[ \Phi(b \mid t_0, -h''(t_0)^{-1} - \Phi(a \mid t_0, -h''(t_0)^{-1} \right]. \end{aligned}

If a = -\infty and b = \infty (as is usually the case), the term in square brackets is equal to 1, and we obtain

\begin{aligned} \int_{-\infty}^\infty g(t) dt = \exp [h(t_0)] \sqrt{\frac{2\pi}{-h''(t_0)}}. \end{aligned}

Operationalizing Laplace’s approximation depends crucially on being able to locate t_0 well. Thus, Laplace’s approximation replaces an integration problem with a maximization problem.

How well does Laplace’s approximation do? That depends on how good the second-order Taylor approximation for h = \log g is. When the second-order Taylor approximation is exact, g is exactly proportional to a normal PDF. Hence, the closer g looks like a normal PDF, the better Laplace’s approximation will be.


  1. Peng, R. D. (2018). Advanced Statistical Computing (Chapter 5.1).