Be careful of NA/NaN/Inf values when using base R’s plotting functions!

I was recently working on a supervised learning problem (i.e. building a model using some features to predict some response variable) with a fairly large dataset. I used base R’s plot and hist functions for exploratory data analysis and all looked well. However, when I started building my models, I began to run into errors. For example, when trying to fit the lasso using the glmnet package, I encountered this error:

I thought this error message was rather cryptic. However, after some debugging, I realized the error was exactly what it said it was: there were NA/NaN/Inf values in my data matrix! The problem was that I had expected these problematic values to have been flagged during my exploratory data analysis. However, R’s plot and hist functions silently drop these values before giving a plot.

Here’s some code to demonstrate the issue. Let’s create some fake data with NA/NaN/Inf values:

n <- 50  # no. of observations
p <- 2   # no. of features

# create fake data matrix
x <- matrix(rnorm(n * p), nrow = n)

# make some entries invalid
x[1:3, 1] <- NA
x[4:5, 2]             [,1]       [,2]
#> [1,]         NA  0.3981059
#> [2,]         NA -0.6120264
#> [3,]         NA  0.3411197
#> [4,]  1.5952808        Inf
#> [5,]  0.3295078        Inf
#> [6,] -0.8204684  1.9803999

The two lines of code give plots in return, without any warning message to the console that data points have been dropped:

plot(x[, 1], x[, 2])

The ggplot2 package does a better job of handling such values. While it also makes the plot, it sends a warning to the console that some values have been dropped in the process:

df <- data.frame(x = x[,1])
ggplot(df, aes(x)) + geom_histogram()

Moral(s) of the story:

  1. Don’t assume that your data is free of NA/NaN/Inf values. Check!
  2. Base R’s hist and plot functions do not warn about invalid values being removed. Either follow the advice in the previous point or use code that flags such removals (e.g. ggplot2).

6 thoughts on “Be careful of NA/NaN/Inf values when using base R’s plotting functions!

  1. Great point. We encourage all our students to include a missing data ID step right after they take the data in.


    Is great for this.


  2. Hi, thank you for sharing your experience. I would recommend to always count the missing values of your features and target before doing any analysis. Ideally (even when features have many categories (let’s say up to the 100s)) it is a good practice also to table them – i.e. looking at the univariate frequencies.
    In case you have many missing values further exploration might be needed – for example cross count them across features (and, do you need to impute missings?). Also, understanding why they are missing, if you do not not already, might me very useful for the meaningfulness of your model and prediction. Good luck in your modelling endeavors!


  3. Is it my browser or your html renderer that turned ” x <- 5 " into "X &lt:- 5" ? 😦 . BTW, the "silent drop" is by design — this allows the user to insert NA values when a break in a plot is desired. Consider plot(c(1,2,3,NA,4,5),t='l') . And face it: using 'plot' or 'hist' to do your data validation is not really a good idea. How about checking (is.finite(c(1,2,3,NA,4,5,Inf,8,9,NaN,10)))


  4. Just to be clear, this has nothing to do with NA values. You get almost the same plot without any missing values, so no, according to “missing data analysis”, default behaviour is no worse than ggplot.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s