Documentation for internal functions

TLDR: To avoid triple quotes and R CMD CHECK --as-cran errors due to documentation examples for internal functions, enclose the example code in \dontrun{}.

I recently encountered an issue when submitting an R package to CRAN that I couldn’t find a clean answer for. One of the comments from the manual check was the following:

Using foo:::f instead of foo::f allows access to unexported objects. It should not be used in other context and is generally not recommended, as the semantics of unexported objects may be changed by the package author in routine maintenance.

I was a bit surprised at getting this message as I was quite careful to avoid using the triple quotes ::: in my functions. A bit of checking revealed that I had used them in one of the examples when documenting an internal function.

Now, a simple fix for this would have been to delete the documentation examples for this internal function and resubmit, but that seemed wrong to me. These examples, while not necessarily user-facing, and very important for developers of the package to get a quick handle on the internal functions. There has to be a better way!

Here’s a small example that replicates the problem I faced. (Note that I am working on a Mac, and parts of the post will reflect that.) Create a new R package via RStudio’s “New Project…” (I named this package mypackage, and add the following internal function as internal.R:

#' Internal function
#'
#' This function prints a message.
#'
#' @examples
#' mypackage:::internal_function()
internal_function <- function() {
  print("This is an internal function.")
}

Running ?internal_function in the console, I get the following documentation, as one would expect:

Running R CMD CHECK --as-cran on this package does not throw any warnings or notes, but when you submit it on CRAN, you will get the comment at the beginning of the post.

My first thought was to replace the line #' mypackage:::internal_function() with #' internal_function(). However, this will throw an error when running R CMD CHECK --as-cran:

A simple workaround I found (from this thread) is to enclose the example in \dontrun{}:

#' Internal function
#'
#' This function prints a message.
#'
#' @examples \dontrun{
#' internal_function()
#' }
internal_function <- function() {
  print("This is an internal function.")
}

R CMD CHECK --as-cran no longer throws an error, and there are no :::s to complain about. The documentation file for internal_function now looks like this:

Because this function is internal, it is worth adding a line @keywords internal in the roxygen comment block for this function; that way the function is removed from the package’s documentation index.

Disclaimer: This is the workaround I found: it is not necessarily the best way or the correct way to deal with this issue. I would love to hear from others on how you avoid this problem when documenting internal functions!

Introducing cvwrapr for your cross-validation needs

Update (2021-06-11): cvwrapr v1.0 is now on CRAN! You can view it here.

TLDR: I’ve written an R package, cvwrapr, that helps users to cross-validate hyperparameters. The code base is largely extracted from the glmnet package. The R package is available for download from Github, and contains two vignettes which demonstrate how to use it. Comments, feedback and bug reports welcome!

Imagine yourself in the following story:

You are working on developing a supervised learning method. You ponder each detail of the algorithm day and night, obsessing over whether it does the right thing. After weeks and months of grinding away at your laptop, you finally have a fitter function that you are satisfied with. You bring it to your collaborator who says:

“Great stuff! … Now we need a function that does cross-validation for your hyperparameter.”

A wave of fatigue washes over you: weren’t you done with the project already? You galvanize yourself, thinking, “it’s not that bad: it’s just a for loop over the folds right?” You write the for loop and get a matrix of out-of-fold predictions. You now realize you have to write more code to compute the CV error. “Ok, that’s simple enough for mean-squared error…”, but then a rush of questions flood your mind:

“What about the CV standard errors? I can never remember what to divide the standard deviation by…”
“We have to compute lambda.min and lambda.1se too right?”
“What about family=’binomial’? family=’cox’?”
“Misclassification error? AUC?”

As these questions (and too much wine) make you doubt your life choices, a realization pops into your head:

“Wait: doesn’t cv.glmnet do all this already??”

And that is the backstory for the cvwrapr package, which I’ve written as a personal project.  It essentially rips out the cross-validation (CV) portion of the glmnet package and makes it more general. The R package is available for download from Github, and contains two vignettes which demonstrate how to use it. The rest of this post is a condensed version of the vignettes.

First, let’s set up some fake data.

set.seed(1)
nobs <- 100; nvars <- 10
x <- matrix(rnorm(nobs * nvars), nrow = nobs)
y <- rowSums(x[, 1:2]) + rnorm(nobs)

The lasso

The lasso is a popular regression method that induces sparsity of features in the fitted model. It comes with a hyperparmater \lambda that the user usually picks by cross-validation (CV). The glmnet package has a function cv.glmnet that does this:

library(glmnet)
glmnet_fit <- cv.glmnet(x, y)

The code snippet below shows the equivalent code using the cvwrapr package. The model-fitting function is passed to kfoldcv via the train_fun argument, while the prediction function is passed to the predict_fun argument. (See the vignette for more details on the constraints on these functions.)

library(cvwrapr)
cv_fit <- kfoldcv(x, y, train_fun = glmnet, 
                  predict_fun = predict)

The lasso with variable filtering

In some settings, we might want to exclude variables which are too sparse. For example, we may want to remove variables that are more than 80% sparse (i.e. more than 80% of observations having feature value of zero), then fit the lasso for the remaining variables.

To do CV correctly, the filtering step should be included in the CV loop as well. The functions in the glmnet package have an exclude argument where the user can specify which variables to exclude from model-fitting. Unfortunately, at the time of writing, the exclude argument can only be a vector of numbers, representing the feature columns to be excluded. Since we could be filtering out different variables in different CV folds, using the exclude argument will not work for us.

The cvwrapr package allows us to wrap the feature filtering step into the CV loop with the following code:

# filter function
filter <- function(x, ...) which(colMeans(x == 0) > 0.8)

# model training function
train_fun <- function(x, y) {
  exclude <- which(colMeans(x == 0) > 0.8)
  if (length(exclude) == 0) {
    model <- glmnet(x, y)
  } else {
    model <- glmnet(x[, -exclude, drop = FALSE], y)
  }
  return(list(lambda = model$lambda,
              exclude = exclude,
              model = model))
}

# prediction function
predict_fun <- function(object, newx, s) {
  if (length(object$exclude) == 0) {
    predict(object$model, newx = newx, s = s)
  } else {
    predict(object$model, 
            newx = newx[, -object$exclude, drop = FALSE], 
            s = s)
  }
}

set.seed(1)
cv_fit <- kfoldcv(x, y, train_fun = train_fun, 
                  predict_fun = predict_fun)

Gradient boosting: number of trees

Next, let’s look at a gradient boosting example. In gradient boosting, a common hyperparameter to cross-validate is the number of trees in the gradient boosting ensemble. The code below shows how one can achieve that with the cvwrapr and gbm packages:

library(gbm)

# lambda represents # of trees
train_fun <- function(x, y, lambda) {
  df <- data.frame(x, y)
  model <- gbm::gbm(y ~ ., data = df, 
                    n.trees = max(lambda),
                    distribution = "gaussian")
  
  return(list(lambda = lambda, model = model))
}

predict_fun <- function(object, newx, s) {
  newdf <- data.frame(newx)
  predict(object$model, newdata = newdf, n.trees = s)
}

set.seed(3)
lambda <- 1:100
cv_fit <- kfoldcv(x, y, lambda = lambda, 
                  train_fun = train_fun, 
                  predict_fun = predict_fun)

kfoldcv returns an object of class “cvobj”, which has a plot method:

plot(cv_fit, log.lambda = FALSE, xlab = "No. of trees", 
     main = "CV MSE vs. no. of trees")

The vertical line on the right corresponds to the hyperparameter value that gives the smallest CV error, while the vertical line on the left corresponds to the value whose CV error is within one standard error of the minimum.

Computing different error metrics

Sometimes you may only have access to the out-of-fold predictions; in these cases you can use cvwrapr‘s computeError function to compute the CV error for you (a non-trivial task!).

The code below does CV for the lasso with the CV error metric being the default mean squared error:

set.seed(1)
cv_fit <- kfoldcv(x, y, train_fun = glmnet, 
                  predict_fun = predict,
                  keep = TRUE)
plot(cv_fit)

What if we wanted to pick the hyperparameter according to mean absolute error instead? One way would be to call kfoldcv again with the additional argument type.measure = "mae". This would involve doing all the model-fitting again.

Since we specified keep = TRUE in the call above, cv_fit contains the out-of-fold predictions: we can do CV for mean absolute error with these predictions without refitting the models. The call below achieves that (see vignette for details on the additional arguments):

mae_err <- computeError(cv_fit$fit.preval, y, 
                        cv_fit$lambda, cv_fit$foldid, 
                        type.measure = "mae", family = "gaussian")
plot(mae_err)

covidcast package for COVID-19-related data

(This is a PSA post, where I share a package that I think that might be of interest to the community but I haven’t looked too deeply into myself.)

Today I learnt of the covidcast R package, which provides access to the COVIDcast Epidata API published by the Delphi group at Carnegie Mellon University. According to the covidcast R package website,

This API provides daily access to a range of COVID-related signals Delphi that builds and maintains, from sources like symptom surveys and medical claims data, and also standard signals that we simply mirror, like confirmed cases and deaths.

(There is a corresponding python package with similar functionality.) The Delphi group has done a huge amount of work in logging a wide variety of COVID-related data and making it available, along with tools to visualize and make sense of the data.

For those interested in doing COVID-related analyses, I think this is a treasure trove of information for you to use. The covidcast package contains several different types of data (which they call “signals”), including public behavior (e.g. COVID searches on Google), early indicators (e.g. COVID-related doctor visits) and late indicators (e.g. deaths). Documentation on the signals available can be found here. (Note: The data is US-focused right now; I don’t know if there are plans to include data from other countries.)

Let me end off with a simple example showing what you can do with this package. This example is modified from one of the package vignettes; see the Articles section of the package website for more examples.

The package is not available on CRAN yet but can be downloaded from Github:

devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/covidcast")

The code below pulls data on cumulative COVID cases per 100k people on 2020-12-31 at the county level. covidcast_signal is the function to use for pulling data, and it returns an object of class c("covidcast_signal", "data.frame").

library(covidcast)

# Cumulative COVID cases per 100k people on 2020-12-31
df <- covidcast_signal(data_source = "usa-facts", 
                   signal = "confirmed_cumulative_prop",
                   start_day = "2020-12-31", end_day = "2020-12-31")
summary(df)
# A `covidcast_signal` data frame with 3142 rows and 9 columns.
# 
# data_source : usa-facts
# signal      : confirmed_cumulative_prop
# geo_type    : county
# 
# first date                          : 2020-12-31
# last date                           : 2020-12-31
# median number of geo_values per day : 3142

There is a plot method for calss covidcast_signal objects:

plot(df)

The automatic plot is usually not bad. The plot method comes with some arguments that the user can use to customize the plot (full documentation here):

breaks <- c(0, 500, 1000, 5000, 10000)
colors <- c("#D3D3D3", "#FEDDA2",  "#FD9950", "#C74E32", "#800026")
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31")

The plot returned is actually created using the ggplot2 package, so it is possible to add your own ggplot2 code on top of it:

library(ggplot2)
plot(df, choro_col = colors, choro_params = list(breaks = breaks),
     title = "Cumulative COVID cases per 100k people on 2020-12-31") +
  theme(title = element_text(face = "bold"))