# Generating correlation matrix for AR(1) model

Assume that we are in the time series data setting, where we have data at equally-spaced times $1, 2, \dots$ which we denote by random variables $X_1, X_2, \dots$. The AR(1) model, commonly used in econometrics, assumes that the correlation between $X_i$ and $X_j$ is $\text{Cor}(X_i, X_j) = \rho^{|i-j|}$, where $\rho$ is some parameter that usually has to be estimated.

If we were writing out the full correlation matrix for $n$ consecutive data points $X_1, \dots, X_n$, it would look something like this: $\begin{pmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} &\dots & 1 \end{pmatrix}$

(Side note: This is an example of a correlation matrix which has Toeplitz structure.)

Given $\rho$, how can we generate this matrix quickly in R? The function below is my (current) best attempt:

ar1_cor <- function(n, rho) {
exponent <- abs(matrix(1:n - 1, nrow = n, ncol = n, byrow = TRUE) -
(1:n - 1))
rho^exponent
}


In the function above, n is the number of rows in the desired correlation matrix (which is the same as the number of columns), and rho is the $\rho$ parameter. The function makes use of the fact that when subtracting a vector from a matrix, R automatically recycles the vector to have the same number of elements as the matrix, and it does so in a column-wise fashion.

Here is an example of how the function can be used:

ar1_cor(4, 0.9)
#       [,1] [,2] [,3]  [,4]
# [1,] 1.000 0.90 0.81 0.729
# [2,] 0.900 1.00 0.90 0.810
# [3,] 0.810 0.90 1.00 0.900
# [4,] 0.729 0.81 0.90 1.000


Such a function might be useful when trying to generate data that has such a correlation structure. For example, it could be passed as the Sigma parameter for MASS::mvrnorm(), which generates samples from a multivariate normal distribution.

Can you think of other ways to generate this matrix?

# Non-negative least squares

Imagine that one has a data matrix $X \in \mathbb{R}^{n \times p}$ consisting of $n$ observations, each with $p$ features, as well as a response vector $y \in \mathbb{R}^n$. We want to build a model for $y$ using the feature columns in $X$. In ordinary least squares (OLS), one seeks a vector of coefficients $\hat{\beta} \in \mathbb{R}^p$ such that \begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \| y - X\beta \|_2^2. \end{aligned}

In non-negative least squares (NNLS), we seek a vector coefficients $\hat{\beta} \in \mathbb{R}^p$ such that it minimizes $\| y - X \beta\|_2^2$ subject to the additional requirement that each element of $\hat{\beta}$ is non-negative.

There are a number of ways to perform NNLS in R. The first two methods come from Reference 1, while I came up with the third. (I’m not sharing the third way Reference 1 details because it claims that the method is buggy.)

Let’s generate some fake data that we will use for the rest of the post:

set.seed(1)
n <- 100; p <- 10
x <- matrix(rnorm(n * p), nrow = n)
y <- x %*% matrix(rep(c(1, -1), length.out = p), ncol = 1) + rnorm(n)


Method 1: the nnls package

library(nnls)
mod1 <- nnls(x, y)
mod1x #  0.9073423 0.0000000 1.2971069 0.0000000 0.9708051 #  0.0000000 1.2002310 0.0000000 0.3947028 0.0000000  Method 2: the glmnet package The glmnet() function solves the minimization problem \begin{aligned} \hat{\beta} = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n} \| y - X\beta \|_2^2 + \lambda \left[ \frac{1-\alpha}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1 \right], \end{aligned} where $\alpha$ and $\lambda$ are hyperparameters the user chooses. By setting $\alpha = 1$ (the default) and $\lambda = 0$, glmnet() ends up solving the OLS problem. By setting lower.limits = 0, this forces the coefficients to be non-negative. We should also set intercept = FALSE so that we don’t have an extraneous intercept term. library(glmnet) mod2 <- glmnet(x, y, lambda = 0, lower.limits = 0, intercept = FALSE) coef(mod2) # 11 x 1 sparse Matrix of class "dgCMatrix" # s0 # (Intercept) . # V1 0.9073427 # V2 . # V3 1.2971070 # V4 . # V5 0.9708049 # V6 . # V7 1.2002310 # V8 . # V9 0.3947028 # V10 .  Method 3: the bvls package NNLS is a special case of bounded-variable least squares (BVLS), where instead of having constraints $\beta_j \geq 0$ for each $j = 1, \dots, p$, one has constraints $a_j \leq \beta_j \leq b_j$ for each $j$. BVLS is implemented in the bvls() function of the bvls package: library(bvls) mod3 <- bvls(x, y, bl = rep(0, p), bu = rep(Inf, p)) mod3x
#  0.9073423 0.0000000 1.2971069 0.0000000 0.9708051
#  0.0000000 1.2002310 0.0000000 0.3947028 0.0000000


In the above, bl contains the lower limits for the coefficients while bu contains the upper limits for the coefficients.

References:

1. Things I Thought At One Point. Three ways to do non-negative least squares in R.

# An unofficial vignette for the gamsel package

I’ve been working on a project/package that closely mirrors that of GAMSEL (generalized additive model selection), a method for fitting sparse generalized additive models (GAMs). In preparing my package, I realized that (i) the gamsel package which implements GAMSEL doesn’t have a vignette, and (ii) I could modify the vignette for my package minimally to create one for gamsel. So here it is!

For a markdown version of the vignette, go here. Unfortunately LaTeX doesn’t play well on Github… The Rmarkdown file is available here: you can download it and knit it on your own machine.

Introduction

This is an unofficial vignette for the gamsel package. GAMSEL (Generalized Additive Model Selection) is a method for fitting sparse generalized additive models proposed by Alexandra Chouldechova and Trevor Hastie. Here is the abstract of the paper:

We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.

Let there be $n$ observations, each consisting of $p$ features. Let $\mathbf{X} \in \mathbb{R}^{n \times p}$ denote the overall feature matrix, and let $y \in \mathbb{R}^n$ denote the vector of responses. Let $X_j \in \mathbb{R}^n$ denote the $j$th column of $\mathbf{X}$.

For each variable $X_j$, let $U_j \in \mathbb{R}^{n \times m_j}$ be the matrix of evaluations of $m_j$ basis functions $u_1, \dots, u_{m_j}$ at points $x_{1j}, \dots, x_{nj}$. The model response from GAMSEL is of the form \begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where $\beta_j \in \mathbb{R}^{m_j}$.

For more details on the method, see the arXiv paper. For gamsel’s official R documentation, see this link.

The gamsel() function

The purpose of this section is to give users a general sense of the gamsel() function, which is probably the most important function of the package. First, we load the gamsel package:

library(gamsel)


Let’s generate some data:

set.seed(1)
n <- 100; p <- 12
x = matrix(rnorm((n) * p), ncol = p)
f4 = 2 * x[,4]^2 + 4 * x[,4] - 2
f5 = -2 * x[, 5]^2 + 2
f6 = 0.5 * x[, 6]^3
mu = rowSums(x[, 1:3]) + f4 + f5 + f6
y = mu + sqrt(var(mu) / 4) * rnorm(n)


We fit the model using the most basic call to gamsel():

fit <- gamsel(x, y)


The function gamsel() fits GAMSEL for a path of lambda values and returns a gamsel object. Typical usage is to have gamsel() specify the lambda sequence on its own. (By default, the model is fit for 50 different lambda values.) The returned gamsel object contains some useful information on the fitted model. For a given value of the $\lambda$ hyperparameter, GAMSEL gives the predictions of the form \begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where $\alpha_0 \in \mathbb{R}$, $\alpha_j \in \mathbb{R}$ and $\beta_j \in \mathbb{R}^{m_j}$ are fitted coefficients.

Printing the returned gamsel object tells us how many features, linear components and non-linear components were included in the model for each lambda value respectively. It also shows the fraction of null deviance explained by the model and the lambda value for that model.

fit
#>
#> Call:  gamsel(x = x, y = y)
#>
#>       NonZero Lin NonLin    %Dev  Lambda
#>  [1,]       0   0      0 0.00000 80.1300
#>  [2,]       1   1      0 0.03693 72.9400
#>  [3,]       1   1      0 0.06754 66.4000
#>  [4,]       1   1      0 0.09290 60.4400
#>  [5,]       1   1      0 0.11390 55.0200
#>  [6,]       1   1      0 0.13130 50.0900
#>  [7,]       1   1      0 0.14580 45.5900
#>  .... (redacted for conciseness)


gamsel has a tuning parameter gamma which is between 0 and 1. Smaller values of gamma penalize the linear components less than the non-linear components, resulting in more linear components for the fitted model. The default value is gamma = 0.4.

fit2 <- gamsel(x, y, gamma = 0.8)


By default, each variable is given $m_j = 10$ basis functions. This can be modified with the degrees option, and this value can differ from variable to variable (to allow for this, pass a vector of length equal to the number of variables to the degrees option).

By default, the maximum degrees of freedom for each variable is 5. This can be modified with the dfs option, with larger values allowing more “wiggly” fits. Again, this value can differ from variable to variable.

Predictions

Predictions from this model can be obtained by using the predict method of the gamsel() function output: each column gives the predictions for a value of lambda.

# get predictions for all values of lambda
all_predictions <- predict(fit, x) dim(all_predictions) #>  100  50

# get predictions for 20th model for first 5 observations
all_predictions[1:5, 20]
#>   1.88361056 -4.47189543  8.05935149 -0.04271584  5.93270321


One can also specify the lambda indices for which predictions are desired:

# get predictions for 20th model for first 5 observations
predict(fit, x[1:5, ], index = 20)
#>              l20
#> [1,]  1.88361056
#> [2,] -4.47189543
#> [3,]  8.05935149
#> [4,] -0.04271584
#> [5,]  5.93270321


The predict method has a type option which allows it to return different types of information to the user. type = "terms" gives a matrix of fitted functions, with as many columns as there are variables. This can be useful for understanding the effect that each variable has on the response. Note that what is returned is a 3-dimensional array!

# effect of variables for first 10 observations and 20th model
indiv_fits <- predict(fit, x[1:10, ], index = c(20), type = "terms") dim(indiv_fits) #>  10  1 12
# effect of variable 4 on these first 10 observations
plot(x[1:10, 4], indiv_fits[, , 4]) type = "nonzero" returns a list of indices of non-zero coefficients at a given lambda.

# variables selected by GAMSEL and the 10th and 50th lambda values
predict(fit, x, index = c(10, 50), type = "nonzero")
#> $l10 #>  2 3 4 5 6 #> #>$l50
#>    1  2  3  4  5  6  7  8  9 10 11 12


Plots and summaries

Let’s fit the basic gamsel model again:

fit <- gamsel(x, y)


fit is a class “gamsel” object which comes with a plot method. The plot method shows us the relationship our predicted response has with each input feature, i.e. it plots $\alpha_j X_j + U_j \beta_j$ vs. $X_j$ for each $j$. Besides passing fit to the plot() call, the user must also pass an input matrix x: this is used to determine the coordinate limits for the plot. It is recommended that the user simply pass in the same input matrix that the GAMSEL model was fit on.

By default, plot() gives the fitted functions for the last value of the lambda key in fit, and gives plots for all the features. For high-dimensional data, this latter default is problematic as it will produce too many plots! You can pass a vector of indices to the which option to specify which features you want plots for. The code below gives plots for the first 4 features:

par(mfrow = c(1, 4))
par(mar = c(4, 2, 2, 2))
plot(fit, x, which = 1:4) The user can specify the index of the lambda value to show using the index option:

# show fitted functions for x2, x5 and x8 at the model for the 15th lambda value
par(mfrow = c(1, 3))
plot(fit, x, index = 15, which = c(2, 5, 8)) Linear functions are colored green, non-linear functions are colored red, while zero functions are colored blue.

Class “gamsel” objects also have a summary method which allows the user to see the coefficient profiles of the linear and non-linear features. On each plot (one for linear features and one for non-linear features), the x-axis is the $\lambda$ value going from large to small. For linear components, the y-axis is the coefficient for each variable while for the nonlinear components, it is the norm of the nonlinear coefficients.

par(mfrow = c(1, 2))
summary(fit) We can include annotations to show which profile belongs to which feature by specifying label = TRUE.

par(mfrow = c(1, 2))
summary(fit, label = TRUE) Cross-validation

We can perform $k$-fold cross-validation (CV) for GAMSEL with cv.gamsel(). It does 10-fold cross-validation by default:

cvfit <- cv.gamsel(x, y)


We can change the number of folds using the nfolds option:

cvfit <- cv.gamsel(x, y, nfolds = 5)


If we want to specify which observation belongs to which fold, we can do that by specifying the foldid option, which is a vector of length $n$, with the $i$th element being the fold number for observation $i$.

set.seed(3)
foldid <- sample(rep(seq(10), length = n))
cvfit <- cv.gamsel(x, y, foldid = foldid)


A cv.gamsel() call returns a cv.gamsel object. We can plot this object to get the CV curve with error bars (one standard error in each direction). The left vertical dotted line represents lambda.min, the lambda value which attains minimum CV error, while the right vertical dotted line represents lambda.1se, the largest lambda value with CV error within one standard error of the minimum CV error.

plot(cvfit) The numbers at the top represent the number of features in our original input matrix that are included in the model.

The two special lambda values, as well as their indices in the lambda path, can be extracted directly from the cv.gamsel object:

# lambda values
cvfit$lambda.min #>  3.959832 cvfit$lambda.1se
#>  8.39861
# corresponding lambda indices
cvfit$index.min #>  33 cvfit$index.1se
#>  25


Logistic regression with binary data

In the examples above, y was a quantitative variable (i.e. takes values along the real number line). As such, using the default family = "gaussian" for gamsel() was appropriate. In theory, the GAMSEL algorithm is very flexible and can be used when y is not a quantitative variable. In practice, gamsel() has been implemented for binary response data. The user can use gamsel() (or cv.gamsel()) to fit a model for binary data by specifying family = "binomial". All the other functions we talked about above can be used in the same way.

In this setting, the response y should be a numeric vector containing just 0s and 1s. When doing prediction, note that by default predict() gives just the value of the linear predictors, i.e. \begin{aligned} \log [\hat{p} / (1 - \hat{p})] = \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where $\hat{p}$ is the predicted probability. To get the predicted probability, the user has to pass type = "response" to the predict() call.

# fit binary model
bin_y <- ifelse(y > 0, 1, 0)
binfit <- gamsel(x, bin_y, family = "binomial")

# linear predictors for first 5 observations at 10th model
predict(binfit, x[1:5, ])[, 10]
#>   0.1293867 -0.4531994  0.4528493 -0.2539989  0.3576436

# predicted probabilities for first 5 observations at 10th model
predict(binfit, x[1:5, ], type = "response")[, 10]
#>  0.5323016 0.3886003 0.6113165 0.4368395 0.5884699


# The hidden diagnostic plots for the lm object

When plotting an lm object in R, one typically sees a 2 by 2 panel of diagnostic plots, much like the one below:

set.seed(1)
x <- matrix(rnorm(200), nrow = 20)
y <- rowSums(x[,1:3]) + rnorm(20)
lmfit <- lm(y ~ x)
summary(lmfit)
par(mfrow = c(2, 2))
plot(lmfit) This link has an excellent explanation of each of these 4 plots, and I highly recommend giving it a read.

Most R users are familiar with these 4 plots. But did you know that the plot() function for lm objects can actually give you 6 plots? It says so right in the documentation: We can specify which of the 6 plots we want when calling this function using the which option. By default, we are given plots 1, 2, 3 and 5. Let’s have a look at what plots 4 and 6 are.

Plot 4 is of Cook’s distance vs. observation number (i.e. row number). Cook’s distance is a measure of how influential a given observation is on the linear regression fit, with a value > 1 typically indicating a highly influential point. By plotting this value against row number, we can see if highly influential points exhibit any relationship to their position in the dataset. This is useful for time series data as it can indicate if our fit is disproportionately influenced by data from a particular time period.

Here is what plot 4 might look like:

plot(lmfit, which = 4) Plot 6 is of Cook’s distance against (leverage)/(1 – leverage). An observation’s leverage must fall in the interval $[0,1]$, so plotting against (leverage)/(1 – leverage) allows the x-axis to span the whole positive real line. The contours on the plot represent points where the absolute value of the standardized residual is the same. On this plot they happen to be straight lines; the documentation says so as well but I haven’t had time to check it mathematically.

Here is what plot 6 might look like:

plot(lmfit, which = 6) I’m not too sure how one should interpret this plot. As far as I know, one should take extra notice of points with high leverage and/or high Cook’s distance. So any observation in the top-left, top-right or bottom-right corner should be taken note of. If anyone knows of a better way to interpret this plot, let me know!

# Use mfcol to have plots drawn by column

To plot multiple figures on a single canvas in base R, we can change the graphical parameter mfrow. For instance, the code below tells R that subsequent figures will by drawn in a 2-by-3 array:

par(mfrow = c(2, 3))


If we then run this next block of code, we will get the image below:

set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
plot(x[, j], x[, j]^2 + 0.1 * rnorm(n),
xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
} Notice how the plots are filled in by row? That is, the first plot goes in the top-left corner, the next plot goes to its right, and so on. What if we want the plots to be filled in by column instead? For instance, in our example above each feature is associated with two panels. It would make more sense for the first column to filled with plots related to x1, and so on.

There is an easy fix for that: instead of specifying the mfrow parameter, specify the mfcol parameter. See the results below:

par(mfcol = c(2, 3))
set.seed(10)
n <- 50; p <- 3
x <- matrix(runif(n * p), ncol = p)
for (j in 1:p) {
plot(x[, j], x[, j]^2 + 0.1 * rnorm(n),
xlab = paste0("x", j), ylab = paste0("f(x", j, ")^2 + noise"))
plot(x[, j], x[, j]^3 + 0.1 * rnorm(n),
xlab = paste0("x", j), ylab = paste0("f(x", j, ")^3 + noise"))
} References:

# Lesser known dplyr functions

The dplyr package is an essential tool for manipulating data in R. The “Introduction to dplyr” vignette gives a good overview of the common dplyr functions (list taken from the vignette itself):

• filter() to select cases based on their values.
• arrange() to reorder the cases.
• select() and rename() to select variables based on their names.
• mutate() and transmute() to add new variables that are functions of existing variables.
• summarise() to condense multiple values to a single value.
• sample_n() and sample_frac() to take random samples.

The “Two-table verbs” vignette gives a good introduction to using dplyr function for joining two tables together. The “Window functions” vignette talks about, well, window functions, which are defined as functions which take n values and return n values (as opposed to aggregation functions, which take n values but return one value). The window functions mentioned there are (list taken from the vignette itself):

• Ranking and ordering functions: row_number()min_rank()dense_rank()cume_dist()percent_rank(), and ntile(). These functions all take a vector to order by, and return various types of ranks.
• Offsets: lead() and lag() allow you to access the previous and next values in a vector, making it easy to compute differences and trends.
• Cumulative aggregates: cumsum()cummin()cummax() (from base R), and cumall()cumany(), and cummean() (from dplyr).

However, dplyr comes with several other functions that are not mentioned in the vignettes (or at least, not at length). In this post I’ll talk about some of them. (For the full list of dplyr functions, see the reference manual.)

• Counting functions: n() and n_distinct()
• If-else functions: if_else() and case_when()
• Comparison functions: between() and near()
• Selecting specific elements based on position: nth(), first() and last()
• Selecting specific rows based on position/value: slice() and top_n()
• Utilities: coalesce() and pull()

To illustrate these functions, I will use the flights dataset in the nycflights13 package.

library(tidyverse)
library(nycflights13)
data(flights)


Counting functions: n() and n_distinct()

Ok, these aren’t exactly lesser known functions but they’re certainly useful. n() counts the number of rows in each group. For example, to count the number of flights in each month:

flights %>% group_by(month) %>%
summarize(count = n())
# # A tibble: 12 x 2
#    month count
#    <int> <int>
# 1      1 27004
# 2      2 24951
# 3      3 28834
# 4      4 28330
# 5      5 28796
# 6      6 28243
# 7      7 29425
# 8      8 29327
# 9      9 27574
# 10    10 28889
# 11    11 27268
# 12    12 28135


n_distinct() counts the number of unique values in each group. To count the number of distinct values of day in the dataset:

flights %>% summarize(cnt = n_distinct(day))
# # A tibble: 1 x 1
#     cnt
#   <int>
# 1    31


as we expect, since the longest month only has 31 days. We can also count the number of unique sets of values across columns. To count the number of distinct (month, day) values:

flights %>% summarize(cnt = n_distinct(month,day))
# # A tibble: 1 x 1
#     cnt
#   <int>
# 1   365


as expected (number of days in a year).

If-else functions: if_else() and case_when()

if_else() returns a value which depends on whether a given condition is true or not. It works like the base ifelse() function, except that it is stricter in that the returned value must be of the same type, whether the given condition is true or not. if_else() is commonly used to create new columns. For example, we could create a new column that is “United” if a flight was from United Airlines, “Other” otherwise:

flights %>%
transmute(carrier = carrier,
isUA = if_else(carrier == "UA", "United", "Otherwise")) %>%
# # A tibble: 6 x 2
#   carrier isUA
#   <chr>   <chr>
# 1 UA      United
# 2 UA      United
# 3 AA      Otherwise
# 4 B6      Otherwise
# 5 DL      Otherwise
# 6 UA      United


case_when() is like if_else() except that we have more than 2 possible values for the output. It is “the R equivalent of the SQL CASE WHEN statement”. If none of the cases match, an NA is returned. As with if_else(), the returned values for all the cases must be of the same type.

Here is some code to return “United” if carrier is UA, “JetBlue” if it is B6, and “Other” for all other carriers (notice that the syntax within case_when() is a little different from the usual, and how we use TRUE to catch all the cases that don’t meet the first two conditions):

flights %>%
transmute(carrier = carrier,
newCol = case_when(
carrier == "UA" ~ "United",
carrier == "B6" ~ "JetBlue",
TRUE ~ "Otherwise"
)) %>%
# # A tibble: 6 x 2
#   carrier newCol
#   <chr>   <chr>
# 1 UA      United
# 2 UA      United
# 3 AA      Otherwise
# 4 B6      JetBlue
# 5 DL      Otherwise
# 6 UA      United


Comparison functions: between() and near()

between(x, left, right) is a shortcut for x >= left & x <= right that has an efficient implementation. It makes for more concise code too, at the expense of being unable to tell if the endpoints are included or not unless one is familiar with the function.

near() is a safer way than using == to compare if two vectors of floating point numbers are equal element-wise. This is because it uses tolerance which is based on the machine’s double precision. Here is the example provided in near()‘s documentation:

sqrt(2) ^ 2 == 2
#  FALSE
near(sqrt(2) ^ 2, 2)
#  TRUE


Selecting specific elements based on position: nth(), first() and last()

nth(x, n) returns the nth element in the vector x. An optional vector to determine the order can be passed to the function as well. If n is negative, we count from the end of x (e.g. -2 will return the second last value in x). first() and last() are obvious special cases of nth().

What’s interesting is that these functions are actually wrappers around [[. This means that they can be used to pull out the nth element in a list as well!

Selecting specific rows based on position/value: slice() and top_n()

slice() allows us to select certain rows based on their ordinal position (i.e. row number). This works especially well with grouped dataframes. For example, if we want the first row for each date:

flights %>% group_by(month, day) %>%
slice(1) %>%
select(month, day, dep_time)
# # A tibble: 365 x 3
# # Groups:   month, day 
#    month   day dep_time
#    <int> <int>    <int>
#  1     1     1      517
#  2     1     2       42
#  3     1     3       32
#  4     1     4       25
#  5     1     5       14
#  6     1     6       16
#  7     1     7       49
#  8     1     8      454
#  9     1     9        2
# 10     1    10        3
# # … with 355 more rows


A simple modification gives us the first two rows for each date:

flights %>% group_by(month, day) %>%
slice(1:2) %>%
select(month, day, dep_time)
# # A tibble: 730 x 3
# # Groups:   month, day 
#    month   day dep_time
#    <int> <int>    <int>
#  1     1     1      517
#  2     1     1      533
#  3     1     2       42
#  4     1     2      126
#  5     1     3       32
#  6     1     3       50
#  7     1     4       25
#  8     1     4      106
#  9     1     5       14
# 10     1     5       37
# # … with 720 more rows


In contrast to slice() which picks out rows by their position (within the group), top_n() picks out rows by their value in a pre-specified column (within the group). For example, if we wanted to see the carriers for the top 3 longest departure delays for each day, we could use this code:

flights %>% group_by(month, day) %>%
top_n(3, dep_delay) %>%
select(month, day, dep_delay, carrier)
# # A tibble: 1,108 x 4
# # Groups:   month, day 
#    month   day dep_delay carrier
#    <int> <int>     <dbl> <chr>
#  1     1     1       853 MQ
#  2     1     1       290 EV
#  3     1     1       379 EV
#  4     1     2       334 UA
#  5     1     2       337 AA
#  6     1     2       379 UA
#  7     1     3       268 DL
#  8     1     3       252 B6
#  9     1     3       291 9E
# 10     1     4       208 B6
# # … with 1,098 more rows


Notice that while we get the top 3 rows in dep_delay for each day, the rows are not sorted according to the dep_delay column. To get the bottom three rows instead, use a negative number:

flights %>% group_by(month, day) %>%
top_n(-3, dep_delay) %>%
select(month, day, dep_delay, carrier)
# # A tibble: 1,504 x 4
# # Groups:   month, day 
#    month   day dep_delay carrier
#    <int> <int>     <dbl> <chr>
#  1     1     1       -15 MQ
#  2     1     1       -14 F9
#  3     1     1       -15 AA
#  4     1     2       -13 AA
#  5     1     2       -13 UA
#  6     1     2       -12 AA
#  7     1     2       -12 9E
#  8     1     3       -12 EV
#  9     1     3       -12 EV
# 10     1     3       -13 B6
# # … with 1,494 more rows


Notice how Jan 2nd has 4 entries, not 3? That’s because there was a tie for 3rd place. top_n() either takes all rows with a value or none of them. Another gotcha: if the column that you are sorting against is not specified, it defaults to the last variable in the data frame, NOT to ordinal position/row number.

Utilities: coalesce() and pull()

From the documnetation: “Given a set of vectors, coalesce() finds the first non-missing value at each position. This is inspired by the SQL COALESCE function which does the same thing for NULLs.”

I haven’t had to use this function myself so far, but I imagine it could be useful when you have a number of different data sources reporting the same thing but with different missing values in each source. Here is a contrived example:

x <- c(NA, NA, 3)
y <- c(1, NA, 4)
z <- c(NA, 2, NA)
coalesce(x, y, z)
#  1 2 3


Notice that the third value in y showing a 4 is ignored.

pull() works like [[: it allows you to pull out a particular column (as a vector). The variable you pass to the function can be a variable name, a positive integer (giving the column position when counting from the left) or a negative integer (giving the column position when counting from the right).

# Mixing up R markdown shortcut keys in RStudio, or how to unfold all chunks

When using R markdown in RStudio, I like to insert a new chunk using the shortcut Cmd+Option+I. Unfortunately I often press a key instead of “I” and end up folding all the chunks, getting something like this: It often takes me a while (on Google) to figure out what I did and how to undo it. With this note to remind me, no longer!! The shortcut I accidentally used was Cmd+Option+O, which folds up all chunks. To unfold all chunks, use Cmd+Shift+Option+O.

The full list of RStudio keyboard shortcuts can be found here.

Update (2019-08-27): For Windows & Linux users: the equivalent shortcut keys for insert chunk, fold up all chunks and unfold all chunks are Ctrl+Alt+I, Alt+O and Shift+Alt+O respectively.