A quirk when using data.table?

I recently came across this quirk in using data.table that I don’t really have a clean solution for. I outline the issue below as well as my current way around it. Appreciate any better solutions!

The problem surfaces quite generally, but I’ll illustrate it by trying to achieve the following task: write a function that takes a data table and a column name, and returns the data table with the data in that column scrambled.

The function below was my first attempt:

library(data.table)

scramble_col <- function(input_dt, colname) {
  input_dt[[colname]] <- sample(input_dt[[colname]])
  input_dt
}

The code snippet below shows that it seems to work:

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2

However, when I tried to add a new column that is a replica of the x column, I get a strange warning!

input_dt[, y := x]  # gives warning

There are few webpages out there that try to explain what’s going on with this warning, but I haven’t had time to fully digest what is going on. My high-level takeaway is that the assignment in the line input_dt[[colname]] <- sample(input_dt[[colname]]) is problematic.

This was my second attempt:

scramble_col <- function(input_dt, colname) {
  input_dt[, c(colname) := sample(get(colname))]
}

This version works well: it doesn’t throw the warning when I added a second column.

input_dt <- data.table(x = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "x")
input_dt
#    x
# 1: 1
# 2: 4
# 3: 3
# 4: 5
# 5: 2
input_dt[, y := x]
input_dt
#    x y
# 1: 1 1
# 2: 4 4
# 3: 3 3
# 4: 5 5
# 5: 2 2

However, the function does not work for a particular input: when the column name is colname! When I run the following code, I get an error message.

input_dt <- data.table(colname = 1:5)
set.seed(1)
input_dt <- scramble_col(input_dt, "colname") # error

The function below was my workaround and I think it works for all inputs, but it seems a bit inelegant:

scramble_col <- function(input_dt, colname) {
  new_col <- sample(input_dt[[colname]])
  input_dt[, c(colname) := new_col]
}

Would love to hear if anyone has a better solution for this task, or if you have some insight into what is going on with the warning/error above.

9 thoughts on “A quirk when using data.table?”

I like these data.table problems. Tried a couple other ideas here: https://tshafer.com/blog/2022/06/shuffling-columns-data-table

LikeLike

Reply ↓

kjytay on June 4, 2022 at 3:30 pm said:

Thanks for thinking about the problem! Based on the timings, looks like your set solution is the way to go.

LikeLike

Reply ↓

The first issue you described, which gives the ‘data.table’ warning, is due to the way R works when you are assigning a value to an argument of a function. When you do that, R creates a copy of the data pointed to by the argument. In the specific case, where the argument is pointing to a data.table, R does a ‘shallow-copy’, which also copies the attributes of the object. However, data.table is using the attributes for storing internal data (in this case, a reference to the data.table object), but when this attribute is copied ‘as-is’, it is no longer valid in the context of the new object created by the shallow copy. This is why you’re getting this error. The solutions provided by TS above use data.table ‘assignment by reference’ – both the ‘set’ function and the ‘:=’ operator update the object by reference, so no copy is performed, and the object integrity is kept. If what you want to do is to update the original object, this is also the most efficient way to do that. If you do want to create a modified copy (without modifying the original object), the best way is to explicitly create a copy of the object passed to the function, e.g. by ‘input_dt <- data.table::as.data.table(input_dt)' as first thing. Then the rest of your function would work.

As for the second issue, with the error from the 'get' function – the problem is not with the particular name of the column, but it stems from the fact that the name of the argument to the function is identical to the name of one of the columns of the data.table object. In data.table functions (and something like 'dt[, …]' is actually a call to the function '`[.data.table`(dt, …)'), the names in the function are evaluated first with the data.table object as an environment (in other words, the data.table object is placed in the first place of the search path). So when you call 'get(colname,…)', 'colname' is searched first in the data.table object, and since it is found, because there is a column with that name, this column is passed to 'get', and not the value of the 'colname' argument of the function. This is why you are getting this error. If you do want to do it in a way similar to what you are trying to do, you can do that like that:

scramble_col <- function(input_dt, colname) {
vals 50,000 rows, we can see that the differences between the various functions become negligible. However, using ‘set’ is still the fastest way, and this is expected since this function has the minimum possible overhead.

LikeLike

Reply ↓

kjytay on June 5, 2022 at 11:32 pm said:

Thanks for this detailed explanation!

LikeLike

Reply ↓

Glad to help. It seems that the last part of my comment has somehow gone wrong. If possible, please edit my comment and replace the last paragraph (starting by ‘scramble_col <-') by the text below (and then this comment is unnecessary and can be deleted):

LikeLike

Reply ↓

Amit Ramon on June 6, 2022 at 5:07 am said:

Trying again 🙂 There where less-than and greater than characters that got interpreted as HTML comments and caused some of the text to be hidden. So the last paragraph of my original comment should be replaced by the text below, and then my 2 other comments can be deleted:

scramble_col <- function(input_dt, colname) {
vals = sample(input_dt[[colname]])
input_dt[, c(colname) := vals]
}

That way you first use 'colname' to get and scramble the values of the column, and then you assign it to the data.table column.

Last thing, a comment about the benchmark done by TS – the benchmark was done on a 5-rows data.table. This is a very small value, and with various overheads incurred in the various functions, the benchmark is very inaccurate. To get a reliable test, we should use much larger data.table. And indeed, when trying with a data.table of, say, more than 50,000 rows, we can see that the differences between the various functions become negligible. However, using 'set' is still the fastest way, and this is expected since this function has the minimum possible overhead.

LikeLike

Reply ↓
- TS on June 7, 2022 at 9:53 pm said:
  
  The point re: vector size is true to an extent, but if you run the test it’s a little surprising how long the difference persists. (Whether the difference matters in practice depends on the use case.) We basically need to wait for sample() to take about as long as the assignment, and the difference in timings (mostly) doesn’t disappear until reaching vector sizes in the 10,000–1,000,000-element range.
  
  LikeLike