A review of data.table (1.6.5)

Data.table is fast compared to ddply and ave

I wrote the following response to a question on R-help. It compares the speed of data.table to ave and ddply:

It is part of this R-help thread: http://www.mail-archive.com/r-help@r-project.org/msg142659.html

My mail:

After reading this interesting discussion I delved a bit deeper into the subject matter. The following snippet of code (see the end of my mail) compares three ways of performing this task, using ddply, ave and one yet unmentioned option: data.table (a package). The piece of code generates mock datasets which vary in size and number of factor levels for the factor. The results look like this (there is also a ggplot plot in the script that summarise the table):

> res

   datsize noClasses   tave  tddply tdata.table
...note that I cut out part of the table for readability...
17   1e+07        10  9.160   3.500       1.064
18   1e+07        50 10.126   4.483       1.364
19   1e+07       100 10.485   5.016       1.407
20   1e+07       200 10.680   6.901       1.435
21   1e+07       500 10.801  12.569       1.474
22   1e+07      1000 10.923  21.001       1.540
23   1e+07      2500 11.514  51.020       1.622
24   1e+07     10000 12.158 182.752       1.737

It is clear that the option of using data.table is by far the fastest of the three and scales quite nicely with the number of factor levels, in contrast to ddply. It is also faster than ave by up to a factor of 10.

cheers, Paul

library(ggplot2) library(data.table) themeset(themebw()) datsize = c(10e4, 10e5, 10e6) noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3) comb = expand.grid(datsize = datsize, noClasses = noClasses) res = ddply(comb, .(datsize, noClasses), function(x) { expdata = data.frame(value = runif(x$datsize), cat = round(runif(x$datsize, min = 0, max = x$noClasses))) expdataDT = data.table(expdata) t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum))) t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val = sum(value))) t3 = system.time(res3 <- expdataDT[, sum(value), by = cat]) return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3])) }, .progress = 'text') res ggplot(aes(x = noClasses, y = log(value), color = variable), data = melt(res, id.vars = c("datsize","noClasses"))) + facetwrap(~ datsize) + geomline()