A review of data.table (1.6.5)
Data.table is fast compared to ddply and ave
I wrote the following response to a question on R-help. It compares the speed of data.table to ave and ddply:
It is part of this R-help thread: http://www.mail-archive.com/r-help@r-project.org/msg142659.html
My mail:
After reading this interesting discussion I delved a bit deeper into the subject matter. The following snippet of code (see the end of my mail) compares three ways of performing this task, using ddply, ave and one yet unmentioned option: data.table (a package). The piece of code generates mock datasets which vary in size and number of factor levels for the factor. The results look like this (there is also a ggplot plot in the script that summarise the table):
> res
datsize noClasses tave tddply tdata.table
...note that I cut out part of the table for readability...
17 1e+07 10 9.160 3.500 1.064
18 1e+07 50 10.126 4.483 1.364
19 1e+07 100 10.485 5.016 1.407
20 1e+07 200 10.680 6.901 1.435
21 1e+07 500 10.801 12.569 1.474
22 1e+07 1000 10.923 21.001 1.540
23 1e+07 2500 11.514 51.020 1.622
24 1e+07 10000 12.158 182.752 1.737
It is clear that the option of using data.table is by far the fastest of the three and scales quite nicely with the number of factor levels, in contrast to ddply. It is also faster than ave by up to a factor of 10.
cheers, Paul
library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5, 10e6)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
expdata = data.frame(value = runif(x$datsize),
cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
expdataDT = data.table(expdata)
t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum)))
t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val =
sum(value)))
t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
res
ggplot(aes(x = noClasses, y = log(value), color = variable),
data = melt(res, id.vars = c("datsize","noClasses"))) +
facet_wrap(~ datsize) + geom_line()