A review of data.table (1.2)


The fast way to do SQL like operations in R

data.table provides a data type similar to a data.frame with the addition of indexes on columns. This means that if you are either trying to quickly access a subset of rows or merge two data.tables together, instead of dealing with O(N^2) type operations, the indexes can be used to improve performance with O(log N) index searches.

This is typically important when you are working with large normalized (in the RDBMS sense of the word) data sets. While there are way of working with this data both in the base packages (merge(), aggregate(), which(), subset()...), these methods are unable to operate as fast as similar operations on an indexed table inside a database system. By taking data.frames and converting them to data.tables, you can add indexes to the data and achieve performance gains.

In addition, while the syntax is a bit quirky, data.table lets you express much of what you would express in a SQL SELECT query inside a single data.table call.

Basic syntax:

dt <- DT(df, key=’colname1,colname2’) +ENDSRC Takes a data frame (df ) and creates a data.table (dt) which is indexed by colname1 then colname2 (for ties) +BEGINSRC dt[i, j] +ENDSRC • If i is not another data.table, then it works exactly like a data.frame for subsetting rows. • If i is a data.table, then does a fast (O(log (n))) join of i’s keys with dt’s. Returns intersection. • If j is a single column index, then it works exactly like a data.frame for selecting a column. • If j is a data.table, then it performs expressions in the scope of the data.table. +BEGINSRC dt[i, j, mult={’first’, ’last’, ’all’}, nomatch={0, NA}, roll={FALSE, TRUE}, by=’colname’] • When performing a join (using a data.table as i), the default behavior is to only return the first match. You can tweak this with the mult option. • Likewise, the default join behavior is an INNER JOIN. For an OUTER JOIN, specify nomatch = NA. • When doing joins on time series, roll lets you join against imperfectly matched times by using the closest prior time. • by lets you GROUP BY colname, for each group the j expressions are evaluated separately. (Inefficient according to docs)