Optimisations in data.table
data.table
internally optimises certain expressions in order to improve
performance. This section briefly summarises those optimisations.
Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.
Run the code under the example section to get a feel for the performance benefits from these optimisations.
data.table
reads the global option datatable.optimize
to figure
out what level of optimisation is required. The default value Inf
activates all available optimisations.
At optimisation level >= 1
, i.e., getOption("datatable.optimize")
>= 1, these are the optimisations:
The base function order
is internally replaced with
data.table
's fast ordering. That is, DT[order(...)]
gets internally optimised to DT[forder(...)]
.
The expression DT[, lapply(.SD, fun), by=.]
gets optimised
to DT[, list(fun(a), fun(b), ...), by=.]
where a,b, ...
are
columns in .SD
. This improves performance tremendously.
Similarly, the expression DT[, c(.N, lapply(.SD, fun)), by=.]
gets optimised to DT[, list(.N, fun(a), fun(b), ...)]
. .N
is
just for example here.
base::mean
function is internally optimised to use
data.table
's fastmean
function. mean()
from base
is an S3 generic and gets slow with many groups.
At optimisation level >= 2
, i.e., getOption("datatable.optimize")
>= 2, additional optimisations are implemented on top of the optimisations already shown above.
Expressions in j
which contain only the functions
min, max, mean, median, var, sd, sum, prod, first, last, head, tail
(for example,
DT[, list(mean(x), median(x), min(y), max(y)), by=z]
), they are very
effectively optimised using what we call GForce. These functions
are automatically replaced with a corresponding GForce version
with pattern g*
, e.g., prod
becomes gprod
.
Normally, once the rows belonging to each group are identified, the values
corresponding to the group are gathered and the j
-expression is
evaluated. This can be improved by computing the result directly without
having to gather the values or evaluating the expression for each group
(which can get costly with large number of groups) by implementing it
specifically for a particular function. As a result, it is extremely fast.
In addition to all the functions above, '.N' is also optimised to
use GForce, when used separately or when combined with the functions mentioned
above. Note further that GForce-optimized functions must be used separately,
i.e., code like DT[ , max(x) - min(x), by=z]
will not currently
be optimized to use gmax, gmin
.
Expressions of the form DT[i, j, by]
are also optimised when
i
is a subset operation and j
is any/all of the functions
discussed above.
At optimisation level >= 3
, i.e., getOption("datatable.optimize")
>= 3, additional optimisations for subsets in i are implemented on top of the optimisations already shown above. Subsetting operations are - if possible - translated into joins to make use of blazing fast binary search using indices and keys. The following queries are optimized:
Supported operators: ==
, %in%
. Non-equi operators(>, <, etc.) are not supported yet because non-equi joins are slower than vector based subsets.
Queries on multiple columns are supported, if the connector is '&
', e.g. DT[x == 2 & y == 3]
is supported, but DT[x == 2 | y == 3]
is not.
Optimization will currently be turned off when doing subset when cross product of elements provided to filter on exceeds > 1e4. This most likely happens if multiple %in%
, or %chin%
queries are combined, e.g. DT[x %in% 1:100 & y %in% 1:200]
will not be optimized since 100 * 200 = 2e4 > 1e4
.
Queries with multiple criteria on one column are not supported, e.g. DT[x == 2 & x %in% c(2,5)]
is not supported.
Queries with non-missing j are supported, e.g. DT[x == 3 & y == 5, .(new = x-y)]
or DT[x == 3 & y == 5, new := x-y]
are supported. Also extends to queries using with = FALSE
.
"notjoin" queries, i.e. queries that start with !
, are only supported if there are no &
connections, e.g. DT[!x==3]
is supported, but DT[!x==3 & y == 4]
is not.
If in doubt, whether your query benefits from optimization, call it with the verbose = TRUE
argument. You should see "Optimized subsetting...".
Auto indexing: In case a query is optimized, but no appropriate key or index is found, data.table
automatically creates an index on the first run. Any successive subsets on the same
column then reuse this index to binary search (instead of
vector scan) and is therefore fast.
Auto indexing can be switched off with the global option
options(datatable.auto.index = FALSE)
. To switch off using existing
indices set global option options(datatable.use.index = FALSE)
.
## Not run: # Generate a big data.table with a relatively many columns set.seed(1L) DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE)) setDT(DT)[, id := sample(1e5, 5e6, TRUE)] print(object.size(DT), units="Mb") # 400MB, not huge, but will do # 'order' optimisation options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- DT[order(id)]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- DT[order(id)]) identical(ans1, ans2) # optimisation of 'lapply(.SD, fun)' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- DT[, lapply(.SD, min), by=id]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- DT[, lapply(.SD, min), by=id]) identical(ans1, ans2) # optimisation of 'mean' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- DT[, lapply(.SD, mean), by=id]) system.time(ans2 <- DT[, lapply(.SD, base::mean), by=id]) identical(ans1, ans2) # optimisation of 'c(.N, lapply(.SD, ))' options(datatable.optimize = 1L) # optimisation 'on' system.time(ans1 <- DT[, c(.N, lapply(.SD, min)), by=id]) options(datatable.optimize = 0L) # optimisation 'off' system.time(ans2 <- DT[, c(N=.N, lapply(.SD, min)), by=id]) identical(ans1, ans2) # GForce options(datatable.optimize = 2L) # optimisation 'on' system.time(ans1 <- DT[, lapply(.SD, median), by=id]) system.time(ans2 <- DT[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id]) identical(ans1, ans2) # optimized subsets options(datatable.optimize = 2L) system.time(ans1 <- DT[id == 100L]) # vector scan system.time(ans2 <- DT[id == 100L]) # vector scan system.time(DT[id %in% 100:500]) # vector scan options(datatable.optimize = 3L) system.time(ans1 <- DT[id == 100L]) # index + binary search subset system.time(ans2 <- DT[id == 100L]) # only binary search subset system.time(DT[id %in% 100:500]) # only binary search subset again ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.