Extending dplyr with new data frame subclasses
These three functions, along with names<-
and 1d numeric [
(i.e. x[loc]
) methods, provide a minimal interface for extending dplyr
to work with new data frame subclasses. This means that for simple cases
you should only need to provide a couple of methods, rather than a method
for every dplyr verb.
These functions are a stop-gap measure until we figure out how to solve the problem more generally, but it's likely that any code you write to implement them will find a home in what comes next.
dplyr_row_slice(data, i, ...) dplyr_col_modify(data, cols) dplyr_reconstruct(data, template)
data |
A tibble. We use tibbles because they avoid some inconsistent subset-assignment use cases |
i |
A numeric or logical vector that indexes the rows of |
cols |
A named list used modify columns. A |
template |
Template to use for restoring attributes |
This section gives you basic advice if you want to extend dplyr to work with your custom data frame subclass, and you want the dplyr methods to behave in basically the same way.
If you have data frame attributes that don't depend on the rows or columns (and should unconditionally be preserved), you don't need to do anything.
If you have scalar attributes that depend on rows, implement a
dplyr_reconstruct()
method. Your method should recompute the attribute
depending on rows now present.
If you have scalar attributes that depend on columns, implement a
dplyr_reconstruct()
method and a 1d [
method. For example, if your
class requires that certain columns be present, your method should return
a data.frame or tibble when those columns are removed.
If your attributes are vectorised over rows, implement a
dplyr_row_slice()
method. This gives you access to i
so you can
modify the row attribute accordingly. You'll also need to think carefully
about how to recompute the attribute in dplyr_reconstruct()
, and
you will need to carefully verify the behaviour of each verb, and provide
additional methods as needed.
If your attributes that are vectorised over columns, implement
dplyr_col_modify()
, 1d [
, and names<-
methods. All of these methods
know which columns are being modified, so you can update the column
attribute according. You'll also need to think carefully about how to
recompute the attribute in dplyr_reconstruct()
, and you will need to
carefully verify the behaviour of each verb, and provide additional
methods as needed.
arrange()
, filter()
, slice()
, semi_join()
, and anti_join()
work by generating a vector of row indices, and then subsetting
with dplyr_row_slice()
.
mutate()
generates a list of new column value (using NULL
to indicate
when columns should be deleted), then passes that to dplyr_col_modify()
.
transmute()
does the same then uses 1d [
to select the columns.
summarise()
works similarly to mutate()
but the data modified by
dplyr_col_modify()
comes from group_data()
.
select()
uses 1d [
to select columns, then names<-
to rename them.
rename()
just uses names<-
. relocate()
just uses 1d [
.
inner_join()
, left_join()
, right_join()
, and full_join()
coerces x
to a tibble, modify the rows, then uses dplyr_reconstruct()
to convert back to the same type as x
.
nest_join()
uses dplyr_col_modify()
to cast the key variables to
common type and add the nested-df that y
becomes.
distinct()
does a mutate()
if any expressions are present, then
uses 1d [
to select variables to keep, then dplyr_row_slice()
to
select distinct rows.
Note that group_by()
and ungroup()
don't use any these generics and
you'll need to provide methods directly.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.