arrow: dataset_factory – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

dataset_factory

Create a DatasetFactory

Description

A Dataset can constructed using one or more DatasetFactorys. This function helps you construct a DatasetFactory that you can pass to open_dataset().

Usage

dataset_factory(
  x,
  filesystem = NULL,
  format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
  partitioning = NULL,
  ...
)

Arguments

`x`	A string path to a directory containing data files, a vector of one one or more string paths to data files, or a list of `DatasetFactory` objects whose datasets should be combined. If this argument is specified it will be used to construct a `UnionDatasetFactory` and other arguments will be ignored.
`filesystem`	A FileSystem object; if omitted, the `FileSystem` will be detected from `x`
`format`	A FileFormat object, or a string identifier of the format of the files in `x`. Currently supported values: "parquet" "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported "csv"/"text", aliases for the same thing (because comma is the default delimiter for text files "tsv", equivalent to passing `format = "text", delimiter = "\t"` Default is "parquet", unless a `delimiter` is also specified, in which case it is assumed to be "text".
`partitioning`	One of A `Schema`, in which case the file paths relative to `sources` will be parsed, and path segments will be matched with the schema fields. For example, `schema(year = int16(), month = int8())` would create partitions for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc. A character vector that defines the field names corresponding to those path segments (that is, you're providing the names that would correspond to a `Schema` but the types will be autodetected) A `HivePartitioning` or `HivePartitioningFactory`, as returned by `hive_partition()` which parses explicit or autodetected fields from Hive-style path segments `NULL` for no partitioning
`...`	Additional format-specific options, passed to `FileFormat$create()`. For CSV options, note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the `readr`-style naming used in `read_csv_arrow()` ("delim", "quote", etc.). Not all `readr` options are currently supported; please file an issue if you encounter one that `arrow` should support.

Details

If you would only have a single DatasetFactory (for example, you have a single directory containing Parquet files), you can call open_dataset() directly. Use dataset_factory() when you want to combine different directories, file systems, or file formats.

Value

A DatasetFactory object. Pass this to open_dataset(), in a list potentially with other DatasetFactory objects, to create a Dataset.

arrow

Integration to 'Apache' 'Arrow'

v4.0.0.1

Apache License (>= 2.0)

Authors

Neal Richardson [aut, cre], Ian Cook [aut], Jonathan Keane [aut], Romain François [aut] (<https://orcid.org/0000-0002-2444-4226>), Jeroen Ooms [aut], Javier Luraschi [ctb], Jeffrey Wong [ctb], Apache Arrow [aut, cph]

Initial release