arrow: write_dataset – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

write_dataset

Write a dataset

Description

This function allows you to write a dataset. By writing to more efficient binary storage formats, and by specifying relevant partitioning, you can make it much faster to read and query.

Usage

write_dataset(
  dataset,
  path,
  format = c("parquet", "feather", "arrow", "ipc"),
  partitioning = dplyr::group_vars(dataset),
  basename_template = paste0("part-{i}.", as.character(format)),
  hive_style = TRUE,
  ...
)

Arguments

`dataset`	Dataset, RecordBatch, Table, `arrow_dplyr_query`, or `data.frame`. If an `arrow_dplyr_query` or `grouped_df`, `schema` and `partitioning` will be taken from the result of any `select()` and `group_by()` operations done on the dataset. `filter()` queries will be applied to restrict written rows. Note that `select()`-ed columns may not be renamed.
`path`	string path, URI, or `SubTreeFileSystem` referencing a directory to write to (directory will be created if it does not exist)
`format`	a string identifier of the file format. Default is to use "parquet" (see FileFormat)
`partitioning`	`Partitioning` or a character vector of columns to use as partition keys (to be written as path segments). Default is to use the current `group_by()` columns.
`basename_template`	string template for the names of files to be written. Must contain `"{i}"`, which will be replaced with an autoincremented integer to generate basenames of datafiles. For example, `"part-{i}.feather"` will yield `"part-0.feather", ...`.
`hive_style`	logical: write partition segments as Hive-style (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is `TRUE`.
`...`	additional format-specific arguments. For available Parquet options, see `write_parquet()`. The available Feather options are `use_legacy_format` logical: write data formatted so that Arrow libraries versions 0.14 and lower can read it. Default is `FALSE`. You can also enable this by setting the environment variable `ARROW_PRE_0_15_IPC_FORMAT=1`. `metadata_version`: A string like "V5" or the equivalent integer indicating the Arrow IPC MetadataVersion. Default (NULL) will use the latest version, unless the environment variable `ARROW_PRE_1_0_METADATA_VERSION=1`, in which case it will be V4. `codec`: A Codec which will be used to compress body buffers of written files. Default (NULL) will not compress body buffers. `null_fallback`: character to be used in place of missing values (`NA` or `NULL`) when using Hive-style partitioning. See `hive_partition()`.

Value

The input dataset, invisibly

arrow

Integration to 'Apache' 'Arrow'

v4.0.0.1

Apache License (>= 2.0)

Authors

Neal Richardson [aut, cre], Ian Cook [aut], Jonathan Keane [aut], Romain François [aut] (<https://orcid.org/0000-0002-2444-4226>), Jeroen Ooms [aut], Javier Luraschi [ctb], Jeffrey Wong [ctb], Apache Arrow [aut, cph]

Initial release