Apply an R Function in Spark
Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply( x, f, columns = NULL, memory = !is.null(name), group_by = NULL, packages = NULL, context = NULL, name = NULL, barrier = NULL, fetch_result_as_sdf = TRUE, partition_index_param = "", arrow_max_records_per_batch = NULL, ... )
x |
An object (usually a |
f |
A function that transforms a data frame partition into a data frame.
The function Can also be an |
columns |
A vector of column names or a named vector of column types for
the transformed object. When not specified, a sample of 10 rows is taken to
infer out the output columns automatically, to avoid this performance penalty,
specify the column types. The sample size is configurable using the
|
memory |
Boolean; should the table be cached into memory? |
group_by |
Column name used to group by data frame partitions. |
packages |
Boolean to distribute Defaults to For clusters using Yarn cluster mode, For offline clusters where For clusters where R packages already installed in every worker node,
the |
context |
Optional object to be serialized and passed back to |
name |
Optional table name while registering the resulting data frame. |
barrier |
Optional to support Barrier Execution Mode in the scheduler. |
fetch_result_as_sdf |
Whether to return the transformed results in a Spark
Dataframe (defaults to NOTE: |
partition_index_param |
Optional if non-empty, then NOTE: when |
arrow_max_records_per_batch |
Maximum size of each Arrow record batch, ignored if Arrow serialization is not enabled. |
... |
Optional arguments; currently unused. |
spark_config()
settings can be specified to change the workers
environment.
For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.*
config, to launch workers
without --vanilla
use sparklyr.apply.options.vanilla
set to
FALSE
, to run a custom script before launching Rscript use
sparklyr.apply.options.rscript.before
.
## Not run: library(sparklyr) sc <- spark_connect(master = "local[3]") # creates an Spark data frame with 10 elements then multiply times 10 in R sdf_len(sc, 10) %>% spark_apply(function(df) df * 10) # using barrier mode sdf_len(sc, 3, repartition = 3) %>% spark_apply(nrow, barrier = TRUE, columns = c(id = "integer")) %>% collect() ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.