Fast CSV writer
As write.csv
but much faster (e.g. 2 seconds versus 1 minute) and just as flexible. Modern machines almost surely have more than one CPU so fwrite
uses them; on all operating systems including Linux, Mac and Windows.
fwrite(x, file = "", append = FALSE, quote = "auto", sep = ",", sep2 = c("","|",""), eol = if (.Platform$OS.type=="windows") "\r\n" else "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE, qmethod = c("double","escape"), logical01 = getOption("datatable.logical01", FALSE), # due to change to TRUE; see NEWS logicalAsInt = logical01, # deprecated scipen = getOption('scipen', 0L), dateTimeAs = c("ISO","squash","epoch","write.csv"), buffMB = 8L, nThread = getDTthreads(verbose), showProgress = getOption("datatable.showProgress", interactive()), compress = c("auto", "none", "gzip"), yaml = FALSE, bom = FALSE, verbose = getOption("datatable.verbose", FALSE))
x |
Any |
file |
Output file name. |
append |
If |
quote |
When |
sep |
The separator between columns. Default is |
sep2 |
For columns of type |
eol |
Line separator. Default is |
na |
The string to use for missing values in the data. Default is a blank string |
dec |
The decimal separator, by default |
row.names |
Should row names be written? For compatibility with |
col.names |
Should the column names (header row) be written? The default is |
qmethod |
A character string specifying how to deal with embedded double quote characters when quoting strings.
|
logical01 |
Should |
logicalAsInt |
Deprecated. Old name for 'logical01'. Name change for consistency with 'fread' for which 'logicalAsInt' would not make sense. |
scipen |
|
dateTimeAs |
How
The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for |
buffMB |
The buffer size (MB) per thread in the range 1 to 1024, default 8MB. Experiment to see what works best for your data on your hardware. |
nThread |
The number of threads to use. Experiment to see what works best for your data on your hardware. |
showProgress |
Display a progress meter on the console? Ignored when |
compress |
If |
yaml |
If |
bom |
If |
verbose |
Be chatty and report timings? |
fwrite
began as a community contribution with pull request #1613 by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: https://www.h2o.ai/blog/fast-csv-writing-for-r/. Final items were tracked in issue #1664 such as automatic quoting, bit64::integer64
support, decimal/scientific formatting exactly matching write.csv
between 2.225074e-308 and 1.797693e+308 to 15 significant figures, row.names
, dates (between 0000-03-01 and 9999-12-31), times and sep2
for list
columns where each cell can itself be a vector.
To save space, fwrite
prefers to write wide numeric values in scientific notation – e.g. 10000000000
takes up much more space than 1e+10
. Most file readers (e.g. fread
) understand scientific notation, so there's no fidelity loss. Like in base R, users can control this by specifying the scipen
argument, which follows the same rules as options('scipen')
. fwrite
will see how much space a value will take to write in scientific vs. decimal notation, and will only write in scientific notation if the latter is more than scipen
characters wider. For 10000000000
, then, 1e+10
will be written whenever scipen<6
.
CSVY Support:
The following fields will be written to the header of the file and surrounded by ---
on top and bottom:
source
- Contains the R version and data.table
version used to write the file
creation_time_utc
- Current timestamp in UTC time just before the header is written
schema
with element fields
giving name
-type
(class
) pairs for the table; multi-class objects (e.g. c('POSIXct', 'POSIXt')
) will have their first class written.
header
- same as col.names
(which is header
on input)
sep
sep2
eol
na.strings
- same as na
dec
qmethod
logical01
DF = data.frame(A=1:3, B=c("foo","A,Name","baz")) fwrite(DF) write.csv(DF, row.names=FALSE, quote=FALSE) # same fwrite(DF, row.names=TRUE, quote=TRUE) write.csv(DF) # same DF = data.frame(A=c(2.1,-1.234e-307,pi), B=c("foo","A,Name","bar")) fwrite(DF, quote='auto') # Just DF[2,2] is auto quoted write.csv(DF, row.names=FALSE) # same numeric formatting DT = data.table(A=c(2,5.6,-3),B=list(1:3,c("foo","A,Name","bar"),round(pi*1:3,2))) fwrite(DT) fwrite(DT, sep="|", sep2=c("{",",","}")) ## Not run: set.seed(1) DT = as.data.table( lapply(1:10, sample, x=as.numeric(1:5e7), size=5e6)) # 382MB system.time(fwrite(DT, "/dev/shm/tmp1.csv")) # 0.8s system.time(write.csv(DT, "/dev/shm/tmp2.csv", # 60.6s quote=FALSE, row.names=FALSE)) system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical set.seed(1) N = 1e7 DT = data.table( str1=sample(sprintf("%010d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), str2=sample(sprintf("%09d",sample(N,1e5,replace=TRUE)), N, replace=TRUE), str3=sample(sapply(sample(2:30, 100, TRUE), function(n) paste0(sample(LETTERS, n, TRUE), collapse="")), N, TRUE), str4=sprintf("%05d",sample(sample(1e5,50),N,TRUE)), num1=sample(round(rnorm(1e6,mean=6.5,sd=15),2), N, replace=TRUE), num2=sample(round(rnorm(1e6,mean=6.5,sd=15),10), N, replace=TRUE), str5=sample(c("Y","N"),N,TRUE), str6=sample(c("M","F"),N,TRUE), int1=sample(ceiling(rexp(1e6)), N, replace=TRUE), int2=sample(N,N,replace=TRUE)-N/2 ) # 774MB system.time(fwrite(DT,"/dev/shm/tmp1.csv")) # 1.1s system.time(write.csv(DT,"/dev/shm/tmp2.csv", # 63.2s row.names=FALSE, quote=FALSE)) system("diff /dev/shm/tmp1.csv /dev/shm/tmp2.csv") # identical unlink("/dev/shm/tmp1.csv") unlink("/dev/shm/tmp2.csv") ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.