Importing csv files into ff data.frames
Function read.table.ffdf
reads separated flat files into ffdf
objects, very much like (and using) read.table
.
It can also work with any convenience wrappers like read.csv
and provides its own convenience wrapper (e.g. read.csv.ffdf
) for R's usual wrappers.
read.table.ffdf( x = NULL , file, fileEncoding = "" , nrows = -1, first.rows = NULL, next.rows = NULL , levels = NULL, appendLevels = TRUE , FUN = "read.table", ... , transFUN = NULL , asffdf_args = list() , BATCHBYTES = getOption("ffbatchbytes") , VERBOSE = FALSE ) read.csv.ffdf(...) read.csv2.ffdf(...) read.delim.ffdf(...) read.delim2.ffdf(...)
x |
NULL or an optional |
file |
the name of the file which the data are to be read from.
Each row of the table appears as one line of the file. If it does
not contain an absolute path, the file name is
relative to the current working directory,
Alternatively, |
fileEncoding |
character string: if non-empty declares the
encoding used on a file (not a connection) so the character data can
be re-encoded. See |
nrows |
integer: the maximum number of rows to read in (includes first.rows in case a 'first' chunk is read) Negative and other invalid values are ignored. |
first.rows |
integer: number of rows to be read in the first chunk, see details. Default is the value given at |
next.rows |
integer: number of rows to be read in further chunks, see details.
By default calculated as |
levels |
NULL or an optional list, each element named with col.names of factor columns specifies the |
appendLevels |
logical.
A vector of permissions to expand |
FUN |
character: name of a function that is called for reading each chunk, see |
... |
further arguments, passed to |
transFUN |
NULL or a function that is called on each data.frame chunk after reading with |
asffdf_args |
further arguments passed to |
BATCHBYTES |
integer: bytes allowed for the size of the |
VERBOSE |
logical: TRUE to verbose timings for each processed chunk (default FALSE) |
read.table.ffdf
has been designed to read very large (many rows) separated flatfiles in row-chunks
and store the result in a ffdf
object on disk, but quickly accessible via ff
techniques.
The first chunk is read with a default of 1000 rows, for subsequent chunks the number of rows is calculated to not require more RAM than getOption("ffbatchbytes")
.
The following could be indications to change the parameter first.rows
:
set first.rows=-1
to read the complete file in one go (requires enough RAM)
set first.rows
to a smaller number if the pre-allocation of RAM for the first chunk with parameter nrows
in read.table
is too large, i.e. with many columns on machine with little RAM.
set first.rows
to a larger number if you expect better factor level ordering (factor levels are sorted in the first chunk, but not at subsequent chunks, however, factor level ordering can be fixed later, see below).
By default the ffdf
object is created on the fly at the end of reading the 'first' chunk, see argument first.rows
.
The creation of the ffdf
object is done via as.ffdf
and can be finetuned by passing argument asffdf_args
.
Even more control is possible by passing in a ffdf
object as argument x
to which the read records are appended.
read.table.ffdf
has been designed to behave as much like read.table
as possible. Hoever, note the following differences:
Arguments 'colClasses' and 'col.names' are now enforced also during 'next.rows' chunks.
For example giving colClasses=NA
will force that no colClasses are derived from the first.rows
respective from the ffdf
object in parameter x
.
colClass 'ordered' is allowed and will create an ordered
factor
character vector are not supported, character data must be read as one of the following colClasses: 'Date', 'POSIXct', 'factor, 'ordered'. By default character columns are read as factors. Accordingly arguments 'as.is' and 'stringsAsFactors' are not allowed.
the sequence of levels.ff
from chunked reading can depend on chunk size: by default new levels found on a chunk are appended to the levels found in previous chunks, no attempt is made to sort and recode the levels during chunked processing, levels can be sorted and recoded most efficiently after all records have been read using sortLevels
.
the default for argument 'comment.char' is ""
even for those FUN that have a different default. However, explicit specification of 'comment.char' will have priority.
Note that using the 'skip' argument still requires to read the file from beginning in order to count the lines to be skipped.
If you first read part of the file in order to understand its structure and then want to continue,
a more efficient solution that using 'skip' is opening a file
connection
and pass that to argument 'file'.
read.table.ffdf
does the same in order to skip efficiently over previously read chunks.
Jens Oehlschlägel, Christophe Dutang
message("create some csv data on disk") x <- data.frame( log=rep(c(FALSE, TRUE), length.out=26) , int=1:26 , dbl=1:26 + 0.1 , fac=factor(letters) , ord=ordered(LETTERS) , dct=Sys.time()+1:26 , dat=seq(as.Date("1910/1/1"), length.out=26, by=1) , stringsAsFactors = TRUE ) x <- x[c(13:1, 13:1),] csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv") write.csv(x, file=csvfile, row.names=FALSE) cat("Simply read csv with header\n") y <- read.csv(file=csvfile, header=TRUE) y cat("Read csv with header\n") ffy <- read.csv.ffdf(file=csvfile, header=TRUE) ffy sapply(ffy[,], class) message("reading with colClasses (an ordered factor wont'work in read.csv)") try(read.csv(file=csvfile, header=TRUE, colClasses=c(ord="ordered") , stringsAsFactors = TRUE)) # TODO could fix this with the following two commands (Gabor Grothendieck) # but does not know what bad side-effects this could have #setOldClass("ordered") #setAs("character", "ordered", function(from) ordered(from)) y <- read.csv(file=csvfile, header=TRUE, colClasses=c(dct="POSIXct", dat="Date") , stringsAsFactors = TRUE) ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") ) rbind( ram_class = sapply(y, function(x)paste(class(x), collapse = ",")) , ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ",")) , ff_vmode = vmode(ffy) ) message("NOTE that reading in chunks can change the sequence of levels and thus the coding") message("(Sorting levels during chunked reading can be too expensive)") levels(ffy$fac[]) ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , first.rows=6 , next.rows=10 , VERBOSE=TRUE ) levels(ffy$fac[]) message("If we don't know the levels we can sort then after reading") message("(Will rewrite all factor codes)") message("NOTE that you MUST assign the return value of sortLevels()") ffy <- sortLevels(ffy) levels(ffy$fac[]) message("If we KNOW the levels we can fix levels upfront") ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , first.rows=6 , next.rows=10 , levels=list(fac=letters, ord=LETTERS) ) levels(ffy$fac[]) message("Or we inspect a sufficiently large chunk of data and use those") table(ffy$fac[], exclude=NULL) ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , nrows=13 , VERBOSE=TRUE ) message("append the rest to ffy") ffy <- read.csv.ffdf( x=ffy , file=csvfile , header=FALSE , skip=1 + nrow(ffy) , VERBOSE=TRUE ) table(ffy$fac[], exclude=NULL) message("We can turn unexpected factor levels to NA, say we only allowed a:l") ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , levels=list(fac=letters[1:12], ord=LETTERS[1:12]) , appendLevels=FALSE ) sapply(colnames(ffy), function(i)sum(is.na(ffy[[i]][]))) message("let's store some columns more efficient") sum(.ffbytes[vmode(ffy)]) ffy$log <- clone(ffy$log, vmode="boolean") ffy$fac <- clone(ffy$fac, vmode="byte") ffy$ord <- clone(ffy$ord, vmode="byte") sum(.ffbytes[vmode(ffy)]) message("let's make a template with zero rows") ffx <- clone(ffy) nrow(ffx) <- 0 message("reading with template and colClasses") ffy <- read.csv.ffdf( x=ffx , file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , next.rows = 12 , VERBOSE = TRUE ) rbind( ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ",")) , ff_vmode = vmode(ffy) ) levels(ffx$fac[]) levels(ffy$fac[]) message("reading with template without colClasses") ffy <- read.csv.ffdf( x=ffx , file=csvfile , header=TRUE , next.rows = 12 , VERBOSE = TRUE ) rbind( ff_class = sapply(ffy[,], function(x)paste(class(x), collapse = ",")) , ff_vmode = vmode(ffy) ) levels(ffx$fac[]) levels(ffy$fac[]) message("We can fine-tune the creation of the ffdf") message("- let's create the ff files outside of fftempdir") message("- let's reduce required disk space and thus file.system cache RAM") message("By default we had record size 36.25") ffy <- read.csv.ffdf( file=csvfile , header=TRUE , colClasses=c(ord="ordered", dct="POSIXct", dat="Date") , asffdf_args=list( vmode = c( log="boolean" , int="byte" , dbl="single" , fac="nibble" # no NAs , ord="nibble" # no NAs , dct="single" , dat="single" ) , col_args=list(pattern = "./csv") # create in getwd() with prefix csv ) ) vmode(ffy) message("This recordsize is more than 50% reduced") sum(.ffbytes[vmode(ffy)]) / 36.25 message("Don't forget to wrap-up files that are not in fftempdir") delete(ffy); rm(ffy) message("It's a good habit to also wrap-up temporary stuff (or at least know how this is done)") rm(ffx); gc() fwffile <- tempfile() cat(file=fwffile, "123456", "987654", sep="\n") x <- read.fwf(fwffile, widths=c(1,2,3), stringsAsFactors = TRUE) #> 1 23 456 \ 9 87 654 y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,2,3)) stopifnot(identical(x, y[,])) x <- read.fwf(fwffile, widths=c(1,-2,3), stringsAsFactors = TRUE) #> 1 456 \ 9 654 y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,-2,3)) stopifnot(identical(x, y[,])) unlink(fwffile) cat(file=fwffile, "123", "987654", sep="\n") x <- read.fwf(fwffile, widths=c(1,0, 2,3), stringsAsFactors = TRUE) #> 1 NA 23 NA \ 9 NA 87 654 y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=c(1,0, 2,3)) stopifnot(identical(x, y[,])) unlink(fwffile) cat(file=fwffile, "123456", "987654", sep="\n") x <- read.fwf(fwffile, widths=list(c(1,0, 2,3), c(2,2,2)) , stringsAsFactors = TRUE) #> 1 NA 23 456 98 76 54 y <- read.table.ffdf(file=fwffile, FUN="read.fwf", widths=list(c(1,0, 2,3), c(2,2,2))) stopifnot(identical(x, y[,])) unlink(fwffile) unlink(csvfile)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.