ff classes for representing (large) atomic data
The ff package provides atomic data structures that are stored on disk but behave (almost) as if they were in RAM by
mapping only a section (pagesize) into main memory (the effective main memory consumption per ff object).
Several access optimization techniques such as Hyrid Index Preprocessing (as.hi
, update.ff
) and Virtualization (virtual
, vt
, vw
) are implemented to achieve good performance even with large datasets.
In addition to the basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects (clone
, as.ff
, as.ram
) and very basic support for operating on ff objects (ffapply
).
While the (possibly packed) raw data is stored on a flat file, meta
informations about the atomic data structure such as its dimension,
virtual storage mode (vmode
), factor level encoding,
internal length etc.. are stored as an ordinary R object (external
pointer plus attributes) and can be saved in the workspace.
The raw flat file data encoding is always in native machine format for
optimal performance and provides several packing schemes for different
data types such as logical, raw, integer and double (in an extended version
support for more tighly packed virtual data types is supported).
flatfile data files can be shared among ff objects in the same R process or
even from different R processes due to Memory-Mapping, although the
caching effects have not been tested extensively.
Please do read and understand the limitations and warnings in LimWarn
before you do anything serious with package ff.
ff( initdata = NULL , length = NULL , levels = NULL , ordered = NULL , dim = NULL , dimorder = NULL , bydim = NULL , symmetric = FALSE , fixdiag = NULL , names = NULL , dimnames = NULL , ramclass = NULL , ramattribs = NULL , vmode = NULL , update = NULL , pattern = NULL , filename = NULL , overwrite = FALSE , readonly = FALSE , pagesize = NULL # getOption("ffpagesize") , caching = NULL # getOption("ffcaching") , finalizer = NULL , finonexit = NULL # getOption("fffinonexit") , FF_RETURN = TRUE , BATCHSIZE = .Machine$integer.max , BATCHBYTES = getOption("ffbatchbytes") , VERBOSE = FALSE )
initdata |
scalar or vector of the |
length |
optional vector |
levels |
optional character vector of levels if (in this case initdata must be composed of these) (default: derive from initdata) |
ordered |
indicate whether the levels are ordered (TRUE) or non-ordered factor (FALSE, default) |
dim |
|
dimorder |
physical layout (default seq_along(dim)), see |
bydim |
dimorder by which to interpret the 'initdata', generalization of the 'byrow' paramter in |
symmetric |
extended feature: TRUE creates symmetric matrix (default FALSE) |
fixdiag |
extended feature: non-NULL scalar requires fixed diagonal for symmetric matrix (default NULL is free diagonal) |
names |
NOT taken from initdata, see |
dimnames |
NOT taken from initdata, see |
ramclass |
class attribute attached when moving all or parts of this ff into ram, see |
ramattribs |
additional attributes attached when moving all or parts of this ff into ram, see |
vmode |
virtual storage mode (default: derive from 'initdata'), see |
update |
set to FALSE to avoid updating with 'initdata' (default TRUE) (used by |
pattern |
root pattern with or without path for automatic ff filename creation (default NULL translates to "ff"), see also argument 'filename' |
filename |
ff |
overwrite |
set to TRUE to allow overwriting existing files (default FALSE) |
readonly |
set to TRUE to forbid writing to existing files |
pagesize |
pagesize in bytes for the memory mapping (default from |
caching |
caching scheme for the backend, currently 'mmnoflush' or 'mmeachflush' (flush mmpages at each swap, default from |
finalizer |
name of finalizer function called when ff object is |
finonexit |
logical scalar determining whether and |
FF_RETURN |
logical scalar or ff object to be used. The default TRUE creates a new ff file. FALSE returns a ram object. Handing over an ff object here uses this or stops if not |
BATCHSIZE |
integer scalar limiting the number of elements to be processed in |
BATCHBYTES |
integer scalar limiting the number of bytes to be processed in |
VERBOSE |
set to TRUE for verbosing in |
The atomic data is stored in filename
as a native encoded raw flat file on disk, OS specific limitations of the file system apply.
The number of elements per ff object is limited to the integer indexing, i.e. .Machine$integer.max
.
Atomic objects created with ff
are is.open
, a C++ object is ready to access the file via memory-mapping.
Currently the C++ backend provides two caching schemes: 'mmnoflush' let the OS decide when to flash memory mapped pages
and 'mmeachflush' will flush memory mapped pages at each page swap per ff file.
These minimal memory ressources can be released by closeing
or deleteing
the ff file.
ff objects can be saved
and loaded
across R sessions. If the ff file still exists in the same location,
it will be opened
automatically at the first attempt to access its data. If the ff object is removed
,
at the next garbage collection (see gc
) the ff object's finalizer
is invoked.
Raw data files can be made accessible as an ff object by explicitly given the filename and vmode but no size information (length or dim).
The ff object will open the file and handle the data with respect to the given vmode.
The close
finalizer will close the ff file, the delete
finalizer will delete the ff file.
The default finalizer deleteIfOpen
will delete open files and do nothing for closed files. If the default finalizer is used,
two actions are needed to protect the ff file against deletion: create the file outside the standard 'fftempdir' and close the ff object before removing it or before quitting R.
When R is exited through q
, the finalizer will be invoked depending on the 'fffinonexit' option, furthermore the 'fftempdir' is unlinked
.
physical |
an external pointer of class ' |
virtual |
an empty list which carries attributes with copy by value semantics: changing a virtual attribute of a copy does not change the original |
The 'ff_pointer
' carries the following 'physical' or readonly attributes, which are accessible via physical
:
vmode |
see vmode |
maxlength |
see maxlength |
pattern |
see parameter 'pattern' |
filename |
see filename |
pagesize |
see parameter 'pagesize' |
caching |
see parameter 'caching' |
finalizer |
see parameter 'finalizer' |
finonexit |
see parameter 'finonexit' |
readonly |
see is.readonly |
class |
The external pointer needs class 'ff\_pointer' to allow method dispatch of finalizers |
The 'virtual' component carries the following attributes (some of which might be NULL):
Length |
see length.ff |
Levels |
see levels.ff |
Names |
see names.ff |
VW |
see vw.ff |
Dim |
see dim.ff |
Dimorder |
see dimorder |
Symmetric |
see symmetric.ff |
Fixdiag |
see fixdiag.ff |
ramclass |
see ramclass |
ramattribs |
see ramattribs |
You should not rely on the internal structure of ff objects or their ram versions. Instead use the accessor functions like vmode
, physical
and virtual
.
Still it would be wise to avoid attributes AND classes 'vmode', 'physical' and 'virtual' in any other packages.
Note that the 'ff' object's class attribute also has copy-by-value semantics ('virtual').
For the 'ff' object the following class attritibutes are known:
vector | c("ff_vector","ff") |
matrix | c("ff_matrix","ff_array","ff") |
array | c("ff_array","ff") |
symmetric matrix | c("ff_symm","ff") |
distance matrix | c("ff_dist","ff_symm","ff") |
reserved for future use | c("ff_mixed","ff") |
The following methods and functions are available for ff objects:
Type | Name | Assign | Comment |
Basic functions | |||
function | ff |
constructor for ff and ram objects | |
generic | update |
updates one ff object with the content of another | |
generic | clone |
clones an ff object optionally changing some of its features | |
method | print |
print ff | |
method | str |
ff object structure | |
Class test and coercion | |||
function | is.ff |
check if inherits from ff | |
generic | as.ff |
coerce to ff, if not yet | |
generic | as.ram |
coerce to ram retaining some of the ff information | |
generic | as.bit |
coerce to bit |
|
Virtual storage mode | |||
generic | vmode |
<- |
get and set virtual mode (setting only for ram, not for ff objects) |
generic | as.vmode |
coerce to vmode (only for ram, not for ff objects) | |
Physical attributes | |||
function | physical |
<- |
set and get physical attributes |
generic | filename |
<- | get and set filename |
generic | pattern |
<- | get pattern and set filename path and prefix via pattern |
generic | maxlength |
get maxlength | |
generic | is.sorted |
<- |
set and get if is marked as sorted |
generic | na.count |
<- |
set and get NA count, if set to non-NA only swap methods can change and na.count is maintained automatically |
generic | is.readonly |
get if is readonly | |
Virtual attributes | |||
function | virtual |
<- |
set and get virtual attributes |
method | length |
<- |
set and get length |
method | dim |
<- |
set and get dim |
generic | dimorder |
<- |
set and get the order of dimension interpretation |
generic | vt |
virtually transpose ff_array | |
method | t |
create transposed clone of ff_array | |
generic | vw |
<- |
set and get virtual windows |
method | names |
<- |
set and get names |
method | dimnames |
<- |
set and get dimnames |
generic | symmetric |
get if is symmetric | |
generic | fixdiag |
<- |
set and get fixed diagonal of symmetric matrix |
method | levels |
<- |
levels of factor |
generic | recodeLevels |
recode a factor to different levels | |
generic | sortLevels |
sort the levels and recoce a factor | |
method | is.factor |
if is factor | |
method | is.ordered |
if is ordered (factor) | |
generic | ramclass |
get ramclass | |
generic | ramattribs |
get ramattribs | |
Access functions | |||
function | get.ff |
get single ff element (currently [[ is a shortcut) |
|
function | set.ff |
set single ff element (currently [[<- is a shortcut) |
|
function | getset.ff |
set single ff element and get old value in one access operation | |
function | read.ff |
get vector of contiguous elements | |
function | write.ff |
set vector of contiguous elements | |
function | readwrite.ff |
set vector of contiguous elements and get old values in one access operation | |
method | [ |
get vector of indexed elements, uses HIP, see hi |
|
method | [<- |
set vector of indexed elements, uses HIP, see hi |
|
generic | swap |
set vector of indexed elements and get old values in one access operation | |
generic | add |
(almost) unifies '+=' operation for ff and ram objects | |
generic | bigsample |
sample from ff object | |
Opening/Closing/Deleting | |||
generic | is.open |
check if ff is open | |
method | open |
open ff object (is done automatically on access) | |
method | close |
close ff object (releases C++ memory and protects against file deletion if deleteIfOpen ) finalizer is used |
|
generic | delete |
deletes ff file (unconditionally) | |
generic | deleteIfOpen |
deletes ff file if ff object is open (finalization method) | |
generic | finalizer |
<- | get and set finalizer |
generic | finalize |
force finalization | |
Other | |||
function | geterror.ff |
get error code | |
function | geterrstr.ff |
get error message | |
option | description | default |
fftempdir |
default directory for creating ff files | tempdir |
fffinalizer |
name of default finalizer | deleteIfOpen |
fffinonexit |
default for invoking finalizer on exit of R | TRUE |
ffpagesize |
default pagesize | getdefaultpagesize |
ffcaching |
caching scheme for the C++ backend | 'mmnoflush' |
ffdrop |
default for the drop parameter in the ff subscript methods | TRUE |
ffbatchbytes |
default for the byte limit in batched/chunked processing | memory.limit() %/% 100 |
The following table gives an overview of file size limits for common file systems (see https://en.wikipedia.org/wiki/Comparison_of_file_systems for further details):
File System | File size limit |
FAT16 | 2GB |
FAT32 | 4GB |
NTFS | 16GB |
ext2/3/4 | 16GB to 2TB |
ReiserFS | 4GB (up to version 3.4) / 8TB (from version 3.5) |
XFS | 8EB |
JFS | 4PB |
HFS | 2GB |
HFS Plus | 16GB |
USF1 | 4GB to 256TB |
USF2 | 512GB to 32PB |
UDF | 16EB |
Package Version 1.0
Daniel Adler | dadler@uni-goettingen.de |
R package design, C++ generic file vectors, Memory-Mapping, 64-bit Multi-Indexing adapter and Documentation, Platform ports | |
Oleg Nenadic | onenadi@uni-goettingen.de |
Index sequence packing, Documentation | |
Walter Zucchini | wzucchi@uni-goettingen.de |
Array Indexing, Sampling, Documentation | |
Christian Gläser | christian\_glaeser@gmx.de |
Wrapper for biglm package | |
Package Version 2.0
Jens Oehlschlägel | Jens.Oehlschlaegel@truecluster.com |
R package redesign; Hybrid Index Preprocessing; transparent object creation and finalization; vmode design; virtualization and hybrid copying; arrays with dimorder and bydim; symmetric matrices; factors and POSIXct; virtual windows and transpose; new generics update, clone, swap, add, as.ff and as.ram; ffapply and collapsing functions. R-coding, C-coding and Rd-documentation. | |
Daniel Adler | dadler@uni-goettingen.de |
C++ generic file vectors, vmode implementation and low-level bit-packing/unpacking, arithmetic operations and NA handling, Memory-Mapping and backend caching. C++ coding and platform ports. R-code extensions for opening existing flat files readonly and shared. | |
Package under GPL-2, included C++ code released by Daniel Adler under the less restrictive ISCL
Note that the standard finalizers are generic functions, their dispatch to the 'ff_pointer
' method happens at finalization time, their 'ff' methods exist for direct calling.
message("make sure you understand the following ff options before you start using the ff package!!") oldoptions <- options(fffinalizer="deleteIfOpen", fffinonexit="TRUE", fftempdir=tempdir()) message("an integer vector") ff(1:12) message("a double vector of length 12") ff(0, 12) message("a 2-bit logical vector of length 12 (vmode='boolean' has 1 bit)") ff(vmode="logical", length=12) message("an integer matrix 3x4 (standard colwise physical layout)") ff(1:12, dim=c(3,4)) message("an integer matrix 3x4 (rowwise physical layout, but filled in standard colwise order)") ff(1:12, dim=c(3,4), dimorder=c(2,1)) message("an integer matrix 3x4 (standard colwise physical layout, but filled in rowwise order aka matrix(, byrow=TRUE))") ff(1:12, dim=c(3,4), bydim=c(2,1)) gc() options(oldoptions) if (ffxtensions()){ message("a 26-dimensional boolean array using 1-bit representation (file size 8 MB compared to 256 MB int in ram)") a <- ff(vmode="boolean", dim=rep(2, 26)) dimnames(a) <- dummy.dimnames(a) rm(a); gc() } ## Not run: message("This 2GB biglm example can take long, you might want to change the size in order to define a size appropriate for your computer") require(biglm) b <- 1000 n <- 100000 k <- 3 memory.size(max = TRUE) system.time( x <- ff(vmode="double", dim=c(b*n,k), dimnames=list(NULL, LETTERS[1:k])) ) memory.size(max = TRUE) system.time( ffrowapply({ l <- i2 - i1 + 1 z <- rnorm(l) x[i1:i2,] <- z + matrix(rnorm(l*k), l, k) }, X=x, VERBOSE=TRUE, BATCHSIZE=n) ) memory.size(max = TRUE) form <- A ~ B + C first <- TRUE system.time( ffrowapply({ if (first){ first <- FALSE fit <- biglm(form, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE)) }else fit <- update(fit, as.data.frame(x[i1:i2,,drop=FALSE], stringsAsFactors = TRUE)) }, X=x, VERBOSE=TRUE, BATCHSIZE=n) ) memory.size(max = TRUE) first fit summary(fit) rm(x); gc() ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.