Construct a Ctype struct
Construct arbitarily complex ‘struct’ures in R for use with on-disk C struct's.
struct(..., bytes, offset) is.struct(x)
... |
Field types contained in struct. |
bytes |
The total number of bytes in the struct. See details. |
offset |
The byte offset of members of the struct. See details. |
x |
object to test |
struct
provides a high level R
based description of a C based struct
data type on disk.
The types of data that can be contained within a structure (byte array) on disk can be any permutation of the following: int8, uint8, int16 uint16, int32, real32, and real64. ‘struct’s are not recursive, that is all struct's contained within a struct must be logically flattened (core elements extracted).
All C types are converted to the appropriate R type internally.
It is best to consider a struct a simple byte array,
where at specified offsets, a valid C variable type
exists. Describing the struct using the R
function struct
allows mmap extraction
to proceed as if the entire structure was one block,
(a single ‘i’ value), and each block
of bytes can thus be read into R with one
operation.
One important distinction between the R struct (and the examples that follow) and a C struct is related to byte-alignment. Note that the R version is effectively serializing the data, without padding to word boundaries. See the following section on ANSI C for more details for reading data generated by an external process such as C/C++.
A list of values, one element for each type of R data.
ANSI C struct's will typically have padding in cases where required
by the language details and/or C programs. In general, if the struct
on disk has padding, the use of bytes
and offset
are required
to maintain alignment with the extraction and replacement code in mmap for R.
A simple example of this is where you have an 8-byte double (real64) and a 4-byte integer (int32). Created by a C/C++ program, the result will be a 16-byte struct - where the final 4-bytes will be padding.
To accomodate this from mmap, it is required to specify the corrected
bytes
(e.g. bytes=16 in this example). For cases where padding
is not at the end of the struct (e.g. if an additional 8-byte double was
added as the final member of the previous struct), it would also
be necessary to correct the offset to reflect the internal padding. Here,
the correct setting would be offset=c(0,8,16)
- since the 4-byte
integer will be padded to 8-bytes to allow for the final double to
begin on a word boundary (on a 64 bit platform).
This is a general mechanism to adjust for offset - but requires knowledge
of both the struct on disk as well as the generating process. At some point
in the near future struct
will attempt to properly adjust for
offset if mmap is used on data created from outside of R.
It is important to note that this alignment is also dependent on the underlying hardware word size (size_t) and is more complicated than the above example.
‘struct’'s can be thought of as ‘rows’
in a database. If many different types need always
be returned together, it will be more efficient to
store them together in a struct on disk. This reduces
the number of page hits required to fetch all required
data. Conversley, if individual columns are desired
it will likely make sense to simply store vectors
in seperate files on disk and read in with mmap
individually as needed.
Note that not all behavior of struct extraction and replacement is defined for all virtual and real types yet. This is an ongoing development and will be completed in the near future.
Jeffrey A. Ryan
tmp <- tempfile() f <- file(tmp, open="ab") u_int_8 <- c(1L, 255L, 22L) # 1 byte, valid range 0:255 int_8 <- c(1L, -127L, -22L) # 1 byte, valid range -128:127 u_int_16 <- c(1L, 65000L, 1000L) # 2 byte, valid range 0:65+k int_16 <- c(1L, 25000L, -1000L) # 2 byte, valid range -32k:32k int_32 <- c(98743L, -9083299L, 0L) # 4 byte, standard R integer float_32 <- c(9832.22, 3.14159, 0.00001) cplx_64 <- c(1+0i, 0+8i, 2+2i) # not yet supported in struct char_ <- writeBin(as.raw(1:3), raw()) fixed_width_string <- c("ab","cd","ef") for(i in 1:3) { writeBin(u_int_8[i], f, size=1L) writeBin(int_8[i], f, size=1L) writeBin(u_int_16[i], f, size=2L) writeBin(int_16[i], f, size=2L) writeBin(int_32[i], f, size=4L) writeBin(float_32[i], f, size=4L) # store as 32bit - prec issues writeBin(float_32[i], f, size=8L) # store as 64bit writeBin(cplx_64[i], f) writeBin(char_[i], f) writeBin(fixed_width_string[i], f) } close(f) m <- mmap(tmp, struct(uint8(), int8(), uint16(), int16(), int32(), real32(), real64(), cplx(), char(), # also raw() char(2) # character array of n characters each )) length(m) # only 3 'struct' elements str(m[]) m[1:2] # add a post-processing function to convert some elements (rows) to a data.frame extractFUN(m) <- function(x,i,...) { x <- x[i] data.frame(u_int_8=x[[1]], int_8=x[[2]], int_16=x[[3]], int_32=x[[4]], float_32=x[[5]], real_64=x[[6]] ) } m[1:2] munmap(m) # grouping commonly fetched data by row reduces # disk IO, as values reside together on a page # in memory (which is paged in by mmap). Here # we try 3 columns, or one row of 3 values. # note that with structs we replicate a row-based # structure. # # 13 byte struct x <- c(writeBin(1L, raw(), size=1), writeBin(3.14, raw(), size=4), writeBin(100.1, raw(), size=8)) writeBin(rep(x,1e6), tmp) length(x) m <- mmap(tmp, struct(int8(),real32(),real64())) length(m) m[1] # create the columns in seperate files (like a column # store) t1 <- tempfile() t2 <- tempfile() t3 <- tempfile() writeBin(rep(x[1],1e6), t1) writeBin(rep(x[2:5],1e6), t2) writeBin(rep(x[6:13],1e6), t3) m1 <- mmap(t1, int8()) m2 <- mmap(t2, real32()) m3 <- mmap(t3, real64()) list(m1[1],m2[1],m3[1]) i <- 5e5:6e5 # note that times are ~3x faster for the struct # due to decreased disk IO and CPU cost to process system.time(for(i in 1:100) m[i]) system.time(for(i in 1:100) m[i]) system.time(for(i in 1:100) list(m1[i],m2[i],m3[i])) system.time(for(i in 1:100) list(m1[i],m2[i],m3[i])) system.time(for(i in 1:100) {m1[i];m2[i];m3[i]}) # no cost to list() # you can skip struct members by specifying offset and bytes m <- mmap(tmp, struct(int8(), #real32(), here we are skipping the 4 byte float real64(), offset=c(0,5), bytes=13)) # alternatively you can add padding directly n <- mmap(tmp, struct(int8(), pad(4), real64())) pad(4) pad(int32()) m[1] n[1] munmap(m) munmap(n) munmap(m1) munmap(m2) munmap(m3) unlink(t1) unlink(t2) unlink(t3) unlink(tmp)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.