Object Oriented Interface to Foreign Files
Importer objects are objects that refer to an external data file. Currently only Stata files, SPSS system, portable, and fixed-column files are supported.
Data are actually imported by ‘translating’ an
importer file into a data.set
using
as.data.set
or subset
.
The importer
mechanism is more flexible and extensible
than read.spss
and read.dta
of package "foreign", as most of the parsing of the file headers is done in R.
It is also adapted to efficiently load large data sets.
Most importantly, importer objects support the
labels
, missing.values
,
and description
s, provided by this package.
spss.file(file,...) spss.fixed.file(file, columns.file, varlab.file=NULL, codes.file=NULL, missval.file=NULL, count.cases=TRUE, to.lower=getOption("spss.fixed.to.lower",FALSE), iconv=TRUE, encoded=getOption("spss.fixed.encoding","cp1252")) spss.portable.file(file, varlab.file=NULL, codes.file=NULL, missval.file=NULL, count.cases=TRUE, to.lower=getOption("spss.por.to.lower",FALSE), iconv=TRUE, encoded=getOption("spss.por.encoding","cp1252")) spss.system.file(file, varlab.file=NULL, codes.file=NULL, missval.file=NULL, count.cases=TRUE, to.lower=getOption("spss.sav.to.lower",FALSE), iconv=TRUE, encoded=getOption("spss.sav.encoding","cp1252"), ignore.scale.info = FALSE) Stata.file(file, iconv=TRUE, encoded=if(new_format) getOption("Stata.new.encoding","utf-8") else getOption("Stata.old.encoding","cp1252")) ## The most important methods for "importer" objects are: ## S4 method for signature 'importer' subset(x, subset, select, drop = FALSE, ...) ## S4 method for signature 'importer' as.data.set(x,row.names=NULL,optional=NULL, compress.storage.modes=FALSE,...) ## S4 method for signature 'importer' head(x,n=20,...) ## S4 method for signature 'importer' tail(x,n=20,...)
file |
character string; the path to the file containing the data |
... |
Other arguments. |
columns.file |
character string; the path to an
SPSS/PSPP syntax file with a |
varlab.file |
character string; the path to an
SPSS/PSPP syntax file with a |
codes.file |
character string; the path to an
SPSS/PSPP syntax file with a |
missval.file |
character string; the path to an
SPSS/PSPP syntax file with a |
count.cases |
logical; should cases in file be counted? This takes effect only if the data file does not already contain information about the number of cases. |
to.lower |
logical; should variable names changed to lower case? |
iconv |
logical; should strings (in labels and variables) changed into encoding of the platform? |
encoded |
a cacharacter string; the way characters are encoded
in the improrted file. For the available encoding options
see |
ignore.scale.info |
logical; should information about measuremnt scale levels provided in the file be ignored? |
x |
an object that inherits from class |
subset |
a logical vector or an expression containing variables from the external data file that evaluates to logical. |
select |
a vector of variable names from the external data file. This may also be a named vector, where the names give the names into which the variables from the external data file are renamed. |
drop |
a logical value, that determines what happens if
only one column is selected. If TRUE and only one column
is selected, |
row.names |
ignored, present only for compatibility. |
optional |
ignored, present only for compatibility. |
compress.storage.modes |
logical value; if TRUE floating point values are converted to integers if possible without loss of information. |
n |
integer; the number of rows to be shown by |
A call to a ‘constructor’ for an importer object, that is,
spss.fixed.file
, spss.portable.file
, spss.sysntax.file
,
or Stata.file
,
causes R to read in the header of the data file and/or
the syntax files that contain information about
the variables, such as the columns that they occupy
(in case of spss.fixed.file
), variable labels,
value labels and missing values.
The information in the file header and/or the accompagnying
files is then processed to prepare the file for importing.
Thus the inner structure of an importer
object may
well vary according to what type of file is to imported and
what additional information is given.
The as.data.set
and subset
methods
for "importer"
objects internally use the
generic functions seekData
, readData
, readSlice
,
and readChunk
, which have methods for the
subclasses of "importer"
.
These functions are not callable
from outside the package, however.
The subset
method for "importer"
objects reads in
the data ‘chunk-wise’ to create the subset of observations if
the option "subset.chunk.size"
is set to a non-NULL
value, e.g. by options(subset.chunk.size=1000)
. This may be
useful in case of very large data sets from which only a tiny subset
of observations is needed for analysis.
Since the functions described here are more or less complete rewrite
based on the description of the file structure provided
by the documenation for PSPP, they are perhaps not as thorougly tested as the
functions in the foreign
package, apart from the frequent use
by the author of this package.
spss.fixed.file
, spss.portable.file
,
spss.system.file
, and Stata.file
return, respectively, objects of class
"spss.fixed.importer"
, "spss.portable.importer"
,
"spss.system.importer"
, "Stata.importer"
, or "Stata_new.importer"
,
which, by inheritance, are also objects of class "importer"
.
"Stata.importer"
is for files in the format of Stata versions up
to 12, while "Stata_new.importer"
is for files in the newer
format of Stata versions from 13.
Objects of class "importer"
have at least the following two slots:
ptr |
an external pointer |
variables |
a list of objects of class |
The as.data.frame
for importer
objects does
the actual data import and returns a data frame. Note that in contrast
to read.spss
, the variable names of the
resulting data frame will be lower case, unless the importer function
is called with to.lower=FALSE
. If long variable names
are defined (in case of a PSPP/SPSS system file), they take
precedence and are not coerced to lower case.
# Extract American National Election Study of 1948 nes1948.por <- unzip(system.file("anes/NES1948.ZIP",package="memisc"), "NES1948.POR",exdir=tempfile()) # Get information about the variables contained. nes1948 <- spss.portable.file(nes1948.por) # The data are not yet loaded: show(nes1948) # ... but one can see what variables are present: description(nes1948) # Now a subset of the data is loaded: vote.socdem.48 <- subset(nes1948, select=c( V480018, V480029, V480030, V480045, V480046, V480047, V480048, V480049, V480050 )) # Let's make the names more descriptive: vote.socdem.48 <- rename(vote.socdem.48, V480018 = "vote", V480029 = "occupation.hh", V480030 = "unionized.hh", V480045 = "gender", V480046 = "race", V480047 = "age", V480048 = "education", V480049 = "total.income", V480050 = "religious.pref" ) # It is also possible to do both # in one step: # vote.socdem.48 <- subset(nes1948, # select=c( # vote = V480018, # occupation.hh = V480029, # unionized.hh = V480030, # gender = V480045, # race = V480046, # age = V480047, # education = V480048, # total.income = V480049, # religious.pref = V480050 # )) # We examine the data more closely: codebook(vote.socdem.48) # ... and conduct some analyses. # t(genTable(percent(vote)~occupation.hh,data=vote.socdem.48)) # We consider only the two main candidates. vote.socdem.48 <- within(vote.socdem.48,{ truman.dewey <- vote valid.values(truman.dewey) <- 1:2 truman.dewey <- relabel(truman.dewey, "VOTED - FOR TRUMAN" = "Truman", "VOTED - FOR DEWEY" = "Dewey") }) summary(truman.relig.glm <- glm((truman.dewey=="Truman")~religious.pref, data=vote.socdem.48, family="binomial", ))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.