Update a Data Frame or Cleanup a Data Frame after Importing
cleanup.import
will correct errors and shrink
the size of data frames. By default, double precision numeric
variables are changed to integer when they contain no fractional components.
Infinite values or values greater than 1e20 in absolute value are set
to NA. This solves problems of importing Excel spreadsheets that
contain occasional character values for numeric columns, as S
converts these to Inf
without warning. There is also an option to
convert variable names to lower case and to add labels to variables.
The latter can be made easier by importing a CNTLOUT dataset created
by SAS PROC FORMAT and using the sasdict
option as shown in the
example below. cleanup.import
can also transform character or
factor variables to dates.
upData
is a function facilitating the updating of a data frame
without attaching it in search position one. New variables can be
added, old variables can be modified, variables can be removed or renamed, and
"labels"
and "units"
attributes can be provided.
Observations can be subsetted. Various checks
are made for errors and inconsistencies, with warnings issued to help
the user. Levels of factor variables can be replaced, especially
using the list
notation of the standard merge.levels
function. Unless force.single
is set to FALSE
,
upData
also converts double precision vectors to integer if no
fractional values are present in
a vector. upData
is also used to process R workspace objects
created by StatTransfer, which puts variable and value labels as attributes on
the data frame rather than on each variable. If such attributes are
present, they are used to define all the labels and value labels
(through conversion to factor variables) before any label changes
take place, and force.single
is set to a default of
FALSE
, as StatTransfer already does conversion to integer.
Variables having labels but not classed "labelled"
(e.g., data
imported using the haven
package) have that class added to them
by upData
.
The dataframeReduce
function removes variables from a data frame
that are problematic for certain analyses. Variables can be removed
because the fraction of missing values exceeds a threshold, because they
are character or categorical variables having too many levels, or
because they are binary and have too small a prevalence in one of the
two values. Categorical variables can also have their levels combined
when a level is of low prevalence.
cleanup.import(obj, labels, lowernames=FALSE, force.single=TRUE, force.numeric=TRUE, rmnames=TRUE, big=1e20, sasdict, print, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), autodate=FALSE, autonum=FALSE, fracnn=0.3, considerNA=NULL, charfactor=FALSE) upData(object, ..., subset, rename, drop, keep, labels, units, levels, force.single=TRUE, lowernames=FALSE, caplabels=FALSE, moveUnits=FALSE, charfactor=FALSE, print=TRUE, html=FALSE) dataframeReduce(data, fracmiss=1, maxlevels=NULL, minprev=0, print=TRUE)
obj |
a data frame or list |
object |
a data frame or list |
data |
a data frame |
force.single |
By default, double precision variables are converted to single precision
(in S-Plus only) unless |
force.numeric |
Sometimes importing will cause a numeric variable to be
changed to a factor vector. By default, |
rmnames |
set to ‘F’ to not have ‘cleanup.import’ remove ‘names’ or ‘.Names’ attributes from variables |
labels |
a character vector the same length as the number of variables in
|
lowernames |
set this to |
big |
a value such that values larger than this in absolute value are set to
missing by |
sasdict |
the name of a data frame containing a raw imported SAS PROC CONTENTS CNTLOUT= dataset. This is used to define variable names and to add attributes to the new data frame specifying the original SAS dataset name and label. |
print |
set to |
datevars |
character vector of names (after |
datetimevars |
character vector of names (after |
dateformat |
for |
fixdates |
for any of the variables listed in |
autodate |
set to |
autonum |
set to |
fracnn |
see |
considerNA |
for |
charfactor |
set to |
... |
for |
subset |
an expression that evaluates to a logical vector
specifying which rows of |
rename |
list or named vector specifying old and new names for variables. Variables are
renamed before any other operations are done. For example, to rename
variables |
drop |
a vector of variable names to remove from the data frame |
keep |
a vector of variable names to keep, with all other variables dropped |
units |
a named vector or list defining |
levels |
a named list defining |
caplabels |
set to |
moveUnits |
set to |
html |
set to |
fracmiss |
the maximum permissable proportion of |
maxlevels |
the maximum number of levels of a character or categorical or factor variable before the variable is dropped |
minprev |
the minimum proportion of non-missing observations in a category for a binary variable to be retained, and the minimum relative frequency of a category before it will be combined with other small categories |
a new data frame
Frank Harrell, Vanderbilt University
## Not run: dat <- read.table('myfile.asc') dat <- cleanup.import(dat) ## End(Not run) dat <- data.frame(a=1:3, d=c('01/02/2004',' 1/3/04','')) cleanup.import(dat, datevars='d', dateformat='%m/%d/%y', fixdates='year') dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3) dat2 <- upData(dat, x=x^2, x=x-5, m=x/10, rename=c(a='x'), drop='z', labels=c(x='X', y='test'), levels=list(y=list(a='a',b=c('b1','b2')))) dat2 describe(dat2) dat <- dat2 # copy to original name and delete dat2 if OK rm(dat2) dat3 <- upData(dat, X=X^2, subset = x < (3/7)^2 - 5, rename=c(x='X')) # Remove hard to analyze variables from a redundancy analysis of all # variables in the data frame d <- dataframeReduce(dat, fracmiss=.1, minprev=.05, maxlevels=5) # Could run redun(~., data=d) at this point or include dataframeReduce # arguments in the call to redun # If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict, # the LABELs from this dataset can be added to the data. Let's also # convert names to lower case for the main data file ## Not run: mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.