Convert a SAS Dataset to an S Data Frame
Converts a SAS dataset into an S data frame. You may choose to extract only a subset of variables or a subset of observations in the SAS dataset. You may have the function automatically convert
PROC FORMAT
-coded
variables to factor objects. The original SAS codes are stored in an
attribute called sas.codes
and these may be added back to the
levels
of a factor
variable using the code.levels
function.
Information about special missing values may be captured in an attribute
of each variable having special missing values. This attribute is
called special.miss
, and such variables are given class special.miss
.
There are print
, []
, format
, and is.special.miss
methods for such variables.
The chron
function is used to set up date, time, and date-time variables.
If using S-Plus 5 or 6 or later, the timeDate
function is used
instead.
Under R, Dates
is used for dates and chron
for date-times. For times without
dates, these still need to be stored in date-time format in POSIX.
Such SAS time variables are given a major class of POSIXt
and a
format.POSIXt
function so that the date portion (which will
always be 1/1/1970) will not print by default.
If a date variable represents a partial date (0.5 added if
month missing, 0.25 added if day missing, 0.75 if both), an attribute
partial.date
is added to the variable, and the variable also becomes
a class imputed
variable.
The describe
function uses information about partial dates and
special missing values.
There is an option to automatically uncompress (or gunzip
) compressed
SAS datasets.
sas.get(libraryName, member, variables=character(0), ifs=character(0), format.library=libraryName, id, dates.=c("sas","yymmdd","yearfrac","yearfrac2"), keep.log=TRUE, log.file="_temp_.log", macro=sas.get.macro, data.frame.out=existsFunction("data.frame"), clean.up=FALSE, quiet=FALSE, temp=tempfile("SaS"), formats=TRUE, recode=formats, special.miss=FALSE, sasprog="sas", as.is=.5, check.unique.id=TRUE, force.single=FALSE, pos, uncompress=FALSE, defaultencoding="latin1") is.special.miss(x, code) ## S3 method for class 'special.miss' x[..., drop=FALSE] ## S3 method for class 'special.miss' print(x, ...) ## S3 method for class 'special.miss' format(x, ...) sas.codes(object) code.levels(object)
libraryName |
character string naming the directory in which the dataset is kept. |
drop |
logical. If |
member |
character string giving the second part of the two part SAS dataset name. (The first part is irrelevant here - it is mapped to the UNIX directory name.) |
x |
a variable that may have been created by |
variables |
vector of character strings naming the variables in the SAS dataset.
The S dataset will contain only those variables from the
SAS dataset.
To get all of the variables (the default), an empty string may be given.
It is a fatal error if any one of the variables is not
in the SAS dataset. You can use |
ifs |
a vector of character strings, each containing one SAS “subsetting if” statement. These will be used to extract a subset of the observations in the SAS dataset. |
format.library |
The UNIX directory containing the file ‘formats.sct’, which contains the definitions of the user defined formats used in this dataset. By default, we look for the formats in the same directory as the data. The user defined formats must be available (so SAS can read the data). |
formats |
Set |
recode |
This parameter defaults to |
special.miss |
For numeric variables, any missing values are stored as NA in S.
You can recover special missing values by setting |
id |
The name of the variable to be used as the row names of the S dataset.
The id variable becomes the |
dates. |
specifies the format for storing SAS dates in the resulting data frame |
as.is |
IF |
check.unique.id |
If B23 . |
force.single |
By default, SAS numeric variables having LENGTH > 4 are stored as S double precision numerics, which allow for the same precision as a SAS LENGTH 8 variable. Set LENGTH statement. R does not have single precision, so no attempt is made to convert to single if running R. |
dates |
One of the character strings YYMMDD (year%%100, month, day).
Note that R will store these as numbers, not as
character strings. If |
keep.log |
logical flag: if |
log.file |
the name of the SAS log file. |
macro |
the name of an S object in the current search path that contains the text of
the SAS macro called by R. The R object is a character vector that
can be edited using for example |
data.frame.out |
logical flag: if |
clean.up |
logical flag: if |
quiet |
logical flag: if |
temp |
the prefix to use for the temporary files. Two characters will be added to this, the resulting name must fit on your file system. |
sasprog |
the name of the system command to invoke SAS |
uncompress |
set to |
pos |
by default, a list or data frame which contains all the variables is returned.
If you specify |
code |
a special missing value code (A through Z or \_) to check
against. If |
defaultencoding |
encoding to assume if the SAS dataset does not specify one. Defaults to "latin1". |
object |
a variable in a data frame created by |
... |
ignored |
If you specify special.miss = TRUE
and there are no special missing
values in the data SAS dataset, the SAS step will bomb.
For variables having a
PROC FORMAT VALUE
format with some of the levels undefined, sas.get
will interpret those
values as NA
if you are using recode
.
The SAS macro ‘sas\_get’ uses record lengths of up to 4096 in two places. If you are exporting records that are very long (because of a large number of variables and/or long character variables), you may want to edit these
LRECL
s to quadruple them, for example.
if data.frame.out
is TRUE
, the output will
be a data frame resembling the SAS dataset. If id
was specified, that column of the data frame will be used
as the row names of the data frame. Each variable in the data frame
or vector in the list will have the attributes label
and format
containing SAS labels and formats. Underscores in formats are
converted to periods. Formats for character variables have \$
placed
in front of their names.
If formats
is TRUE
and there are any
appropriate format definitions in format.library
, the returned
object will have attribute formats
containing lists named the
same as the format names (with periods substituted for underscores and
character formats prefixed by \$
).
Each of these lists has a vector called values
and one called
labels
with the
PROC FORMAT; VALUE ...
definitions.
If data.frame.out
is FALSE
, the output will
be a list of vectors, each containing a variable from the SAS
dataset. If id
was specified, that element of the list will
be used as the id
attribute of the entire list.
if a SAS error occurs and quiet
is FALSE
, then the SAS log file will be
printed under the control of the less
pager.
The references cited below explain the structure of SAS datasets and how they are stored under UNIX. See SAS Language for a discussion of the “subsetting if” statement.
You must be able to run SAS (by typing sas
) on your system.
If the S command !sas
does not start SAS, then this function cannot work.
If you are reading time or
date-time variables, you will need to execute the command library(chron)
to print those variables or the data frame if the timeDate
function
is not available.
Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corporation
Michael W. Kattan, Cleveland Clinic Foundation
Reinhold Koch (encoding)
SAS Institute Inc. (1990). SAS Language: Reference, Version 6. First Edition. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1988). SAS Technical Report P-176, Using the SAS System, Release 6.03, under UNIX Operating Systems and Derivatives. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1985). SAS Introductory Guide. Third Edition. SAS Institute Inc., Cary, North Carolina.
## Not run: sas.contents("saslib", "mice") # [1] "dose" "ld50" "strain" "lab_no" attr(, "n"): # [1] 117 mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50")) plot(mice$dose, mice$ld50) nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", ifs="if strain='nude'") nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", var=c("dose", "ld50"), ifs="if strain='nude'") # Get a dataset from current directory, recode PROC FORMAT; VALUE \dots # variables into factors with labels of the form "good(1)" "better(2)", # get special missing values, recode missing codes .D and .R into new # factor levels "Don't know" and "Refused to answer" for variable q1 d <- sas.get(".", "mydata", recode=2, special.miss=TRUE) attach(d) nl <- length(levels(q1)) lev <- c(levels(q1), "Don't know", "Refused") q1.new <- as.integer(q1) q1.new[is.special.miss(q1,"D")] <- nl+1 q1.new[is.special.miss(q1,"R")] <- nl+2 q1.new <- factor(q1.new, 1:(nl+2), lev) # Note: would like to use factor() in place of as.integer \dots but # factor in this case adds "NA" as a category level d <- sas.get(".", "mydata") sas.codes(d$x) # for PROC FORMATted variables returns original data codes d$x <- code.levels(d$x) # or attach(d); x <- code.levels(x) # This makes levels such as "good" "better" "best" into e.g. # "1:good" "2:better" "3:best", if the original SAS values were 1,2,3 # Retrieve the same variables from another dataset (or an update of # the original dataset) mydata2 <- sas.get('mydata2', var=names(d)) # This only works if none of the original SAS variable names contained _ mydata2 <- cleanup.import(mydata2) # will make true integer variables # Code from Don MacQueen to generate SAS dataset to test import of # date, time, date-time variables # data ssd.test; # d1='3mar2002'd ; # dt1='3mar2002 9:31:02'dt; # t1='11:13:45't; # output; # # d1='3jun2002'd ; # dt1='3jun2002 9:42:07'dt; # t1='11:14:13't; # output; # format d1 mmddyy10. dt1 datetime. t1 time.; # run; ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.