Selection By Filtering (SBF)
Model fitting after applying univariate filters
sbf(x, ...) ## Default S3 method: sbf(x, y, sbfControl = sbfControl(), ...) ## S3 method for class 'formula' sbf(form, data, ..., subset, na.action, contrasts = NULL) ## S3 method for class 'recipe' sbf(x, data, sbfControl = sbfControl(), ...) ## S3 method for class 'sbf' predict(object, newdata = NULL, ...)
x |
a data frame containing training data where samples are in rows and
features are in columns. For the recipes method, |
... |
for |
y |
a numeric or factor vector containing the outcome for each sample. |
sbfControl |
a list of values that define how this function acts. See
|
form |
A formula of the form |
data |
Data frame from which variables specified in |
subset |
An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.) |
na.action |
A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) |
contrasts |
a list of contrasts to be used for some or all the factors appearing as variables in the model formula. |
object |
an object of class |
newdata |
a matrix or data frame of predictors. The object must have non-null column names |
More details on this function can be found at http://topepo.github.io/caret/feature-selection-using-univariate-filters.html.
This function can be used to get resampling estimates for models when simple, filter-based feature selection is applied to the training data.
For each iteration of resampling, the predictor variables are univariately filtered prior to modeling. Performance of this approach is estimated using resampling. The same filter and model are then applied to the entire training set and the final model (and final features) are saved.
sbf
can be used with "explicit parallelism", where different
resamples (e.g. cross-validation group) can be split up and run on multiple
machines or processors. By default, sbf
will use a single processor
on the host machine. As of version 4.99 of this package, the framework used
for parallel processing uses the foreach package. To run the resamples
in parallel, the code for sbf
does not change; prior to the call to
sbf
, a parallel backend is registered with foreach (see the
examples below).
The modeling and filtering techniques are specified in
sbfControl
. Example functions are given in
lmSBF
.
for sbf
, an object of class sbf
with elements:
pred |
if |
variables |
a list of variable names that survived the filter at each resampling iteration |
results |
a data frame of results aggregated over the resamples |
fit |
the final model fit with only the filtered variables |
optVariables |
the names of the variables that survived the filter using the training set |
call |
the function call |
control |
the control object |
resample |
if
|
metrics |
a character vector of names of the performance measures |
dots |
a list of optional arguments that were passed in |
For predict.sbf
, a vector of predictions.
Max Kuhn
## Not run: data(BloodBrain) ## Use a GAM is the filter, then fit a random forest model RFwithGAM <- sbf(bbbDescr, logBBB, sbfControl = sbfControl(functions = rfSBF, verbose = FALSE, method = "cv")) RFwithGAM predict(RFwithGAM, bbbDescr[1:10,]) ## classification example with parallel processing ## library(doMC) ## Note: if the underlying model also uses foreach, the ## number of cores specified above will double (along with ## the memory requirements) ## registerDoMC(cores = 2) data(mdrr) mdrrDescr <- mdrrDescr[,-nearZeroVar(mdrrDescr)] mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)] set.seed(1) filteredNB <- sbf(mdrrDescr, mdrrClass, sbfControl = sbfControl(functions = nbSBF, verbose = FALSE, method = "repeatedcv", repeats = 5)) confusionMatrix(filteredNB) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.