Nearest centroid predictor
Nearest centroid predictor for binary (i.e., two-outcome) data. Implements a whole host of options and improvements such as accounting for within-class heterogeneity using sample networks, various ways of feature selection and weighing etc.
nearestCentroidPredictor( # Input training and test data x, y, xtest = NULL, # Feature weights and selection criteria featureSignificance = NULL, assocFnc = "cor", assocOptions = "use = 'p'", assocCut.hi = NULL, assocCut.lo = NULL, nFeatures.hi = 10, nFeatures.lo = 10, weighFeaturesByAssociation = 0, scaleFeatureMean = TRUE, scaleFeatureVar = TRUE, # Predictor options centroidMethod = c("mean", "eigensample"), simFnc = "cor", simOptions = "use = 'p'", useQuantile = NULL, sampleWeights = NULL, weighSimByPrediction = 0, # What should be returned CVfold = 0, returnFactor = FALSE, # General options randomSeed = 12345, verbose = 2, indent = 0)
x |
Training features (predictive variables). Each column corresponds to a feature and each row to an observation. |
y |
The response variable. Can be a single vector or a matrix with arbitrary many columns. Number of rows (observations) must equal to the number of rows (observations) in x. |
xtest |
Optional test set data. A matrix of the same number of columns (i.e., features) as |
featureSignificance |
Optional vector of feature significance for the response variable. If given, it is used for feature selection (see details). Should preferably be signed, that is features can have high negative significance. |
assocFnc |
Character string specifying the association function. The association function should behave roughly as
|
assocOptions |
Character string specifying options to the association function. |
assocCut.hi |
Association (or featureSignificance) threshold for including features in the predictor. Features with
associtation higher than |
assocCut.lo |
Association (or featureSignificance) threshold for including features in the predictor. Features with
associtation lower than |
nFeatures.hi |
Number of highest-associated features (or features with highest |
nFeatures.lo |
Number of lowest-associated features (or features with highest |
weighFeaturesByAssociation |
(Optional) power to downweigh features that are less associated with the response. See details. |
scaleFeatureMean |
Logical: should the training features be scaled to mean zero? Unless there are good reasons not to scale, the features should be scaled. |
scaleFeatureVar |
Logical: should the training features be scaled to unit variance? Again, unless there are good reasons not to scale, the features should be scaled. |
centroidMethod |
One of |
simFnc |
Character string giving the similarity function for measuring the similarity between test samples and
centroids. This function should
behave roughly like the function |
simOptions |
Character string specifying the options to the similarity function. |
useQuantile |
If non-NULL, the "nearest quantiloid" will be used instead of the nearest centroid. See details. |
sampleWeights |
Optional specification of sample weights. Useful for example if one wants to explore boosting. |
weighSimByPrediction |
(Optional) power to downweigh features that are not well predicted between training and test sets. See details. |
CVfold |
Non-negative integer specifying cross-validation. Zero means no cross-validation will be performed. values above zero specify the number of samples to be considered test data for each step of cross-validation. |
returnFactor |
Logical: should a factor be returned? |
randomSeed |
Integere specifying the seed for the random number generator. If |
verbose |
Integer controling how verbose the diagnostic messages should be. Zero means silent. |
indent |
Indentation for the diagnostic messages. Zero means no indentation, each unit adds two spaces. |
Nearest centroid predictor works by forming a representative profile (centroid)
across features for each class from
the training data, then assigning each test sample to the class of the nearest representative profile. The
representative profile can be formed either as mean or as athe first principal component ("eigensample";
this choice is governed by the option centroidMethod
).
When the number of features is large and only a small fraction is likely to be associated with the outcome,
feature selection can be used to restrict the features that actually enter the centroid. Feature selection
can be based either on their association with the outcome
calculated from the training data using assocFnc
, or on user-supplied feature significance (e.g.,
derived from literature, argument
featureSignificance
). In either case, features can be selected by high and low association tresholds
or by taking a fixed number of highest- and lowest-associated features.
As an alternative to centroids, the predictor can also assign test samples based on a given quantile of the
distances from the training samples in each class (argument useQuantile
). This may be advantageous if
the samples in each class form irregular clusters. Note that setting useQuantile=0
(i.e., using
minimum distance in each class) essentially gives a nearest neighbor predictor: each test sample will be
assigned to the class of its nearest training neighbor.
If features exhibit non-trivial correlations among themselves (such as, for example, in gene expression
data), one can attempt to down-weigh features that do not exhibit the same correlation in the test set.
This is done by using essentially the same predictor to predict _features_ from all other features in the
test data (using the training data to train the feature predictor). Because test features are known, the
prediction accuracy can be evaluated. If a feature is predicted badly (meaning the error in the test set is
much larger than the error in the cross-validation prediction in training data),
it may mean that its quality in the
training or test data is low (for example, due to excessive noise or outliers).
Such features can be downweighed using the argument weighByPrediction
. The extra factor is
min(1, (root mean square prediction error in test set)/(root mean square cross-validation prediction error
in
the trainig data)^weighByPrediction), that is it is never bigger than 1.
Unless the features' mean and variance can be ascribed clear meaning, the (training) features should be scaled to mean 0 and variance 1 before the centroids are formed.
The function implements a basic option for removal of spurious effects in the training and test data, by removng a fixed number of leading principal components from the features. This sometimes leads to better prediction accuracy but should be used with caution.
If samples within each class are heterogenous, a single centroid may not represent each class well. This
function can deal with within-class heterogeneity by clustering samples (separately in each class), then
using a one representative (mean, eigensample) or quantile for each cluster in each class to assign test
samples. Various similarity measures, specified by adjFnc
, can be used to construct the sample network
adjacency. Similarly, the user can specify a clustering function using clusteringFnc
. The
requirements on the clustering function are described in a separate section below.
A list with the following components:
predicted |
The back-substitution prediction in the training set. |
predictedTest |
Prediction in the test set. |
featureSignificance |
A vector of feature significance calculated by |
selectedFeatures |
A vector giving the indices of the features that were selected for the predictor. |
centroidProfile |
The representative profiles of each class (or cluster). Only returned in
|
testSample2centroidSimilarities |
A matrix of calculated similarities between the test samples and class/cluster centroids. |
featureValidationWeights |
A vector of validation weights (see Details) for the selected features. If
|
CVpredicted |
Cross-validation prediction on the training data. Present only if |
sampleClusterLabels |
A list with two components (one per class). Each component is a vector of sample cluster labels for samples in the class. |
Peter Langfelder
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.