Compute (Skewness-adjusted) Multivariate Outlyingness
For an n * p data matrix (or data frame) x
,
compute the “outlyingness” of all n observations.
Outlyingness here is a generalization of the Donoho-Stahel
outlyingness measure, where skewness is taken into account via the
medcouple, mc()
.
adjOutlyingness(x, ndir = 250, p.samp = p, clower = 4, cupper = 3, alpha.cutoff = 0.75, coef = 1.5, qr.tol = 1e-12, keep.tol = 1e-12, only.outlyingness = FALSE, maxit.mult = max(100, p), trace.lev = 0, mcReflect = n <= 100, mcScale = TRUE, mcMaxit = 2*maxit.mult, mcEps1 = 1e-12, mcEps2 = 1e-15, mcTrace = max(0, trace.lev-1))
x |
a numeric |
ndir |
positive integer specifying the number of directions that should be searched. |
p.samp |
the sample size to use for finding good random
directions, must be at least |
clower, cupper |
the constant to be used for the lower and upper
tails, in order to transform the data towards symmetry. You can set
|
alpha.cutoff |
number in (0,1) specifying the quantiles
(α, 1-α) which determine the “outlier”
cutoff. The default, using quartiles, corresponds to the definition
of the medcouple ( |
coef |
positive number specifying the factor with which the
interquartile range ( |
qr.tol |
positive tolerance to be used for |
keep.tol |
positive tolerance to determine which of the sample
direction should be kept, namely only those for which
||x|| * ||B|| is larger than |
only.outlyingness |
logical indicating if the final outlier determination should be skipped. In that case, a vector is returned, see ‘Value:’ below. |
maxit.mult |
integer factor; |
trace.lev |
an integer, if positive allows to monitor the direction search. |
FIXME: Details in the comment of the Matlab code; also in the reference(s).
The method as described can be useful as preprocessing in FASTICA (http://research.ics.aalto.fi/ica/fastica/ see also the R package fastICA.
If only.outlyingness
is true, a vector adjout
,
otherwise, as by default, a list with components
adjout |
numeric of |
cutoff |
cutoff for “outlier” with respect to the adjusted
outlyingnesses, and depending on |
nonOut |
logical of |
The result is random as it depends on the sample of
ndir
directions chosen; specifically, to get sub samples the algorithm uses
sample.int(n, p.samp)
which from R version 3.6.0 depends on
RNGkind(*, sample.kind)
. Exact reproducibility of results
from R versions 3.5.3 and earlier, requires setting
RNGversion("3.5.0")
.
In any case, do use set.seed()
yourself
for reproducibility!
Till Aug/Oct. 2014, the default values for clower
and cupper
were
accidentally reversed, and the signs inside exp(.)
where swapped
in the (now corrected) two expressions
tup <- Q3 + coef * IQR * exp(.... + clower * tmc * (tmc < 0)) tlo <- Q1 - coef * IQR * exp(.... - cupper * tmc * (tmc < 0))
already in the code from Antwerpen (‘mcrsoft/adjoutlingness.R’), contrary to the published reference.
Further, the original algorithm had not been scale-equivariant in the direction construction, which has been amended in 2014-10 as well.
The results, including diagnosed outliers, therefore have changed, typically slightly, since robustbase version 0.92-0.
Guy Brys; help page and improvements by Martin Maechler
Brys, G., Hubert, M., and Rousseeuw, P.J. (2005) A Robustification of Independent Component Analysis; Journal of Chemometrics, 19, 1–12.
Hubert, M., Van der Veeken, S. (2008) Outlier detection for skewed data; Journal of Chemometrics 22, 235–246; doi: 10.1002/cem.1123.
For the up-to-date reference, please consult https://wis.kuleuven.be/statdatascience/robust
## An Example with bad condition number and "border case" outliers dim(longley) # 16 x 7 // set seed, as result is random : set.seed(31) ao1 <- adjOutlyingness(longley, mcScale=FALSE) ## which are outlying ? which(!ao1$nonOut) ## for this seed, two: "1956", "1957"; (often: none) ## For seeds 1:100, we observe (Linux 64b) if(FALSE) { adjO <- sapply(1:100, function(iSeed) { set.seed(iSeed); adjOutlyingness(longley)$nonOut }) table(nrow(longley) - colSums(adjO)) } ## #{outl.}: 0 1 2 3 ## #{cases}: 74 17 6 3 ## An Example with outliers : dim(hbk) set.seed(1) ao.hbk <- adjOutlyingness(hbk) str(ao.hbk) hist(ao.hbk $adjout)## really two groups table(ao.hbk$nonOut)## 14 outliers, 61 non-outliers: ## outliers are : which(! ao.hbk$nonOut) # 1 .. 14 --- but not for all random seeds! ## here, they are the same as found by (much faster) MCD: cc <- covMcd(hbk) stopifnot(all(cc$mcd.wt == ao.hbk$nonOut)) ## This is revealing: About 1--2 cases, where outliers are *not* == 1:14 ## but needs almost 1 [sec] per call: if(interactive()) { for(i in 1:30) { print(system.time(ao.hbk <- adjOutlyingness(hbk))) if(!identical(iout <- which(!ao.hbk$nonOut), 1:14)) { cat("Outliers:\n"); print(iout) } } } ## "Central" outlyingness: *not* calling mc() anymore, since 2014-12-11: trace(mc) out <- capture.output( oo <- adjOutlyingness(hbk, clower=0, cupper=0) ) untrace(mc) stopifnot(length(out) == 0) ## A rank-deficient case T <- tcrossprod(data.matrix(toxicity)) try(adjOutlyingness(T, maxit. = 20, trace.lev = 2)) # fails and recommends: T. <- fullRank(T) aT <- adjOutlyingness(T.) plot(sort(aT$adjout, decreasing=TRUE), log="y") plot(T.[,9:10], col = (1:2)[1 + (aT$adjout > 10000)]) ## .. (not conclusive; directions are random, more 'ndir' makes a difference!)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.