Shape-based distance
Distance based on coefficient-normalized cross-correlation as proposed by Paparrizos and Gravano (2015) for the k-Shape clustering algorithm.
SBD(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE) sbd(x, y, znorm = FALSE, error.check = TRUE, return.shifted = TRUE)
x, y |
Univariate time series. |
znorm |
Logical. Should each series be z-normalized before calculating the distance? |
error.check |
Logical indicating whether the function should try to detect inconsistencies and give more informative errors messages. Also used internally to avoid repeating checks. |
return.shifted |
Logical. Should the shifted version of |
This distance works best if the series are z-normalized. If not, at least they should have appropriate amplitudes, since the values of the signals do affect the outcome.
If x
and y
do not have the same length, it would be best if the longer sequence is
provided in y
, because it will be shifted to match x
. After matching, the series may have to
be truncated or extended and padded with zeros if needed.
The output values lie between 0 and 2, with 0 indicating perfect similarity.
For return.shifted = FALSE
, the numeric distance value, otherwise a list with:
dist
: The shape-based distance between x
and y
.
yshift
: A shifted version of y
so that it optimally matches x
(based on NCCc()
).
The version registered with dist
is custom (loop = FALSE
in
pr_DB
). The custom function handles multi-threaded parallelization
directly (with RcppParallel
). It uses all
available threads by default (see
RcppParallel::defaultNumThreads()
), but this can
be changed by the user with
RcppParallel::setThreadOptions()
.
An exception to the above is when it is called within a foreach
parallel loop made by dtwclust. If the parallel workers do not have the number of
threads explicitly specified, this function will default to 1 thread per worker. See the
parallelization vignette for more information (browseVignettes("dtwclust")
).
It also includes symmetric optimizations to calculate only half a distance matrix when
appropriate—only one list of series should be provided in x
. If you want to avoid this
optimization, call dist
by giving the same list of series in both x
and y
.
In some situations, e.g. for relatively small distance matrices, the overhead introduced by the logic that computes only half the distance matrix can be bigger than just calculating the whole matrix.
If you wish to calculate the distance between several time series, it would be better to use the
version registered with the proxy
package, since it includes some small optimizations. See the
examples.
This distance is calculated with help of the Fast Fourier Transform, so it can be sensitive to numerical precision. Thus, this function (and the functions that depend on it) might return different values in 32 bit installations compared to 64 bit ones.
Paparrizos J and Gravano L (2015). “k-Shape: Efficient and Accurate Clustering of Time Series.” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, doi: 10.1145/2723372.2737793.
# load data data(uciCT) # distance between series of different lengths sbd <- SBD(CharTraj[[1]], CharTraj[[100]], znorm = TRUE)$dist # cross-distance matrix for series subset (notice the two-list input) sbD <- proxy::dist(CharTraj[1:10], CharTraj[1:10], method = "SBD", znorm = TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.