Representative Curves
curveRep
finds representative curves from a
relatively large collection of curves. The curves usually represent
time-response profiles as in serial (longitudinal or repeated) data
with possibly unequal time points and greatly varying sample sizes per
subject. After excluding records containing missing x
or
y
, records are first stratified into kn
groups having similar
sample sizes per curve (subject). Within these strata, curves are
next stratified according to the distribution of x
points per
curve (typically measurement times per subject). The
clara
clustering/partitioning function is used
to do this, clustering on one, two, or three x
characteristics
depending on the minimum sample size in the current interval of sample
size. If the interval has a minimum number of unique values
of
one, clustering is done on the single x
values. If the minimum
number of unique x
values is two, clustering is done to create
groups that are similar on both min(x)
and max(x)
. For
groups containing no fewer than three unique x
values,
clustering is done on the trio of values min(x)
, max(x)
,
and the longest gap between any successive x
. Then within
sample size and x
distribution strata, clustering of
time-response profiles is based on p
values of y
all
evaluated at the same p
equally-spaced x
's within the
stratum. An option allows per-curve data to be smoothed with
lowess
before proceeding. Outer x
values are
taken as extremes of x
across all curves within the stratum.
Linear interpolation within curves is used to estimate y
at the
grid of x
's. For curves within the stratum that do not extend
to the most extreme x
values in that stratum, extrapolation
uses flat lines from the observed extremes in the curve unless
extrap=TRUE
. The p
y
values are clustered using
clara
.
print
and plot
methods show results. By specifying an
auxiliary idcol
variable to plot
, other variables such
as treatment may be depicted to allow the analyst to determine for
example whether subjects on different treatments are assigned to
different time-response profiles. To write the frequencies of a
variable such as treatment in the upper left corner of each panel
(instead of the grand total number of clusters in that panel), specify
freq
.
curveSmooth
takes a set of curves and smooths them using
lowess
. If the number of unique x
points in a curve is
less than p
, the smooth is evaluated at the unique x
values. Otherwise it is evaluated at an equally spaced set of
x
points over the observed range. If fewer than 3 unique
x
values are in a curve, those points are used and smoothing is not done.
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5, force1 = TRUE, metric = c("euclidean", "manhattan"), smooth=FALSE, extrap=FALSE, pr=FALSE) ## S3 method for class 'curveRep' print(x, ...) ## S3 method for class 'curveRep' plot(x, which=1:length(res), method=c('all','lattice'), m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE, idcol=NULL, freq=NULL, plotfreq=FALSE, xlim=range(x), ylim=range(y), xlab='x', ylab='y', colorfreq=FALSE, ...) curveSmooth(x, y, id, p=NULL, pr=TRUE)
x |
a numeric vector, typically measurement times.
For |
y |
a numeric vector of response values |
id |
a vector of curve (subject) identifiers, the same length as
|
kn |
number of curve sample size groups to construct.
|
kxdist |
maximum number of x-distribution clusters to derive
using |
k |
maximum number of x-y profile clusters to derive using |
p |
number of |
force1 |
By default if any curves have only one point, all curves
consisting of one point will be placed in a separate stratum. To
prevent this separation, set |
metric |
see |
smooth |
By default, linear interpolation is used on raw data to
obtain |
extrap |
set to |
pr |
set to |
which |
an integer vector specifying which sample size intervals
to plot. Must be specified if |
method |
The default makes individual plots of possibly all
x-distribution by sample size by cluster combinations. Fewer may be
plotted by specifying |
m |
the number of curves in a cluster to randomly sample if there
are more than |
nx |
applies if |
probs |
3-vector of probabilities with the central quantile first. Default uses quartiles. |
fill |
for |
idcol |
a named vector to be used as a table lookup for color
assignments (does not apply when |
freq |
a named vector to be used as a table lookup for a grouping
variable such as treatment. The names are curve |
plotfreq |
set to |
colorfreq |
set to |
xlim, ylim, xlab, ylab |
plotting parameters. Default ranges are
the ranges in the entire set of raw data given to |
... |
arguments passed to other functions. |
In the graph titles for the default graphic output, n
refers to the
minimum sample size, x
refers to the sequential x-distribution
cluster, and c
refers to the sequential x-y profile cluster. Graphs
from method = "lattice"
are produced by
xyplot
and in the panel titles
distribution
refers to the x-distribution stratum and
cluster
refers to the x-y profile cluster.
a list of class "curveRep"
with the following elements
res |
a hierarchical list first split by sample size intervals,
then by x distribution clusters, then containing a vector of cluster
numbers with |
ns |
a table of frequencies of sample sizes per curve after
removing |
nomit |
total number of records excluded due to |
missfreq |
a table of frequencies of number of |
ncuts |
cut points for sample size intervals |
kn |
number of sample size intervals |
kxdist |
number of clusters on x distribution |
k |
number of clusters of curves within sample size and distribution groups |
p |
number of points at which to evaluate each curve for clustering |
x |
|
y |
|
id |
input data after removing |
curveSmooth
returns a list with elements x,y,id
.
The references describe other methods for deriving
representative curves, but those methods were not used here. The last
reference which used a cluster analysis on principal components
motivated curveRep
however. The kml
package does k-means clustering of longitudinal data with imputation.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Segal M. (1994): Representative curves for longitudinal data via regression trees. J Comp Graph Stat 3:214-233.
Jones MC, Rice JA (1992): Displaying the important features of large collections of similar curves. Am Statistician 46:140-145.
Zheng X, Simpson JA, et al (2005): Data from a study of effectiveness suggested potential prognostic factors related to the patterns of shoulder pain. J Clin Epi 58:823-830.
## Not run: # Simulate 200 curves with pre-curve sample sizes ranging from 1 to 10 # Make curves with odd-numbered IDs have an x-distribution that is random # uniform [0,1] and those with even-numbered IDs have an x-dist. that is # half as wide but still centered at 0.5. Shift y values higher with # increasing IDs set.seed(1) N <- 200 nc <- sample(1:10, N, TRUE) id <- rep(1:N, nc) x <- y <- id for(i in 1:N) { x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75)) y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10) } w <- curveRep(x, y, id, kxdist=2, p=10) w par(ask=TRUE, mfrow=c(4,5)) plot(w) # show everything, profiles going across par(mfrow=c(2,5)) plot(w,1) # show n=1 results # Use a color assignment table, assigning low curves to green and # high to red. Unique curve (subject) IDs are the names of the vector. cols <- c(rep('green', N/2), rep('red', N/2)) names(cols) <- as.character(1:N) plot(w, 3, idcol=cols) par(ask=FALSE, mfrow=c(1,1)) plot(w, 1, 'lattice') # show n=1 results plot(w, 3, 'lattice') # show n=4-5 results plot(w, 3, 'lattice', idcol=cols) # same but different color mapping plot(w, 3, 'lattice', m=1) # show a single "representative" curve # Show median, 10th, and 90th percentiles of supposedly representative curves plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9)) # Same plot but with much less grouping of x variable plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2) # Smooth data before profiling. This allows later plotting to plot # smoothed representative curves rather than raw curves (which # specifying smooth=TRUE to curveRep would do, if curveSmooth was not used) d <- curveSmooth(x, y, id) w <- with(d, curveRep(x, y, id)) # Example to show that curveRep can cluster profiles correctly when # there is no noise. In the data there are four profiles - flat, flat # at a higher mean y, linearly increasing then flat, and flat at the # first height except for a sharp triangular peak set.seed(1) x <- 0:100 m <- length(x) profile <- matrix(NA, nrow=m, ncol=4) profile[,1] <- rep(0, m) profile[,2] <- rep(3, m) profile[,3] <- c(0:3, rep(3, m-4)) profile[,4] <- c(0,1,3,1,rep(0,m-4)) col <- c('black','blue','green','red') matplot(x, profile, type='l', col=col) xeval <- seq(0, 100, length.out=5) s <- x matplot(x[s], profile[s,], type='l', col=col) id <- rep(1:100, each=m) X <- Y <- id cols <- character(100) names(cols) <- as.character(1:100) for(i in 1:100) { s <- id==i X[s] <- x j <- sample(1:4,1) Y[s] <- profile[,j] cols[i] <- col[j] } table(cols) yl <- c(-1,4) w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found 4 clusters but two have same profile w <- curveRep(X, Y, id, kn=1, kxdist=1, k=3) plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl) # Incorrectly combined black and red because default value p=5 did # not result in different profiles at x=xeval w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40) plot(w, 1, 'lattice', idcol=cols, ylim=yl) # Found correct clusters because evaluated curves at 40 equally # spaced points and could find the sharp triangular peak in profile 4 ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.