ks: ks-package – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

ks-package

Description

Kernel smoothing for data from 1- to 6-dimensions.

Details

There are three main types of functions in this package:

computing kernel estimators - these function names begin with ‘k’
computing bandwidth selectors - these begin with ‘h’ (1-d) or ‘H’ (>1-d)
displaying kernel estimators - these begin with ‘plot’.

The kernel used throughout is the normal (Gaussian) kernel K. For 1-d data, the bandwidth h is the standard deviation of the normal kernel, whereas for multivariate data, the bandwidth matrix H is the variance matrix.

–For kernel density estimation, kde computes

hat(f)(x) = n^(-1) sum_i K_H (x - X_i).

The bandwidth matrix H is a matrix of smoothing parameters and its choice is crucial for the performance of kernel estimators. For display, its plot method calls plot.kde.

–For kernel density estimation, there are several varieties of bandwidth selectors

plug-in hpi (1-d); Hpi, Hpi.diag (2- to 6-d)
least squares (or unbiased) cross validation (LSCV or UCV) hlscv (1-d); Hlscv, Hlscv.diag (2- to 6-d)
biased cross validation (BCV) Hbcv, Hbcv.diag (2- to 6-d)
smoothed cross validation (SCV) hscv (1-d); Hscv, Hscv.diag (2- to 6-d)
normal scale hns (1-d); Hns (2- to 6-d).

–For kernel density support estimation, the main function is ksupp which is (the convex hull of)

{x: hat(f) > tau}

for a suitable level tau. This is closely related to the tau-level set of hat(f).

–For truncated kernel density estimation, the main function is kde.truncate

hat(f)(x) 1{x in Omega}/int hat(f) 1{x in Omega}

for a bounded data support Omega. The standard density estimate hat(f) is truncated and rescaled to give unit integral over Omega. Its plot method calls plot.kde.

–For boundary kernel density estimation where the kernel function is modified explicitly in the boundary region, the main function is kde.boundary

hat(f)(x) = n^(-1) sum_i K*_H (x - X_i)

for a boundary kernel K*. Its plot method calls plot.kde.

–For variable kernel density estimation where the bandwidth is not a constant matrix, the main functions are kde.balloon

hat(f)_ball(x) = n^(-1) sum_i K_H(x) (x - X_i)

and kde.sp

hat(f)_SP(x) = n^(-1) sum_i K_H(X_i) (x - X_i).

For the balloon estimation hat(f)_ball the bandwidth varies with the estimation point x, whereas for the sample point estimation hat(f)_SP the bandwidth varies with the data point X_i, i=1, ..., n. Their plot methods call plot.kde. The bandwidth selectors for kde.balloon are based on the normal scale bandwidth Hns(,deriv.order=2) via the MSE minimal formula, and for kde.SP on Hns(,deriv.order=4) via the Abramson formula.

–For kernel density derivative estimation, the main function is kdde

hat(f)^(r)(x) = n^(-1) sum_i D^r K_H (x - X_i).

The bandwidth selectors are a modified subset of those for kde, i.e. Hlscv, Hns, Hpi, Hscv with deriv.order>0. Its plot method is plot.kdde for plotting each partial derivative singly.

–For kernel summary curvature estimation, the main function is kcurv

hat(s)(x) = -1{D^2 hat(f)(x) <0)*abs(det(D^2 hat(f)(x)))}

where D^2 hat(f)(x) is the kernel Hessian matrix estimate. It has the same structure as a kernel density estimate so its plot method calls plot.kde.

–For kernel discriminant analysis, the main function is kda which computes density estimates for each the groups in the training data, and the discriminant surface. Its plot method is plot.kda. The wrapper function hkda, Hkda computes bandwidths for each group in the training data for kde, e.g. hpi, Hpi.

–For kernel functional estimation, the main function is kfe which computes the r-th order integrated density functional

hat(psi)_r = n^(-2) sum_i sum_j D^r K_H (X_i - X_j).

The plug-in selectors are hpi.kfe (1-d), Hpi.kfe (2- to 6-d). Kernel functional estimates are usually not required to computed directly by the user, but only within other functions in the package.

–For kernel-based 2-sample testing, the main function is kde.test which computes the integrated L2 distance between the two density estimates as the test statistic, comprising a linear combination of 0-th order kernel functional estimates:

hat(T) = hat(psi)_0,1 + hat(psi)_0,2 - (hat(psi)_0,12 + hat(psi)_0,21),

and the corresponding p-value. The psi are zero order kernel functional estimates with the subscripts indicating that 1 = sample 1 only, 2 = sample 2 only, and 12, 21 = samples 1 and 2. The bandwidth selectors are hpi.kfe, Hpi.kfe with deriv.order=0.

–For kernel-based local 2-sample testing, the main function is kde.local.test which computes the squared distance between the two density estimates as the test statistic

hat(U)(x) = [hat(f)_1(x) - hat(f)_2(x)]^2

and the corresponding local p-values. The bandwidth selectors are those used with kde, e.g. hpi, Hpi.

–For kernel cumulative distribution function estimation, the main function is kcde

hat(F)(x) = n^(-1) sum_i intK_H (x - X_i)

where intK is the integrated kernel. The bandwidth selectors are hpi.kcde, Hpi.kcde. Its plot method is plot.kcde. There exist analogous functions for the survival function hat(bar(F)).

–For kernel estimation of a ROC (receiver operating characteristic) curve to compare two samples from hat(F)_1, hat(F)_2, the main function is kroc

{hat(F)_hat(Y1))(z), hat(F_hat(Y2))(z)}

based on the cumulative distribution functions of hat(Yj)=hat(bar(F))_1(X_j), j=1,2.

The bandwidth selectors are those used with kcde, e.g. hpi.kcde, Hpi.kcde for hat(F)_hat(Yj), hat(bar(F))_1. Its plot method is plot.kroc.

–For kernel estimation of a copula, the main function is kcopula

hat(C)(z) = hat(F)(hat(F)_1^(-1)(z_1),..., hat(F)_d^(-1)(z_d))

where hat(F)_j^(-1)(z_j) is the z_j-th quantile of of the j-th marginal distribution hat(F_j). The bandwidth selectors are those used with kcde for hat(F), hat(F)_j. Its plot method is plot.kcde.

–For kernel mean shift clustering, the main function is kms. The mean shift recurrence relation of the candidate point x

x_j+1 = x_j + H D hat(f)(x_j)/hat(f)(x_j),

where j>=0 and x_0 = x, is iterated until x converges to its local mode in the density estimate hat(f) by following the density gradient ascent paths. This mode determines the cluster label for x. The bandwidth selectors are those used with kdde(,deriv.order=1).

–For kernel density ridge estimation, the main function is kdr. The kernel density ridge recurrence relation of the candidate point x

x_j+1 = x_j + U_(d-1)(x_j) U_(d-1)(x_j)^T H D hat(f)(x_j)/hat(f)(x_j),

where j>=0, x_0 = x and U_(d-1) is the 1-dimensional projected density gradient, is iterated until x converges to the ridge in the density estimate. The bandwidth selectors are those used with kdde(,deriv.order=2).

– For kernel feature significance, the main function kfs. The hypothesis test at a point x is H0(x): H f(x) < 0, i.e. the density Hessian matrix H f(x) is negative definite. The test statistic is

W(x) = ||S(x)^(-1/2) vech H hat{f}(x)||^2

where H hat{f} is the Hessian estimate, vech is the vector-half operator, and S is an estimate of the null variance. W(x) is approximately chi-squared distributed with d(d+1)/2 degrees of freedom. If H0(x) is rejected, then x belongs to a significant modal region. The bandwidth selectors are those used with kdde(,deriv.order=2). Its plot method is plot.kfs.

–For deconvolution density estimation, the main function is kdcde. A weighted kernel density estimation with the contaminated data W_1, ..., W_n,

hat(f)(x) = n^(-1) sum_i alpha_i K_H (x - W_i),

is utilised, where the weights alpha_1, ..., alpha_n are chosen via a quadratic optimisation involving the error variance and the regularisation parameter. The bandwidth selectors are those used with kde.

–Binned kernel estimation is an approximation to the exact kernel estimation and is available for d=1, 2, 3, 4. This makes kernel estimators feasible for large samples.

–For an overview of this package with 2-d density estimation, see vignette("kde").

–For ks >= 1.11.1, the misc3d and rgl (3-d plot), OceanView (quiver plot), oz (Australian map) packages have been moved from Depends to Suggests. This was done to allow ks to be installed on systems where these latter graphical-based packages can't be installed. Furthermore, since the future of OpenGL in R is not certain, plot3D becomes the default for 3D plotting for ks >= 1.12.0. RGL plots are still supported though these may be deprecated in the future.

Author(s)

Tarn Duong for most of the package. M. P. Wand for the binned estimation, univariate plug-in selector and univariate density derivative estimator code. J. E. Chacon for the unconstrained pilot functional estimation and fast implementation of derivative-based estimation code. A. and J. Gramacki for the binned estimation for unconstrained bandwidth matrices.

References

Bowman, A. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis. Oxford University Press, Oxford.

Chacon, J.E. & Duong, T. (2018) Multivariate Kernel Smoothing and Its Applications. Chapman & Hall/CRC, Boca Raton.

Duong, T. (2004) Bandwidth Matrices for Multivariate Kernel Density Estimation. Ph.D. Thesis, University of Western Australia.

Scott, D.W. (1992) Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York.

Silverman, B. (1986) Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London.

Simonoff, J. S. (1996) Smoothing Methods in Statistics. Springer-Verlag, New York.

Wand, M.P. & Jones, M.C. (1995) Kernel Smoothing. Chapman & Hall/CRC, London.

ks-package

Description

Details

Author(s)

References

See Also

ks

We don't support your browser anymore