The Empirical Distribution Based on a Set of Observations
Density, distribution function, quantile function, and random generation for the empirical distribution based on a set of observations
demp(x, obs, discrete = FALSE, density.arg.list = NULL) pemp(q, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) qemp(p, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) remp(n, obs)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
obs |
numeric vector of observations. Missing ( |
discrete |
logical scalar indicating whether the assumed parent distribution of |
density.arg.list |
list with arguments to the R |
prob.method |
character string indicating what method to use to compute the empirical
probabilities. Possible values are |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant. The default value is |
Let x_1, x_2, …, x_n denote a random sample of n observations
from some unknown probability distribution (i.e., the elements of the argument
obs
), and let x_{(i)} denote the i^{th} order statistic, that is,
the i^{th} largest observation, for i = 1, 2, …, n.
Estimating Density
The function demp
computes the empirical probability density function. If
the observations are assumed to come from a discrete distribution, the probability
density (mass) function is estimated by:
\hat{f}(x) = \widehat{Pr}(X = x) = \frac{∑^n_{i=1} I_{[x]}(x_i)}{n}
where I is the indicator function:
I_{[x]}(y) = | 1 | if y = x, |
0 | if y \ne x |
That is, the estimated probability of observing the value x is simply the observed proportion of observations equal to x.
If the observations are assumed to come from a continuous distribution, the
function demp
calls the R function density
to compute the
estimated density based on the values specified in the argument obs
,
and then uses linear interpolation to estimate the density at the values
specified in the argument x
. See the R help file for
density
for more information on how the empirical density is
computed in the continuous case.
Estimating Probabilities
The function pemp
computes the estimated cumulative distribution function
(cdf), also called the empirical cdf (ecdf). If the observations are assumed to
come from a discrete distribution, the value of the cdf evaluated at the i^{th}
order statistic is usually estimated by:
\hat{F}[x_{(i)}] = \widehat{Pr}(X ≤ x_{(i)}) = \hat{p}_i = \frac{∑^n_{j=1} I_{(-∞, x_{(i)}]}(x_j)}{n}
where:
I_{(-∞, x]}(y) = | 1 | if y ≤ x, |
0 | if y > x |
(D'Agostino, 1986a). That is, the estimated value of the cdf at the i^{th}
order statistic is simply the observed proportion of observations less than or
equal to the i^{th} order statistic. This estimator is sometimes called the
“empirical probabilities” estimator and is intuitively appealing.
The function pemp
uses the above equations to compute the empirical cdf when
prob.method="emp.probs"
.
For any general value of x, when the observations are assumed to come from a discrete distribution, the value of the cdf is estimated by:
\hat{F}(x) = | 0 | if x < x_{(1)}, |
\hat{p}_i | if x_{(i)} ≤ x < x_{(i+1)}, | |
1 | if x ≥ x_{(n)} |
The function pemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the value of the cdf evaluated at the i^{th} order statistic is usually estimated by:
\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1}
where a denotes the plotting position constant and 0 ≤ a ≤ 1
(Cleveland, 1993, p.18; D'Agostino, 1986a, pp.8,25). The estimators defined by
the above equation are called plotting positions and are used to construct
probability plots. The function pemp
uses the above equation
when prob.method="plot.pos"
.
For any general value of x, the value of the cdf is estimated by linear interpolation:
\hat{F}(x) = | \hat{p}_1 | if x < x_{(1)}, |
(1 - r)\hat{p}_i + r\hat{p}_{i+1} | if x_{(i)} ≤ x < x_{(i+1)}, | |
\hat{p}_n | if x ≥ x_{(n)} |
where
r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}}
(Chambers et al., 1983). The function pemp
uses the above two equations
when discrete=FALSE
.
Estimating Quantiles
The function qemp
computes the estimated quantiles based on the observed
data. If the observations are assumed to come from a discrete distribution, the
p^{th} quantile is usually estimated by:
\hat{x}_p = | x_{(1)} | if p ≤ \hat{p}_1, |
x_{(i)} | if \hat{p}_{i-1} < p ≤ \hat{p}_i, | |
x_n | if p > \hat{p}_n |
The function qemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the p^{th} quantile is usually estimated by linear interpolation:
\hat{x}_p = | x_{(1)} | if p ≤ \hat{p}_1, |
(1 - r)x_{(i-1)} + rx_{(i)} | if \hat{p}_{i-1} < p ≤ \hat{p}_i, | |
x_n | if p > \hat{p}_n | |
where
r = \frac{p - \hat{p}_{i-1}}{\hat{p}_i - \hat{p}_{i-1}}
The function qemp
uses the above two equations when discrete=FALSE
.
Generating Random Numbers From the Empirical Distribution
The function remp
simply calls the R function sample
to
sample the elements of obs
with replacement.
density (demp
), probability (pemp
), quantile (qemp
), or
random sample (remp
) for the empirical distribution based on the data
contained in the vector obs
.
The function demp
let's you perform nonparametric density estimation.
The function pemp
computes the value of the empirical cumulative
distribution function (ecdf) for user-specified quantiles. The ecdf is a
nonparametric estimate of the true cdf (see ecdfPlot
). The
function qemp
computes nonparametric estimates of quantiles
(see the help files for eqnpar
and quantile
).
The function remp
let's you sample a set of observations with replacement,
which is often done while bootstrapping or performing some other kind of
Monte Carlo simulation.
Steven P. Millard (EnvStats@ProbStatInfo.com)
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11–16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7–62.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. John Wiley and Sons, New York.
Sheather, S. J. and Jones M. C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statististical Society B, 683–690.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Wegman, E.J. (1972). Nonparametric Probability Density Estimation. Technometrics 14, 533-546.
# Create a set of 100 observations from a gamma distribution with # parameters shape=4 and scale=5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(3) obs <- rgamma(100, shape=4, scale=5) # Now plot the empirical distribution (with a histogram) and the true distribution: dev.new() hist(obs, col = "cyan", xlim = c(0, 65), freq = FALSE, ylab = "Relative Frequency") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) box() # Now plot the empirical distribution (based on demp) with the # true distribution: x <- qemp(p = seq(0, 1, len = 100), obs = obs) y <- demp(x, obs) dev.new() plot(x, y, xlim = c(0, 65), type = "n", xlab = "Value of Random Variable", ylab = "Relative Frequency") lines(x, y, lwd = 2, col = "cyan") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Alternatively, you can create the above plot with the function # epdfPlot: dev.new() epdfPlot(obs, xlim = c(0, 65), epdf.col = "cyan", xlab = "Value of Random Variable", main = "Empirical and Theoretical PDFs") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Clean Up #--------- rm(obs, x, y)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.