Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Nonparametric Simultaneous Prediction Interval
Compute the probability that at least one set of future observations violates the given rule based on a nonparametric simultaneous prediction interval for the next r future sampling occasions. The three possible rules are: k-of-m, California, or Modified California. The probability is based on assuming the true distribution of the observations is normal.
predIntNparSimultaneousTestPower(n, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), delta.over.sigma = 0, pi.type = "upper", r.shifted = r, method = "approx", NMC = 100, ci = FALSE, ci.conf.level = 0.95, integrate.args.list = NULL, evNormOrdStats.method = "royston")
n |
vector of positive integers specifying the sample sizes.
Missing ( |
n.median |
vector of positive odd integers specifying the sample size associated with the
future medians. The default value is |
k |
for the k-of-m rule ( |
m |
vector of positive integers specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling
“occasions”. The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
vector of non-negative integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
vector of non-negative integers related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of |
delta.over.sigma |
numeric vector indicating the ratio Δ/σ. The quantity
Δ (delta) denotes the difference between the mean of the population
that was sampled to construct the prediction interval, and the mean of the
population that will be sampled to produce the future observations. The quantity
σ (sigma) denotes the population standard deviation for both populations.
The default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
r.shifted |
vector of positive integers specifying the number of future sampling occasions for
which the scaled mean is shifted by Δ/σ. All values must be
integeters between |
method |
character string indicating what method to use to compute the power. The possible
values are |
NMC |
positive integer indicating the number of Monte Carlo trials to run when |
ci |
logical scalar indicating whether to compute a confidence interval for the power
when |
ci.conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the
confidence interval for the power. The argument is ignored if |
integrate.args.list |
list of arguments to supply to the |
evNormOrdStats.method |
character string indicating which method to use in the call to
|
What is a Nonparametric Simultaneous Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line
constructed so that it will contain at least k of m future observations from
that population with some specified probability (1-α)100\%, where
0 < α < 1 and k and m are some pre-specified positive integers
and k ≤ m. The quantity (1-α)100\% is called
the confidence coefficient or confidence level associated with the prediction
interval. The function predIntNpar
computes a standard
nonparametric prediction interval.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval that will contain a certain number of future observations
with probability (1-α)100\% for each of r future sampling
“occasions”,
where r is some pre-specified positive integer. The quantity r may
refer to r distinct future sampling occasions in time, or it may for example
refer to sampling at r distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the r
distinct locations.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval based on one of three possible rules:
For the k-of-m rule (rule="k.of.m"
), at least k of
the next m future observations will fall in the prediction
interval with probability (1-α)100\% on each of the r future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to m observations may be taken, but once
k of the observations fall within the prediction interval, sampling can stop.
Note: For this rule, when r=1, the results of predIntNparSimultaneous
are equivalent to the results of predIntNpar
.
For the California rule (rule="CA"
), with probability
(1-α)100\%, for each of the r future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
m-1 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
m-1 more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
(1-α)100\%, for each of the r future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Nonparametric simultaneous prediction intervals can be extended to using medians
in place of single observations (USEPA, 2009, Chapter 19). That is, you can
create a nonparametric simultaneous prediction interval that will contain a
specified number of medians (based on which rule you choose) on each of r
future sampling occassions, where each each median is based on b individual
observations. For the function predIntNparSimultaneous
, the argument
n.median
corresponds to b.
The Form of a Nonparametric Prediction Interval
Let \underline{x} = x_1, x_2, …, x_n denote a vector of n
independent observations from some continuous distribution, and let
x_{(i)} denote the the i'th order statistics in \underline{x}.
A two-sided nonparametric prediction interval is constructed as:
[x_{(u)}, x_{(v)}] \;\;\;\;\;\; (1)
where u and v are positive integers between 1 and n, and u < v. That is, u denotes the rank of the lower prediction limit, and v denotes the rank of the upper prediction limit. To make it easier to write some equations later on, we can also write the prediction interval (1) in a slightly different way as:
[x_{(u)}, x_{(n + 1 - w)}] \;\;\;\;\;\; (2)
where
w = n + 1 - v \;\;\;\;\;\; (3)
so that w is a positive integer between 1 and n-1, and
u < n+1-w. In terms of the arguments to the function predIntNparSimultaneous
,
the argument lpl.rank
corresponds to u, and the argument
n.plus.one.minus.upl.rank
corresponds to w.
If we allow u=0 and w=0 and define lower and upper bounds as:
x_{(0)} = lb \;\;\;\;\;\; (4)
x_{(n+1)} = ub \;\;\;\;\;\; (5)
then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:
[x_{(u)}, x_{(n + 1)}] = [x_{(u)}, ub] \;\;\;\;\;\; (6)
and a one-sided upper nonparametric prediction interval is constructed as:
[x_{(0)}, x_{(n + 1 - w)}] = [lb, x_{(n + 1 - w)}] \;\;\;\;\;\; (7)
Usually, lb = -∞ or lb = 0 and ub = ∞.
Note: For nonparametric simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available.
Computing Power
The "power" of the prediction interval is defined as the probability that
at least one set of future observations violates the given rule based on a
simultaneous prediction interval for the next r future sampling occasions,
where the population for the future observations is allowed to differ from
the population for the observations used to construct the prediction interval.
For the function predIntNparSimultaneousTestPower
, power is computed assuming
both the background and future the observations come from normal distributions with
the same standard deviation, but the means of the distributions are allowed to differ.
The quantity Δ (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity σ (sigma) denotes the population standard deviation of both
of these populations. The argument delta.over.sigma
corresponds to the
quantity Δ/σ.
Approximate Power (method="approx"
)
Based on Gansecki (2009), the power of a nonparametric simultaneous prediction
interval when the underlying observations come from a nomral distribution
can be approximated by the power of a normal simultaneous prediction
interval (see predIntNormSimultaneousTestPower
) where the multiplier
K is replaced with the expected value of the normal order statistic that
corresponds to the rank of the order statistic used for the upper or lower bound
of the prediction interval. Gansecki (2009) uses the approximation:
K = Φ^{-1}(\frac{i - 0.5}{n}) \;\;\;\;\;\; (8)
where Φ denotes the cumulative distribution function of the standard
normal distribution and i denotes the rank of the order statistic used
as the prediction limit. By default, the value of the argument evNormOrdStats.method="royston"
, so the function
predIntNparSimultaneousTestPower
uses the exact value of the
expected value of the normal order statistic in the call to
evNormOrdStatsScalar
. You can change the
method of computing the expected value of the normal order statistic by
changing the value of the argument evNormOrdStats.method
.
Power Based on Monte Carlo Simulation (method="simulate"
)
When method="simulate"
, the power of the nonparametric simultaneous
prediction interval is estimated based on a Monte Carlo simulation. The argument
NMC
determines the number of Monte Carlo trials. If ci=TRUE
, a
confidence interval for the power is created based on the NMC
Monte Carlo
estimates of power.
vector of values between 0 and 1 equal to the probability that the rule will be violated.
See the help file for predIntNparSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNparSimultaneousTestPower
and plotPredIntNparSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard (EnvStats@ProbStatInfo.com)
See the help file for predIntNparSimultaneous
.
Gansecki, M. (2009). Using the Optimal Rank Values Calculator. US Environmental Protection Agency, Region 8, March 10, 2009.
# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence levels and powers associated with # two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. # Power will be computed assuming a normal distribution and setting # delta.over.sigma equal to 2, 3, and 4. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the confidence level associated with each plan. # Note that both plans achieve the required confidence level. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r) #[1] 0.9940354 # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3) #[1] 0.9864909 #------------------------------------------------------------------------------ # Compute approximate power of each plan to detect contamination at just 1 well # assuming true underying distribution of Hg is Normal at all wells and # using delta.over.sigma equal to 2, 3, and 4. #------------------------------------------------------------------------------ # Computer aproximate power for # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousTestPower(n = 20, n.median = 3, k = 1, m = 2, r = r, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.3953712 0.9129671 0.9983054 # Compute approximate power for # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.4367972 0.8694664 0.9888779 #---------- ## Not run: # Compare estimated power using approximation method with estimated power # using Monte Carlo simulation for the 1-of-4 plan based on individual # observations using the 3rd highest background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1, method = "simulate", ci = TRUE, NMC = 1000) #[1] 0.437 0.863 0.989 #attr(,"conf.int") # [,1] [,2] [,3] #LCL 0.4111999 0.8451148 0.9835747 #UCL 0.4628001 0.8808852 0.9944253 ## End(Not run) #========== # Cleanup #-------- rm(n, nw, nc, ne, r, conf.level)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.