Nonparametric Simultaneous Prediction Interval for a Continuous Distribution
Construct a nonparametric simultaneous prediction interval for the next r sampling “occasions” based on one of three possible rules: k-of-m, California, or Modified California. The simultaneous prediction interval assumes the observations from from a continuous distribution.
predIntNparSimultaneous(x, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "upper", integrate.args.list = NULL)
x |
a numeric vector of observations. Missing ( |
n.median |
positive odd integer specifying the sample size associated with the future medians.
The default value is |
k |
for the k-of-m rule ( |
m |
positive integer specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
positive integer indicating the rank of the order statistic to use for the lower
bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
positive integer related to the rank of the order statistic to use for the upper
bound of the prediction interval. A value of |
lb, ub |
scalars indicating lower and upper bounds on the distribution. By default,
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
integrate.args.list |
a list of arguments to supply to the |
What is a Nonparametric Simultaneous Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line
constructed so that it will contain at least k of m future observations from
that population with some specified probability (1-α)100\%, where
0 < α < 1 and k and m are some pre-specified positive integers
and k ≤ m. The quantity (1-α)100\% is called
the confidence coefficient or confidence level associated with the prediction
interval. The function predIntNpar
computes a standard
nonparametric prediction interval.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval that will contain a certain number of future observations
with probability (1-α)100\% for each of r future sampling
“occasions”,
where r is some pre-specified positive integer. The quantity r may
refer to r distinct future sampling occasions in time, or it may for example
refer to sampling at r distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the r
distinct locations.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval based on one of three possible rules:
For the k-of-m rule (rule="k.of.m"
), at least k of
the next m future observations will fall in the prediction
interval with probability (1-α)100\% on each of the r future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to m observations may be taken, but once
k of the observations fall within the prediction interval, sampling can stop.
Note: For this rule, when r=1, the results of predIntNparSimultaneous
are equivalent to the results of predIntNpar
.
For the California rule (rule="CA"
), with probability
(1-α)100\%, for each of the r future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
m-1 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
m-1 more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
(1-α)100\%, for each of the r future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Nonparametric simultaneous prediction intervals can be extended to using medians
in place of single observations (USEPA, 2009, Chapter 19). That is, you can
create a nonparametric simultaneous prediction interval that will contain a
specified number of medians (based on which rule you choose) on each of r
future sampling occassions, where each each median is based on b individual
observations. For the function predIntNparSimultaneous
, the argument
n.median
corresponds to b.
The Form of a Nonparametric Prediction Interval
Let \underline{x} = x_1, x_2, …, x_n denote a vector of n
independent observations from some continuous distribution, and let
x_{(i)} denote the the i'th order statistics in \underline{x}.
A two-sided nonparametric prediction interval is constructed as:
[x_{(u)}, x_{(v)}] \;\;\;\;\;\; (1)
where u and v are positive integers between 1 and n, and u < v. That is, u denotes the rank of the lower prediction limit, and v denotes the rank of the upper prediction limit. To make it easier to write some equations later on, we can also write the prediction interval (1) in a slightly different way as:
[x_{(u)}, x_{(n + 1 - w)}] \;\;\;\;\;\; (2)
where
w = n + 1 - v \;\;\;\;\;\; (3)
so that w is a positive integer between 1 and n-1, and
u < n+1-w. In terms of the arguments to the function predIntNparSimultaneous
,
the argument lpl.rank
corresponds to u, and the argument
n.plus.one.minus.upl.rank
corresponds to w.
If we allow u=0 and w=0 and define lower and upper bounds as:
x_{(0)} = lb \;\;\;\;\;\; (4)
x_{(n+1)} = ub \;\;\;\;\;\; (5)
then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:
[x_{(u)}, x_{(n + 1)}] = [x_{(u)}, ub] \;\;\;\;\;\; (6)
and a one-sided upper nonparametric prediction interval is constructed as:
[x_{(0)}, x_{(n + 1 - w)}] = [lb, x_{(n + 1 - w)}] \;\;\;\;\;\; (7)
Usually, lb = -∞ or lb = 0 and ub = ∞.
Note: For nonparametric simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available.
Constructing Nonparametric Simultaneous Prediction Intervals for Future Observations
First we will show how to construct a nonparametric simultaneous prediction interval based on
future observations (i.e., b=1, n.median=1
), and then extend the formulas to
future medians.
Simultaneous Prediction Intervals for the k-of-m Rule (rule="k.of.m"
)
For the k-of-m rule (rule="k.of.m"
) with w=1
(i.e., n.median=1
), at least k of the next m future
observations will fall in the prediction interval
with probability (1-α)100\% on each of the r future sampling
occasions. If observations are being taken sequentially, for a particular
sampling occasion, up to m observations may be taken, but once k
of the observations fall within the prediction interval, sampling can stop.
Note: When r=1, this kind of simultaneous prediction
interval becomes the same as a standard nonparametric prediction interval
(see predIntNpar
).
Chou and Owen (1986) developed the theory for nonparametric simultaneous prediction limits for various rules, including the 1-of-m rule. Their theory, however, does not cover the California or Modified California rules, and uses an r-fold summation involving a minimum of 2^r terms. Davis and McNichols (1994b; 1999) extended the results of Chou and Owen (1986) to include the California and Modified California rule, and developed algorithms that involve summing far fewer terms.
Davis and McNichols (1999) give formulas for the probabilities associated with the one-sided upper simultaneous prediction interval shown in Equation (7). For the k-of-m rule, the probability that at least k of the next m future observations will be contained in the interval given in Equation (7) for each of r future sampling occasions is given by:
1 - α | = | E[∑_{i=0}^{m-k} {{k-1+i} \choose {k-1}} Y^k (1-Y)^i]^r |
= | \int_0^1 [∑_{i=0}^{m-k} {{k-1+i} \choose {k-1}} y^k (1-y)^i]^r f(y) dy \;\;\;\;\;\; (8) | |
where Y denotes a random variable with a beta distribution
with parameters v and n+1-v, and f() denotes the pdf of this
distribution. Note that v denotes the rank of the order statistic used as the
upper prediction limit (i.e., n.plus.one.minus.upl.rank=
n+1-v), and
that v is usually equal to n.
Also note that the summation term in Equation (8) corresponds to the cumulative
distribution function of a Negative Binomial distribution
with parameters size=
k and prob=
y evaluated at
q=
m-k.
When pi.type="lower"
, Y denotes a random variable with a
beta distribution with parameters n+1-u and
u. Note that u denotes the rank of the order statistic used as the
lower prediction limit (i.e., lpl.rank=
u), and
that u is usually equal to 1.
Simultaneous Prediction Intervals for the California Rule (rule="CA"
)
For the California rule (rule="CA"
), with probability (1-α)100\%,
for each of the r future sampling occasions, either the first observation will
fall in the prediction interval, or else all of the next m-1 observations will
fall in the prediction interval. That is, if the first observation falls in the
prediction interval then sampling can stop. Otherwise, m-1 more observations
must be taken.
In this case, the probability is given by:
1 - α | = | E[∑_{i=0}^r {r \choose i} Y^{r - i + (m-1)i} (1-Y)^i] |
= | \int_0^1 [∑_{i=0}^r {r \choose i} y^{r - i + (m-1)i} (1-y)^i] f(y) dy \;\;\;\;\;\; (9) | |
Simultaneous Prediction Intervals for the Modified California Rule (rule="Modified.CA"
)
For the Modified California rule (rule="Modified.CA"
), with probability
(1-α)100\%, for each of the r future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
In this case, the probability is given by:
1 - α | = | E[Y^r (1 + Q + Q^2 - 2Q^3)^r] |
= | \int_0^1 [y^r (1 + q + q^2 - 2q^3)^r] f(y) dy \;\;\;\;\;\; (10) | |
where Q = 1 - Y and q = 1 - y.
Davis and McNichols (1999) provide algorithms for computing the probabilities based on expanding
polynomials and the formula for the expected value of a beta random variable. In the discussion
section of Davis and McNichols (1999), however, Vangel points out that numerical integration is
adequate, and this is how these probabilities are computed in the function
predIntNparSimultaneous
.
Constructing Nonparametric Simultaneous Prediction Intervals for Future Medians
USEPA (2009, Chapter 19; Cameron, 2011) extends nonparametric simultaneous
prediction intervals to testing future medians for the case of the 1-of-1 and
1-of-2 plans for medians of order 3. In general, each of the rules
(k-of-m, California, and Modified California) can be easily
extended to the case of using medians as long as the medians are based on an
odd (as opposed to even) sample size.
For each of the above rules, if we are interested in using medians instead of
single observations (i.e., b ≥ 1; n.median
≥ 1), and we
force b to be odd, then a median will be less than a prediction limit
once (b+1)/2 observations are less than the prediction limit. Thus,
Equations (8) - (10) are modified by replacing y with the term:
∑_{i=0}^{b - b'} {{b' - 1 + i} \choose {b' - 1}} y^{b'} (1 - y)^i \;\;\;\;\;\; (11)
where
b' = \frac{b+1}{2} \;\;\;\;\;\; (12)
a list of class "estimate"
containing the simultaneous prediction interval
and other information. See the help file for estimate.object
for
details.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard (EnvStats@ProbStatInfo.com)
Cameron, Kirk. (2011). Personal communication, February 16, 2011. MacStat Consulting, Ltd., Colorado Springs, Colorado.
Chew, V. (1968). Simultaneous Prediction Intervals. Technometrics 10(2), 323–331.
Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.
Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.
Davis, C.B., and R.J. McNichols. (1994b). Ground Water Monitoring Statistics Update: Part II: Nonparametric Prediction Limits. Ground Water Monitoring and Remediation 14(4), 159–175.
Davis, C.B., and R.J. McNichols. (1999). Simultaneous Nonparametric Prediction Limits (with Discusson). Technometrics 41(2), 89–112.
Gibbons, R.D. (1987a). Statistical Prediction Intervals for the Evaluation of Ground-Water Quality. Ground Water 25, 455–465.
Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.
Gibbons, R.D., and J. Baker. (1991). The Properties of Various Statistical Prediction Intervals for Ground-Water Detection Monitoring. Journal of Environmental Science and Health A26(4), 535–553.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.
Hall, I.J., R.R. Prairie, and C.K. Motlagh. (1975). Non-Parametric Prediction Intervals. Journal of Quality Technology 7(3), 109–114.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
# Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNparSimultaneous to construct an upper one-sided prediction interval # using the maximum observed value using the 1-of-3 rule. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.94353% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.000000 # UPL = 1.817311 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0)$interval$conf.level #[1] 0.9994353 predIntNparSimultaneous(dat, m = 3, rule = "CA", lb = 0)$interval$conf.level #[1] 0.9919066 predIntNparSimultaneous(dat, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9984943 #========= # Repeat the above example, but create the baseline data using just # n=8 observations and set r to 4 future sampling occasions set.seed(598) dat <- rlnormMixAlt(n = 8, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 97.7599% # #Prediction Limit Rank(s): 8 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 3 # #Number of Future #Sampling Occasions: 4 # #Prediction Interval: LPL = 0.000000 # UPL = 5.683453 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0)$interval$conf.level #[1] 0.977599 predIntNparSimultaneous(dat, m = 3, r = 4, rule = "CA", lb = 0)$interval$conf.level #[1] 0.8737798 predIntNparSimultaneous(dat, r = 4, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9510178 #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence level associated with two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 alpha <- 1 - conf.level alpha #[1] 0.02085164 #---------- # Look at the data: head(EPA.09.Ex.19.5.mercury.df) # Event Well Well.type Mercury.ppb.orig Mercury.ppb Censored #1 1 BG-1 Background 0.21 0.21 FALSE #2 2 BG-1 Background <.2 0.20 TRUE #3 3 BG-1 Background <.2 0.20 TRUE #4 4 BG-1 Background <.2 0.20 TRUE #5 5 BG-1 Background <.2 0.20 TRUE #6 6 BG-1 Background NA FALSE longToWide(EPA.09.Ex.19.5.mercury.df, "Mercury.ppb.orig", "Event", "Well", paste.row.name = TRUE) # BG-1 BG-2 BG-3 BG-4 CW-1 CW-2 #Event.1 0.21 <.2 <.2 <.2 0.22 0.36 #Event.2 <.2 <.2 0.23 0.25 0.2 0.41 #Event.3 <.2 <.2 <.2 0.28 <.2 0.28 #Event.4 <.2 0.21 0.23 <.2 0.25 0.45 #Event.5 <.2 <.2 0.24 <.2 0.24 0.43 #Event.6 <.2 0.54 # Construct the upper simultaneous prediction limit using the 1-of-2 # retesting plan for a median of order 3 based on the background maximum Hg.Back <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb[Well.type == "Background"]) pred.int.1.of.2.med.3 <- predIntNparSimultaneous(Hg.Back, n.median = 3, k = 1, m = 2, r = r, lb = 0) pred.int.1.of.2.med.3 #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.40354% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Medians #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Medians #(per Sampling Occasion): 2 # #Number of Future #Sampling Occasions: 10 # #Sample Size for Medians: 3 # #Prediction Interval: LPL = 0.00 # UPL = 0.28 # Note that the achieved confidence level of 99.4% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The median of the first 3 observations is 0.2, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # The median of the first 3 observations is 0.36, so 3 more observations have to # be looked at. The median of the second 3 observations is 0.45, which is # larger than the UPL of 0.28, so there is evidence of contamination. #---------- # Now create the upper simultaneous prediction limit using the 1-of-4 plan # on individual observations using the 3rd highest background value. pred.int.1.of.4.3rd <- predIntNparSimultaneous(Hg.Back, k = 1, m = 4, r = r, lb = 0, n.plus.one.minus.upl.rank = 3) pred.int.1.of.4.3rd #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 98.64909% # #Prediction Limit Rank(s): 18 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 4 # #Number of Future #Sampling Occasions: 10 # #Prediction Interval: LPL = 0.00 # UPL = 0.24 # Note that the achieved confidence level of 98.6% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The first observation is less than the UPL of 0.24, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # All of the first 4 observations are greater than the UPL of 0.24, so there # is evidence of contamination. #========== # Cleanup #-------- rm(dat, n, nw, nc, ne, r, conf.level, alpha, Hg.Back, pred.int.1.of.2.med.3, pred.int.1.of.4.3rd, Hg.CW.1, Hg.CW.2)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.