Estimate Mean and Coefficient of Variation for a Gamma Distribution Based on Type I Censored Data
Estimate the mean and coefficient of variation of a gamma distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
egammaAltCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. Currently, the only
available method is maximum likelihood ( |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
ci.sample.size |
numeric scalar indicating what sample size to assume to construct the
confidence interval for the mean if |
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let \underline{x} denote a vector of N observations from a
gamma distribution with parameters
shape=
κ and scale=
θ.
The relationship between these parameters and the mean μ
and coefficient of variation τ of this distribution is given by:
κ = τ^{-2} \;\;\;\;\;\; (1)
θ = μ/κ \;\;\;\;\;\; (2)
μ = κ \; θ \;\;\;\;\;\; (3)
τ = κ^{-1/2} \;\;\;\;\;\; (4)
Assume n (0 < n < N) of these observations are known and c (c=N-n) of these observations are all censored below (left-censored) or all censored above (right-censored) at k fixed censoring levels
T_1, T_2, …, T_k; \; k ≥ 1 \;\;\;\;\;\; (5)
For the case when k ≥ 2, the data are said to be Type I multiply censored. For the case when k=1, set T = T_1. If the data are left-censored and all n known observations are greater than or equal to T, or if the data are right-censored and all n known observations are less than or equal to T, then the data are said to be Type I singly censored (Nelson, 1982, p.7), otherwise they are considered to be Type I multiply censored.
Let c_j denote the number of observations censored below or above censoring level T_j for j = 1, 2, …, k, so that
∑_{i=1}^k c_j = c \;\;\;\;\;\; (6)
Let x_{(1)}, x_{(2)}, …, x_{(N)} denote the “ordered” observations, where now “observation” means either the actual observation (for uncensored observations) or the censoring level (for censored observations). For right-censored data, if a censored observation has the same value as an uncensored one, the uncensored observation should be placed first. For left-censored data, if a censored observation has the same value as an uncensored one, the censored observation should be placed first.
Note that in this case the quantity x_{(i)} does not necessarily represent the i'th “largest” observation from the (unknown) complete sample.
Finally, let Ω (omega) denote the set of n subscripts in the
“ordered” sample that correspond to uncensored observations.
Estimation
Maximum Likelihood Estimation (method="mle"
)
For Type I left censored data, the likelihood function is given by:
L(μ, τ | \underline{x}) = {N \choose c_1 c_2 … c_k n} ∏_{j=1}^k [F(T_j)]^{c_j} ∏_{i \in Ω} f[x_{(i)}] \;\;\;\;\;\; (7)
where f and F denote the probability density function (pdf) and cumulative distribution function (cdf) of the population (Cohen, 1963; Cohen, 1991, pp.6, 50). That is,
f(t) = \frac{t^{κ-1} e^{-t/θ}}{θ^κ Γ(κ)} \;\;\;\;\;\; (8)
(Johnson et al., 1994, p.343), where κ and θ are defined in terms of μ and τ by Equations (1) and (2) above.
For left singly censored data, Equation (7) simplifies to:
L(μ, τ | \underline{x}) = {N \choose c} [F(T)]^{c} ∏_{i = c+1}^n f[x_{(i)}] \;\;\;\;\;\; (9)
Similarly, for Type I right censored data, the likelihood function is given by:
L(μ, τ | \underline{x}) = {N \choose c_1 c_2 … c_k n} ∏_{j=1}^k [1 - F(T_j)]^{c_j} ∏_{i \in Ω} f[x_{(i)}] \;\;\;\;\;\; (10)
and for right singly censored data this simplifies to:
L(κ, θ | \underline{x}) = {N \choose c} [1 - F(T)]^{c} ∏_{i = 1}^n f[x_{(i)}] \;\;\;\;\;\; (11)
The maximum likelihood estimators are computed by minimizing the
negative log-likelihood function.
Confidence Intervals
This section explains how confidence intervals for the mean μ are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean μ while treating the coefficient of variation
τ as a nuisance parameter. Equation (7) above
shows the form of the likelihood function L(μ, τ | \underline{x}) for
multiply left-censored data, where μ and τ are defined by
Equations (3) and (4), and Equation (10) shows the function for
multiply right-censored data.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates of the mean and coefficient of variation by (μ^*, τ^*). The likelihood ratio test statistic (G^2) of the hypothesis H_0: μ = μ_0 (where μ_0 is a fixed value) equals the drop in 2 log(L) between the “full” model and the reduced model with μ fixed at μ_0, i.e.,
G^2 = 2 \{log[L(μ^*, τ^*)] - log[L(μ_0, τ_0^*)]\} \;\;\;\;\;\; (12)
where τ_0^* is the maximum likelihood estimate of τ for the reduced model (i.e., when μ = μ_0). Under the null hypothesis, the test statistic G^2 follows a chi-squared distribution with 1 degree of freedom.
Alternatively, we may express the test statistic in terms of the profile likelihood function L_1 for the mean μ, which is obtained from the usual likelihood function by maximizing over the parameter τ, i.e.,
L_1(μ) = max_{τ} L(μ, τ) \;\;\;\;\;\; (13)
Then we have
G^2 = 2 \{log[L_1(μ^*)] - log[L_1(μ_0)]\} \;\;\;\;\;\; (14)
A two-sided (1-α)100\% confidence interval for the mean μ consists of all values of μ_0 for which the test is not significant at level alpha:
μ_0: G^2 ≤ χ^2_{1, {1-α}} \;\;\;\;\;\; (15)
where χ^2_{ν, p} denotes the p'th quantile of the
chi-squared distribution with ν degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity 1-α in Equation (15) is replaced with
1-2α.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate (1-α)100\% confidence intervals for
μ based on the assumption that the estimator of μ is
approximately normally distributed. That is, a two-sided (1-α)100\%
confidence interval for μ is constructed as:
[\hat{μ} - t_{1-α/2, m-1}\hat{σ}_{\hat{μ}}, \; \hat{μ} + t_{1-α/2, m-1}\hat{σ}_{\hat{μ}}] \;\;\;\; (16)
where \hat{μ} denotes the estimate of μ, \hat{σ}_{\hat{μ}} denotes the estimated asymptotic standard deviation of the estimator of μ, m denotes the assumed sample size for the confidence interval, and t_{p,ν} denotes the p'th quantile of Student's t-distribuiton with ν degrees of freedom. One-sided confidence intervals are computed in a similar fashion.
The argument ci.sample.size
determines the value of m and by
default is equal to the number of uncensored observations.
This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the p'th quantile from the
standard normal distribution is used in place of the
p'th quantile from Student's t-distribution.
The standard deviation of the mle of μ is
estimated based on the inverse of the Fisher Information matrix.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate (1-α)100\% confidence interval for the population mean μ, the bootstrap can be broken down into the following steps:
Create a bootstrap sample by taking a random sample of size N from the observations in \underline{x}, where sampling is done with replacement. Note that because sampling is done with replacement, the same element of \underline{x} can appear more than once in the bootstrap sample. Thus, the bootstrap sample will usually not look exactly like the original sample (e.g., the number of censored observations in the bootstrap sample will often differ from the number of censored observations in the original sample).
Estimate μ based on the bootstrap sample created in Step 1, using the same method that was used to estimate μ using the original observations in \underline{x}. Because the bootstrap sample usually does not match the original sample, the estimate of μ based on the bootstrap sample will usually differ from the original estimate based on \underline{x}.
Repeat Steps 1 and 2 B times, where B is some large number.
For the function egammaAltCensored
, the number of bootstraps B is
determined by the argument n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the B estimated values of μ to compute the empirical
cumulative distribution function of this estimator of μ (see
ecdfPlot
), and then create a confidence interval for μ
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
[\hat{G}^{-1}(\frac{α}{2}), \; \hat{G}^{-1}(1-\frac{α}{2})] \;\;\;\;\;\; (17)
where \hat{G}(t) denotes the empirical cdf evaluated at t and thus \hat{G}^{-1}(p) denotes the p'th empirical quantile, that is, the p'th quantile associated with the empirical cdf. Similarly, a one-sided lower confidence interval is computed as:
[\hat{G}^{-1}(α), \; ∞] \;\;\;\;\;\; (18)
and a one-sided upper confidence interval is computed as:
[0, \; \hat{G}^{-1}(1-α)] \;\;\;\;\;\; (19)
The function egammaAltCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (17)-(19).
The percentile method bootstrap confidence interval is only first-order accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability that the confidence interval will contain the true value of μ can be off by k/√{N}, where kis some constant. Efron and Tibshirani (1993, pp.184-188) proposed a bias-corrected and accelerated interval that is second-order accurate, meaning that the probability that the confidence interval will contain the true value of μ may be off by k/N instead of k/√{N}. The two-sided bias-corrected and accelerated confidence interval is computed as:
[\hat{G}^{-1}(α_1), \; \hat{G}^{-1}(α_2)] \;\;\;\;\;\; (20)
where
α_1 = Φ[\hat{z}_0 + \frac{\hat{z}_0 + z_{α/2}}{1 - \hat{a}(z_0 + z_{α/2})}] \;\;\;\;\;\; (21)
α_2 = Φ[\hat{z}_0 + \frac{\hat{z}_0 + z_{1-α/2}}{1 - \hat{a}(z_0 + z_{1-α/2})}] \;\;\;\;\;\; (22)
\hat{z}_0 = Φ^{-1}[\hat{G}(\hat{μ})] \;\;\;\;\;\; (23)
\hat{a} = \frac{∑_{i=1}^N (\hat{μ}_{(\cdot)} - \hat{μ}_{(i)})^3}{6[∑_{i=1}^N (\hat{μ}_{(\cdot)} - \hat{μ}_{(i)})^2]^{3/2}} \;\;\;\;\;\; (24)
where the quantity \hat{μ}_{(i)} denotes the estimate of μ using all the values in \underline{x} except the i'th one, and
\hat{μ}{(\cdot)} = \frac{1}{N} ∑_{i=1}^N \hat{μ_{(i)}} \;\;\;\;\;\; (25)
A one-sided lower confidence interval is given by:
[\hat{G}^{-1}(α_1), \; ∞] \;\;\;\;\;\; (26)
and a one-sided upper confidence interval is given by:
[0, \; \hat{G}^{-1}(α_2)] \;\;\;\;\;\; (27)
where α_1 and α_2 are computed as for a two-sided confidence interval, except α/2 is replaced with α in Equations (21) and (22).
The constant \hat{z}_0 incorporates the bias correction, and the constant \hat{a} is the acceleration constant. The term “acceleration” refers to the rate of change of the standard error of the estimate of μ with respect to the true value of μ (Efron and Tibshirani, 1993, p.186). For a normal (Gaussian) distribution, the standard error of the estimate of μ does not depend on the value of μ, hence the acceleration constant is not really necessary.
When ci.method="bootstrap"
, the function egammaAltCensored
computes both
the percentile method and bias-corrected and accelerated method bootstrap confidence
intervals.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation. Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators for parameters of a normal or lognormal distribution based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation (or coefficient of variation), rather than rely on a single point-estimate of the mean. Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and coefficient of variation of a gamma distribution when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Steven P. Millard (EnvStats@ProbStatInfo.com)
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley \& Sons, Hoboken, New Jersey.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). http://people.upei.ca/hstryhn/stryhn208.pdf.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and coefficient of variation # ON THE ORIGINAL SCALE using the MLE and # assuming a gamma distribution. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and coefficient of variation # using the MLE, and compute a confidence interval # for the mean using the profile-likelihood method. #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 19.664797 # cv = 1.252936 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.25151 # UCL = 34.35332 #---------- # Compare the confidence interval for the mean # based on assuming a lognormal distribution versus # assuming a gamma distribution. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.37629 69.87694 with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.25151 34.35332
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.