EnvStats: cdfCompare – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

cdfCompare

Plot Two Cumulative Distribution Functions

Description

For one sample, plots the empirical cumulative distribution function (ecdf) along with a theoretical cumulative distribution function (cdf). For two samples, plots the two ecdf's. These plots are used to graphically assess goodness of fit.

Usage

cdfCompare(x, y = NULL, discrete = FALSE, 
    prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = NULL, 
    distribution = "norm", param.list = NULL, 
    estimate.params = is.null(param.list), est.arg.list = NULL, 
    x.col = "blue", y.or.fitted.col = "black", 
    x.lwd = 3 * par("cex"), y.or.fitted.lwd = 3 * par("cex"), 
    x.lty = 1, y.or.fitted.lty = 2, digits = .Options$digits, ..., 
    type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, 
    xlim = NULL, ylim = NULL)

Arguments

`x`	numeric vector of observations. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`y`	a numeric vector (not necessarily of the same length as `x`). Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed. The default value is `y=NULL`, in which case the empirical cdf of `x` will be plotted along with the theoretical cdf specified by the argument `distribution`.
`discrete`	logical scalar indicating whether the assumed parent distribution of `x` is discrete (`discrete=TRUE`) or continuous (`discrete=FALSE`; the default).
`prob.method`	character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are `plot.pos` (plotting positions, the default if `discrete=FALSE`) and `emp.probs` (empirical probabilities, the default if `discrete=TRUE`). See the help file for `ecdfPlot` for more explanation.
`plot.pos.con`	numeric scalar between 0 and 1 containing the value of the plotting position constant. When `y` is supplied, the default value is `plot.pos.con=0.375`. When `y` is not supplied, for the normal, lognormal, three-parameter lognormal, zero-modified normal, and zero-modified lognormal distributions, the default value is `plot.pos.con=0.375`. For the Type I extreme value (Gumbel) distribution (`distribution="evd"`), the default value is `plot.pos.con=0.44`. For all other distributions, the default value is `plot.pos.con=0.4`. See the help files for `ecdfPlot` and `qqPlot` for more information. This argument is ignored if `prob.method="emp.probs"`.
`distribution`	when `y` is not supplied, a character string denoting the distribution abbreviation. The default value is `distribution="norm"`. See the help file for `Distribution.df` for a list of possible distribution abbreviations. This argument is ignored if `y` is supplied.
`param.list`	when `y` is not supplied, a list with values for the parameters of the distribution. The default value is `param.list=list(mean=0, sd=1)`. See the help file for `Distribution.df` for the names and possible values of the parameters associated with each distribution. This argument is ignored if `y` is supplied or `estimate.params=TRUE`.
`estimate.params`	when `y` is not supplied, a logical scalar indicating whether to compute the cdf for `x` based on estimating the distribution parameters (`estimate.params=TRUE`) or using the known distribution parameters specified in `param.list` (`estimate.params=FALSE`). The default value is `TRUE` unless the argument `param.list` is supplied. The argument `estimate.params` is ignored if `y` is supplied.
`est.arg.list`	when `y` is not supplied and `estimate.params=TRUE`, a list whose components are optional arguments associated with the function used to estimate the parameters of the assumed distribution (see the help file Estimating Distribution Parameters). For example, all functions used to estimate distribution parameters have an optional argument called `method` that specifies the method to use to estimate the parameters. (See the help file for `Distribution.df` for a list of available estimation methods for each distribution.) To override the default estimation method, supply the argument `est.arg.list` with a component called `method`; for example `est.arg.list=list(method="mle")`. The default value is `est.arg.list=NULL` so that all default values for the estimating function are used. This argument is ignored if `estimate.params=FALSE` or `y` is supplied.
`x.col`	a numeric scalar or character string determining the color of the empirical cdf (based on `x`) line or points. The default value is `x.col="blue"`. See the entry for `col` in the help file for `par` for more information.
`y.or.fitted.col`	a numeric scalar or character string determining the color of the empirical cdf (based on `y`) or the theoretical cdf line or points. The default value is `y.or.fitted.col="black"`. See the entry for `col` in the help file for `par` for more information.
`x.lwd`	a numeric scalar determining the width of the empirical cdf (based on `x`) line. The default value is `x.lwd=3*par("cex")`. See the entry for `lwd` in the help file for `par` for more information.
`y.or.fitted.lwd`	a numeric scalar determining the width of the empirical cdf (based on `y`) or theoretical cdf line. The default value is `y.or.fitted.lwd=3*par("cex")`. See the entry for `lwd` in the help file for `par` for more information.
`x.lty`	a numeric scalar determining the line type of the empirical cdf (based on `x`) line. The default value is `x.lty=1`. See the entry for `lty` in the help file for `par` for more information.
`y.or.fitted.lty`	a numeric scalar determining the line type of the empirical cdf (based on `y`) or theoretical cdf line. The default value is `y.or.fitted.lty=2`. See the entry for `lty` in the help file for `par` for more information.
`digits`	when `y` is not supplied, a scalar indicating how many significant digits to print for the distribution parameters. The default value is `digits=.Options$digits`.
`type, main, xlab, ylab, xlim, ylim, ...`	additional graphical parameters (see `lines` and `par`). In particular, the argument `type` specifies the kind of line type. By default, the function `cdfCompare` plots a step function (`type="s"`) when `discrete=TRUE`, and plots a straight line between points (`type="l"`) when `discrete=FALSE`. The user may override these defaults by supplying the graphics parameter `type` (`type="s"` for a step function, `type="l"` for linear interpolation, `type="p"` for points only, etc.).

Details

When both x and y are supplied, the function cdfCompare creates the empirical cdf plot of x and y on the same plot by calling the function ecdfPlot.

When y is not supplied, the function cdfCompare creates the emprical cdf plot of x (by calling ecdfPlot) and the theoretical cdf plot (by calling cdfPlot and using the argument distribution) on the same plot.

Value

When y is supplied, cdfCompare invisibly returns a list with components:

`x.ecdf.list`	a list with components `Order.Statistics` and `Cumulative.Probabilities`, giving coordinates of the points that have been plotted for the `x` values.
`y.ecdf.list`	a list with components `Order.Statistics` and `Cumulative.Probabilities`, giving coordinates of the points that have been plotted for the `y` values.

When y is not supplied, cdfCompare invisibly returns a list with components:

`x.ecdf.list`	a list with components `Order.Statistics` and `Cumulative.Probabilities`, giving coordinates of the points that have been plotted for the `x` values.
`fitted.cdf.list`	a list with components `Quantiles` and `Cumulative.Probabilities`, giving coordinates of the points that have been plotted for the fitted cdf.

Note

An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.

Chambers et al. (1983, pp.11-16) plot the observed order statistics on the y-axis vs. the ecdf on the x-axis and call this a quantile plot.

Empirical cumulative distribution function (ecdf) plots are often plotted with theoretical cdf plots (see cdfPlot and cdfCompare) to graphically assess whether a sample of observations comes from a particular distribution. The Kolmogorov-Smirnov goodness-of-fit test (see gofTest) is the statistical companion of this kind of comparison; it is based on the maximum vertical distance between the empirical cdf plot and the theoretical cdf plot. More often, however, quantile-quantile (Q-Q) plots are used instead of ecdf plots to graphically assess departures from an assumed distribution (see qqPlot).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Examples

# Generate 20 observations from a normal (Gaussian) distribution 
  # with mean=10 and sd=2 and compare the empirical cdf with a 
  # theoretical normal cdf that is based on estimating the parameters. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rnorm(20, mean = 10, sd = 2) 
  dev.new()
  cdfCompare(x)

  #----------

  # Generate 30 observations from an exponential distribution with parameter 
  # rate=0.1 (see the R help file for Exponential) and compare the empirical 
  # cdf with the empirical cdf of the normal observations generated in the 
  # previous example:

  set.seed(432)
  y <- rexp(30, rate = 0.1) 
  dev.new()
  cdfCompare(x, y)

  #==========

  # Generate 20 observations from a Poisson distribution with parameter lambda=10 
  # (see the R help file for Poisson) and compare the empirical cdf with a 
  # theoretical Poisson cdf based on estimating the distribution parameters. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rpois(20, lambda = 10) 
  dev.new()
  cdfCompare(x, dist = "pois")

  #==========

  # Clean up
  #---------
  rm(x, y)
  graphics.off()

EnvStats

Package for Environmental Statistics, Including US EPA Guidance

v2.4.0

GPL (>= 3)

Authors

Steven P. Millard [aut], Alexander Kowarik [ctb, cre] (<https://orcid.org/0000-0001-8598-4130>)

Initial release

2020-10-20