An EDA Graphical Summary
Plots a simple four panel graphical distributional summary for a dataset, comprising a histogram, a horizontal Tukey boxplot or box-and-whisker plot (Garrett, 1988), an empirical cumulative distribution function (ECDF), and a cumulative normal percentage probability (CPP) plot. The plots in all four panels will have identical x-axis scaling. Optionally the EDA graphics may be plotted with logarithmic (base 10) scaling.
shape(xx, xlab = deparse(substitute(xx)), log = FALSE, xlim = NULL, nclass = NULL, ifbw = FALSE, wend = 0.05, ifnright = TRUE, colr = 8, cex = 0.8, ...)
xx |
name of the variable to be plotted. |
xlab |
by default the character string for |
log |
to display the data with logarithmic (x-axis) scaling, set |
xlim |
is determined by |
nclass |
the default procedure for preparing the histogram depends on sample size. Where N <= 500 the Scott (1979) rule is used, and when N > 500 the Freedman-Diaconis (1981) rule; both these rules are resistant to the presence of outliers, and usually provide informative histograms. Alternately, the user may define the histogram binning by setting |
ifbw |
the default is to plot a horizontal Tukey boxplot, if a box-and-whisker plot is required set |
wend |
if |
colr |
by default the histogram and Tukey boxplot, or box-and-whisker plot, are infilled in grey, |
ifnright |
controls where the sample size is plotted in the histogram display, by default this in the upper right corner of the plot. If the data distribution is such that the upper left corner would be preferable, set |
cex |
by default the size of the text sample size, N, is set to 80%, i.e. |
... |
further arguments to be passed to methods. For example, the size of the axis scale annotation can be changed by setting |
A histogram is displayed upper left, an ECDF is displayed below it (lower left). To the right of the histogram a horizontal Tukey boxplot (default) or box-and-whisker plot (option) is displayed (upper right). In the lower right quadrant a cumulative normal percentage probability (CPP) plot is displayed. The x-axis scaling is identical in all four plots.
In a box-and-whisker plot there are two special cases. When wend = 0
the whiskers extend to the observed minima and maxima that are not plotted with the plus symbol. When wend = 0.25
no whiskers or the data minimum and maximum are plotted, only the median and box representing the span of the middle 50 percent of the data are displayed.
Any less than detection limit values represented by negative values, or zeros or other numeric codes representing blanks in the data, must be removed prior to executing this function, see ltdl.fix.df
.
Any NA
s in the data vector are removed prior to displaying the plots.
If the default selection for xlim
is inappropriate it can be set, e.g., xlim = c(0, 200)
or c(2, 200)
, the latter being appropriate for a logarithmcally scaled plot, i.e. log = TRUE
. If the defined limits lie within the observed data range truncated plots will be displayed. If this occurs the number of data points omitted is displayed below the total number of observations in the various panels.
If it is desired to prepare a display of data falling within a defined part of the actual data range, then either a data subset can be prepared externally using the appropriate R syntax, or xx
may be defined in the function call as, for example, Cu[Cu < some.value]
which would remove the influence of one or more outliers having values greater than some.value
. In this case the number of data values displayed will be the number that are <some.value
.
In some R installations the generation of multi-panel displays and the use of function eqscplot from package MASS causes warning messages related to graphics parameters to be displayed on the current device. These may be suppressed by entering options(warn = -1)
on the R command line, or that line may be included in a ‘first’ function prepared by the user that loads the ‘rgr’ package, etc.
For summary statistics displays to complement the graphics see, gx.summary1
, gx.summary2
and inset
.
Robert G. Garrett
Garrett, R.G., 1988. IDEAS - An Interactive Computer Graphics Tool to Assist the Exploration Geochemist. In Current Research Part F, Geological Survey of Canada Paper 88-1F, pp. 1-13. See pp. 5 for a description of box-and-whisker plots.
Venables, W.N. and Ripley, B.D., 2001. Modern Applied Statistsis with S-Plus, 3rd Edition, Springer, 501 p. See pp. 119 for a description of histogram bin selection computations.
## Make test data available data(kola.o) attach(kola.o) ## Generates an initial display to have a first look at the data and ## decide how best to proceed shape(Cu) ## Provides a more appropriate initial display and indicates the ## quartiles shape(Cu, xlab = "Cu (mg/kg) in <2 mm O-horizon soil", log = TRUE, ifqs = TRUE) ## Causes the Friedman-Diaconis rule to be used to select the number of ## histogram bins and changes the ECDF and CPP plotting symbols to a ## cross/x shape(Cu, xlab = "Cu (mg/kg) in <2 mm O-horizon soil", log = TRUE, nclass = "fd", pch = 4) ## Replaces the Tukey boxplot with a box-and-whisker plot where the ## whiskers extend to the 10th and 90th percentiles and the minimum ## and maximum observed values are marked with a plus sign. shape(Cu, xlab = "Cu (mg/kg) in <2 mm O-horizon soil", log = TRUE, ifbw =TRUE, wend = 0.1) ## Detach test data detach(kola.o)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.