Generally–Altered, –Inflated and –Truncated Poisson Regression
Fits a generally–altered, –inflated and –truncated Poisson regression by MLE. The GAIT combo model having 5 types of special values is implemented. This allows mixtures of Poissons on nested and/or partitioned support as well as a multinomial logit model for altered and inflated values. Truncation may include the upper tail.
gaitpoisson(alt.mix = NULL, inf.mix = NULL, alt.mlm = NULL, inf.mlm = NULL, truncate = NULL, max.support = Inf, zero = c("pobs", "pstr"), eq.ap = FALSE, eq.ip = FALSE, parallel.ap = FALSE, parallel.ip = FALSE, llambda.p = "loglink", llambda.a = llambda.p, llambda.i = llambda.p, type.fitted = c("mean", "lambdas", "pobs.mlm", "pstr.mlm", "pobs.mix", "pstr.mix", "Pobs.mix", "Pstr.mix", "nonspecial", "Numer", "Denom.p", "sum.mlm.i", "sum.mix.i", "ptrunc.p", "cdf.max.s"), gpstr.mix = ppoints(7) / 3, gpstr.mlm = ppoints(7) / (3 + length(inf.mlm)), imethod = 1, imux = 0.5, ilambda.p = NULL, ilambda.a = ilambda.p, ilambda.i = ilambda.p, ipobs.mix = NULL, ipstr.mix = NULL, ipobs.mlm = NULL, ipstr.mlm = NULL, byrow.ai = FALSE, ishrinkage = 0.95, probs.y = 0.35)
truncate, max.support |
Vector of truncated values,
i.e., nonnegative integers.
For the first five arguments (for the special values)
a |
alt.mix, inf.mix |
Vector of altered and inflated values corresponding to finite mixture models. These are described as parametric or structured. The parameter If Due to its flexibility, it is easy to misuse this function
and ideally the values of the above arguments should be well
justified by the application on hand.
Adding inappropriate or
unnecessary values to these arguments willy-nilly
is a recipe for disaster, especially for |
alt.mlm, inf.mlm |
Vector of altered and inflated values corresponding
to the multinomial logit model (MLM) probabilities of
observing those values—see
|
llambda.p, llambda.a, llambda.i |
Link functions;
the suffixes |
eq.ap, eq.ip |
Single logical each.
Constrain the rate parameters to be equal?
See For the GIT–Pois–Pois submodel,
after plotting the responses,
if the distribution of the spikes
above the nominal probabilities
has roughly the same shape
as the ordinary values then setting
|
parallel.ap, parallel.ip |
Single logical each.
Constrain the MLM probabilities to be equal?
If so then this applies to all
|
type.fitted |
See The choice Option The choice |
gpstr.mix, gpstr.mlm |
See |
imethod, ipobs.mix, ipstr.mix, ipobs.mlm, ipstr.mlm |
See |
imux |
Numeric, general downward multiplier for initial values for
the sample proportions (MLEs actually).
The value 1 makes no adjustment, and in general it
should lie in (0, 1] with a value near 0.5 recommended.
If too high then |
ilambda.p, ilambda.a, ilambda.i |
Initial values for the rate parameters;
see |
probs.y, ishrinkage |
See |
byrow.ai |
Details are at |
zero |
See For the MLM probabilities,
to model It is noted that, amongst other things,
|
The full
GAIT–Pois–Pois–MLM–Pois-MLM combo model
may be fitted with this family function.
There are five types of special values and all arguments for these
may be used in a single model.
Here, the MLM represents the nonparametric while the Pois
refers to the Poisson mixtures.
The defaults for this function correspond to an ordinary Poisson
regression so that poissonff
is called instead.
A MLM with only one probability to model is equivalent to
logistic regression
(binomialff
and logitlink
).
The order of the linear/additive predictors is best explained by
an example.
Suppose a combo model has length(a.mix) > 1
and
length(i.mix) > 1
, a.mlm = 3:5
and
i.mlm = 6:9
, say.
Then loglink(lambda.p)
is the first.
The second is multilogitlink(pobs.mix)
followed
by loglink(lambda.a)
because a.mix
is long enough.
The fourth is multilogitlink(pstr.mix)
followed
by loglink(lambda.i)
because i.mix
is long enough.
Next are the probabilities for the alt.mlm
values.
Lastly are the probabilities for the inf.mlm
values.
All the probabilities are estimated by one big MLM and effectively
the "(Others)"
column of left over probabilities is
associated with the nonspecial values.
The dimension of the vector of linear/additive predictors
is M=12.
Two mixture submodels that may be fitted can be abbreviated
GAT–Pois–Pois or
GIT–Pois–Pois.
For the GAT model
the distribution being fitted is a (spliced) mixture
of two Poissons with differing (partitioned) support.
Likewise, for the GIT model
the distribution being fitted is a mixture
of two Poissons with nested support.
The two rate parameters may be constrained to be equal using
eq.ap
and eq.ip
.
A good first step is to apply spikeplot
for selecting
candidate values for altering and inflating. Deciding between
parametrically or nonparametrically can also be determined from
examining the spike plot. Misspecified
alt.mix
/alt.mlm
/inf.mix
/inf.mlm
will result
in convergence problems (setting trace = TRUE
is a very
good idea.)
This function currently does not handle multiple responses.
Further details are at Gaitpois
.
A well-conditioned data–model combination should pose no
difficulties for the automatic starting value selection
being successful.
Failure to obtain initial values from this self-starting
family function indicates the degree of inflation may
be marginal and/or a misspecified model.
If this problem is worth surmounting
the arguments to focus on especially are
imux
,
gpstr.mix
and
gpstr.mlm
.
See below for the stepping-stone trick.
Apart from the order of the linear/additive predictors,
the following are (or should be) equivalent:
gaitpoisson()
and poissonff()
,
gaitpoisson(alt.mix = 0)
and zapoisson(zero = "pobs0")
,
gaitpoisson(inf.mix = 0)
and zipoisson(zero = "pstr0")
,
gaitpoisson(truncate = 0)
and pospoisson()
.
Likewise,
if
alt.mix
and inf.mix
are assigned a scalar then
it effectively
moves that scalar to alt.mlm
and inf.mlm
because
there is no lambda.a
or lambda.i
being estimated.
Thus
gaitpoisson(alt.mix = 0)
and gaitpoisson(alt.mlm = 0)
are the effectively same,
and ditto for
gaitpoisson(inf.mix = 0)
and gaitpoisson(inf.mlm = 0)
.
A nonparametric special case submodel is the GAIT–Pois–MLM–MLM, which is where the ordinary values have a Poisson distribution, and there are altered and inflated values having unstructured probabilities. Thus the distribution being fitted is a mixture of a Poisson and two MLMs with the support of one of the MLMs being equal to the set of altered values and the other for the inflated values. Hence the probability for each inflated value comes from two sources: the parent distribution and a MLM.
An object of class "vglmff"
(see vglmff-class
).
The object is used by modelling functions such as vglm
,
rrvglm
and vgam
.
The fitted.values
slot of the fitted object,
which should be extracted by the generic function fitted
,
returns the mean mu by default.
The choice type.fitted = "pobs.mlm"
returns
a matrix whose columns are
the altered probabilities (Greek symbol omega).
The choice "pstr.mlm"
returns
a matrix whose columns are
the inflated probabilities (Greek symbol phi).
The choice "ptrunc.p"
returns the probability of having
a truncated value with respect to the parent distribution.
It includes any truncated values in the upper tail
beyond max.support
.
The probability of a value less than or equal to
max.support
with respect to the parent distribution
is "cdf.max.s"
.
The choice "sum.mlm.i"
adds two terms.
This gives the probability of an inflated value,
and the formula can be loosely written down
as something like
"pstr.mlm" + "Numer" * dpois(inf.mlm, lambda.p) / "Denom"
.
Amateurs tend to be overzealous fitting
zero-inflated models when the fitted mean is low—the
warning of ziP
should be heeded.
For GAIT regression the warning
applies here to all
inf.mix
and inf.mlm
values,
not just 0.
Default values for this and similar family functions
may change in the future, e.g., eq.ap
and eq.ip
.
Important internal changes might occur too, such as the
ordering of the linear/additive predictors and
the quantities returned as the fitted values.
Using inf.mlm
requires more caution than alt.mlm
because
gross inflation is ideally needed for it to work safely.
Ditto for inf.mix
versus alt.mix
.
Data exhibiting deflation or no inflation will produce
numerical problems,
hence set trace = TRUE
to monitor convergence.
More than c.10 IRLS iterations should raise suspicion.
Parameter estimates close to the boundary of the parameter space
indicate model misspecification
and this can be detected by hdeff
.
This function is quite memory-hungry with respect to
length(c(alt.mix, inf.mix, alt.mlm, inf.mlm))
.
It is often a good idea to set eq.ip = TRUE
,
especially when length(inf.mix)
is not much more than
2 or the values
of inf.mix
are not spread over the range of the response.
This way the estimation can borrow strength from both the
inflated and non-inflated values.
If the inf.mix
values form a single small
cluster then this can easily create estimation difficulties—the
idea is somewhat similar to multicollinearity.
Numerical problems can easily arise because of the
flexibility of this distribution and/or the lack of
sizeable inflation; it is a good idea to
gain experience with simulated data first before applying
it to real data.
Numerical problems may arise if any of the special values
are in remote places of the support, e.g.,
a value y
such that dpois(y, lambda.p)
is
very close to 0. This is because the ratio of two
tiny values can be unstable.
Good initial values may be difficult to obtain using self-starting
procedures, especially when there are covariates. If so, then it is
advisable to use a trick: fit an intercept-only model first and then
use etastart = predict(int.only.model)
to fit the model
with covariates.
This uses the simpler model as a stepping-stone.
The labelling of the linear/additive predictors has been
abbreviated to reduce space.
For example, multilogitlink(pobs.mix)
and
multilogitlink(pstr.mix)
would be more accurately
multilogitlink(cbind(pobs.mix, pstr.mix))
because one grand MLM is fitted.
This shortening may result in modifications needed in other
parts of VGAM to compensate.
T. W. Yee
Yee, T. W. and Ma, C. (2021). Generally–altered, –inflated and –truncated regression, with application to heaped and seeped counts. In preparation.
a.mix <- c(5, 10) # Alter these values parametrically i.mlm <- c(4, 14) # Inflate these values i.mix <- c(3, 15) # Inflate these values tvec <- c(6, 7) # Truncate these values pobs.mix <- pstr.mix <- pstr.mlm <- 0.1 # So parallel.ip = TRUE, etc. max.support <- 20; set.seed(1) gdata <- data.frame(x2 = runif(nn <- 1000)) gdata <- transform(gdata, lambda.p = exp(2 + 0.5 * x2)) gdata <- transform(gdata, y1 = rgaitpois(nn, lambda.p, alt.mix = a.mix, inf.mix = i.mix, pobs.mix = pobs.mix, pstr.mix = pstr.mix, inf.mlm = i.mlm, pstr.mlm = pstr.mlm, truncate = tvec, max.support = max.support)) gaitpoisson(alt.mix = a.mix, inf.mix = i.mix, inf.mlm = i.mlm) with(gdata, table(y1)) ## Not run: spikeplot(with(gdata, y1)) gaitpfit <- vglm(y1 ~ x2, crit = "coef", trace = TRUE, data = gdata, gaitpoisson(alt.mix = a.mix, inf.mix = i.mix, parallel.ip = TRUE, inf.mlm = i.mlm, eq.ap = TRUE, eq.ip = TRUE, truncate = tvec, max.support = max.support)) head(fitted(gaitpfit, type.fitted = "Pstr.mix")) head(predict(gaitpfit)) t(coef(gaitpfit, matrix = TRUE)) # Easier to see with t() summary(gaitpfit, HDEtest = FALSE) # summary(gaitpfit) is better
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.