Specifying generalized additive models
This page is intended to provide some more information on how to specify GAMs. A GAM is a GLM in which the linear predictor depends, in part, on a sum of smooth functions of predictors and (possibly) linear functionals of smooth functions of (possibly dummy) predictors.
Specifically let y_i denote an independent random variable with mean mu_i and an exponential family distribution, or failing that a known mean variance relationship suitable for use of quasi-likelihood methods. Then the the linear predictor of a GAM has a structure something like
g(mu_i)=X_i b + f_1(x_1i,x_2i) + f_2(x_3i) + L_i f_3(x_4) + ...
where g is a known smooth monotonic ‘link’ function, X_i b is the parametric part of the linear predictor, the x_j are predictor variables, the f_j are smooth functions and L_i is some linear functional of f_3. There may of course be multiple linear functional terms, or none.
The key idea here is that the dependence of the response on the predictors can be represented as a parametric sub-model plus the sum of some (functionals of) smooth functions of one or more of the predictor variables. Thus the model is quite flexible relative to strictly parametric linear or generalized linear models, but still has much more structure than the completely general model that says that the response is just some smooth function of all the covariates.
Note one important point. In order for the model to be identifiable
the smooth functions usually have to be constrained to have zero mean (usually
taken over the set of covariate values). The constraint is needed if the term involving the
smooth includes a constant function in its span. gam
always applies such constraints
unless there is a by
variable present, in which case an assessment is made of whether
the constraint is needed or not (see below).
Consider the example model.
g(mu_i) = b_0 + b_1 x_1i + b_2 x_2i + f1(x_3i) + f2(x_4i,x_5i)
where the response variables y_i has expectation mu_i and g is a link function.
The gam
formula for this would be y ~ x1 + x2 + s(x3) + s(x4,x5)
.
This would use the default basis for the smooths (a thin plate
regression spline basis for each), with automatic selection of the
effective degrees of freedom for both smooths. The dimension of the
smoothing basis is given a default value as well (the dimension of the
basis sets an upper limit on the maximum possible degrees of
freedom for the basis - the limit is typically one less than basis
dimension). Full details of how to control smooths are given in
s
and te
, and further discussion of basis
dimension choice can be found in choose.k
.
For the moment suppose that we would like to change
the basis of the first smooth to a cubic regression spline basis with
a dimension of 20, while fixing the second term at 25 degrees of
freedom. The appropriate formula would be:y ~ x1 + x2 + s(x3,bs="cr",k=20) + s(x4,x5,k=26,fx=TRUE)
.
The above assumes that x_4 and x_5 are naturally on
similar scales (e.g. they might be co-ordinates), so that isotropic smoothing
is appropriate. If this assumption is false then tensor product smoothing might be
better (see te
). y ~ x1 + x2 + s(x3) + te(x4,x5)
would generate a tensor product smooth of x_4 and x_5.
By default this smooth would have basis dimension 25 and use cubic regression spline marginals.
Varying the defaults is easy. For exampley ~ x1 + x2 + s(x3) + te(x4,x5,bs=c("cr","ps"),k=c(6,7))
specifies that the tensor product should use a rank 6 cubic regression spline marginal
and a rank 7 P-spline marginal to create a smooth with basis dimension 42.
Sometimes it is interesting to specify smooth models with a main effects + interaction structure such as
E(y) = f1 (x) + f2(z) + f3(x,z)
or
E(y) = f1(x) + f2(z) + f3(v) + f4(x,z) + f5(z,v) + f6(z,v) + f7(x,z,v)
for example. Such models should be set up using ti
terms in the model formula. For example: y ~ ti(x) + ti(z) + ti(x,z)
, ory ~ ti(x) + ti(z) + ti(v) + ti(x,z) + ti(x,v) + ti(z,v)+ti(x,z,v)
.
The ti
terms produce interactions with the component main effects excluded appropriately. (There is in fact no need to use ti
terms for the main effects here, s
terms could also be used.)
gam
allows nesting (or ‘overlap’) of te
and s
smooths, and automatically generates side conditions to
make such models identifiable, but the resulting models are much less stable and interpretable than those constructed using ti
terms.
by
variables are the means for constructing ‘varying-coefficient models’ (geographic regression models) and
for letting smooths ‘interact’ with factors or parametric terms. They are also the key to specifying general linear
functionals of smooths.
Factor smooth interactions (see also factor.smooth.interaction
).
If a by
variable is a factor
then it
generates an indicator vector for each level
of the factor, unless it is an ordered
factor.
In the non-ordered case, the model matrix for the smooth term is then replicated for each factor level,
and each copy has its rows multiplied by the corresponding rows of its
indicator variable. The smoothness penalties are also duplicated for each
factor level. In short a different smooth is generated
for each factor level (the id
argument to s
and te
can be used to force all
such smooths to have the same smoothing parameter). ordered
by
variables are handled in the same
way, except that no smooth is generated for the first level of the ordered factor (see b3
example below).
This is useful for setting up
identifiable models when the same smooth occurs more than once in a model, with different factor by
variables.
As an example, consider the model
E(y_i) = b_0 + f(x_i)z_i
where f is a smooth function, and z_i is a numeric variable.
The appropriate formula is:y ~ s(x,by=z)
- the by
argument ensures that the smooth function gets multiplied by
covariate z
. Note that when using factor by variables, centering constraints are applied to the smooths,
which usually means that the by variable should be included as a parametric term, as well.
The example code below also illustrates the use of factor by
variables.
by
variables may be supplied as numeric matrices as part of specifying general linear functional terms.
If a by
variable is present and numeric (rather than a factor) then the corresponding smooth is only subjected
to an identifiability constraint if (i) the by
variable is a constant vector, or, (ii) for a matrix
by
variable, L
, if L%*%rep(1,ncol(L))
is constant or (iii) if a user defined smooth constructor
supplies an identifiability constraint explicitly, and that constraint has an attibute "always.apply"
.
It is sometimes desirable to insist that different smooth terms have the same degree of smoothness.
This can be done by using the id
argument to s
or te
terms. Smooths
which share an id
will have the same smoothing parameter. Really this only makes sense if the
smooths use the same basis functions, and the default behaviour is to force this to happen: all smooths
sharing an id
have the same basis functions as the first smooth occurring with that id
. Note
that if you want exactly the same function for each smooth, then this is best achieved by making use of the
summation convention covered under ‘linear functional terms’.
As an example suppose that E(y_i)=mu_i and
g(mu_i) = f1(x_1i) + f2(x_2i,x_3i) + f3(x_4i)
but that f1 and f3 should have the same smoothing parameters (and x_2
and x_3 are on different scales). Then
the gam
formulay ~ s(x1,id=1) + te(x_2,x3) + s(x4,id=1)
would achieve the desired result. id
can be numbers or character strings. Giving an id
to a
term with a factor by
variable causes the smooths at each level of the factor to have the same smoothing
parameter.
Smooth term id
s are not supported by gamm
.
General linear functional terms have a long history in the spline literature including in the penalized GLM context (see e.g. Wahba 1990). Such terms encompass varying coefficient models/ geographic regression, functional GLMs (i.e. GLMs with functional predictors), GLASS models, etc, and allow smoothing with respect to aggregated covariate values, for example.
Such terms are implemented in mgcv
using a simple ‘summation convention’ for smooth terms: If the covariates of a
smooth are supplied as matrices, then summation of the evaluated smooth over the columns of the matrices is implied. Each
covariate matrix and any by
variable matrix must be of the same dimension. Consider, for example the terms(X,Z,by=L)
where X
, Z
and L
are n by p matrices. Let f denote the thin plate regression
spline specified. The resulting contibution to the ith
element of the linear predictor is
sum_j^p L_ij f(X_ij,Z_ij)
If no L
is supplied then all its elements are taken as 1. In R code terms, let F
denote the n by p
matrix obtained by evaluating the smooth at the values in X
and Z
. Then the contribution of the term to the
linear predictor is rowSums(L*F)
(note that it's element by element multiplication here!).
The summation convention applies to te
terms as well as s
terms. More details and examples
are provided in
linear.functional.terms
.
Random effects can be added to gam
models using s(...,bs="re")
terms (see
smooth.construct.re.smooth.spec
),
or the paraPen
argument to gam
covered below. See gam.vcomp
, random.effects
and
smooth.construct.re.smooth.spec
for further details. An alternative is to use the approach of gamm
.
In case the ability to add smooth classes, smooth identities, by
variables and the summation convention are
still not sufficient to implement exactly the penalized GLM that you require, gam
also allows you to penalize the
parametric terms in the model formula. This is mostly useful in
allowing one or more matrix terms to be included in the formula, along with a
sequence of quadratic penalty matrices for each.
Suppose that you have set up a model matrix X, and want to penalize the corresponding coefficients, b
with two penalties b'S1 b and b'S2 b.
Then something like the
following would be appropriate:gam(y ~ X - 1,paraPen=list(X=list(S1,S2)))
The paraPen
argument should be a list with elements having names corresponding to the terms being penalized.
Each element of paraPen
is itself a list, with optional elements L
, rank
and sp
: all other elements
must be penalty matrices. If present, rank
is a vector giving the rank of each penalty matrix
(if absent this is determined numerically). L
is a matrix that maps underlying log smoothing parameters to the
log smoothing parameters that actually multiply the individual quadratic penalties: taken as the identity if not supplied.
sp
is a vector of (underlying) smoothing parameter values: positive values are taken as fixed, negative to signal that
the smoothing parameter should be estimated. Taken as all negative if not supplied.
An obvious application of paraPen
is to incorporate random effects, and an example of this is provided below. In this
case the supplied penalty matrices will be (generalized) inverse covariance matrices for the random effects — i.e. precision
matrices. The final estimate of the covariance matrix corresponding to one of these penalties is given by the (generalized)
inverse of the penalty matrix multiplied by the estimated scale parameter and divided by the estimated
smoothing parameter for the penalty. For example, if you use an identity matrix to penalize some coefficients that are to be viewed as i.i.d.
Gaussian random effects, then their estimated variance will be the estimated scale parameter divided by the estimate of the
smoothing parameter, for this penalty. See the ‘rail’ example below.
P-values for penalized parametric terms should be treated with caution. If you must have them, then use the option freq=TRUE
in
anova.gam
and summary.gam
, which will tend to give reasonable results for random effects implemented this way,
but not for terms with a rank defficient penalty (or penalties with a wide eigen-spectrum).
Simon N. Wood simon.wood@r-project.org
Wahba (1990) Spline Models of Observational Data SIAM.
Wood S.N. (2017) Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC Press.
require(mgcv) set.seed(10) ## simulate date from y = f(x2)*x1 + error dat <- gamSim(3,n=400) b<-gam(y ~ s(x2,by=x1),data=dat) plot(b,pages=1) summary(b) ## Factor `by' variable example (with a spurious covariate x0) ## simulate data... dat <- gamSim(4) ## fit model... b <- gam(y ~ fac+s(x2,by=fac)+s(x0),data=dat) plot(b,pages=1) summary(b) ## note that the preceding fit is the same as.... b1<-gam(y ~ s(x2,by=as.numeric(fac==1))+s(x2,by=as.numeric(fac==2))+ s(x2,by=as.numeric(fac==3))+s(x0)-1,data=dat) ## ... the `-1' is because the intercept is confounded with the ## *uncentred* smooths here. plot(b1,pages=1) summary(b1) ## repeat forcing all s(x2) terms to have the same smoothing param ## (not a very good idea for these data!) b2 <- gam(y ~ fac+s(x2,by=fac,id=1)+s(x0),data=dat) plot(b2,pages=1) summary(b2) ## now repeat with a single reference level smooth, and ## two `difference' smooths... dat$fac <- ordered(dat$fac) b3 <- gam(y ~ fac+s(x2)+s(x2,by=fac)+s(x0),data=dat,method="REML") plot(b3,pages=1) summary(b3) rm(dat) ## An example of a simple random effects term implemented via ## penalization of the parametric part of the model... dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth ## Now add some random effects to the simulation. Response is ## grouped into one of 20 groups by `fac' and each groups has a ## random effect added.... fac <- as.factor(sample(1:20,400,replace=TRUE)) dat$X <- model.matrix(~fac-1) b <- rnorm(20)*.5 dat$y <- dat$y + dat$X%*%b ## now fit appropriate random effect model... PP <- list(X=list(rank=20,diag(20))) rm <- gam(y~ X+s(x0)+s(x1)+s(x2)+s(x3),data=dat,paraPen=PP) plot(rm,pages=1) ## Get estimated random effects standard deviation... sig.b <- sqrt(rm$sig2/rm$sp[1]);sig.b ## a much simpler approach uses "re" terms... rm1 <- gam(y ~ s(fac,bs="re")+s(x0)+s(x1)+s(x2)+s(x3),data=dat,method="ML") gam.vcomp(rm1) ## Simple comparison with lme, using Rail data. ## See ?random.effects for a simpler method require(nlme) b0 <- lme(travel~1,data=Rail,~1|Rail,method="ML") Z <- model.matrix(~Rail-1,data=Rail, contrasts.arg=list(Rail="contr.treatment")) b <- gam(travel~Z,data=Rail,paraPen=list(Z=list(diag(6))),method="ML") b0 (b$reml.scale/b$sp)^.5 ## `gam' ML estimate of Rail sd b$reml.scale^.5 ## `gam' ML estimate of residual sd b0 <- lme(travel~1,data=Rail,~1|Rail,method="REML") Z <- model.matrix(~Rail-1,data=Rail, contrasts.arg=list(Rail="contr.treatment")) b <- gam(travel~Z,data=Rail,paraPen=list(Z=list(diag(6))),method="REML") b0 (b$reml.scale/b$sp)^.5 ## `gam' REML estimate of Rail sd b$reml.scale^.5 ## `gam' REML estimate of residual sd ################################################################ ## Approximate large dataset logistic regression for rare events ## based on subsampling the zeroes, and adding an offset to ## approximately allow for this. ## Doing the same thing, but upweighting the sampled zeroes ## leads to problems with smoothness selection, and CIs. ################################################################ n <- 50000 ## simulate n data dat <- gamSim(1,n=n,dist="binary",scale=.33) p <- binomial()$linkinv(dat$f-6) ## make 1's rare dat$y <- rbinom(p,1,p) ## re-simulate rare response ## Now sample all the 1's but only proportion S of the 0's S <- 0.02 ## sampling fraction of zeroes dat <- dat[dat$y==1 | runif(n) < S,] ## sampling ## Create offset based on total sampling fraction dat$s <- rep(log(nrow(dat)/n),nrow(dat)) lr.fit <- gam(y~s(x0,bs="cr")+s(x1,bs="cr")+s(x2,bs="cr")+s(x3,bs="cr")+ offset(s),family=binomial,data=dat,method="REML") ## plot model components with truth overlaid in red op <- par(mfrow=c(2,2)) fn <- c("f0","f1","f2","f3");xn <- c("x0","x1","x2","x3") for (k in 1:4) { plot(lr.fit,select=k,scale=0) ff <- dat[[fn[k]]];xx <- dat[[xn[k]]] ind <- sort.int(xx,index.return=TRUE)$ix lines(xx[ind],(ff-mean(ff))[ind]*.33,col=2) } par(op) rm(dat) ## A Gamma example, by modify `gamSim' output... dat <- gamSim(1,n=400,dist="normal",scale=1) dat$f <- dat$f/4 ## true linear predictor Ey <- exp(dat$f);scale <- .5 ## mean and GLM scale parameter ## Note that `shape' and `scale' in `rgamma' are almost ## opposite terminology to that used with GLM/GAM... dat$y <- rgamma(Ey*0,shape=1/scale,scale=Ey*scale) bg <- gam(y~ s(x0)+ s(x1)+s(x2)+s(x3),family=Gamma(link=log), data=dat,method="REML") plot(bg,pages=1,scheme=1)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.