 |
GAM: Generalized additive models
by Trevor Hastie and Robert Tibshirani
Introduction
In the statistical analysis of
clinical trials and observational studies,
the identification and adjustment for
prognostic factors is an important component.
Valid comparisons of different treatments requires the appropriate
adjustment
for relevant prognostic factors.
The failure to consider important prognostic
variables, particularly in observational studies, can lead to errors in estimating
treatment differences. In addition, incorrect modeling of prognostic factors can
result in the failure to identify nonlinear trends or
threshold effects on survival.
This
article describes flexible statistical methods that may be
used to identify and characterize the effect of potential
prognostic factors on an outcome variable. These methods
are called ``generalized additive models'', and extend
the traditional linear statistical model. They can be applied
in any setting where a linear or generalized linear model
is typically used. These settings include standard continuous response
regression, categorical or ordered categorical
response data, count data, survival data and time series.
One of the most commonly used statistical models in medical research
is the logistic regression model for binary data.
We use it here as a specific
illustration of a generalized additive mode. Logistic regression (and
many other techniques) model
the effects of prognostic factors
in terms of a linear predictor of the form
, where the
are parameters.
The generalized additive model replaces
with where is a unspecified (``non-parametric'')
function.
This function is estimated in a flexible manner using a scatterplot
smoother. The estimated function can reveal
possible nonlinearities in the effect of the .
We first give some background on the methodology,
and then discuss the details of
the logistic regression model and its
generalization.
Some related developments are discussed in the last section.
The building block of the generalized additive model
algorithm is the scatterplot smoother.
We will first describe scatterplot smoothing in a simple setting,
and then indicate how it is used in generalized additive modelling.
Suppose that we have a scatterplot of points
like that shown in figure 1.
|
Figure 1: Left panel shows a fictitious scatterplot
of an outcome measure y plotted against a prognostic factor x.
In the right panel, a scatterplot smooth has been added to describe the
trend of y on x.
|
Here y is a response or outcome variable, and x is a prognostic
factor.
We wish to fit a smooth curve f(x) that summarizes the
dependence of y on x.
If we were to find the curve that simply minimizes
, the result would be an interpolating curve
that would not be smooth at all.
The cubic spline smoother imposes smoothness on f(x).
We seek the function f(x) that minimizes
Notice that measures the ``wiggliness'' of the
function f:
linear fs have , while non-linear fs
produce values bigger than zero.
is a non-negative smoothing parameter that must
be chosen by the data analyst.
It governs the tradeoff between the goodness of fit to the data and
(as measured by and wiggleness of the
function.
Larger values of force f to be smoother.
For any value of , the solution to (1)
is a cubic spline, i.e., a piecewise cubic polynomial with pieces
joined at the unique observed values of x in the dataset.
Fast and stable numerical procedures are available for
computation of the fitted curve.
The right panel of figure
1 shows a cubic spline fit to the data.
What value of did we use in figure 1?
In fact it is not a convenient to express the
desired smoothness of f in terms of , as the meaning
of depends on the units of the prognostic factor x.
Instead, it is possible to define an
``effective number of parameters'' or ``degrees of freedom''
of a cubic spline smoother, and then use a
numerical search to determine the value
of to yield this number. In figure 1
we chose the effective number of parameters to be 5.
Roughly speaking, this means that the complexity of the curve
is about the same as a polynomial regression of degrees 4.
However, the cubic spline smoother ``spreads out'' its parameters
in a more even manner, and hence is much more flexible than
a polynomial regression. Note that the degrees of freedom of a smoother
need not be an integer.
The above discussion tells how to fit a curve to a single
prognostic factor.
With multiple prognostic factors, if denotes the
value of the jth prognostic factor for the
ith observation, we fit the additive model
A criterion like (1) can be specified for this problem,
and a simple iterative procedure exists for estimating the s.
We
apply a cubic spline smoother to
the outcome as a function of ,
for each prognostic factor in turn. The process is continues until
the estimates stabilize.
These procedure is known as ``backfitting'' and the resulting fit
is analogous to a multiple regression for linear models.
When generalized additive models are fit to binary
response data (and in many other settings),
the appropriate error criterion is a penalized log likelihood
or a penalized log partial-likelihood.
To maximize it, the backfitting procedure is used
in conjunction with a maximum likelihood or
maximum partial likelihood algorithm.
The usual Newton-Raphson routine for maximizing log-likelihoods
in these models can be cast
in a IRLS (iteratively reweighted least squares)
form.
This involves a repeated weighted linear regression of a constructed
response variable on the covariates: each regression yields
a new value of the parameter estimates which give a new
constructed variable, and the process is iterated.
In the generalized additive model, the weighted linear regression
is simply replaced by a weighted backfitting algorithm.
Details can be found
in chapter 6 of HT90.
The generalized additive logistic model
Generalized additive models can be used in virtually any
setting where linear models are used.
The basic idea is to replace , the linear component
of the model with an additive component .
In the logistic regression model the outcome is 0 or 1, with 1 indicating an event
(like death or relapse of a disease) and 0 indicating no event.
We wish to model , the probability of an event
given prognostic factors . The linear
logistic model assumes that the log-odds are linear:
The generalized additive logistic model assumes instead that
The functions are estimated by an algorithm
like the one described earlier.
To illustrate this, we describe a study on
the survival of children after cardiac surgery for heart defects,
taken from Wi90.
The data
were collected during for the period 1983-1988.
A pre-operation warm-blood cardioplegia procedure, thought to improve chances for survival, was introduced
in February 1988.
This was not used on all of the children after February 1988,
only on those for which it was thought appropriate and only by surgeons who
chose to use the new procedure.
The main question is whether the introduction of the warming procedure
improved survival; the importance of risk factors age, weight and diagnostic
category is also of interest.
Figure 2: Estimated functions for weight and age for warm
cardioplegia data. The shaded region represents twice the pointwise asymptotic standard errors of the estimated curve.
If the warming procedure was given in a randomized manner, we could simply
focus on the post-February 1988 data and compare the survival of those who received the
new procedure to those who did not.
However allocation was not random so we can only try to assess the effectiveness
of the warming procedure as it was applied.
For this analysis, we use all of the data (1983-1988).
To adjust for changes that might have occurred over the five-year period,
we include the date of the operation as a covariate.
However operation date is strongly confounded with the warming operation
and thus a general nonparametric fit for date of operation might
unduly remove some of the effect attributable to the warming procedure.
To avoid this, we allow only a linear effect for operation date.
Hence we
must assume that any time trend
is either a consistently increasing or decreasing trend.
|
Figure 3: Estimated probabilities for warm cardioplegia
data, conditioned on two ages (columns), and three diagnostic classes (rows). Broken line is standard treatment, solid line is warm cardioplegia.
Bars indicate one standard error.
|
We fit a generalized additive logistic model to the binary response death, with smooth terms for
age and weight, a linear term for operation date, a categorical variable
for diagnosis, and a binary variable for the warming operation.
All the smooth terms are fitted with 4 degrees of freedom.
The resulting curves for age and weight are shown in figure
2.
As one would expect, the highest risk is for the lighter babies, with
a decreasing risk over 3 kg.
Somewhat surprisingly, there seems to be a low risk age around 200 days,
with higher risk for younger and older children.
Note that the numerical algorithm is not able to achieve exactly
4 degrees of freedom for the age and weight terms, but 3.80 and 3.86
degrees of freedom respectively.
An analysis of deviance can be carried out for inference from
a generalized additive model, analogous to that done for generalized
linear models. The only new twist is estimation of
the degrees of freedom
or effective number of parameters of the fitted model,
which was discussed in the previous section.
This analysis shows that the warming procedure is strongly beneficial
to survival.
There are strong differences in the diagnosis categories, while
the estimated effect of operation date is not large.
Since a logistic regression is additive on the logit scale but not on the
probability scale, a plot of the fitted probabilities is often
informative.
Figure 3 shows the fitted probabilities broken down by
age and diagnosis, and is a concise summary of the findings of this study.
The beneficial effect of the treatment at the lower
weights is evident.
As with all nonrandomized studies, the results here should be interpreted
with caution.
In particular, one must ensure that the children were not chosen for the warming operation
based on their prognosis.
To investigate this,
we perform a second analysis in which a dummy variable (say period), corresponding
to before versus after February 1988, is inserted in place of the
dummy variable for the warming operation.
The purpose of this is to investigate whether the overall treatment
strategy improved after February 1988.
If this turns out not to be the case, it will imply that warming was
used only for patients with a good prognosis, who would have survived anyway.
A linear adjustment for operation date is included as before.
The results are qualitatively very similar to the first analysis:
age and weight are significant, with effects similar to those
in Fig. 2; diagnosis is significant, while operation date
(linear effect) is not.
Period is highly significant.
Hence there seems to be a significant overall improvement in survival
after February 1988.
For more details, see Wi90.
Discussion
The nonlinear modeling procedures described here are useful
for two reasons. First, they help to prevent model
misspecification, which can lead to incorrect conclusions
regarding treatment efficacy. Second, they provide
information about the relationship between prognostic
factors and disease risk that is not revealed by the use of
standard modeling techniques. Linearity always remains a special case, and thus
simple linear relationships can be easily confirmed with
flexible modeling of covariate effects.
The most comprehensive source for generalized additive models
is the text of that name by Hastie and Tibshirani (1990), from which the
example was taken.
HST92 provide a detailed example of the use of generalized
additive models in the proportional hazads setting.
Different applications of this work in medical problems
are discussed in HBS89 and HH90.
GS94
discuss penalization and spline models in a variety of settings.
Wa90 is a good source
for the mathematical background of spline models.
Efron & Tibshirani (1991) give an exposition of
modern developments in statistics
(including generalized additive models),
for a nonmathematical audience.
There has been some recent related work in this area.
KST93 describe a different method for flexible
hazard modelling. Fr91 proposed a generalization of
additive modelling that finds interactions among prognostic factors.
Of particular interest in the proportional hazards setting
is the
varying coefficient model of HT93,
in which the parameter effects can change with other factors
such as time.
The model has the form
The parameter functions are estimated by scatterplot smoothers
in a similar fashion to the methods described earlier.
This gives a useful way of modelling departures
from the proportional hazards assumption by estimating the way in
which the
parameters change with time.
Software for fitting generalized additive models
is available as part of the S/S-PLUS statistical language BCW88,
CH91, in a Fortran program called gamfit available
at statlib (in general/gamfit at the ftp site lib.stat.cmu.edu) and also
in the GAIM package for MS-DOS computers.
Acknowledgements
The second author gratefully acknowledges support from the
Natural Sciences and Engineering Research
Council of Canada.
References
- [Becker et al.] Becker, Chambers & Wilks1988BCW88
Becker, R., Chambers, J. & Wilks, A. (1988), The New S Language,
Wadsworth International Group.
- [Chambers & Hastie] Chambers & Hastie1991CH91
, J. & Hastie, T. (1991), Statistical Models in S,
Wadsworth/Brooks Cole, Pacific Grove.
- Efron, B. & Tibshirani, R. (1991) , `Statistical analsysis in the computer
age', Science.
[Friedman]Friedman1991Fr91
Friedman, J. (1991), `Multivariate adaptive regression splines (with
discussion)', Annals of Statistics 19(1), 1-141.
[Green & Silverman]Green & Silverman1994GS94
Green, P. & Silverman, B. (1994), Nonparametric regression and
generalized linear models: a roughness peanlty approach, Chapman and Hall.
[Hastie & Herman]Hastie & Herman1990HH90
Hastie, T. & Herman, A. (1990), `An analysis of gestational age, neonatal
size and neonatal death using nonparametric logistic regression',
Journal of Clinical Epidemiology 43, 1179-90.
[Hastie & Tibshirani]Hastie & Tibshirani1990HT90
Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models,
Chapman and Hall.
[Hastie & Tibshirani]Hastie & Tibshirani1993HT93
Hastie, T. & Tibshirani, R. (1993), Discriminant analysis by gaussian
mixtures, In preparation.
[Hastie et al.]Hastie, Botha & Schnitzler1989HBS89
Hastie, T., Botha, J. & Schnitzler, C. (1989), `Regression with an ordered
categorical response', Statistics in Medicine 43, 884-889.
[Hastie et al.]Hastie, Sleeper & Tibshirani1992HST92
Hastie, T., Sleeper, L. & Tibshirani, R. (1992), `Flexible covariate effects
in the cox model', Breast Cancer Research and Treatment -- special
issue.
[Kooperberg et al.]Kooperberg, Stone & Truong1993KST93
Kooperberg, C., Stone, C. & Truong, Y. (1993), Hazard regression, Technical
report, Dept of statistics, Univ. of Cal. Berkeley.
[Wahba]Wahba1990Wa90
Wahba, G. (1990), Spline Models for Observational Data, SIAM,
Philadelphia.
[Williams et al.]Williams, Rebeyka, Tibshirani, Coles, Lightfoot,
Freedom & Trusler1990Wi90
Williams, W., Rebeyka, I., Tibshirani, R., Coles, J., Lightfoot, N., Freedom,
R. & Trusler, G. (1990), `Warm induction cardioplegia in the infant: a
technique to avoid rapid cooling myocardial contracture', J. Thorac. and
Cardio. Surg 100, 896-901.
 |
 |