Sequoia Hall
Home Academics Research Seminars Consulting Industrial Affiliates   People   Admissions
    



GAM: Generalized additive models

by Trevor Hastie and Robert Tibshirani

Introduction
In the statistical analysis of clinical trials and observational studies, the identification and adjustment for prognostic factors is an important component. Valid comparisons of different treatments requires the appropriate adjustment for relevant prognostic factors. The failure to consider important prognostic variables, particularly in observational studies, can lead to errors in estimating treatment differences. In addition, incorrect modeling of prognostic factors can result in the failure to identify nonlinear trends or threshold effects on survival.

This article describes flexible statistical methods that may be used to identify and characterize the effect of potential prognostic factors on an outcome variable. These methods are called ``generalized additive models'', and extend the traditional linear statistical model. They can be applied in any setting where a linear or generalized linear model is typically used. These settings include standard continuous response regression, categorical or ordered categorical response data, count data, survival data and time series.

One of the most commonly used statistical models in medical research is the logistic regression model for binary data. We use it here as a specific illustration of a generalized additive mode. Logistic regression (and many other techniques) model the effects of prognostic factors in terms of a linear predictor of the form , where the are parameters. The generalized additive model replaces with where is a unspecified (``non-parametric'') function. This function is estimated in a flexible manner using a scatterplot smoother. The estimated function can reveal possible nonlinearities in the effect of the .

We first give some background on the methodology, and then discuss the details of the logistic regression model and its generalization. Some related developments are discussed in the last section.

Smoothing methods and generalized additive models

The building block of the generalized additive model algorithm is the scatterplot smoother. We will first describe scatterplot smoothing in a simple setting, and then indicate how it is used in generalized additive modelling.

Suppose that we have a scatterplot of points like that shown in figure 1.

  
Figure 1: Left panel shows a fictitious scatterplot of an outcome measure y plotted against a prognostic factor x. In the right panel, a scatterplot smooth has been added to describe the trend of y on x.

Here y is a response or outcome variable, and x is a prognostic factor. We wish to fit a smooth curve f(x) that summarizes the dependence of y on x. If we were to find the curve that simply minimizes , the result would be an interpolating curve that would not be smooth at all.

The cubic spline smoother imposes smoothness on f(x). We seek the function f(x) that minimizes

 

Notice that measures the ``wiggliness'' of the function f: linear fs have , while non-linear fs produce values bigger than zero. is a non-negative smoothing parameter that must be chosen by the data analyst. It governs the tradeoff between the goodness of fit to the data and (as measured by and wiggleness of the function. Larger values of force f to be smoother.

For any value of , the solution to (1) is a cubic spline, i.e., a piecewise cubic polynomial with pieces joined at the unique observed values of x in the dataset. Fast and stable numerical procedures are available for computation of the fitted curve. The right panel of figure 1 shows a cubic spline fit to the data.

What value of did we use in figure 1? In fact it is not a convenient to express the desired smoothness of f in terms of , as the meaning of depends on the units of the prognostic factor x. Instead, it is possible to define an ``effective number of parameters'' or ``degrees of freedom'' of a cubic spline smoother, and then use a numerical search to determine the value of to yield this number. In figure 1 we chose the effective number of parameters to be 5. Roughly speaking, this means that the complexity of the curve is about the same as a polynomial regression of degrees 4. However, the cubic spline smoother ``spreads out'' its parameters in a more even manner, and hence is much more flexible than a polynomial regression. Note that the degrees of freedom of a smoother need not be an integer.

The above discussion tells how to fit a curve to a single prognostic factor. With multiple prognostic factors, if denotes the value of the jth prognostic factor for the ith observation, we fit the additive model

 
A criterion like (1) can be specified for this problem, and a simple iterative procedure exists for estimating the s. We apply a cubic spline smoother to the outcome as a function of , for each prognostic factor in turn. The process is continues until the estimates stabilize. These procedure is known as ``backfitting'' and the resulting fit is analogous to a multiple regression for linear models.

When generalized additive models are fit to binary response data (and in many other settings), the appropriate error criterion is a penalized log likelihood or a penalized log partial-likelihood. To maximize it, the backfitting procedure is used in conjunction with a maximum likelihood or maximum partial likelihood algorithm. The usual Newton-Raphson routine for maximizing log-likelihoods in these models can be cast in a IRLS (iteratively reweighted least squares) form. This involves a repeated weighted linear regression of a constructed response variable on the covariates: each regression yields a new value of the parameter estimates which give a new constructed variable, and the process is iterated. In the generalized additive model, the weighted linear regression is simply replaced by a weighted backfitting algorithm. Details can be found in chapter 6 of HT90.

The generalized additive logistic model
Generalized additive models can be used in virtually any setting where linear models are used. The basic idea is to replace , the linear component of the model with an additive component .

In the logistic regression model the outcome is 0 or 1, with 1 indicating an event (like death or relapse of a disease) and 0 indicating no event. We wish to model , the probability of an event given prognostic factors . The linear logistic model assumes that the log-odds are linear:

The generalized additive logistic model assumes instead that

The functions are estimated by an algorithm like the one described earlier.

To illustrate this, we describe a study on the survival of children after cardiac surgery for heart defects, taken from Wi90. The data were collected during for the period 1983-1988. A pre-operation warm-blood cardioplegia procedure, thought to improve chances for survival, was introduced in February 1988. This was not used on all of the children after February 1988, only on those for which it was thought appropriate and only by surgeons who chose to use the new procedure. The main question is whether the introduction of the warming procedure improved survival; the importance of risk factors age, weight and diagnostic category is also of interest.

  
Figure 2: Estimated functions for weight and age for warm cardioplegia data. The shaded region represents twice the pointwise asymptotic standard errors of the estimated curve.

If the warming procedure was given in a randomized manner, we could simply focus on the post-February 1988 data and compare the survival of those who received the new procedure to those who did not. However allocation was not random so we can only try to assess the effectiveness of the warming procedure as it was applied. For this analysis, we use all of the data (1983-1988). To adjust for changes that might have occurred over the five-year period, we include the date of the operation as a covariate. However operation date is strongly confounded with the warming operation and thus a general nonparametric fit for date of operation might unduly remove some of the effect attributable to the warming procedure. To avoid this, we allow only a linear effect for operation date. Hence we must assume that any time trend is either a consistently increasing or decreasing trend.

  
Figure 3: Estimated probabilities for warm cardioplegia data, conditioned on two ages (columns), and three diagnostic classes (rows). Broken line is standard treatment, solid line is warm cardioplegia. Bars indicate one standard error.

We fit a generalized additive logistic model to the binary response death, with smooth terms for age and weight, a linear term for operation date, a categorical variable for diagnosis, and a binary variable for the warming operation. All the smooth terms are fitted with 4 degrees of freedom.

The resulting curves for age and weight are shown in figure 2. As one would expect, the highest risk is for the lighter babies, with a decreasing risk over 3 kg. Somewhat surprisingly, there seems to be a low risk age around 200 days, with higher risk for younger and older children. Note that the numerical algorithm is not able to achieve exactly 4 degrees of freedom for the age and weight terms, but 3.80 and 3.86 degrees of freedom respectively.

An analysis of deviance can be carried out for inference from a generalized additive model, analogous to that done for generalized linear models. The only new twist is estimation of the degrees of freedom or effective number of parameters of the fitted model, which was discussed in the previous section. This analysis shows that the warming procedure is strongly beneficial to survival. There are strong differences in the diagnosis categories, while the estimated effect of operation date is not large.

Since a logistic regression is additive on the logit scale but not on the probability scale, a plot of the fitted probabilities is often informative. Figure 3 shows the fitted probabilities broken down by age and diagnosis, and is a concise summary of the findings of this study. The beneficial effect of the treatment at the lower weights is evident. As with all nonrandomized studies, the results here should be interpreted with caution. In particular, one must ensure that the children were not chosen for the warming operation based on their prognosis. To investigate this, we perform a second analysis in which a dummy variable (say period), corresponding to before versus after February 1988, is inserted in place of the dummy variable for the warming operation. The purpose of this is to investigate whether the overall treatment strategy improved after February 1988. If this turns out not to be the case, it will imply that warming was used only for patients with a good prognosis, who would have survived anyway. A linear adjustment for operation date is included as before. The results are qualitatively very similar to the first analysis: age and weight are significant, with effects similar to those in Fig. 2; diagnosis is significant, while operation date (linear effect) is not. Period is highly significant. Hence there seems to be a significant overall improvement in survival after February 1988. For more details, see Wi90.

Discussion
The nonlinear modeling procedures described here are useful for two reasons. First, they help to prevent model misspecification, which can lead to incorrect conclusions regarding treatment efficacy. Second, they provide information about the relationship between prognostic factors and disease risk that is not revealed by the use of standard modeling techniques. Linearity always remains a special case, and thus simple linear relationships can be easily confirmed with flexible modeling of covariate effects.

The most comprehensive source for generalized additive models is the text of that name by Hastie and Tibshirani (1990), from which the example was taken. HST92 provide a detailed example of the use of generalized additive models in the proportional hazads setting. Different applications of this work in medical problems are discussed in HBS89 and HH90. GS94 discuss penalization and spline models in a variety of settings. Wa90 is a good source for the mathematical background of spline models.

Efron & Tibshirani (1991) give an exposition of modern developments in statistics (including generalized additive models), for a nonmathematical audience.

There has been some recent related work in this area. KST93 describe a different method for flexible hazard modelling. Fr91 proposed a generalization of additive modelling that finds interactions among prognostic factors. Of particular interest in the proportional hazards setting is the varying coefficient model of HT93, in which the parameter effects can change with other factors such as time. The model has the form  

The parameter functions are estimated by scatterplot smoothers in a similar fashion to the methods described earlier. This gives a useful way of modelling departures from the proportional hazards assumption by estimating the way in which the parameters change with time.

Software for fitting generalized additive models is available as part of the S/S-PLUS statistical language BCW88, CH91, in a Fortran program called gamfit available at statlib (in general/gamfit at the ftp site lib.stat.cmu.edu) and also in the GAIM package for MS-DOS computers.

Acknowledgements
The second author gratefully acknowledges support from the Natural Sciences and Engineering Research Council of Canada.

References

  • [Becker et al.] Becker, Chambers & Wilks1988BCW88 Becker, R., Chambers, J. & Wilks, A. (1988), The New S Language, Wadsworth International Group.
  • [Chambers & Hastie] Chambers & Hastie1991CH91 , J. & Hastie, T. (1991), Statistical Models in S, Wadsworth/Brooks Cole, Pacific Grove.
  • Efron, B. & Tibshirani, R. (1991) , `Statistical analsysis in the computer age', Science.

[Friedman]Friedman1991Fr91 Friedman, J. (1991), `Multivariate adaptive regression splines (with discussion)', Annals of Statistics 19(1), 1-141.

[Green & Silverman]Green & Silverman1994GS94 Green, P. & Silverman, B. (1994), Nonparametric regression and generalized linear models: a roughness peanlty approach, Chapman and Hall.

[Hastie & Herman]Hastie & Herman1990HH90 Hastie, T. & Herman, A. (1990), `An analysis of gestational age, neonatal size and neonatal death using nonparametric logistic regression', Journal of Clinical Epidemiology 43, 1179-90.

[Hastie & Tibshirani]Hastie & Tibshirani1990HT90 Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall.

[Hastie & Tibshirani]Hastie & Tibshirani1993HT93 Hastie, T. & Tibshirani, R. (1993), Discriminant analysis by gaussian mixtures, In preparation.

[Hastie et al.]Hastie, Botha & Schnitzler1989HBS89 Hastie, T., Botha, J. & Schnitzler, C. (1989), `Regression with an ordered categorical response', Statistics in Medicine 43, 884-889.

[Hastie et al.]Hastie, Sleeper & Tibshirani1992HST92 Hastie, T., Sleeper, L. & Tibshirani, R. (1992), `Flexible covariate effects in the cox model', Breast Cancer Research and Treatment -- special issue.

[Kooperberg et al.]Kooperberg, Stone & Truong1993KST93 Kooperberg, C., Stone, C. & Truong, Y. (1993), Hazard regression, Technical report, Dept of statistics, Univ. of Cal. Berkeley.

[Wahba]Wahba1990Wa90 Wahba, G. (1990), Spline Models for Observational Data, SIAM, Philadelphia.

[Williams et al.]Williams, Rebeyka, Tibshirani, Coles, Lightfoot, Freedom & Trusler1990Wi90 Williams, W., Rebeyka, I., Tibshirani, R., Coles, J., Lightfoot, N., Freedom, R. & Trusler, G. (1990), `Warm induction cardioplegia in the infant: a technique to avoid rapid cooling myocardial contracture', J. Thorac. and Cardio. Surg 100, 896-901.



Contact  | Sitemap  | Directories  | Maps  & Directions  | Giving to Stanford
Copyright 2004Stanford University. All Rights Reserved. Stanford, CA 94305, (650) 723-2300
Terms of Use Copyright Complaints