&EPA
United States
Environmental Protection
Agency
Office of Research and
Development
Washington, D.C. 20460
EPA/600/R-01/081
October 2001
Parametric and
Nonparametric Logistic
Regressions for Prediction
of Presence/Absence of an
Amphibian
Q: What environmental factors determine the distribution of the
Red-Spotted Toad in a fragmented desert landscape?
Q: Can logistic regression using GLM or GAM-MARS be used to
address this question?
008LEB02.RPT 12/11/03
-------
Note: To conserve space on the cover, the title of this report has been abbreviated.
The full title appears on the following page.
-------
EPA/600/R-01/081
October 2001
Parametric and Nonparametric
(MARS; Multivariate Additive
Regression Splines)
Logistic Regressions for Prediction of
A Dichotomous Response Variable
With an Example for
Presence/Absence of an Amphibian*
by
Maliha S. Nash and David F. Bradford
U.S. Environmental Protection Agency
Office of Research and Development
National Exposure Research Laboratory
Environmental Sciences Division
Las Vegas, Nevada
* Full Title
-------
Notice
The U.S. Environmental Protection Agency (EPA), through its Office of Research and Development
(ORD), funded and performed the research described here. It has been peer reviewed by the EPA and
approved for publication. Mention of trade names or commercial products does not constitute
endorsement or recommendation for use.
-------
Table of Contents
Notice ii
List of Figures iv
List of Tables v
Acknowledgments vi
Section 1 Introduction 1
1.1 Example Data Set Used 5
1.2 Logistic Regression 5
1.2.1 Maximum Likelihood Estimator 6
1.2.2 Assumptions 6
1.2.3 Steps to Follow 7
1.2.4 Model Selection 17
1.2.5 Dependence Between Observations 17
1.2.6 Interactions 18
Section 2 MARS 19
2.1.1 Model Fitting 19
2.1.2 Final Model 21
Section 3 Conclusion 25
References 26
Appendix 1 GENMDOD Output 28
Appendix 2 SAS Statements for Standard Logistic Regression 32
Appendix 3 MARS Output 35
-------
List of Figures
Figure # Page
1 Presence (=1), absence (=0), and (-) prediction of amphibians as related with variable
for latitude (UTM-N see Table 1 for variable description) 2
Box-plots show the presence (1) and absence (0) of toads as related to the independent
variables. Independent variables are: Elevation (U), UTM-N (W; latitude), and vegetation
cover (< 1 m high) over adjacent land, mean % (Q). "+" is the mean value; lines for box
from top as: Maximum, Quartile 3, Quartile 2, Quartile 1 and minimum values 13
Deviance (DIFDEV) values for the predicted probability of the presence and absence
of toads 15
Chi-square (DIFCHISQ) values for the predicted probability of the presence and absence
of toads 15
Square root of the absolute value of the deviance residual for each independent variable
fo's) that was significant in the final model 16
Regression for the main effect variables in the model using MARS 24
IV
-------
List of Tables
Table # Page
1 Description for Metrics Used in Analyses 2
2 SAS Output for Model Fit, Testing Global Null Hypothesis, and Association of
Predicted Probability and Observed Toad Presence 10
3 Final Stepwise Logistic Regression Analysis Model (n = 122 sites) 11
4 Coefficients and Their Statistics From MARS. Model F = 14.128, p < 0.001, df = 7,114 . . 21
-------
Acknowledgments
We are grateful for the valuable inputs and suggestions provided by Drs. James Wickham, Sandra
Catlin, Kurt Riitters, Chad Cross and Bill Bossing, which improved the comprehensiveness and clarity of
this report.
VI
-------
Section 1
Introduction
The purpose of this report is to provide a reference manual that can be used by investigators for
making informed use of logistic regression using two statistical methods (standard logistic regression and
Multivariate Adaptive Regression Splines (MARS)). The details for analyses of relationships between a
dependent binary response variable (e.g., presence/absence) and a set of independent variables are
provided step-by-step for use by scientists who are not statisticians. Details of such statistical analyses
and their assumptions are often omitted from published literature, yet such details are essential to the
proper conduct of statistical analyses and interpretation of results. In this report, we use a data set for
amphibian presence/absence and associated habitat variables as an example.
Relationships between a response variable and independent variable(s) are commonly quantified and
described by regression models. The values of the coefficients and predictions from the fitted model are
used to infer and describe patterns of relationships, the effect of the independent variables on the
response, and the strength of association between the independent and response variable. All these will
help to analyze and understand a phenomenon, in this case biological phenomena. The general linear
model (GLM) offers a wide range of regression models where the simple regression, analysis of
covariance, and ANOVA are special cases. In GLM, the functional relationships between the expected
value of the response variable(s) and the independent variables are described via a link function as:
g (ji) = + -,-x, (1)
where g(.) is link function, (i is the expected value of the response variable, /?0&/?,- are regression
coefficients, and Xt 's are the independent variables. In simple regression, the link function represents the
mean of the response variable as:
g(\i) = \i= + ;% (2)
The above model represents the simplest relationship between the response and independent variables
in a linear manner. The above relationship also implies that the dependent variable is continuous and
random.
When the relationships between the mean response and the independent variables are not linear, a
different link function can be used to describe the relationships. The loglinear model, for example, can be
used where the link function is defined as:
log GI) =%+ ;% (3)
-------
where the log(/0 = log^ / (1 - //)) is the "logit" link, which is the logit transformation of the probability.
When the binary response variable (present/absence) is plotted against the independent variable (Figure
1), the data are on the one and zero lines. Whether the UTM-N variable (i.e., latitude) enhances or
suppresses the presence of amphibians, a relationship cannot be assessed by examining Figure 1 as
normally done with the continuous response variables in linear regression analysis. Therefore, the linear
model (Equation 2) is not valid for count and dichotomous response variables. Instead, the response
variable is related nonlinearly to the independent variable(s) via a link, such as "logit" (Equation 3; solid
line in Figure 1). The latter is usually chosen as the link function in which logistic regression can be an
increasing or decreasing function, the link function is differentiable, and it relates the linear predictor to
the expected value of the response. The logit transformation in Equation 3 should map the binary values
(0,1) to a range of (- co , co ) over the domain x (- co , co ).
1 -
CD
U
c
Q)
(fl
_Q
o
c
II)
W
.t
Description
Patch Size Metrics
Water
LogWPArea (A)
SiteSurf(B)
NS
NS
Surface water
Surface water
area (Iog10 [m2 + 10])
linear extent (percent)
Continued..
-------
Table 1. Continued
Variable Type/Name
P>.t
Description
Patch Size Metrics, Continued
Vegetation
LogRiparea (C)
SITEEMER (D)
SITENATI (E)
SiteRipa (F)
SiteMesq (G)
0.0085
NS
NS
0.0859
NS
Riparian zone area (Iog10 [m2 + 10]). Zone boundaries delineated by indicator
taxa defined for SITEEMER, SITENATI, and SITERIPA
Emergent-type vegetation linear extent (percent). Indicator taxa: Typha,
Eleocharis, Scirpus, Mimulus, Anemopsis; Juncus & Carex if in stream channel
Native riparian trees linear extent (percent). Indicator taxa: Sa//x, Populus,
Fraxinus
Riparian shrubs/herbs linear extent (percent). Indicator taxa: Baccharis,
Pluchea, Vitis, Allenrolfea, Equisetunr, Juncus or Carex if outside stream
channel
Phreatophytes linear extent (percent). Indicator taxa: Prosopis, Chilopsis
Patch Quality Variables
Site Scale
SiteOver (H)
Per_Rock(l)
AveGrain (J)
NS
0.0704
NS
Linear extent of channel entirely overgrown (< 1 m height) with vegetation
(percent)
Bedrock substrate cover (percent)
Predominate substrate grain size (median of 5 categories: < 0.1, 0.1-0.5, 0.5-8,
8-30, > 30 cm)
Plot Scale
PlotFlsu (K)
P_Avedep (L)
P_Wetper (M)
Plotsubm (N)
PlotCanp (O)
PlotEmer(P)
P_Cany (Q)
PLT_Rock (R)
0.0142
NS
NS
0.0251
0.0037
NS
0.0006
0.0391
Submerged or floating vegetation cover, including filamentous algae, mean
(percent)
Water depth, mean (cm)
Wetted perimeter width, mean (m)
Plot substrate size, for granular substrate, mean (Iog10 of cm)
Vegetation cover over water, mean (percent)
Emergent vegetation within 1 5 cm of point, mean (percent)
Vegetation cover (< 1 m high) over adjacent land, mean (percent)
Bedrock substrate cover, mean (percent)
Water
LogEC (S)
PH(T)
NS
0.0802
Electrical conductivity (Iog10 of 6/cm)
PH
Geographic Metrics
Elevation (U)
UTM-E (V)
UTM-N (W)
0.0038
0.0031
0.0006
Elevation (m)
UTM-East coordinate (km), centered on mean (UTM - 659); approximately
corresponds to longitude
UTM-N coordinate (km), centered on mean (UTM - 3989); approximately
corresponds to latitude
Exotic Vegetation
SiteTama (X)
SiteTamu (Y)
0.0249
0.0341
Tamarix spp. (exotic plant) linear extent (percent) in 400-m area
Tamarix spp. (exotic plant) linear extent (percent), in 40-m segments
P value represents univariate logistic regression analyses of the presence/absence of the toad with each
environmental variable. NS indicates P > 0.10. Letter inside parentheses corresponds to the variable used in the
SAS program for simplicity.
-------
To relax the distributional assumption in GLM of a strictly linear relationship between response and
independent variables, the general additive model (GAM) was introduced as an extension. It uses
prediction via a nonparametric method. Fitting a model that accounts for local behavior may describe the
behavior of the data more accurately than that described by a linear relationship. GAM uses different
methods of smoothing to describe the data with little interference from the user. As described earlier, the
link function in the GLM is in a linear form, whereas in GAM it is in an additive form. One of the
common links in GAM is canonical and that describes the relationships between the transformed mean
response and an independent variable using a nonparametric function as:
g(E(Y))=g(u) = C0 + --ft(Xl) (4)
where g(.) is the link transform function, E (Y) is the expected response, c0 is a constant (intercept), and /,
is a nonparametric function. The most commonly used link function is the canonical link (Agresti, 1996).
The main difference between the GLM and GAM for binary data is that GLM-logistic regression assumes
that the log odds are linearly related to the independent variables, whereas GAM assumes that the log
odds are related to the sum of smooth functions of the independent variables.
A parametric method that is comparable to logistic regression analysis, but is used mainly to classify
the observations into groups of populations, is known as discriminant analysis. For this data set, where
the response variable is binary, discriminant analyses combine the independent variables linearly
separating the data into two groups. This method requires that variance-covariance in each group is
homogeneous, i.e., multivariate normal (Press and Wilson, 1978). Both discriminant and logistic
regression analyses produce solutions in terms of probability of presence/absence (0-1 response as in the
binary case). But the logistic model is preferable if multivariate normality is at all suspect. The logistic
model requires more computational effort, but that is not of great concern. If multivariate normality
holds, the coefficients of the discriminant model may have a smaller variance. In logistic regression, a
maximum likelihood method for estimation is used, which does not require that independent variables be
multivariate normal (see Maximum Likelihood Estimator (MLE) below).
MARS and Classification and Regression Tree (CART) are two common examples of GAM that have
been used in many studies (Walker, 1990; Efron and Tibshirani, 1991; Moore et al., 1991; White and
Sifneos, 1997). Both models can be used for regression modeling of binary response, but CART is more
useful for classification than regression. Similar to the simple logistic regression, MARS has the
backward, forward, and stepwise selections that help to choose the most related independent variables to
the response. The stepwise selection is often preferred because a removed variable may have a chance to
be included again.
In any GAM model, the final model is a summation of a group of functions that fit the data locally.
The final model is data driven and represents closely the behavior of the data. The process of fitting and
validating a model requires a large number of observations, especially when there are many independent
variables relative to the number of observations (see text for MARS). For CART, for example, one needs
more than 128 observations in a data set (Miller, 1994). We used MARS as a nonparametric method for
logistic regression analysis, with the "stepwise" option for selecting the most significant variable, making
it similar to programs for parametric logistic regression analysis.
Regression analysis and parametric generalizations make specific assumptions about the functional
form of the relationship of the response to the predictions (e.g., linear); nonparametric regression, on the
other hand, places minimal assumptions about the form of the relationship (e.g., Y is a smooth or additive
function of the X's). Thus, extensive computation and a decrease in theoretical framework are exchanged
-------
for relaxation of assumptions. To decide which of the two methods to use (parametric versus
nonparametric) depends on the user and the size of the data set. If the interest is for estimation and
inference about the independent variables, then GLM is the preferred method. If the interest is to reveal
structural behavior of the response with the independent variable, then GAM is the method of choice,
especially when little is known about the nature of the relationship. Some studies have yielded similar
results for parametric and nonparametric methods when applied to the same data sets (Bradford et al.,
1998; Sheskin, 2000).
Below, we describe procedures for a general linear model with logistic regression analysis using SAS
(SAS Institute, Gary, NC, version 8) and a general additive model (MARS) using a computer program
also called MARS (Salford Systems, 1999, User Guide), which herein will be noted in italics to
distinguish it from the statistical method of the same name. The MARS program can be easily used for
continuous and binary dependent variables, handle missing values using surrogate variables, include all
possible interactions, account for collinearity between independent variables, and prevent overfitting for
the final model. This program has the algorithm to search for the basis function and knots and define the
final optimal model with its statistics. This software offers the use of the Graphic User Interface (GUI),
commands at the command prompt, and produces a classic output for the analyses.
1.1 Example Data Set Used
The main objective of conducting the regression analyses described above is to predict the
presence/absence of the red-spotted toad (Bufo punctatus) in the northeastern Mojave Desert, USA, as a
function of the surrounding environment (Bradford et al., submitted). There were 25 environmental
variables that were used in this report (Table 1). These variables represent topography, patch size
metrics, patch quality, exotic vegetation, and spatial direction. Although the total number of sites was
128, only 122 sites were used in the regression analysis because of missing values for some of the
independent variables. Also, the data herein differed slightly from the data set and model used for
biological interpretation (Bradford et al., submitted).
1.2 Logistic Regression
The response variable is dichotomous or binary (i.e., the presence or absence of toads). Our interest
is in estimating the probability of the presence of toads as a function of many independent variables (see
Table 1 for variable descriptions). The 25 independent variables were grouped based on habitat, scale,
directional, and environmental characteristics. For simplicity hereafter, we will use alphabetic letters for
the independent variables and will refer the reader to Table 1 for variable descriptions.
Equation (1) is a multiple logistic regression model that relates the proportion of presence (p(x)) as a
function of independent variables (xj's).
g (ji) = Logit (p(x)) = Log{p(x)/(l-p (x))} = - + ' 'YCx,) (5)
The coefficients & '/s in Equation 5 are estimated using maximum likelihood and are used to
predict the probability of toad presence as a function of Xj's (Equation 6).
p = exp( + fa)) I (1 + exp( + fa,)) (6)
Detailed mathematical descriptions of the logistic regression analysis are given in Hosmer and
Lemeshow (1989), Christensen (1997), and many others.
-------
1.2.1 Maximum Likelihood Estimator
The Maximum Likelihood Estimator (MLE) is a method used with logistic regression analysis to
estimate coefficients for the fitted model. The likelihood function for the logit model is as follows:
LogL = - »xy,- - -Log( 1 + e 'x') (7)
The coefficients ''"s are not linear in Equation 7. The likelihood function (Equation 7) is maximized
by choosing a value of "»'; in an iterative method, such as the Newton-Raphson method and the Fisher-
scoring algorithm in SAS. The Fisher-scoring algorithm is the default method and Newton-Raphson is
the alternative method in SAS. The former and latter methods use the observed and expected information
matrix, respectively. However, when the data are binary, results are the same. Newton-Raphson is widely
used by many in statistics and mathematics, and, therefore, both are explained below for the user. In the
Newton-Raphson method, the first (U('» and second (!()) derivative of Equation 7 with respect to is
obtained and used in Equation 8 for estimation. The vector of the first derivative is called gradients or
score, and the second derivative is called Hessian. The value of will be estimated by the Newton-
Raphson algorithm as:
=p j-rl(p j)u is used to produce the
first iteration estimate. This value is substituted back in the right-hand matrix (Equation 8) for the second
iteration. Iterations continue until the difference between two consecutive iterations is less than or equal
to a very small value, e.g., 0.001. More on maximum likelihood is given in Allison (1999, p. 36) and
Hastie and Tibsherani (1990).
The Fisher-scoring algorithm is also known as the iteratively reweighted least squares algorithm. As
mentioned earlier, the parameter estimate is the same using Newton-Raphson or Fisher-scoring, except
that the covariance for the parameter estimates may not be the same. To start the estimation, a value of
zero for the slopes ('^is used, and a value of logits of the observed cumulative proportions of the
dependent variables is used for the intercept to produce the first iteration estimates. These values are
substituted back for the second iteration. Iterations continue until the differences between two
consecutive iterations are less than a very small value.
1.2.2 Assumptions
When standard regression analysis is fit to a set of data, there are basic assumptions on the errors
(differences of the observed and the predicted values) that are considered normal, such as an expected
mean value of zero, constant variance (homoscedasticity), no serial correlation, and no error in
measurements. For multiple standard regression analysis, it is important to evaluate the collinearity
between the independent variables (xj's). There should not be "perfect collinearity" between the
independent variables (Berry and Felman, 1985). A detailed discussion on multicollinearity is given
below.
Logistic regression analyses are different from linear regression analyses because of the nature of the
error. The assumption that differentiates it from linear regression is that the dependent variable is binary
and has a binomial distribution with a single trial and parameter p(x) (e.g., probability of presence
given x).
-------
Logistic regression also requires that no correlation exists between observations (Hosmer and
Lemeshow, 1989). If a number of observations are located close to each other and form a cluster, these
observations may be dependent, and it is necessary to account for that in the model. Dependency will
affect the standard error of the coefficient and make the estimate unstable. Luckily, there is a Generalized
Estimation Equation (GEE) option in PROC GENMOD, SAS (Diggle et al., 1994) that accounts for the
dependence between observations. The dependence between observations is run after finalizing the
model and is described below.
Normality of the data is an issue of concern for many researchers. The assumption of the central limit
theorem can be used when the sample size is large (> 30) and the distribution of a variable will
approximate normality (Madansky, 1988). In standard linear regression analysis, however, basic
assumptions need to undergo diagnostic checking on the residuals for normality or independence, for
example. The link function in logistic regression allows the random component (response variable) to
have a distribution other than normal. The prediction in logistic regression is done through the link
function that models the mean response; the prediction is not obtained directly from the mean response as
in standard linear regression. Logistic regression analysis uses the maximum likelihood estimators, which
are approximately normal. Therefore, coefficients'/)-values and confidence intervals can be estimated
using the normal and chi-square distributions. Logistic regression does not require multivariate normality
of the independent variables.
Linear relationships between dependent and independent variables were used in Equation 5. To
linearize relationships between dependent and independent variables, log transformation of surface water
area variable (A) was done as an example. Sheskin (2000) described many methods of data
transformation. When nonlinear relationships exist and transformation does not linearize it, then
nonparametric is the method to use, provided that the sample is large (Efron and Tibshirani, 1991).
12.3 Steps to Follow
Prior to running the logistic regression analysis, a few analyses are needed to understand the data
better. Univariate logistic regression analysis is important to examine the strength of the relationships
between the independent variables and the dependent variables. Collinearity is critical in regression
analysis and needs to be studied first to exclude any collinear independent variables. Variable selection,
goodness of fit, prediction of the model, and diagnostic check of the fitted model are explained in detail
below.
a) Univariate Logistic Regression Analysis: Each of the variables in Table 1 was regressed with
the presence/absence of toads. Univariate regression analysis reveals the strength of relationship
and association of each independent variable with toad presence/absence. Table 1 gives the
significance level for the slope coefficient for each of the independent variables. Eighteen of the
25 variables showed a significant relationship (p* O.05) with the presence and absence of toads.
b) Collinearity: It is necessary to determine the magnitude of the collinearity of the independent
variables. Collinearity can make the model coefficient unstable (Allison, 1999) and adversely
affect the coefficient interpretation (Christensen, 1997), but it has no effect on the model
prediction. A higher value of collinearity elevates the standard error of the estimated coefficient,
which decreases the coefficient level of significance. In other words, a coefficient may be
significant, but because of the presence of other correlated variables its significance may be
diminished. Alternatively, a coefficient may be artificially significant and become insignificant
when correlated variables are removed. Therefore, caution must be taken when explaining
coefficients when a high degree of collinearity is present.
-------
We examined collinearity in four ways:
1. First, we examined the simple pairwise (Pearson) correlation between the independent
variables. Different cutoff values for r have been recommended as an indication of serious
collinearity (e.g., !"= 0.8, Berry and Felman, 1985; 'f= 0.9, Griffith and Amerhein, 1997).
Belsley et al. (1980, p. 96) reported that r««values < 0.7 should be considered with no fear of
serious collinearity. However, Berry and Felman (1985) indicated that the cutoff value of r
depends on the number of observations. When the number of observations is small (e.g.,
n < 30), then the recommended cutoff value of r**is O.70 and when the number of
observations is > 30 then the cutoff of *r««value is O.85. We used *r«~ O.85 as the cutoff
value for collinearity. If an independent variable is highly correlated with other independent
variables (*r«~ O.85), and is not highly associated with the dependent variable (presence of
toads), that variable was not included in the logistic multiple regression analysis. Pairwise
correlation values ranged from -0.77 to 0.77 for the independent variables that entered the
final regression analysis model.
2. The second test for collinearity was using the Variance Inflation Factor (VIF) and Tolerance
(VIF is the inverse of the tolerance (=1-R2), Griffith and Amerhein, 1997) as an indication of
the degree of collinearity between the independent variables. VIF represents the amount of
inflation in the variance when the collinearity of a variable with others exists. A preliminary
multiple regression analysis that includes all the independent variables can be used to
examine VIF. A VIF value of one indicates that there is no linear relation (R2 = zero)
between the independent variables. A VIF of more than one means that R2 is more than zero,
which indicates some linear relation between the independent variables. VIF may increase to
some value that makes the model unstable (i.e., "imprecise" in its prediction). A question
may be asked: What is the cutoff value for VIF? Different values are given in the literature.
Neter et al. (1996) and Griffith and Amerhein (1997, p. 99) indicated that a value of VIF that
exceeds 10 can lead to a serious collinearity. For our analyses, we used VIF values of 10 as a
cutoff for collinearity.
Note: VIF is used later (page 13) for the diagnostic checking of the final logistic model.
3. The third test was to examine the absolute correlation between the coefficient estimates. A
pairwise correlation r > 0.9 indicates that one variable has to be excluded from the regression
analysis (Griffith and Amerhein, 1997). Griffith and Amerhein (1997) suggested that this test
reveals a better indication of the linearity between a single variable and the linear
combinations of others. The correlation between coefficient estimates can be outputted by
adding an option CORRB to the Proc Logistic SAS statement. The highest r value (= 0.887)
in our case was between salinity (S) and elevation (U) coefficients, and the lowest (= -0.72)
was between latitude (W) and surface water area (A).
4. The fourth test was very simple and is used frequently by many researchers. It is performed
by observing changes in the value and sign of a coefficient that is already in the model upon
the addition of other independent variables. If a big change occurs in coefficient value or the
sign is reversed, then this is an indication of collinearity. This method may be the best,
though somewhat inefficient, since it is really model instability that is of concern.
Scientific knowledge is often useful in avoiding problems with collinearity. Although
multicollinearity can occur with three or more variables, such that it masks its presence in any
subset of two of these variables, it is most often seen with two. Hence, the first test (though
without stringent cutoff values) is quite useful and simple.
-------
c) Variable Selection: Stepwise selection (SAS) was used in selecting variables for the logistic
multiple regression analysis. In this selection method, there are two probability values that
control the variable to be entered (SLENTRY) or removed (SLSTAY) from the model. There
were two important recommendations stating that the probability values for variable entry should
be:
1. Higher than 0.05, giving an opportunity for an important biological variable to enter the
model, and
2. Higher than that of removing the variable (Efroymson, 1960; Bendel and Afifi, 1977; Mickey
and Greenland, 1989). A significance level of entry O.25 has been recommended for
stepwise regression analysis (Mickey and Greenland, 1989; Hosmer and Lemeshow, 1989).
We used a value of 0.3 and 0.1 for SLENTRY and SLSTAY, respectively. We felt the
significance level of O.l was appropriate for identifying independent variables influencing
the presence/absence of toads in the final model.
One key point to make is that some biological variables are important in explaining or predicting
a biological phenomenon, yet their significance level may be more than 0.05 in the final model.
A decision has to be made between a sound biological (or physical) model and a statistical model;
therefore, it may necessitate increasing the significance level to more than 0.05.
Before finalizing the fitted modes, variables that are in the model have to be checked for VIF. A
variable with high VIF should not be left in the model. A variable with high VIF would cause an
unstable model, and, when deleted, would cause a major change in the coefficient estimates.
d) Assessing Model Fit: Proc Logistic (SAS) produces many statistics to test for predictiveness
and effectiveness of the fitted model. Statistics such as Akaike Information Criterion (AIC),
Shwarz Criterion (SC), and negative twice the log likelihood (-2 Log L) are given for the model
with and without the independent variables (intercept only). Substantial reduction of these
statistics, upon including independent variables, indicate a good predictive model. For example,
the values of-2 Log L before and after including the independent variables were 144.38 and
55.58, respectively (Table 2). The difference in these two values is the maximum likelihood ratio
^ value (= 88.80) which was significant (p < 0.0001; Table 2). Analogous to a standard
regression analysis, where the model F value is used to test the null hypothesis (Ho: all
coefficients are equal to zero), we used Wald ^ in logistic regression analysis. Wald ^ is 21.81
(p = 0.0027; Table 2), rejecting the null hypothesis that states all coefficients are zero.
-------
Table 2. SAS Output for Model Fit, Testing Global Null Hypothesis, and Association
of Predicted Probability and Observed Toad Presence
Model Fit Statistics
Criterion
AICa
scb
-2 Log L
Intercept only
146.377 (=144.377 + 2*1)
149.181 (=1 44.37 +1*l_n(1 22))
144.377
Intercept and Covariance
71.583 (=55.583 +2*8)
94.015 (=55.583 +8*Ln(122))
55.583
R-Square 0.5170 Max-rescaled R-Square 0.7453
Testing Global Null Hypothesis: Beta = 0
Test
Likelihood Ratio
Score
Wald
Chi-Square
88.7946 (=144.377-55.583)
55.5509
21.8141
DF
7
7
7
Pr>ChiSq
< 0.0001
< 0.0001
0.0027
Association of Predicted Probabilities and Observed Responses
Percent Concordance
Percent Discordant
Percent Tied
Pairs
95.4
4.5
0.0
2992
Somers' D
Gamma
Tau-a
c
0.909
0.909
0.368
0.954
AIC = -2LogL + 2*k, where k is the number of coefficients in the model.
SC = -2LogL + k*ln(n), where n is the total number of observations (SC is also known as
Baysian Information Criteria).
The Wald chi-square statistic appears in logistic SAS output twice, once to test for the hypothesis
that all coefficients are zero and again to test the significant level of each coefficient. A Wald
statistic for the models is given in "Testing Global Null Hypothesis," and for each coefficient is
given in "Analysis of Maximum Likelihood Estimates." For the calculations of the overall model
Wald statistic, refer to Long (1997) and Rao (1973). For each coefficient, a Wald statistic is
easily calculated as the squared ratio of the value of a coefficient and its standard error (Table 3).
It is important to note that when testing the best fitted model, it is safer to use a log likelihood
ratio than the Wald test. When the coefficient is large, a Wald test can lead to Type II error. In
general, the two statistics should not disagree substantially, and if they do, it may indicate that an
asymptotic test is not appropriate. Pearson's chi-square is generally the best test of the overall
model, but all fail for binary data. For nested models, the log likelihood ratio is generally
preferred.
10
-------
Table 3. Final Stepwise Logistic Regression Analysis Model (n = 122 sites)
Variable
Intercept
U
W
S
I
O
A
Q
Estimate ( )
10.9961
-0.00858
-0.0868
-6.4869
0.1865
-0.0601
1.8175
-0.0383
Standard
Error
2.9378
0.00211
0.0211
1.8971
0.0635
0.0170
0.7135
0.0153
P
0.0002
< 0.0001
< 0.0001
0.0006
0.0033
0.0004
0.0109
0.0125
Standardized
Estimate
-2.1904
-1.7455
-1.3683
1.3399
-1.0299
0.7666
-0.5487
Odds
Ratio
0.991
0.917
0.002
1.205
0.942
6.157
0.962
Variable Description
Elevation
Latitude
Water salinity
Bedrock substrate cover
Vegetation over water
Surface water area
Vegetation over adjacent land
Dependent variable is patch occupancy by red-spotted toad (B. punctatus). Variables are arranged in order of
importance in influencing patch occupancy (i.e., by absolute value of a standardized estimate). -is a parameter
estimate, and Odds Ratio is exp(- ). P value for Wald statistic.
standardized estimate =
Stdx
Where is the coefficient estimate in Table 3 (e.g., U = -0.00858),
sfdx is the standard deviation of the independent variable (e.g., std for U = 463.159),
stdy is the standard deviation of the dependent variable (1.8138, see Allison, 1999),
therefore the standardized estimate for U = -
- 0.00858*463.159
1.8138
= -2.1904
The Wald test for overall model significance, -2 = 21.8, df = 7, 114, P = 0.0027. Hosmer-Lemeshow goodness of
fit test, 2 = 3.91, df = 8, P = 0.87. Percent of sites correctly classified as having present or absent populations is
95.4 percent.
R2 is known in standard regression analysis as the coefficient of determination. In logistic
regression analysis, it is known as the generalized R2 and can be used to determine the predictive
ability of the fitted model. To decide between models, choose the model with the highest R2
value indicating a better predictive model. Allison (1999) used R2 to identify the power of
prediction of the fitted model. Logistic regression analysis R2, known as the generalized R2, is
calculated differently than that of the standard regression analysis and should not be used to
describe the percent of explained variability as in standard regression analysis.
R2 = 1- exp(- Likelihood ratio ^ I ri)
= 1- exp(- 88.80 /122) = 0.5170 or 52 percent (Table 2)
The generalized R2 has an upper value which is not equal to one, as in the standard regression
analysis. The logistic upper R2 value is called "Max rescaledR2 " value and is determined by
dividing the generalized R2 by the upper R2 value (see calculation below):
Upper R2 = l-exp(- Intercept only (-2 Log L) I ri) =
= l-exX-144.377/122) = 0.6938 or 69 percent
11
-------
Max-rescaled R2 = 0.5170/ 0.6938 = 0.7453 or 75 percent (Table 2, Allison, 1999)
Measures of associations between observed and predicted values, also known as ordinal measures
of associations, are given by SAS (Table 2). In explaining concordance, there are 7,381 pairs
(122*121/2) of all possible combinations (presence/presence, absence/absence, and
presence/absence). For concordance/discordance calculations, presence/absence pairs are
considered (2,992 pairs). The model predicts the probability of presence, so if the probability of
predicting presence is higher than that of absence in a pair, it is concordant. Otherwise, it is
discordant. Our fitted model has a strong prediction for toad presence measured by a
concordance of 95.4 percent.
There are many measures for the prediction ability of the fitted model to predict the presence of
toads. These are measures of association and they are all high in values except for Tau-a (0.37;
Table 2). Comparing Somers' D (0.91 percent) and R2 (0.52 percent) values, there is a large
difference between the two values, and the R2 value is not very high. With a well-fit model in
standard regression analysis, R2 is expected to be high (e.g., > 90). Contrary to standard
regression analysis, a low value of R2 is expected in logistic regression analysis, even when
applying a we 11-fit model to the data. Christiansen (1997, p. 128) suggested using R2 as a relative
measure of goodness of fit but not as a measure of the absolute goodness of fit. The value of
logistic regression analysis R2 becomes high only when the predicted probability approaches zero
or one. Therefore, the value of R2 can be used to compare two models and then to choose the one
with higher value. The ability of the model to predict presence/absence can be assessed by the
concordance.
The goodness of fit statistics is 3.905 with/) = 0.8656, indicating a good fit (a Hosmer and
Lemeshow goodness of fit test). When the probability is not significant (p > 0.05), a good fit is
indicated. Thus, the logistic fitted model appeared to be "the right" model for presence/absence of
toads.
e) Coefficient Estimates: Overall model ^ indicated that at least one coefficient is not zero. To
test whether an independent variable in the model is significant, the Wald statistic was used
(Table 3). All coefficients are significantly different from zero (p < 0.013; Table 3). An increase
in surface water area and extent of bedrock substrate resulted in a higher probability for the
presence of toads in the study area. Inversely, higher elevation, latitude, mean of percent
vegetation cover over water, mean percent vegetation cover on the adjacent land, and electrical
conductivity resulted in a lower probability of toads being present. Further interpretation of the
physical relationships of these variables to the presence of toads is given in Bradford et al.
(submitted).
Another statistic is the relative importance of the independent variables in the model, which is
measured by the standardized estimate (Table 3). This statistic makes comparison possible across
variables with different measurement units. Elevation is the most important environmental
variable affecting the presence of toads, followed in importance by latitude (W), salinity (S), and
percent bedrock substrate (I). A simple figure such as a box-plot can show the relationships
between presence and absence for each of the independent variables (Figure 2). Values at
quartiles 1, 2, and 3 for elevation, latitude, and vegetation cover over water over adjacent land are
higher for sites where species is absent. Odds ratio can be defined as the ratio of odds for
presence to the odds for absence. The predicted odds of toad presence for elevation are 0.991
times the odds of absence of toads. The independent variable in this study is a continuous
variable, therefore, the odds ratio presents changes in toad present per one unit increase in the
12
-------
independent variable. An increase of one unit (meter) in elevation decreases the predicted odds
of presence of toads by 0.9 percent (1-0.991; Table 3). An increase of 100 units in elevation will
change the predicted odds from 0.991 to 0.424 (exp(100*-0.00858 = 0.424). That is, with 100
more units in elevation, the predicted odds of presence are 0.424 times the odds of absence which
is a decrease of 58 percent (1-0.424) in predicted odds of presence (see Hosmer and Lemeshow,
1989, pg. 63). In another example, the predicted odds of presence for bedrock substrate cover is
1.21 times the odds of absence. An increase of one unit (1 percent) in bedrock substrate cover
increases the predicted odds of presence of toads by 21 percent.
2000-
1500-
1
c
2 1000-
ra
UJ
500-
0-
A
I T
P
L
j
-
41 UO -
E
"^ 4050 -
n
c
H
o 4000 -
O
o
? 3950 -
3900
T
I
L
B
1 ₯
1UU
E S 75-
~ c
$-3 50-
O c
o S
^ W
qj w
01 $
d) >
> O
0-
r C
[
0 1
Figure 2. Box-plots show the presence (1) and absence (0) of toads as related to the independent
variables. Independent variables are: Elevation (U), UTM-N (W; latitude), and vegetation
cover (< 1 m high) over adjacent land, mean % (Q). "+" is the mean value; lines for box from
top as: Maximum, Quartile 3, Quartile 2, Quartile 1 and minimum values.
f) Diagnostic Checking
1. VIF. Before analyzing the Logistic Regression analysis output, VIF for each of the
independent variables in the final model was examined by incorporating a weighted value
(variance of the binomial) into the VIF calculation to account for collinearity (Allison, 1999,
p. 50). The steps below are suggested to first output the residuals into file 'OF from the
logistic regression analysis in SAS and run weighted linear regression analysis as:
Data O2:
13
-------
Set O1;/*O1 is the residual from the regression equation */
W = Pred*(l-Pred);
run;
Options ps = 255 Is = 100;
Proc Reg data = O2;
Weight W;
Model Toad = U W S I O A Q / TOL VIF;
run;
VIF values for the independent variables in our case ranged from 1.3 to 8.0. These values are
< 10, indicating no serious collinearity (Allison, 1999).
2. Deviance: A low value of the residual in linear regression analysis indicates that an
observation is close to its predicted value. This is not the case in logistic regression analysis,
and we may find a high residual value for normally distributed observations (Christensen,
1997). Using residuals to check for the fitted model as described in linear regression analysis
is not appropriate for logistic regression analysis. SAS outputs many statistics to describe the
influence of each observation on the fitted model and outliers. DFBETAS can detect an
observation that causes instability for the model coefficient. DIFDEV and DIFCHISQ detect
change in deviation and Pearson chi-square when an observation is deleted from the model.
C and CBAR statistics are similar to that of Cook's D distance in the standard regression
analysis, where a confidence interval is used to detect outliers.
The Hat matrix diagonal is also known as leverage and is used to indicate how extreme the
observation is as related to independent variables. High and low leverage are indications of a
poor fit for these identified observations. A change in deviance and Pearson 3 with a value
of more than 4 (value of 4 as the upper 95th percentile of the ^ distribution with one degree
of freedom; 2 095 = 3.84) needs to be considered and explained on the basis of the knowledge
of these observations to make sense of the behavior. Sometimes, data like this may be
deleted, but this behavior of the data should be investigated further: perhaps more data are
needed. More data may result in a lower value (i.e., less than a value of 4). Plotting
differences in Deviance and Pearson 2 are recommended to visualize the overall fitting of a
covariate pattern and to look for clusters or points that are isolated from the overall pattern
(Figures 3 and 4). In each figure there are two curved lines; the curve (left) that decreases
with predicted probability describes the behavior of the covariates when toads are present,
and the other (right) describes the pattern when they are absent. Figures 3 and 4 show that
one value is more than 10 and 3 others are more than 4. These values were not removed from
the data because of their physical importance to presence and absence of the toad. Hat matrix
and DFBETAS are also output by SAS to show high magnitude values when an observation
is deleted. More detail about this is given by Lemeshow and Hosmer (1989).
One additional diagnostic check to make is on the variance homogeneity. This can be done
by plotting the square root of the residual deviance vs. the predictor and examining the
behavior of points (Figure 5). The points are scattered randomly over the domain with no
clusters, indicating no heterogeneity in variance.
14
-------
& :
a
u
c
.5 5
HI
o
4
3-
* ..
-*..
0.0 0.1 0,2 0.3 0,4 0.5 0.6 0.7 0.8 0.9 1.0
Estimated Probability
Figure 3. Deviance (DIFDEV) values for the predicted probability of the
presence and absence of toads.
40-
30-
20 -
O
10-
0 -
0.0 0.1 0,2 0.3 0.4 0.5 0.6 0.7 0.8 0,9 1,0
Estimated Probability
Figure 4. Chi-square (DIFCHISQ) values for the predicted probability of the
presence and absence of toads.
15
-------
120
100
Q. 80
tn
O 60
O
£L 40
20
0
>* ' . : * '
*;^w%*V* *
* *** * *
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Dev. Re.v.
*. * /***: * *
«fev*^
3900
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
. Re.v.
0.0 0.2 0.4 0.6 0.8 1.0 1.2
v/ Dev. Res.
1.4 1.6 1.8
90
80
jt 70
o 60i
5; so
§ 40
o
J>
°- 20
10
0
I*** * * * **
4
3,5
3
2.5
1
1.5
1
0.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Dev. Res,
Dev. Res.
120
100
80,
o.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
V
Dev. Res.
Figure 5. Square root of the absolute value of the deviance residual for each independent variable (Xj'
that was significant in the final model.
16
-------
12.4 Model Selection
One may have many models in hand and would like to choose only one of them. A comparison has
to be made between the saturated model (all independent variables are included in the model) and reduced
models (model with a specific number of independent variables). Many statistics can be used in model
selection such as R2 and Wald statistics as mentioned earlier. An analog of Mallows Cpl that is used in
standard linear regression (see footnote), Cp can be used in logistic regression to select a model. The Cp
statistic in logistic regression is related to Akaike's criteria information (AIC). A model with the lowest
Cp can be chosen as the best fitted model. Hosmer and Lemeshow (1989, p. 134) describe in detail the
use of a modified Cp denoted as Cq. Lower Cq values indicate a better model to choose.
1.2.5 Dependence Between Observations
In logistic regression, it is important that observations are independent. If the dependence between
observations in a group is of concern, then data may be grouped in a logical way (in our case watershed
and mountain range) to examine whether sites/observations that are located near each other are
dependent. The final model was run again using Proc GENMOD with the two groups/clusters
(Appendix 1). Estimates of values and their probabilities were compared with that of Proc Logistic to
note any change in coefficient values and their level of significance (p > 3). If the estimate's significant
levels change from significant to insignificant or vice versa, then dependence is an issue, and analysis has
to be done by groups. Thus, in this situation one global logistic regression analysis would not be valid for
all the observations. The following are SAS statements that use GEE through the Repeated option.
Options ps = 255 Is = 100;
Proc Genmod data = neweuc;
Class WsGrp; /* by watershed group */
Model Toad = AUWIOQS/D = B;
Repeated subject = Wsgrp / Type = exch;
run;
Options ps = 255 Is = 100;
Proc Genmod data = neweuc;
Class Rangegrp; /* by mountain range group */
Model Toad = A U WI O Q S / D=B;
Repeated subject = Rangegrp / Type = exch;
run;
RSSg
Mallows Cp is a statistic that was developed by Mallows (1973) as: Cp = - (n-2e )
S2
Where RSS. js the residual sum of squares from regression analysis model;
S2 is the residual mean square, an unbiased estimate of the error variance ( 2);
is the number of coefficients including an intercept; and
n is the number of observations.
Cp can be used to select the best fitted model. For a satisfactory model, Cp and -should be close in value. Normally, a "1-1"
plot of Cp (y-axis) vs. (number of coefficients; x-axis) is constructed to give an idea of the best fitted model (Draper and
Smith, 1981, p. 300). If the model Cp value is above the "1-1" line, then it indicates a biased model. The best fitted models
have a low Cp that is closer to the number of coefficients.
17
-------
Results in our case indicated that estimates (coefficients) and their significant levels remain largely
unchanged when either group was considered (Appendix 1). Therefore, it was concluded that there was
no dependence between sites.
SAS statements for running the standard logistic regression analysis are given in Appendix 2.
1.2.6 Interactions
Interaction is referred to as the effect of the cross product of two independent variables on the mean
response. For example, if the probability of presence as a function of latitude is dependent on elevation,
then an interaction term (= latitude x elevation) must be included. Interaction between variables,
therefore, should be considered if researchers deem it necessary to ensure that the final model is sound
biologically and statistically. See Bradford et al. (submitted) for inclusion of interaction terms in the
model.
Note: Researchers may find that there is a need to reduce the number of variables in the model, but this
should not be done by arbitrary exclusions. An investigation of the magnitude of each coefficient
may explain the importance of each variable. A coefficient's value describes the relation to the
dependent variable and defines the rate of change of the dependent variable per one unit change in
the independent variable. When two coefficients that are very close in value have opposite signs,
the difference between them may be used in the model instead of the individual variables. One
must define the statistical significance level and physical contribution of the new coefficient to
decide whether the difference has better representation/contribution to the model. Then one can
compare the new model R2 with that of the original. These results may show some interesting
biological phenomena.
18
-------
Section 2
MARS
In the above logistic regression analysis, a linear relationship was assumed between the response
variable, log(p(x)/l-/?(x)), and the independent variables. A researcher may wish to explore if there are
any nonlinear relationships between the response and the independent variables. That is, the response
may have a piecewise behavior over the domain. In standard regression analysis, prior to fitting the
model, the researcher has to specify the form of the nonlinear behavior (e.g., square). Modern
nonparametric regression analysis can be used to model linear and nonlinear behavior without prior
specification to the form of data behavior. To relax the assumption in GLM, the General Additive Model
(GAM) was introduced. In GAM, the expected response value is a sum of smoothed functions of the
independent variables. MARS and Classification and Regression Tree (CART) are two of the modern
nonparametric regression analysis methods that are used in environmental studies (Walker, 1990; Moore
et al., 1991; White and Sifneos, 1997; Miller, 1994). Computer algorithms, such asMARS (Salford
Systems, 1999) and CART (Salford Systems, 1999), were developed to be used for data analysis. The
selection option for variables in MARS (Salford Systems, 1999) algorithm and standard logistic regression
(SAS) makes both methods similar in procedure for comparing results. We will use MARS in italics to
differentiate the program from the method.
MARS, which was developed by Friedman (1991), is "extremely promising as an automatic high
dimensional smoother" (Hastie and Tibshirani, 1990). It is data-driven more than user-driven, as in the
case of simple regression analysis. For each of the variables, an algorithm is employed to determine the
function of the variable (e.g., linear, cubic, etc.) by using a sequence of local nonparametric regression
analysis on the data.
2.1.1 Model Fitting
In simple regression analysis, regression is done globally and may be sensitive to outliers. MARS, on
the other hand, reduces the effect of outliers on the final model. It builds flexible models by fitting
piecewise linear regression analysis to data to approximate the nonlinear behavior of the independent
variables. Prior to building the model, we need to define and explain a few steps that need to be
understood when running MARS.
1) Knots: When one regression line does not fit well to all the data, several regression lines (piecewise)
are used to describe the overall behavior over the entire domain of the independent variable. The
value of the independent variable where the slope of a line changes is called a knot. The knot defines
the end of one domain and beginning of another. Between two knots, a linear (or cubic) regression
line is fit to that group of data. When the slope is not changing along the entire domain, then one line
fits all the data resulting in no knots.
19
-------
2) Basis Functions: Basis functions are a set of functions used to reexpress the relations between
dependent and independent variables. For example, basis function (BF]) on the variable elevation is
defined by MARS as:
BFj = max(0, elevation -219) (9)
Data for elevation variables are grouped into two sets: the first set is assigned 0 for all elevation
values that are below a threshold (e.g., c = 219 m), and the second set contains the elevation values
that are more than 219 m. Elevation has no relation to the probability of presence (i.e., slope = 0) for
values below the threshold of 219 m, but has a negative relationship (slope < 0) above this threshold.
3) Fitting a Model: The initial model starts with a constant only (c0), then adds a basis function (a
single or multivariate interaction term) in time to build up a comprehensive model that contains the
maximum number of basis functions and their interactions, which are specified by the user
(Equation 10).
Y = c0 + q * BF; + error (10)
where Y is the response variable, c0 is a constant and c; is a coefficient for the basis function. Not all
these basis functions contribute significantly to the model. The least contributing one is deleted by the
stepwise method, and the final model contains the significant functions that contribute to the
dependent variable. Analogous to CART, the initial model is overfitted, which is then pruned to give
the optimum model with the lowest mean square error (MSE).
4) Fitting Logistic Regression: A modification to Equation 10, Hastie and Tibshirani (1990)
introduced the logistic regression analysis using the modern additive logistic as:
Log[Xx)/(l-Xx))] = S + ^(xO (11)
where ; is a constant, and f^xj estimates local smooth functions; as mentioned earlier, this model
can also contain interactions of order »2. Our task is to accurately predict the probability of the
presence of toads given many independent variables. The same data (122 observations with 25
variables) that was used for the normal logistic regression analysis was used again in the MARS
program.
5) Model Validation: With large sample sizes, the data are split into training (e.g., 90 percent of the
data) and test sets (e.g., 10 percent of the data). The training data set is used for building the model,
and the test data set is used to validate the fitted model. When the sample is not large, cross
validation (CV) is the best method to use for validation. In CV, one observation is left out and
smoothing is done on n-1 observations. The CV is the mean sum square of the differences between
the Yj's and their predicted values (f "'(Xj)) where an observation is excluded:
CV=l/n- {Y1-fI(x1)}2 (12)
The MARS outputs CV and PSE (PSE = 1/n *{Yj - f (x^}2) can be used to assess the final model.
PSE is the mean Predictive Squared Error when all observations are included, whereas CV is a measure
for n-1 observations. When CV and PSE are close in value, a minimum CV is reached and an optimal
model is produced. Another measure for the optimal model is the Generalized Cross Validation (GCV)
as a measure of mean square error. A model with minimum GCV is to be chosen. GCV is analogous to
20
-------
Cp in simple regression analysis (see page 17) which is known as "Mallows' Cp" (see Hastie and
Tibshirani (1990) for detail description of Cp and GCV).
2.1.2 Final Model
The model was run with 121-fold cross validation, 40 minimum numbers of observations between
knots, and 15 maximum basis functions (these options are in the "Testing" option in GUI). The degrees
of freedom is the number of observations between knots to be considered in smoothing and can be
defined in three different ways (see Testing in MARS user guide GUI; Model-set-up, Model).
Different models can be applied with different options in GUI, and the model with the lowest PSE
and GCV is chosen. PSE and GCV are given in MARS output (see Appendix 3 at the end of cross
validation; (estimated) PSE = 0.124 and GCV = 0.136). Values of GCV and PSE are similar; therefore,
we can conclude that an optimal model is reached.
The final model with all the above options yielded:
P(Y=1 "X) = f(U) + f(W) + f(Q) + f(S) + f(O) + f(C) + f(I).
The independent variables were similar to those by Proc Logistic (Table 4), except for A (surface
water area) which was replaced by C (riparian zone area). Both of these metrics reflect the extent of a
moist habitat. Also, the number of variables was similar for both methods. Input and output of MARS to
this data set are given in Appendix 3.
Table 4. Coefficients and Their Statistics From MARS. Model F = 14.128, p < 0.001, df= 7,114
Variable
Intercept
Basis function 1 (W)
Basis function 2 (U)
Basis function 3 (O)
Basis function 4 (S)
Basis function 5 (C)
Basis function 6 (Q)
Basis function 7 (I)
Estimates
1.690
-0.005
-0.405E-03
-0.003
-0.302
0.111
-0.004
0.005
t-ratio
8.082
-5.981
-4.335
-2.431
-2.902
2.966
-2.850
2.237
P value
0.745E-12
0.261 E-07
0.316E-04
0.017
0.004
0.004
0.005
0.027
Variable description
Latitude
Elevation
Vegetation over water
Water salinity
Riparian zone area
Vegetation over adjacent land
Bedrock substrate cover
All 15 base variables with their contributions by GCV to the preliminary model are given in
Appendix 3 ("Forward Stepwise Knot Placement"). GCV gives the amount of degradation in the model
when a variable is deleted.
The importance of each variable in the model is another valuable output from MARS, which describes
the amount of reduction in goodness of fit when any variable is removed. Latitude and elevation
variables were the most important variables with 100 percent and 70 percent importance, respectively.
Extent of riparian zone, salinity, and percent vegetation cover over adjacent land were similar in their
importance ( 37 percent), whereas percent of vegetation cover over water and percent of bedrock
21
-------
substrate were the least important (< 30 percent; see output in Relative Variable Importance section for
these values, Appendix 3).
The Basis functions for the final model were:
Minimum
BF1 = max (0, W - 3927.375) 3927.375
BF2 = max(0, U-219) 219.0
BF3 = max (0, O - 0.585E-6) 0.0
BF4 = max (0,8+1.053) 1.053
BF5 = max (0,C-1.00) 1.00
BF6 = max(0, Q-0.191E-5) 0.0
BF7 = max(0,1 +0.192E-6) 0.0
The above basis functions indicate the linear relationships between the dependent and independent
variables (Figure 6). The basis functions all behaved linearly and started at the minimum values for each
variable. In other words, there was no piecewise fitting. The final model is:
Y = 1.69 - 0.005*BF1 - 0.405E-3*BF2 - 0.003*BF3 - 0.302*BF4 + 0.111*BF5 - 0.004*BF6 +
0.005*BF7.
One important note to make here is that when different options in model specification in GUI were
used, the sign for some variables was reversed in the final model. This is known as the collinearity effect
between the independent variables. An option called "penalty" ("Option and Limits" Model in GUI) can
be used when variables are added to reduce the potential of including collinear variables in a model.
For the final model, the "overall" F value was significant (F = 14.13, P = 0.425E-12, df = 7,114;
Table 4; Appendix 3). There are three different values of R2 which appear in the GUI and the output
(Appendix 3) that can be used to describe the degree of association between the basis functions and the
dependent variable, similar to that in ordinary least square regression. Therefore, these R2 values are not
appropriate for assessing the predictive power of the model when the response variable is a binary. These
values, however, can be used to compare models. The description below for each R2 is for clarification.
In GUI, Naive R2 (final MARS-Model R2), Naive-Adjusted R2, and GCV R2 are given. In the output
(Appendix 3), values of R2, adjusted R2, and the uncentered R2 are given. An explanation of each is given
below:
NaTve (GUI) and R-squared (Appendix 3): The degree of the association between the dependent
variable and the basis functions. It was calculated in the same way as the R2 for the simple
ordinary least regression analysis.
Naive-Adjusted R2 (GUI) and Adjusted R2 (Appendix 3): It is the same as the adjusted R2 for the
simple ordinary least regression analysis, adjusted for the number of basis functions.
GCV R2: It is adjusted for the effective number of the parameters in the model, and it is always the
lowest in value of the MARS R2.
Uncentered R2: It should not be used to test for goodness of fit. It is used in econometric diagnostic
tests.
22
-------
R2 and Adjusted R2 were 0.465 and 0.432, respectively. The R2 values from MARS and simple
logistic regression analysis are close to 50 percent. Values for the coefficients, Standard Error, T-Ratio
and associated P-value are given in Appendix 3. All coefficients were significant (P < 0.03).
23
-------
Curve 1: Maximum = 0.80604
Curve 2: Maximum = 0,61417
Curve 3: Maximum = 0.26460
Curve 4: Maximum = 0.60720
s
c
re
0.0 -1
3900 3950 4000 4050 4100
UTM-N
400 800 1200 1600
Elevation (m)
Curve 6: Maximum = 0.37911
0 20 40 60 80 100
Vegetation Cover Over Water (%)
Curve 7: Maximum = 0.42773
-1.0 -0.5 0.0 0.5
Water Salinity (uS/cm)
1.0
Curve 5: Maximum = 0.38224
234
Riparian Zone Area (m2)
0 20 40 60 80 100
Vegetation Cover Over Adjacent Land (%)
0 10 20 30 40 50 60 70 80
Bedrock Substrate Cover (%)
Figure 6. Regression for the main effect variables in the model using MARS.
-------
Section 3
Conclusion
The goal of this report was to provide a reference manual to assist researchers in making informed use
of logistic regression using the standard logistic regression and MARS. The details for analyses of
relationships between presence/absence of an amphibian and environmental variables were exhibited in a
manner to be used easily by nonstatistical users.
As noted, when comparing the standard logistic regression with another parametric method such as
discriminant analysis, the former does not require multivariate normality, which often makes it more
applicable to field data. Logistic regression using the nonparametric method, MARS, allows the user to
fit a group of models to the data that reveal structural behavior of the data with little input from the user.
Results using the standard regression (GLM) and general additive models (MARS) were similar for our
example data set.
Logistic regression analysis was used to predict the probability of toad presence with respect to a
number of environmental variables. Variable importance measured by the standardized estimate indicated
that the geographical metrics (latitude and elevation) were the most important factors influencing toad
presence, operating in a negative relationship. On the other hand, water salinity and percent bedrock
substrate had lesser impacts on toad presence, with the former operating in a negative direction and the
latter operating in a positive direction.
MARS is a nonparametric logistic regression analysis that is close procedurally to the simple
parametric logistic regression analysis because of the variable selection through stepwise regression
analysis. MARS and simple logistic regression analysis yielded similar models, and both indicated that
latitude and elevation are the most important variables influencing toad presence. For this data set, the
simple logistic regression analysis is the better method to be used because of the low number of
observations (n = 122). It is recommended that if the degrees of freedom between knots were 10 and 20,
the number of observations of 1,000 would be needed when using 30 variables (MARS; User guide, p.
34); that is, for each independent variable, 33 observations are needed. Although six independent
variables were significant for both the simple logistic regression analysis and the nonparametric logistic
regression analysis (MARS), and the R2 values were similar, the application of MARS to this data set is
still not superior to that of the parametric procedure.
On a final note, the decision to use simple logistic regression, MARS, or any other nonparametric
method, has to be made on the basis of the suitability and interpretability of the statistical model to
describe a phenomenon. If the interest is to infer the significance and importance of environmental
variables in the model, then GLM logistic regression is the method to use. If the interest is to visualize
and examine the structural relationship between a response and independent variable, especially when
there is little or no knowledge about the data, then GAM is the method to use. A combination of the two
methods may help reveal an important relationship in building the right statistical model.
25
-------
References
Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons, Inc. New York.
Allison, P.D. 1999. Logistic Regression Using the SAS System: Theory and application. SAS Inst. Inc.,
Gary, NC.
Belsley, D.A., E. Kuh, and R.E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and
Sources of Collinearity. John Wiley & Sons, New York.
Bendel, R.B. and A.A. Afifi. 1977. Comparison of stopping rules in Forward "Stepwise" Regression.
Journal of the American Statistical Association. 72:357, 46-54.
Berry, W.D. and S. Felman. 1985. Multiple Regression in Practice. Sage Publications, Beverly Hills.
Bradford, D.F., S.E. Franson, G.R. Miller, A.C. Neale, G.E. Canterbury, and D.T. Heggem. 1998. Bird
species assemblages as indicators of biological integrity in Great Basin rangeland. Environmental
Monitoring and Assessment. 49:1-22.
Bradford, D.F., A.C. Neale, M.S. Nash, D.W. Sada, and J.R. Jaeger. Submitted to Ecology. Habitat Patch
Occupancy by the Red-Spotted Toad (Bufo Punctatus) in a Naturally Fragmented, Desert Landscape.
CART, 1999. Robust Decision-Tree Technology for data mining, predictive modeling and data
preprocessing, Interface and documentation.
Christensen, R. 1997. Log-Linear Models and Logistic Regression. Springer, New York.
Diggle, P.J., K.Y. Liang, and S.L. Zeger. 1994. The analysis of longitudinal data. Oxford University
Press, New York.
Draper, N.R. and H. Smith. 1981. Applied Regression Analysis, second edition, John Wiley & Sons, Inc.
New York (p. 300).
Efron, B. and R. Tibshirani. 1991. Statistical data analysis in the computer age. Science 253:390-395.
Efroymson, M.A. 1960. "Multiple regression analysis" in Anthony Ralston and Herbert S. Wilf, eds.,
Mathematical Methods for Digital Computers, John Wiley & Sons, Inc. New York.
Friedman, J.H. 1991. Multivariate Adaptive Regression Splines (with discussion). Annals of Statistics 19,
1 (software).
Griffith, D.A. and C.G. Amerhein. 1997. Multivariate Statistical Analysis for Geographers. Prentice Hall,
New Jersey.
Hastie, T.J. and R.J. Tibshirani. 1990. Generalized additive models. Chapman & Hall/CRC, New York.
Hosmer, D.W. and S. Lemeshow. 1989. Applied Logistic Regression. Wiley, New York.
26
-------
Long, J.S. 1997. Regression Models for Categorical and Limited Dependent Variables. Sage
Publications. London.
Mallows, C.L. 1973. Some comments on Cp. Technometrics, 15, 661-675.
Madansky, A. 1988. Prescriptions for working statisticians. Spring Verlag, New York.
Mickey, J. and S. Greenland. 1989. A study of the impact of confounder selection criteria on effect
estimation. American Journal of Epidemiology, 129, 125-137.
Miller, T.W. 1994. Model selection in tree-structured regression in Proceedings of the Statistical
Computing Section. ASA, p. 158-163.
Moore, D.M., E.G. Lees, and S.M. Davey. 1991. A new method for predicting vegetation distributions
using decision tree analysis in a geographic information system. Env. Manag. 5(1):59-71.
Neter, J., M.H. Kutner, C.J. Nachtsheim, and W. Wasserman. 1996. Applied linear statistical models.
WCB McGraw-Hill, MA.
Press, S.J. and S. Wilson. 1978. Choosing between logistic regression and discriminant analysis. JASA,
73(364):699-705.
Rao, C.R. 1973. Linear statistics inferences and its application, second edition. Wiley, Inc., New York.
Salford Systems. 1999. MARS, User Guide. Cal. Stat. Software, Inc., San Diego, California.
Sheskin, D.J. 2000. Handbook of Parametric and NonParametric Statistical Procedures, Second Edition.
Chapman and Hall/CRD, New York.
Walker, P.A. 1990. Modeling wildlife distribution using a geographic information system: Kangaroos in
relation to climate. Journal of Biogeography. 17:279-286.
White, D. and J. Sifneos. 1997. Mapping multivariate spatial relationships from regression trees by
partitions of color visual variables. CSM 57th annual convention, ASPRS 63rd Annual Convention,
Seattle, Washington. April 7-10, 1997. 86-95.
27
-------
Appendix 1
GENMDOD Output
Dependence Between Observations
Based on Watershed Group
8:38 Thursday, April 19, 2001
The GENMOD Procedure
Model Information
Data Set WORK.NEWEUC
Distribution Binomial
Link Function Logit
Dependent Variable Toad
Observations Used 122
Probability Modeled Pr (Toad = 1.0000)
Missing Values 6
Class Level Information
Class Levels Values
WSgrp 11 AMFL AMRI COLD COLO ELDO IVAN LASV MESQ PAHR PAIV VIRG
Response Profile
Ordered Level Ordered Value Count
1 1.0000 88
2 0.0000 34
Parameter Information
Parameter Effect
Prm1 Intercept
Prm2 A
Prm3 U
Prm4 W
Prm5 I
Prm6 O
Prm7 Q
Prm8 S
28
-------
Criteria for Assessing Goodness of Fit
Criterion DF Value Value/DF
Parameter
Intercept
A
U
Y
1
O
Q
S
Scale
Deviance 114 55.5828
Scaled Deviance 114 55.5828
Pearson Chi-Square 114 74.4921
Scaled Pearson X2 114 74.4921
Log Likelihood -27.7914
Algorithm Converged
Analysis of Initial Parameter Estimates
Standard Wald 95%
DF Estimate Error Confidence Limits
1 357.1798 85.5390 189.5266 524.8331
1 1.8175 0.7135 0.4191 3.2158
1 -0.0086 0.0021 -0.0127 -0.0044
1 -0.0868 0.0211 -0.1281 -0.0455
1 0.1865 0.0635 0.0620 0.3109
1 -0.0601 0.0170 -0.0933 -0.0268
1 -0.0383 0.0153 -0.0683 -0.0083
1 -6.4872 1.8972 -10.2056 -2.7687
0 1.0000 0.0000 1.0000 1.0000
0.4876
0.4876
0.6534
0.6534
Chi-Square
17.44
6.49
16.45
16.99
8.62
12.54
6.24
11.69
Pr> ChiSq
<0001
0.0109
<0001
<0001
0.0033
0.0004
0.0125
0.0006
Note: The scale parameter was held fixed.
GEE Model Information
Correlation Structure Exchangeable
Subject Effect WSgrp (11 levels)
Number of Clusters 11
Clusters with Missing Values 5
Correlation Matrix Dimension 34
Maximum Cluster Size 33
Minimum Cluster Size 1
Algorithm Converged
Analysis of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter
Intercept
A
U
W
I
O
Q
S
Estimate
354.2233
1.8578
-0.0084
-0.0861
0.1881
-0.0597
-0.0384
-6.3573
Standard
Error
73.0343
0.4825
0.0012
0.0181
0.0857
0.0083
0.0236
0.9988
95% Confidence Limits
211.0788
0.9122
-0.0107
-0.1216
0.0201
-0.0759
-0.0846
-8.3149
497.3679
2.8035
-0.0061
-0.0507
0.3561
-0.0434
0.0078
-4.3998
Z
4.85
3.85
-7.22
-4.77
2.19
-7.21
-1.63
-6.37
Pr>(Z)
<0001
0.0001
<0001
<0001
0.0282
<0001
0.1032
<0001
29
-------
Dependence Between Observations
Based on Mountain Range Group
8:38 Thursday, April 19, 2001
The GENMOD Procedure
Model Information
Data Set WORK.NEWEUC
Distribution Binomial
Link Function Logit
Dependent Variable Toad
Observations Used 122
Probability Modeled Pr (Toad = 1.0000)
Missing Values 6
Class Level Information
Class Levels Values
Rangegrp 8 CLARK DRIVE ELDO KING MCHI RIVE SHEE SPRI
Response Profile
Ordered Level Ordered Value Count
1 1.0000 88
2 0.0000 34
Parameter Information
Parameter
Prm1
Prm2
Prm3
Prm4
Prm5
Prm6
Prm7
Prm8
Effect
Intercept
A
U
W
I
O
Q
S
Criteria for Assessing Goodness of Fit
Criterion DF Value Value/DF
Deviance 114 55.5828 0.4876
Scaled Deviance 114 55.5828 0.4876
Pearson Chi-Square 114 74.4921 0.6534
Scaled Pearson X2 114 74.4921 0.6534
Log Likelihood -27.7914
Algorithm Converged
30
-------
Parameter DF
Intercept 1
A 1
U 1
W 1
I 1
O 1
Q 1
S 1
Scale 0
Parameter
Intercept
A
U
W
I
O
Q
S
Analysis of Initial Parameter Estimates
Standard Wald 95%
Estimate Error Confidence Limits Chi-Square Pr
357.1798 85.5390 189.5266 524.8331 17.44
1.8175 0.7135 0.4191 3.2158 6.49
-0.0086 0.0021 -0.0127 -0.0044 16.45
-0.0868 0.0211 -0.1281 -0.0455 16.99
0.1865 0.0635 0.0620 0.3109 8.62
-0.0601 0.0170 -0.0933 -0.0268 12.54
-0.0383 0.0153 -0.0683 -0.0083 6.24
-6.4872 1.8972 -10.2056 -2.7687 11.69
1.0000 0.0000 1.0000 1.0000
Note: The scale parameter was held fixed.
GEE Model Information
Correlation Structure Exchangeable
Subject Effect Rangegrp (8 levels)
Number of Clusters 8
Clusters with Missing Values 4
Correlation Matrix Dimension 55
Maximum Cluster Size 53
Minimum Cluster Size 4
Algorithm Converged
Analysis of GEE Parameter Estimates
Empirical Standard Error Estimates
Standard
Estimate Error 95% Confidence Limits Z Pr > (Z)
367.1993 53.8241 261.7060 472.6926 6.82 <0001
1.8007 0.5507 0.7215 2.8800 3.27 0.0011
-0.0088 0.0009 -0.0106 -0.0070 -9.60 <0001
-0.0893 0.0133 -0.1154 -0.0632 -6.71 <0001
0.1786 0.0210 0.1374 0.2197 8.51 <0001
-0.0598 0.0089 -0.0772 -0.0425 -6.76 <0001
-0.0378 0.0239 -0.0847 0.0091 -1.58 0.1144
-6.2212 0.8184 -7.8252 -4.6173 -7.60 <0001
> ChiSq
<0001
0.0109
<0001
<0001
0.0033
0.0004
0.0125
0.0006
31
-------
Appendix 2
SAS Statements for Standard Logistic Regression
************ Comprehensive Final Model *****************;
Options ps=255 ls=255;
Proc Corr data=neweuc;
varABCDEFGHIJKLMNOPQRSTUVWXY;
run;
Options ps=255 ls=150;
Proc Logistic data = neweuc Descending;
Model Toad = ABCDEFGHIJKLMNOPQRSTUVWXY
/ Selection = stepwise tech=newton details sle=0.30 sls=0.1
Waldcl Waldrl Plcl Influence Iplot Lackfit Rsq CORRB ;
output out=ol predicted=pred difdev=dev difchisq=chi;
run;
Proc Print data=ol;
var dev chi pred Toad;
run;
goptions reset=global gunit=pct noborder
ftext=Swissb htext=2;* horigin=0.2 in vorigin=0.2 in;
axis length=40 order=(0 to 2000 by 500) label=(justify=c ' Elevation (m)');
axis2 length=40 order=(0 to 1) label=(justify=c angle=-270 ' Presence/Absence');
Symbol 1 v=dot c=black h= 1;
Proc gplot data=ol;
Format Toad FLO Elev2 F4.0;
Plot Toad * U /
frame
haxis=axis
vaxis=axis2;
Title' ';
Footnote 1 j=c 'Figure 1. Presence(=l) and absence(=0) of the amphibians '
Footnote2 j=r' ';
FootnoteS j=r' ';
Footnote4 j=r' ';
Footnote 5 j =r 'pg 21';
run;
32
-------
* * * * * pi fYH"iii n T
goptions reset=global gunit=pct noborder
ftext=Swissb htext=2 horigin=0.2 in vorigin=0.2 in;
axis length=60;
axis2 length =60 label=(justify=c angle=-270 ' Deviance ');
Proc gplot data=ol;
Plot Dev*Pred /
frame
haxis=axis
vaxis=axis2;
Symbol v=dot h=l .2 ;
Title' ';
Footnote 1 j=c 'Figure 2. Deviance (DIFDEV) values for the predicted probability for the study
area';
Footnote2 j=r' ';
Footnotes j=r' ';
Footnote4 j=r' ';
Footnotes j=r'pg';
run;
axisS length=60 label=(justify=c angle=-270 ' Chi-Square');
Proc gplot data=o 1;
* format Per_min F3.0;
Plot chi*pred /
frame
haxis=axis
vaxis=axis3;
Symbol v=dot h=l .2;
Title' ';
Footnote 1 j=c ' Figure 3. Chi-square (DIFCHISQ) values for the predicted probability for the study
area';
Footnote2 j=r' ';
Footnote3 j=r' ';
Footnote4 j=r' ';
FootnoteS j=r 'pg';
run;
* * * * Test for Collinearity for the independent variables in the final model **********.
Data O2;
Set Ol; /* Ol is the residual from the regression equation */
W=Pred*(l-Pred);
run;
Options ps=255 ls=100;
Proc Reg data = O2;
Weight W;
Model Toad = U W S I O A Q / TOL VIF;
run;
33
-------
End of Test for Collinearitv *****************************
* * * * Test for the dependence between observations ************************.
Options ps=255 ls=100;
Title ' Based on Watershed groups ';
Proc Genmod data = neweuc ;
Class WsGrp ;
Model Toad = U W S I O A Q / D=B ;
Repeated subject = Wsgrp / Type=exch;
run;
Title ' Based on Mountain Range groups ';
Proc Genmod data = neweuc ;
Class Rangegrp;
Model Toad = U W S I O A Q/D=B ;
Repeated subject = Rangegrp / Type=exch;
run;
34
-------
Appendix 3
MARS Output
MARS Version 1.0.0.14
Reading Data, up to 292142 Records.
Records Read: 122
Records Kept in Learning Sample: 122
Learning Sample Statistics
(For variable description see Table 1)
Variable
Toad
V
W
U
H
B
D
F
E
X
G
Y
J
I
T
L
M
Q
N
Mean
0.721
659.172
3989.784
1134.431
9.180
52.828
35.492
57.131
17.500
24.754
29.303
23.402
2.754
6.189
8.094
6.895
45.277
36.163
31.311
SD
0.450
41.397
36.474
463.159
20.076
35.958
35.258
40.272
32.367
37.795
35.199
36.592
0.805
13.033
0.522
11.238
84.631
25.985
33.763
N
122
122
122
122
122
122
122
122
122
122
122
122
122
122
122
122
122
122
122
Sum
88.000
80418.953
486753.597
138400.552
1120.000
6445.000
4330.000
6970.000
2135.000
3020.000
3575.000
2855.000
336.000
755.000
987.510
841.200
5523.800
4411.900
3820.000
Continued..
35
-------
Learning Sample Statistics, Continued
Variable
K
Q
P
R
S
A
C
Variable
Toad
Mean
36.221
23.336
33.221
11.484
-0.005
1.981
2.940
min
0
Ordinal
Q25
0
SD
35.146
31.104
35.112
23.568
0.383
0.765
0.899
Response
Q50
1
Ordinal Predictor Variables:
Variable
V
W
u
H
B
D
F
E
X
G
Y
J
1
T
L
M
Q
N
K
min
586.21
3927.38
219.00
0.00
10.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
5.55
0.20
0.20
0.00
0.00
0.00
Q25
632.86
3953.22
620.00
0.00
20.00
0.00
10.00
0.00
0.00
0.00
0.00
2.20
0.00
7.80
1.40
1.00
14.30
0.00
0.00
Q50
641.92
3990.68
1256.00
0.00
50.00
20.00
70.00
0.00
0.00
10.00
0.00
2.90
0.00
8.10
3.00
2.60
30.60
17.00
27.00
N
122
122
122
122
122
122
122
Q75
1
25
Q75
699.45
4019.19
1515.00
10.00
90.00
70.00
100.00
20.00
30.00
100.00
30.00
3.10
10.00
8.40
9.10
60.40
56.00
60.00
70.00
Sum
4419.000
2847.000
4053.000
1401.000
-0.659
241.701
358.620
max
1
max
733.08
4080.23
1735.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
5.00
80.00
9.37
92.80
458.00
99.00
100.00
100.00
Continued..
36
-------
Ordinal Predictor Variables: 25, Continued
Variable
Q
P
R
S
A
C
min
0.00
0.00
0.00
-1.05
1.00
1.00
Q25
0.00
0.00
0.00
-0.26
1.00
2.52
Q50 Q75
10.00 30.00
20.00 60.00
0.00 10.00
-0.08 0.20
1.91 2.65
3.17 3.60
max
100.00
100.00
100.00
0.96
3.66
4.45
121-fold cross validation used to estimate DF.
Estimated optimal DF(8) = 1.182 with (estimated) PSE = 0.124
BasFn(s)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Basis Fun
0
1
2
GCV
0.204
0.185
0.164
0.152
0.145
0.142
0.138
0.136
0.139
0.142
0.145
0.148
0.152
0.155
0.159
0.164
Forward
IndBsFns
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
15.0
(After
Coefficient
1.69
-0.005
-0.4051 23E-03
Stepwise Knot
Placement
EfPrms Variable Knot
1.0
2.8 W
4.6 U
6.4 O
8.2 S
9.9 C
11.7 Q
13.5 I
15.3 T
17.1 H
18.9 A
20.7 V
22.5 D
24.2 X
26.0 J
27.8 F
3927.375
219.000
.584602E-06
-1.053
1.000
.191016E-05
-.192323E-06
5.550
-. 14221 OE-06
1.000
586.205
-.955236E-06
.954165E-06
1.000
.376499E-05
Parent BsF
Final Model
Backward Stepwise Elimination)
Variable
W
U
Parent
Knot
3927.375
219.00
Continued..
37
-------
Final Model, Continued
Basis Fun
3
4
5
6
7
Coefficient
-0.003
-0.302
0.111
-0.004
0.005
Variable
O
S
c
Q
I
Parent Knot
0.584602E-06
-1.053
1.000
0.0191016E-05
-0.192323E-06
Piecewise Linear GCV = 0.136, #efprms=13.518
ANOVA Decomposition on 7 Basis Functions
fun
std. dev. -gcv #bsfns #efprms Variable Description
1
2
3
4
5
6
7
0.192
0.187
0.082
0.115
0.099
0.099
0.069
0.173 1
0.153 1
0.139 1
0.141 1
0.142 1
0.141 1
0.138 1
1.788
1.788
1.788
1.788
1.788
1.788
1.788
W
U
O
S
C
Q
I
Latitude
Elevation
Vegetation Over Water
Water Salinity
Riparian
Vegetation Over Adjacent Land
Bedrock Substrate Cover
Piecewise Cubic Fit on 7 Basis Functions, GCV = 0.136
Relative Variable Importance
2
3
25
23
17
20
13
1
4
5
6
7
Variable
W
U
C
S
Q
O
I
V
H
B
D
F
Importance
100.000
68.489
39.559
38.052
36.812
25.756
19.493
0.000
0.000
0.000
0.000
0.000
-gcv
0.173
0.153
0.142
0.141
0.141
0.139
0.138
0.136
0.136
0.136
0.136
0.136
Variable Description
Latitude
Elevation
Riparian
Water Salinity
Vegetation Cover (< 1 m high) Over Adjacent Land
Vegetation Cover Over Water
Bedrock Substrate Cover
Longitude
Linear Extent of Channel with Vegetation
Surface Water Linear Extent
Emergent-type vegetation (Typha, Eleocharis, Scirpus,
Anemopsis; Juncus & Carex) inside Stream Channel
Mimulus,
Riparian Shrubs/Herbs (Baccharis, Pluchea, Vitis, Allenrolfea,
Equisetum; Juncus or Carex) outside Stream Channel
Continued....
38
-------
Relative Variable Importance, Continued
8
9
10
11
12
14
15
16
18
19
21
22
24
Variable
E
X
G
Y
J
T
L
M
N
K
P
R
A
Importance
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
-gcv
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
0.136
Variable Description
Native Riparian Trees (Sa//x, Populus, Fraxinus)
Tamarix spp. (exotic plant) in 400-m Area
Phreatophytes (Prosopis, Chilopsis)
Tamarix spp. (exotic plant) in 40-m Segments
Predominate substrate grain size
PH
Water Depth
Wetted Perimeter Width
Plot Substrate Size, for Granular Substrate
Submerged or Floating Vegetation Cover
Emergent Vegetation within 15 cm of point
Bedrock Substrate Cover
Plot Surface Water Area
Basis Functions
BF1 = max (0, W- 3927.375);
BF2 = max(0, U-219.000);
BF3 = max (0, O - .584602E-06);
BF4 = max(0, S + 1.053);
BF5 = max(0, C-1.000);
BF6 = max(0, Q-.191016E-05);
BF7 = max(0, I + .192323E-06);
Latitude
Elevation
Vegetation Over Water
Water Salinity
Riparian
Vegetation Over Adjacent Land
Bedrock Substrate Cover
Y = 1.690 - 0.005 * BF1 - .405123E-03 * BF2 - 0.003 * BF3 - 0.302 * BF4 + 0.111 * BF5 - 0.004 * BF6 + 0.005 '
BF7;
model DEPVAR = BF1 BF2 BF3 BF4 BF5 BF6 BF7;
Ordinary Least Squares Results
N: 122.000 R-Squared: 0.465
Mean Dep Var: 0.721 Adj. R-Squared 0.432
Uncentered R-squared = R-0 Squared: 0.851
39
-------
Parameter
Constant
Basis Function 1
Basis Function 2
Basis Function 3
Basis Function 4
Basis Function 5
Basis Function 6
Basis Function 7
Estimate
1.690
-0.005
-.405123E-03
-0.003
-0.302
0.111
-0.004
0.005
S.E.
0.209
.881709E-03
.934568E-04
0.001
0.104
0.037
0.001
0.002
T-ratio
8.082
P-value
.745404E12
-5.981 260884E-07
-4.335
-2.431
-2.902
2.966
-2.850
2.237
.316045E-04
0.017
0.00
0.004
0.005
0.027
F-statistic =
P-value =
[MDF, NDF]
14.128
. 424771 E-12
[7,114]
S.E. of Regression
Residual Sum of Squares
Regression Sum of Squares
0.339
= 13.132
= 11.392
40
------- |