ConttrolfrutioirD of Nutrients and E. ©@0Bt©
Surface Msiteo3 Condition dm the Ozarks

Part 1. Using Partial Least Squares Predictions When Standard Regression Assumptions Are Violated

Maliha S. Nash* and	D. Lopas

U.S. EPA, 944 East Harmon Avenue, Las Vegas, Nevada 89119

Corresponding author e-mail: nash.maliha@epa.gov

1. introduction

Investigation of associations among constituents of surface water and landscapes involves statistical
analyses of fundamentally different data sets. Data on surface water conditions are generally obtained
through field sampling programs and field/analysis programs are expensive and labor intensive; con-
sequently, the total number of sample sites is usually small. The data set may contain missing val-
ues due to the realities of sampling or cost. Landscape data, however, is derived from remote sensing
platforms, thereby permitting wall-to-wall coverage. The landscape data sets may contain a very large
number of variables, although many of these are not wholly independent (i.e., they may be collinear).
Single- and multiple-regression analysis has frequently been used to relate water nutrient concentra-
tions to selected landscape variables that are sensitive to missing values and dependence of predictors
(landscape variables). Reliable statistically significant results generally cannot be obtained unless the
total number of samples greatly exceeds the number of variables. Partial least squares (PLS) analysis
offers a number of advantages over the more traditionally used regression analyses. It has been found
to be useful both for providing accurate predictions and for interpreting relationships between data
sets containing a high degree of collinearity (see references in Nash et al., 2005).

2. Data Description

2.1. SITE DESCRIPTION

The study area is in the Upper White River study area (21,848 km2) in the
Ozarks of Missouri and Arkansas (Figure 1).

Figure 1. The Upper White River study area is
in the Ozarks of Missouri arid Arkansas, where
244 water quality sampling locations were sam-
pled (A) and used as "pour points," from which
244 contributing subwatersheds were delin-
eated (B). A combination of multiple Landsat
Thematic Mapper imagery (C) and digital aerial
photography was used to produce a 2000 land
cover map of the study area (D), which was
used to calculate landscape metrics.

3. Results

TP PLS model resulted in one significant factor explaining 59% of the variability in the
TP (see table). The most significant contributor's are the watershed percent barren and
stream density. While the stream density relates inversely with TP, percent barret) en-
hances TP in surface water. The forest-related variables contribute equally with a nega-
tive effect on the TP. Urban enhanced TP but mostly within the proximity (RurbO) of
the sampling site.

TAM PLS model resulted in one significant factor explaining 93% of the variability in
the TAM (see table). Riparian and natural within all distances have negative effect on
TAM, whereas urban has a positive effect.

Ef coli PLS model resulted in two significant factors explaining 81% of the variability
in the E. coli population (see table). Urban in riparian within area of 0m or more of the
sampling site did enhance E. coli.

The prediction of the constituents in the 244 watersheds (from a small field-based data
sample) was used to visualize the joint behavior of the predicted TP, TAM, and E. coli in
surface water of the Upper White River (Figure 2).

Landscape Metrics Coefficient VIP Coefficient VIP Coefficient VIP

Figure 2. Three-dimensional plot of predicted TAM,
TP, and E. coli cell counts among 244 subwatersheds
in the Upper White River region of the Ozarks.

2.2.	DATA

For each of the selected sites, the watershed support area was delineated and a suite of landscape variables
was calculated. There were 244 sites with its supported watershed. A total of 46 landscape metrics was
generated per each watershed. Measured total phosphorous (TP), total ammonia (TAM), and E. coli only
existed in 18, 6, and 15 sites, respectively. Landscape metrics were for year 2000 and surface water constitu-
ents were averaged over a period of 1997-2002. Each of the surface water constituents from the above sites
was used in PLS to predict for the remaining from the 244 sites.

2.3.	STATISTICAL METHODOLOGY

PLS is a multivariate analysis technique that permits analysis and prediction for data sets with missing
values, with collinearity and with a relatively small number of observations (see references in Nash et
al., 2005).

In the PLS analyses, both data sets (e.g., water and landscape variables) are first centered and scaled. A
linear combination is composed of the independent variables (T = L0 W; T is the score and W is weight)
forming a number of orthogonal latent variables [T] that are less in number (dimensions) than that of the
original landscape variables. The linear combination in [T] is formed so that the covariance between [T]
and the linear composition of the dependent variables are maximized (T & U; U = B0 V; U is the score and
V is weight). Prediction of both water and landscape data will be via regression on the common latent vari-
ables (T). Modeling and prediction in PLS, therefore, is not solely based on the conditional distribution of
the predictors (water variables) in the presence of independent variables (landscape variables); instead it ac-
counts for both landscape and water together through [T] (see references in Nash et al., 2005).

PLS produces n-1 factors, with each factor containing a pair of scores (Ti, Uj). Linear combinations on
each data set are called factors. For example, if the number of sites (observations) is 89, then 88 factors will
be produced. Not all of these factors are significant using the Cross Validation (CV) method; only the sig-
nificant factors are used in the final model. The fitted models are tested using the test data sets and the pre-
dicted values are compared with that of observed using PRESS (Predictive Residual Sum of Square) to as-
sess the predictive ability of the model. Root means PRESS and its significant level (the lower the value, the
better the model is) will be used in the final model.

After defining the significant PLS factor, scores, weights, and VIP (Variable Influence on Projection) are
used to examine the strength of the relationship, irregularities, and the contribution of the independent
variable (landscape) in the model. It was indicated VIP values of less than 0.8 are considered to be small.
The quality of the model was determined by examining the residuals for both the response and the land-
scape variables for any possible outliers. SAS was used for all statistical analyses.

0.00368
0.00249

-0.00319
-0.00334
-0.00327
-0.00326

0.00270

0.00397
0.00378

0.00512
0.03096
0.68643

0.87681
0.92441

1.04904
0.95938
0.98311
1.02739

-0.00027
-0.00025
-0.00023

1.03313 0.00031

1.01606

1.02533 0.00042

0.00036

1.01764
1.05578
0.87789

0.91965
0.92898
0.94123

1.07047

1.06935

-0.37189 1.16001

-0.00052
-0.00206
-0.00002
-0.00078
-0.00139
-0.00002
-0.00078
-0.00139

0.00559

0.00722
0.00676
0.00595
0.00002
0.00078
0.00139

0.01028
0.06105
2.98807
0.62089
0.87270
2.58041

-0.94186
0.00006
0.00007
-0.00048

0.82952
0.94050
0.94904
0.95368
0.95435
0.94904
0.95368
0.95435

1.10603

1.05721
1.05577
1.05542
0.94904
0.95368
0.95435

1.07970
1.10793
1.26598
0.89036
0.88389
1.31004

1.24181
0.93605
1.40485
0.96849

Number of Factors

Coefficients of the non-centered value of landscape metrics to
predict the In(TP), TAM, and ln(£. coli). Number of significant
PLS factors and percent variation explained by PLS for the
responses are in the last two rows.

4. Discussion and Conclusion

The results indicate PLS may prove to be a valuable statistical analysis tool for
ecological studies. The PLS methodology is less sensitive to the limitations than
other statistical methods. The joint behavior of TP and TAM as related with E.
coli (Figure 2) was not possible using the measurements from the study area sites
(18, 6, and 15 sites for TP, TAM, and E. coli, respectively), but it was overcome
by prediction from the PLS model for the 244 sites. Hence, further analyses and
comparisons within and between the four groups (high TP-high E. coli, low TP-
high E. coli, low TP-high E. coli, and moderate TP, TAM, and E. coli) may reveal
the spatial characteristics setting for watersheds and their effect on surface
water quality.

The model results may help landscape ecologists produce indicators of surface
water condition, such that unique combinations of these indicators can be used to
infer' the potential cause(s) and origin(s) of nonpoint pollution, which may lead to
eutrophication in aquatic ecosystems, the loss of aquatic ecosystem function, and
the injury of humans that consume from (or recreate in) the aquatic resources of
the Ozarks. The PLS results discussed in this poster are actively being used to
prioritize subwatersheds in the Ozarks for watershed management activities.

Reference

Nash, M.S., D.J. Chaloud and R.D. Lopez. 2005. Application of Canonical Correlation Analysis
and Partial Least Square Analyses in Landscape Ecology. EPA/600/X-05/004.

epascienceforum

Your Health • Your Environment * Your Future

113LEB06 SF NASH

Notice: Although this work was reviewed by EPA and approved for publication, it may not necessarily reflect official Agency policy. Mention of trade names or commercial products does not constitute endorsement or recommendation by EPA for use.


-------