ConttrolfrutioirD of Nutrients and E. ©@0Bt© Surface Msiteo3 Condition dm the Ozarks Part 1. Using Partial Least Squares Predictions When Standard Regression Assumptions Are Violated Maliha S. Nash* and D. Lopas U.S. EPA, 944 East Harmon Avenue, Las Vegas, Nevada 89119 Corresponding author e-mail: nash.maliha@epa.gov 1. introduction Investigation of associations among constituents of surface water and landscapes involves statistical analyses of fundamentally different data sets. Data on surface water conditions are generally obtained through field sampling programs and field/analysis programs are expensive and labor intensive; con- sequently, the total number of sample sites is usually small. The data set may contain missing val- ues due to the realities of sampling or cost. Landscape data, however, is derived from remote sensing platforms, thereby permitting wall-to-wall coverage. The landscape data sets may contain a very large number of variables, although many of these are not wholly independent (i.e., they may be collinear). Single- and multiple-regression analysis has frequently been used to relate water nutrient concentra- tions to selected landscape variables that are sensitive to missing values and dependence of predictors (landscape variables). Reliable statistically significant results generally cannot be obtained unless the total number of samples greatly exceeds the number of variables. Partial least squares (PLS) analysis offers a number of advantages over the more traditionally used regression analyses. It has been found to be useful both for providing accurate predictions and for interpreting relationships between data sets containing a high degree of collinearity (see references in Nash et al., 2005). 2. Data Description 2.1. SITE DESCRIPTION The study area is in the Upper White River study area (21,848 km2) in the Ozarks of Missouri and Arkansas (Figure 1). Figure 1. The Upper White River study area is in the Ozarks of Missouri arid Arkansas, where 244 water quality sampling locations were sam- pled (A) and used as "pour points," from which 244 contributing subwatersheds were delin- eated (B). A combination of multiple Landsat Thematic Mapper imagery (C) and digital aerial photography was used to produce a 2000 land cover map of the study area (D), which was used to calculate landscape metrics. 3. Results TP PLS model resulted in one significant factor explaining 59% of the variability in the TP (see table). The most significant contributor's are the watershed percent barren and stream density. While the stream density relates inversely with TP, percent barret) en- hances TP in surface water. The forest-related variables contribute equally with a nega- tive effect on the TP. Urban enhanced TP but mostly within the proximity (RurbO) of the sampling site. TAM PLS model resulted in one significant factor explaining 93% of the variability in the TAM (see table). Riparian and natural within all distances have negative effect on TAM, whereas urban has a positive effect. Ef coli PLS model resulted in two significant factors explaining 81% of the variability in the E. coli population (see table). Urban in riparian within area of 0m or more of the sampling site did enhance E. coli. The prediction of the constituents in the 244 watersheds (from a small field-based data sample) was used to visualize the joint behavior of the predicted TP, TAM, and E. coli in surface water of the Upper White River (Figure 2). Landscape Metrics Coefficient VIP Coefficient VIP Coefficient VIP Figure 2. Three-dimensional plot of predicted TAM, TP, and E. coli cell counts among 244 subwatersheds in the Upper White River region of the Ozarks. 2.2. DATA For each of the selected sites, the watershed support area was delineated and a suite of landscape variables was calculated. There were 244 sites with its supported watershed. A total of 46 landscape metrics was generated per each watershed. Measured total phosphorous (TP), total ammonia (TAM), and E. coli only existed in 18, 6, and 15 sites, respectively. Landscape metrics were for year 2000 and surface water constitu- ents were averaged over a period of 1997-2002. Each of the surface water constituents from the above sites was used in PLS to predict for the remaining from the 244 sites. 2.3. STATISTICAL METHODOLOGY PLS is a multivariate analysis technique that permits analysis and prediction for data sets with missing values, with collinearity and with a relatively small number of observations (see references in Nash et al., 2005). In the PLS analyses, both data sets (e.g., water and landscape variables) are first centered and scaled. A linear combination is composed of the independent variables (T = L0 W; T is the score and W is weight) forming a number of orthogonal latent variables [T] that are less in number (dimensions) than that of the original landscape variables. The linear combination in [T] is formed so that the covariance between [T] and the linear composition of the dependent variables are maximized (T & U; U = B0 V; U is the score and V is weight). Prediction of both water and landscape data will be via regression on the common latent vari- ables (T). Modeling and prediction in PLS, therefore, is not solely based on the conditional distribution of the predictors (water variables) in the presence of independent variables (landscape variables); instead it ac- counts for both landscape and water together through [T] (see references in Nash et al., 2005). PLS produces n-1 factors, with each factor containing a pair of scores (Ti, Uj). Linear combinations on each data set are called factors. For example, if the number of sites (observations) is 89, then 88 factors will be produced. Not all of these factors are significant using the Cross Validation (CV) method; only the sig- nificant factors are used in the final model. The fitted models are tested using the test data sets and the pre- dicted values are compared with that of observed using PRESS (Predictive Residual Sum of Square) to as- sess the predictive ability of the model. Root means PRESS and its significant level (the lower the value, the better the model is) will be used in the final model. After defining the significant PLS factor, scores, weights, and VIP (Variable Influence on Projection) are used to examine the strength of the relationship, irregularities, and the contribution of the independent variable (landscape) in the model. It was indicated VIP values of less than 0.8 are considered to be small. The quality of the model was determined by examining the residuals for both the response and the land- scape variables for any possible outliers. SAS was used for all statistical analyses. 0.00368 0.00249 -0.00319 -0.00334 -0.00327 -0.00326 0.00270 0.00397 0.00378 0.00512 0.03096 0.68643 0.87681 0.92441 1.04904 0.95938 0.98311 1.02739 -0.00027 -0.00025 -0.00023 1.03313 0.00031 1.01606 1.02533 0.00042 0.00036 1.01764 1.05578 0.87789 0.91965 0.92898 0.94123 1.07047 1.06935 -0.37189 1.16001 -0.00052 -0.00206 -0.00002 -0.00078 -0.00139 -0.00002 -0.00078 -0.00139 0.00559 0.00722 0.00676 0.00595 0.00002 0.00078 0.00139 0.01028 0.06105 2.98807 0.62089 0.87270 2.58041 -0.94186 0.00006 0.00007 -0.00048 0.82952 0.94050 0.94904 0.95368 0.95435 0.94904 0.95368 0.95435 1.10603 1.05721 1.05577 1.05542 0.94904 0.95368 0.95435 1.07970 1.10793 1.26598 0.89036 0.88389 1.31004 1.24181 0.93605 1.40485 0.96849 Number of Factors Coefficients of the non-centered value of landscape metrics to predict the In(TP), TAM, and ln(£. coli). Number of significant PLS factors and percent variation explained by PLS for the responses are in the last two rows. 4. Discussion and Conclusion The results indicate PLS may prove to be a valuable statistical analysis tool for ecological studies. The PLS methodology is less sensitive to the limitations than other statistical methods. The joint behavior of TP and TAM as related with E. coli (Figure 2) was not possible using the measurements from the study area sites (18, 6, and 15 sites for TP, TAM, and E. coli, respectively), but it was overcome by prediction from the PLS model for the 244 sites. Hence, further analyses and comparisons within and between the four groups (high TP-high E. coli, low TP- high E. coli, low TP-high E. coli, and moderate TP, TAM, and E. coli) may reveal the spatial characteristics setting for watersheds and their effect on surface water quality. The model results may help landscape ecologists produce indicators of surface water condition, such that unique combinations of these indicators can be used to infer' the potential cause(s) and origin(s) of nonpoint pollution, which may lead to eutrophication in aquatic ecosystems, the loss of aquatic ecosystem function, and the injury of humans that consume from (or recreate in) the aquatic resources of the Ozarks. The PLS results discussed in this poster are actively being used to prioritize subwatersheds in the Ozarks for watershed management activities. Reference Nash, M.S., D.J. Chaloud and R.D. Lopez. 2005. Application of Canonical Correlation Analysis and Partial Least Square Analyses in Landscape Ecology. EPA/600/X-05/004. epascienceforum Your Health • Your Environment * Your Future 113LEB06 SF NASH Notice: Although this work was reviewed by EPA and approved for publication, it may not necessarily reflect official Agency policy. Mention of trade names or commercial products does not constitute endorsement or recommendation by EPA for use. ------- |