Development of the Virtual Beach Model, Phase 1: An Empirical Model Zhongfu Ge1, Walter E. Frick2: 1 National Research Council; 2US EPA, NERL, ERD, Athens, GA Example of modeling for Huntington Beach, Ohio, a Lake Erie beach \ Aiii aM A Lr JA n.a aMM k Data come from 247 days in 2000- 2004. Four explanatory variables are available: turbidity (TB), wave height (WH), antecedent 24-hour rainfall (RW), and wind direction (WD). Cross-correlation with time delays shows that the data do not need to be synchronized. (USGS data) log(TB) Sqit(RW) Transformations can help normalize scatter Cj°g(i°gP~B)-WHO Southerly winds Dynamic Modeling In the proposed dynamic modeling approach, models are updated as new observations are added to the data base. This graphic shows predictions for 60 days. (Note: measurements are not available for every day.) Image ofE coli Conclusions ~ Model selection should be based on Cp and R2 as criteria; R2 or t-statistic alone are found inadequate ~ Transformations tend to improve results ~ Interaction terms can improve model R2 and are useful especially when variables are limited (herein 48% compared to 41% without interactions). ~ Optimal models are both beach-specific and time-varying ~ The idea of dynamic modeling based on a growing data-base is recommended Abstract With increasing attention focused on the use of multiple linear regression (MLR) modeling of beach fecal bacteria concentration, the validity of the entire statistical process should be carefully evaluated to assure satisfactory predictions. This work aims to identify pitfalls and misunderstandings of the statistical aspect of modeling. The importance of preliminary inspection of raw data, useful transformations, development of interaction terms, adjustment for time- series effects, identification of outliers, correlation studies, and model selection criteria are stressed. It is recommended the model selection process should be conducted using R2 and Cp statistic as joint criteria. The methodology is illustrated with actual data from Huntington Beach, OH, in 2000-2004. Dynamic modeling, as a new concept, is advanced for prediction purposes, as beach bacteria MLR models are in fact beach specific and time varying. This work also serves as a statistical basis for US EPA's public domain pathogen assessment software, Virtual Beach. Example Virtual Beach input screen. Rows and columns can be highlighted by the user to initiate actions, such as omitting a variable (last column), or by the program, for example, to show an identified outlier case (first row). Objectives • To demonstrate multiple linear regression modeling of E coli concentrations • To clarify some misunderstandings and pitfalls of MLR modeling found in practice • To promote the idea of dynamic modeling based on a growing data-base ~ The variable is in the model; R2 is consistently around 48% table shows how the best models change with time (additional data). Although the variables that produce the best models change with time, the R2 value remains steady, around 48%. Model Selection After outlier identification (not shown), model selection is accomplished using backward elimination, a process facilitated by Virtual Beach. Starting with the full model, elimination of the poorest predictor leads to variable Cp and BIC. (BIC = Bayesian Information Criterion.) The minima in these statistics help to identify the best models. Sequence of elimination 7—5—& Other data inspection issues include multicollinearity. This table shows variance inflation factors (VIF) for each explanatory variable The full model before backwards elimination is given here. The graphic depicts the normality of the residuals. //{log(EC)} = Pa+ P?B+P2WH+J33RW + /3aWD+/35TB • WH+ J36TB •RW+ PnWH-RW Inspection of the scatter plots shows that transformation will help equalize variances over the range of values. Here are two distributions after transformation. Including interaction terms can substantially improve m odel perform ance. Wind direction was categorized into northerly (WD=0) and southerly (WD=1) winds. This change produces data histograms that exhibit equal spreads. ------- |