Development of the Virtual Beach Model Phase 1 An Empirical Model

Development of the Virtual Beach Model,
Phase 1: An Empirical Model

Zhongfu Ge1, Walter E. Frick2: 1 National Research Council; 2US EPA, NERL, ERD, Athens, GA

Example of modeling for Huntington Beach, Ohio, a Lake Erie beach

\ Aiii aM A Lr

JA n.a aMM k

Data come from 247 days in 2000-
2004. Four explanatory variables are
available: turbidity (TB), wave height
(WH), antecedent 24-hour rainfall
(RW), and wind direction (WD).
Cross-correlation with time delays
shows that the data do not need to be
synchronized. (USGS data)

log(TB) Sqit(RW)

Transformations can help normalize scatter

Cj°g(i°gP~B)-WHO

Southerly winds

Dynamic Modeling

In the proposed dynamic
modeling approach, models
are updated as new
observations are added to the
data base. This graphic shows
predictions for 60 days. (Note:
measurements are not
available for every day.)

Image ofE coli

Conclusions

~ Model selection should be based on Cp and R2 as criteria; R2 or t-statistic alone are found inadequate

~ Transformations tend to improve results

~ Interaction terms can improve model R2 and are useful especially when variables are limited (herein 48% compared to 41%
without interactions).

~ Optimal models are both beach-specific and time-varying

~ The idea of dynamic modeling based on a growing data-base is recommended

Abstract

With increasing attention focused on the use of
multiple linear regression (MLR) modeling of beach
fecal bacteria concentration, the validity of the entire
statistical process should be carefully evaluated to
assure satisfactory predictions. This work aims to
identify pitfalls and misunderstandings of the statistical
aspect of modeling. The importance of preliminary
inspection of raw data, useful transformations,
development of interaction terms, adjustment for time-
series effects, identification of outliers, correlation
studies, and model selection criteria are stressed. It is
recommended the model selection process should be
conducted using R2 and Cp statistic as joint criteria.
The methodology is illustrated with actual data from
Huntington Beach, OH, in 2000-2004. Dynamic
modeling, as a new concept, is advanced for prediction
purposes, as beach bacteria MLR models are in fact
beach specific and time varying. This work also serves
as a statistical basis for US EPA's public domain
pathogen assessment software, Virtual Beach.

Example Virtual Beach input screen. Rows and
columns can be highlighted by the user to initiate
actions, such as omitting a variable (last column), or
by the program, for example, to show an identified
outlier case (first row).

Objectives

• To demonstrate multiple linear regression modeling
of E coli concentrations

• To clarify some misunderstandings and pitfalls of
MLR modeling found in practice

• To promote the idea of dynamic modeling based on
a growing data-base

~ The variable is in the model; R2 is consistently around 48%

table shows how the

best models change with
time (additional data).
Although the variables
that produce the best
models change with
time, the R2 value

remains steady, around
48%.

Model Selection

After outlier identification (not shown), model selection is accomplished
using backward elimination, a process facilitated by Virtual Beach.

Starting with the full model, elimination of the poorest predictor leads to
variable Cp and BIC. (BIC = Bayesian Information Criterion.) The minima
in these statistics help to identify the best models.

Sequence of
elimination

7—5—&

Other data inspection issues include multicollinearity. This table
shows variance inflation factors (VIF) for each explanatory
variable

The full model before backwards elimination is given here. The
graphic depicts the normality of the residuals.

//{log(EC)} = Pa+ P?B+P2WH+J33RW + /3aWD+/35TB • WH+ J36TB •RW+ PnWH-RW

Inspection of the scatter
plots shows that
transformation will help
equalize variances over the
range of values. Here are
two distributions after
transformation.

Including interaction terms
can substantially improve
m odel perform ance.

Wind direction was categorized into northerly
(WD=0) and southerly (WD=1) winds. This
change produces data histograms that exhibit
equal spreads.

-------