Virtual Beach v 2.2 User Guide

Mike Cyterski, Mike Galvin, Kurt Wolfe, and Rajbir Parmar

Virtual Beach

Empirical Modeling Software for
Pathosren Indicators in Recreational Waters

TURNING
DATA

LIS, Environmental Protection Agency
Office of Research and Development
National Exposure Research Laboratory
Ecosystems Research Division


-------
Table of Contents

1.	Introduction	4

1.1	On Predictive Modeling	4

1.2	Recommended User Background	4

1.3	History and Comparison of Version 2.2 to Earlier Versions	5

2.	Installation and Execution	8

2.1 Viewing this Documentation	8

3.	Operational Overview	9

4.	Project Management	10

5.	Beach Location Mapping Interface	11

5.1	Finding a Location	11

5.2	Defining the Beach Orientation	12

5.3	Finding nearby Water Quality, Flow, and Climate Information Sources	13

5.4	Saving Beach Information in a Project File	15

6.	Data Processing	16

6.1	Data Requirements and Considerations	16

6.2	Importing a Dataset	17

6.3	Validating the Imported Data	18

6.4	Working with a Dataset Post-Validation	20

6.5	Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components	23

Notes on wind, wave and current component calculations:	24

6.6	Creation of New Independent Variables	27

6.7	Transforming the Independent Variables	29

Plotting Transformed IVs	32

Notes on Transformed IVs	32

6.8	Saving Processed Data	34

6.9	Go to Modeling	34

7.	Modeling	35

7.1	Selecting Variables for Model Building	35

7.2	Modeling Control Options	35

7.3	Linear Regression Modeling Methods	38

7.4	Using the Genetic Algorithm	40

7.5	Evaluating Model Output	41

7.6	Viewing X-Y Scatterplots	46

7.7	ROC Curves	47

7.8	Cross-Validation	48

7.9	Report Generation	49

8.	Residual Analysis	51

Viewing the Data Table	55

9.	Prediction	57

9.1	Model Statement	57

9.2	Model Evaluation Thresholds	57

9.3	Prediction Form	58

9.4	Viewing Plots	62

9.5	Prediction Form Manipulation	63

10.	Future Enhancements	63

11.	User Feedback	63

12.	Acknowledgments	63

2


-------
List of Figures

Figure 1. The five major component tabs of VB 2.2	5

Figure 2. Beach Location interface	11

Figure 3. Beach Location tab controls and their function	12

Figure 4. Adding shoreline and water markers to define beach orientation	13

Figure 5. NOAA/NCDC station marker showing station ID information	14

Figure 6. USGS/NWIS station marker showing station ID information	14

Figure 7. Beach Location interface showing station markers	15

Figure 8. Importing a dataset into the Data Processing tab	17

Figure 9. Data validation required to begin data processing	18

Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu	19

Figure 11. Post-validation enabling of the Data Processing functionality	20

Figure 12. Right-click options on columns that are not the response variable	21

Figure 13. Four different plots available for evaluation of IVs	21

Figure 14. Disabling an observation from within the XY scatterplot	22

Figure 15. Available choices when right-clicking the current response variable	23

Figure 16. Window for computation of alongshore and offshore/onshore components	24

Figure 17. A and O component definitions for wind, current, and wave data	25

Figure 18. Principal beach orientations given in degrees	26

Figure 19. Window for the formulation of "Manipulates"	27

Figure 20. Creation of a new IV defined as the mean of two existent IVs	28

Figure 21. Formation of two-way cross-products of a set of four existent IVs	29

Figure 22. The range of choices for IV transformations	30

Figure 23. Pearson correlation coefficient scores for judging the efficacy of IV transformations	31

Figure 24. Scatterplots (Response vs. IV) for six different data transformations of a single IV	32

Figure 25. Selecting variables for MLR processing within the Modeling tab	35

Figure 26. Setting modeling options within the Modeling interface	36

Figure 27. Setting evaluation thresholds and threshold transformation information	37

Figure 28. Model building interface	39

Figure 29. Using the IV filter to select a subset of variables from the best-fit models	40

Figure 30. Genetic algorithm options within the modeling interface	41

Figure 31. Modeling results shown after completion of an exhaustive regression run	42

Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model	43

Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model	43

Figure 34. Modeling interface showing a time series plot for the selected model	44

Figure 35. An XY scatter plot of observed versus predicted values for the selected model	45

Figure 36. The ROC curves and AUC table for the Best Fit models	46

Figure 37. The cross-validation results for each of the 10 best-fit models	48

Figure 38. A text report generated on the modeling results	49

Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models	50

Figure 40. Scaled versus un-scaled views of selected model evaluation criterion	50

Figure 41. Information available on the Residuals tab	51

Figure 42. Plot of studentized predictions vs. residuals and the A-D test of normality	52

Figure 43. A table and plot of the DFFITS scores for the residuals	53

Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points	54

Figure 45. "View Data Table" window	55

Figure 46. Observed vs. Predicted plot on the Residual tab	56

Figure 47. Residuals interface showing a list of rebuilt models	56

Figure 48. The MLR Prediction interface	58

Figure 49. Importation of IV data using the "Column Mapper" window	59

Figure 50. Importation of observational data using the "Column Mapper" window	59

Figure 51. The IV validation window on the MLR Prediction tab	60

Figure 52. A prediction grid after IVs and observational data have been imported	61

Figure 53. Prediction interface plotting of the observations versus predictions	62

3


-------
1. INTRODUCTION

Virtual Beach version 2.2 (VB 2.2) is a decision support tool. It is designed to
construct site-specific Multi-Linear Regression (MLR) models to predict pathogen
indicator levels (or fecal indicator bacteria, FIB) at recreational beaches. MLR analysis
has outperformed persistence models (using the most recent FIB concentration as the sole
predictor of the next FIB concentrations, i.e., yt = yt_i) at beaches where conditions, such
as weather, water conditions, and human and animal traffic levels, change significantly
from day to day (Frick, Ge et al. 2008).

1.1	On Predictive Modeling

In any predictive modeling endeavor, variability and uncertainty are always
associated with model output, arising from a variety of reasons that are impossible to
eradicate completely from the modeling exercise. Virtual Beach 2.2 attempts to be
forthright with this fact by issuing a probability of exceedance for any regulatory
standard that the user wishes to investigate. Even so, there is no guarantee than every
model prediction will be correct, and a situation where the model predicts water quality
to be good enough for public recreation might be erroneous. Decisions to allow or not
allow swimming at beaches must be made, however, and in the best case scenarios the
regression models developed with Virtual Beach 2.2 will outperform less rigorous
predictive efforts.

1.2	Recommended User Background

Virtual Beach 2.2 is our attempt to create a decision support software tool that
will assist someone with little statistical knowledge in developing a multiple linear
regression model based on their available data. Some familiarity with regression
modeling and residual analysis will no doubt benefit a VB 2.2 user, although we believe
that, after only a few sessions, someone with very little background in statistics can
produce defensible regression models using VB 2.2. We note that these MLR models, or
any other statistical models, will only be as effective as the data used to develop them.
No statistician, however skilled, can turn a dataset filled with worthless independent
variables (i.e., IVs) into a useful predictive device.

VB 2.2 has five major components:

•	Beach location map interface where users can locate their site, define the
orientation of the beach, and examine nearby potential data sources.

•	Data processing spreadsheet interface that facilitates the import and manipulation
of MLR model variable data.

•	Modeling interface presenting options for performing MLR analyses.

4


-------
•	Residuals component to examine regression residuals, allow optional elimination
of highly influential data records, and perform recalculation of the regression
model.

•	Prediction interface allowing entry of new data and subsequent estimation of
pathogen indicator levels using a selected MLR model.

Each component is accessible from the application's main window via selectable
tabs. The Beach Location and Data Processing tabs are always visible, the Modeling tab
becomes visible once the input data have been validated, and the Residuals and MLR
Prediction tabs appear when model-building is complete and a model is selected.

Project Model Help

Beach Location I Data Processing / Modeling I Residuals

BBS

Variable Selection Control Options |	 Number of Observations: 37

Evaluation Criteria

Akaike Information Criterion (AIC)

H I Maximum Number of Variables in a Model
	 Available: 7, Recommended: 4, Max: 7

|5 | MaximumVIF
Model EvaluationThresholds

PI Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)

Current US Regulatory Standards

Threshold Transform
© None
O Log10

O Ln

O Power

E. coli. Freshwater: 235
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61

Manual Genetic Algorithm

l~~l Set Seed Value:



Population Size:

100

Number of Generations:

100

Mutation Rate:

0.01

Crossover Rate:

0.20

Model Information
Best Fits:

8.207G
9.1112
9.2219
9.2231
9.2471
10.1760

IV Filter
[ Add to List

View
Report

Variable Statistics | Model Statistics

Parameter

Coefficient

Standardized Coefficient

Std. Error

t-Statistic

(Intercept)

1.8228



0.2994

6.0879

waveheight

1.6811

0.2239

1.0139

1.6580

uv

-0.0007

-0.5050

0.0002

-3.7750

WindDirection

-0.0030

-0.4177

0.0010

-3.1185

<







>

Progress | Results Observed vs Predicted ROC Curves

Genetic Algorithm Dynamic Fitness Update

30 40 50 60 70
Percent of Generations Completed

Project Name: Beach Name:

Total number of possible models: 127

Figure 1. The five major component tabs of VB 2.2 - the modeling tab is currently active
1.3 History and Comparison of Version 2.2 to Earlier Versions

Virtual Beach 2.2 is derived from the Virtual Beach Model Builder application
(VB1.0 - also known as Virtual Beach vl.O) developed by Walter Frick and Zhongfu Ge.
VB1.0 can be characterized as a MLR model-building tool that supports a primarily
manual analysis of data sets via visual inspection of data plots and manipulation of
variables (e.g., transformations, creating interaction terms), followed by an iterative
process of testing, comparing and evaluating models. The fitness of developed models is
computed and tracked, allowing for comparison and eventual selection of a "best" model
for the dataset under consideration. This model can then produce estimates of pathogen
indicator levels using current or forecasted environmental data from the site.

5


-------
VB 2.2 enhances the functionality of its predecessor, performing similar functions
(visual inspection of univariate data plots, manual transformations of individual variables,
MLR model building, prediction, etc.), but also automating and extending functionality in
several ways:

•	The Map component provides users with information on the location and
availability of local data sources (NWIS/NCDC data) through the map interface.
These sources can provide recently collected and/or forecasted data for generating
predictions by a chosen MLR model.

•	The Map component provides a convenient method for defining beach orientation
by overlaying the beach on current shore-line layers (satellite images, Google
Maps, MS Virtual Earth, etc). Given this orientation, VB 2.2 can calculate wind,
wave, or current components (A component is parallel to shore and O component
is perpendicular to shore), which can be important predictor variables.

•	Although manual processing and analysis of imported data (visual inspection of
univariate data plots and the transformations/interactions of variables) has been
retained, the Data Processing component of VB 2.2 provides automated
generation of all possible 2n order interaction terms amongst a set of IVs,
formation of more complex functions of multiple columns, and automated testing
of a suite of variable transformations for improved model linearity. This
functionality increases the number of models to evaluate during later selection
routines and removes the burden/difficulty of manual assessment placed on users
of VB1.0.

•	Multi-collinearity amongst predictor variables is handled automatically in the
Model Building component. Any model containing an IV with a high degree of
correlation with other IVs (as measured by a large Variance Inflation Factor
[VIF]) is removed from consideration during model selection. The VIF threshold
is user-defined with a default value of 5.

•	During model selection, MLR models are ranked by a user-selected evaluation

2	2

criterion. Possible criteria include R , Adjusted R , Akaike Information Criterion
(AIC), Corrected AIC, Predicted Error Sum of Squares (PRESS), Bayes
Information Criterion (BIC), Accuracy, Sensitivity, Specificity, or the model's
Root Mean Square Error (RMSE). Regardless of which criterion is chosen, the
software records the ten best models in terms of that criterion. In comparison,
VB1.0 had only a single comparative criterion, Mallow's Cp.

•	As the number of IVs in a dataset increases, possible MLR models increase
exponentially (considering transforms/interactions), resulting in trillions of
possible models from a modest number (12-13) of IVs. VB 2.2 implements a
Genetic Algorithm (GA) that effectively and efficiently searches for the best
possible MLR model. Alternatively, VB 2.2 users can perform an exhaustive
calculation in which all possible combinations of IVs are used and tested if the
number of possible models is reasonably small (circa 100,000). Both the GA and

6


-------
exhaustive approaches greatly expand the model-building capabilities of VB 2.2,
compared to VB1.0.

• Users no longer have to enter data values in transformed, interacted, or

component-decomposed form to make a prediction with a chosen MLR model.
On the VB 2.2 MLR Prediction tab, a user-selected model is coded into an input
grid with data entry columns matching the model's main effects. Any
mathematical manipulation of these IVs is then automatically performed prior to
making predictions.

7


-------
2. INSTALLATION AND EXECUTION

VB 2.2 is developed with MS Visual Studio 2010, written in C#, using multiple
public domain system components (Weifen Luo Docking UI, ZedGraph, and GMap.Net)
and employs a single licensed statistical library (Extreme Optimization). No license or
software purchase is required by the user to install and run the application, but an internet
connection is required to display maps. Users must have Microsoft XP or Windows 7 OS
with the DotNet Framework 4.0 to assure proper installation and operation. Assorted
errors have occurred when running Windows Vista OS. Certain VB 2.2 data
manipulation and model-building operations are computationally intensive so faster
CPUs are better, but most new laptops or desktop systems will be adequate. Disk space
requirements are modest (less than 5 MB) if the DotNet Framework is installed; if not,
the Framework installer requires ~ 175 MB of disk space. The VB 2.2 application
installer will attempt to download and install the DotNet Framework 4.0 if it is not
installed on the target system; this also requires a network connection. If necessary, a
user can freely obtain the DotNet Framework 4 installer at:

http://www.microsoft.com/download/en/details.aspx?id=17851

The EPA's Center for Exposure Assessment Modeling (CEAM) web site
distributes VB 2.2 at:

http ://www. epa. gov/ceampubl/swater/vb2/index .html

Obtain and initiate execution of the VB 2.2 application installer and follow the on-screen
instructions. The VB 2.2 application installer can be found at:

https://iemhub.org/resources/vbmb2 for iemHub Virtual Beach Group members;
https://iemhub.org/groups/virtualbeach/i oin to request Group member access.

Finally, the software can be obtained by request (see the contacts list in the
Feedback section at the end of this document). After installation, a shortcut will appear
on your desktop to start the software.

2.1 Viewing this Documentation

Virtual Beach's User Guide can be accessed within the software via the top-level
Help User Guide menu selection or in a context-sensitive fashion via the F1 key.

Invoking F1 will launch Adobe Acrobat or Adobe Reader (if installed) and open the User
Guide to the appropriate page. Note that if the Guide is already open, the F1 key will
have no effect; users must close Reader (or Acrobat) for F1 to launch and open to the
correct page. Or if the Guide is already open, users can navigate to the area of interest
via the Table of Contents. . The User Guide (Virtual_Beach_2_User_Guide.pdf) can also
be opened independently of program operation; it resides within the Documentation
folder of the program's installation folder.

8


-------
3. OPERATIONAL OVERVIEW

Virtual Beach 2.2 is simple to operate: it is categorized into five functions, each
with its own component or interface:

Beach Location - a mapping tab whose utility is meant to provide a basis for generating
orthogonal (alongshore and offshore/onshore) wind, current, and/or wave components for
the beach under consideration; its use is optional. Such components can be powerful
predictors of pathogen indicator levels at the beach, so using the beach definition
component is recommended if the dataset under consideration contains wind, wave or
current data. This tab is also useful for locating nearby NWIS/NCDC climate and water
quality data sources for a specific location.

Data Processing - a spreadsheet tab to support data manipulation procedures on an
imported dataset. In addition to wind/current/wave component generation, users can
generate new independent variables that represent the products, means, sums, minimums,
and maximums of other IVs, as well as common data transformations for the IVs.
Statistical indicators help users select the best IV transformations in MLR model-
building.

Modeling - this tab allows selection of any eligible IVs for consideration in MLR model-
building and model-generation. Model-generation is accommodated by user-selected
model evaluation criteria and automatic generation of the ten best-fit models from a
search in which all possible combinations of predictor variables are tested, or via a
heuristic searching algorithm (the Genetic Algorithm or GA). Regression fit and model
variable statistics are generated to help evaluate the usefulness of predictive variables and
overall fit. Time series and XY scatter plots, as well as reports on best-fit models, can be
viewed and/or saved for further analysis and recording.

Residual Analysis - this tab displays plots of a model's regression residuals, including
their normality statistics, and provides means to eliminate highly influential data records
and recalculate the regression model. Altered data sets can be exported for external use
and rebuilt models can be selected for the prediction tab.

Prediction — this tab is comprised of three grids where users can enter or import the
needed IVs for the chosen model, enter or import observations that will be compared to
model predictions, and examine model predictions and exceedance probabilities. Time
series and XY scatter plots of observations versus predictions are shown to help users
gauge model effectiveness.

9


-------
4. PROJECT MANAGEMENT

Oftentimes the user will put an imported dataset through lengthy pre-processing
to prepare it for analysis. To avoid repeating all of this work, "project" files can be saved
and re-opened via the Project -> Save and Project Open menu selection. Subsequent
opening of a saved project file will load the processed data sheet and information on the
Beach Location tab, including the beach orientation if the user had defined it. However,
no modeling information is saved inside a project file.

In addition to project files, "model" files can be opened and saved using choices
under the "Model" menu at the top of the VB 2.2 interface. A model file contains
information on the IVs, regression parameters, and other metadata for the currently
selected model in the Modeling, Residual, or MLR Prediction tab. Whenever a model
file is saved, VB 2.2 will prompt the user to enter a Decision Criterion (DC), Regulatory
Standard (RS) and Threshold Transformation for the model. These parameters will be
used as initial values (they can be changed when the model file is opened) for later
calculations of model sensitivity and specificity, which depend on the numbers of false
negative and false positive model predictions (see Sections 7.6 and 7.7).

When users open a previously saved model file from within VB 2.2, they are
taken directly to the MLR Prediction tab where they can use the saved model to generate
predictions. Model files are designed for situations where a statistically-savvy developer
is charged with developing regression models for a number of beach sites. After the
developer chooses a "best" model for a site, the model file can be saved and then
delivered to the beach manager who will not use VB 2.2 for full-scale model
development, but only to input new data, generate predictions, and make decisions
regarding swimming advisories.

10


-------
5. BEACH LOCATION MAPPING INTERFACE

On VB 2.2 application startup, the map interface is shown, but users can go
directly to the Data Processing tab if desired.

Edmonton

Winnipeg

'""Seattle

Helena

Ottawa

S'" \l

o Augusta'

St Paul





Salt lake City
O

Denver



CKarlotte

lLos Angeles

Jackson_ MS. AL
O



Mexxrali

Jacksony:fle

Houston

Tallahassee1



MEXICO

SariLusPOtosi

¦ CUBA

Guadalajara'

-V^lfrjopan',

'de Juarez

JHONDURA'S*

Gu a te rns


-------
g Virtual Beach 2.2

Project Model Help

Athens, GA

Place

GoT o Place

Map Settings
Type

Reload |

Beach Orientation

Add 1st Beach Make*

Add 2nd Beach Maikw |

Add Water M arke<

Beach Orientation

Show Station Locations

~	NWIS ~ NCDC

~	STORET

Remove Station Locations

Cwerit Location

loading

Map Controls

Zoom Slider- drag slider up and
down to zoom in and out,
respectively.

Map Controls-Add Lat/Long and
click "GoToLat/Long" button or enter
a Place and click "GoToPllace."

Map Settings - Select map type from
dropdown menu to change the
display in the map window.

Beach Orientation - use buttons to
add or remove markers on the map.
Once the beach shoreline is
delineated by placing the la and 2nd
beach markers, click in the water and
then click "Add Water Marker," which
will lead to the correct orientation
angle being placed into the "Beach
Orientation" box.

Show Station Location - if zoomed in
enough, select a station type and
then click "Show Station Locations"
to display such stations on the map.

Current Location - click anywhere on
the map to display that points Lat
and Long.

Loading - map loading progress bar
that shows network download
activity for map images.

Figure 3. Beach Location tab controls and their function

5.2 Defining the Beach Orientation

Map control allows delineation of a beach on the map to ascertain its orientation,
which is useful if wind, wave, and/or current flow components are to be used in MLR
model-building. Maps, as opposed to satellite or hybrid images, provide less shoreline

12


-------
detail so it is recommended that the map setting type use a hybrid or satellite image prior
to adding point locations that define beach boundaries. Once displayed, click on the map
(a red marker will appear) and select the "Add 1st Beach Marker" button; this represents
the first point of the extent of your beach shoreline. Repeat this for the second beach
marker and click on the map to indicate which side of the shoreline represents the water;
then hit the "Add Water Marker" button. Marker points will turn green as you add them.
Once the water marker is added, a shaded box (the beach) appears and the computed
orientation angle will be displayed.

SI Virtual Beach 2.2

00®

Project Name: Beach Name:	Status: ready (_

Project Model Help

Beach Location | Data Processing

Map Controls	Zoon

Map Settings
Type

[ YahooHybnd

| Reload |

Beach Orientation
| Remove 1st Beach Marker |
[ Remove 2nd Beach |
[ Remove Water Marker |

Beach Orientation -94.95

Show Station Locations

HNWIS ~ NCDC
STORET

Current Location
41.6458510994252 Lat
-87.257022857S6G Lng
loading

Figure 4. Adding shoreline and water markers to define beach orientation

Points can be added or removed until the user is satisfied with the beach
representation. To recall the computed beach orientation in the data processing
components creation screen (see Data Processing section below), users can either save
and then re-open a project file or they can note the beach orientation on the mapping
screen and manually enter that angle on the components calculation screen.

5.3 Finding nearby Water Quality, Flow, and Climate Information Sources

Possible nearby data sources for the area of interest may be located and displayed
on the map. USGS NWIS and NOAA NCDC station markers at a zoomed-in map area
can be located and displayed by checking appropriate items in the map window and
clicking the "Show Station Locations" button. Note that the "Show Station Locations"

13


-------
button is only enabled when zoomed-in to an appropriate level (e.g., zoom level three as
measured from the top of the zoom control slider). If either of the selected station
categories (NWIS and/or NCSC; the STORET station category, although present on the
control, is not yet functional) are present within the map display area, they will appear.
Also note that the network server that produces NCDC station locations restricts location
requests to one every 30 seconds - a one-half minute delay is required for subsequent
location requests and an error message will be displayed if the appropriate wait time has
not elapsed. Once station location markers are displayed on the map, hovering over the
top-left hand corner of any station marker will display station ID information. With that
information, users can visit the appropriate web address to gather water/weather data for
the area of interest.

showing station ID information

Figure 5. NQAA/NCDC station marker

Station ID: USGS-Q2Z17890
S tation N ame: N 0 R T H 0 CO NEE RIVER AT US 78, AT ATHENS, GA

Figure 6. USGS/NWIS station marker showing station ID information

USGS NWIS web site URL: http://waterdata.usgs.gov/nwis/inventory

NOAA NCDC web site URL: http://www.ncdc.noaa.gov/oa/climate/stationlocator.html

14


-------
B Virtual Beach 2.2

Project Model Help



Beach Location Data Processing

Map Controls	Zoor

Lat
Lng

Map Settings

Type	

YahooHybrid

Beach Orientation

Remove 1st Beach Marker

Remove 2nd Beach

Remove Water Marker

Beach Orientation [-94 95

| Show Station Locations |

0 NWIS 0 NCDC
~ STORET

| Remove Station Locations |
Current Location
[41.6254197800841 | Lat
|-87.2442770004272 | Lng

Project File Name:

Project Name: Beach Name:

Figure 7. Beach Location interface showing station markers near Gary, Indiana

5.4 Saving Beach Information in a Project File

Use the Project-^Save menu bar selection to open a Save File dialog and to save
the project information to disk. Beach marker and angle information is saved in the file
name provided; the saved file can be anywhere, but using the "Project Files" folder
(found in the VB 2.2 root install folder) is recommended.

15


-------
6. DATA PROCESSING

6.1 Data Requirements and Considerations

VB 2.2 accepts files from Excel 2007 or earlier (Excel 2010 is not currently
supported), as well as comma-separated-value (CSV) text files. Input data must conform
to certain standards:

•	The first row of any data column must be a header with the IVs name. For best
operation of the software, the column name should be composed of letters,
numbers (don't begin the column name with a number), and/or underscores, i.e.,

Other characters in column names can cause problems.

•	The first (left-most) column of the dataset must be identification for the
observations, typically a date or time stamp that indicates when the observation
was collected. The only requirement is that each row MUST have a unique ID.
VB 2.2 will not import datasets with non-unique IDs in the first column. If the
first column is a time stamp, VB 2.2's plotting functions will work best if the
column is in chronological order, from earliest to most recent observations.

•	The second column of the dataset will initially be set as the dependent or response
variable; however, this can be changed after data are imported. Any subsequent
columns will be considered to be IVs.

•	Variable measurement units are not considered, but certainly affect predictions.
Make sure any data used for predictions are in the same units as those used to
build the models; for example, do not build a MLR model with water temperature
in degrees Fahrenheit, then later import water temperature in degrees Celsius for
predictions. It is prudent to include unit information in the column names (e.g.,
WaterTempC) to remind the user of the proper units when making predictions.

•	Missing data (blank cells) are permitted on import, but must be dealt with in Data
Processing prior to modeling.

•	If present in the imported Excel data sheet (other than in column names or the
first ID column), cells with non-numeric values (i.e., symbols or text) are turned
into empty cells. If such non-numeric characters are present in an imported .csv
file, they will be imported to the data grid, but will be recognized as anomalous
data during the required validation scan and will have to be dealt with (deleted or
turned into a numeric value) at that time.

•	VB 2.2 recognizes any column of data with only two different values as
categorical. If you have a column of categorical data with more than two values,
you can designate it as categorical, using methods described below. The
ramification of a variable being identified as categorical is that VB 2.2 leaves it
out of transformation processes.

•	There is no hard-coded limit on the number of IV columns one can import;
however, a practical limit exists that depends on system processing resources.
There is also an inherent limit: - documentation indicates that the grid components
used in the application are designed for a maximum of 300 columns before
performance issues degrade the application. Modeling 250+ columns of data

16


-------
20

presents circa 2(10) possible data combinations for MLR processing. The
Genetic Algorithm handles this modeling task, but choosing "Run all
combinations" would likely take an immense amount of time to complete.
Depending on how many additional IVs will be created by the user, importing a
dataset with less than 100 IVs should be acceptable.

6.2 Importing a Dataset

When users first click on the Data Processing tab, they open a dataset using the
"Import" button. This brings up a dialog screen where a directory explorer can be used to
find the data file and open it. If the dataset is an Excel file with multiple sheets, a dialog
box opens to ask the user which to import.

m Virtual Beach 2.2

Project Model Help

Data Processing |	

Import

Compute A, 0

Manipulate

lib

My Recent
Documents

My Documents

Si

My Computer

iHmv Documents

l£^)Zepp Irradiance

^ My Computer

§3Shortcut to Agent.exe

^My Network Places

Testing.xls

Ir^ Brown Bags



l^(CCC sampling



InlCooter N files



OEPA Support Tools



E3ESA2011



^Modeling Datasets



IlDNMR Spectra



IC) Rockwell Data



Stuff



OVB Images



Ir^iVB Interview



Ij^Whelan Rainfall



File name:
My Network Files of type:

Open

Project File Name:

Project Name: Beach Name:

Status: ready (	

Figure 8. Importing a dataset into the Data Processing tab

Once imported, the data grid is shown as a spreadsheet on the right. The second
column of the spreadsheet will be highlighted in blue to indicate its status as the current
response variable. Information about the dataset, such as number of rows and columns,
name of the ID column and name of the response variable, appear on the left. At this
point the grid cannot be edited or interacted with in any manner; tTo access additional
processing functionality, the data must be validated.

17


-------
6.3 Validating the Imported Data

The "Validate" options window can be accessed by clicking the "Validate" button
at the top of the Data Processing tab. This window primarily launches a required data
scan to identify blank and non-numeric data cells in the imported spreadsheet. However,
one can also find and replace other specified values (e.g., a missing data tag like -999) in
the dataset using the "(Optional) Find:" input box.

a Virtual Beach 2.2

Project Model Help

Data Processing

File

Column Count
Row Count
Date-Time Index
Response Variable

Testing.xls
9

37

tstamp
LogCFU

Disabled Row Count	0

Disabled Column Count	0

Hidden Column Count	0

Independent Variable Count	7

38550.46

LogCFU

1.452

0.8653

0.801G

1.738

1.028

0.301

1.627

1.247

1.773

0.9379

0.9542

1.079

0.97

1.195

1.239

0.699

-0.1761

1.176

0.1249

0

1.222
0.5643
0.6368
2.727
2.235
0.5229
rmaq

Project File Name:

Project Name: Beach Name:

Status: ready (_

Figure 9. Data validation required to begin data processing

To validate the data, the user clicks "Scan." VB then goes through the
spreadsheet, cell by cell, looking for blanks, non-numeric, or user-specified values
entered in the "Find:" input box. If one of these types of cells is found, the scan will stop
to highlight that cell. Users must decide how to deal with the cell using choices in the
"Action" section: they can replace the bad cell with a specified value, using the "Replace
With:" input box, or they can delete the row or column containing the bad cell. The user
must decide where to implement the chosen action with the "Take Action Within" menu.
Possible choices are "Only this Cell," "Only this Row," "Only this Column," "Entire
Row," "Entire Column," and "Entire Sheet." Items in this menu are context-sensitive,
i.e., they change depending on which Action is selected. This setup gives the user
flexibility, for example, to delete all rows containing missing values within one specific
column of data (Action would be "Delete Row" taken within the "Entire Column"), and
replace all missing values with a user-specified numeric value within another column of
data (Action would be "Replace With:" taken within "Entire Column"). The cell, row,
and column reference will always refer to the highlighted cell. After setting the "Take

18


-------
Action Within" menu, the user clicks the "Take Action" button, VB 2.2 makes the
specified changes to the spreadsheet, and the scan continues. When the entire
spreadsheet has been scanned and all bad cells have been fixed, VB 2.2 reports that "no
anomalous data have been found," and the user can click the "Return" button to close the
Scan window.

As stated earlier, VB 2.2 will not attempt to transform categorical data columns.
It automatically identifies columns with only two unique values as categorical, but if the
user has other categorical IVs with more than two categories, those should be identified
to VB 2.2 by the "Identify Categorical Variables" button.

0 Virtual Beach 2.2

. Ifnilx

			 u.,

	

Beach Location Data Processing |

File T esting.xls



Import Validate ¦ r ¦ ¦



Column Count 9





Date-Time Index tstamp
Response Variable LogCFU



tstamp LogCFU uv airtemp waveheight centershintemp center waisttemp WindS peed Vi 1



38507.33 1.452

360

29.3

0.15

28.4

28.4



0

Disabled Row Count 0



38507.46 0.8653

1403

29.9

0.2

30.5

30



1C

Validation

38507.63 0.8016

1555

30.7

0.2

33.7

33.1



2C

	



38508.33 23.3

0.2

27.8

27.8



3C

uaia vauuauun



38508.46 1.028
38508.63 0.301

1305

29

0.2

30.2

30.1



4C

Scan

1568

30.9

0.2

32.5

32.1



5C

38521.46 1.627
38521.63 1.247
38522.33 1.773
38522.46 0.9379
38522.63 0.9542
38528.33 1.079
38528.46 0.97

1342

28.6

0.02



28.3



6C

(Optional) Find*

1276

28.2

0.01

33.3

33.2



7C



225

25

0.01

26.4

26.4



8C

Action:

1260

32

0.01

27.8

28



9C

O Replace With:

1409

29.4

0.01

32.5

31.8



1C

® Delete Row
O Delete Column

295

25.7

0..1

24.6

26.2



11

1800

30.5

0.15

27.6

27.4



12

38528.63 1.195
38535.33 1.239
38535.46 0.699
38535.63 -0.1761
38536.33 1.176
38536.46 0.1249
38536.63 0
38537.33 1.222
38537.46 0.5643
38537.63 0.6368
38549.33 2.727

900

34

0.18

30.1

30



V;





293

29.9

0.15

28.7

29



14

1537

31.6

0.15

31.4

30.4



15

lEntireSteet

1763

31,1

0.3

35.2

33.5



16



286



0.05

27.3

27.8



17





1481

29.8

0.1

30.2

29.2



1E

[ Identify Categorical Variables |

1802

30.3

0.3



33.1



1S



292

29.1

0.2

27.8

28.3



2C





675

30

0.3

29

29.2



21

( Cancel |

1834

30.2

0.2

34

32.4



22





292

28.9

0.5

27.6

27.6









38549,46 2.235

1233

29.9

0.3

30.4

29.8



24



38550.46 0.5229

1470

29.8

0.3

30.1

30



25



•wwn R3 n 1



•31 q

n?

qjq

"n j





Project File Name:	Project Name: Beach Name:	Status: ready (_

Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu

19


-------
6.4 Working with a Dataset Post-Validation

After the dataset has passed the validation scan, the function buttons across the
top of the Data Processing tab are enabled.

!i Virtual Beach 2.2

Project Model Help

Beach Location Data Processing

File

Column Count
Row Count
Date-Time Index
Response Variable

Testing.xls

9

37

(stamp
LogCFU

Disabled Row Count	0

Disabled Column Count	0

Hidden Column Count	0
Independent Variable Count 7

Import



Validate

Compute A, 0 | | Manipulate

Go to Modeling

(stamp

LogCFU

uv

airtemp

waveheight

centershintemp

centerwaisttemp

\v

38507.33

1.452

360

29.3

0.15

28.4

28.4



38507.46

0.8653

1403

29.9

0.2

30.5

30



38507. G3

0.8016

1555

30.7

0.2

33.7

33.1



38508.33

1.738

337

29.3

0.2

27.8

27.8



38508.4G

1.028

1305

29

0.2

30.2

30.1



38508.63

0.301

1568

30.9

0.2

32.5

32.1



38521.46

1.627

1342

28.6

0.02

28.7

28.3



38521.63

1.247

1276

28.2

0.01

33.3

33.2



38522.33

1.773

225

25

0.01

26.4

26.4



38522.46

0.9379

1260

32

0.01

27.8

28



38522.63

0.8542

1408

29.4

0.01

32.5

31.8



38528.33

1.079

295

25.7

0.1

24.6

26.2



38528.46

0.97

1800

30.5

0.15

27.6

27.4



38528.63

1.195

900

34

0.18

30.1

30



38535.33

1.239

293

29.9

0.15

28.7

29



38535.46

0.699

1537

31.6

0.15

31.4

30.4



38535.63

¦0.1761

1763

31..1

0.3

35.2

33.5



38536.33

1.176

286

26.6

0.05

27.3

27.8



38536.46

0.1249

1481

29.8

0.1

30.2

29.2



38536.63

0

1802

30.3

0.3

34.7

33.1



38537.33

1.222

292

29.1

0.2

27.8

28.3



38537.46

0.5643

675

30

0.3

28

29.2



38537.63

0.6368

1834

30.2

0.2

34

32.4



38548.33

2.727

292

28.9

0.5

27.6

27.6



38549.46

2.235

1233

29.9

0.3	

30.4

29.8



Project File Name:

Project Name: Beach Name:

Status; ready (_

Figure 11. Post-validation enabling of the Data Processing functionality

At this point, the grid cells (other than the ID column) are editable - that is, users
can manually enter new numeric data into the cells by double-clicking on a cell and
typing in a new value. VB 2.2 does not allow blank cells or non-numeric data in cells.
Additionally, a right mouse-click on an IV column header presents options:

20


-------










Validate



Compute A, 0 Manipulate









LGgCFU uv

Disable Column

Enable Column
5et Response Variable
View Plots

Delete Column

waveheit

11.452

360



0.15



0.8653

1403



0.2



0.8016

1555



0.2



1.738

337



0.2









1.028

1305

28

0.2



0.301

1568

30.9

0.2



1.627

1342

28.6

0.02

Figure 12. Right-click options on columns that are not the response variable

"Disable Column" turns the column's text red and prevents the column from being
passed to the Modeling tab of VB. Previously-disabled columns can be activated using
"Enable Column." "Set Response Variable" will make that IV the new response variable
and it becomes blue as a visual indication of this change. "View Plots" shows a new
screen with column statistics at the far left and four plots for that IV: (1) a scatterplot of
the IV versus the response variable in the upper left panel, (2) a plot of the IV values
versus the ID column at the upper right (a time series plot if the ID is an observation
date), (3) a box-and-whiskers plot at the bottom left, and (4) a histogram for the IV at the
bottom right.

a Variable airtemp

QUI®

Data

Variable Name
Row Count
Maximum Value
Minimum Value
Average Value
Unique Values
Zero Count
Median Value
Data Range

Value

airtemp
37

35.70

25.00

30.11

30

0

29.900
10.700

AD Statistic	0.2589

AD Stat P-Value	0.6959

Mean Value	30.111
Standard Deviation 2.459

Variance	G.045

Kurtosis	0.767

Skewness	0.767

22 24 26

32 34 36

BoxVvhisker Plot



"Time Series Plot

:8.50 38.51 38.52 38.53 38.54 38.55 38.56 38.57
tstamp (10*3)

Figure 13. Four different plots available for evaluation of IVs

21


-------
The scatter plot (upper left) is probably the most-examined, as it can indicate a
non-linear relationship between the IV and the response variable, problems with
homogeneity of variance across the range of the IV, or outliers. Ensuring that the IVs are
linearly related to the response variable raises the probability of producing a robust,
meaningful analysis. If the relationship between the response and the IV is not well-
approximated by a straight line (a fundamental assumption of MLR), it may be beneficial
to transform the IV. Using VB 2.2 to accomplish this will be explained later in this
document. The scatterplot also shows the best-fit regression line in red, along with the
correlation coefficient ("r") and the significance (p-value) of the correlation coefficient at
the top of the plot. For the most part, p-values below 0.05 are considered statistically
significant.

Identifying odd values (potential outliers or bad data) of any IV can often be done by
visually inspecting these plots. If users double-click on the data point marker for any
observation in one of the top panels or the bottom left panel (i.e., not the histogram), they
can disable that point (the row) in the data grid.

The final choice — "Delete Column"-- deletes a column from the data grid, but the
original columns of the imported data sheet (VB 2.2 thinks of these as "main effects")
cannot be deleted. Rows can be disabled and enabled, but not deleted, from the data grid
by right-clicking the row header (far left of each row) and making the desired choice.

If the user right-clicks on the column header of the response variable, a different
set of choices is shown:

22


-------


Import

Validate

Compute A, 0

Manipulate





(stamp LogCFU

i iu air



wavehe

Transform ~
View Plots

UnTransform



~

38507.33

1.452



0.15



38507.46

0.8653

)

0.2



38507.63

0.8016

Set Defined Transformed

~

none
Log 10
Ln

Power



0.2



38508.33

1.738

337

29



0.2



38508.46

1.028

1305

29

"

0.2



38508.63

0.301

1568

30.

9

0.2



38521.46

1.627

1342

28.6

0.02



38521.63

1.247

1276

28.2

0.01

1QROO 11 1 771 19R 9R fl m

Figure 15. Available choices when right-clicking the current response variable

Users can transform the response variable in three ways: logio, loge, or a power
transformation (raising the response to an exponent: y ). They can also un-transform the
response, view the plots shown previously for the IVs, or define a transformation of the
response variable. This option is used when a datasheet is imported with an already-
transformed response variable. For example, users could import a datasheet with logio-
transformed fecal indicator bacteria levels and then define the response as being logio-
transformed. Doing this facilitates later comparisons with observations, decision criteria,
and regulatory standards. When users transform the response variable within VB 2.2
using the "Transform" option, VB 2.2 automatically defines the response as having the
chosen transformation and, in doing so, synchronizes the units of measurement for later
comparisons.

6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current
Components

Orthogonal wind, current, and wave vectors can be powerful predictors of beach
bacterial concentrations. Depending on the orientation of the beach, wind and currents
can influence the movement of bacteria from a nearby source to the beach, and wave
action can re-suspend bacteria buried in beach sediment. To make more sense of these
data, researchers typically decompose wind/current/wave magnitude and direction into A
(alongshore) and O (offshore/onshore) components for analysis (see equations at the end
of this section).

If direction and magnitude (speed/height) data are available, A and O components can
be calculated with the "Compute A, O" button. Clicking it brings up a window where
users specify which columns of the data grid contain the relevant magnitude and direction
data, using drop-down menus (Figure 16). There is also an input box at the bottom of the
form for the beach orientation angle. If the user defined the angle on the "Beach
Location" tab, that value should be seen here. After clicking "OK," new data columns
are added to the far right of the data grid, representing the A and O components of the
specified wind, current, or wave data. Unlike the originally imported IVs, these

23


-------
components can be deleted from the data grid after they are created. Names of these new
columns will be: WindA_comp(X,Y,Z), CurrentO_comp(X,Y,Z), WaveA_comp(X,Y,Z),
etc, where X is the name of the column of data used for magnitude, Y is the name of the
column used for direction, and Z is the beach orientation angle.

Wind/Current/Wave Components

Wind Data

Specify wind data columns:

Speed

Direction (deg)

Current Data

Specify current data columns:

Speed

Direction (deg)

Wave Data

Specify wave data columns:
Wave Height

Direction (deg)

Beach Angle (deg):

Ok

0.00

Cancel

Figure 16. Window for computation of alongshore and offshore/onshore components

Notes on wind, wave and current component calculations:

Direction is an angular degree measure. Moving in a clockwise direction from north
(0 degrees), values are positive, and negative while moving counter-clockwise. Wind
and current speed (as well as wave height) can be measured in any unit. VB 2.2 adheres
to scientific convention where wind direction is specified as the direction from which the
wind blows, while current and wave directions are specified as the direction toward
which the current or waves move. Thus, wind blowing from west to east has a direction

24


-------
of either 270 or -90 degrees, while a current/wave moving from west to east has a
direction of 90 degrees.

The A component measures the force of the wind/current/wave moving parallel to
the shoreline (Figure 17). A positive A component means winds/currents/waves are
moving from right to left as you look out at the water. A negative A component means
winds/currents/waves are moving left to right as you look out at the water. The O
component measures force perpendicular to the shoreline. A negative O value indicates
movement from the land surface directly offshore (unlikely to see with wave action). A
positive O indicates waves/wind/currents from the water to the shore. These relationships
apply no matter how the beach is oriented (Figure 18).

Negative O

Positive A	Negative A

Figure 17. A and O component definitions for wind, current, and wave data

25


-------
Beach Orientation for Wind Component Calculations

270 degrees

315 degrees

0 degrees

135 degrees

90 degrees

45 degrees

180 degrees

215 degrees

t

North

Figure 18. Principal beach orientations given in degrees

Equations for calculation of Wind A/O components:

Wind A: -SPD * cosine ((DIR-BO) * PI/180)

Wind O: SPD * sine ((DIR-BO) * PI/180)

where SPD is wind speed, DIR is wind direction, BO is the beach orientation (in degrees)
and PI = 3.1416. Current A/O and Wave A/O are these same equations multiplied by -1.

26


-------
6.6 Creation of New Independent Variables

Users may click the "Manipulate" button to create new columns of data that might
serve as useful IVs. On the screen that pops up, there is a list of available IVs on the far
left, under "Independent Variables." If users wish to create a new term, they add any
available IV used in this new term by selecting it and using the ">" button to add it to the
"Variables in Expression" box. Clicking and dragging down through the "Independent
Variables" list allows for multiple IVs to be added at once.

Manipulate

Build Expression
Independent Variables

00®

Variables in Expression

airtemp

waveheight

centershinternp

centewaisttemp

WindSpeed

WindDirection

~
S

© Sum O Maximum Q Minimum O Mean C1 Product

OK

Cancel

Figure 19. Window for the formulation of "Manipulates" - arithmetic combinations of existing
columns within the data grid

For example: if users wish to create a new IV that is a row-by-row mean value of
the "centershinternp" and "centerwaisttemp" variables, they add those two to the
"Variables in Expression" box, then choose the "Mean" function, "Add" that expression
to the lower box, then click "OK." That adds a new column of data that represents a row-
by-row average of the two IVs, to the end of the data grid (far right.)

27


-------


Manipulate

Build Expression
Independent Variables

Variables in Expression

airtemp
waveheight
WindS peed
WindDirection

~

S

centershinternp
centerwaisttennp

O Sum O Maximum Q Minimum © Mean O Product

M EAN [centershintemp,centerwaisttemp]

Add

R emove

2nd Order Interactions

M E AN [centershintemp,centerwaisttemp]

OK ]	| Cancel

Figure 20. Creation of a new IV defined as the mean of two existent IVs

Users can create a row-by-row sum, maximum, minimum, mean, or product from
any number of IVs that are added to the "Variables in Expression" box. More than one
expression can be created before the "OK" button is clicked, and IVs can be easily moved
in and out of the box using "<" and ">" keys. Any created expressions can be removed
from the lower box with the "Remove" button. No matter how many IVs are added to the
"Variables in Expression" box, clicking "2nd Order Interactions" will add the cross-
products for all possible pairings of those IVs. Thus, four IVs will produce six
interactions, five IVs will produce ten interactions, and so on. Note that the names of the
columns used to create any manipulate are inside the parentheses of that manipulate's
column name.

28


-------
EH Manipulate

~

Build Expression
I ndependent Variables

Variables in Expression

uv

waveheight
WindDirection

~
~

centershintennp
centerwaisttemp
WindS peed
airtermp

O Sum O Maximum Q Minimum © Mean O Product

MEAN[centershintemp,centemaistternp,WindSpeed,airterinp]

Add

Remove

2nd Order Interactions

F'Fl 0 D [centershintemp.centemaisttemp]
PROD [centershintennp,WindS peed]
PROD [centershintemp,airtemp]
F'FIOD[centerwaistternp,WindSpeed]
PR 0 D [centerwaisttemp,airtemp]
PR 0 D [WindS peed,airtemp]

OK

Cancel

Figure 21. Formation of two-way cross-products of a set of four existent IVs

VB 2.2 does not allow previously created "manipulates" — new columns of data
created through the "Manipulate" button — to be further manipulated. Previously-created
manipulates will not appear in the "Independent Variables" section at the left. They can,
however, be chosen as the response variable or deleted from the data grid, using the
appropriate menu choices, accessed by a right-click of the column header.

6.7 Transforming the Independent Variables

VB 2.2 gives users the ability to transform non-categorical IVs to assist in
linearizing the relationship between the IVs and the response variable, which is a
fundamental assumption of an MLR analysis. VB 2.2 provides the following
transformations, where Xt is the transformed IV and X is the original IV:

Logio: Xt = logio(X)

Loge: Xt = loge(X)

Inverse: Xt = 1/X
Square: Xt = X2
Square Root: Xt = X'

0.5

r0.25

Quad Root: Xt = Xu
Polynomial: Xt = a + bX + cX2

General Exponent: Xt = Xe where the user specifies the value of e

When users click the "Transform" button, they are presented a choice of
transformations to investigate:

29


-------
Transforms to Perform

Available T ransforms

I I LoglO

~	Ln

~	Inverse
I I Square

I	I SquareRoot

I	I QuadRoot

I	I Polynomial

I	I General Exponent 1.0

I I Select All

Dependent Variable:
LogCFU

Go

Cancel

Figure 22. The range of choices for IV transformations

When users click "Go", the chosen transforms are applied to each non-categorical
IV. VB 2.2 then opens a table that allows comparison of the success of each transform
using a Pearson correlation coefficient, a measure of linear dependence between the
response variable and the IVs. For the polynomial transformation, the Pearson
coefficient is calculated as the square root of the adjusted R value derived from the
regression of the response on Xt. Because this adjusted R2 value can possibly be

•	2

negative, an empirically-derived formula is applied when adjusted R values fall below

0.1:

Polynomial Pearson Coefficient = (-6.67*REi2 + 13.9*REi- 6.24)*(R2)0 5

where REi = 1.015 - 1.856*R2 + 1.862*adjR2 - 0.000153*N, R2 and adjR2 are defined
by the regression of the response on Xt, and N = number of observations.

The table that VB 2.2 creates groups all transformed versions of each IV by the
IV name, type of transformation, and the associated Pearson coefficient. By default, the
transformation (this includes the un-transformed version of the IV, denoted by "none"),
with the largest absolute value of the Pearson coefficient is highlighted in black text for
selection. Users may override the default selection by left-clicking on the row header of
a transformed IV they choose. They may also override the default by setting a Threshold
percentage and clicking "Threshold Select" on the left side of the box. This selects the
un-transformed IV unless the transformed IV with the highest absolute value Pearson
coefficient exceeds the un-transformed IV Pearson coefficient by the specified
percentage. In essence, the user is saying, "Unless the Pearson coefficient of the

30


-------
transformed IV is some % greater than the Pearson coefficient of the un-transformed IV,
use the un-transformed IV." This can be useful because transforming IVs makes
interpreting model coefficients more difficult; unless an improvement is seen,
transformation may not be worth the trouble. Users can also revert to the default by
clicking "Go" under the "Auto Select" section at the left.

Pearson Univariate Correlation Results - Maximum Pearson Coefficients (signed) in BOLD text

Help

Variables, possible variable

interactions, and their
transforms are shown. Select
variables for further
processing and modeling.

Auto-Select

The variable or one of its
transforms is selected by
maximum Pearson Coefficient.
(This is the default view shown.)

Threshold Select

Select a transformed variable only
if its Pearson Coefficient exceeds
the untransformed variable's
Pearson Coefficient by a
specified threshold.

Manual Select

Mouse-click on a row header to
select or deselect that variable.
At most one member from each
group can be selected.

~

Add transformed variables to dataset
and disable untransformed columns.

Dependent Variable: LogCFU

Pearson
Coefficient

Correlation
P-Value

uv

none

-0.4706

0.0033

uv

INVERSE[uv,1 01.5]

0.3335

0.0437

uv

SQUARE[uv]

-0.4887

0.0021

uv

QUADR00T[uv]

-0.4339

0.0073

uv

PO LY[uv,1.2133824,0.000332S8167,-5.0448752e-07]

0.4432

0.0060









airtemp

none

-0.3772

0.0214

airtemp

IN VE R S E [airtemp,12.5]

0.3624

0.0275

airtemp

SQUARE [airtemp]

-0.3820

0.0136

airtemp

QUAD ROOT [airtemp]

-0.3724

0.0232

airtemp

PO LY[airtemp,-2.7045332,0.35028385,-0.0076782138]

0.3170

0.0553









waveheight

none

0.1031

0.5435

waveheight

INVERSE[waveheight,0.005]

0.2006

0.2339

waveheight

S Q UAR E [waveheight]

0.2612

0.1184

waveheight

Q U AD R 0 0 T [waveheight]

-0.0666

0.6354

waveheight

P0LY[waveheight,1.2708351 ,-7.0250516,19.175368]

0.3874

0.0178









centershintemp

none

-0.4260

0.0086

centershintemp

IN VE R S E [centershintemp,12.3]

0.4197

0.0037

centershintemp

S Q UAR E [centershintemp]

-0.4272

0.0084

centershintemp

QUADROOT[centershintemp]

-0.4243

0.0083

centershintemp

PO LY[centershintemp,1.2563378,0.034614607,-0.0035446356]

0.3669

0.0255









centerwaisttemp

none

-0.3991

0.0144

centerwaisttemp

INVERSE[centerwaisttemp,13.1 ]

0.4093

0.0113

Figure 23. Pearson correlation coefficient scores for judging the efficacy of IV transformations

31


-------
Plotting Transformed IVs

Users may prefer to examine plots visually to determine which transformation of
IV to choose. If users right-clicks on a row header in this correlation table, they can view
an array of scatterplots, time series plots, or frequency plots for each data transformation
of the IV represented by that header. Scatterplots will show the best-fit regression line,
the correlation coefficient, and the p-value for that correlation coefficient.

3 Variable airtemp and its Transforms	|| ~ || X [

SQUAR E[alrt*m p]

QUADROOTplrtemp]

POLYpirtemp.-s.roisssj.o.ssoaBBBS.-o.oorsrBJiSB]

Figure 24. Scatterplots (Response vs. IV) for six different data transformations of a single IV

After choosing a transformation for each IV, users click "OK." This populates
the data grid with new columns representing transformed versions of the IVs. The small
checkbox in the bottom left corner of Figure 23 controls whether the untransformed
version of the IV remains enabled in the data grid after the user clicks "OK." When the
box is checked, for any IV in which the user chooses a transformed version, the un-
transformed version will be disabled in the data grid. Notice that transformed versions of
an IV are put into the data grid immediately after the original, un-transformed IV.

Notes on Transformed IVs

Any transformations put into the data grid can be deleted with the "Delete
Column" choice after right-clicking on their column header. Transformed IVs will
appear in the list of IVs on the "Manipulate" screen; however, transformed IVs cannot be

32


-------
further transformed and will not appear in the transform table if the user goes back to the
"Transform" window.

VB 2.2 transformations have specific processing for certain data values and are
not pure mathematical transformations — they were designed to maintain data order
while helping to linearize the response-IV relationship. For the SQUARE (b=2),
SQUAREROOT (b=0.5), QUADROOT (b=0.25), INVERSE (b=-l) and GENERAL
EXPONENT (b is user-defined) transformations, VB 2.2 uses the signed equivalent of
the mathematical function:

xAb == sign(x)*abs(x)Ab
For example: (-2)2 = -4 (-9)0'5 =-3 (-4)"0'5 =-0.5 (-2)"2 =-0.25

To avoid potentially undefined values (i.e., 1/x when x = 0), the INVERSE and
GENERAL EXPONENT (if the user sets b < 0) transformations have special processing:

If x = 0, then VB 2.2 will find the minimum of abs(z), where z is the set of all
non-zero values for the IV in question. For the purpose of computing the transformation,
once z is defined, VB 2.2 substitutes z/2 for x. From this definition, note that z can be
either a positive or negative number.

LOGio and LOGe transforms are also the signed equivalent of the mathematical
functions:

loge(x) == loge(x)
loge(-x) == -l0ge(x)
logio(x) == logio(x)
logio(-x) == -logio(x)

In addition, if (-1 < x < 1), then loge(x) = 0 and logio(x) = 0

VB 2.2 will not compute the INVERSE, GENERAL EXPONENT (with a
negative b), LOGio and LOGe transformations for data columns if more than 10% of the
IV's values are zero. Programmatically, zero is defined as any number whose absolute
value is less than 1.0e-21.

POLYNOMIAL transformations are the result of a linear regression of the
response variable on the IV and the square of the IV:

Poly(X) = a + b*X + c*X2

where a, b, and c are determined by a multiple linear regression of X and X on the
response variable.

In general, the name of the transformed column of data that VB 2.2 creates is
simply the type of transformation, with the original data column name in parentheses.
For example, WaterTemp would become LOGio(WaterTemp). There are some
exceptions, however:

33


-------
INVERSE(X,Y) : X is the original data column name and Y is the z/2 value
discussed earlier in this section.

POWER(X,Y) : When Y is positive, X is the original data column name and Y is
the exponent specified by the user.

POWER(X,Y,Z) : When Y is negative, X is the original data column name, Y is
the exponent specified by the user, and Z is the z/2 value discussed earlier in this section.

POLY(X, a,b,c): X is the original data column name and a, b, and c are the
values of the polynomial regression coefficients.

Finally, because transformations are determined by the current response variable,
when users change the response variable in the data grid (using the column header right-
click menu), all transformed IVs in the data grid are erased (a message warns the user).

6.8	Saving Processed Data

Data can be saved in a project file (Project-^Save) at any time during data processing.
When the file is opened, the data grid will be repopulated as it appeared when the project
was saved. Also, users may highlight the entire table or sections of the table and use
Control-C and Control-V to copy and paste the data grid into a word processing or
spreadsheet application.

6.9	Go to Modeling

After data processing is complete, users must click the "Go to Modeling" button
to open the Modeling tab. If users have already done modeling work and returned to the
data sheet to make changes, they will receive a message that the data sheet has changed
and any prior information on the Modeling, Residual, or MLR Prediction tabs will be
erased. Users can then choose to move forward to the Modeling tab or revert to the
previous version of the data sheet prior to making changes.

34


-------
7. MODELING

The Modeling tab facilitates finding the best model based on criteria selected by
the user. As the number of IVs increases, the number of possible models in the solution
space increases exponentially. Users may select all or a subset of the IVs for
consideration in the model to reduce the size of the solution space.

7.1 Selecting Variables for Model Building

All eligible IVs are listed in the left column ("Available Variables") under the
Variable Selection sub-tab. Any variable users wish to consider for model inclusion must
then be moved to the "Independent Variables" list by highlighting the IV and clicking the
">" key. Any number of IVs can be moved or removed from this list.

Beach Location Data Processing , Modeling

Model Settings
Variable Selection

Control Uptions

Number of Observations: 37

Dependent Variable: LogCFU
Available Variables (7)

Independent Variables (0)

airtemp

waveheight

centershintennp

centerwaisttemp

WindS peed

WindDirection

CD
CD

Figure 25. Selecting variables for MLR processing within the Modeling tab

As you add or remove IVs from the "Independent Variables" list, the number of
possible MLR models is displayed in the status strip at the bottom right of the application
window. The number of possible models can grow exceedingly large; 66 IVs represent
7.38* 1019 possibilities. More than 66 variables produces a number that exceeds the
capacity of the program to store it - in such cases, "more than 9.2e019" is displayed.

7.2 Modeling Control Options

The first decision users make on this tab involves which evaluation criteria will be used
to judge model fitness. There are currently ten criteria available in the drop-down menu:

35


-------
•	Akaike Information Criterion (AIC)

•	Corrected Akaike Information Criterion (AICC)

•	R2

•	Adjusted R2

•	Predicted Error Sum of Squares (PRESS)

•	Bayesian Information Criterion (BIC)

•	RMSE

•	Sensitivity

•	Specificity

•	Accuracy

Evaluation Uriteria

Akaike Information Criterion (AIC)

Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7

Maximum VIF

Figure 26. Setting modeling options within the Modeling interface

The "Maximum VIF" (Variance Inflation Factor) parameter is used selectively to
discard models that contain variables with a high degree of multi-collinearity, i.e., IVs
that are greatly correlated with other IVs. If any IV in a model has a VIF exceeding the
threshold, that model will be discarded. The default VIF value used in the application is
set to 5. A VIF of 5 means that 80% (1/5) of the variability in an IV can be explained by
the variability of other IVs in the model. A VIF of 10 means that 90% (1/10) of the
variability can be explained, and so on. If users aren't concerned with muli-collinearity
among the explanatory variables in a regression model, they can lower the Maximum VIF
value. However, multi-collinearity leads to poorly estimated regression coefficients (i.e.,
large standard deviations of these coefficients).

The "Maximum Number of Variables in a Model" parameter tells VB 2.2 how
large the models being evaluated can be. As a rule, most modelers prefer to have about
10 observations per estimated parameter in their models, otherwise possibilities increase
for model over-fitting and poor estimation of regression parameters. VB 2.2's
recommendation is close to this rule. It equals (1 + n/10) where n is the number of
observations in the dataset. The maximum allowable number equals n/5. VB 2.2 won't
let users set this value over the maximum. The total number of available parameters is
also given here.

If we define p as the number of parameters in a model, n as the number of
observations in the dataset, RSS as the residual sum of squares for a model, and TSS as
the total sum of squares for a model, then the evaluation criteria for a model can be
defined as:

•	Akaike Information Criterion (AIC): 2p + n*ln(RSS)

•	Corrected Akaike Information Criterion (AICC): ln(RSS/n) + (n+p)/(n-p-2)

36


-------
• R2: 1 - RSS/TSS

•	Adjusted R2: 1 - (l-R2)(n-l)/(n-p-l)

•	Bayes (Schwarz) Information Criterion (BIC): = n*ln(RSS/n) + p*ln(n)

•	Root Mean Squared Error (RMSE): (RSS/n)12

•	Predicted Error Sum of Squares (PRESS): 1 - S(y;- y.;)2 / 2(y; - ym)2

where y is the ith observation, yH is the model estimate of the ith observation when the model coefficients
are fitted with the ilh observation removed from the dataset and ym is the mean value of y in the dataset

•	Accuracy: (true positives + true negatives) / number of total observations

•	Specificity: true positives / (true positives + false positives)

•	Sensitivity: true negatives / (true negatives + false negatives)

Sensitivity, specificity and accuracy are special cases that require users to enter
both a Decision Criterion (DC) and Regulatory Standard (RS) so that true/false positives
and true/false negatives can be defined. The DC is a modeled (predicted) value the user
chooses. Model predictions above this threshold are considered exceedances, while
model predictions below this value are considered non-exceedances. The RS is typically
a safety limit on fecal indicator bacteria (FIB) levels set by a state or federal agency. The
"Threshold Transform" radio buttons tell VB 2.2 how to transform the DC and RS for
comparison to model predictions and observations. If a transformation definition is set
for the response variable (either manually by the user or automatically by transforming
the response) during data processing, that definition will be set as the default here. Users
should understand that changing the threshold transform definition can lead to problems
when comparing modeling predictions to observations. Caution should be exercised.

Model Evaluation!hresholds

Decision Criterion (Horizontal)

Regulatory Standard [Vertical)

235

235

Threshold Transform	Current US Regulatory Standards

® None	£ co|j Freshwater: 235

O Log10	Enterococcl Freshwater: 61

O Ln

powef	Enterococci, Saltwater: 104

Figure 27. Setting evaluation thresholds and threshold transformation information within the
modeling interface

37


-------
7.3 Linear Regression Modeling Methods

There are two options for exploring the solution space.

1.	Manual - this option is for a directed model search. If the 'Run all combinations'
box is not checked, a single model including every IV that was added to the
"Independent Variables" column will be evaluated. If 'Run all combinations' is
checked, an exhaustive search is performed. The exhaustive search evaluates
every model that can be constructed with the selected IVs, but does not evaluate
any with more parameters than the "Maximum Number of Variables in a Model"
input box. For example, if there are 24 IVs to evaluate and the maximum number
of IVs in a model is set at 8, the exhaustive routine examines every possible 1-, 2-

, 3-, 4-, 5-, 6-, 7-, and 8-parameter model. As the number of IVs rises, the number
of possible models quickly gets so large that the exhaustive routine cannot
maintain reasonable computation times and the user is advised to switch to the
genetic algorithm.

2.	Genetic Algorithm - the Genetic Algorithm (GA) option explores solution spaces
too large to handle exhaustively. Genetic algorithms are loosely based on the
natural evolutionary process, in which individuals in a population reproduce and
mutate. Individuals with high fitness (regression models that produce small
residuals) are more likely to reproduce and pass their genes (IVs) to the next
generation. The goal is to find a good solution without having to examine every
possible option and the GA balances random and directed searching.

38


-------
Virtual Beach 2.2

Project Model Help

Beach Location Data Processing Modeling

Model Settings

Variable Selection Control Options

Number of Observations: 37

Evaluation Criteria

Akaike Information Criterion (AIC)

Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7

Model E valuation!" hresholds

235~; Decision Criterion (Horizontal)
1235 I Regulatory Standard (Vertical)

Threshold Transform
0 None
O Log10
O Ln

O Power 	

Current US Regulatory Standards
E. coli, Freshwater: 235
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61

Manual Genetic Algorithm |

0 Run all combinations

Run

] Virtual Beach 2.2

Project Model Help

Beach Location Data Processing

Model Settings

Variable Selection Control Options

Modeling

Number of Observations: 37

Evaluation Criteria

Akaike Information Criterion (AIC)

Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7

Maximum VIF

Model EvaluationThresholds

[235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard [Vertical]

Threshold Transform	Current US Regulatory Standards

® None	£ co|j Freshwater: 235

0 Log10	Enterococci, Freshwater: 104

O Ln
O Power

Enterococci, Saltwater: 61

Manual | Genetic Algorithm
I I Set Seed Value:

Population Size:

100

Number of Generations: 1100
Mutation Rate:

Crossover Rate:

10.01

0.20

Figure 28. Model building interface using a manual search (left panel) or the Genetic Algorithm
(right panel)

Choosing between an exhausti ve and a GA search depends on your data set,
available hardware and time constraints. Fifteen IVs produce about 32,000 model
possibilities; on our system (Dell Precision T5400 workstation running MS Win XPSP3
w/ dual Xeon 2.66 GHz processors having 4 GB RAM), the exhaustive search was
completed in approximately 90 seconds. Sixteen IVs represent more than 65,000
possibilities which is more than double that of 15 IVs. Some model building results are
summarized below:

Exhaustive Search - Run All Combinations

Number of IVs

Number of MLR models

Approximate Time
Required to Generate and
Filter Models (seconds)

15

32767

90

16

65535

110

17

131071

280

39


-------
By contrast, the GA with 17 IVs was completed in less than seven seconds. We note,
however, that the exhaustive search did find a slightly better model than the GA did using
the selected AIC evaluation criterion (49.2 versus 55).

An alternative modeling strategy could be to use the GA on your entire list of IVs,
then the exhaustive search on a subset of the initial IVs - any IV that appears in one of
the best ten models found by the GA. This two-step modeling process is facilitated with
the "IV Filter" list control.

Model Information
Best Fits:

-143.3235

-143.0920
-142.9118
-142.9249
-142.6259
-142.4560
-141.4349

IV Filter
Add to List
Clear List

Figure 29. Using the IV filter to select a subset of variables from the best-fit models

When the GA ends and the 10 best models are shown, use the "Clear List" button
to remove all IVs from the selection list. Select a model from the "Best Fits" list one at a
time and click the "Add to List" button; this action adds any IVs in the model to the
Independent Variable list. After doing this for the ten best models, users likely have a
much more manageable IV list and can run an exhaustive search to find the very best
combination of IVs. Regardless of the method chosen to build models, the "Best Fits"
window shows the top ten models found, in terms of the evaluation criterion chosen.

7.4 Using the Genetic Algorithm

There are five parameters users can set to adjust performance of the GA:

a)	Seed value: internal random number generator to produce random values.
Setting this seed to a known value will make the GA run reproducible.

Changing the seed will create a new series of random values, possibly returning
different results.

b)	Population size: number of individuals in the population of each generation. A
larger population broadens the search at each generation, but slows processing
time.

c)	Number of generations: how long to run the search since individuals can
reproduce and mutate once each generation. The fitness of every individual in
the population is evaluated at the end of each generation.

Report

Cross
Validation

40


-------
d)	Mutation rate: chance each individual has of undergoing random mutation in
each generation. The higher the mutation rate, the more random (less directed)
the search of parameter space is.

e)	Crossover rate: probability that two selected individuals in the population will
exchange genome parts. Exchanging genes creates new individuals in the
population.

The best GA parameter values depend on the dataset being investigated, but
typical values of the mutation rate are between 0.001 and 0.1 (0.1 and 10%) and typical
values of the crossover rate are between 0.4 and 0.75 (40 and 75%). For most datasets, a
population size and generation number of 100 will be sufficient. Larger datasets may
require an increase in these numbers for optimal solutions.

Manual

Genetic Algorithm

I I Set Seed Value:
Population Size:

Number of Generations:
Mutation Rate:

Crossover Rate:

100

100

0.05

0.50

Run

Figure 30. Genetic algorithm options within the modeling interface

7.5 Evaluating Model Output

After selecting a method to build models and an evaluation criterion to rank them,
users then click the "Run" button. Model selection and evaluation progress is displayed
on the "Progress" graph at the lower right of the Modeling tab. Note that the "Run"
button changes to "Cancel;" the process is interruptible should progress be unacceptably
slow. Once model-building is completed, the ten best MLR fits are displayed in the
"Best Fits" box. Selecting a model from the list results in (see Figure 31):

1.	A list of the model's IVs with associated regression coefficients and statistics
is displayed on the "Variable Statistics" subtab.

2.	A list of the model's evaluation metrics is shown on the "Model Statistics"
subtab.

3.	The "Results" subtab will show the observations and model fits versus the
observation number. If observations are chronologically ordered, this is
basically a time series plot.

41


-------
4.	The "Observed versus Predicted" subtab can show plots and tables based on
observations versus model fits.

5.	The "ROC Curves" subtab shows a plot of the Receiver Operating
Characteristic curve of each "Best Fits" model, as well as a table showing the
computed AUC (area-under-the-curve) for each ROC (see Section 7.7).

6.	Clicking on "View Report" generates a text report of model and variable
statistics for the selected model.

7.	The "Residuals" tab will appear at the top, allowing users to proceed to the
residual analysis component of the application.

8.	The "Prediction" tab will appear at the top, allowing users to proceed to the
prediction component of the application.

Note that selecting a different model from the "Best Fits" list updates the Variable
and Model statistics tables and di splays of the plotting subtab s.

B0®

Project Model Help

Beach Location Data Processing

Modeling | Residuals MLR Prediction

Model Settings

Variable Selection | Control Options
Evaluation Criteria

Number of Observations: 37

Akaike Information Criterion (AIC)

1

|'^ | Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7

[5 | Maximum VIF

Model EvaluationThresholds

Decision Criterion (Horizontal)

1235 | Regulatory Standard (Vertical)

Current US Regulatory Standards

E. coli. Freshwater: 235

Threshold T ransform
0 None
O Log10
O Ln

O Power |

Enterococci, Freshwater: 104
Enterococci, Saltwater: 61

Manual Genetic Algorithm

0 Run all combinations

Model Information
Best Fits:

8.2076
9.1112
9.2219
9.2231
9.2471
10.1760

IV Filter

| Add to List |

View
Report

Variable Statistics Model Statistics

Parameter
(Intercept)

waveheight
WindDirection

Coefficient Standardized Coefficient

-0.0007
1.6811
-0.0030

-0.5050
0.2239
-0.4177

Progress Results Observed vs Predicted ROC Curves

Exhaustive Search of Independent Variable Space
(Percent Complete)

15
14
13
12

11 -
10 ii
9 ~

— Fitness |

Std. Error
0.2994
0.0002
1.0139
0.0010

t-Statistii
6.087<
-3.775C
1.658(
-3.118!

¦ ¦ I " ¦ ¦ I " ' ¦ I ¦ ' I ¦ I ¦ ' ' I I ¦ ¦ ! I I ¦ ¦ I ¦ I ¦ ' ! H ¦ I I ' I " I ¦ I " ' ' I I I " I ' ¦ ¦ ' I I ¦ ¦ ' I ' ¦
10 15 20 25 30 35 40 45 50 55 60 65 70 75
Percent Completed

Project Name: Beach Name:

Total number of possible models: 127 |_

Figure 31. Modeling results shown after completion of an exhaustive regression run

42


-------
Model Information
Best Fits:

8.2076
9.1112
9.2219
9.2231
9.2471
10.1760

IV Filter

Add to List

Clear List

View
Report

Cross
Validation

Variable S tatistics M odel S tatistics

Parameter

Coefficient

Standardized...

Std. Error

t-Statistic

P-Value

^intercept)

1.8228



0.2994

6.0879

7.4508e-07 i

uv

-0.0007

¦0.5050

0.0002

-3.7750

0.0006

waveheiaht

1.6811

0.2239

1.0139

1.6580

0.1068

WindDireotion

•0.0030

¦0.4177

0.0010

-3.1185

0.0038

Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model

Model Information
Best Fits:

IV Filter

Add to List

Clear List

View
Report

Cross
Validation

7.2471

A



B.2076





9.1112





9.2219





9.2231





9.2471





10.1760

V



Variable Statistics fMc^Stafeticslj





A

Metric

Value





R Squared

0.4185





Adjusted R Squared

0.3667





Akaike Information Crite...

7.2471





Corrected AIC

9.1826





Bayesian Info Criterion

-25.3092





PRESS

17.0349





RMSE

0.S188





Sensitivity

0.0000





Specificity

1.0000





Accuracy

0.9459











M irnh^r nf fl hvprv^tinnv

T?







Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model

43


-------
Model Information
Best Fits:

7.2471



8.2076



9.1112



9.2219



9.2231



9.2471



10.1760

V

IV Filter

Add to List

Clear List

View
Report

Cross
Validation

Variable Statistics

Model Statistics













A

Metric



Value





R 5quared



0.4195





Adjusted R Squared

0.3667





Akaike Information Crite...

7.2471





Corrected AIC

9.1826





Elayesian Info Criterion

¦25.3092





PRESS



17.0349





RMSE



0.6188





Sensitivity



0.0000





Specificity



1.0000





Accuracy



0.9459















M i imhpr nf l~l

17









Progress Results Observed vs Predicted ROC Curves

Results

-a- YPred

Threshold

2

Figure 34. Modeling interface showing a time series plot for the selected model

44


-------
Progress || Results | Predicted vs Observed | ROC Curves)
Select View

Plot: Pred vs Obs

Plot Thresholds

235 Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
I Threshold Transform
O None
® Log10
O Ln

O Power |NaN

Update

Model Evaluation



False Positives (Type 1):

0

Specificity:

1

False Negatives (Type II):

I2

Sensitivity:

0

Accuracy:

[0.3459

Predictions vs Observations

3 --

1 --

-1 --

-2

	 Decision Threshold 	 Regulatory Threshold |

* - *

~ ~
~

~

-*~ *



-2

1

Observations

Figure 35. An XY scatter plot of observed versus predicted values for the selected model

45


-------
Model Information
Best Fits:

,7.2471

A-



8.2076

	



9.1112





9.2219





9.2231

—



9.2471





10.1760

V



IV Filter

Add to List

Clear List

View
Report

Cross
Validation

Variable Statistics Model Statistics

Metric

Value

R Squared

0.4195

Adjusted R Squared

0.3667

Akaike Information Crite...

7.2471

Corrected AIC

9.1826

Bayesian Info Criterion

-25.3092

PRESS

17.0349

RMSE

0.6188

Sensitivity

0.0000

Specificity

1.0000

Accuracy

0.9459

Hiirnhpr nf l~~l h^pr\/.=iHnrK-

17

Progress Results Ubserved vs Predicted LQ9.—

Model Fit

7.2471

8.2076

9.1112

9.2219

9.2231

9.2471

10.176

10.2047

10.2063

10.2076

AUC

.739683
.635714
.732143
.754464
.754464
.739683
.63

.635714
.635714
.635714

Plot

View T able

Receiver Operating Characteristic Curves
for Best-Fit Models

2219 -JK- 9.2231
10.2063	10.2076

0.5 0.6
Specificity

Figure 36. The ROC curves and AUC table for the Best Fit models

7.6 Viewing X-Y Scatterplots

In multiple locations within VB 2.2 (Modeling, Residual and MLR Prediction
tabs), users can access a subtab that allows them to view information for comparing
observations to model predictions (Figure 35). From this space, users can view four
different pieces of data:

1)	A plot of predictions versus observations: "Pred vs. Obs"

2)	A table summarizing model errors (false negatives/false positives) as the decision
criterion (DC) varies across the range of the response variable: "Error Table: DC as

cFtr

3)	A plot of the percent of probability of exceedance (calculated based on the current DC)
versus observations: "% Exc vs. Obs"

4)	A table summarizing model errors as the percent of probability of exceedance is
varied: "Error Table: DC as % Exc"

46


-------
These four are chosen with the drop-down menu at the top left corner of the form.
On both of the two plots, a right-button click in the plot area shows a menu of functions
for saving, copying, printing or manipulating the plot view. The plot area can be zoomed
and un-zoomed: left-button mouse drags an area for zooming in; with right-button click,
select "Un-Zoom" or "Set Scale to Default" to see the entire data set. To pan to an area
of the plot not in view, hold the Shift key down and use the left mouse button to drag the
view. To view (x,y) values of any data point, hover the cursor over the data point. If the
information does not appear, right-click on the graph and make sure "Show Point Values"
is selected.

In regards to interpretation of these plots, the green (Regulatory Standard) and
blue (Decision Criterion) lines permit model evaluation and provide information on
which to base a DC to be used for predictive purposes. On the plots, false positives
represent data points in the upper left quadrant of the graph, in which the model
predictions exceed the DC, but observations are below the RS. In such cases, a beach
advisory would be incorrectly issued based on the model prediction, leading to potential
economic losses. False negatives (points in the lower right quadrant) represent a
potentially more serious scenario: model predictions below the DC and observations that
exceeds the RS. In other words, swimming at the beach may have been allowed when it
should have been prohibited due to elevated FIB concentrations.

A model that produces no false positives or false negatives would be an ideal
decision tool, but this is often unattainable with real data. Examining the two tables (#2
and #4 mentioned above) on this subtab should allow users to set a robust DC (either
using units of the actual response variable or a percentage probability of exceedance) that
minimizes both errors. Note that in most cases, the RS is set based on federal or state law
and should not be adjusted by the user, however, the user is free to adjust the DC to
minimize false negatives and false positives.

7.7 ROC Curves

In addition to time series and scatterplots which show results for an individual
model, users may also compare all "Best Fits" models using the ROC Curves tab. A
Receiver Operating Characteristic curve shows a model's true positive rate (sensitivity)
plotted against its false positive rate (1 - specificity) as a decision threshold varies
between the model's minimum and maximum predicted values. Models can then be
compared using the area under their ROC curves (AUC). Models having the largest
AUC values perform best over the entire decision space.

The model with the largest AUC appears in red text in the ROC tab's model list.
A single ROC may be plotted by selecting a model in the list and clicking "Plot."

Multiple models can be selected in the usual Windows fashion with Shift-Click (select all
items between the first and second selection) or Control-Click (select only the clicked
items). The background cell color of models not selected for plot display will be gray
after the "Plot" button is clicked.

Clicking "View Table" will replace the ROC plot with a table showing the false
positives, false negatives, sensitivity, and specificity at every evaluated value of the
Decision Criterion for a single selected model. Users need only click on a model in the
list to the left of this table to see its results. The ROC plot will return to view after
clicking "View Plot."

47


-------
AUC calculations are performed and curves are plotted when the "ROC Curve"
tab is selected. If this tab is active and new models are subsequently built, leaving this
tab and then returning will generate the new plots and AUC values.

7.8 Cross-Validation

Clicking the "Cross-Validation" button on the Modeling tab brings up a sub-
screen. On it users can set two parameters: sample size for the testing data (T) and
number of random samples (R) taken. When cross-validation is started, a random sample
of size T is taken from the modeling dataset and set aside. Each "Best Fits" model is then
re-fit to the remaining training data. The IVs in each model stays the same, but the
regression coefficients are adjusted to reflect the least-squares fit to the training data.
The Mean Squared Error of Prediction (MSEP) is then calculated based on the T testing
data points for each candidate model. The process (taking a random testing sample; re-
fitting regression coefficients for the ten candidate models based on the training data;
using the re-fit models to make predictions; and computing 10 MSEP values) will be
done R times. A table will show average MSEP values for each candidate model.

Cross-validation is a widespread, useful technique for examining the predictive
power of models, i.e., their ability to make predictions for data they have not seen before.
For users wishing to emphasize the predictive ability of a potential model, cross-
validation allows them to evaluate which candidate model consistently makes the best
predictions (i.e., has the lowest MSEP). Note that the PRESS statistic Virtual Beach 2.2
provides as a model evaluation criterion is a cross-validation statistic with T set to 1. The
PRESS algorithm removes one observation at a time from the dataset, re-fits the model
regression coefficients, and then calculates the squared residual for the removed
observation. It does this once for every observation in the dataset to compute the model's
PRESS value — a confined look at a model's predictive potential.

Recommended values to enter for the observations used for testing are
approximately 25% of the total number of observations and 500-1000 trials.

§§ Cross Validation	fL~|f5]|5T|

Total Number of Observations:	225

Number of Observations Used for Testing:

Number of Trials:	1:11:1	I Run 1

Fitness

MSEP

IndVar 1

Ind Var 2

IndVar 3

Ind Var 4

IndVar 5

Ind Var 6

IndVar 7



~

-143.323483044...

0.178258878933...

clouds

S Q Ft [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

PO LY[atmpressure]

LOG[cuyahogariv..





-143.092024887...

0.183755617610...

clouds

SQR [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

POLY[atmpressure]

LOG[cuyahogariv..





-142.911814497...

0.189189307571...

clouds

SQR[turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

LOG[cuyahogariv...

P0LY[ucomp]





-142.824883297...

0.172544273813...

clouds

SQR [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

PO LY[atmpressure]

LOG[cuyahogariv..





-142.625947884...

0.184948801378...

clouds

SQR [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

LOG[cuyahogariv...

POLY[rockyriverfL





-142.456029460...

0.178419303326...

clouds

SQR[turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

POLY[atmpressure]

LOG[cuyahogariv..





-141.434871829...

0.175263600776...

windspeed

clouds

SQR[turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

POLY[atmpressure





-141.336885984...

0.178221812478...

windspeed

clouds

SQR [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

POLY[atmpressure



-141.288453099...

0.180921289930...

windspeed

clouds

SQR [turbidity]

SQR[Previous24...

POLY[airtemp]

POLY[dewpoint]

POLY[atmpressure v

<



















I I

Figure 37. The cross-validation results for each of the 10 best-fit models

48


-------
7.9 Report Generation

A text report of modeling results can be generated, copied to the system
clipboard, or saved to a text file using the "View Report" button. Users can view the
report within VB 2.2 by selecting the desired models and clicking on "Generate Report
for Selected Models." The report contains descriptive statistics for each model variable
and model evaluation statistic. Any number of best-fit models can be selected for
reporting.

A recommended approach to saving the information in an external application is
to copy the report to the clipboard (with the "CopytoClipboard" button) and paste it into a
rich-text application like MS Word, Write or WordPad. NotePad or other text editors
will work, but column formats will likely be lost and make the report difficult to
interpret.

Figure 38. A text report generated on the modeling results

Comparative bar graphs can be displayed to view evaluation criteria for all top
models. Click on "View Evaluation Graphs" to see these plots. Hover the mouse over
any plot to display the relevant evaluation criteria and hovering over any bar displays the
associated model. Note that the evaluation criteria graphs are scaled to emphasize
differences between the model scores although the difference may, in fact, be quite small.
With the cursor over any graph, right-mouse click and select "Set Scale to Default" to
view the un-scaled graph.

49


-------
j Model Evaluation Criteria

Adjusted R2

logEcoli = 13.0836e-01 - 23.3539e-03xairtennp + 10.8332e-03xturbidity + 98.1067e-03xclouds - 28.6138e-05xrockyriverflow + 18.535e-05xcuyahogariverflow +
23.473e-02xPrevious24hrrainfall + 25.5045e-03xdewpoinl:

I

n W 1

XL

ill

Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models

d Model Evaluation Criteria

3 Model Evaluation Criteria

R2

R2

logEcoli = -14.2B08e00 + 50.1901 e-01"P0LY[[airtemp][dewpoint]] - 47.2897e-02"PC logEcoli = -13.9053e00 + 48.31 B5e-01"P0LY[[airtemp][dewpoirit]] - 51.8026e-02"PC
11.2129e-04KS Q R [[airtemp][cuvahogari verflo w]] + 14.3251 e-02"SQR[[Previous24hn 14.3141 e-02"SQR[[Previous24h[iairifall][windspeed]] + 12.43746-01 "POLY[[airtemp;

Figure 40. Scaled versus un-scaled views of selected model evaluation criterion

50


-------
8. RESIDUAL ANALYSIS

Once a model is selected in the "Best Fits" window on the Modeling tab, the
"Residuals" and "MLR Prediction" tabs appear at the top of the interface. Users may
click "Residuals" to view information about residuals of the selected model, but this is
not mandatory; they may take the selected model immediately to prediction mode by
clicking on "MLR Prediction." There are four subtabs on the Residuals tab: Predicted vs
Residuals, Observed vs Predicted, DFFITS, and Cook's Distance.

[J Virtual Beach 2.2

Project Model Help

00®

Beach Location Data Processing Modeling Residuals | MLR Prediction

Variable Statistics Model Statistics

Parameter	Coefficient	StandardizedCoefficient Std. Error	t-Statistic	P-Value

(Intercept)	14.5347	3.7900	3.8351	0.0001

Turbidity	0.0094	0.3384	0.0010	9.3457	1.1916e-19

WaveH eight	0.1469	0.2185	0.0242	6.0642	2.1665e-09

Dew_Point_F	0.0190	0.2387	0.0025	7.4886	2.0948e-13

Wind*/	-0.0144	-0.1506	0.0033	-4.3896	1.3102e-05

Station_Pressure -0.4906	-0.1121	0.1287	-3.8120	0.0001

Precip_T otal	24.7024	0.2124	3.4226	7.2174	1.3794e-12

Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance

A.D. Normality Statistic = 0.5732
A.D. Statistic P-value = 0.1364

Predictions vs Studentized Residuals

<2 o

-5

¦¦

o

:: o
V 8 <5>jM

° o a :

MM ° -

S - n ¦=¦

° H8H

ir

o

: i i i i 1 ¦ i

mf° * ° :

:

& :

o -

¦ i | ¦ i i i | ¦ 1 ¦ ¦ ¦ ¦ | ¦ ¦ i ¦ "

2	3	4

Predictions

Project Name: Beach Name:

Total number of possible models: 2,047 [

Figure 41. Information available on the Residuals tab, including a plot of studentized residuals
versus predictions, the Anderson-Darling residual normality test, and regression statistics

The Predicted vs Residuals subtab shows a graph of the studentized residuals
versus their predicted model values. The Anderson-Darling Normality Statistic
(http://en.wikipedia.org/wiki/Anderson-Darling) is shown with its significance (p-value).
Linear regression assumes normally-distributed residuals, so if this A-D normality test
fails (the A-D p-value is less than 0.05), the user should 1) transform the response
variable, 2) transform some of the IVs, or 3) consider deleting offensive high leverage
observations, which can be done on this tab.

51


-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance

A. D. N crmality S tatistic = 1.1610
A.D. Statistic P-value = 0.0043

Predictions vs Studentized Residuals

1 1 I 1 I 1 1 I







. 1 . 1 1 . 1 1



o









o







--







o

o

o







--

o o







o

o



o





o









o

o





o

O Q

o









o

o

o





o

o



¦¦

o

o °

o

o

M-

o
o

o

o

o o
o





—¦ ¦ i i 1 ¦ ¦ i

o

1 1 ¦ 1 1

¦ 1 ¦ '

I'M

' ¦ ¦ 1 ' ' ' ¦

-0.5	0.0	0.5	1.0	1.5	2.0	2.5

Predictions

Figure 42. Plot of studentized predictions vs. residuals and the A-D test of normality

On DFFITS and Cook's Distance subtabs, observations are sorted by the largest
(absolute value) respective measure in a grid at the left. A plot of the DFFITS/Cook's
Distances for each record (observation) versus the Record ID is shown at the right. Data
points with very large DFFITS/Cook's Distances (i.e., lie outside the horizontal red
boundaries on the graph) distort the fitted values and standard deviation of the regression
coefficients.

52


-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance

Record

Date/Time

DFFITS

~

357

38958.375

-1.995793



0

3GG7G.375

0.470774



229

38495.375

•0.443269



114

37447.375

-0.426416



3S0

39223.375

0.401342



4G2

39589.38819444...

-0.355593



272

38575.375

0.346042



27G

3858G.375

-0.344014



124

37483.375

0.317248

Iterative Rebuild [ Go |	2*SQR(pAi) = 0.2491

Auto Rebuild

Stop when all DFFITS less than:

~ED O iterative threshold using 2*SQR(pAi)
(~) constant threshold 10.2491 |

View Data Table

Residuals

| ~ DFFITS		 CUtOff = 0.2491 	 -cutoff =-0.2491 |

0.5

0.0

-0.5 --

-1.0 --

-1.5 --

-2.0 --

nwiI

J*;

~

1 I ' 1 1 ' I ' 1 1 ' I ' ' ¦ ' I ' ' 1 ' I ' ' ' ' I ' ' i ' I I
100 200 300 400 500 600 700
Record

Figure 43. A table and plot of the DFFITS scores for the residuals

Clicking the Iterative Rebuild "Go" button removes the observation with the
largest absolute value DFFITS/Cook's Distance, re-fits the regression, and calculates new
DFFITS/Cook's Distances for the remaining observations. This model is named
"Rebuildl," and it is added to the "Models" window at the top left of the screen.

Clicking on the Iterative Rebuild "Go" button again would produce a model called
"Rebuild2," which is calculated after removing the observation with the largest absolute
value DFFITS/Cook's Distance remaining in the dataset (it is the 2nd largest absolute
value in the original dataset). The user can continue to click "Go" and remove
observations with the largest remaining DFFITS/Cook's Distances, thus creating
"Rebuild3," "Rebuild4," "Rebuild5," etc. VB will not allow a user to delete any
observations if 10 or fewer observations remain in the dataset.

Whenever a "rebuild" is created by pressing "Go," the information displayed on
the Residual tab (variable and model statistics, Observed vs Predicted plot, Predicted vs
Residuals plot, DFFITS values, etc.) is automatically updated to reflect this new model
(even if another model is highlighted in the "Models" window). However, the user can
select any model in the "Models" window to view its associated data and plots.

The user has complete freedom to carry out the outlier removal process while
toggling back and forth between the DFFITS and Cook's Distance subtabs. For example,
the first removal can be based on a DFFITS value, the next removal can be based on a
Cook's Distance, the next two removals can be based on DFFITS, etc. If the user wishes
to clear the "Models" window for whatever reason, simply click the "Clear" button.

Rather than using Iterative Rebuild, the user has two additional choices for Auto
Rebuild, both of which remove all observations above some threshold. The "iterative
threshold" choice bases removals on a threshold that is updated every time an observation
is deleted. For DFFITS, this threshold is 2*(p/n)°'5, where p is the number of IVs in the
model and n is the current number of observations in the dataset. For Cook's Distance,
the threshold is 4/n.

53


-------
Iterative Rebuild
Auto Rebuild

Go

2*SQR(p/n) = 0.2491

Go

Stop when all DFFITS less than:
O iterative threshold using 2KSQR(pA"i)

® constant threshold 0.2491

View Data Table

Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points

In the "iterative threshold" process, step one is to check if any DFFITS/Cook's
Distances are above the threshold; if so, VB removes the observation with the largest
absolute value DFFITS/Cook's Distance and then recalculates the regression model, the
DFFITS/Cook's Distances, and the threshold (because n has been reduced by 1). VB
then checks to see if any of these new DFFITS/Cook's Distances are above the new
threshold. If so, the process repeats. VB will continue until no DFFITS/Cook's
Distances remain that exceed the current threshold, or until half of the dataset has been
removed, whatever comes first. For example, if a dataset has 100 observations, VB will
allow 50 to be removed before it breaks out of the Auto Rebuild removal loop. At that
point the user can click the Auto Rebuild "Go" button again to potentially remove
another 25 observations of the remaining 50. We note that, in practice, one should not
remove more than 5-10% of the original dataset as outliers; the need to remove more
indicates a poor MLR fit and warrants a different analytical technique.

Using the "constant threshold" Auto Rebuild option differs from the "iterative
threshold" only in that the threshold remains static (i.e., the value the user types into the
input box) regardless of how many observations are deleted. Updated DFFITS/Cook's
Distances are still calculated after every removal event. VB will also stop this process if
half the number of starting observations has been deleted. There is an upper limit to the
number that can be entered into the "constant threshold" input box (DFFITS = 3, Cook's
Distance = 16/n).

Upon completion of the Auto Rebuild process, multiple models may have been
added to the "Models" window. For example, if 10 observations were removed, then
"Rebuildl" through "RebuildlO" will appear in the "Models" window.

If a user has interest in both DFFITS and Cook's Distances as outlier metrics, we
suggest one of the following methods:

1)	To see if the two criteria would produce different results:

Apply DFFITS removal to your model of choice. Note the results and then clear the
Residual tab using the "Clear" button. Next perform a removal process based on
Cook's Distance and compare the results.

2)	To filter out observations that offend either DFFITS or Cook's Distance criteria:
Run DFFITS removal on the model (i.e., remove all observations above your
specified DFFITS threshold), then click the Cook's Distance subtab and perform

54


-------
additional outlier removal based on its threshold. After this process, remaining
observations are "OK" from the perspective of both metrics.

Note that the highlighted model in the "Models" box is used if the "MLR
Prediction" tab is clicked, not necessarily the model whose information is displayed on
the Residuals tab. Also note that any observations removed from the "Residuals" tab are
not removed from the primary dataset shown on the "Data Processing" tab.

Viewing the Data Table

From the DFFITS or Cook's Distance subtabs, users can click on "View Data
Table" to display a history of the observation removal process for the model highlighted
in the "Model" box. From this window, users may export the dataset for external use or
re-importation into VB 2.2.



Records Eliminated from Model Data Set



Model

Residual
Value

Residual Type

Date

logEcoli

clouds

SQR [turbidity]

SQR[Previous24h

~

Rebuildl

-1.339716

DFFITS

8/16/2007

3.58546073

5

16.06237840420...

1.118033988749..



Rebuild2

-1.013314

DFFITS

6/1/2009

0.301029996

4

2.664582518894...

0



Rebuild3

0.685558

DFFITS

7/25/2008

2.939519253

3

5.540758070878...

0

*

















Model Data Set - Inactive Records in Red

Save Data



Date

logEcoli

clouds

SQR[turbidity]

SQR[Previous24hrr

POLY[airtemp]

~

6/1/2007

1.230448921

4

1.717556403731...

0

1.507064992941.



6/2/2007

2.939519253

4

1.612451549659...

0

1.603774691988.



6/3/2007

1.897627091

2

6.606814663663...

0.223606797749...

1.783618147049.



6/4/2007

1.204119983

3

3.154362059117...

0

1.783618147049.





n omnQQQa?

A

1 qriR'SQjincic?

n

1 7?Q/ianC7-|10Q v

<



Illl







>

Figure 45. "View Data Table" window for examining the dataset after removal of influential data
points

The "Observed vs Predicted" subtab is the same as that in Section 7.6. There are
two plots and two tables to examine, along with controls to modify the Decision Criterion
(blue horizontal line) and Regulatory Standard (green vertical line), to judge effects these
changes have on model outcomes (false positives, false negatives, sensitivity, specificity,
etc.).

55


-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance

Plot: Pred vs Obs

Plot Thresholds

HO Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Threshold Transform
O None
® Log10

O Ln

O Power [

Update

Model Evaluation



False Positives (Type 1):

* 1

Specificity:

0.9882

False Negatives (Type II):
Sensitivity:
Accuracy:

80

0.3043J

0.8772 |





Predictions vs Observations

7

6
5
4

CO

c

.2 3

| Decision Threshold

Regulatory Threshold

•S'



-1

Observations

Figure 46. Observed vs. Predicted plot on the Residual tab with model evaluation threshold control
and model evaluation statistics

51 Virtual Beach 2.2

Project Model Help

~d®

Beach Location Data Processing Modeling Residuals | MLR Prediction

SelectedModel
Rebuildl
Rebuild2
Rebuild3	

Variable S tatistics M odel S tatistics

Parameter
(Intercept)

waveheight
WindDirection

Coefficient

1.9979

-0.0005

-0.7739

-0.0042

S tandardisedCoefficient

-0.4334
-0.1071
-0.7244

Std. Error
0.157G
9,9649e-05
0.G7G8
0.0005

t-Statistic
12.6746
-4.6448
-1.1435

P-Value
2.3721 e-13
6.8014e-05
0.2622
6.4821 e-09

Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance

A.D. Normality Statistic = 0.1526
A.D. Statistic P-value = 0.9546

Predictions vs Studentized Residuals

0.5	1.0

Predictions

Project File Name:

Project Name: Beach Name:

Total number of possible models: 127 [

Figure 47. Residuals interface showing a list of rebuilt models resulting from observation deletions,
and the associated statistics and residual plots for these rebuilds

56


-------
9. PREDICTION

The MLR Prediction interface allows users to estimate or predict FIB
concentrations with a selected regression model. Whether a user was previously on the
Modeling tab (with a model selected in "Best Fits") or on the Residuals tab (with a model
selected in "Models"), the interface of the MLR Prediction tab will look the same.

9.1	Model Statement

At the top is the linear expression for the chosen model, with values of the
regression coefficients and names of each IV in the model (Figure 48).

9.2	Model Evaluation Thresholds

There are input boxes for the Decision Criterion (DC) and Regulatory Standard
(RS). Setting these allows model predictions to be evaluated and model specificity,
sensitivity, and accuracy to be calculated. When users first arrive at the Prediction tab,
values of the DC and RS will be set to what was on the Modeling tab. The "Threshold
Transform" button tells VB 2.2 how to transform the DC and RS for comparison to
model predictions and observations. If a transformation definition was set for the
response variable during data processing (either manually by the user or automatically by
transforming the response), that definition will be set here as the default. Users should be
aware that changing the threshold transform definition can cause problems when
comparing modeling predictions to observations. Caution should be exercised.

57


-------
SI Virtual Beach 2.2

BBB

Project Model Help

Beach Location Data Processing Modeling Residuals MLR Prediction I

Model:

LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)

Model Evaluation Thresholds
1235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)

Threshold Transform
® None
O Log10

O Ln

O Power 11.0

IV Data
Validation

Import IVs

Import Obs

Make
Predictions

~ear	Export As CSV

Predictive Record

Project File Name:	Project Name: Beach Name:	Total number of possible models: 127 [	] ¦:

Figure 48. The MLR Prediction interface

9.3 Prediction Form

Most of the prediction form is in three separate data panels: the left panel holds
IV data; the middle panel is for observational data, e.g., lab results of FIB concentrations;
and the right section shows model predictions and evaluation metrics. Each panel also
contains a column for a unique ID for each row of data (e.g., the date that data were
collected). The panels have separate horizontal and vertical scroll bars that become
visible if the number of rows or columns exceeds the viewable area. The three panels
independently scroll horizontally, but scroll as a group vertically. Panels can be re-sized
by clicking and dragging the blue vertical partitions. Order of the columns in the left and
right panels can be changed by clicking and dragging the column headers left or right.

Users can import IV and observational data from a file using "Import IVs" and
"Import Obs" buttons in the "Prediction Form" button bank located in the middle right of
the screen, or users can type data into the input grids. Either way, they should be certain
that the entered IV data are in the same units as those used to build the model.

Depending on which model was selected for prediction, the IV panel will have
one column for every unique IV that appears in the model, plus a column for the row's
unique ID. When a data file is imported with the "Import IVs" button, a "Column
Mapper" window opens. This window allows users to tell VB 2.2 which columns in the
imported datasheet should be used for the row IDs and each IV found in the model. By

58


-------
default, the first column of the imported file maps to the ID field, but users can choose
another column if needed. If a column in the imported spreadsheet has an identical name
to an IV in the model, that column will be automatically selected by VB 2.2 as the
appropriate one for that IV.

ggj Column Mapper

Figure 49. Importation of IV data using the "Column Mapper" window

As with IV data, observational data can be typed into the middle panel or
imported using "Import Obs." For observational data, only two columns are needed:
row IDs for every observation and the actual observations. A "Column Mapper" window
appears when observational data are imported from a file. After they have been imported
or manually entered, users can specify the scale/transformation of the observations for a
proper comparison to model predictions. This is done by right-clicking on the
"Observation" column header and defining the transformation: none, logio, loge, or a
power transformation. "None" is the default choice. For example, if LoglO observations
are imported, the user would need to change the right-click menu choice to "LoglO."



Column Mapper



Obs IDs

Obs

¦

~

ID

J (stamp





Observation

LogCFU

J 1



~ k

Cancel

Figure 50. Importation of observational data using the "Column Mapper" window

The "Make Predictions" button remains disabled until the IV data (imported from
a file or manually typed) are validated using the "IV Data Validation" button. This scan
ensures there are no blank cells or non-numeric data in the IV columns of the IV data
panel and checks that every row ID is unique (non-numeric data are allowed for the ID
column). This validation scan window is very similar to the validation scan window sin

59


-------
the Data Processing tab; however, "Delete Column" is not a choice. "Replace With" and
"Delete Row" are the only ways to deal with problems in the IV data grid.

§§ Virtual Beach 2.2

Project Model Help

J-lf51fx|

Beach Location | Data Processing } Modeling Residuals MLR Prediction

Model:

LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)

Model Evaluation Thresholds
1235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)

Predictive Record

ID

Threshold T ransform
0 None
O Log10
O Ln

O Power 11.0

IV Data
Validation

Import IVs

Import Obs



Export As CSV

waveheight

~

38507.33

360

0.15

0



38507.46

1403

0.2

10



38507.63

1555

0.2

20



38508.33

337

0.2

30



38508.46

1305

0.2

40



38508.63

1568

0.2

50



38521.46

1342

0.02

60



38521.63

1276

0.01

70



38522.33

225

0.01

80



38522.46

1260

0.01

90



38522.63

1409

0.01

100



38528.33

295

0.1

110



38528.46

1800

0.15

120



38528.63

900

0.18

130



38535.33

293

0.15

140



38535.46

1537

0.15

150



38535.63

1763

0.3

160



38536.33

286

0.05

170



incnr Ar-

i 
-------
Virtual Beach 2.2

Project Model Help

BB®

Data Processing Modeling Residuals MLR Prediction i

Model:

LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)

Model Evaluation Thresholds
S D ecision Criterion (Horizontal)

1235 | Regulatory Standard (Vertical)

Predictive Record

Threshold Transform
© None
O Log10
O Ln

O Power [To

IV Data
Validation

Import IVs

Import Obs

Make
Predictions

Plot | | Clear | | ExpotlAsCSV |

ID

uv

waveheight

WindDirec

A



ID

Observation

A

'D

Model_Prediction

C

/V

~

38507.33

360

0.15

0



~

38507.33

1.452



~

38507.33

1.831

2:





38507.46

1403

0.2

10





38507.46

0.8653





38507.46

1.177

2:





38507. G3

1555

0.2

20





38507.63

0.8016





38507.63

1.044

2:





38508.33

337

0.2

30





38508.33

1.738





38508.33

1.84

2:





38508.4G

1305

0.2

40





38508.46

1.028





38508.46

1.153

2:





38508.63

1568

0.2

50





38508.63

0.301





38508.63

0.9449

2:





38521.46

1342

0.02

60





38521.46

1.627





38521.46

0.7657

2:





38521.63

1276

0.01

70





38521.63

1.247





38521.63

0.7636

2:





38522.33

225

0.01

80





38522.33

1.773





38522.33

1.447

2:





38522.46

1260

0.01

90



38522.46

0.9379





38522.46

0.7145

2:



38522.63

1409

0.01

100



38522.63

0.9542





38522.63

0.5833

2:



38528.33

295

0.1

110



38528.33

1.079





38528.33

1.461

2:



38528.46

1800

0.15

120



38528.46

0.97





38528.46

0.4933

2:



38528.63

900

0.18

130



38528.63

1.195





38528.63

1.125

2:



38535.33

293

0.15

140



38535.33

1.239





38535.33

1.456

2:



38535.46

1537

0.15

150



38535.46

0.699





38535.46

0.5818

2:



38535.63

1763

0.3

160



38535.63

-0.1761





38535.63

0.6506

2:



38536.33

286

0.05

170



38536.33

1.176





38536.33

1.203

2:



tncir ac

1 AC1

n 1

inn





38536.46

0.1249





nncnr

n Ajici





<







>



V

<

mi 'I



>

L-	















I	



	1

Project File Name:

Project Name: Beach Name:

Total number of possible models: 127 l_

Figure 52. A prediction grid after IVs and observational data have been imported, and model
predictions have been made

The ID column of the model output panel is taken directly from the ID column of
the IV panel, not the observation panel. The "Make Predictions" button makes one
model prediction per row in the IV data panel, regardless of how many observations are
entered in the observation panel.

The Model Prediction column contains predicted values of the response variable.
Right-clicking on this column header allows the user to change how the predictions are
displayed in the table (as linear, log, or power units). The Decision Criterion and
Regulatory Standard are values set by the user (shown in the left panel as transformed by
the choice of "Threshold Transform"). The Exceedance Probability (actually the
probability x 100) is defined as the probability that the model prediction will be larger
than the Decision Criterion, based on uncertainty bounds (confidence intervals) around
the model predictions.

To compare model predictions to observations, VB 2.2 looks at the prediction ID
and attempts to find an observation in the observation panel with that same ID. VB 2.2
does not require unique IDs for each row in the observation panel, but note that a model
prediction is compared to the first observation found with the same ID. When comparing
model predictions to observations, an error (false exceedance or false non-exceedance)
appears in the "Error Type" column.

It is important to note that accurately assessing model output depends on
synchronized transformation information regarding the Decision Criterion, Regulatory

61


-------
Standard, model predictions, and observations. Users must be careful to ensure each
value is in a comparable unit.

9.4 Viewing Plots

After predictions have been made, a scatterplot of observations versus predictions
can be viewed by clicking "Plot" in the "Prediction Grid" button bank. If no
observational data were entered, a message asking for observational data appears. The
features and functionality of the form that appears when the "Plot" button is clicked are
described in Section 7.6. The data are based on comparing model predictions (right pane
of the Prediction Form) with observations (middle pane) that share the same, unique ID.

Select View

Plot: Pred vs Obs

Plot Thresholds

[235 | Decision Criterion (Horizontal)
] Regulatory Standard (Vertical)
Threshold Transform
O None
® Log10
O Ln

O Power |NaN

235

Update t

Model Evaluation



False Positives (Type I):

|7

Specificity:

[0.9882]

False Negatives (Type II):

80

Sensitivity:

[0.3043]

Accuracy:

0.8772

Close

BBS

Predictions vs Observations

5 --

4 --

I 2

-1 —

-2

	 Decision Threshold 	 Regulatory Threshold |

. 1 1 . 1 1 1 . 1 1 . 1 1 .









I...









•































\ •
»

•









~ .



••











.*







m

V

*









m
: •











-2

0

2

Observati

) IIS



£

f













Figure 53. Prediction interface plotting of the observations versus predictions, with model evaluation
threshold controls

62


-------
9.5 Prediction Form Manipulation

Two other buttons are found in the "Prediction Grid" button bank. If a user wants
to view the table in a spreadsheet or word processing program, "Export as CSV" saves
the contents of the entire table (three panels) in .csv format. "Clear" deletes all
information in the predictive table. As with most of the tabular information in VB 2.2,
data in individual panels can be selected with a left click and drag. Control-C and
Control-V can then be used to copy and paste the data into another application such as
WordPad or Excel.

10.	FUTURE ENHANCEMENTS

VB 2.2 is a Windows application and undergoes continuous improvement and
functional expansion. In version 3.0, slated for release in 2012, project management
enhancements will allow site-based seasonal prediction and model assessment. The map
interface will provide user access and information to site-specific data such as water
quality, water flow gauge readings and weather data. Model- building functionality will
grow beyond MLR to include Gradient Boosting Machines (Decision Trees), Binary
Logistic Regression, Partial Least Squares regression, and Neural Networks.

11.	USER FEEDBACK

Opinions and experiences from the user community are welcomed by the Virtual
Beach design/development team. Users are encouraged to report problems, issues and
likes/dislikes to:

Mike Cyterski - 706 355-8142 (cvterski.mike@epa.gov)

Mike Galvin -706 355-8318 (galvin.mike@epa.gov)

Rajbir Parmar - 706 355-8306 (parmar.raibir@epa.gov)

Kurt Wolfe - 706 355-8311 (wolfe.kurt@epa.gov)

12.	ACKNOWLEDGMENTS

We would like to thank the following people, who generously donated their time
and expertise for software testing and review of this document:

Adam Mednick, Wisconsin DNR
David Rockwell, NOAA
Fran Rauschenberg, USEPA
Wesley Brooks, USGS
Mike Fienen, USGS
Donna Francy, USGS
Richard Zepp, USEPA
Steve Corsi, USGS

63


-------