Michael Cyterski
                   Mike Galvin
                   Rajbir Parmar
                   Kurt Wolfe
ixposure Kesearc
atory, Ecosystem Research Division, Athens, GA 3060

-------

-------
Notice
The research described in this document was funded by the U.S. Environmental Protection
Agency through the Office of Research and Development. The research described herein was
conducted at the Ecosystems Research Division of the USEPA National Exposure Research
Laboratory in Athens, Georgia. It has been subjected to the Agency's peer and administrative
review and has been approved for publication as an EPA document.  Mention of trade names or
commercial products does not constitute endorsement or recommendation for use.
Abstract
This report describes the development and design of Virtual Beach 2.2 (VB2.2) and
provides guidance for its proper use. VB2.2 is a tool that allows beach managers to analyze
environmental data in order to make decisions regarding beach closures due to microbial
contamination. It does this by facilitating the construction of statistical models for the
prediction of fecal indicator bacteria (FIB) levels.  Some familiarity  with multiple linear
regression (MLR) modeling and residual analysis will benefit a VB user; however, it is not
required.
VB2.2 has five major components:
• Beach location mapping interface where users can locate their site, define the orientation of
  the beach, and examine nearby potential data sources.
• Data processing spreadsheet interface that facilitates the import and manipulation of data.
• Modeling interface that presents options for performing MLR analyses.
• Residuals component to examine regression residuals, allow optional elimination of highly
  influential data records, and perform recalculation of the chosen regression model.
• Prediction interface allowing the entry of new data and subsequent estimation of pathogen
  indicator levels using a selected MLR model.

-------

-------
                                                                       Table  of Contents
1.0 Introduction	1
    1.1 On Predictive Modeling	1
    1.2 Recommended User Background	1
    1.3 History and Comparison of Version 2.2 to Earlier Versions	2
    2.1 Viewing this Documentation	5
2.0 Installation and Execution	5
3.0 Operational Overview	6
4.0 Project Management	7
5.0 Beach Location Mapping Interface	8
    5.1 Finding a Location	8
    5.2 Defining the Beach Orientation	10
    5.3 Finding nearby Water Quality, Flow, and Climate Information Sources	11
    5.4 Saving Beach Information in a Project File	12
6.0 Data Processing	13
    6.1 Data Requirements and Considerations	13
    6.2 Importing a Dataset	14
    6.3 Validating the Imported Data	15
    6.4 Working with a Dataset Post-Validation	16
    6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components	20
    6.6 Creation of New Independent Variables 	23
    6.7 Transforming the Independent Variables	26
    6.8 Saving Processed Data	31
    6.9 Go to Modeling	31
7.0 Modeling	32
    7.1 Selecting Variables for Model Building	32
    7.2 Modeling Control Options	32
    7.3 Linear Regression Modeling Methods	34
    7.4 Using the Genetic Algorithm	37
    7.5 Evaluating Model Output	38
    7.6 Viewing X-Y Scatterplots	43
    7.7 ROC Curves	44
    7.8 Cross-Validation	45
    7.9 Report Generation	45
8.0 Residual Analysis	48
9.0 Prediction	55
    9.1 Model Statement	55
    9.2 Model Evaluation Thresholds	55
    9.3 Prediction Form	56
    9.4 Viewing Plots	60
    9.5 Prediction Form Manipulation	60
10.0 Future Enhancements	62
11.0 User Feedback	63
12.0 Acknowledgments	64

-------
                                                                         Table  of Figures
Figure 1. The five major component tabs of VB 2.2	2
Figure 2. Beach Location interface	8
Figure3. Beach Location tab controls and their function	9
Figure 4. Adding shoreline and water markers to define beach orientation	10
Figures. NOAA/NCDC station marker showing station ID information	11
Figure 6. USGS/NWIS station marker showing station ID information	11
Figure 7. Beach Location interface showing station markers	12
Figure 8. Importing a dataset into the Data Processing tab	14
Figure 9. Data validation required to begin data processing	15
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu	16
Figure 11. Post-validation enabling  of the Data Processing functionality	17
Figure 13. Four different plots available for evaluation of IVs	18
Figure 14. Disabling an observation from within the XY scatterplot	19
Figure 15. Available choices when right-clicking the current response variable	20
Figure 16. Window for computation of alongshore and offshore/onshore components	21
Figure 17. A and O component definitions for wind, current, and wave data	22
Figure 18. Principal beach orientations given in degrees	23
Figure 19. Window for the formulation of "Manipulates"	24
Figure 20. Creation of a new IV defined as the mean of two existent IVs	25
Figure 21. Formation of two-way cross-products of a set of four existent IVs	26
Figure 22. The range of choices for IV transformations	27
Figure 23. Pearson correlation coefficient scores	28
Figure 24. Scatterplots (Response vs. IV) for six different data transformations	29
Figure 25. Selecting variables for MLR processing within the Modeling tab	32
Figure 26. Setting modeling options within the Modeling interface	33
Figure 27. Setting evaluation thresholds and threshold transformation information	34
Figure 28. Model building interface	36
Figure 29. Using the IV filter to select a subset of variables from the best-fit models	37
Figure 30. Genetic algorithm options within the modeling interface	38
Figure 31. Modeling results shown after completion of an exhaustive regression run	39
Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model	40
Figure 33. Modeling interface showing model evaluation metrics 	40
Figure 34. Modeling interface showing a time series plot for the selected model	41
Figure 35. An XY scatter plot of observed versus predicted values for the selected model	42
Figure 36. The ROC curves and AUC table for the Best Fit models	43
Figure 37. The cross-validation results for each of the 10 best-fit models	45
Figure 38. Atext report generated on the modeling results	46
Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models	47
Figure 40. Scaled versus un-scaled views of selected model evaluation criterion	47
Figure 41. Information available on the Residuals tab	48
Figure 42. Plot of studentized predictions vs. residuals and theA-D test of normality	49
Figure 43. Atable and plot of the DFFITS scores for the residuals	50
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points	51
Figure 45. "View Data Table" window for examining the dataset	52
Figure 46. Observed vs. Predicted plot on the Residual tab	53
Figure 47. Residuals interface showing a list of rebuilt models	54
Figure 48. The MLR Prediction interface	56
Figure 49. Importation of IV data using the "Column Mapper" window	57

-------
                                                                     Table of  Figures
Figure 50. Importation of observational data using the "Column Mapper" window	57
Figure 51. The IV validation window on the MLR Prediction tab	58
Figure 52. A prediction grid after IVs and observational data have been imported	59
Figure 53. Prediction interface plotting of the observations versus predictions	60

-------

-------
                                                                                      1.0
                                                                         Introduction
       Virtual Beach version 2.2 (VB 2.2) is a decision support tool.  It is designed to construct
site-specific Multi-Linear Regression (MLR) models to predict pathogen indicator levels
(or fecal indicator bacteria, FIB) at recreational beaches. MLR analysis has outperformed
persistence models (using the most recent FIB concentration as the sole predictor of the next FIB
concentrations, i.e., yt = yt:) at beaches where conditions, such as weather, water conditions, and
human and animal traffic levels, change significantly from day to day (Frick, Ge et al. 2008).
1.1 On Predictive Modeling

       In any predictive modeling endeavor, variability and uncertainty are always associated
with model output, arising from a variety of reasons that are impossible to eradicate completely
from the modeling exercise. Virtual Beach 2.2 attempts to be forthright with this fact by issuing a
probability of exceedance for any regulatory standard that the user wishes to investigate. Even so,
there is no guarantee than every model prediction will be correct, and a situation where the model
predicts water quality to be good enough for public recreation might be erroneous. Decisions to
allow or not allow swimming at beaches must be made, however, and in the best case scenarios
the regression models developed with Virtual Beach 2.2 will  outperform less rigorous predictive
efforts.
1.2 Recommended User Background

       Virtual Beach 2.2 is our attempt to create a decision support software tool that will assist
someone with little statistical knowledge in developing a multiple linear regression model based
on their available data.  Some familiarity with regression modeling and residual analysis will no
doubt benefit a VB 2.2 user, although we believe that, after only a few sessions, someone with
very little background in statistics can produce defensible regression models using VB 2.2. We
note that these MLR models, or any other statistical models, will only be as effective as the data
used to develop them. No statistician, however skilled, can turn a dataset filled with worthless
independent variables (i.e., IVs) into a useful predictive device.

VB 2.2 has five major components:

   •  Beach location map interface where users can locate their site, define the orientation of the
       beach, and examine nearby potential data sources.
   •  Data processing spreadsheet interface that facilitates the import and manipulation of MLR
       model variable data.
   •  Modeling interface presenting options for performing MLR analyses.
   •  Residuals component to examine regression residuals, allow optional elimination of highly

-------
       influential data records, and perform recalculation of the regression model.
    •  Prediction interface allowing entry of new data and subsequent estimation of pathogen
       indicator levels using a selected MLR model.

       Each component is accessible from the application's main window via selectable tabs. The
Beach Location and Data Processing tabs are always visible, the Modeling tab becomes visible
once the input data have been validated, and the Residuals and MLR Prediction tabs appear when
model-building is complete and a model is selected.
 Project Model Help

   Beach Location   Data Processing
                  "Modeling I Residuals   MLR Prediction
  Variabb Selection Control Opiums
    Evaluation Criteria
    Alail-plrlnrrnaiicn Infer on |,",IC|
    4    Maximum Number of Variables in a Model
        Available: 7, Recommended: 4, Man: 7

    |5   | MaximumVIF

   Model E valuation! hresholds
      1235 | Decision Criterion [Horizontal]

      1235 | Regulatory Standard (Vertical)

   Threshold Transform    Current US Regulatory Standards

               E. coli. Freshwater:   235
               Enterococcii, Freshwater: 104
               EnteiococcL Saltwater:  61
ervations: 37
M
odel Information
Best Fits:
13.2076
9.1112
9.2219
9.2471
10.1760

Variable 9latislics Model Statistics

Parameter Coetficient Standardized Coefficient Std Error t-Statistic
(Intercept) 1.9229 0.2994 6.0879
wavehaighl 16811 0.2239 1.0139 1.6580
uv -0.0007 -0.5050 0.0002 -3.7750
WindDirection -0.0030 -0.4177 00010 -3.1185

    Population Size:    100

    Number of Generations. 100

    Mutation Rate:

    Crossover Rate:    |0.20
Progress ResLJ||s Observed vs Predicted ROC Curves
Genetic Algorithm Dynamic Fitness Update
7.7 -
7.3 -
7.5 -
7.4 -
$ 7.3 -
|7.2-
7.1 -
7.0 -
6.9 -
e.a -
:
— Fitness |
-
-
-

-



10 20 30 40 50 30 70 80 80 100







Percent of Generations Completed
 Project File Name;
                    Project Name: Beach Name:
                                                                 Total number of possible models: 127 I
Figure 1. The five major component tabs of VB 2.2 - the modeling tab is currently active

1.3 History and Comparison of Version 2.2 to Earlier Versions

       Virtual Beach 2.2 is derived from the Virtual Beach Model Builder application (VB1.0
- also known as Virtual Beach vl .0) developed by Walter Frick and Zhongfu Ge. VB 1.0 can be
characterized as a MLR model-building tool that supports a primarily manual analysis of data sets
via visual inspection of data plots and manipulation of variables (e.g., transformations,  creating
interaction terms), followed by an iterative process of testing, comparing and evaluating models.
The fitness  of developed models is computed and tracked, allowing for comparison and eventual
selection of a "best" model for the dataset under consideration. This model can then produce
estimates of pathogen indicator levels using current or forecasted environmental data from the site.
       VB  2.2 enhances the functionality of its predecessor, performing similar functions (visual
inspection of univariate data plots, manual transformations of individual variables, MLR model
building, prediction, etc.), but also automating and  extending functionality in several ways:

-------
The Map component provides users with information on the location and availability of
local data sources (NWIS/NCDC data) through the map interface. These sources can
provide recently collected and/or forecasted data for generating predictions by a chosen
MLR model.

The Map component provides a convenient method for denning beach orientation by
overlaying the beach on current shore-line layers (satellite images, Google Maps, MS
Virtual Earth, etc). Given this orientation, VB 2.2 can calculate wind, wave, or current
components (A component is parallel to shore and O component is perpendicular to shore),
which can be important predictor variables.

Although manual processing and analysis of imported data (visual inspection of univariate
data plots and the transformations/interactions of variables) has been retained, the Data
Processing component of VB 2.2 provides automated generation of all possible 2nd order
interaction terms amongst a set of TVs, formation of more complex functions of multiple
columns, and automated testing of a suite of variable transformations for improved model
linearity.  This functionality increases the number of models to evaluate during later
selection routines  and removes the burden/difficulty of manual assessment placed on users
ofVBl.O.

Multi-collinearity amongst predictor variables is handled automatically in the Model
Building  component. Any model containing an  IV with a high degree of correlation
with other IVs (as measured by a large Variance Inflation Factor [VIF]) is  removed from
consideration during model selection.  The VIF threshold is user-defined with a default
value of 5.

During model selection, MLR models are ranked by a user-selected evaluation criterion.
Possible criteria include R2, Adjusted R2, Akaike Information Criterion (AIC), Corrected
AIC, Predicted Error Sum of Squares (PRESS),  Bayes Information Criterion (BIC),
Accuracy, Sensitivity, Specificity, or the model's Root Mean Square Error (RMSE).
Regardless of which criterion is chosen, the software records the ten best models in terms
of that criterion. In  comparison, VB1.0 had only a single comparative criterion, Mallow's
Cp.

As the number of IVs in a dataset increases, possible MLR models increase exponentially
(considering transforms/interactions), resulting in trillions of possible models from a
modest number (12-13) of IVs.  VB 2.2 implements a Genetic Algorithm (GA) that
effectively and efficiently searches for the best possible MLR model. Alternatively, VB 2.2
users can perform an exhaustive  calculation in which all possible combinations of IVs are
used and  tested if the number of possible models is reasonably small (circa 100,000). Both
the GA and exhaustive approaches greatly expand the model-building capabilities of VB
2.2, compared to VB 1.0.

Users no  longer have to enter data values in transformed, interacted, or component-
decomposed form to make a prediction with a chosen MLR model. On the VB 2.2 MLR
Prediction tab, a user-selected model is coded into an input grid with data  entry columns

-------
       matching the model's main effects. Any mathematical manipulation of these TVs is then
       automatically performed prior to making predictions.
       VB 2.2 is developed with MS Visual Studio 2010, written in C#, using multiple public
domain system components (Weifen Luo Docking UI, ZedGraph, and GMap.Net) and employs
a single licensed statistical library (Extreme Optimization). No license or software purchase is
required by the user to install and run the application, but an internet connection is required to
display maps. Users must have Microsoft XP or Windows 7 OS with the DotNet Framework 4.0
to assure proper installation and operation. Assorted errors have occurred when running Windows
Vista OS.  Certain VB 2.2 data manipulation and model-building operations are computationally
intensive so faster CPUs are better, but most new laptops or desktop systems will be adequate.
Disk space requirements are modest (less than 5 MB) if the DotNet Framework is installed; if
not, the Framework installer requires ~ 175 MB of disk space.  The VB 2.2 application installer
will attempt to download and install the DotNet Framework 4.0 if it is not installed on the target
system; this also requires a network connection. If necessary, a user can freely obtain the DotNet
Framework 4 installer at:

http://www.microsoft.com/download/en/details.aspx?id=l 7851

       The EPAs Center for Exposure Assessment Modeling (CEAM) web site distributes VB 2.2
at:

http://www.epa.gov/ceampubl/swater/vb2/index.html

Obtain and initiate execution of the VB 2.2 application installer and follow the on-screen
instructions. The VB 2.2 application installer can be found at:

https://iemhub.org/resources/vbmb2 for iemHub Virtual Beach Group members;
https:IIIemhub.org/groups/virtualbeach/j oin to request Group member access.

       Finally, the software can be obtained by request (see the contacts list in the Feedback
section at the end of this document). After installation, a shortcut will appear on your desktop to
start the software.

-------
                                                                                  2.0
                                                  Installation and Execution
2.1 Viewing this Documentation
       Virtual Beach's User Guide can be accessed within the software via the top-level Help User
Guide menu selection or in a context-sensitive fashion via the Fl key.  Invoking Fl will launch
Adobe Acrobat or Adobe Reader (if installed) and open the User Guide to the appropriate page.
Note that if the Guide is already open, the Fl key will have no effect; users must close Reader  (or
Acrobat) for Fl to launch and open to the correct page. Or if the Guide is already open, users can
navigate to the area of interest via the Table of Contents. . The User Guide (Virtual_Beach_2_
User_Guide.pdf) can also be opened independently of program operation; it resides within the
Documentation folder of the program's installation folder.

-------
                                                                                     3.0

                                                         Operational Overview

       Virtual Beach 2.2 is simple to operate: it is categorized into five functions, each with its
own component or interface:

Beach Location - a mapping tab whose utility is meant to provide a basis for generating
orthogonal (alongshore and offshore/onshore) wind, current, and/or wave components for the
beach under consideration; its use is optional. Such components can be powerful predictors of
pathogen indicator levels at the beach, so using the beach definition component is recommended
if the dataset under consideration contains wind, wave or current data. This tab is also useful for
locating nearby NWIS/NCDC climate and water quality data sources for a specific location.

Data Processing - a spreadsheet tab to support data manipulation procedures on an imported
dataset. In addition to wind/current/wave component generation, users can generate new
independent variables that represent the products, means, sums, minimums,  and maximums of
other IVs, as well as common data transformations for the TVs.  Statistical indicators help users
select the best IV transformations in MLR model-building.

Modeling - this tab allows selection of any eligible IVs for consideration in MLR model- building
and model-generation. Model-generation is accommodated by user-selected model evaluation
criteria and automatic generation of the ten best-fit models from a search in which all possible
combinations of predictor variables are tested, or via a heuristic searching algorithm (the Genetic
Algorithm or GA). Regression fit and model variable statistics are generated to help evaluate
the usefulness of predictive variables and overall fit.  Time series and XY scatter plots, as well as
reports on best-fit models, can be viewed and/or saved for further analysis and recording.

Residual Analysis - this tab displays plots of a model's regression residuals, including their
normality statistics, and provides means to eliminate highly influential data records and recalculate
the regression model. Altered data sets can be exported for external use and rebuilt models can be
selected for the prediction tab.

Prediction — this tab is comprised of three grids where users can enter or import the needed IVs
for the chosen model, enter or import observations that will be compared to  model predictions,
and examine model predictions and exceedance probabilities. Time series and XY scatter plots of
observations versus predictions are shown to help users gauge model effectiveness.

-------
                                                                                   4.0

                                                         Project Management

       Oftentimes the user will put an imported dataset through lengthy pre-processing to prepare
it for analysis.  To avoid repeating all of this work, "project" files can be saved and re-opened via
the Project -^ Save and Project -^ Open menu selection. Subsequent opening of a saved project
file will load the processed data sheet and information on the Beach Location tab, including the
beach orientation if the user had defined it.  However, no modeling information is saved inside a
project file.
       In addition to project files, "model" files can be opened and saved using choices under
the "Model" menu at the top of the VB 2.2  interface.  A model file contains information on the
TVs, regression parameters, and other metadata for the currently selected model in the Modeling,
Residual, or MLR Prediction tab.  Whenever a model file is saved, VB 2.2 will prompt the user to
enter a Decision Criterion (DC), Regulatory Standard (RS) and  Threshold Transformation for the
model. These parameters will be used as initial values (they can be changed when the model file is
opened) for later calculations of model sensitivity and specificity, which depend on the numbers of
false negative and false positive model predictions (see Sections 7.6 and 7.7).
       When users open a previously saved model file from within VB 2.2,  they are taken directly
to the MLR Prediction tab where they can use the saved model to generate predictions.  Model
files are designed for situations where a statistically-savvy developer is charged with developing
regression models for a number of beach sites. After the developer chooses  a "best" model for
a site, the model file can be saved and then  delivered to the beach manager who will not use VB
2.2 for full-scale model development, but only to input new data, generate predictions, and make
decisions regarding swimming advisories.

-------
                                                                              5.0
                                    Beach Location Mapping  Interface
      On VB 2.2 application startup, the map interface is shown, but users can go directly to the
Data Processing tab if desired.
Figure 2. Beach Location interface - the default map type is Yahoo Map, but users have many
mapping options


5.1 Finding a Location
      The map interface provides map controls that allow users to look up a location manually
by panning and zooming (mouse drag on the map and use of the mouse wheel or zoom control).
Alternately, a decimal latitude/longitude or place name can be entered. The control uses Google
Maps' reverse geo-coding network service to find locations.

-------
 IS Virtual Beach 2.2
         ModeJ  Help
                                         Map Controls
    Beach Location

    Map Controls
    D NWIS    D NCDC
    3 STORET
     Remove Station Locations
    Cuirervt Location
                  Lat

    -€7.3769564628601  Lng

    loading
Zoom Slider - drag slider up and
down to zoom in and out,
respectively.

Map Controls—Add Lat/Long and
click "GoToLat/Long" button or enter
a Place and click "GoToPlace."

Map Settings - Select map type from
dropdown menu to change the
display in the map window.

Beach Orientation-use buttons to
add or remove markers on the map.
Once the beach shoreline is
delineated by placing the  1st and 2nd
beach markers, click in the water and
then click "Add Water Marker," which
will lead to the correct orientation
angle being placed into the "Beach
Orientation" box.

Show Station Location - if zoomed in
enough, select a station type and
then click "Show Station Locations"
to display such stations on the map.

Current Location - click anywhere on
the map to display that points Lat
and Long.

Loading-map loading progress bar
that shows network download
activity for map images.
Figure 3. Beach Location tab controls and their function

5.2 Defining the Beach Orientation
       Map control allows delineation of a beach on the map to ascertain its orientation, which

-------
5.2 Defining the Beach Orientation

       Map control allows delineation of a beach on the map to ascertain its orientation, which
is useful if wind, wave, and/or current flow components are to be used in MLR model-building.
Maps, as opposed to satellite or hybrid images, provide less shoreline detail so it is recommended
that the map setting type use a hybrid or satellite image prior to adding point locations that define
beach boundaries.  Once displayed, click on the map (a red marker will appear) and select the
"Add 1st Beach Marker" button; this represents the first point of the extent of your beach shoreline.
Repeat this for the second beach marker and click on the map to indicate which side of the
shoreline represents the water; then hit the "Add Water Marker" button.  Marker points will turn
green as you add them. Once the water marker is added, a shaded box (the beach) appears and the
computed orientation angle will be displayed.
SB Virtual Beach 2.2
Figure 4. Adding shoreline and water markers to define beach orientation
       Points can be added or removed until the user is satisfied with the beach representation.
To recall the computed beach orientation in the data processing components creation screen (see
Data Processing section below), users can either save and then re-open a project file or they can
note the beach orientation on the mapping screen and manually enter that angle on the components
calculation screen.

-------
5.3 Finding nearby Water Quality, Flow, and Climate Information Sources

       Possible nearby data sources for the area of interest may be located and displayed on the
map.  USGS NWIS and NOAANCDC station markers at a zoomed-in map area can be located
and displayed by checking appropriate items in the map window and clicking the "Show Station
Locations" button. Note that the "Show Station Locations" button is only enabled when zoomed-
in to an appropriate level (e.g., zoom level three as measured from the top of the zoom control
slider). If either of the selected station categories (NWIS and/or NCSC; the STORE! station
category, although present on the control, is not yet functional) are present within the map display
area, they will appear. Also note that the network  server that produces NCDC station locations
restricts location requests to one every 30 seconds - a one-half minute delay is required for
subsequent location requests and an error message will be displayed if the appropriate wait time
has not elapsed. Once station location markers are displayed on the map, hovering over the top-
left hand corner of any station marker will display station ID information.  With that information,
users can visit the appropriate web address to gather water/weather data for the area of interest.
      Station ID: 09043199399
      Station None: ATHENS 2
Figure 5. NOAA/NCDC station marker showing station ID information
                   Station ID: USGS^EZI7890
    Station Name: NORTH 0 CO NEE RIVER AT US 78, AT ATHENS, GA
Figure 6. USGS/NWIS station marker showing station ID information
USGS NWIS web site URL: http://waterdata.usgs.gov/nwis/inventory
NOAANCDC web site URL: http://www.ncdc.noaa.gov/oa/climate/stationlocator.html

-------
 H Virtual Beach 2.2
  Project Model  Help
   Beach Location j

   Map Controls
            ~]Lng
  , Athens, GA
    YahooHybrid
   Remove 1st Beach Market
  |  Remove 2nd Beach

  |  Remove Water Marker
   Beach Orientation -94-96
    Show Station Locations
  0 NWIS   0 NCDC
  D STORET
   Remove Station Locations
   Cerent Location
   41.6254197800841 ' Lat

   -87.2442770004272 Lng

   loading
   S talion ID - U S EFtt-41370G087150701
SlationName: USER*WEILBH-17AT GARY IN
                     Project Name; Beach Name:
Figure 7.  Beach Location interface showing station markers near Gary, Indiana
5.4 Saving Beach Information in a Project File

        Use the Project-^Save menu bar selection to open a Save File dialog and to save the project
information to disk.  Beach marker and angle information is saved in the file name provided; the
saved file can be anywhere, but using the "Project Files" folder (found in the VB 2.2 root install
folder) is recommended.

-------
                                                                                   6.0

                                                                Data Processing

6.1 Data Requirements and Considerations

       VB 2.2 accepts files from Excel 2007 or earlier (Excel 2010 is not currently supported), as
well as comma-separated-value (CSV) text files. Input data must conform to certain standards:

   •   The first row of any data column must be a header with the TVs name. For best operation
       of the software, the column name should be composed of letters, numbers (don't begin the
       column name with a number), and/or underscores, i.e., "_".  Other characters in column
       names can cause problems.
   •   The first (left-most) column of the dataset must be identification for the observations,
       typically a date or time stamp that indicates when the observation was collected. The only
       requirement is that each row MUST have a unique ID.  VB 2.2 will  not import datasets
       with non-unique IDs in the first column. If the first  column is a time stamp, VB 2.2's
       plotting functions will work best if the column is in  chronological order, from earliest to
       most recent observations.
   •   The second column of the dataset will initially be set as the dependent or response variable;
       however, this can be changed after data are imported. Any subsequent columns will be
       considered to be IVs.
   •   Variable measurement units are not considered, but  certainly affect predictions. Make sure
       any data used for predictions are in the same units as those used to build the models; for
       example, do not build a MLR model with water temperature in degrees Fahrenheit, then
       later import water temperature in degrees Celsius for predictions. It is prudent to include
       unit information in the column names (e.g., WaterTemp_C) to remind the user of the proper
       units when making predictions.
   •   Missing data (blank cells) are permitted  on import, but must be dealt with in Data
       Processing prior to modeling.
   •   If present in the imported Excel data sheet (other than in column names or the first ID
       column), cells with non-numeric values (i.e., symbols or text) are turned into empty cells.
       If such non-numeric characters are present in an imported .csv file, they will be imported to
       the data grid, but will be recognized as anomalous data during the required validation scan
       and will have to be dealt with (deleted or turned into a numeric value) at that time.
   •   VB 2.2 recognizes any  column of data with only two different values as categorical.  If
       you have a column of categorical data with more than two values, you can designate it as
       categorical, using methods described below.  The ramification of a variable being identified
       as categorical is that VB 2.2 leaves it out of transformation processes.
   •   There is no hard-coded limit on the number of IV columns one can import; however,
       a practical limit exists that depends on system processing resources.  There is also an
       inherent limit: - documentation indicates that the grid components used in the application
       are designed  for a maximum of 300 columns before performance issues degrade the
       application.  Modeling 250+ columns of data presents circa 2(10)20 possible data

-------
       combinations for MLR processing. The Genetic Algorithm handles this modeling task,
       but choosing "Run all combinations" would likely take an immense amount of time to
       complete. Depending on how many additional TVs will be created by the user, importing a
       dataset with less than 100 IVs should be acceptable.

6.2 Importing a Dataset
       When users first click on the Data Processing tab, they open a dataset using the "Import"
button.  This brings up a dialog screen where a directory explorer can be used to find the data file
and open it.  If the dataset is an Excel file with multiple sheets, a dialog box opens to ask the user
which to import.
H Virtual Beach 2.2
  reject  Model  Help
   Beach Location /' Data Processing I
                        Import
                                     ocuments
                                  ^ My Computer
                                    Network Places
                                    ซnBags
                                 JXCC sampling
                                   Cooter N files
                                 L^EPA Support Tools
                                 CJESA2011
                                 CJ Modeling Datasets
                                     Spectra
                                   Rockwell Data
                  Project Name: Beach Name:
                                                                     Status: ready l_
Figure 8.  Importing a dataset into the Data Processing tab

       Once imported, the data grid is shown as a spreadsheet on the right. The second column
of the spreadsheet will be highlighted in blue to indicate its status as the current response variable.
Information about the dataset, such as number of rows and columns, name of the ID column
and name of the response variable, appear on the left. At this point the grid cannot be edited or
interacted with in any manner; tTo access additional processing functionality, the data must be
validated.

-------
6.3 Validating the Imported Data

       The "Validate" options window can be accessed by clicking the "Validate" button at the top
of the Data Processing tab. This window primarily launches a required data scan to identify blank
and non-numeric data cells in the imported spreadsheet. However, one can also find and replace
other specified values (e.g., a missing data tag like -999) in the dataset using the "(Optional) Find:"
input box.
Project Model Help




Beach Location Data Processing 1 r

Fie Testing.*
Column Count 3
Row Count 37
Date-Time Index tstamp
RetponseVariable LogCFU
Disabled How Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 7












































[ Import | | Validate ]
I











tstamp
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33
38536.46
38536.63
38537.33
38537.46
38537.63
38519.33
38549.46
38550.46
LogCFU uv airtemp
1 452 360 29.3
0.8653
0.8016
1 739
1 028
0.301
1.627
1 247
1.773
0.9378
0.9542
1.073
0.87
1 135
1 239
0.883
-01761
1 176
0.1243
0
1.222
0.5643
0.8368
2.727
2235
0.5223
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1800
900
293
1537
1763
286
1481
1802
292
675
1834
292
1233
1470
3fiRRn S3 n 1 ?4S 1 81 R

Project File Name: Project Name: Beach Nam

23.8
30.7
29.3
23
30.3
28.6
29.2
25
32
29.4
25.7
30.5
34
29.9
31.8
31.1
26.6
23.8
30.3
29.1
30
30.2
DataVa
(Optio
R
D
D

waveheight centershintemp centerwaisttemp
0.15 284 28.4


alj Find:
place With: | |
lete Row
lete Column
Take Action Within:
OnrjThuCell
Take Action
dentily Categorical Variables |
[~ Cancel |

29.9
29.8
31 q



0.3 30.4
0.3 30.1
n? 3dq
30
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
30.4
33.5
27.8
28.2
33.1
28.3
29.2
32.4
27.6
29.8
30
nd
Wind5peed
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
W
0
1C
2C
;:c
4C
5E
6C
7C
SI
9C
1C
11
12
1;
14
15
1E
17
1E
IE
2C
21
22
2E
24
25
7f

Status: ready (
Figure 9. Data validation required to begin data processing

       To validate the data, the user clicks "Scan." VB then goes through the spreadsheet, cell
by cell, looking for blanks, non-numeric, or user-specified values entered in the "Find:" input
box. If one of these types of cells is found, the scan will stop to highlight that cell. Users must
decide how to deal with the cell using choices in the "Action" section: they can replace the bad
cell with a specified value, using the "Replace With:" input box, or they can delete the row or
column containing the bad cell. The user must decide where to implement the chosen action
with the "Take Action Within" menu.  Possible choices are "Only this Cell," "Only this Row,"
"Only this Column," "Entire Row," "Entire Column," and "Entire Sheet." Items in this menu
are context-sensitive, i.e., they change depending on which Action is selected. This setup gives
the user flexibility, for example, to delete all rows containing missing values within one specific
column of data (Action would be "Delete Row" taken within the "Entire Column"), and replace all
missing values with a user-specified numeric value within another column of data (Action would

-------
be "Replace With:" taken within "Entire Column").  The cell, row, and column reference will
always refer to the highlighted cell. After setting the "Take Action Within" menu, the user clicks
the "Take Action" button, VB 2.2 makes the specified changes to the spreadsheet, and the scan
continues. When the entire spreadsheet has been scanned and all bad cells have been fixed, VB 2.2
reports that "no anomalous data have been found," and the user can click the "Return" button to
close the Scan window.
       As stated earlier, VB 2.2 will not attempt to transform categorical data columns.  It
automatically identifies columns with only two unique values as categorical, but if the user has
other categorical TVs with more than two categories, those should be identified to VB 2.2 by the
"Identify Categorical Variables" button.
1' Vi,tualBMCll2.2 |- |[nJtX|
Project Model Help
Beach Location I Data Processing •*•

File

Testing. *ls |
Import [ Validate ]
Column Count 9
Rmyrn.r.t 37
Date-Time
Response
D sabled R
DataV
[Opt
Actio
O
ndex tstamp
Variable LogCFU
:w Count 	 0 	 |
^^^^^H
slidation

i:
Replace With:
ฉ Delete Row
O Delete Column
Take Action Within:

p^ntire Column

Identify Categorical Variables j

| Cancel |






tstamp LogCFU uv
38507.33 1.452 380
38507.46 0.8653 1403
38507.63 0.8016 1555
38508.33 1.738
3850146 1.028 1305
38508.63 0.301 1568
38521.46 1.627 1342
38521.63 1.247 1276
38522.33 1773 225
38522.46 0.9378 1260
38522.63 0.9542 1409
38523.33 1.079 295
38528.46 0.97 1300
38523.63 1.195 900
38535.33 1.239 233
38535.46 0699 1537
38535.63 -0.1761 1763
38536.33 1.176 236
38536.46 0.1249 1481
38536.63 ID 1302
38537.33 1.222 232
38537.46 0.5643 675
38587.68 0.6368 1834
38549.33 2.727 292
38549.46 2235 1233
air temp
29.3
29.9
30.7
29.3
29
30.9
28.6
23.2
25
32
29.4
25.7
30.5
34
29.9
31.6
31.1

29.8
30.3
29.1
30
30.2
28.9
29.9
38550.46 0.5229 1470 29.8
3s^n RT n 1 ?aq i si s qi q

Project File Name: Project Name: Beach Name:
wave height
0.15
12
12
12
12
12
102
101
101
101
0.01
11
115
0.18
115
115
13
0.05
0.1
13
12
0.3
12
15
13
centershintemp
28.4
30.5
33.7
27.8
30,2
32.5

33.3
26.4
27.8
32.5
24.6
27.6
30.1
28.7
31.4
35.2
27.3
30.2

27.8
29
34
27.6
30.4
0.3 30.1
n 7 ^q
centerwaisttemp
28.4
3D
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
314
33.5
27.8
29i2
33.1
28.3
29.2
32.4
27.6
29.8
30
WindSpeed j \ปi rt


























0
1C
2C
3C
4C
5C
6C
71
8C
9C
1C
11
:

4
E
E
1
IE-
IE
2C
21
22

24
2E
iS 'if
Status;

eady I
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu
6.4 Working with a Dataset Post-Validation
       After the dataset has passed the validation scan, the function buttons across the top of the
Data Processing tab are enabled.

-------
Project Model Help
Beach Location Data Processing
^^^^H
^^^^^^^^^^^^^^^^^^^^^^^^^•aesyyi
-

File Testing.*
Column Count 9
Row Count 37
Date-Time Index Sstamp
Response Variable LogCFU
Disabled Row Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 1


































Import
Validate
| Compute A. 0 | Manipulate Transform
Go to Modeling

\











(stamp
^^^^^^H
38507.46
38507.63
38508.33
38508.46
LogCFU
1.452
08653
08016
1738
1.028
38508.63 0301
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
1.627
1.247
1.773
0.9378
0.9542
1.078
0.97
1.195
1.239
38535.46 0.639
38535.63
3953E.33
38536.46
38536.63
38537.33
-0.1761
1.176
0.1249
0
1.222
38537.46 0.5643
38537.63
38549.33
0.6368
2.727
uv airtemp
360 29.3
1403
1555
337
1305
1568
1342
1276
225
1260
1409
285
1800
900
293
1537
1763
286
1481
1802
292
675
1834
292
23.3
30.7
29.3
29
30.8
28.6
28.2
25
32
29.4
25.7
30.5
34
29.9
31.6
31.1
26.6
28.8
30.3
29.1
30
30.2
28.9
waveheight centershinternp
0.15 Isl
0.2
0.2
0.2
0.2
0.2
0.02
0.01
0.01
0.01
0.01
0.1
015
0.18
0.15
0.15
0.3
0.05
0.1
0.3
02
0.3
0.2
0.5
30.5
33.7
27.6
30.2
32.5
28.7
33.3
26.4
27.8
32.5
24.6
27.6
30.1
28.7
31.4
35.2
27.3
30.2
34.7
27.8
29
34
27.6

centewaisttemp \v A
28.4
30
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
30.4
33.5
27.8
23.2
33.1
28.3
29.2
324
27.6
38548.46 2235 1233 29.3 0.3 30.4 129.8
1
1
1
1
1
1
1
'
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
>

Project File Name: Project Name: Beach Name: Status: ready .:
Figure 11.  Post-validation enabling of the Data Processing functionality

       At this point, the grid cells (other than the ID column) are editable - that is, users can
manually enter new numeric data into the cells by double-clicking on a cell and typing in a new
value. VB 2.2 does not allow blank cells or non-numeric data in cells.  Additionally, a right mouse-
click on an IV column header presents options:
      Validate
Compute A, 0
Manipulate






LogLhU uv
1.452
0.8653
0.8016
1.733
1.023
0.301
1.627
360
1403
1555
337
1305
1568
Disable Column
Enable Column
Set Response Variable
View Plots
Delete Column



1342
,-•3.0





29
30.9
28.6
waveneit
0.15
0.2
0.2
0.2
0.2
0.2
0.02
Figure 12.  Right-click options on columns that are not the response variable

-------
"Disable Column" turns the column's text red and prevents the column from being passed to the
Modeling tab of VB. Previously-disabled columns can be activated using "Enable Column."  "Set
Response Variable"  will make that IV the new response variable and it becomes blue as a visual
indication of this change. "View Plots" shows a new screen with column statistics at the far left
and four plots for that IV (1) a scatterplot of the IV versus the response variable in the upper left
panel, (2) a plot of the IV values versus the ID column at the upper right (a time series plot if the
ID is an observation date), (3) a box-and-whiskers plot at the bottom left, and (4) a histogram for
the IV at the bottom right.
H Variable airtemp SOS

Data
Variable Name
Row Count
MaHimum Value
Minimum Value
Average Value
UniqueValues
Zero Count
Median Value
Data Range
AD Statistic
AD Stat P-Value
Mean Value
Standard Deviation
Variance
Kurtosis
Skewness






















Value
airlemp
37
3570
2500
30.11
30
0
29900
10700
0.2589
0.6959
30111
2.453
B.045
0.7G7
07G7























[ Replot ]


Scatter Plot
5 •
2
^
1

""isi"0""""""---^
ฐ" "
22 24 28 23 30 32 34 36 36
40
3D
|ป
10
D
BoxWhisker Plot
j
8


Ti me Ser es P ot
38
36
34 -
32
!-
28 -
26
24 -
I/I
HfH
33.56 33.51 38.52 36.53 33.54 33.55 33.56 36.57 36.
i5tamp(10"3)
12
10 -
I'
6
4 -
2 -
Frequency Plot
I=LJ 1
Li 1 il
8

22 24 26 28 30 32 34 36 38
airtemp



Figure 13. Four different plots available for evaluation of IVs

       The scatter plot (upper left) is probably the most-examined, as it can indicate a non-linear
relationship between the IV and the response variable, problems with homogeneity of variance
across the range of the IV, or outliers. Ensuring that the IVs are linearly related to the response
variable raises the probability of producing a robust, meaningful analysis. If the relationship
between the response and the IV is not well-approximated by a straight line (a fundamental
assumption of MLR), it may be beneficial to transform the IV Using VB 2.2 to accomplish this
will be explained later in this document.  The scatterplot also shows the best-fit regression line in
red, along with the correlation coefficient ("r") and the significance (p-value) of the correlation
coefficient at the top of the plot. For the most part, p-values below 0.05 are considered statistically
significant.
   Identifying odd values (potential outliers or bad data) of any IV can often be done by visually
inspecting these plots.  If users double-click on the data point marker for any observation in one

-------
of the top panels or the bottom left panel (i.e., not the histogram), they can disable that point (the
row) in the data grid.



4
3 -

2 -
=
LL.
(J
O>
0
1 -
n
-1 -
2
Scatter Plot



P
Disable Row containing 7/16/2005 7:55: 12 AM
Enable Row containing 7/16/2005 7:55:12 AM
: D :
"~^-~_ n
- ^^-^\ D ฐฐ •
n ?^^n ;
u n n ^~~-
n n n
n
n
2 24 26 28 30 32 34 36 3
airtemp








8
Figure 14. Disabling an observation from within the XY scatterplot

       The final choice - "Delete Column"-- deletes a column from the data grid, but the original
columns of the imported data sheet (VB 2.2 thinks of these as "main effects") cannot be deleted.
Rows can be disabled and enabled, but not deleted, from the data grid by right-clicking the row
header (far left of each row) and making the desired choice.
       If the user right-clicks on the column header of the response variable, a different set of
choices is shown:

-------
                        Validate
                                      Manipulate
       tstamp
      38507.46
      38507.63
      38508.33
      38508.46
      38508.63
      38521.46
      38521.63
      IQRT'9
LogCFU
1.452
0.8653
Transform
View Plots
UnTransform
0.8016
1.738
1.028
0.301
1.627
1.247
                    1 77T
    337
    1305
    1568
    1342
    1276
29
29
none
LoglO
Ln
Power
30.9
28.6
28.2
              wavehe
              0.15
              0.2
                                0.2
0.2
0.2
          0.2
          0.02
          0.01
                                                               n m
Figure 15. Available choices when right-clicking the current response variable

       Users can transform the response variable in three ways: Iog10, loge, or a power
transformation (raising the response to an exponent: yx). They can also un-transform the response,
view the plots shown previously for the TVs, or define a transformation of the response variable.
This option is used when a datasheet is imported with an already-transformed response variable.
For example, users could import a datasheet with Iog10-transformed fecal indicator bacteria levels
and then define the response as being Iog10-transformed.  Doing this facilitates later comparisons
with observations, decision criteria, and regulatory standards. When users transform the response
variable within VB 2.2 using the "Transform" option, VB 2.2 automatically defines the response as
having the chosen transformation and, in doing so, synchronizes the units of measurement for later
comparisons.
6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components

   Orthogonal wind, current, and wave vectors can be powerful predictors of beach bacterial
concentrations. Depending on the orientation of the beach, wind and currents can influence the
movement of bacteria from a nearby source to the beach, and wave action can re-suspend bacteria
buried in beach sediment.  To make more sense of these  data, researchers typically decompose
wind/current/wave magnitude and direction into A (alongshore) and O (offshore/onshore)
components for analysis (see equations at the end of this section).
   If direction and magnitude (speed/height) data are available, A and O components can be
calculated with the "Compute A, O" button. Clicking it  brings up  a window where users specify
which columns of the data grid contain the relevant magnitude and direction data, using drop-down
menus (Figure 16).  There is also an input box at the bottom of the form for the beach orientation
angle. If the user defined the angle on the "Beach Location" tab, that value should be seen here.
After clicking "OK," new data columns are added to the far right of the data grid, representing the
A and O components of the specified wind, current, or wave data.  Unlike the originally imported
IVs,  these components can be deleted from the data grid after they are  created. Names  of these

-------
new columns will be: WindA_comp(X,Y,Z), CurrentO_comp(X,Y,Z), WaveA_comp(X,Y,Z), etc,
where X is the name of the column of data used for magnitude, Y is the name of the column used
for direction, and Z is the beach orientation angle.
H Wind/Cur rent/Wave Components _ || D X |
Wind Data
Specify wind data columns:
Directi
Current D
Specify
Speed 1

on (deg)

current data columns:
Speed

Direction (deg) v
Wave Data
Specify wave data columns:
Wave
Height v

Direction (deg)



Beach Angle (deg): 0.00


Ok Cancel
.:i
Figure 16. Window for computation of alongshore and offshore/onshore components

Notes on wind, wave and current component calculations:
   Direction is an angular degree measure. Moving in a clockwise direction from north (0
degrees), values are positive, and negative while moving counter-clockwise. Wind and current
speed (as well as wave height) can be measured in any unit. VB 2.2 adheres to scientific
convention where wind direction is specified as the direction from which the wind blows, while

-------
current and wave directions are specified as the direction toward which the current or waves move.
Thus, wind blowing from west to east has a direction of either 270 or -90 degrees, while a current/
wave moving from west to east has a direction of 90 degrees.
       The A component measures the force of the wind/current/wave moving parallel to the
shoreline (Figure 17). A positive A component means winds/currents/waves are moving from
right to left as you look out at the water. A negative A component means winds/currents/waves are
moving left to right as you look out at the water. The O component measures force perpendicular
to the shoreline. A negative O value indicates movement from the land surface directly offshore
(unlikely to see with wave action).  A positive O indicates waves/wind/currents from the water to
the shore. These relationships apply no matter how the beach is oriented (Figure 18).
                            Positive A
Negative A
Figure 17. A and O component definitions for wind, current, and wave data

-------
        Beach Orientation tor Wind Component Calculations
  270 degrees
  135 degrees
315 degrees
 90 degrees
   0 degrees
45 degrees
          180 degrees


         Water   Land
            21E degrees
                                         t
                                                                North
Figure 18. Principal beach orientations given in degrees
Equations for calculation of Wind A/O components:

                    Wind A: -SPD * cosine ( (DIR-BO) * PI/180)

                     Wind O: SPD * sine ( (DIR-BO) * PI/180)

where SPD is wind speed, DIR is wind direction, BO is the beach orientation (in degrees) and PI
= 3.1416. Current A/O and Wave A/O are these same equations multiplied by -1.6.6 Creation of
New Independent Variables
      Users may click the "Manipulate" button to create new columns of data that might serve

-------
6.6 Creation of New Independent Variables
       Users may click the "Manipulate" button to create new columns of data that might serve
as useful TVs.  On the screen that pops up, there is a list of available TVs on the far left, under
"Independent Variables." If users wish to create a new term, they add any available IV used in this
new term by selecting it and using the ">" button to add it to the "Variables in Expression" box.
Clicking and dragging down through the "Independent Variables" list allows for multiple IVs to be
added at once.
 H Manipulate
                                                     -  n x
   Build Expression

    Independent Variables
    Variables in Expression
    airtemp
    waveheight
    centershintemp
    centemaisttemp
    WindS peed
    WindDirection
CD
m
                            ฉ Sum O Maximum  Q Minimum Q Mean O Product
                                                    2nd Order Interactions
                       OK
                                      Cancel
Figure 19.  Window for the formulation of "Manipulates" - arithmetic combinations of existing columns
within the data grid

       For example: if users wish to create a new IV that is a row-by-row mean value of the
"centershintemp" and "centerwaisttemp" variables, they add those two to the "Variables in
Expression" box, then choose the "Mean" function, "Add" that expression to the lower box, then
click "OK." That adds a new column of data that represents a row-by-row average of the two IVs,
to the end of the data grid (far right.)

-------
 iH Manipulate
   Build Expression

    Independent Variables
                                                     -  n x
     airtemp
     waveheight
     WindSpeed
     WindDirection
m
m
Variables in Expression
 centershintemp
 centerwaisttemp

                            O Sum O Maximum Q Minimum ฉ Mean O Product

                            MEAN[centershintemp,centerwaisttemp]
                                                     2nd Order Interactions
                            M E AN [centershintemp,centewaisttemp]
                        OK
                                      Cancel
Figure 20. Creation of a new IV defined as the mean of two existent TVs

       Users can create a row-by-row sum, maximum, minimum, mean, or product from any
number of TVs that are added to the "Variables in Expression" box.  More than one expression
can be created before the "OK" button is clicked, and TVs can be easily moved in and out of the
box using "<" and ">" keys. Any created expressions can be removed from the lower box with
the "Remove" button. No matter how many IVs are added to the "Variables in Expression" box,
clicking "2nd Order Interactions" will add the cross-products for all possible pairings of those IVs.
Thus, four IVs will produce six interactions, five IVs will produce ten interactions, and so on.
Note that the names of the columns used to create any manipulate are inside the parentheses of that
manipulate's column name.

-------
 EH Manipulate
   Build Expression

    Independent Variables
    Variables in Expression
     uv
     waveheight
     WindDirection
     mcentershinternp
     centerwaisttemp
m
                             WindSpeed
                             airtemp
                             O Sum O Maximum  Q Minimum  •:'*:• Mean O Product
                             M EAN [centershintemp,centemaisttemp,WindS peed,airtemp]

                               Add
                                                       2nd Order Interactions
                             PROD[centershintemp,centerwaisttemp]
                             PROD [centershintemp,WindS peed]
                             PRQD[centershintemp,airtemp]
                             PROD [centemaisttemp,WindS peed]
                             PR ODJcentewaisttemp,airtemp]
                             PR 0 D [WindS peed,airtemp]
                        OK
                                       Cancel
Figure 21. Formation of two-way cross-products of a set of four existent TVs

       VB 2.2 does not allow previously created "manipulates" — new columns of data created
through the "Manipulate" button — to be further manipulated. Previously-created manipulates will
not appear in the "Independent Variables" section at the left. They can, however, be chosen as the
response variable or deleted from the data grid, using the appropriate menu choices, accessed by a
right-click of the column header.

6.7 Transforming the Independent Variables

       VB 2.2 gives users the ability to transform non-categorical TVs to assist in linearizing the
relationship between the TVs and the response variable, which is a fundamental assumption of an
MLR analysis.  VB 2.2 provides the following transformations, where Xt is the transformed IV and
X is the original IV

Log10: Xt = log10(X)
Loge: Xt = loge(X)
Inverse:  Xt= 1/X
Square: Xt = X2
Square Root: Xt = X05
Quad Root:  Xt =  X025
Polynomial: Xt = a + bX + cX2
General Exponent:  Xt  = Xe where the user specifies the value of e

-------
       When users click the "Transform" button, they are presented a choice of transformations to
investigate:
    Transforms to Perform
      Available Transforms

          D LoglO
             Inverse

             Square

             SquareRoot

             QuadRoot

             Polynomial

             General Exponent

             PI Select All
Dependent Variable:

LogCFU
Figure 22. The range of choices for IV transformations

       When users click "Go", the chosen transforms are applied to each non-categorical IV. VB
2.2 then opens a table that allows comparison of the success of each transform using a Pearson
correlation coefficient, a measure of linear dependence between the response variable and the
IVs.  For the polynomial transformation, the Pearson coefficient is calculated as the square root of
the adjusted R2 value derived from the regression of the response on Xt. Because this adjusted R2
value can possibly be negative, an empirically-derived formula is applied when  adjusted R2 values
fall below 0.1:

            Polynomial Pearson Coefficient = (-6.67*RE12 + 13.9*REr 6.24)*(R2)05

where RE: = 1.015 - 1.856*R2 + 1.862*adjR2 - 0.000153*N, R2 and adjR2 are defined by the
regression of the response on Xt, and N = number of observations.

       The table that VB 2.2  creates groups all transformed versions of each IV by the IV name,
type of transformation, and the associated Pearson coefficient. By default, the transformation
(this includes the un-transformed version of the IV, denoted by "none"), with the largest absolute
value of the Pearson coefficient is highlighted in black text for selection. Users  may override the
default selection by left-clicking on the row header of a transformed IV they choose. They may

-------
also override the default by setting a Threshold percentage and clicking "Threshold Select" on
the left side of the box. This selects the un-transformed IV unless the transformed IV with the
highest absolute value Pearson coefficient exceeds the un-transformed IV Pearson coefficient
by the specified percentage.  In essence, the user is saying, "Unless the Pearson coefficient of
the transformed IV is some % greater than the Pearson coefficient of the un-transformed IV, use
the un-transformed IV" This can be useful because transforming IVs makes interpreting model
coefficients more difficult; unless an improvement is seen, transformation may not be worth the
trouble. Users can also revert to the default by clicking "Go" under the "Auto Select" section at
the left.
Pearson Univariate Correlation Results - Maximum Pearson Coefficients (signed) in BOLD text
Help
Variables, possible variable
interactions, and their
transforms are shown. Select
variables for further
processing and modeling.
Auto-Select
The variable or one of its
transforms is selected by
maximum Pearson Coefficient.
(This is the default view shown.)
Threshold Select
Select a transformed variable only
if its Pearson Coefficient exceeds
the untransformed variable's
Pearson Coefficient by a
specified threshold.
Threshold (%\ 20 ;
[ So ]
Manual Select
Mouse-click on a row header to
select or deselect that variable.
At most one member from each
group can be selected.
1 — | Add transformed variables to dataset
— and disable untransformed columns.
| Ok | 1 Cancel 1 1 Print 1
Dependent Variable: LogCFU
I-





Variable
uv
uv
uv
uv

airtemp
airtemp
airtemp
airtemp
airtemp

waveheight
waveheight
waveheight
waveheight
waveheight

centershintemp
cenlershintemp
centershintemp
centershintemp
centershintemp

centerwaisttemp
centerwaisttemp
Transform
none
INVERSE[uv,101.5]
SQUARE[uv]
QUADROOT[uv]
POLY[uv,1 .21 39824,0.000332681 67,-5.U448752e-07]

none
INVERSE[arrtemp,12.5]
SQUARE[airtemp]
QLIADROOT[airtemp]
POLY[airtemp.-2. 7045932,0 350288G5,-0 00767821 38]

none
INVERSE[waveheight,0.005]
SQUARE[waveheight]
QUADROOT[waveheight]
POLY[waveheight,1 .2708951 ,-7.025051 6,1 9.1 75368]

none
INVERSE[centershin!ernp,1Z3]
SQUARE[centershintemp]
QUADRODT[centeishintemp]
POLY[centershintemp,1 .2563378,0.09461 4607.-0.0035446956]

none
I NVE RS E [centerwaisttemp,! 3.1]
Pearson
Coefficient
-0.4706
0.3335
-0.4887
-0.4339
0.4432

-0.3772
0.3624
-0.3820
-0.3724
0.3170

0.1031
0.2006
0.2612
-0.0666
0.3874

-0,4260
0.4197
-0.4272
-0.4243
0.3669

-0.3991
0.4093
Correlation
P-Value
0 0033
0.0437
0.0021
0.0073
0.0060

0.0214
0.0275
0.0196
0.0232
0.0559

0.5435
0.2339
0.1184
0.6954
0.0178

0.0086
0.0097
0.0084
0.0089
0 0255

0.0144
0.0119
A

Figure 23.  Pearson correlation coefficient scores for judging the efficacy of IV transformations
Plotting Transformed IVs
       Users may prefer to examine plots visually to determine which transformation of IV to
choose. If users right-clicks on a row header in this correlation table, they can view an array of
scatterplots, time series plots, or frequency plots for each data transformation of the IV represented
by that header.  Scatterplots will show the best-fit regression line, the correlation coefficient, and
the p-value for that correlation coefficient.

-------
 IB Variable airtemp and its Transforms
                      airtemp LOG10  INVERSE  SQUARE QUADROOT  POLY
             Pearson Coefficient -0.3772 -0.3706 0.3624   -0.3820  -0.3724    0.3170


                                      QUADRQQT[alrtomp]
                                   22S   233  235   2.UD
                                        QUADRQorpirfcrnp]
                                                                  IMVEF? SE[3lrt*mp,12.S]
                                                               POL v ปtrfanii.-i?B4ปn,ojinim.-a.Hwn \
Figure 24.  Scatterplots (Response vs. IV) for six different data transformations of a single IV

       After choosing a transformation for each IV, users click "OK." This populates the data
grid with new columns representing transformed versions of the IVs.  The small checkbox in the
bottom left corner of Figure 23 controls whether the untransformed version of the IV remains
enabled in the data grid after the user clicks "OK."  When the box is checked, for any IV in which
the user chooses a transformed version, the un-transformed version will be disabled in the data
grid.  Notice that transformed versions of an IV are put into the data grid immediately after the
original, un-transformed IV
Notes on Transformed IVs
       Any transformations put into the data grid can be deleted with the "Delete Column" choice
after right-clicking on their column header.  Transformed IVs will appear in the list of IVs on the
"Manipulate" screen; however, transformed IVs cannot be further transformed and will not appear
in the transform table if the user goes back to the "Transform" window.
       VB 2.2 transformations have specific processing for certain data values and are not
pure mathematical transformations - they were designed to maintain data order  while helping
to linearize the response-IV relationship. For the SQUARE (b=2), SQUAREROOT (b=0.5),

-------
QUADROOT (b=0.25), INVERSE (b=-l) and GENERAL EXPONENT (b is user-defined)
transformations, VB 2.2 uses the signed equivalent of the mathematical function:

                                xAb == sign(x)*abs(x)Ab

For example:   (-2)2 = -4     (-9)05 = -3     (-4)-ฐ5 = -0.5     (-2)-2 =-0.25

       To avoid potentially undefined values (i.e., 1/x when x = 0), the INVERSE and GENERAL
EXPONENT (if the user sets b < 0) transformations have special processing:

       If x = 0, then VB 2.2 will find the minimum of abs(z), where z is the set of all non-zero
values for the IV in question. For the purpose of computing the transformation, once z is defined,
VB 2.2 substitutes z/2 for x.  From this definition, note that z can be either a positive or negative
number.

       LOG10 and LOGe transforms are also the signed equivalent of the mathematical functions:
                                   loge(x) == loge(x)
                                  loge(-x) == -loge(x)
                                  Iog10(x) == Iog10(x)
                                  Iog10(-x) == -Iog10(x)

In addition, if (-1 < x < 1), then loge(x) = 0 and Iog10(x) = 0

       VB 2.2 will not compute the INVERSE, GENERAL EXPONENT (with a negative b),
LOG10 and LOGe transformations for data columns if more than 10% of the IV's values are zero.
Programmatically, zero is defined as any number whose absolute value is less than l.Oe-21.
      POLYNOMIAL transformations are the result of a linear regression of the response
variable on the IV and the square of the IV

                               Poly(X) = a + b*X + c*X2

where a, b, and c are determined by a multiple linear regression of X and X2 on the response
variable.

      In general, the name of the transformed column of data that VB 2.2 creates is simply
the type of transformation, with the original data column name in parentheses. For example,
Water Temp would become LOG10(WaterTemp). There are some exceptions, however:

      INVERSE(X,Y):  X is the original data column name and Y is the z/2 value discussed
earlier in this section.

      POWER(X,Y) : When Y is positive, X is the original data column name and Y is the
exponent specified by the user.

-------
       POWER(X,Y,Z) :  When Y is negative, X is the original data column name, Y is the
exponent specified by the user, and Z is the z/2 value discussed earlier in this section.

       POLY(X, a,b,c):  X is the original data column name and a, b, and c are the values of the
polynomial regression coefficients.

       Finally, because transformations are determined by the current response variable, when
users change the response variable in the data grid (using the column header right-click menu), all
transformed TVs in the data grid are erased (a message warns the user).

6.8 Saving Processed Data

    Data can be saved in a project file (Project-^Save) at any time during data processing. When
the file is opened,  the data grid will be repopulated as it appeared when the project was saved.
Also, users may highlight the entire table or sections of the table and use Control-C and Control-V
to copy and paste  the data grid into a word processing or spreadsheet application.

6.9 Go to Modeling

       After data  processing is complete, users must click the "Go to Modeling" button to open
the Modeling tab.  If users have already done modeling work and returned to the data sheet
to make changes, they will receive a message that the data sheet has changed and any prior
information  on the Modeling, Residual, or MLR Prediction tabs will be erased.  Users can then
choose to move forward to the Modeling tab or revert to the previous version of the data sheet
prior to making changes.
       The Modeling tab facilitates finding the best model based on criteria selected by the user.
As the number of TVs increases, the number of possible models in the solution space increases
exponentially. Users may select all or a subset of the TVs for consideration in the model to reduce
the size of the solution space.

-------
                                                                                    7.0
                                                                           Modeling
7.1 Selecting Variables for Model Building

    All eligible TVs are listed in the left column ("Available Variables") under the Variable
Selection sub-tab. Any variable users wish to consider for model inclusion must then be moved to
the "Independent Variables" list by highlighting the IV and clicking the ">" key. Any number of
IVs can be moved or removed from this list.
    Beach Location
Data Processing , ' Modeling
Model Settings
Variable Selection Control Options
Number of Observations: 37
Dependent Variable: LuyCFU
Available Variables (7)
uv
airtemp
waveheight
centershintemp
centerwaisttemp
WindSpeed
WindDirection



H
Independent Variables (0)



Figure 25. Selecting variables for MLR processing within the Modeling tab

   As you add or remove IVs from the "Independent Variables" list, the number of possible MLR
models is displayed in the status strip at the bottom right of the application window. The number
of possible models can grow exceedingly large; 66 IVs represent 7.38*1019 possibilities.  More
than 66 variables produces a number that exceeds the capacity of the program to store it - in such
cases, "more than 9.2e019" is displayed.
7.2 Modeling Control Options
The first decision users make on this tab involves which evaluation criteria will be used to judge
model fitness.  There are currently ten criteria available in the drop-down menu:

-------
   •   Akaike Information Criterion (AIC)
   •   Corrected Akaike Information Criterion (AICC)
   •   R2
   •   Adjusted R2
   •   Predicted Error Sum of Squares (PRESS)
   •   Bayesian Information Criterion (BIC)
   •   RMSE
   •   Sensitivity
   •   Specificity
   •   Accuracy

   Evaluation Criteria
    Akaike Information Criterion (AIC)            v

    t      Maximum Number of Variables in a Model
          Available: 7, Recommended: 4, Max: 7

    5      Maximum VIF

Figure 26. Setting modeling options within the Modeling interface

       The "Maximum VIF" (Variance Inflation Factor) parameter is used selectively to discard
models that contain variables with a high degree of multi-collinearity, i.e., IVs that are greatly
correlated with other IVs.  If any IV in a model has a VIF exceeding the threshold, that model will
be discarded. The default VIF value used in the application is set to 5.  A VIF of 5 means that 80%
(1/5) of the variability in an IV can be explained by the variability of other IVs in the model. A
VIF of 10 means that 90% (1/10) of the variability can be explained, and so on. If users aren't
concerned with muli-collinearity among the explanatory variables in a regression model, they can
lower the Maximum VIF value.  However, multi-collinearity leads to poorly estimated regression
coefficients (i.e., large standard deviations of these coefficients).
       The "Maximum Number of Variables in a Model" parameter tells VB 2.2 how large the
models being evaluated can be. As a rule, most modelers prefer to have about 10 observations per
estimated parameter in their models, otherwise possibilities increase for model over-fitting and
poor estimation of regression parameters. VB 2.2's recommendation is  close to this rule. It equals
(1 + n/10) where n is the number of observations in the dataset. The maximum allowable number
equals n/5. VB 2.2 won't let users set this value over the maximum.  The total number of available
parameters is also given here.
       If we define/? as the number of parameters in a model, n as the number of observations in
the dataset, RSS as the residual sum of squares for a model, and TSS as the total sum of squares for
a model,  then the evaluation criteria for a model can be defined as:

• Akaike Information  Criterion (AIC): 2p + n*ln(RSS)

• Corrected  Akaike Information Criterion (AICC): ln(RSS/n) + (n+p)/(n-p-2)

-------
• R2: 1 - RSS/TSS

• Adjusted R2: 1 - (l-R2)(n-l)/(n-p-l)

• Bayes (Schwarz) Information Criterion (BIC): = n*ln(RSS/n) + p*ln(n)

• Root Mean Squared Error (RMSE):  (RSS/n)1/2

• Predicted Error Sum of Squares (PRESS): 1 - E(y- y_)212(y.- yj2
where y. is the i, observation, y is the model estimate of the i, observation when the model coefficients are fitted with
     ^ i      th          ' ^ -i                       th
the i, observation removed from the dataset, and y  is the mean value of y in the dataset
   th                               '    ^ m                ^

• Accuracy: (true positives + true negatives) / number of total observations

• Specificity: true positives / (true positives + false positives)

• Sensitivity: true negatives / (true negatives + false negatives)

       Sensitivity, specificity and accuracy are special cases that require users to enter both
a Decision Criterion (DC) and Regulatory Standard (RS) so that true/false positives and true/
false negatives can be defined. The DC is a modeled (predicted) value the user chooses.  Model
predictions above this threshold are considered exceedances, while model predictions below
this value are considered non-exceedances. The RS is typically a safety limit on fecal indicator
bacteria (FIB) levels set by a state or federal agency.  The "Threshold Transform" radio buttons tell
VB 2.2 how to transform the DC and RS for comparison to model predictions and observations.
If a transformation definition is set for the response variable (either manually by the user or
automatically by transforming the response) during data processing, that definition will be set as
the default here. Users should understand that changing the threshold transform definition can lead
to problems when comparing modeling predictions to observations.  Caution should be exercised.
  Model Evaluation!hresholds
               Decision Criterion (Horizontal]
               Regulatory Standard [Vertical)

  Threshold Transform        Current US Regulatory Standards

   ฉ None               E.coli, Freshwater:      235
                         Enterococci, Freshwater:  61
   O Ln
   O Power               Enterococci, Saltwater:   104


Figure 27. Setting evaluation thresholds and threshold transformation information within the
modeling interface

-------
7.3 Linear Regression Modeling Methods

There are two options for exploring the solution space.
       Manual - this option is for a directed model search.  If the 'Run all combinations' box
       is not checked, a single model including every IV that was added to the "Independent
       Variables" column will be evaluated. If 'Run all combinations' is checked, an exhaustive
       search is performed.  The exhaustive search evaluates every model that can be constructed
       with the selected IVs, but does not evaluate any with more parameters than the "Maximum
       Number of Variables in a Model" input box. For example, if there are 24 IVs to evaluate
       and the maximum number of IVs in a model is set at 8, the exhaustive routine examines
       every possible 1-, 2-, 3-, 4-, 5-, 6-, 7-, and 8-parameter model. As the number of IVs rises,
       the number of possible models quickly gets so large that the exhaustive routine cannot
       maintain reasonable computation times and the user is advised to switch to the genetic
       algorithm.

       Genetic Algorithm - the Genetic Algorithm (GA) option explores solution spaces too large
       to handle exhaustively. Genetic algorithms are loosely based on the natural evolutionary
       process, in which individuals in a population reproduce and  mutate. Individuals with high
       fitness (regression models that produce small residuals) are more likely to reproduce and
       pass their genes (IVs) to the next generation. The goal is to find a good solution without
       having to examine every possible option and the GA balances random and directed
       searching.

-------
 IS Virtual Beach 2.2
  Project   Model  Help
    Beach Location    Data Processing  / ^Modeling
   Model Settings
                                                          8 Virtual Beach 2.2
Project  Model  Help

  Beach Location

 Model Settings
    Variable Selection  Control Options |    Number of Observations: 37
      Evaluation Criteria
       Akaike Information Criterion (AIC)
  H   1  Maximum Number of Variables in a Model
        Available: 1, Recommended: 4, Max: 7

  |5   |  MaximumVIF

Model E valuation! hresholds

     1235 |  Decision Criterion (Horizontal)

     |235 |  Regulatory Standard (Vertical)

 Threshold Transform      Current US Regulatory Standards

    Nore            E. coli, Freshwater:     235
                   EntetococcL Freshwater: 104

                   Enterococci Saltwater:  61


Manual  G enetic Algorithm I
          Run all combinations
                                                            Variable Selection Control Options
                                                                                     Number of Observations: 37
                                                               Evaluation Criteria
                                                               Akaike Information Criterion (AIC)
                                                                                            ซ
                                                               n     Maximum Number of Variables in a Model
                                                                     Available: 7, Recommended: 4, Man: 7

                                                               |5     Maximum VIF

                                                             Model Evaluation!hresholds

                                                                  1235 I  Decision Criterion (Horizontal)

                                                                  |235 |  R egulatory S tandard [Vertical)

                                                              Threshold Transform      Current US Regulatory Standards

                                                               ฉ None            E. coli. Freshwater:     235
                                                               O Logic
                                                               O Ln
                                                               O Power
                     Enterococci, Freshwater:  104
                     Enterococci, Saltwater:    61
  Manual 11 Genetic Algorithm

    D Set Seed Value:

    Population Size:     100

    Number of Generations: 1100

    Mutation Rate:      fool

    Crossover Rate:     10.20
Figure 28. Model building interface using a manual search (left panel) or the Genetic Algorithm
(right panel)

         Choosing between an exhaustive and a GA search depends on your data set, available
hardware and time constraints.  Fifteen IVs produce about 32,000 model possibilities; on our
system (Dell  Precision T5400 workstation running MS Win XPSP3 w/ dual Xeon 2.66 GHz
processors having 4 GB RAM), the exhaustive search was completed in approximately 90 seconds.
Sixteen IVs represent more than 65,000 possibilities which is more than double that of 15 IVs.
Some model building results are summarized below:
Exhaustive Search - Run All Combir
Number of IVs



5
6
7
Number of MLR models
32767
65535
131071
ations
Approximate Time
Required to Generate and
Filter Models (seconds')
90 .
10
280
    By contrast, the GA with 17 IVs was completed in less than seven seconds. We note, however,

-------
that the exhaustive search did find a slightly better model than the GA did using the selected AIC
evaluation criterion (49.2 versus 55).
   An alternative modeling strategy could be to use the GA on your entire list of TVs, then the
exhaustive search on a subset of the initial TVs - any IV that appears in one of the best ten models
found by the GA. This two-step modeling process is facilitated with the "IV Filter" list control.

  Model Information
   Best Fits:
    -143.0920
    -142.9118
    -142.8249
    •142.6259
    -142.4560
    -141.4349
  IV Filter
Figure 29. Using the IV filter to select a subset of variables from the best-fit models

       When the GA ends and the 10 best models are shown, use the "Clear List" button to
remove all IVs from the selection list.  Select a model from the "Best Fits" list one at a time and
click the "Add to List" button; this action adds any IVs in the model to the Independent Variable
list. After doing this for the ten best models, users likely have a much more manageable IV list
and can run an exhaustive search to find the very best combination of IVs. Regardless of the
method chosen to build models, the "Best Fits" window shows the top ten models found, in terms
of the evaluation criterion chosen.
7.4 Using the Genetic Algorithm
There are five parameters users can set to adjust performance of the GA:

     a)  Seed value:  internal random number generator to produce random values.  Setting this
         seed to a known value will make the GA run reproducible.  Changing the seed will create
         a new series of random values, possibly returning different results.
     b)  Population size: number of individuals in the population of each generation. A larger
         population broadens the search at each generation, but slows processing time.
     c)  Number of generations: how long to run the search since individuals can reproduce
         and mutate once each generation. The fitness of every individual in the population is
         evaluated at the end of each generation.
     d)  Mutation rate: chance each individual has of undergoing random mutation in each
         generation. The higher the mutation rate, the more random (less directed) the search of

-------
         parameter space is.
     e)  Crossover rate:  probability that two selected individuals in the population will exchange
         genome parts. Exchanging genes creates new individuals in the population.

       The best GA parameter values depend on the dataset being investigated, but typical values
of the mutation rate are between 0.001 and 0.1 (0.1 and 10%) and typical values of the crossover
rate are between 0.4 and 0.75 (40 and 75%).  For most datasets, a population size and generation
number of 100 will be sufficient. Larger datasets may require an increase in these numbers for
optimal solutions.
  M anualj | G enetic Algorithm

    CD Set Seed Value:

    Population Size:
    Number of Generations: 100
                        Run
Figure 30.  Genetic algorithm options within the modeling interface
7.5 Evaluating Model Output
       After selecting a method to build models and an evaluation criterion to rank them,
users then click the "Run" button. Model selection and evaluation progress is displayed on the
"Progress" graph at the lower right of the Modeling tab. Note that the "Run" button changes to
"Cancel;" the process is interruptible should progress be unacceptably slow. Once model-building
is completed, the ten best MLR fits are displayed in the "Best Fits" box. Selecting a model from
the list results in (see Figure 31):

       1.   A list of the model's TVs with associated regression coefficients and statistics is
           displayed on the "Variable Statistics" subtab.
       2.   A list of the model's evaluation metrics is shown on the "Model Statistics" subtab.
       3.   The "Results" subtab will show the observations and model fits versus the observation
           number. If observations are chronologically ordered, this is basically a time series plot.
       4.   The "Observed versus Predicted" subtab can show plots and tables based on
           observations versus model fits.

-------
        5.  The "ROC Curves" subtab shows a plot of the Receiver Operating Characteristic curve
            of each "Best Fits" model, as well as a table showing the computed AUC (area-under-
            the-curve) for each ROC (see Section 7.7).
        6.  Clicking on "View Report" generates a text report of model and variable statistics for
            the selected model.
        7.  The "Residuals" tab will appear at the top, allowing users to proceed to the residual
            analysis component of the application.
        8.  The "Prediction" tab  will appear at the top, allowing users to proceed to the prediction
            component of the application.

        Note that selecting a different model from the "Best Fits" list updates the Variable and
Model statistics tables and displays of the plotting  subtabs.
 Project  Model  Help

   Beach Location   Data Processing   Modeling  Residuals   MLP Predicticin
  Model Sellings
  Variable Selection Control Options   Number of Observations: 37
    Evaluation Criteria
    Akaike Information Criterion (AIC)
                         -
    IE:
Maximum Number of Variable: in a Model
               7
        Available: 7, Recommended: 4,

    |5  | MaximumVfF

   Model E valuation! hresholds

       1235  | Decision Criterion (Horizontal)

       |235  | Regulatory Standard (Vertical)
    Threshold Transform
    0 None
    O LoglO
    O Ln
    O Power  [
     0 Run all combinatio'
        Current US Regulatory Standards

        E. coli. Freshwater:   235

        Enterococci, Freshwater: 104

        Enterococci, Saltwater  61
                                   Model Information
                                    Best Fits:
                                     Progress Results Observed vs Predicted ROC Curves
Exhaustive Search of independent Variable Space
          (Percent Complete)
                                        15

                                        14 --

                                        13 \-

                                        12 '-.-

                                        11 --
                                                                                            QtUS
Variable Statistics Model Statistics
Parameter Coefficient Standardized Coefficient Std. Error
(Intercept] 1.8228 0.2994
uv -0.0007 -0.5050 0.0002
waveheight 1.6811 0.2239 1.0139
WiodDireclion -[NTf-ifl


<
-0.4177 OD010



t-Statisti.
6.087i
-3.775E'
1.658C
-3.118;
>
                                               10  15 20  25 30 35  40 45 50  55  60 65 70
                                                              Percent Completed
                                                                        Total number of possible models: 127 I
Figure 31.  Modeling results shown after completion of an exhaustive regression run

-------
  Model Information
    Best Fits:
    S.2076
    9.1112
    9.2219
    9.2231
    9.2471
    10.1760
  IV Filter
   Add to List
    Clear List
                     v
Variable Statistics
Parameter
j(intercept)
uv
waveheight
WindDirection





Model Statistics
Coefficient
1.8228
-0.0007
1.6811
-0.0030






Standardized ... Std. Error
0.2994
-0.5050 0.0002
0.2239 1.0139
-0.4177 0.0010






(-Statistic P-Value
6.0879 7.4"508e-07 )
-3.7750 0.0006
1.65SO 0.1068
-3.1185 0.0038





Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model
  Model Information
   Best Fits:
   8.2076
   9.1112
   9.2219
   8.2231
   9.2471
   10.1760
  IV Filter
   Add to List
    Clear List
                     v
Variable Statistics ! Model Statist
A

V
Metric
R Squared
Adjusted R Squared
Akaike Information Crite...
Corrected AIC
Bayesian Info Criterion
PRESS
RMSE
Sensitivity
S pecif icity
Accuracy
Wi imhpr nf Hh^prw^Hnn?
CS I
Value
0.4195
0.3667
7.2471
9.1826
-25.3092
17.0348
0.6188
0.0000
1.0000
0.9459
77
Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model

-------
  Model Information
    Best Fits:
    8.2076
    9.1112
    9.2219
    9.2231
    9.2471
    10.1760
  IV Filter
   Add to List
    Clear List
Variable S tatistics M ฐdel S tatistics
*! Metric
R Squared
Adjusted R Squared
Akaike Information Crite...
Corrected AIC
Bayesian Info Criterion
PRESS
RMSE
Sensitivity
Specificity
- Accuracy
— 'Ml imher nf nhwwaKnnt

Value
0.4195
0.3667
7.2471
9.1826
-25.3092
17.0349
0.61 SB
0.0000
1.0000
0.9459
T!
     Progress Results  Observed vs Predicted  ROC Curves
                                               Results
          3 --
          2 --
Figure 34.  Modeling interface showing a time series plot for the selected model

-------
Progress Results Predicted vs Observed ROC Curves
Select View
Plct: Pted vs Obs


|235 | Decision Criterion (Horizontal)
|235 | Regulatory Standard (Vertical)
Threshold Transform
O None
[ . . . . 1 ฉ LoglO
Update
'— 	 	 ' O Ln
O Power Q
Model Evaluation
False Positives (Type I): 0
Specificity: 1
False N egatives (Type 1 1 ]: 2
Sensitivity: |0
Accuracy: 1 0.9459


Predictions vs Observations

Predictions
D ^ to no ฃ
-1 -
	 Decision Threshold 	 Regulatory Threshold


* * :
*# *:
.****
;•****•
ป * * *
** •** *
ป** * *
i -2 -i 0 1 2
Observations



* ;
*
3 4 !



Figure 35. An XY scatter plot of observed versus predicted values for the selected model

-------
1
3est Fits:
QS^H
3.2076
3.1112
3.2219
3.2231
3.2471
0.1760
IV Filter
Add to List

Clear List






	 Variable Statistics Model Statistics
• A ^
Metric
Value

R Squared 0.4195
Adjusted R Squared 0.3667
Akaike Information Crite... 7.2471
II

View
Report

Cross
Validation
V
lorrected A
3ayesian In
PRESS
RMSE
Sensitivity
Specificity
Accuracy
Ji imhpr nf 1"

Progress Results Observed vs Predicted
C 9.1826
o Criterion -25.3092
17.0349
0.6188
0.0000
1.0000
0.9459

[ ROC Curves !











Model Fit
7.2471
8.2076
9.1112
9.2219
9.2231
9.2471
10.176
10.2047
10.2063
10.2076
AUC
.739683
.635714
.732143
.754464
.754464
.739683
.63
.635714
.635714
.635714


Plot

\
'lew Table

1.0
0.9
0.8
0.7
ฃ ฐ-6
'/'
1 0.4
0.3
0.2
0.1
0.0







Receiver Operating Characteristic Curves
for Best-Fit Models
7.2471 — fr- S.2076 — t— 9.1112 -B— 9.2219 9.2231
-ft— 9.2471 —9— 10.176 -*— 10.2047 — I — 10.2063 10.2076
; Y7
^V* 77 C^
I ^
^^^
'^1.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 - Specificity


s™

1.0



Figure 36. The ROC curves and AUC table for the Best Fit models
7.6 Viewing X-Y Scatterplots

       In multiple locations within VB 2.2 (Modeling, Residual and MLR Prediction tabs), users
can access a subtab that allows them to view information for comparing observations to model
predictions (Figure 35). From this space, users can view four different pieces of data:

1) A plot of predictions versus observations: "Pred vs. Obs"
2) A table summarizing model errors (false negatives/false positives) as the decision criterion (DC)
varies across the range of the response variable: "Error Table: DC as CFU"
3) A plot of the percent of probability  of exceedance (calculated based on the current DC) versus
observations: "% Exc vs. Obs"
4) A table summarizing model errors as the percent of probability of exceedance is varied: "Error
Table: DC as % Exc"

-------
       These four are chosen with the drop-down menu at the top left corner of the form. On
both of the two plots, a right-button click in the plot area shows a menu of functions for saving,
copying, printing or manipulating the plot view. The plot area can be zoomed and un-zoomed:
left-button mouse drags an area for zooming in; with right-button click, select "Un-Zoom" or "Set
Scale to Default" to see the entire data set. To pan to an area of the plot not in view, hold the Shift
key down and use the left mouse button to drag the view.  To view (x,y) values of any data point,
hover the cursor over the data point.  If the information does not appear, right-click on the graph
and make sure "Show Point Values" is selected.
       In regards to interpretation of these plots, the green (Regulatory Standard) and blue
(Decision Criterion) lines permit model  evaluation and provide information on which to base a
DC to be used for predictive purposes.  On the plots, false positives represent data points in the
upper left quadrant of the graph, in which the model predictions  exceed the DC, but observations
are below the RS. In such cases, a beach advisory would  be incorrectly issued based on the
model prediction, leading to potential economic losses. False negatives (points in the lower right
quadrant) represent a potentially more serious scenario: model predictions below the DC and
observations that exceeds the RS.  In other words, swimming at the beach may have been allowed
when it should have been prohibited due to elevated FIB concentrations.
       A model that produces no false positives or false negatives would be  an ideal decision
tool, but this is often unattainable with real data. Examining the two tables (#2 and #4 mentioned
above) on this subtab should allow users to set a robust DC  (either using units of the actual
response variable or a percentage probability of exceedance) that minimizes  both errors.  Note that
in most cases, the RS is set based on federal  or state law and should not be adjusted by the user,
however, the user is free to adjust the DC to  minimize  false  negatives and false positives.

7.7 ROC Curves

       In addition to time series and scatterplots which show results  for an individual model,
users may also compare all "Best Fits" models using the ROC Curves tab. A Receiver Operating
Characteristic curve shows a model's true positive rate (sensitivity) plotted against its false positive
rate (1  - specificity) as a decision threshold varies  between the model's minimum and maximum
predicted values. Models can then be compared using the area under their ROC curves (AUC).
Models having the largest AUC values perform best over  the entire decision  space.
       The model with the largest AUC appears in red text in the ROC tab's model list. A single
ROC may be  plotted by selecting a model in the list and clicking "Plot."  Multiple models can
be selected in the usual Windows fashion with Shift-Click (select all  items between the first and
second selection) or Control-Click (select only the clicked items). The background cell color of
models not selected for plot display will be gray after the  "Plot"  button is clicked.
       Clicking "View Table" will replace the ROC plot with a table showing the false positives,
false negatives, sensitivity, and specificity at every evaluated value of the Decision Criterion for a
single selected model. Users need only  click on a model in  the list to the left of this table to see its
results.  The ROC plot will return to view after clicking "View Plot."
       AUC calculations are performed and curves are plotted when the "ROC Curve" tab is
selected. If this tab is active and new models are subsequently built,  leaving this  tab and then
returning will generate the new plots and AUC values.

-------
7.8 Cross-Validation

       Clicking the "Cross-Validation" button on the Modeling tab brings up a sub-screen. On it
users can set two parameters:  sample size for the testing data (T) and number of random samples
(R) taken. When cross-validation is started, a random sample of size T is taken from the modeling
dataset and set aside.  Each "Best Fits" model is then re-fit to the remaining training data. The
TVs in each model stays the same, but the regression coefficients are adjusted to reflect the least-
squares fit to the training data. The Mean Squared Error of Prediction (MSEP) is then calculated
based on the T testing data points for each candidate model. The process (taking a random testing
sample; re-fitting regression coefficients for the ten candidate models based on the training data;
using the re-fit models to make predictions; and computing 10 MSEP values) will be done R times.
A table will show average MSEP values for each candidate model.
       Cross-validation is a widespread, useful technique for examining the predictive power of
models, i.e., their ability to make predictions for data they have not seen before. For users wishing
to emphasize the predictive ability of a potential model, cross-validation allows them to evaluate
which candidate model consistently makes the best predictions (i.e., has the lowest MSEP). Note
that the PRESS statistic Virtual Beach 2.2 provides as a model evaluation criterion is a cross-
validation statistic with T set to 1.  The PRESS algorithm removes one  observation at a time from
the dataset, re-fits the model regression coefficients, and then calculates the squared residual for the
removed observation.  It does  this once for every observation in the dataset to compute the model's
PRESS value — a confined look at  a model's predictive potential.
       Recommended values  to enter for the observations used for testing are approximately 25%
of the total number of observations and 500-1000 trials.
Totd Number of Observations
225


Number of Observations Used for Testing: 40
Number of Trials.
Fitness
>
• 43.092024667...
- 42 91 181 4497. .
- 42.824883297...
- 42.625847684...
- 42.456029460...
- 41 434871829...
[^336885984...
- 41.238453099...
<
\m 	 1
MSEP
0.173258378933...
0.133755617610...
0189188307571...
0.172544273813...
0.184848801378...
0.178418303326...
0.175263600776...
0.178221812478...
L^LJ
Ind Var 1 Ind Var 2 Ind Var 3 Ind Var 4
clouds
clouds
clouds
clouds
clouds
SQR[lurbidity]
SQRtlurbiditj-]
SQRpuffaidtii]
SQR[turbiditjj]
SQR[lurbidity]
clouds SQR[lurbiditji]
'.'. -lij.pl.'. J l.llJUlJ:
windspeed
0.180921289930... iwidspeed

clouds
SQR[Previous2 ... POLY[airlemp]
SQR[Previous2 ... POLY[airtemp]
SOR[Previous2 ... POLY|airtemp]
SQR[Previou$2 ... POLY[airtemp]
SQR[Previous2 ... POLY[airlemp]
SQR[Previous2 ... POLY[airtemp]
S 3 R [turbidity] SQF![Previou$24...
SQR [turbidity] SQR[Previous24...
clouds SQR [turbidity] SQR[Previous24...

IndVarS lndVar6 IndVar?
POLY[dewpoinl] POLY[alrnpressure] LOG[cuyahogariv..
POLY[dewpoinl] POLY[alrnpressure] LOG[cuyahogariv.
POLY[dewpoint] LOG[cuyahogaiiv . PDLY[ucomp]
POLY[dewpoint] PO LY[ a tm pressure] LOG[cuyahogariv..
POLY[dewpoinl] LOG[ctiyahogariv... POLY [rocky riverfl..
POLYIdewpoint] POLY[atrn pressure] LOG[cuyahogariv.
PDLY[airtemp] POLY[dewpoint] POLY[atmpressure
POLY[airtemp] POLY[dewpoint] POLY[atmpressute
POLY[airlemp] POLY[dewpoint] POLY[atmpressure v
>
OK |
Figure 37. The cross-validation results for each of the 10 best-fit models
7.9 Report Generation

       A text report of modeling results can be generated, copied to the system clipboard, or saved
to a text file using the "View Report" button.  Users can view the report within VB 2.2 by selecting
the desired models and clicking on "Generate Report for Selected Models." The report contains

-------
descriptive statistics for each model variable and model evaluation statistic. Any number of best-
fit models can be selected for reporting.
        A recommended approach to saving the information in an external application is to copy
the report to the clipboard (with the "CopytoClipboard" button) and paste it into a rich-text
application like MS Word, Write or WordPad. NotePad or other text editors will work, but column
formats will likely be lost and make the report difficult to interpret.
 IS MLR Model Building Report - Best Fits
                                                                 -  n x
 Select models for report:
  -106.3737
  -105.1724
  -103.3300
  -103.6883
  -103.6583
  -102.6353
  -102.4818
MLR Model Building Report

VB2 Project Name:
VB2 Project File:

Imported Data Input File:
Independent Variable: logEcoli
Number of observations: 225

Models are listed in order of best-fit based upon selected evaluation criterion.
Model Evaluation Criterion: Akaike
                    Model: logEcoli = 13.1649e-01 - 25.41 Q4e-03*airternp + 10.227e-03'turbidity + 87.1563e-03
                    'clouds - 26.4922e-05*rockyriverflow + 18.4437e-03*windspeed + 18.7124e-05*cuyahogariverflow
                     22.478Be-02*Previous24hrrainfall + 26.035e-03*dewpoint

                    Model Evaluation Score: -1.0687e02
                   All Evaluation Metrics:
                         R Squared:
                         Adjusted R Squared:
                         Akaike Info Criterion:
                         Corrected AIC:
                       4.789e-01
                       4.5966-01
                       -1.0687602
                       -1.0585e02

Figure 38.  A text report generated on the modeling results

        Comparative bar graphs can be displayed to view evaluation criteria for all top models.
Click on "View Evaluation Graphs" to see these plots. Hover the mouse over any plot to display
the relevant evaluation criteria and hovering over any bar displays the associated model. Note
that the  evaluation criteria graphs are scaled to emphasize differences between the model scores
although the difference may, in fact, be quite small. With the cursor over any graph, right-mouse
click and select "Set Scale to Default" to view the un-scaled graph.

-------
 ffi] Model Evaluation Criteria
                                                   Adjusted R2
   logEcoli = 13.083Ge-01 - 23.3539e.03"airtemp + 10.8332e-03"turbidity + 9B.1067e.03"clouds • 28.6138e-Q5"rockyriveiflc.w + 18.535e-Q5"cuyahogaiiveTflow +
   23.473e-02"Previous24hrrain(all + 25.5045e-03*dewpoint
                        11
                        II

1

1

1

I"
•   .
i
i
                                                                                   JL,

                                                            -CL
Figure 39.  Plots of the various model evaluation metrics for the 10 best-fit models
9 Model Evaluation Criteria
                                                        S Model Evaluation Criteria
                                              R2
                                                                                                      R2
  logEcoli - -14.2808800 + 50.1901 e-01 TOLY[[aillemp][dewpoint]] • 47.2897e-02"PC   logEcoli = -13.9053eOO + 4B.31 B5e-01"POLY[[airtemp][dewpoint]] - 51.8026e-02"PC
  11.2129e-04"SQR[[airtemp][cujiahogariverflow]] + 14.3251 e-Q2"SQR[[Previous24hn   14-3141 e-02"SQR[[Pfevious24htrainfall][windspeed]] 112.4374e-01 TOLY[[airtemp:










0619 .
I™-
ฃ
*DB,S
i
ฃ Ofilfi -
1.
DฃI3 -
^ป . |







0














2








































1







6 a 10
BBctRtMadB Humbvr












i —
i
i
!
|-

[
1=5-
ri









                                                                                               4
                                                                                6    B     10    12
                                                                                                      ฃ

                                                                                                      g 0597
Figure 40.  Scaled versus un-scaled views of selected model evaluation criterion

-------
                                                                                          8.0

                                                                   Residual Analysis

       Once a model is selected in the "Best Fits" window on the Modeling tab, the "Residuals"
and "MLR Prediction" tabs appear at the top of the interface. Users may click "Residuals" to view
information about residuals of the selected model, but this is not mandatory; they may take the
selected model immediately to prediction mode by clicking on "MLR Prediction."  There are four
subtabs on the Residuals tab:  Predicted vs Residuals, Observed vs Predicted, DFFITS, and Cook's
Distance.
 H Virtual Beach 2.2
  Project  Model  Help

   Beach Location   Data Processing   Modeling  ' Residuals I  MLR Prediction
   Model-:
Variable Statistics
Parameter
(Intercept)
Turbidity
WaveH eight
Dew Point F
WinoV
Station_Pressure
Precip Total


Model Statistics
Coefficient
14.5347
0.0094
0.1469
0.0190
-0.0144
•0.4906
24.7024
StandardizedCoefficient

0.3384
0.2185
0.2387
-0.1506
•0.1121
0.2124


Std. Error
3.7900
0.0010
0.0242
0.0025
0.0033
0.1287
3.4226


t-Statistic
3.8351
9.3457
6.0642
7.4886
-4.3896
•3.8120
7.2174


P-Value
0.0001
1.1916e-19
2.1665e-09
2.0948e-13
1.3102e-05
0.0001
1.3794e-12




  Predicted vs Residuals Observed vs Predicted DFFITS  Cook's Distance
    A.D. Normality Statistic = 0.5732

    A.D. Statistic P-value = 0.1364
                                     Predictions vs Studentized Residuals
           Project Name; Beach Name;
                                                         Total number of possible models: 2,047 l_
Figure 41.  Information available on the Residuals tab, including a plot of Studentized residuals
versus predictions, the Anderson-Darling residual normality test, and regression statistics

       The Predicted vs Residuals subtab shows a graph of the Studentized residuals versus their
predicted model values.  The Anderson-Darling Normality Statistic (http://en.wikipedia.org/wiki/
Anderson-Darling) is shown with its significance (p-value). Linear regression assumes normally-

-------
distributed residuals, so if this A-D normality test fails (the A-D p-value is less than 0.05), the user
should 1) transform the response variable, 2) transform some of the TVs, or 3) consider deleting
offensive high leverage observations, which can be done on this tab.
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A.D. Normality Statistic = 1.1 610
A.D. Statistic P-value = 0.0043

















4 -

3 -

Residuals
ro
Studentized
o -^
-1 -
-2 -
-0
Predictions vs Studentized Residuals


0
o
0
0 0
0 0
00 0
o
0 0
o
0 ฐ
^ ฐ 0 ฐ 0 0
0 o ฐ ฐ
00 o
o








5 0.0 0.5 1.0 1.5 2.0 2.5
Predictions

Figure 42.  Plot of studentized predictions vs. residuals and the A-D test of normality

       On DFFITS and Cook's Distance subtabs, observations are sorted by the largest (absolute
value) respective measure in a grid at the left. A plot of the DFFITS/Cook's Distances for each
record (observation) versus the Record ID is shown at the right.  Data points with very large
DFFITS/Cook's Distances (i.e., lie outside the horizontal red boundaries on the graph) distort the
fitted values and standard deviation of the regression coefficients.

-------
 F'redictedvE Fie;idualE  UbEetved v; Predicted DFFITS Cook's Distance
             37447.375   -0.426416
             39223.375   0.401342
             39583.38819444... -0.355593
                    0.346042
             10575 J75
             38586.375   -0.344014
             37483375   0317248
   Iterative Rebuild [ Go

   Auto Rebuild
2-SORIp/n| = 0.2481
           9lop when all DFFITS less than:

     Gฐ |    O iterative threshold using 2'SQR(p/n]

           ฉ constant threshold 0.2481
                                                    Residuals
                                                    cutoff = 0 249 1
                                                                 -cutoff =-0.2491
              -1.5 --
                                -2.0 --
                                        100    200    300   400    500    600   700
                                                       Record
Figure 43.  A table and plot of the DFFITS scores for the residuals

       Clicking the Iterative Rebuild "Go" button removes the observation with the largest
absolute value DFFITS/Cook's Distance, re-fits the regression, and calculates new DFFITS/Cook's
Distances for the remaining observations. This model is named "Rebuildl," and it is added to the
"Models" window at the top left of the screen. Clicking on the Iterative Rebuild "Go" button again
would produce a model called "Rebuild2," which is calculated after removing the observation
with the largest absolute value DFFITS/Cook's Distance remaining in the dataset (it is the 2nd
largest absolute value in the original dataset). The user can continue to click "Go" and remove
observations with the largest remaining DFFITS/Cook's Distances, thus creating "Rebuilds,"
"Rebuild4," "Rebuilds," etc. VB will not allow a user to delete any observations if 10 or fewer
observations remain in the dataset.
       Whenever a "rebuild" is created by pressing "Go," the information displayed on the
Residual tab (variable and model statistics, Observed vs Predicted plot, Predicted vs Residuals
plot, DFFITS values,  etc.) is automatically updated to reflect this new model (even if another
model is highlighted in the "Models" window). However, the user can select any model in the
"Models" window to view its associated data and plots.
       The user has complete freedom to carry out the outlier removal process while toggling
back and forth between the DFFITS and Cook's Distance subtabs. For example, the first removal
can be based on a DFFITS value, the next removal can be based on a Cook's Distance, the next
two removals can be based on DFFITS, etc.  If the user wishes to  clear the "Models" window for
whatever reason, simply click the "Clear" button.
       Rather than using Iterative Rebuild, the user has  two additional choices for Auto Rebuild,
both of which remove all observations above some threshold. The "iterative threshold" choice
bases removals on a threshold that is updated every time an observation is deleted. For DFFITS,
this threshold is 2*(p/n)05, where p is the number of IVs in the model  and n is the current number
of observations in the dataset.  For Cook's Distance, the threshold is 4/n.

-------
  Iterative Rebuild  [ Go  J         2*SQR(p/n) = 0.2491

  Auto Rebuild
                Stop when all DFFITS less than:
                O iterative threshold using 2KSQR(p/n)
                (*) constant threshold   0.2491
     View Data Table
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points

       In the "iterative threshold" process, step one is to check if any DFFITS/Cook's Distances
are above the threshold; if so, VB removes the observation with the largest absolute value DFFITS/
Cook's Distance and then recalculates the regression model, the DFFITS/Cook's Distances, and the
threshold (because n has been reduced by 1).  VB then checks to see if any of these new DFFITS/
Cook's Distances are above the new threshold. If so, the process repeats. VB will continue until
no DFFITS/Cook's Distances remain that exceed the current threshold, or until half of the dataset
has been removed, whatever comes first. For example, if a dataset has 100 observations, VB will
allow 50 to be removed before it breaks out of the Auto Rebuild removal loop. At that point the
user can click the Auto Rebuild "Go" button again to potentially remove another 25 observations
of the remaining 50. We note that, in practice, one should not remove more than 5-10% of the
original dataset as outliers; the need to remove more indicates a poor MLR fit and warrants a
different analytical technique.
       Using the "constant threshold" Auto Rebuild option differs from the "iterative threshold"
only in that the threshold remains static (i.e., the value the user types into the input box) regardless
of how many observations are deleted. Updated DFFITS/Cook's Distances are still  calculated after
every removal event. VB will also stop this process if half the number of starting observations
has been deleted.  There is an upper limit to the number that can be entered into the "constant
threshold" input box (DFFITS = 3, Cook's Distance = 16/n).
       Upon completion of the Auto Rebuild process, multiple models may have been added to
the "Models" window. For example, if 10 observations were removed, then "Rebuildl" through
"RebuildlO" will appear in the "Models" window.
       If a user has  interest in both DFFITS and Cook's Distances as outlier metrics, we suggest
one of the following methods:

I) To see if the  two criteria would produce different results:
   Apply DFFITS removal to your model of choice.  Note the results and then clear the Residual
   tab using the "Clear" button. Next perform a removal process based on Cook's Distance and
   compare the results.

-------
2) To filter out observations that offend either DFFITS or Cook's Distance criteria:
   Run DFFITS removal on the model (i.e., remove all observations above your specified DFFITS
   threshold), then click the Cook's Distance subtab and perform additional outlier removal based
   on its threshold.  After this process, remaining observations are "OK" from the perspective of
   both metrics.

       Note that the highlighted model in the "Models" box is used if the "MLR Prediction" tab is
clicked, not necessarily the model whose information is displayed on the Residuals tab. Also note
that any observations removed from the "Residuals" tab are not removed from the primary dataset
shown  on the "Data Processing" tab.
Viewing the Data Table
       From the DFFITS or Cook's Distance subtabs, users can click on "View Data Table" to
display a history of the observation removal process for the model highlighted in the "Model" box.
From this window, users may export the dataset for external use or re-importation into VB 2.2.
H Model Data Q@S

Records Eliminated from Model Data Se
Model Value"3' Residual Tjpe Date

> | -1.33971 6 DFFITS 8/16/2007
Rebuild2 -1.013314 DFFITS 6/1/2009
Rebuilds 0.635558 DFFITS 7/25/2008
#
<
Model Da a 9et - Inactive Records in Red
Date logEcoli
[ Save Data ] > | 1.230448921
6/2/2007 2.939519253
6/3/2007 1.697627091
6/4/2007 1.204119983
cyRj^nrr? n QmriQQQQT
<


logEcoli
3.58546073
0.301029986
2.938518253



clouds
4
4
2
3
A



clouds
5
4
3



SQR[turbidity]
1
1
717556403731...
612451549659...
6.606814663663...
3.154362059117...
1 Q™™™^



SQR[turbidity]
16.06237840420...
2.664582518884...
5.540758070878...



SQR[Previous24hrr
0
0
0.223606797749...
0
n



SQR[Previous24h
1.118033988748..
0
0

>

POLY[airtemp] *
1.507064992941.
1.603774691938.
1.783618147049.
1
783618147049.




Figure 45. "View Data Table" window for examining the dataset after removal of influential data
points

       The "Observed vs Predicted" subtab is the same as that in Section 7.6.  There are two plots
and two tables to examine, along with controls to modify the Decision Criterion (blue horizontal
line) and Regulatory Standard (green vertical line), to judge effects these changes have on model
outcomes (false positives, false negatives, sensitivity, specificity, etc.).

-------
Predicted vs Residuals
Select View
Plot: Pred vs Obs
Observed vs Predicted

V

Plot Thresholds
|235 Decision Criterion (Horizontal)
|235 Regulatory Standard (Vertical)
Threshold Transform
O None
^~^~^ O Ln
O Power
Model Evaluation
False Positives (T
Spec
False Negatives (Ty
Sen
Ace


/pel]: 7
ificity: 1 0.9832 |
pell): 80
itivity: 0.3043
jracy: O.S772


DFFITS 1 Cook's Distance


Predictions vs Observations
7
Predictions
u _i ro oo •ฃ=• en co
-1 :
2 "






2



cision Thresh





<
-1 (










)








• _
?
:ry Threshold |



i
2
Observatic


.

3451
>ns :



Figure 46.  Observed vs. Predicted plot on the Residual tab with model evaluation threshold
control and model evaluation statistics

-------
B Virtual Beach 2. 2 EPS
Project Model Help

Beach Location Data Processing Modeling Residuals MLR Prediction T
Models
SelectedModel
Rebuild!
Rebuild2
Rebuilds


Variable Statistics Model Statistics
Paiametei Coefficient SlandardizedCoefficient Std. Eiroi t-Statistic P-Value
[Intercept] 1.3979 0.1576 12.6746 2.3721e-13
uv -0.0005 -0.4334 9.3649e-05 -4.6448 68014e-05
maveheighl -07733 -0.1071 0.6768 -1.1435 0.2622
WindDirection -0.0042 -0.7244 0.0005 -8.0840 6.4821e-09



Predicted vs Residuals Observed vs Predicted DFFITS • Cook's Distance]
A.D. Normality Statistic. 0.1526
A.D. Statistic P-vaiue = 09546



Predictions vs Studentized Residuals
1 Residuals
D ^ ro c
Studentize
j ro -^ c
0ฐ
o
-
8
o oo
"ฐ
0 ฐ
0 o ฐ o
o o
0 D
O

-0.5 0.0 0.5 1.0 1.5 2.0
Predictions

Project File Name: Project Name: Bea



:h Name: Total number Gf possible models: 127 (
Figure 47. Residuals interface showing a list of rebuilt models resulting from observation
deletions, and the associated statistics and residual plots for these rebuilds

-------
                                                                                    9.0
                                                                          Prediction
       The MLR Prediction interface allows users to estimate or predict FIB concentrations with
a selected regression model. Whether a user was previously on the Modeling tab (with a model
selected in "Best Fits") or on the Residuals tab (with a model selected in "Models"), the interface
of the MLR Prediction tab will look the same.
9.1 Model Statement

       At the top is the linear expression for the chosen model, with values of the regression
coefficients and names of each IV in the model (Figure 48).
9.2 Model Evaluation Thresholds

       There are input boxes for the Decision Criterion (DC) and Regulatory Standard (RS).
Setting these allows model predictions to be evaluated and model specificity, sensitivity, and
accuracy to be calculated. When users first arrive at the Prediction tab, values of the DC and
RS will be set to what was on the Modeling tab. The "Threshold Transform" button tells VB
2.2 how to transform the DC and RS for comparison to model predictions and observations. If a
transformation definition was set for the response variable during data processing (either manually
by the user or automatically by transforming the response), that definition will be set here as
the default. Users should be aware that changing the threshold transform definition can cause
problems when comparing modeling predictions to observations. Caution should be exercised.

-------
 Project  Model  Help
   Beach Location   Data Processing  Modeling  Residuals  MLR Prediction
  Model:
         LogCFU = 1.8228075 - 0.00067864774'(uv) + 1.6810716*(waveheight) - O.D030005423*(WindDirection)
   Model Evaluation Thresholds
                       Threshold Transform
   1235  Decision Criterion (Horizontal]     ฎ None
                        O Login
   1235 I Regulatory Standard (Vertical)     ^ .

                        O Power |1.0
Import IVs

Import Qbs
  Predictive Record
      ID
 Proiect File Name;
                   Project Name: Beach Name:
                                                             Total number of possible models; 127 l_
Figure 48. The MLR Prediction interface
9.3 Prediction Form
       Most of the prediction form is in three separate data panels:  the left panel holds IV data;
the middle panel is for observational  data, e.g., lab results of FIB concentrations; and the right
section shows model predictions and evaluation metrics. Each panel also contains a column for a
unique ID for each row of data (e.g., the date that data were collected). The panels have separate
horizontal and vertical scroll bars that become visible if the number of rows or columns exceeds
the viewable area.  The three panels independently scroll horizontally, but scroll as a group
vertically. Panels can be re-sized by clicking and dragging the blue vertical  partitions. Order
of the columns in the left and right panels can be changed by clicking and dragging the column
headers left or right.
       Users can import IV and observational data from a file using "Import IVs"  and "Import
Obs" buttons in the "Prediction Form" button bank located in the middle right of the screen, or
users can type data into the input grids. Either way, they should be certain that the entered IV data
are in the same units as those used to build the model.
       Depending on which model was selected for prediction, the IV panel will have one column
for every unique IV that appears in the model, plus a column for the row's unique ID.  When a data

-------
file is imported with the "Import TVs" button, a "Column Mapper" window opens.  This window
allows users to tell VB 2.2 which columns in the imported datasheet should be used for the row
IDs and each IV found in the model. By default, the first column of the imported file maps to the
ID field, but users can choose another column if needed. If a column in the imported spreadsheet
has an identical name to an IV in the model, that column will be automatically selected by VB 2.2
as the appropriate one for that IV
    Column Mapper
                        -   n  x
         Model Variables
         uv
         waveheight
         WindDirection
Imported Variables
(stamp
uv
waveheight
WindDirection
                            Ok
     Cancel
Figure 49. Importation of IV data using the "Column Mapper" window

      As with IV data, observational data can be typed into the middle panel or imported
using "Import Obs." For observational data, only two columns are needed:  row IDs for
every observation and the actual observations. A "Column Mapper" window appears when
observational data are imported from a file. After they have been imported or manually entered,
users can specify the scale/transformation of the observations for a proper comparison to model
predictions.  This is done by right-clicking on the "Observation" column header and defining the
transformation: none, Iog10, loge, or a power transformation.  "None" is the default choice. For
example, if LoglO observations are imported, the user would need to change the right-click menu
choice to "LoglO."
   Column Mapper
                              n  x
                                            Cancel
Figure 50. Importation of observational data using the "Column Mapper" window

-------
       The "Make Predictions" button remains disabled until the IV data (imported from a file or
manually typed) are validated using the "IV Data Validation" button.  This scan ensures there are
no blank cells or non-numeric data in the IV columns of the IV data panel and checks that every
row ID is unique (non-numeric data are allowed for the ID column).  This validation scan window
is very similar to the validation scan window sin the Data Processing tab; however, "Delete
Column" is not a choice. "Replace With" and "Delete Row" are the  only ways to deal with
problems in the IV data grid.
  Project  Model Help

   Beach Location   Data Processing  Modeling  Residuals /MLR Prediction 1
  Model:
         LogCFU = 1.8228075 - 0.00067B64774*(uv) * 1.6810716*(wavehsight) - 0.0030005123-(WinrJDirBrtion)
   Model Evaluation Thresholds

    235 Decision Criterion (Horizontal)

    235 Regulatory Standard (Vertical)



  Predictive Record
Threshold Transform
 ฉ None
 O LoglO
 O Ln
 O Poraer jl.0
Import I Vs

Import Qbs
Make
>














{
ID
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
uv waveheight
360
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1800
900
293
1537
1763
WmdDiiec "
0.15 0
0.2
0.2
0.2
0.2
0.2
0.02
001
0.01
001
0.01
0.1
0.15
0.18
0.15
0.15
0.3
38536.33 236 0.05
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
-f
ID

Data Validation
1 (Optional) Find:
Observation

•
ican

^m
H

r



^


!



Cancel




 Project File Name:
                   Project Name: Beach Name:
                                                              Total number of possible models: 127
Figure 51. The IV validation window on the MLR Prediction tab

       Once IV data have been validated, clicking the "Make Predictions" button will generate
model predictions.  Observational data need not be present to make predictions, but observations
are needed for model evaluation (sensitivity, specificity, false negatives, false positives, etc.). After
clicking "Make Predictions", VB 2.2 uses the model, IV data, and observational data to fill the
right panel with the following data columns:  ID, Model Prediction, Decision Criterion, Regulatory
Standard, Exceedance Probability, and Error Type.

-------
9 Virtual Beach 2.2 (T|[5]f5 38507.33
38507.46
38507.63
38508.33
38508.46
38508.63
3852146
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33

Project File Name:
360
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1SOD
300
293
1537
1763
0.15
0.2
0.2
0.2
0.2
0.2
0.02
0.01
0.01
0.01
0.01
0.1
0.15
0.18
0.15
0.15
0.3
286 0.05
[
WindDiiec "
0
10
20
30
40
50
60
70
30
90
100
110
120
130
140
150
160
170





IV Data
Validation
Make
Predictions

ID

Predict on Grid


Import IVs



Import Obs



Plot Clear | Export As CSV |

3850733
38507.46
33507.63
38508.33
38508.46
38508.63
38521.46
38521.63
3852233
3852246
3852263
3852833
38528.46

3852863
38535.33
38535.46
38535.63
38536.33
38536.46
Project Name: Beach Name:
Obseivation
1.452
0.8653
0.8016
1.738
1.028
0.301
1.627
1.247
1.773
0.9379
0.9542
1.079
0.97
1.195
1.239
0.699
-0.1761
1.176
0.1249






ID
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63


38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33



ModeLPrediction
1.831
1.177
1.044
1.84
1.153
0.9449
0.7657
0.7636
1.447
0.7145
0.5833
1.461
2!
2:
2:
2:
2:
2:
2:
2:
2:-
2:
2:
2:
0.4933 1 2:
1.125
1.456
0.5818
0.6506
1.203


2
2:
2:
2:
2:
-? T^

Total number of possible models; 127 :
Figure 52. A prediction grid after IVs and observational data have been imported, and model
predictions have been made

       The ID column of the model output panel is taken directly from the ID column of the IV
panel, not the observation panel. The "Make Predictions" button makes one model prediction
per row in the IV data panel, regardless of how many observations are entered in the observation
panel.
       The Model Prediction column contains predicted values of the response variable. Right-
clicking on this column header allows the user to change how the predictions are displayed in the
table (as linear, log, or power units).  The Decision Criterion and Regulatory Standard are values
set by the user (shown in the left panel as transformed by the choice of "Threshold Transform").
The Exceedance Probability (actually the probability x 100) is denned as the probability that
the model prediction will be larger than the Decision Criterion, based on uncertainty bounds
(confidence intervals) around the model predictions.
       To compare model predictions to observations, VB 2.2 looks at the prediction ID and
attempts to find an observation in the observation panel with that same ID. VB 2.2 does not
require unique IDs for each row in the observation panel, but note that a model prediction is
compared to the first observation found with the same ID.  When comparing model predictions
to observations, an error (false exceedance or false non-exceedance) appears in the "Error Type"
column.

-------
       It is important to note that accurately assessing model output depends on synchronized
transformation information regarding the Decision Criterion, Regulatory Standard, model
predictions, and observations. Users must be careful to ensure each value is in a comparable unit.
9.4 Viewing Plots

       After predictions have been made, a scatterplot of observations versus predictions can be
viewed by clicking "Plot" in the "Prediction Grid" button bank.  If no observational data were
entered, a message asking for observational data appears. The features and functionality of the
form that appears when the "Plot" button is clicked are described in Section 7.6. The data are
based on comparing model predictions (right pane of the Prediction Form) with observations
(middle pane) that share the same, unique ID.
  Select View
   Plot: Pred vs Qbs
  Plot Thresholds

    |235 | Decision Criterion (Horizontal)

    [235 | Regulatory Standard [Vertical)

         Threshold Transform
         O None
  ( ,, .  , 1  ffi LoglO
   Update
         O Ln
         O Power

  Model Evaluation

    False Positives [Type I):  7

           Specificity:  0.9882

   False N egatives (Type 1 1 ):  SO

           Sensitivity:  0.3043

           Accuracy:  O.S772 |
                                             Predictions vs Observations
= q J.
Figure 53. Prediction interface plotting of the observations versus predictions, with model
evaluation threshold controls

9.5 Prediction Form Manipulation
       Two other buttons are found in the "Prediction Grid" button bank. If a user wants to view
the table in a spreadsheet or word processing program, "Export as CSV" saves the contents of the
entire table (three panels) in .csv format. "Clear" deletes all information in the predictive table. As

-------
with most of the tabular information in VB 2.2, data in individual panels can be selected with a left
click and drag.  Control-C and Control-V can then be used to copy and paste the data into another
application such as WordPad or Excel.

-------
                                                                              10.0

                                                     Future Enhancements

      VB 2.2 is a Windows application and undergoes continuous improvement and functional
expansion. In version 3.0, slated for release in 2012, project management enhancements will
allow site-based seasonal prediction and model assessment.  The map interface will provide user
access and information to site-specific data such as water quality, water flow gauge readings and
weather data. Model- building functionality will grow beyond MLR to include Gradient Boosting
Machines (Decision Trees), Binary Logistic Regression, Partial Least Squares regression, and
Neural Networks.

-------
                                                                             11.0

                                                             User Feedback

      Opinions and experiences from the user community are welcomed by the Virtual Beach
design/development team.  Users are encouraged to report problems, issues and likes/dislikes to:

Mike Cyterski - 706 355-8142 (cyterski.mike@,epa.gov)
Mike Galvin - 706 355-8318 (galvin.mike@,epa.gov)
Rajbir Parmar - 706 355-8306 (parmar. raj bir@, epa.gov)
Kurt Wolfe - 706 355-8311 (wolfe.kurt@.epa.eov)

-------
                                                                       12.0

                                                     Acknowledgments

We would like to thank the following people, who generously donated their time and expertise for
software testing and review of this document:

Adam Mednick, Wisconsin DNR
David Rockwell, NOAA
Fran Rauschenberg, USEPA
Wesley Brooks, USGS
Mike Fienen, USGS
Donna Francy, USGS
Richard Zepp, USEPA
Steve Corsi, USGS

-------

-------

-------
United States
Environmental Protection
Agency
PRESORTED STANDARD
 POSTAGE & FEES PAID
         EPA
   PERMIT NO. G-35
Office of Research and Development (8101R)
Washington, DC 20460

Official  Business
Penalty for Private Use
$300

-------