Virtual Beach v 2.2 User Guide
Mike Cyterski, Mike Galvin, Kurt Wolfe, and Rajbir Parmar
Virtual Beach
Empirical Modeling Software for
Pathosren Indicators in Recreational Waters
TURNING
DATA
LIS, Environmental Protection Agency
Office of Research and Development
National Exposure Research Laboratory
Ecosystems Research Division
-------
Table of Contents
1. Introduction 4
1.1 On Predictive Modeling 4
1.2 Recommended User Background 4
1.3 History and Comparison of Version 2.2 to Earlier Versions 5
2. Installation and Execution 8
2.1 Viewing this Documentation 8
3. Operational Overview 9
4. Project Management 10
5. Beach Location Mapping Interface 11
5.1 Finding a Location 11
5.2 Defining the Beach Orientation 12
5.3 Finding nearby Water Quality, Flow, and Climate Information Sources 13
5.4 Saving Beach Information in a Project File 15
6. Data Processing 16
6.1 Data Requirements and Considerations 16
6.2 Importing a Dataset 17
6.3 Validating the Imported Data 18
6.4 Working with a Dataset Post-Validation 20
6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components 23
Notes on wind, wave and current component calculations: 24
6.6 Creation of New Independent Variables 27
6.7 Transforming the Independent Variables 29
Plotting Transformed IVs 32
Notes on Transformed IVs 32
6.8 Saving Processed Data 34
6.9 Go to Modeling 34
7. Modeling 35
7.1 Selecting Variables for Model Building 35
7.2 Modeling Control Options 35
7.3 Linear Regression Modeling Methods 38
7.4 Using the Genetic Algorithm 40
7.5 Evaluating Model Output 41
7.6 Viewing X-Y Scatterplots 46
7.7 ROC Curves 47
7.8 Cross-Validation 48
7.9 Report Generation 49
8. Residual Analysis 51
Viewing the Data Table 55
9. Prediction 57
9.1 Model Statement 57
9.2 Model Evaluation Thresholds 57
9.3 Prediction Form 58
9.4 Viewing Plots 62
9.5 Prediction Form Manipulation 63
10. Future Enhancements 63
11. User Feedback 63
12. Acknowledgments 63
2
-------
List of Figures
Figure 1. The five major component tabs of VB 2.2 5
Figure 2. Beach Location interface 11
Figure 3. Beach Location tab controls and their function 12
Figure 4. Adding shoreline and water markers to define beach orientation 13
Figure 5. NOAA/NCDC station marker showing station ID information 14
Figure 6. USGS/NWIS station marker showing station ID information 14
Figure 7. Beach Location interface showing station markers 15
Figure 8. Importing a dataset into the Data Processing tab 17
Figure 9. Data validation required to begin data processing 18
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu 19
Figure 11. Post-validation enabling of the Data Processing functionality 20
Figure 12. Right-click options on columns that are not the response variable 21
Figure 13. Four different plots available for evaluation of IVs 21
Figure 14. Disabling an observation from within the XY scatterplot 22
Figure 15. Available choices when right-clicking the current response variable 23
Figure 16. Window for computation of alongshore and offshore/onshore components 24
Figure 17. A and O component definitions for wind, current, and wave data 25
Figure 18. Principal beach orientations given in degrees 26
Figure 19. Window for the formulation of "Manipulates" 27
Figure 20. Creation of a new IV defined as the mean of two existent IVs 28
Figure 21. Formation of two-way cross-products of a set of four existent IVs 29
Figure 22. The range of choices for IV transformations 30
Figure 23. Pearson correlation coefficient scores for judging the efficacy of IV transformations 31
Figure 24. Scatterplots (Response vs. IV) for six different data transformations of a single IV 32
Figure 25. Selecting variables for MLR processing within the Modeling tab 35
Figure 26. Setting modeling options within the Modeling interface 36
Figure 27. Setting evaluation thresholds and threshold transformation information 37
Figure 28. Model building interface 39
Figure 29. Using the IV filter to select a subset of variables from the best-fit models 40
Figure 30. Genetic algorithm options within the modeling interface 41
Figure 31. Modeling results shown after completion of an exhaustive regression run 42
Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model 43
Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model 43
Figure 34. Modeling interface showing a time series plot for the selected model 44
Figure 35. An XY scatter plot of observed versus predicted values for the selected model 45
Figure 36. The ROC curves and AUC table for the Best Fit models 46
Figure 37. The cross-validation results for each of the 10 best-fit models 48
Figure 38. A text report generated on the modeling results 49
Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models 50
Figure 40. Scaled versus un-scaled views of selected model evaluation criterion 50
Figure 41. Information available on the Residuals tab 51
Figure 42. Plot of studentized predictions vs. residuals and the A-D test of normality 52
Figure 43. A table and plot of the DFFITS scores for the residuals 53
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points 54
Figure 45. "View Data Table" window 55
Figure 46. Observed vs. Predicted plot on the Residual tab 56
Figure 47. Residuals interface showing a list of rebuilt models 56
Figure 48. The MLR Prediction interface 58
Figure 49. Importation of IV data using the "Column Mapper" window 59
Figure 50. Importation of observational data using the "Column Mapper" window 59
Figure 51. The IV validation window on the MLR Prediction tab 60
Figure 52. A prediction grid after IVs and observational data have been imported 61
Figure 53. Prediction interface plotting of the observations versus predictions 62
3
-------
1. INTRODUCTION
Virtual Beach version 2.2 (VB 2.2) is a decision support tool. It is designed to
construct site-specific Multi-Linear Regression (MLR) models to predict pathogen
indicator levels (or fecal indicator bacteria, FIB) at recreational beaches. MLR analysis
has outperformed persistence models (using the most recent FIB concentration as the sole
predictor of the next FIB concentrations, i.e., yt = yt_i) at beaches where conditions, such
as weather, water conditions, and human and animal traffic levels, change significantly
from day to day (Frick, Ge et al. 2008).
1.1 On Predictive Modeling
In any predictive modeling endeavor, variability and uncertainty are always
associated with model output, arising from a variety of reasons that are impossible to
eradicate completely from the modeling exercise. Virtual Beach 2.2 attempts to be
forthright with this fact by issuing a probability of exceedance for any regulatory
standard that the user wishes to investigate. Even so, there is no guarantee than every
model prediction will be correct, and a situation where the model predicts water quality
to be good enough for public recreation might be erroneous. Decisions to allow or not
allow swimming at beaches must be made, however, and in the best case scenarios the
regression models developed with Virtual Beach 2.2 will outperform less rigorous
predictive efforts.
1.2 Recommended User Background
Virtual Beach 2.2 is our attempt to create a decision support software tool that
will assist someone with little statistical knowledge in developing a multiple linear
regression model based on their available data. Some familiarity with regression
modeling and residual analysis will no doubt benefit a VB 2.2 user, although we believe
that, after only a few sessions, someone with very little background in statistics can
produce defensible regression models using VB 2.2. We note that these MLR models, or
any other statistical models, will only be as effective as the data used to develop them.
No statistician, however skilled, can turn a dataset filled with worthless independent
variables (i.e., IVs) into a useful predictive device.
VB 2.2 has five major components:
• Beach location map interface where users can locate their site, define the
orientation of the beach, and examine nearby potential data sources.
• Data processing spreadsheet interface that facilitates the import and manipulation
of MLR model variable data.
• Modeling interface presenting options for performing MLR analyses.
4
-------
• Residuals component to examine regression residuals, allow optional elimination
of highly influential data records, and perform recalculation of the regression
model.
• Prediction interface allowing entry of new data and subsequent estimation of
pathogen indicator levels using a selected MLR model.
Each component is accessible from the application's main window via selectable
tabs. The Beach Location and Data Processing tabs are always visible, the Modeling tab
becomes visible once the input data have been validated, and the Residuals and MLR
Prediction tabs appear when model-building is complete and a model is selected.
Project Model Help
Beach Location I Data Processing / Modeling I Residuals
BBS
Variable Selection Control Options | Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
H I Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
|5 | MaximumVIF
Model EvaluationThresholds
PI Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Current US Regulatory Standards
Threshold Transform
© None
O Log10
O Ln
O Power
E. coli. Freshwater: 235
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61
Manual Genetic Algorithm
l~~l Set Seed Value:
Population Size:
100
Number of Generations:
100
Mutation Rate:
0.01
Crossover Rate:
0.20
Model Information
Best Fits:
8.207G
9.1112
9.2219
9.2231
9.2471
10.1760
IV Filter
[ Add to List
View
Report
Variable Statistics | Model Statistics
Parameter
Coefficient
Standardized Coefficient
Std. Error
t-Statistic
(Intercept)
1.8228
0.2994
6.0879
waveheight
1.6811
0.2239
1.0139
1.6580
uv
-0.0007
-0.5050
0.0002
-3.7750
WindDirection
-0.0030
-0.4177
0.0010
-3.1185
<
>
Progress | Results Observed vs Predicted ROC Curves
Genetic Algorithm Dynamic Fitness Update
30 40 50 60 70
Percent of Generations Completed
Project Name: Beach Name:
Total number of possible models: 127
Figure 1. The five major component tabs of VB 2.2 - the modeling tab is currently active
1.3 History and Comparison of Version 2.2 to Earlier Versions
Virtual Beach 2.2 is derived from the Virtual Beach Model Builder application
(VB1.0 - also known as Virtual Beach vl.O) developed by Walter Frick and Zhongfu Ge.
VB1.0 can be characterized as a MLR model-building tool that supports a primarily
manual analysis of data sets via visual inspection of data plots and manipulation of
variables (e.g., transformations, creating interaction terms), followed by an iterative
process of testing, comparing and evaluating models. The fitness of developed models is
computed and tracked, allowing for comparison and eventual selection of a "best" model
for the dataset under consideration. This model can then produce estimates of pathogen
indicator levels using current or forecasted environmental data from the site.
5
-------
VB 2.2 enhances the functionality of its predecessor, performing similar functions
(visual inspection of univariate data plots, manual transformations of individual variables,
MLR model building, prediction, etc.), but also automating and extending functionality in
several ways:
• The Map component provides users with information on the location and
availability of local data sources (NWIS/NCDC data) through the map interface.
These sources can provide recently collected and/or forecasted data for generating
predictions by a chosen MLR model.
• The Map component provides a convenient method for defining beach orientation
by overlaying the beach on current shore-line layers (satellite images, Google
Maps, MS Virtual Earth, etc). Given this orientation, VB 2.2 can calculate wind,
wave, or current components (A component is parallel to shore and O component
is perpendicular to shore), which can be important predictor variables.
• Although manual processing and analysis of imported data (visual inspection of
univariate data plots and the transformations/interactions of variables) has been
retained, the Data Processing component of VB 2.2 provides automated
generation of all possible 2n order interaction terms amongst a set of IVs,
formation of more complex functions of multiple columns, and automated testing
of a suite of variable transformations for improved model linearity. This
functionality increases the number of models to evaluate during later selection
routines and removes the burden/difficulty of manual assessment placed on users
of VB1.0.
• Multi-collinearity amongst predictor variables is handled automatically in the
Model Building component. Any model containing an IV with a high degree of
correlation with other IVs (as measured by a large Variance Inflation Factor
[VIF]) is removed from consideration during model selection. The VIF threshold
is user-defined with a default value of 5.
• During model selection, MLR models are ranked by a user-selected evaluation
2 2
criterion. Possible criteria include R , Adjusted R , Akaike Information Criterion
(AIC), Corrected AIC, Predicted Error Sum of Squares (PRESS), Bayes
Information Criterion (BIC), Accuracy, Sensitivity, Specificity, or the model's
Root Mean Square Error (RMSE). Regardless of which criterion is chosen, the
software records the ten best models in terms of that criterion. In comparison,
VB1.0 had only a single comparative criterion, Mallow's Cp.
• As the number of IVs in a dataset increases, possible MLR models increase
exponentially (considering transforms/interactions), resulting in trillions of
possible models from a modest number (12-13) of IVs. VB 2.2 implements a
Genetic Algorithm (GA) that effectively and efficiently searches for the best
possible MLR model. Alternatively, VB 2.2 users can perform an exhaustive
calculation in which all possible combinations of IVs are used and tested if the
number of possible models is reasonably small (circa 100,000). Both the GA and
6
-------
exhaustive approaches greatly expand the model-building capabilities of VB 2.2,
compared to VB1.0.
• Users no longer have to enter data values in transformed, interacted, or
component-decomposed form to make a prediction with a chosen MLR model.
On the VB 2.2 MLR Prediction tab, a user-selected model is coded into an input
grid with data entry columns matching the model's main effects. Any
mathematical manipulation of these IVs is then automatically performed prior to
making predictions.
7
-------
2. INSTALLATION AND EXECUTION
VB 2.2 is developed with MS Visual Studio 2010, written in C#, using multiple
public domain system components (Weifen Luo Docking UI, ZedGraph, and GMap.Net)
and employs a single licensed statistical library (Extreme Optimization). No license or
software purchase is required by the user to install and run the application, but an internet
connection is required to display maps. Users must have Microsoft XP or Windows 7 OS
with the DotNet Framework 4.0 to assure proper installation and operation. Assorted
errors have occurred when running Windows Vista OS. Certain VB 2.2 data
manipulation and model-building operations are computationally intensive so faster
CPUs are better, but most new laptops or desktop systems will be adequate. Disk space
requirements are modest (less than 5 MB) if the DotNet Framework is installed; if not,
the Framework installer requires ~ 175 MB of disk space. The VB 2.2 application
installer will attempt to download and install the DotNet Framework 4.0 if it is not
installed on the target system; this also requires a network connection. If necessary, a
user can freely obtain the DotNet Framework 4 installer at:
http://www.microsoft.com/download/en/details.aspx?id=17851
The EPA's Center for Exposure Assessment Modeling (CEAM) web site
distributes VB 2.2 at:
http ://www. epa. gov/ceampubl/swater/vb2/index .html
Obtain and initiate execution of the VB 2.2 application installer and follow the on-screen
instructions. The VB 2.2 application installer can be found at:
https://iemhub.org/resources/vbmb2 for iemHub Virtual Beach Group members;
https://iemhub.org/groups/virtualbeach/i oin to request Group member access.
Finally, the software can be obtained by request (see the contacts list in the
Feedback section at the end of this document). After installation, a shortcut will appear
on your desktop to start the software.
2.1 Viewing this Documentation
Virtual Beach's User Guide can be accessed within the software via the top-level
Help User Guide menu selection or in a context-sensitive fashion via the F1 key.
Invoking F1 will launch Adobe Acrobat or Adobe Reader (if installed) and open the User
Guide to the appropriate page. Note that if the Guide is already open, the F1 key will
have no effect; users must close Reader (or Acrobat) for F1 to launch and open to the
correct page. Or if the Guide is already open, users can navigate to the area of interest
via the Table of Contents. . The User Guide (Virtual_Beach_2_User_Guide.pdf) can also
be opened independently of program operation; it resides within the Documentation
folder of the program's installation folder.
8
-------
3. OPERATIONAL OVERVIEW
Virtual Beach 2.2 is simple to operate: it is categorized into five functions, each
with its own component or interface:
Beach Location - a mapping tab whose utility is meant to provide a basis for generating
orthogonal (alongshore and offshore/onshore) wind, current, and/or wave components for
the beach under consideration; its use is optional. Such components can be powerful
predictors of pathogen indicator levels at the beach, so using the beach definition
component is recommended if the dataset under consideration contains wind, wave or
current data. This tab is also useful for locating nearby NWIS/NCDC climate and water
quality data sources for a specific location.
Data Processing - a spreadsheet tab to support data manipulation procedures on an
imported dataset. In addition to wind/current/wave component generation, users can
generate new independent variables that represent the products, means, sums, minimums,
and maximums of other IVs, as well as common data transformations for the IVs.
Statistical indicators help users select the best IV transformations in MLR model-
building.
Modeling - this tab allows selection of any eligible IVs for consideration in MLR model-
building and model-generation. Model-generation is accommodated by user-selected
model evaluation criteria and automatic generation of the ten best-fit models from a
search in which all possible combinations of predictor variables are tested, or via a
heuristic searching algorithm (the Genetic Algorithm or GA). Regression fit and model
variable statistics are generated to help evaluate the usefulness of predictive variables and
overall fit. Time series and XY scatter plots, as well as reports on best-fit models, can be
viewed and/or saved for further analysis and recording.
Residual Analysis - this tab displays plots of a model's regression residuals, including
their normality statistics, and provides means to eliminate highly influential data records
and recalculate the regression model. Altered data sets can be exported for external use
and rebuilt models can be selected for the prediction tab.
Prediction — this tab is comprised of three grids where users can enter or import the
needed IVs for the chosen model, enter or import observations that will be compared to
model predictions, and examine model predictions and exceedance probabilities. Time
series and XY scatter plots of observations versus predictions are shown to help users
gauge model effectiveness.
9
-------
4. PROJECT MANAGEMENT
Oftentimes the user will put an imported dataset through lengthy pre-processing
to prepare it for analysis. To avoid repeating all of this work, "project" files can be saved
and re-opened via the Project -> Save and Project Open menu selection. Subsequent
opening of a saved project file will load the processed data sheet and information on the
Beach Location tab, including the beach orientation if the user had defined it. However,
no modeling information is saved inside a project file.
In addition to project files, "model" files can be opened and saved using choices
under the "Model" menu at the top of the VB 2.2 interface. A model file contains
information on the IVs, regression parameters, and other metadata for the currently
selected model in the Modeling, Residual, or MLR Prediction tab. Whenever a model
file is saved, VB 2.2 will prompt the user to enter a Decision Criterion (DC), Regulatory
Standard (RS) and Threshold Transformation for the model. These parameters will be
used as initial values (they can be changed when the model file is opened) for later
calculations of model sensitivity and specificity, which depend on the numbers of false
negative and false positive model predictions (see Sections 7.6 and 7.7).
When users open a previously saved model file from within VB 2.2, they are
taken directly to the MLR Prediction tab where they can use the saved model to generate
predictions. Model files are designed for situations where a statistically-savvy developer
is charged with developing regression models for a number of beach sites. After the
developer chooses a "best" model for a site, the model file can be saved and then
delivered to the beach manager who will not use VB 2.2 for full-scale model
development, but only to input new data, generate predictions, and make decisions
regarding swimming advisories.
10
-------
5. BEACH LOCATION MAPPING INTERFACE
On VB 2.2 application startup, the map interface is shown, but users can go
directly to the Data Processing tab if desired.
Edmonton
Winnipeg
'""Seattle
Helena
Ottawa
S'" \l
o Augusta'
St Paul
Salt lake City
O
Denver
CKarlotte
lLos Angeles
Jackson_ MS. AL
O
Mexxrali
Jacksony:fle
Houston
Tallahassee1
MEXICO
SariLusPOtosi
¦ CUBA
Guadalajara'
-V^lfrjopan',
'de Juarez
JHONDURA'S*
Gu a te rns
-------
g Virtual Beach 2.2
Project Model Help
Athens, GA
Place
GoT o Place
Map Settings
Type
Reload |
Beach Orientation
Add 1st Beach Make*
Add 2nd Beach Maikw |
Add Water M arke<
Beach Orientation
Show Station Locations
~ NWIS ~ NCDC
~ STORET
Remove Station Locations
Cwerit Location
loading
Map Controls
Zoom Slider- drag slider up and
down to zoom in and out,
respectively.
Map Controls-Add Lat/Long and
click "GoToLat/Long" button or enter
a Place and click "GoToPllace."
Map Settings - Select map type from
dropdown menu to change the
display in the map window.
Beach Orientation - use buttons to
add or remove markers on the map.
Once the beach shoreline is
delineated by placing the la and 2nd
beach markers, click in the water and
then click "Add Water Marker," which
will lead to the correct orientation
angle being placed into the "Beach
Orientation" box.
Show Station Location - if zoomed in
enough, select a station type and
then click "Show Station Locations"
to display such stations on the map.
Current Location - click anywhere on
the map to display that points Lat
and Long.
Loading - map loading progress bar
that shows network download
activity for map images.
Figure 3. Beach Location tab controls and their function
5.2 Defining the Beach Orientation
Map control allows delineation of a beach on the map to ascertain its orientation,
which is useful if wind, wave, and/or current flow components are to be used in MLR
model-building. Maps, as opposed to satellite or hybrid images, provide less shoreline
12
-------
detail so it is recommended that the map setting type use a hybrid or satellite image prior
to adding point locations that define beach boundaries. Once displayed, click on the map
(a red marker will appear) and select the "Add 1st Beach Marker" button; this represents
the first point of the extent of your beach shoreline. Repeat this for the second beach
marker and click on the map to indicate which side of the shoreline represents the water;
then hit the "Add Water Marker" button. Marker points will turn green as you add them.
Once the water marker is added, a shaded box (the beach) appears and the computed
orientation angle will be displayed.
SI Virtual Beach 2.2
00®
Project Name: Beach Name: Status: ready (_
Project Model Help
Beach Location | Data Processing
Map Controls Zoon
Map Settings
Type
[ YahooHybnd
| Reload |
Beach Orientation
| Remove 1st Beach Marker |
[ Remove 2nd Beach |
[ Remove Water Marker |
Beach Orientation -94.95
Show Station Locations
HNWIS ~ NCDC
STORET
Current Location
41.6458510994252 Lat
-87.257022857S6G Lng
loading
Figure 4. Adding shoreline and water markers to define beach orientation
Points can be added or removed until the user is satisfied with the beach
representation. To recall the computed beach orientation in the data processing
components creation screen (see Data Processing section below), users can either save
and then re-open a project file or they can note the beach orientation on the mapping
screen and manually enter that angle on the components calculation screen.
5.3 Finding nearby Water Quality, Flow, and Climate Information Sources
Possible nearby data sources for the area of interest may be located and displayed
on the map. USGS NWIS and NOAA NCDC station markers at a zoomed-in map area
can be located and displayed by checking appropriate items in the map window and
clicking the "Show Station Locations" button. Note that the "Show Station Locations"
13
-------
button is only enabled when zoomed-in to an appropriate level (e.g., zoom level three as
measured from the top of the zoom control slider). If either of the selected station
categories (NWIS and/or NCSC; the STORET station category, although present on the
control, is not yet functional) are present within the map display area, they will appear.
Also note that the network server that produces NCDC station locations restricts location
requests to one every 30 seconds - a one-half minute delay is required for subsequent
location requests and an error message will be displayed if the appropriate wait time has
not elapsed. Once station location markers are displayed on the map, hovering over the
top-left hand corner of any station marker will display station ID information. With that
information, users can visit the appropriate web address to gather water/weather data for
the area of interest.
showing station ID information
Figure 5. NQAA/NCDC station marker
Station ID: USGS-Q2Z17890
S tation N ame: N 0 R T H 0 CO NEE RIVER AT US 78, AT ATHENS, GA
Figure 6. USGS/NWIS station marker showing station ID information
USGS NWIS web site URL: http://waterdata.usgs.gov/nwis/inventory
NOAA NCDC web site URL: http://www.ncdc.noaa.gov/oa/climate/stationlocator.html
14
-------
B Virtual Beach 2.2
Project Model Help
Beach Location Data Processing
Map Controls Zoor
Lat
Lng
Map Settings
Type
YahooHybrid
Beach Orientation
Remove 1st Beach Marker
Remove 2nd Beach
Remove Water Marker
Beach Orientation [-94 95
| Show Station Locations |
0 NWIS 0 NCDC
~ STORET
| Remove Station Locations |
Current Location
[41.6254197800841 | Lat
|-87.2442770004272 | Lng
Project File Name:
Project Name: Beach Name:
Figure 7. Beach Location interface showing station markers near Gary, Indiana
5.4 Saving Beach Information in a Project File
Use the Project-^Save menu bar selection to open a Save File dialog and to save
the project information to disk. Beach marker and angle information is saved in the file
name provided; the saved file can be anywhere, but using the "Project Files" folder
(found in the VB 2.2 root install folder) is recommended.
15
-------
6. DATA PROCESSING
6.1 Data Requirements and Considerations
VB 2.2 accepts files from Excel 2007 or earlier (Excel 2010 is not currently
supported), as well as comma-separated-value (CSV) text files. Input data must conform
to certain standards:
• The first row of any data column must be a header with the IVs name. For best
operation of the software, the column name should be composed of letters,
numbers (don't begin the column name with a number), and/or underscores, i.e.,
Other characters in column names can cause problems.
• The first (left-most) column of the dataset must be identification for the
observations, typically a date or time stamp that indicates when the observation
was collected. The only requirement is that each row MUST have a unique ID.
VB 2.2 will not import datasets with non-unique IDs in the first column. If the
first column is a time stamp, VB 2.2's plotting functions will work best if the
column is in chronological order, from earliest to most recent observations.
• The second column of the dataset will initially be set as the dependent or response
variable; however, this can be changed after data are imported. Any subsequent
columns will be considered to be IVs.
• Variable measurement units are not considered, but certainly affect predictions.
Make sure any data used for predictions are in the same units as those used to
build the models; for example, do not build a MLR model with water temperature
in degrees Fahrenheit, then later import water temperature in degrees Celsius for
predictions. It is prudent to include unit information in the column names (e.g.,
WaterTempC) to remind the user of the proper units when making predictions.
• Missing data (blank cells) are permitted on import, but must be dealt with in Data
Processing prior to modeling.
• If present in the imported Excel data sheet (other than in column names or the
first ID column), cells with non-numeric values (i.e., symbols or text) are turned
into empty cells. If such non-numeric characters are present in an imported .csv
file, they will be imported to the data grid, but will be recognized as anomalous
data during the required validation scan and will have to be dealt with (deleted or
turned into a numeric value) at that time.
• VB 2.2 recognizes any column of data with only two different values as
categorical. If you have a column of categorical data with more than two values,
you can designate it as categorical, using methods described below. The
ramification of a variable being identified as categorical is that VB 2.2 leaves it
out of transformation processes.
• There is no hard-coded limit on the number of IV columns one can import;
however, a practical limit exists that depends on system processing resources.
There is also an inherent limit: - documentation indicates that the grid components
used in the application are designed for a maximum of 300 columns before
performance issues degrade the application. Modeling 250+ columns of data
16
-------
20
presents circa 2(10) possible data combinations for MLR processing. The
Genetic Algorithm handles this modeling task, but choosing "Run all
combinations" would likely take an immense amount of time to complete.
Depending on how many additional IVs will be created by the user, importing a
dataset with less than 100 IVs should be acceptable.
6.2 Importing a Dataset
When users first click on the Data Processing tab, they open a dataset using the
"Import" button. This brings up a dialog screen where a directory explorer can be used to
find the data file and open it. If the dataset is an Excel file with multiple sheets, a dialog
box opens to ask the user which to import.
m Virtual Beach 2.2
Project Model Help
Data Processing |
Import
Compute A, 0
Manipulate
lib
My Recent
Documents
My Documents
Si
My Computer
iHmv Documents
l£^)Zepp Irradiance
^ My Computer
§3Shortcut to Agent.exe
^My Network Places
Testing.xls
Ir^ Brown Bags
l^(CCC sampling
InlCooter N files
OEPA Support Tools
E3ESA2011
^Modeling Datasets
IlDNMR Spectra
IC) Rockwell Data
Stuff
OVB Images
Ir^iVB Interview
Ij^Whelan Rainfall
File name:
My Network Files of type:
Open
Project File Name:
Project Name: Beach Name:
Status: ready (
Figure 8. Importing a dataset into the Data Processing tab
Once imported, the data grid is shown as a spreadsheet on the right. The second
column of the spreadsheet will be highlighted in blue to indicate its status as the current
response variable. Information about the dataset, such as number of rows and columns,
name of the ID column and name of the response variable, appear on the left. At this
point the grid cannot be edited or interacted with in any manner; tTo access additional
processing functionality, the data must be validated.
17
-------
6.3 Validating the Imported Data
The "Validate" options window can be accessed by clicking the "Validate" button
at the top of the Data Processing tab. This window primarily launches a required data
scan to identify blank and non-numeric data cells in the imported spreadsheet. However,
one can also find and replace other specified values (e.g., a missing data tag like -999) in
the dataset using the "(Optional) Find:" input box.
a Virtual Beach 2.2
Project Model Help
Data Processing
File
Column Count
Row Count
Date-Time Index
Response Variable
Testing.xls
9
37
tstamp
LogCFU
Disabled Row Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 7
38550.46
LogCFU
1.452
0.8653
0.801G
1.738
1.028
0.301
1.627
1.247
1.773
0.9379
0.9542
1.079
0.97
1.195
1.239
0.699
-0.1761
1.176
0.1249
0
1.222
0.5643
0.6368
2.727
2.235
0.5229
rmaq
Project File Name:
Project Name: Beach Name:
Status: ready (_
Figure 9. Data validation required to begin data processing
To validate the data, the user clicks "Scan." VB then goes through the
spreadsheet, cell by cell, looking for blanks, non-numeric, or user-specified values
entered in the "Find:" input box. If one of these types of cells is found, the scan will stop
to highlight that cell. Users must decide how to deal with the cell using choices in the
"Action" section: they can replace the bad cell with a specified value, using the "Replace
With:" input box, or they can delete the row or column containing the bad cell. The user
must decide where to implement the chosen action with the "Take Action Within" menu.
Possible choices are "Only this Cell," "Only this Row," "Only this Column," "Entire
Row," "Entire Column," and "Entire Sheet." Items in this menu are context-sensitive,
i.e., they change depending on which Action is selected. This setup gives the user
flexibility, for example, to delete all rows containing missing values within one specific
column of data (Action would be "Delete Row" taken within the "Entire Column"), and
replace all missing values with a user-specified numeric value within another column of
data (Action would be "Replace With:" taken within "Entire Column"). The cell, row,
and column reference will always refer to the highlighted cell. After setting the "Take
18
-------
Action Within" menu, the user clicks the "Take Action" button, VB 2.2 makes the
specified changes to the spreadsheet, and the scan continues. When the entire
spreadsheet has been scanned and all bad cells have been fixed, VB 2.2 reports that "no
anomalous data have been found," and the user can click the "Return" button to close the
Scan window.
As stated earlier, VB 2.2 will not attempt to transform categorical data columns.
It automatically identifies columns with only two unique values as categorical, but if the
user has other categorical IVs with more than two categories, those should be identified
to VB 2.2 by the "Identify Categorical Variables" button.
0 Virtual Beach 2.2
. Ifnilx
u.,
Beach Location Data Processing |
File T esting.xls
Import Validate ¦ r ¦ ¦
Column Count 9
Date-Time Index tstamp
Response Variable LogCFU
tstamp LogCFU uv airtemp waveheight centershintemp center waisttemp WindS peed Vi 1
38507.33 1.452
360
29.3
0.15
28.4
28.4
0
Disabled Row Count 0
38507.46 0.8653
1403
29.9
0.2
30.5
30
1C
Validation
38507.63 0.8016
1555
30.7
0.2
33.7
33.1
2C
38508.33 23.3
0.2
27.8
27.8
3C
uaia vauuauun
38508.46 1.028
38508.63 0.301
1305
29
0.2
30.2
30.1
4C
Scan
1568
30.9
0.2
32.5
32.1
5C
38521.46 1.627
38521.63 1.247
38522.33 1.773
38522.46 0.9379
38522.63 0.9542
38528.33 1.079
38528.46 0.97
1342
28.6
0.02
28.3
6C
(Optional) Find*
1276
28.2
0.01
33.3
33.2
7C
225
25
0.01
26.4
26.4
8C
Action:
1260
32
0.01
27.8
28
9C
O Replace With:
1409
29.4
0.01
32.5
31.8
1C
® Delete Row
O Delete Column
295
25.7
0..1
24.6
26.2
11
1800
30.5
0.15
27.6
27.4
12
38528.63 1.195
38535.33 1.239
38535.46 0.699
38535.63 -0.1761
38536.33 1.176
38536.46 0.1249
38536.63 0
38537.33 1.222
38537.46 0.5643
38537.63 0.6368
38549.33 2.727
900
34
0.18
30.1
30
V;
293
29.9
0.15
28.7
29
14
1537
31.6
0.15
31.4
30.4
15
lEntireSteet
1763
31,1
0.3
35.2
33.5
16
286
0.05
27.3
27.8
17
1481
29.8
0.1
30.2
29.2
1E
[ Identify Categorical Variables |
1802
30.3
0.3
33.1
1S
292
29.1
0.2
27.8
28.3
2C
675
30
0.3
29
29.2
21
( Cancel |
1834
30.2
0.2
34
32.4
22
292
28.9
0.5
27.6
27.6
38549,46 2.235
1233
29.9
0.3
30.4
29.8
24
38550.46 0.5229
1470
29.8
0.3
30.1
30
25
•wwn R3 n 1
•31 q
n?
qjq
"n j
Project File Name: Project Name: Beach Name: Status: ready (_
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu
19
-------
6.4 Working with a Dataset Post-Validation
After the dataset has passed the validation scan, the function buttons across the
top of the Data Processing tab are enabled.
!i Virtual Beach 2.2
Project Model Help
Beach Location Data Processing
File
Column Count
Row Count
Date-Time Index
Response Variable
Testing.xls
9
37
(stamp
LogCFU
Disabled Row Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 7
Import
Validate
Compute A, 0 | | Manipulate
Go to Modeling
(stamp
LogCFU
uv
airtemp
waveheight
centershintemp
centerwaisttemp
\v
38507.33
1.452
360
29.3
0.15
28.4
28.4
38507.46
0.8653
1403
29.9
0.2
30.5
30
38507. G3
0.8016
1555
30.7
0.2
33.7
33.1
38508.33
1.738
337
29.3
0.2
27.8
27.8
38508.4G
1.028
1305
29
0.2
30.2
30.1
38508.63
0.301
1568
30.9
0.2
32.5
32.1
38521.46
1.627
1342
28.6
0.02
28.7
28.3
38521.63
1.247
1276
28.2
0.01
33.3
33.2
38522.33
1.773
225
25
0.01
26.4
26.4
38522.46
0.9379
1260
32
0.01
27.8
28
38522.63
0.8542
1408
29.4
0.01
32.5
31.8
38528.33
1.079
295
25.7
0.1
24.6
26.2
38528.46
0.97
1800
30.5
0.15
27.6
27.4
38528.63
1.195
900
34
0.18
30.1
30
38535.33
1.239
293
29.9
0.15
28.7
29
38535.46
0.699
1537
31.6
0.15
31.4
30.4
38535.63
¦0.1761
1763
31..1
0.3
35.2
33.5
38536.33
1.176
286
26.6
0.05
27.3
27.8
38536.46
0.1249
1481
29.8
0.1
30.2
29.2
38536.63
0
1802
30.3
0.3
34.7
33.1
38537.33
1.222
292
29.1
0.2
27.8
28.3
38537.46
0.5643
675
30
0.3
28
29.2
38537.63
0.6368
1834
30.2
0.2
34
32.4
38548.33
2.727
292
28.9
0.5
27.6
27.6
38549.46
2.235
1233
29.9
0.3
30.4
29.8
Project File Name:
Project Name: Beach Name:
Status; ready (_
Figure 11. Post-validation enabling of the Data Processing functionality
At this point, the grid cells (other than the ID column) are editable - that is, users
can manually enter new numeric data into the cells by double-clicking on a cell and
typing in a new value. VB 2.2 does not allow blank cells or non-numeric data in cells.
Additionally, a right mouse-click on an IV column header presents options:
20
-------
Validate
Compute A, 0 Manipulate
LGgCFU uv
Disable Column
Enable Column
5et Response Variable
View Plots
Delete Column
waveheit
11.452
360
0.15
0.8653
1403
0.2
0.8016
1555
0.2
1.738
337
0.2
1.028
1305
28
0.2
0.301
1568
30.9
0.2
1.627
1342
28.6
0.02
Figure 12. Right-click options on columns that are not the response variable
"Disable Column" turns the column's text red and prevents the column from being
passed to the Modeling tab of VB. Previously-disabled columns can be activated using
"Enable Column." "Set Response Variable" will make that IV the new response variable
and it becomes blue as a visual indication of this change. "View Plots" shows a new
screen with column statistics at the far left and four plots for that IV: (1) a scatterplot of
the IV versus the response variable in the upper left panel, (2) a plot of the IV values
versus the ID column at the upper right (a time series plot if the ID is an observation
date), (3) a box-and-whiskers plot at the bottom left, and (4) a histogram for the IV at the
bottom right.
a Variable airtemp
QUI®
Data
Variable Name
Row Count
Maximum Value
Minimum Value
Average Value
Unique Values
Zero Count
Median Value
Data Range
Value
airtemp
37
35.70
25.00
30.11
30
0
29.900
10.700
AD Statistic 0.2589
AD Stat P-Value 0.6959
Mean Value 30.111
Standard Deviation 2.459
Variance G.045
Kurtosis 0.767
Skewness 0.767
22 24 26
32 34 36
BoxVvhisker Plot
"Time Series Plot
:8.50 38.51 38.52 38.53 38.54 38.55 38.56 38.57
tstamp (10*3)
Figure 13. Four different plots available for evaluation of IVs
21
-------
The scatter plot (upper left) is probably the most-examined, as it can indicate a
non-linear relationship between the IV and the response variable, problems with
homogeneity of variance across the range of the IV, or outliers. Ensuring that the IVs are
linearly related to the response variable raises the probability of producing a robust,
meaningful analysis. If the relationship between the response and the IV is not well-
approximated by a straight line (a fundamental assumption of MLR), it may be beneficial
to transform the IV. Using VB 2.2 to accomplish this will be explained later in this
document. The scatterplot also shows the best-fit regression line in red, along with the
correlation coefficient ("r") and the significance (p-value) of the correlation coefficient at
the top of the plot. For the most part, p-values below 0.05 are considered statistically
significant.
Identifying odd values (potential outliers or bad data) of any IV can often be done by
visually inspecting these plots. If users double-click on the data point marker for any
observation in one of the top panels or the bottom left panel (i.e., not the histogram), they
can disable that point (the row) in the data grid.
The final choice — "Delete Column"-- deletes a column from the data grid, but the
original columns of the imported data sheet (VB 2.2 thinks of these as "main effects")
cannot be deleted. Rows can be disabled and enabled, but not deleted, from the data grid
by right-clicking the row header (far left of each row) and making the desired choice.
If the user right-clicks on the column header of the response variable, a different
set of choices is shown:
22
-------
Import
Validate
Compute A, 0
Manipulate
(stamp LogCFU
i iu air
wavehe
Transform ~
View Plots
UnTransform
~
38507.33
1.452
0.15
38507.46
0.8653
)
0.2
38507.63
0.8016
Set Defined Transformed
~
none
Log 10
Ln
Power
0.2
38508.33
1.738
337
29
0.2
38508.46
1.028
1305
29
"
0.2
38508.63
0.301
1568
30.
9
0.2
38521.46
1.627
1342
28.6
0.02
38521.63
1.247
1276
28.2
0.01
1QROO 11 1 771 19R 9R fl m
Figure 15. Available choices when right-clicking the current response variable
Users can transform the response variable in three ways: logio, loge, or a power
transformation (raising the response to an exponent: y ). They can also un-transform the
response, view the plots shown previously for the IVs, or define a transformation of the
response variable. This option is used when a datasheet is imported with an already-
transformed response variable. For example, users could import a datasheet with logio-
transformed fecal indicator bacteria levels and then define the response as being logio-
transformed. Doing this facilitates later comparisons with observations, decision criteria,
and regulatory standards. When users transform the response variable within VB 2.2
using the "Transform" option, VB 2.2 automatically defines the response as having the
chosen transformation and, in doing so, synchronizes the units of measurement for later
comparisons.
6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current
Components
Orthogonal wind, current, and wave vectors can be powerful predictors of beach
bacterial concentrations. Depending on the orientation of the beach, wind and currents
can influence the movement of bacteria from a nearby source to the beach, and wave
action can re-suspend bacteria buried in beach sediment. To make more sense of these
data, researchers typically decompose wind/current/wave magnitude and direction into A
(alongshore) and O (offshore/onshore) components for analysis (see equations at the end
of this section).
If direction and magnitude (speed/height) data are available, A and O components can
be calculated with the "Compute A, O" button. Clicking it brings up a window where
users specify which columns of the data grid contain the relevant magnitude and direction
data, using drop-down menus (Figure 16). There is also an input box at the bottom of the
form for the beach orientation angle. If the user defined the angle on the "Beach
Location" tab, that value should be seen here. After clicking "OK," new data columns
are added to the far right of the data grid, representing the A and O components of the
specified wind, current, or wave data. Unlike the originally imported IVs, these
23
-------
components can be deleted from the data grid after they are created. Names of these new
columns will be: WindA_comp(X,Y,Z), CurrentO_comp(X,Y,Z), WaveA_comp(X,Y,Z),
etc, where X is the name of the column of data used for magnitude, Y is the name of the
column used for direction, and Z is the beach orientation angle.
Wind/Current/Wave Components
Wind Data
Specify wind data columns:
Speed
Direction (deg)
Current Data
Specify current data columns:
Speed
Direction (deg)
Wave Data
Specify wave data columns:
Wave Height
Direction (deg)
Beach Angle (deg):
Ok
0.00
Cancel
Figure 16. Window for computation of alongshore and offshore/onshore components
Notes on wind, wave and current component calculations:
Direction is an angular degree measure. Moving in a clockwise direction from north
(0 degrees), values are positive, and negative while moving counter-clockwise. Wind
and current speed (as well as wave height) can be measured in any unit. VB 2.2 adheres
to scientific convention where wind direction is specified as the direction from which the
wind blows, while current and wave directions are specified as the direction toward
which the current or waves move. Thus, wind blowing from west to east has a direction
24
-------
of either 270 or -90 degrees, while a current/wave moving from west to east has a
direction of 90 degrees.
The A component measures the force of the wind/current/wave moving parallel to
the shoreline (Figure 17). A positive A component means winds/currents/waves are
moving from right to left as you look out at the water. A negative A component means
winds/currents/waves are moving left to right as you look out at the water. The O
component measures force perpendicular to the shoreline. A negative O value indicates
movement from the land surface directly offshore (unlikely to see with wave action). A
positive O indicates waves/wind/currents from the water to the shore. These relationships
apply no matter how the beach is oriented (Figure 18).
Negative O
Positive A Negative A
Figure 17. A and O component definitions for wind, current, and wave data
25
-------
Beach Orientation for Wind Component Calculations
270 degrees
315 degrees
0 degrees
135 degrees
90 degrees
45 degrees
180 degrees
215 degrees
t
North
Figure 18. Principal beach orientations given in degrees
Equations for calculation of Wind A/O components:
Wind A: -SPD * cosine ((DIR-BO) * PI/180)
Wind O: SPD * sine ((DIR-BO) * PI/180)
where SPD is wind speed, DIR is wind direction, BO is the beach orientation (in degrees)
and PI = 3.1416. Current A/O and Wave A/O are these same equations multiplied by -1.
26
-------
6.6 Creation of New Independent Variables
Users may click the "Manipulate" button to create new columns of data that might
serve as useful IVs. On the screen that pops up, there is a list of available IVs on the far
left, under "Independent Variables." If users wish to create a new term, they add any
available IV used in this new term by selecting it and using the ">" button to add it to the
"Variables in Expression" box. Clicking and dragging down through the "Independent
Variables" list allows for multiple IVs to be added at once.
Manipulate
Build Expression
Independent Variables
00®
Variables in Expression
airtemp
waveheight
centershinternp
centewaisttemp
WindSpeed
WindDirection
~
S
© Sum O Maximum Q Minimum O Mean C1 Product
OK
Cancel
Figure 19. Window for the formulation of "Manipulates" - arithmetic combinations of existing
columns within the data grid
For example: if users wish to create a new IV that is a row-by-row mean value of
the "centershinternp" and "centerwaisttemp" variables, they add those two to the
"Variables in Expression" box, then choose the "Mean" function, "Add" that expression
to the lower box, then click "OK." That adds a new column of data that represents a row-
by-row average of the two IVs, to the end of the data grid (far right.)
27
-------
Manipulate
Build Expression
Independent Variables
Variables in Expression
airtemp
waveheight
WindS peed
WindDirection
~
S
centershinternp
centerwaisttennp
O Sum O Maximum Q Minimum © Mean O Product
M EAN [centershintemp,centerwaisttemp]
Add
R emove
2nd Order Interactions
M E AN [centershintemp,centerwaisttemp]
OK ] | Cancel
Figure 20. Creation of a new IV defined as the mean of two existent IVs
Users can create a row-by-row sum, maximum, minimum, mean, or product from
any number of IVs that are added to the "Variables in Expression" box. More than one
expression can be created before the "OK" button is clicked, and IVs can be easily moved
in and out of the box using "<" and ">" keys. Any created expressions can be removed
from the lower box with the "Remove" button. No matter how many IVs are added to the
"Variables in Expression" box, clicking "2nd Order Interactions" will add the cross-
products for all possible pairings of those IVs. Thus, four IVs will produce six
interactions, five IVs will produce ten interactions, and so on. Note that the names of the
columns used to create any manipulate are inside the parentheses of that manipulate's
column name.
28
-------
EH Manipulate
~
Build Expression
I ndependent Variables
Variables in Expression
uv
waveheight
WindDirection
~
~
centershintennp
centerwaisttemp
WindS peed
airtermp
O Sum O Maximum Q Minimum © Mean O Product
MEAN[centershintemp,centemaistternp,WindSpeed,airterinp]
Add
Remove
2nd Order Interactions
F'Fl 0 D [centershintemp.centemaisttemp]
PROD [centershintennp,WindS peed]
PROD [centershintemp,airtemp]
F'FIOD[centerwaistternp,WindSpeed]
PR 0 D [centerwaisttemp,airtemp]
PR 0 D [WindS peed,airtemp]
OK
Cancel
Figure 21. Formation of two-way cross-products of a set of four existent IVs
VB 2.2 does not allow previously created "manipulates" — new columns of data
created through the "Manipulate" button — to be further manipulated. Previously-created
manipulates will not appear in the "Independent Variables" section at the left. They can,
however, be chosen as the response variable or deleted from the data grid, using the
appropriate menu choices, accessed by a right-click of the column header.
6.7 Transforming the Independent Variables
VB 2.2 gives users the ability to transform non-categorical IVs to assist in
linearizing the relationship between the IVs and the response variable, which is a
fundamental assumption of an MLR analysis. VB 2.2 provides the following
transformations, where Xt is the transformed IV and X is the original IV:
Logio: Xt = logio(X)
Loge: Xt = loge(X)
Inverse: Xt = 1/X
Square: Xt = X2
Square Root: Xt = X'
0.5
r0.25
Quad Root: Xt = Xu
Polynomial: Xt = a + bX + cX2
General Exponent: Xt = Xe where the user specifies the value of e
When users click the "Transform" button, they are presented a choice of
transformations to investigate:
29
-------
Transforms to Perform
Available T ransforms
I I LoglO
~ Ln
~ Inverse
I I Square
I I SquareRoot
I I QuadRoot
I I Polynomial
I I General Exponent 1.0
I I Select All
Dependent Variable:
LogCFU
Go
Cancel
Figure 22. The range of choices for IV transformations
When users click "Go", the chosen transforms are applied to each non-categorical
IV. VB 2.2 then opens a table that allows comparison of the success of each transform
using a Pearson correlation coefficient, a measure of linear dependence between the
response variable and the IVs. For the polynomial transformation, the Pearson
coefficient is calculated as the square root of the adjusted R value derived from the
regression of the response on Xt. Because this adjusted R2 value can possibly be
• 2
negative, an empirically-derived formula is applied when adjusted R values fall below
0.1:
Polynomial Pearson Coefficient = (-6.67*REi2 + 13.9*REi- 6.24)*(R2)0 5
where REi = 1.015 - 1.856*R2 + 1.862*adjR2 - 0.000153*N, R2 and adjR2 are defined
by the regression of the response on Xt, and N = number of observations.
The table that VB 2.2 creates groups all transformed versions of each IV by the
IV name, type of transformation, and the associated Pearson coefficient. By default, the
transformation (this includes the un-transformed version of the IV, denoted by "none"),
with the largest absolute value of the Pearson coefficient is highlighted in black text for
selection. Users may override the default selection by left-clicking on the row header of
a transformed IV they choose. They may also override the default by setting a Threshold
percentage and clicking "Threshold Select" on the left side of the box. This selects the
un-transformed IV unless the transformed IV with the highest absolute value Pearson
coefficient exceeds the un-transformed IV Pearson coefficient by the specified
percentage. In essence, the user is saying, "Unless the Pearson coefficient of the
30
-------
transformed IV is some % greater than the Pearson coefficient of the un-transformed IV,
use the un-transformed IV." This can be useful because transforming IVs makes
interpreting model coefficients more difficult; unless an improvement is seen,
transformation may not be worth the trouble. Users can also revert to the default by
clicking "Go" under the "Auto Select" section at the left.
Pearson Univariate Correlation Results - Maximum Pearson Coefficients (signed) in BOLD text
Help
Variables, possible variable
interactions, and their
transforms are shown. Select
variables for further
processing and modeling.
Auto-Select
The variable or one of its
transforms is selected by
maximum Pearson Coefficient.
(This is the default view shown.)
Threshold Select
Select a transformed variable only
if its Pearson Coefficient exceeds
the untransformed variable's
Pearson Coefficient by a
specified threshold.
Manual Select
Mouse-click on a row header to
select or deselect that variable.
At most one member from each
group can be selected.
~
Add transformed variables to dataset
and disable untransformed columns.
Dependent Variable: LogCFU
Pearson
Coefficient
Correlation
P-Value
uv
none
-0.4706
0.0033
uv
INVERSE[uv,1 01.5]
0.3335
0.0437
uv
SQUARE[uv]
-0.4887
0.0021
uv
QUADR00T[uv]
-0.4339
0.0073
uv
PO LY[uv,1.2133824,0.000332S8167,-5.0448752e-07]
0.4432
0.0060
airtemp
none
-0.3772
0.0214
airtemp
IN VE R S E [airtemp,12.5]
0.3624
0.0275
airtemp
SQUARE [airtemp]
-0.3820
0.0136
airtemp
QUAD ROOT [airtemp]
-0.3724
0.0232
airtemp
PO LY[airtemp,-2.7045332,0.35028385,-0.0076782138]
0.3170
0.0553
waveheight
none
0.1031
0.5435
waveheight
INVERSE[waveheight,0.005]
0.2006
0.2339
waveheight
S Q UAR E [waveheight]
0.2612
0.1184
waveheight
Q U AD R 0 0 T [waveheight]
-0.0666
0.6354
waveheight
P0LY[waveheight,1.2708351 ,-7.0250516,19.175368]
0.3874
0.0178
centershintemp
none
-0.4260
0.0086
centershintemp
IN VE R S E [centershintemp,12.3]
0.4197
0.0037
centershintemp
S Q UAR E [centershintemp]
-0.4272
0.0084
centershintemp
QUADROOT[centershintemp]
-0.4243
0.0083
centershintemp
PO LY[centershintemp,1.2563378,0.034614607,-0.0035446356]
0.3669
0.0255
centerwaisttemp
none
-0.3991
0.0144
centerwaisttemp
INVERSE[centerwaisttemp,13.1 ]
0.4093
0.0113
Figure 23. Pearson correlation coefficient scores for judging the efficacy of IV transformations
31
-------
Plotting Transformed IVs
Users may prefer to examine plots visually to determine which transformation of
IV to choose. If users right-clicks on a row header in this correlation table, they can view
an array of scatterplots, time series plots, or frequency plots for each data transformation
of the IV represented by that header. Scatterplots will show the best-fit regression line,
the correlation coefficient, and the p-value for that correlation coefficient.
3 Variable airtemp and its Transforms || ~ || X [
SQUAR E[alrt*m p]
QUADROOTplrtemp]
POLYpirtemp.-s.roisssj.o.ssoaBBBS.-o.oorsrBJiSB]
Figure 24. Scatterplots (Response vs. IV) for six different data transformations of a single IV
After choosing a transformation for each IV, users click "OK." This populates
the data grid with new columns representing transformed versions of the IVs. The small
checkbox in the bottom left corner of Figure 23 controls whether the untransformed
version of the IV remains enabled in the data grid after the user clicks "OK." When the
box is checked, for any IV in which the user chooses a transformed version, the un-
transformed version will be disabled in the data grid. Notice that transformed versions of
an IV are put into the data grid immediately after the original, un-transformed IV.
Notes on Transformed IVs
Any transformations put into the data grid can be deleted with the "Delete
Column" choice after right-clicking on their column header. Transformed IVs will
appear in the list of IVs on the "Manipulate" screen; however, transformed IVs cannot be
32
-------
further transformed and will not appear in the transform table if the user goes back to the
"Transform" window.
VB 2.2 transformations have specific processing for certain data values and are
not pure mathematical transformations — they were designed to maintain data order
while helping to linearize the response-IV relationship. For the SQUARE (b=2),
SQUAREROOT (b=0.5), QUADROOT (b=0.25), INVERSE (b=-l) and GENERAL
EXPONENT (b is user-defined) transformations, VB 2.2 uses the signed equivalent of
the mathematical function:
xAb == sign(x)*abs(x)Ab
For example: (-2)2 = -4 (-9)0'5 =-3 (-4)"0'5 =-0.5 (-2)"2 =-0.25
To avoid potentially undefined values (i.e., 1/x when x = 0), the INVERSE and
GENERAL EXPONENT (if the user sets b < 0) transformations have special processing:
If x = 0, then VB 2.2 will find the minimum of abs(z), where z is the set of all
non-zero values for the IV in question. For the purpose of computing the transformation,
once z is defined, VB 2.2 substitutes z/2 for x. From this definition, note that z can be
either a positive or negative number.
LOGio and LOGe transforms are also the signed equivalent of the mathematical
functions:
loge(x) == loge(x)
loge(-x) == -l0ge(x)
logio(x) == logio(x)
logio(-x) == -logio(x)
In addition, if (-1 < x < 1), then loge(x) = 0 and logio(x) = 0
VB 2.2 will not compute the INVERSE, GENERAL EXPONENT (with a
negative b), LOGio and LOGe transformations for data columns if more than 10% of the
IV's values are zero. Programmatically, zero is defined as any number whose absolute
value is less than 1.0e-21.
POLYNOMIAL transformations are the result of a linear regression of the
response variable on the IV and the square of the IV:
Poly(X) = a + b*X + c*X2
where a, b, and c are determined by a multiple linear regression of X and X on the
response variable.
In general, the name of the transformed column of data that VB 2.2 creates is
simply the type of transformation, with the original data column name in parentheses.
For example, WaterTemp would become LOGio(WaterTemp). There are some
exceptions, however:
33
-------
INVERSE(X,Y) : X is the original data column name and Y is the z/2 value
discussed earlier in this section.
POWER(X,Y) : When Y is positive, X is the original data column name and Y is
the exponent specified by the user.
POWER(X,Y,Z) : When Y is negative, X is the original data column name, Y is
the exponent specified by the user, and Z is the z/2 value discussed earlier in this section.
POLY(X, a,b,c): X is the original data column name and a, b, and c are the
values of the polynomial regression coefficients.
Finally, because transformations are determined by the current response variable,
when users change the response variable in the data grid (using the column header right-
click menu), all transformed IVs in the data grid are erased (a message warns the user).
6.8 Saving Processed Data
Data can be saved in a project file (Project-^Save) at any time during data processing.
When the file is opened, the data grid will be repopulated as it appeared when the project
was saved. Also, users may highlight the entire table or sections of the table and use
Control-C and Control-V to copy and paste the data grid into a word processing or
spreadsheet application.
6.9 Go to Modeling
After data processing is complete, users must click the "Go to Modeling" button
to open the Modeling tab. If users have already done modeling work and returned to the
data sheet to make changes, they will receive a message that the data sheet has changed
and any prior information on the Modeling, Residual, or MLR Prediction tabs will be
erased. Users can then choose to move forward to the Modeling tab or revert to the
previous version of the data sheet prior to making changes.
34
-------
7. MODELING
The Modeling tab facilitates finding the best model based on criteria selected by
the user. As the number of IVs increases, the number of possible models in the solution
space increases exponentially. Users may select all or a subset of the IVs for
consideration in the model to reduce the size of the solution space.
7.1 Selecting Variables for Model Building
All eligible IVs are listed in the left column ("Available Variables") under the
Variable Selection sub-tab. Any variable users wish to consider for model inclusion must
then be moved to the "Independent Variables" list by highlighting the IV and clicking the
">" key. Any number of IVs can be moved or removed from this list.
Beach Location Data Processing , Modeling
Model Settings
Variable Selection
Control Uptions
Number of Observations: 37
Dependent Variable: LogCFU
Available Variables (7)
Independent Variables (0)
airtemp
waveheight
centershintennp
centerwaisttemp
WindS peed
WindDirection
CD
CD
Figure 25. Selecting variables for MLR processing within the Modeling tab
As you add or remove IVs from the "Independent Variables" list, the number of
possible MLR models is displayed in the status strip at the bottom right of the application
window. The number of possible models can grow exceedingly large; 66 IVs represent
7.38* 1019 possibilities. More than 66 variables produces a number that exceeds the
capacity of the program to store it - in such cases, "more than 9.2e019" is displayed.
7.2 Modeling Control Options
The first decision users make on this tab involves which evaluation criteria will be used
to judge model fitness. There are currently ten criteria available in the drop-down menu:
35
-------
• Akaike Information Criterion (AIC)
• Corrected Akaike Information Criterion (AICC)
• R2
• Adjusted R2
• Predicted Error Sum of Squares (PRESS)
• Bayesian Information Criterion (BIC)
• RMSE
• Sensitivity
• Specificity
• Accuracy
Evaluation Uriteria
Akaike Information Criterion (AIC)
Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
Maximum VIF
Figure 26. Setting modeling options within the Modeling interface
The "Maximum VIF" (Variance Inflation Factor) parameter is used selectively to
discard models that contain variables with a high degree of multi-collinearity, i.e., IVs
that are greatly correlated with other IVs. If any IV in a model has a VIF exceeding the
threshold, that model will be discarded. The default VIF value used in the application is
set to 5. A VIF of 5 means that 80% (1/5) of the variability in an IV can be explained by
the variability of other IVs in the model. A VIF of 10 means that 90% (1/10) of the
variability can be explained, and so on. If users aren't concerned with muli-collinearity
among the explanatory variables in a regression model, they can lower the Maximum VIF
value. However, multi-collinearity leads to poorly estimated regression coefficients (i.e.,
large standard deviations of these coefficients).
The "Maximum Number of Variables in a Model" parameter tells VB 2.2 how
large the models being evaluated can be. As a rule, most modelers prefer to have about
10 observations per estimated parameter in their models, otherwise possibilities increase
for model over-fitting and poor estimation of regression parameters. VB 2.2's
recommendation is close to this rule. It equals (1 + n/10) where n is the number of
observations in the dataset. The maximum allowable number equals n/5. VB 2.2 won't
let users set this value over the maximum. The total number of available parameters is
also given here.
If we define p as the number of parameters in a model, n as the number of
observations in the dataset, RSS as the residual sum of squares for a model, and TSS as
the total sum of squares for a model, then the evaluation criteria for a model can be
defined as:
• Akaike Information Criterion (AIC): 2p + n*ln(RSS)
• Corrected Akaike Information Criterion (AICC): ln(RSS/n) + (n+p)/(n-p-2)
36
-------
• R2: 1 - RSS/TSS
• Adjusted R2: 1 - (l-R2)(n-l)/(n-p-l)
• Bayes (Schwarz) Information Criterion (BIC): = n*ln(RSS/n) + p*ln(n)
• Root Mean Squared Error (RMSE): (RSS/n)12
• Predicted Error Sum of Squares (PRESS): 1 - S(y;- y.;)2 / 2(y; - ym)2
where y is the ith observation, yH is the model estimate of the ith observation when the model coefficients
are fitted with the ilh observation removed from the dataset and ym is the mean value of y in the dataset
• Accuracy: (true positives + true negatives) / number of total observations
• Specificity: true positives / (true positives + false positives)
• Sensitivity: true negatives / (true negatives + false negatives)
Sensitivity, specificity and accuracy are special cases that require users to enter
both a Decision Criterion (DC) and Regulatory Standard (RS) so that true/false positives
and true/false negatives can be defined. The DC is a modeled (predicted) value the user
chooses. Model predictions above this threshold are considered exceedances, while
model predictions below this value are considered non-exceedances. The RS is typically
a safety limit on fecal indicator bacteria (FIB) levels set by a state or federal agency. The
"Threshold Transform" radio buttons tell VB 2.2 how to transform the DC and RS for
comparison to model predictions and observations. If a transformation definition is set
for the response variable (either manually by the user or automatically by transforming
the response) during data processing, that definition will be set as the default here. Users
should understand that changing the threshold transform definition can lead to problems
when comparing modeling predictions to observations. Caution should be exercised.
Model Evaluation!hresholds
Decision Criterion (Horizontal)
Regulatory Standard [Vertical)
235
235
Threshold Transform Current US Regulatory Standards
® None £ co|j Freshwater: 235
O Log10 Enterococcl Freshwater: 61
O Ln
powef Enterococci, Saltwater: 104
Figure 27. Setting evaluation thresholds and threshold transformation information within the
modeling interface
37
-------
7.3 Linear Regression Modeling Methods
There are two options for exploring the solution space.
1. Manual - this option is for a directed model search. If the 'Run all combinations'
box is not checked, a single model including every IV that was added to the
"Independent Variables" column will be evaluated. If 'Run all combinations' is
checked, an exhaustive search is performed. The exhaustive search evaluates
every model that can be constructed with the selected IVs, but does not evaluate
any with more parameters than the "Maximum Number of Variables in a Model"
input box. For example, if there are 24 IVs to evaluate and the maximum number
of IVs in a model is set at 8, the exhaustive routine examines every possible 1-, 2-
, 3-, 4-, 5-, 6-, 7-, and 8-parameter model. As the number of IVs rises, the number
of possible models quickly gets so large that the exhaustive routine cannot
maintain reasonable computation times and the user is advised to switch to the
genetic algorithm.
2. Genetic Algorithm - the Genetic Algorithm (GA) option explores solution spaces
too large to handle exhaustively. Genetic algorithms are loosely based on the
natural evolutionary process, in which individuals in a population reproduce and
mutate. Individuals with high fitness (regression models that produce small
residuals) are more likely to reproduce and pass their genes (IVs) to the next
generation. The goal is to find a good solution without having to examine every
possible option and the GA balances random and directed searching.
38
-------
Virtual Beach 2.2
Project Model Help
Beach Location Data Processing Modeling
Model Settings
Variable Selection Control Options
Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
Model E valuation!" hresholds
235~; Decision Criterion (Horizontal)
1235 I Regulatory Standard (Vertical)
Threshold Transform
0 None
O Log10
O Ln
O Power
Current US Regulatory Standards
E. coli, Freshwater: 235
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61
Manual Genetic Algorithm |
0 Run all combinations
Run
] Virtual Beach 2.2
Project Model Help
Beach Location Data Processing
Model Settings
Variable Selection Control Options
Modeling
Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
Maximum VIF
Model EvaluationThresholds
[235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard [Vertical]
Threshold Transform Current US Regulatory Standards
® None £ co|j Freshwater: 235
0 Log10 Enterococci, Freshwater: 104
O Ln
O Power
Enterococci, Saltwater: 61
Manual | Genetic Algorithm
I I Set Seed Value:
Population Size:
100
Number of Generations: 1100
Mutation Rate:
Crossover Rate:
10.01
0.20
Figure 28. Model building interface using a manual search (left panel) or the Genetic Algorithm
(right panel)
Choosing between an exhausti ve and a GA search depends on your data set,
available hardware and time constraints. Fifteen IVs produce about 32,000 model
possibilities; on our system (Dell Precision T5400 workstation running MS Win XPSP3
w/ dual Xeon 2.66 GHz processors having 4 GB RAM), the exhaustive search was
completed in approximately 90 seconds. Sixteen IVs represent more than 65,000
possibilities which is more than double that of 15 IVs. Some model building results are
summarized below:
Exhaustive Search - Run All Combinations
Number of IVs
Number of MLR models
Approximate Time
Required to Generate and
Filter Models (seconds)
15
32767
90
16
65535
110
17
131071
280
39
-------
By contrast, the GA with 17 IVs was completed in less than seven seconds. We note,
however, that the exhaustive search did find a slightly better model than the GA did using
the selected AIC evaluation criterion (49.2 versus 55).
An alternative modeling strategy could be to use the GA on your entire list of IVs,
then the exhaustive search on a subset of the initial IVs - any IV that appears in one of
the best ten models found by the GA. This two-step modeling process is facilitated with
the "IV Filter" list control.
Model Information
Best Fits:
-143.3235
-143.0920
-142.9118
-142.9249
-142.6259
-142.4560
-141.4349
IV Filter
Add to List
Clear List
Figure 29. Using the IV filter to select a subset of variables from the best-fit models
When the GA ends and the 10 best models are shown, use the "Clear List" button
to remove all IVs from the selection list. Select a model from the "Best Fits" list one at a
time and click the "Add to List" button; this action adds any IVs in the model to the
Independent Variable list. After doing this for the ten best models, users likely have a
much more manageable IV list and can run an exhaustive search to find the very best
combination of IVs. Regardless of the method chosen to build models, the "Best Fits"
window shows the top ten models found, in terms of the evaluation criterion chosen.
7.4 Using the Genetic Algorithm
There are five parameters users can set to adjust performance of the GA:
a) Seed value: internal random number generator to produce random values.
Setting this seed to a known value will make the GA run reproducible.
Changing the seed will create a new series of random values, possibly returning
different results.
b) Population size: number of individuals in the population of each generation. A
larger population broadens the search at each generation, but slows processing
time.
c) Number of generations: how long to run the search since individuals can
reproduce and mutate once each generation. The fitness of every individual in
the population is evaluated at the end of each generation.
Report
Cross
Validation
40
-------
d) Mutation rate: chance each individual has of undergoing random mutation in
each generation. The higher the mutation rate, the more random (less directed)
the search of parameter space is.
e) Crossover rate: probability that two selected individuals in the population will
exchange genome parts. Exchanging genes creates new individuals in the
population.
The best GA parameter values depend on the dataset being investigated, but
typical values of the mutation rate are between 0.001 and 0.1 (0.1 and 10%) and typical
values of the crossover rate are between 0.4 and 0.75 (40 and 75%). For most datasets, a
population size and generation number of 100 will be sufficient. Larger datasets may
require an increase in these numbers for optimal solutions.
Manual
Genetic Algorithm
I I Set Seed Value:
Population Size:
Number of Generations:
Mutation Rate:
Crossover Rate:
100
100
0.05
0.50
Run
Figure 30. Genetic algorithm options within the modeling interface
7.5 Evaluating Model Output
After selecting a method to build models and an evaluation criterion to rank them,
users then click the "Run" button. Model selection and evaluation progress is displayed
on the "Progress" graph at the lower right of the Modeling tab. Note that the "Run"
button changes to "Cancel;" the process is interruptible should progress be unacceptably
slow. Once model-building is completed, the ten best MLR fits are displayed in the
"Best Fits" box. Selecting a model from the list results in (see Figure 31):
1. A list of the model's IVs with associated regression coefficients and statistics
is displayed on the "Variable Statistics" subtab.
2. A list of the model's evaluation metrics is shown on the "Model Statistics"
subtab.
3. The "Results" subtab will show the observations and model fits versus the
observation number. If observations are chronologically ordered, this is
basically a time series plot.
41
-------
4. The "Observed versus Predicted" subtab can show plots and tables based on
observations versus model fits.
5. The "ROC Curves" subtab shows a plot of the Receiver Operating
Characteristic curve of each "Best Fits" model, as well as a table showing the
computed AUC (area-under-the-curve) for each ROC (see Section 7.7).
6. Clicking on "View Report" generates a text report of model and variable
statistics for the selected model.
7. The "Residuals" tab will appear at the top, allowing users to proceed to the
residual analysis component of the application.
8. The "Prediction" tab will appear at the top, allowing users to proceed to the
prediction component of the application.
Note that selecting a different model from the "Best Fits" list updates the Variable
and Model statistics tables and di splays of the plotting subtab s.
B0®
Project Model Help
Beach Location Data Processing
Modeling | Residuals MLR Prediction
Model Settings
Variable Selection | Control Options
Evaluation Criteria
Number of Observations: 37
Akaike Information Criterion (AIC)
1
|'^ | Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
[5 | Maximum VIF
Model EvaluationThresholds
Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Current US Regulatory Standards
E. coli. Freshwater: 235
Threshold T ransform
0 None
O Log10
O Ln
O Power |
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61
Manual Genetic Algorithm
0 Run all combinations
Model Information
Best Fits:
8.2076
9.1112
9.2219
9.2231
9.2471
10.1760
IV Filter
| Add to List |
View
Report
Variable Statistics Model Statistics
Parameter
(Intercept)
waveheight
WindDirection
Coefficient Standardized Coefficient
-0.0007
1.6811
-0.0030
-0.5050
0.2239
-0.4177
Progress Results Observed vs Predicted ROC Curves
Exhaustive Search of Independent Variable Space
(Percent Complete)
15
14
13
12
11 -
10 ii
9 ~
— Fitness |
Std. Error
0.2994
0.0002
1.0139
0.0010
t-Statistii
6.087<
-3.775C
1.658(
-3.118!
¦ ¦ I " ¦ ¦ I " ' ¦ I ¦ ' I ¦ I ¦ ' ' I I ¦ ¦ ! I I ¦ ¦ I ¦ I ¦ ' ! H ¦ I I ' I " I ¦ I " ' ' I I I " I ' ¦ ¦ ' I I ¦ ¦ ' I ' ¦
10 15 20 25 30 35 40 45 50 55 60 65 70 75
Percent Completed
Project Name: Beach Name:
Total number of possible models: 127 |_
Figure 31. Modeling results shown after completion of an exhaustive regression run
42
-------
Model Information
Best Fits:
8.2076
9.1112
9.2219
9.2231
9.2471
10.1760
IV Filter
Add to List
Clear List
View
Report
Cross
Validation
Variable S tatistics M odel S tatistics
Parameter
Coefficient
Standardized...
Std. Error
t-Statistic
P-Value
^intercept)
1.8228
0.2994
6.0879
7.4508e-07 i
uv
-0.0007
¦0.5050
0.0002
-3.7750
0.0006
waveheiaht
1.6811
0.2239
1.0139
1.6580
0.1068
WindDireotion
•0.0030
¦0.4177
0.0010
-3.1185
0.0038
Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model
Model Information
Best Fits:
IV Filter
Add to List
Clear List
View
Report
Cross
Validation
7.2471
A
B.2076
9.1112
9.2219
9.2231
9.2471
10.1760
V
Variable Statistics fMc^Stafeticslj
A
Metric
Value
R Squared
0.4185
Adjusted R Squared
0.3667
Akaike Information Crite...
7.2471
Corrected AIC
9.1826
Bayesian Info Criterion
-25.3092
PRESS
17.0349
RMSE
0.S188
Sensitivity
0.0000
Specificity
1.0000
Accuracy
0.9459
M irnh^r nf fl hvprv^tinnv
T?
Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model
43
-------
Model Information
Best Fits:
7.2471
8.2076
9.1112
9.2219
9.2231
9.2471
10.1760
V
IV Filter
Add to List
Clear List
View
Report
Cross
Validation
Variable Statistics
Model Statistics
A
Metric
Value
R 5quared
0.4195
Adjusted R Squared
0.3667
Akaike Information Crite...
7.2471
Corrected AIC
9.1826
Elayesian Info Criterion
¦25.3092
PRESS
17.0349
RMSE
0.6188
Sensitivity
0.0000
Specificity
1.0000
Accuracy
0.9459
M i imhpr nf l~l
17
Progress Results Observed vs Predicted ROC Curves
Results
-a- YPred
Threshold
2
Figure 34. Modeling interface showing a time series plot for the selected model
44
-------
Progress || Results | Predicted vs Observed | ROC Curves)
Select View
Plot: Pred vs Obs
Plot Thresholds
235 Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
I Threshold Transform
O None
® Log10
O Ln
O Power |NaN
Update
Model Evaluation
False Positives (Type 1):
0
Specificity:
1
False Negatives (Type II):
I2
Sensitivity:
0
Accuracy:
[0.3459
Predictions vs Observations
3 --
1 --
-1 --
-2
Decision Threshold Regulatory Threshold |
* - *
~ ~
~
~
-*~ *
-2
1
Observations
Figure 35. An XY scatter plot of observed versus predicted values for the selected model
45
-------
Model Information
Best Fits:
,7.2471
A-
8.2076
9.1112
9.2219
9.2231
—
9.2471
10.1760
V
IV Filter
Add to List
Clear List
View
Report
Cross
Validation
Variable Statistics Model Statistics
Metric
Value
R Squared
0.4195
Adjusted R Squared
0.3667
Akaike Information Crite...
7.2471
Corrected AIC
9.1826
Bayesian Info Criterion
-25.3092
PRESS
17.0349
RMSE
0.6188
Sensitivity
0.0000
Specificity
1.0000
Accuracy
0.9459
Hiirnhpr nf l~~l h^pr\/.=iHnrK-
17
Progress Results Ubserved vs Predicted LQ9.—
Model Fit
7.2471
8.2076
9.1112
9.2219
9.2231
9.2471
10.176
10.2047
10.2063
10.2076
AUC
.739683
.635714
.732143
.754464
.754464
.739683
.63
.635714
.635714
.635714
Plot
View T able
Receiver Operating Characteristic Curves
for Best-Fit Models
2219 -JK- 9.2231
10.2063 10.2076
0.5 0.6
Specificity
Figure 36. The ROC curves and AUC table for the Best Fit models
7.6 Viewing X-Y Scatterplots
In multiple locations within VB 2.2 (Modeling, Residual and MLR Prediction
tabs), users can access a subtab that allows them to view information for comparing
observations to model predictions (Figure 35). From this space, users can view four
different pieces of data:
1) A plot of predictions versus observations: "Pred vs. Obs"
2) A table summarizing model errors (false negatives/false positives) as the decision
criterion (DC) varies across the range of the response variable: "Error Table: DC as
cFtr
3) A plot of the percent of probability of exceedance (calculated based on the current DC)
versus observations: "% Exc vs. Obs"
4) A table summarizing model errors as the percent of probability of exceedance is
varied: "Error Table: DC as % Exc"
46
-------
These four are chosen with the drop-down menu at the top left corner of the form.
On both of the two plots, a right-button click in the plot area shows a menu of functions
for saving, copying, printing or manipulating the plot view. The plot area can be zoomed
and un-zoomed: left-button mouse drags an area for zooming in; with right-button click,
select "Un-Zoom" or "Set Scale to Default" to see the entire data set. To pan to an area
of the plot not in view, hold the Shift key down and use the left mouse button to drag the
view. To view (x,y) values of any data point, hover the cursor over the data point. If the
information does not appear, right-click on the graph and make sure "Show Point Values"
is selected.
In regards to interpretation of these plots, the green (Regulatory Standard) and
blue (Decision Criterion) lines permit model evaluation and provide information on
which to base a DC to be used for predictive purposes. On the plots, false positives
represent data points in the upper left quadrant of the graph, in which the model
predictions exceed the DC, but observations are below the RS. In such cases, a beach
advisory would be incorrectly issued based on the model prediction, leading to potential
economic losses. False negatives (points in the lower right quadrant) represent a
potentially more serious scenario: model predictions below the DC and observations that
exceeds the RS. In other words, swimming at the beach may have been allowed when it
should have been prohibited due to elevated FIB concentrations.
A model that produces no false positives or false negatives would be an ideal
decision tool, but this is often unattainable with real data. Examining the two tables (#2
and #4 mentioned above) on this subtab should allow users to set a robust DC (either
using units of the actual response variable or a percentage probability of exceedance) that
minimizes both errors. Note that in most cases, the RS is set based on federal or state law
and should not be adjusted by the user, however, the user is free to adjust the DC to
minimize false negatives and false positives.
7.7 ROC Curves
In addition to time series and scatterplots which show results for an individual
model, users may also compare all "Best Fits" models using the ROC Curves tab. A
Receiver Operating Characteristic curve shows a model's true positive rate (sensitivity)
plotted against its false positive rate (1 - specificity) as a decision threshold varies
between the model's minimum and maximum predicted values. Models can then be
compared using the area under their ROC curves (AUC). Models having the largest
AUC values perform best over the entire decision space.
The model with the largest AUC appears in red text in the ROC tab's model list.
A single ROC may be plotted by selecting a model in the list and clicking "Plot."
Multiple models can be selected in the usual Windows fashion with Shift-Click (select all
items between the first and second selection) or Control-Click (select only the clicked
items). The background cell color of models not selected for plot display will be gray
after the "Plot" button is clicked.
Clicking "View Table" will replace the ROC plot with a table showing the false
positives, false negatives, sensitivity, and specificity at every evaluated value of the
Decision Criterion for a single selected model. Users need only click on a model in the
list to the left of this table to see its results. The ROC plot will return to view after
clicking "View Plot."
47
-------
AUC calculations are performed and curves are plotted when the "ROC Curve"
tab is selected. If this tab is active and new models are subsequently built, leaving this
tab and then returning will generate the new plots and AUC values.
7.8 Cross-Validation
Clicking the "Cross-Validation" button on the Modeling tab brings up a sub-
screen. On it users can set two parameters: sample size for the testing data (T) and
number of random samples (R) taken. When cross-validation is started, a random sample
of size T is taken from the modeling dataset and set aside. Each "Best Fits" model is then
re-fit to the remaining training data. The IVs in each model stays the same, but the
regression coefficients are adjusted to reflect the least-squares fit to the training data.
The Mean Squared Error of Prediction (MSEP) is then calculated based on the T testing
data points for each candidate model. The process (taking a random testing sample; re-
fitting regression coefficients for the ten candidate models based on the training data;
using the re-fit models to make predictions; and computing 10 MSEP values) will be
done R times. A table will show average MSEP values for each candidate model.
Cross-validation is a widespread, useful technique for examining the predictive
power of models, i.e., their ability to make predictions for data they have not seen before.
For users wishing to emphasize the predictive ability of a potential model, cross-
validation allows them to evaluate which candidate model consistently makes the best
predictions (i.e., has the lowest MSEP). Note that the PRESS statistic Virtual Beach 2.2
provides as a model evaluation criterion is a cross-validation statistic with T set to 1. The
PRESS algorithm removes one observation at a time from the dataset, re-fits the model
regression coefficients, and then calculates the squared residual for the removed
observation. It does this once for every observation in the dataset to compute the model's
PRESS value — a confined look at a model's predictive potential.
Recommended values to enter for the observations used for testing are
approximately 25% of the total number of observations and 500-1000 trials.
§§ Cross Validation fL~|f5]|5T|
Total Number of Observations: 225
Number of Observations Used for Testing:
Number of Trials: 1:11:1 I Run 1
Fitness
MSEP
IndVar 1
Ind Var 2
IndVar 3
Ind Var 4
IndVar 5
Ind Var 6
IndVar 7
~
-143.323483044...
0.178258878933...
clouds
S Q Ft [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
PO LY[atmpressure]
LOG[cuyahogariv..
-143.092024887...
0.183755617610...
clouds
SQR [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
POLY[atmpressure]
LOG[cuyahogariv..
-142.911814497...
0.189189307571...
clouds
SQR[turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
LOG[cuyahogariv...
P0LY[ucomp]
-142.824883297...
0.172544273813...
clouds
SQR [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
PO LY[atmpressure]
LOG[cuyahogariv..
-142.625947884...
0.184948801378...
clouds
SQR [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
LOG[cuyahogariv...
POLY[rockyriverfL
-142.456029460...
0.178419303326...
clouds
SQR[turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
POLY[atmpressure]
LOG[cuyahogariv..
-141.434871829...
0.175263600776...
windspeed
clouds
SQR[turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
POLY[atmpressure
-141.336885984...
0.178221812478...
windspeed
clouds
SQR [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
POLY[atmpressure
-141.288453099...
0.180921289930...
windspeed
clouds
SQR [turbidity]
SQR[Previous24...
POLY[airtemp]
POLY[dewpoint]
POLY[atmpressure v
<
I I
Figure 37. The cross-validation results for each of the 10 best-fit models
48
-------
7.9 Report Generation
A text report of modeling results can be generated, copied to the system
clipboard, or saved to a text file using the "View Report" button. Users can view the
report within VB 2.2 by selecting the desired models and clicking on "Generate Report
for Selected Models." The report contains descriptive statistics for each model variable
and model evaluation statistic. Any number of best-fit models can be selected for
reporting.
A recommended approach to saving the information in an external application is
to copy the report to the clipboard (with the "CopytoClipboard" button) and paste it into a
rich-text application like MS Word, Write or WordPad. NotePad or other text editors
will work, but column formats will likely be lost and make the report difficult to
interpret.
Figure 38. A text report generated on the modeling results
Comparative bar graphs can be displayed to view evaluation criteria for all top
models. Click on "View Evaluation Graphs" to see these plots. Hover the mouse over
any plot to display the relevant evaluation criteria and hovering over any bar displays the
associated model. Note that the evaluation criteria graphs are scaled to emphasize
differences between the model scores although the difference may, in fact, be quite small.
With the cursor over any graph, right-mouse click and select "Set Scale to Default" to
view the un-scaled graph.
49
-------
j Model Evaluation Criteria
Adjusted R2
logEcoli = 13.0836e-01 - 23.3539e-03xairtennp + 10.8332e-03xturbidity + 98.1067e-03xclouds - 28.6138e-05xrockyriverflow + 18.535e-05xcuyahogariverflow +
23.473e-02xPrevious24hrrainfall + 25.5045e-03xdewpoinl:
I
n W 1
XL
ill
Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models
d Model Evaluation Criteria
3 Model Evaluation Criteria
R2
R2
logEcoli = -14.2B08e00 + 50.1901 e-01"P0LY[[airtemp][dewpoint]] - 47.2897e-02"PC logEcoli = -13.9053e00 + 48.31 B5e-01"P0LY[[airtemp][dewpoirit]] - 51.8026e-02"PC
11.2129e-04KS Q R [[airtemp][cuvahogari verflo w]] + 14.3251 e-02"SQR[[Previous24hn 14.3141 e-02"SQR[[Previous24h[iairifall][windspeed]] + 12.43746-01 "POLY[[airtemp;
Figure 40. Scaled versus un-scaled views of selected model evaluation criterion
50
-------
8. RESIDUAL ANALYSIS
Once a model is selected in the "Best Fits" window on the Modeling tab, the
"Residuals" and "MLR Prediction" tabs appear at the top of the interface. Users may
click "Residuals" to view information about residuals of the selected model, but this is
not mandatory; they may take the selected model immediately to prediction mode by
clicking on "MLR Prediction." There are four subtabs on the Residuals tab: Predicted vs
Residuals, Observed vs Predicted, DFFITS, and Cook's Distance.
[J Virtual Beach 2.2
Project Model Help
00®
Beach Location Data Processing Modeling Residuals | MLR Prediction
Variable Statistics Model Statistics
Parameter Coefficient StandardizedCoefficient Std. Error t-Statistic P-Value
(Intercept) 14.5347 3.7900 3.8351 0.0001
Turbidity 0.0094 0.3384 0.0010 9.3457 1.1916e-19
WaveH eight 0.1469 0.2185 0.0242 6.0642 2.1665e-09
Dew_Point_F 0.0190 0.2387 0.0025 7.4886 2.0948e-13
Wind*/ -0.0144 -0.1506 0.0033 -4.3896 1.3102e-05
Station_Pressure -0.4906 -0.1121 0.1287 -3.8120 0.0001
Precip_T otal 24.7024 0.2124 3.4226 7.2174 1.3794e-12
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A.D. Normality Statistic = 0.5732
A.D. Statistic P-value = 0.1364
Predictions vs Studentized Residuals
<2 o
-5
¦¦
o
:: o
V 8 <5>jM
° o a :
MM ° -
S - n ¦=¦
° H8H
ir
o
: i i i i 1 ¦ i
mf° * ° :
:
& :
o -
¦ i | ¦ i i i | ¦ 1 ¦ ¦ ¦ ¦ | ¦ ¦ i ¦ "
2 3 4
Predictions
Project Name: Beach Name:
Total number of possible models: 2,047 [
Figure 41. Information available on the Residuals tab, including a plot of studentized residuals
versus predictions, the Anderson-Darling residual normality test, and regression statistics
The Predicted vs Residuals subtab shows a graph of the studentized residuals
versus their predicted model values. The Anderson-Darling Normality Statistic
(http://en.wikipedia.org/wiki/Anderson-Darling) is shown with its significance (p-value).
Linear regression assumes normally-distributed residuals, so if this A-D normality test
fails (the A-D p-value is less than 0.05), the user should 1) transform the response
variable, 2) transform some of the IVs, or 3) consider deleting offensive high leverage
observations, which can be done on this tab.
51
-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A. D. N crmality S tatistic = 1.1610
A.D. Statistic P-value = 0.0043
Predictions vs Studentized Residuals
1 1 I 1 I 1 1 I
. 1 . 1 1 . 1 1
o
o
--
o
o
o
--
o o
o
o
o
o
o
o
o
O Q
o
o
o
o
o
o
¦¦
o
o °
o
o
M-
o
o
o
o
o o
o
—¦ ¦ i i 1 ¦ ¦ i
o
1 1 ¦ 1 1
¦ 1 ¦ '
I'M
' ¦ ¦ 1 ' ' ' ¦
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
Predictions
Figure 42. Plot of studentized predictions vs. residuals and the A-D test of normality
On DFFITS and Cook's Distance subtabs, observations are sorted by the largest
(absolute value) respective measure in a grid at the left. A plot of the DFFITS/Cook's
Distances for each record (observation) versus the Record ID is shown at the right. Data
points with very large DFFITS/Cook's Distances (i.e., lie outside the horizontal red
boundaries on the graph) distort the fitted values and standard deviation of the regression
coefficients.
52
-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
Record
Date/Time
DFFITS
~
357
38958.375
-1.995793
0
3GG7G.375
0.470774
229
38495.375
•0.443269
114
37447.375
-0.426416
3S0
39223.375
0.401342
4G2
39589.38819444...
-0.355593
272
38575.375
0.346042
27G
3858G.375
-0.344014
124
37483.375
0.317248
Iterative Rebuild [ Go | 2*SQR(pAi) = 0.2491
Auto Rebuild
Stop when all DFFITS less than:
~ED O iterative threshold using 2*SQR(pAi)
(~) constant threshold 10.2491 |
View Data Table
Residuals
| ~ DFFITS CUtOff = 0.2491 -cutoff =-0.2491 |
0.5
0.0
-0.5 --
-1.0 --
-1.5 --
-2.0 --
nwiI
J*;
~
1 I ' 1 1 ' I ' 1 1 ' I ' ' ¦ ' I ' ' 1 ' I ' ' ' ' I ' ' i ' I I
100 200 300 400 500 600 700
Record
Figure 43. A table and plot of the DFFITS scores for the residuals
Clicking the Iterative Rebuild "Go" button removes the observation with the
largest absolute value DFFITS/Cook's Distance, re-fits the regression, and calculates new
DFFITS/Cook's Distances for the remaining observations. This model is named
"Rebuildl," and it is added to the "Models" window at the top left of the screen.
Clicking on the Iterative Rebuild "Go" button again would produce a model called
"Rebuild2," which is calculated after removing the observation with the largest absolute
value DFFITS/Cook's Distance remaining in the dataset (it is the 2nd largest absolute
value in the original dataset). The user can continue to click "Go" and remove
observations with the largest remaining DFFITS/Cook's Distances, thus creating
"Rebuild3," "Rebuild4," "Rebuild5," etc. VB will not allow a user to delete any
observations if 10 or fewer observations remain in the dataset.
Whenever a "rebuild" is created by pressing "Go," the information displayed on
the Residual tab (variable and model statistics, Observed vs Predicted plot, Predicted vs
Residuals plot, DFFITS values, etc.) is automatically updated to reflect this new model
(even if another model is highlighted in the "Models" window). However, the user can
select any model in the "Models" window to view its associated data and plots.
The user has complete freedom to carry out the outlier removal process while
toggling back and forth between the DFFITS and Cook's Distance subtabs. For example,
the first removal can be based on a DFFITS value, the next removal can be based on a
Cook's Distance, the next two removals can be based on DFFITS, etc. If the user wishes
to clear the "Models" window for whatever reason, simply click the "Clear" button.
Rather than using Iterative Rebuild, the user has two additional choices for Auto
Rebuild, both of which remove all observations above some threshold. The "iterative
threshold" choice bases removals on a threshold that is updated every time an observation
is deleted. For DFFITS, this threshold is 2*(p/n)°'5, where p is the number of IVs in the
model and n is the current number of observations in the dataset. For Cook's Distance,
the threshold is 4/n.
53
-------
Iterative Rebuild
Auto Rebuild
Go
2*SQR(p/n) = 0.2491
Go
Stop when all DFFITS less than:
O iterative threshold using 2KSQR(pA"i)
® constant threshold 0.2491
View Data Table
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points
In the "iterative threshold" process, step one is to check if any DFFITS/Cook's
Distances are above the threshold; if so, VB removes the observation with the largest
absolute value DFFITS/Cook's Distance and then recalculates the regression model, the
DFFITS/Cook's Distances, and the threshold (because n has been reduced by 1). VB
then checks to see if any of these new DFFITS/Cook's Distances are above the new
threshold. If so, the process repeats. VB will continue until no DFFITS/Cook's
Distances remain that exceed the current threshold, or until half of the dataset has been
removed, whatever comes first. For example, if a dataset has 100 observations, VB will
allow 50 to be removed before it breaks out of the Auto Rebuild removal loop. At that
point the user can click the Auto Rebuild "Go" button again to potentially remove
another 25 observations of the remaining 50. We note that, in practice, one should not
remove more than 5-10% of the original dataset as outliers; the need to remove more
indicates a poor MLR fit and warrants a different analytical technique.
Using the "constant threshold" Auto Rebuild option differs from the "iterative
threshold" only in that the threshold remains static (i.e., the value the user types into the
input box) regardless of how many observations are deleted. Updated DFFITS/Cook's
Distances are still calculated after every removal event. VB will also stop this process if
half the number of starting observations has been deleted. There is an upper limit to the
number that can be entered into the "constant threshold" input box (DFFITS = 3, Cook's
Distance = 16/n).
Upon completion of the Auto Rebuild process, multiple models may have been
added to the "Models" window. For example, if 10 observations were removed, then
"Rebuildl" through "RebuildlO" will appear in the "Models" window.
If a user has interest in both DFFITS and Cook's Distances as outlier metrics, we
suggest one of the following methods:
1) To see if the two criteria would produce different results:
Apply DFFITS removal to your model of choice. Note the results and then clear the
Residual tab using the "Clear" button. Next perform a removal process based on
Cook's Distance and compare the results.
2) To filter out observations that offend either DFFITS or Cook's Distance criteria:
Run DFFITS removal on the model (i.e., remove all observations above your
specified DFFITS threshold), then click the Cook's Distance subtab and perform
54
-------
additional outlier removal based on its threshold. After this process, remaining
observations are "OK" from the perspective of both metrics.
Note that the highlighted model in the "Models" box is used if the "MLR
Prediction" tab is clicked, not necessarily the model whose information is displayed on
the Residuals tab. Also note that any observations removed from the "Residuals" tab are
not removed from the primary dataset shown on the "Data Processing" tab.
Viewing the Data Table
From the DFFITS or Cook's Distance subtabs, users can click on "View Data
Table" to display a history of the observation removal process for the model highlighted
in the "Model" box. From this window, users may export the dataset for external use or
re-importation into VB 2.2.
Records Eliminated from Model Data Set
Model
Residual
Value
Residual Type
Date
logEcoli
clouds
SQR [turbidity]
SQR[Previous24h
~
Rebuildl
-1.339716
DFFITS
8/16/2007
3.58546073
5
16.06237840420...
1.118033988749..
Rebuild2
-1.013314
DFFITS
6/1/2009
0.301029996
4
2.664582518894...
0
Rebuild3
0.685558
DFFITS
7/25/2008
2.939519253
3
5.540758070878...
0
*
Model Data Set - Inactive Records in Red
Save Data
Date
logEcoli
clouds
SQR[turbidity]
SQR[Previous24hrr
POLY[airtemp]
~
6/1/2007
1.230448921
4
1.717556403731...
0
1.507064992941.
6/2/2007
2.939519253
4
1.612451549659...
0
1.603774691988.
6/3/2007
1.897627091
2
6.606814663663...
0.223606797749...
1.783618147049.
6/4/2007
1.204119983
3
3.154362059117...
0
1.783618147049.
n omnQQQa?
A
1 qriR'SQjincic?
n
1 7?Q/ianC7-|10Q v
<
Illl
>
Figure 45. "View Data Table" window for examining the dataset after removal of influential data
points
The "Observed vs Predicted" subtab is the same as that in Section 7.6. There are
two plots and two tables to examine, along with controls to modify the Decision Criterion
(blue horizontal line) and Regulatory Standard (green vertical line), to judge effects these
changes have on model outcomes (false positives, false negatives, sensitivity, specificity,
etc.).
55
-------
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
Plot: Pred vs Obs
Plot Thresholds
HO Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Threshold Transform
O None
® Log10
O Ln
O Power [
Update
Model Evaluation
False Positives (Type 1):
* 1
Specificity:
0.9882
False Negatives (Type II):
Sensitivity:
Accuracy:
80
0.3043J
0.8772 |
Predictions vs Observations
7
6
5
4
CO
c
.2 3
| Decision Threshold
Regulatory Threshold
•S'
-1
Observations
Figure 46. Observed vs. Predicted plot on the Residual tab with model evaluation threshold control
and model evaluation statistics
51 Virtual Beach 2.2
Project Model Help
~d®
Beach Location Data Processing Modeling Residuals | MLR Prediction
SelectedModel
Rebuildl
Rebuild2
Rebuild3
Variable S tatistics M odel S tatistics
Parameter
(Intercept)
waveheight
WindDirection
Coefficient
1.9979
-0.0005
-0.7739
-0.0042
S tandardisedCoefficient
-0.4334
-0.1071
-0.7244
Std. Error
0.157G
9,9649e-05
0.G7G8
0.0005
t-Statistic
12.6746
-4.6448
-1.1435
P-Value
2.3721 e-13
6.8014e-05
0.2622
6.4821 e-09
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A.D. Normality Statistic = 0.1526
A.D. Statistic P-value = 0.9546
Predictions vs Studentized Residuals
0.5 1.0
Predictions
Project File Name:
Project Name: Beach Name:
Total number of possible models: 127 [
Figure 47. Residuals interface showing a list of rebuilt models resulting from observation deletions,
and the associated statistics and residual plots for these rebuilds
56
-------
9. PREDICTION
The MLR Prediction interface allows users to estimate or predict FIB
concentrations with a selected regression model. Whether a user was previously on the
Modeling tab (with a model selected in "Best Fits") or on the Residuals tab (with a model
selected in "Models"), the interface of the MLR Prediction tab will look the same.
9.1 Model Statement
At the top is the linear expression for the chosen model, with values of the
regression coefficients and names of each IV in the model (Figure 48).
9.2 Model Evaluation Thresholds
There are input boxes for the Decision Criterion (DC) and Regulatory Standard
(RS). Setting these allows model predictions to be evaluated and model specificity,
sensitivity, and accuracy to be calculated. When users first arrive at the Prediction tab,
values of the DC and RS will be set to what was on the Modeling tab. The "Threshold
Transform" button tells VB 2.2 how to transform the DC and RS for comparison to
model predictions and observations. If a transformation definition was set for the
response variable during data processing (either manually by the user or automatically by
transforming the response), that definition will be set here as the default. Users should be
aware that changing the threshold transform definition can cause problems when
comparing modeling predictions to observations. Caution should be exercised.
57
-------
SI Virtual Beach 2.2
BBB
Project Model Help
Beach Location Data Processing Modeling Residuals MLR Prediction I
Model:
LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)
Model Evaluation Thresholds
1235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Threshold Transform
® None
O Log10
O Ln
O Power 11.0
IV Data
Validation
Import IVs
Import Obs
Make
Predictions
~ear Export As CSV
Predictive Record
Project File Name: Project Name: Beach Name: Total number of possible models: 127 [ ] ¦:
Figure 48. The MLR Prediction interface
9.3 Prediction Form
Most of the prediction form is in three separate data panels: the left panel holds
IV data; the middle panel is for observational data, e.g., lab results of FIB concentrations;
and the right section shows model predictions and evaluation metrics. Each panel also
contains a column for a unique ID for each row of data (e.g., the date that data were
collected). The panels have separate horizontal and vertical scroll bars that become
visible if the number of rows or columns exceeds the viewable area. The three panels
independently scroll horizontally, but scroll as a group vertically. Panels can be re-sized
by clicking and dragging the blue vertical partitions. Order of the columns in the left and
right panels can be changed by clicking and dragging the column headers left or right.
Users can import IV and observational data from a file using "Import IVs" and
"Import Obs" buttons in the "Prediction Form" button bank located in the middle right of
the screen, or users can type data into the input grids. Either way, they should be certain
that the entered IV data are in the same units as those used to build the model.
Depending on which model was selected for prediction, the IV panel will have
one column for every unique IV that appears in the model, plus a column for the row's
unique ID. When a data file is imported with the "Import IVs" button, a "Column
Mapper" window opens. This window allows users to tell VB 2.2 which columns in the
imported datasheet should be used for the row IDs and each IV found in the model. By
58
-------
default, the first column of the imported file maps to the ID field, but users can choose
another column if needed. If a column in the imported spreadsheet has an identical name
to an IV in the model, that column will be automatically selected by VB 2.2 as the
appropriate one for that IV.
ggj Column Mapper
Figure 49. Importation of IV data using the "Column Mapper" window
As with IV data, observational data can be typed into the middle panel or
imported using "Import Obs." For observational data, only two columns are needed:
row IDs for every observation and the actual observations. A "Column Mapper" window
appears when observational data are imported from a file. After they have been imported
or manually entered, users can specify the scale/transformation of the observations for a
proper comparison to model predictions. This is done by right-clicking on the
"Observation" column header and defining the transformation: none, logio, loge, or a
power transformation. "None" is the default choice. For example, if LoglO observations
are imported, the user would need to change the right-click menu choice to "LoglO."
Column Mapper
Obs IDs
Obs
¦
~
ID
J (stamp
Observation
LogCFU
J 1
~ k
Cancel
Figure 50. Importation of observational data using the "Column Mapper" window
The "Make Predictions" button remains disabled until the IV data (imported from
a file or manually typed) are validated using the "IV Data Validation" button. This scan
ensures there are no blank cells or non-numeric data in the IV columns of the IV data
panel and checks that every row ID is unique (non-numeric data are allowed for the ID
column). This validation scan window is very similar to the validation scan window sin
59
-------
the Data Processing tab; however, "Delete Column" is not a choice. "Replace With" and
"Delete Row" are the only ways to deal with problems in the IV data grid.
§§ Virtual Beach 2.2
Project Model Help
J-lf51fx|
Beach Location | Data Processing } Modeling Residuals MLR Prediction
Model:
LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)
Model Evaluation Thresholds
1235 | Decision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Predictive Record
ID
Threshold T ransform
0 None
O Log10
O Ln
O Power 11.0
IV Data
Validation
Import IVs
Import Obs
Export As CSV
waveheight
~
38507.33
360
0.15
0
38507.46
1403
0.2
10
38507.63
1555
0.2
20
38508.33
337
0.2
30
38508.46
1305
0.2
40
38508.63
1568
0.2
50
38521.46
1342
0.02
60
38521.63
1276
0.01
70
38522.33
225
0.01
80
38522.46
1260
0.01
90
38522.63
1409
0.01
100
38528.33
295
0.1
110
38528.46
1800
0.15
120
38528.63
900
0.18
130
38535.33
293
0.15
140
38535.46
1537
0.15
150
38535.63
1763
0.3
160
38536.33
286
0.05
170
incnr Ar-
i
-------
Virtual Beach 2.2
Project Model Help
BB®
Data Processing Modeling Residuals MLR Prediction i
Model:
LogCFU = 1.8228075 - 0.00067864774*(uv) + 1.6810716*(wave height) - 0.0030005423*(WindDirection)
Model Evaluation Thresholds
S D ecision Criterion (Horizontal)
1235 | Regulatory Standard (Vertical)
Predictive Record
Threshold Transform
© None
O Log10
O Ln
O Power [To
IV Data
Validation
Import IVs
Import Obs
Make
Predictions
Plot | | Clear | | ExpotlAsCSV |
ID
uv
waveheight
WindDirec
A
ID
Observation
A
'D
Model_Prediction
C
/V
~
38507.33
360
0.15
0
~
38507.33
1.452
~
38507.33
1.831
2:
38507.46
1403
0.2
10
38507.46
0.8653
38507.46
1.177
2:
38507. G3
1555
0.2
20
38507.63
0.8016
38507.63
1.044
2:
38508.33
337
0.2
30
38508.33
1.738
38508.33
1.84
2:
38508.4G
1305
0.2
40
38508.46
1.028
38508.46
1.153
2:
38508.63
1568
0.2
50
38508.63
0.301
38508.63
0.9449
2:
38521.46
1342
0.02
60
38521.46
1.627
38521.46
0.7657
2:
38521.63
1276
0.01
70
38521.63
1.247
38521.63
0.7636
2:
38522.33
225
0.01
80
38522.33
1.773
38522.33
1.447
2:
38522.46
1260
0.01
90
38522.46
0.9379
38522.46
0.7145
2:
38522.63
1409
0.01
100
38522.63
0.9542
38522.63
0.5833
2:
38528.33
295
0.1
110
38528.33
1.079
38528.33
1.461
2:
38528.46
1800
0.15
120
38528.46
0.97
38528.46
0.4933
2:
38528.63
900
0.18
130
38528.63
1.195
38528.63
1.125
2:
38535.33
293
0.15
140
38535.33
1.239
38535.33
1.456
2:
38535.46
1537
0.15
150
38535.46
0.699
38535.46
0.5818
2:
38535.63
1763
0.3
160
38535.63
-0.1761
38535.63
0.6506
2:
38536.33
286
0.05
170
38536.33
1.176
38536.33
1.203
2:
tncir ac
1 AC1
n 1
inn
38536.46
0.1249
nncnr
n Ajici
<
>
V
<
mi 'I
>
L-
I
1
Project File Name:
Project Name: Beach Name:
Total number of possible models: 127 l_
Figure 52. A prediction grid after IVs and observational data have been imported, and model
predictions have been made
The ID column of the model output panel is taken directly from the ID column of
the IV panel, not the observation panel. The "Make Predictions" button makes one
model prediction per row in the IV data panel, regardless of how many observations are
entered in the observation panel.
The Model Prediction column contains predicted values of the response variable.
Right-clicking on this column header allows the user to change how the predictions are
displayed in the table (as linear, log, or power units). The Decision Criterion and
Regulatory Standard are values set by the user (shown in the left panel as transformed by
the choice of "Threshold Transform"). The Exceedance Probability (actually the
probability x 100) is defined as the probability that the model prediction will be larger
than the Decision Criterion, based on uncertainty bounds (confidence intervals) around
the model predictions.
To compare model predictions to observations, VB 2.2 looks at the prediction ID
and attempts to find an observation in the observation panel with that same ID. VB 2.2
does not require unique IDs for each row in the observation panel, but note that a model
prediction is compared to the first observation found with the same ID. When comparing
model predictions to observations, an error (false exceedance or false non-exceedance)
appears in the "Error Type" column.
It is important to note that accurately assessing model output depends on
synchronized transformation information regarding the Decision Criterion, Regulatory
61
-------
Standard, model predictions, and observations. Users must be careful to ensure each
value is in a comparable unit.
9.4 Viewing Plots
After predictions have been made, a scatterplot of observations versus predictions
can be viewed by clicking "Plot" in the "Prediction Grid" button bank. If no
observational data were entered, a message asking for observational data appears. The
features and functionality of the form that appears when the "Plot" button is clicked are
described in Section 7.6. The data are based on comparing model predictions (right pane
of the Prediction Form) with observations (middle pane) that share the same, unique ID.
Select View
Plot: Pred vs Obs
Plot Thresholds
[235 | Decision Criterion (Horizontal)
] Regulatory Standard (Vertical)
Threshold Transform
O None
® Log10
O Ln
O Power |NaN
235
Update t
Model Evaluation
False Positives (Type I):
|7
Specificity:
[0.9882]
False Negatives (Type II):
80
Sensitivity:
[0.3043]
Accuracy:
0.8772
Close
BBS
Predictions vs Observations
5 --
4 --
I 2
-1 —
-2
Decision Threshold Regulatory Threshold |
. 1 1 . 1 1 1 . 1 1 . 1 1 .
I...
•
\ •
»
•
~ .
••
.*
m
V
*
m
: •
-2
0
2
Observati
) IIS
£
f
Figure 53. Prediction interface plotting of the observations versus predictions, with model evaluation
threshold controls
62
-------
9.5 Prediction Form Manipulation
Two other buttons are found in the "Prediction Grid" button bank. If a user wants
to view the table in a spreadsheet or word processing program, "Export as CSV" saves
the contents of the entire table (three panels) in .csv format. "Clear" deletes all
information in the predictive table. As with most of the tabular information in VB 2.2,
data in individual panels can be selected with a left click and drag. Control-C and
Control-V can then be used to copy and paste the data into another application such as
WordPad or Excel.
10. FUTURE ENHANCEMENTS
VB 2.2 is a Windows application and undergoes continuous improvement and
functional expansion. In version 3.0, slated for release in 2012, project management
enhancements will allow site-based seasonal prediction and model assessment. The map
interface will provide user access and information to site-specific data such as water
quality, water flow gauge readings and weather data. Model- building functionality will
grow beyond MLR to include Gradient Boosting Machines (Decision Trees), Binary
Logistic Regression, Partial Least Squares regression, and Neural Networks.
11. USER FEEDBACK
Opinions and experiences from the user community are welcomed by the Virtual
Beach design/development team. Users are encouraged to report problems, issues and
likes/dislikes to:
Mike Cyterski - 706 355-8142 (cvterski.mike@epa.gov)
Mike Galvin -706 355-8318 (galvin.mike@epa.gov)
Rajbir Parmar - 706 355-8306 (parmar.raibir@epa.gov)
Kurt Wolfe - 706 355-8311 (wolfe.kurt@epa.gov)
12. ACKNOWLEDGMENTS
We would like to thank the following people, who generously donated their time
and expertise for software testing and review of this document:
Adam Mednick, Wisconsin DNR
David Rockwell, NOAA
Fran Rauschenberg, USEPA
Wesley Brooks, USGS
Mike Fienen, USGS
Donna Francy, USGS
Richard Zepp, USEPA
Steve Corsi, USGS
63
------- |