Michael Cyterski
Mike Galvin
Rajbir Parmar
Kurt Wolfe
ixposure Kesearc
atory, Ecosystem Research Division, Athens, GA 3060
-------
-------
Notice
The research described in this document was funded by the U.S. Environmental Protection
Agency through the Office of Research and Development. The research described herein was
conducted at the Ecosystems Research Division of the USEPA National Exposure Research
Laboratory in Athens, Georgia. It has been subjected to the Agency's peer and administrative
review and has been approved for publication as an EPA document. Mention of trade names or
commercial products does not constitute endorsement or recommendation for use.
Abstract
This report describes the development and design of Virtual Beach 2.2 (VB2.2) and
provides guidance for its proper use. VB2.2 is a tool that allows beach managers to analyze
environmental data in order to make decisions regarding beach closures due to microbial
contamination. It does this by facilitating the construction of statistical models for the
prediction of fecal indicator bacteria (FIB) levels. Some familiarity with multiple linear
regression (MLR) modeling and residual analysis will benefit a VB user; however, it is not
required.
VB2.2 has five major components:
Beach location mapping interface where users can locate their site, define the orientation of
the beach, and examine nearby potential data sources.
Data processing spreadsheet interface that facilitates the import and manipulation of data.
Modeling interface that presents options for performing MLR analyses.
Residuals component to examine regression residuals, allow optional elimination of highly
influential data records, and perform recalculation of the chosen regression model.
Prediction interface allowing the entry of new data and subsequent estimation of pathogen
indicator levels using a selected MLR model.
-------
-------
Table of Contents
1.0 Introduction 1
1.1 On Predictive Modeling 1
1.2 Recommended User Background 1
1.3 History and Comparison of Version 2.2 to Earlier Versions 2
2.1 Viewing this Documentation 5
2.0 Installation and Execution 5
3.0 Operational Overview 6
4.0 Project Management 7
5.0 Beach Location Mapping Interface 8
5.1 Finding a Location 8
5.2 Defining the Beach Orientation 10
5.3 Finding nearby Water Quality, Flow, and Climate Information Sources 11
5.4 Saving Beach Information in a Project File 12
6.0 Data Processing 13
6.1 Data Requirements and Considerations 13
6.2 Importing a Dataset 14
6.3 Validating the Imported Data 15
6.4 Working with a Dataset Post-Validation 16
6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components 20
6.6 Creation of New Independent Variables 23
6.7 Transforming the Independent Variables 26
6.8 Saving Processed Data 31
6.9 Go to Modeling 31
7.0 Modeling 32
7.1 Selecting Variables for Model Building 32
7.2 Modeling Control Options 32
7.3 Linear Regression Modeling Methods 34
7.4 Using the Genetic Algorithm 37
7.5 Evaluating Model Output 38
7.6 Viewing X-Y Scatterplots 43
7.7 ROC Curves 44
7.8 Cross-Validation 45
7.9 Report Generation 45
8.0 Residual Analysis 48
9.0 Prediction 55
9.1 Model Statement 55
9.2 Model Evaluation Thresholds 55
9.3 Prediction Form 56
9.4 Viewing Plots 60
9.5 Prediction Form Manipulation 60
10.0 Future Enhancements 62
11.0 User Feedback 63
12.0 Acknowledgments 64
-------
Table of Figures
Figure 1. The five major component tabs of VB 2.2 2
Figure 2. Beach Location interface 8
Figure3. Beach Location tab controls and their function 9
Figure 4. Adding shoreline and water markers to define beach orientation 10
Figures. NOAA/NCDC station marker showing station ID information 11
Figure 6. USGS/NWIS station marker showing station ID information 11
Figure 7. Beach Location interface showing station markers 12
Figure 8. Importing a dataset into the Data Processing tab 14
Figure 9. Data validation required to begin data processing 15
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu 16
Figure 11. Post-validation enabling of the Data Processing functionality 17
Figure 13. Four different plots available for evaluation of IVs 18
Figure 14. Disabling an observation from within the XY scatterplot 19
Figure 15. Available choices when right-clicking the current response variable 20
Figure 16. Window for computation of alongshore and offshore/onshore components 21
Figure 17. A and O component definitions for wind, current, and wave data 22
Figure 18. Principal beach orientations given in degrees 23
Figure 19. Window for the formulation of "Manipulates" 24
Figure 20. Creation of a new IV defined as the mean of two existent IVs 25
Figure 21. Formation of two-way cross-products of a set of four existent IVs 26
Figure 22. The range of choices for IV transformations 27
Figure 23. Pearson correlation coefficient scores 28
Figure 24. Scatterplots (Response vs. IV) for six different data transformations 29
Figure 25. Selecting variables for MLR processing within the Modeling tab 32
Figure 26. Setting modeling options within the Modeling interface 33
Figure 27. Setting evaluation thresholds and threshold transformation information 34
Figure 28. Model building interface 36
Figure 29. Using the IV filter to select a subset of variables from the best-fit models 37
Figure 30. Genetic algorithm options within the modeling interface 38
Figure 31. Modeling results shown after completion of an exhaustive regression run 39
Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model 40
Figure 33. Modeling interface showing model evaluation metrics 40
Figure 34. Modeling interface showing a time series plot for the selected model 41
Figure 35. An XY scatter plot of observed versus predicted values for the selected model 42
Figure 36. The ROC curves and AUC table for the Best Fit models 43
Figure 37. The cross-validation results for each of the 10 best-fit models 45
Figure 38. Atext report generated on the modeling results 46
Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models 47
Figure 40. Scaled versus un-scaled views of selected model evaluation criterion 47
Figure 41. Information available on the Residuals tab 48
Figure 42. Plot of studentized predictions vs. residuals and theA-D test of normality 49
Figure 43. Atable and plot of the DFFITS scores for the residuals 50
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points 51
Figure 45. "View Data Table" window for examining the dataset 52
Figure 46. Observed vs. Predicted plot on the Residual tab 53
Figure 47. Residuals interface showing a list of rebuilt models 54
Figure 48. The MLR Prediction interface 56
Figure 49. Importation of IV data using the "Column Mapper" window 57
-------
Table of Figures
Figure 50. Importation of observational data using the "Column Mapper" window 57
Figure 51. The IV validation window on the MLR Prediction tab 58
Figure 52. A prediction grid after IVs and observational data have been imported 59
Figure 53. Prediction interface plotting of the observations versus predictions 60
-------
-------
1.0
Introduction
Virtual Beach version 2.2 (VB 2.2) is a decision support tool. It is designed to construct
site-specific Multi-Linear Regression (MLR) models to predict pathogen indicator levels
(or fecal indicator bacteria, FIB) at recreational beaches. MLR analysis has outperformed
persistence models (using the most recent FIB concentration as the sole predictor of the next FIB
concentrations, i.e., yt = yt:) at beaches where conditions, such as weather, water conditions, and
human and animal traffic levels, change significantly from day to day (Frick, Ge et al. 2008).
1.1 On Predictive Modeling
In any predictive modeling endeavor, variability and uncertainty are always associated
with model output, arising from a variety of reasons that are impossible to eradicate completely
from the modeling exercise. Virtual Beach 2.2 attempts to be forthright with this fact by issuing a
probability of exceedance for any regulatory standard that the user wishes to investigate. Even so,
there is no guarantee than every model prediction will be correct, and a situation where the model
predicts water quality to be good enough for public recreation might be erroneous. Decisions to
allow or not allow swimming at beaches must be made, however, and in the best case scenarios
the regression models developed with Virtual Beach 2.2 will outperform less rigorous predictive
efforts.
1.2 Recommended User Background
Virtual Beach 2.2 is our attempt to create a decision support software tool that will assist
someone with little statistical knowledge in developing a multiple linear regression model based
on their available data. Some familiarity with regression modeling and residual analysis will no
doubt benefit a VB 2.2 user, although we believe that, after only a few sessions, someone with
very little background in statistics can produce defensible regression models using VB 2.2. We
note that these MLR models, or any other statistical models, will only be as effective as the data
used to develop them. No statistician, however skilled, can turn a dataset filled with worthless
independent variables (i.e., IVs) into a useful predictive device.
VB 2.2 has five major components:
Beach location map interface where users can locate their site, define the orientation of the
beach, and examine nearby potential data sources.
Data processing spreadsheet interface that facilitates the import and manipulation of MLR
model variable data.
Modeling interface presenting options for performing MLR analyses.
Residuals component to examine regression residuals, allow optional elimination of highly
-------
influential data records, and perform recalculation of the regression model.
Prediction interface allowing entry of new data and subsequent estimation of pathogen
indicator levels using a selected MLR model.
Each component is accessible from the application's main window via selectable tabs. The
Beach Location and Data Processing tabs are always visible, the Modeling tab becomes visible
once the input data have been validated, and the Residuals and MLR Prediction tabs appear when
model-building is complete and a model is selected.
Project Model Help
Beach Location Data Processing
"Modeling I Residuals MLR Prediction
Variabb Selection Control Opiums
Evaluation Criteria
Alail-plrlnrrnaiicn Infer on |,",IC|
4 Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Man: 7
|5 | MaximumVIF
Model E valuation! hresholds
1235 | Decision Criterion [Horizontal]
1235 | Regulatory Standard (Vertical)
Threshold Transform Current US Regulatory Standards
E. coli. Freshwater: 235
Enterococcii, Freshwater: 104
EnteiococcL Saltwater: 61
ervations: 37
M
odel Information
Best Fits:
13.2076
9.1112
9.2219
9.2471
10.1760
Variable 9latislics Model Statistics
Parameter Coetficient Standardized Coefficient Std Error t-Statistic
(Intercept) 1.9229 0.2994 6.0879
wavehaighl 16811 0.2239 1.0139 1.6580
uv -0.0007 -0.5050 0.0002 -3.7750
WindDirection -0.0030 -0.4177 00010 -3.1185
Population Size: 100
Number of Generations. 100
Mutation Rate:
Crossover Rate: |0.20
Progress ResLJ||s Observed vs Predicted ROC Curves
Genetic Algorithm Dynamic Fitness Update
7.7 -
7.3 -
7.5 -
7.4 -
$ 7.3 -
|7.2-
7.1 -
7.0 -
6.9 -
e.a -
:
Fitness |
-
-
-
-
10 20 30 40 50 30 70 80 80 100
Percent of Generations Completed
Project File Name;
Project Name: Beach Name:
Total number of possible models: 127 I
Figure 1. The five major component tabs of VB 2.2 - the modeling tab is currently active
1.3 History and Comparison of Version 2.2 to Earlier Versions
Virtual Beach 2.2 is derived from the Virtual Beach Model Builder application (VB1.0
- also known as Virtual Beach vl .0) developed by Walter Frick and Zhongfu Ge. VB 1.0 can be
characterized as a MLR model-building tool that supports a primarily manual analysis of data sets
via visual inspection of data plots and manipulation of variables (e.g., transformations, creating
interaction terms), followed by an iterative process of testing, comparing and evaluating models.
The fitness of developed models is computed and tracked, allowing for comparison and eventual
selection of a "best" model for the dataset under consideration. This model can then produce
estimates of pathogen indicator levels using current or forecasted environmental data from the site.
VB 2.2 enhances the functionality of its predecessor, performing similar functions (visual
inspection of univariate data plots, manual transformations of individual variables, MLR model
building, prediction, etc.), but also automating and extending functionality in several ways:
-------
The Map component provides users with information on the location and availability of
local data sources (NWIS/NCDC data) through the map interface. These sources can
provide recently collected and/or forecasted data for generating predictions by a chosen
MLR model.
The Map component provides a convenient method for denning beach orientation by
overlaying the beach on current shore-line layers (satellite images, Google Maps, MS
Virtual Earth, etc). Given this orientation, VB 2.2 can calculate wind, wave, or current
components (A component is parallel to shore and O component is perpendicular to shore),
which can be important predictor variables.
Although manual processing and analysis of imported data (visual inspection of univariate
data plots and the transformations/interactions of variables) has been retained, the Data
Processing component of VB 2.2 provides automated generation of all possible 2nd order
interaction terms amongst a set of TVs, formation of more complex functions of multiple
columns, and automated testing of a suite of variable transformations for improved model
linearity. This functionality increases the number of models to evaluate during later
selection routines and removes the burden/difficulty of manual assessment placed on users
ofVBl.O.
Multi-collinearity amongst predictor variables is handled automatically in the Model
Building component. Any model containing an IV with a high degree of correlation
with other IVs (as measured by a large Variance Inflation Factor [VIF]) is removed from
consideration during model selection. The VIF threshold is user-defined with a default
value of 5.
During model selection, MLR models are ranked by a user-selected evaluation criterion.
Possible criteria include R2, Adjusted R2, Akaike Information Criterion (AIC), Corrected
AIC, Predicted Error Sum of Squares (PRESS), Bayes Information Criterion (BIC),
Accuracy, Sensitivity, Specificity, or the model's Root Mean Square Error (RMSE).
Regardless of which criterion is chosen, the software records the ten best models in terms
of that criterion. In comparison, VB1.0 had only a single comparative criterion, Mallow's
Cp.
As the number of IVs in a dataset increases, possible MLR models increase exponentially
(considering transforms/interactions), resulting in trillions of possible models from a
modest number (12-13) of IVs. VB 2.2 implements a Genetic Algorithm (GA) that
effectively and efficiently searches for the best possible MLR model. Alternatively, VB 2.2
users can perform an exhaustive calculation in which all possible combinations of IVs are
used and tested if the number of possible models is reasonably small (circa 100,000). Both
the GA and exhaustive approaches greatly expand the model-building capabilities of VB
2.2, compared to VB 1.0.
Users no longer have to enter data values in transformed, interacted, or component-
decomposed form to make a prediction with a chosen MLR model. On the VB 2.2 MLR
Prediction tab, a user-selected model is coded into an input grid with data entry columns
-------
matching the model's main effects. Any mathematical manipulation of these TVs is then
automatically performed prior to making predictions.
VB 2.2 is developed with MS Visual Studio 2010, written in C#, using multiple public
domain system components (Weifen Luo Docking UI, ZedGraph, and GMap.Net) and employs
a single licensed statistical library (Extreme Optimization). No license or software purchase is
required by the user to install and run the application, but an internet connection is required to
display maps. Users must have Microsoft XP or Windows 7 OS with the DotNet Framework 4.0
to assure proper installation and operation. Assorted errors have occurred when running Windows
Vista OS. Certain VB 2.2 data manipulation and model-building operations are computationally
intensive so faster CPUs are better, but most new laptops or desktop systems will be adequate.
Disk space requirements are modest (less than 5 MB) if the DotNet Framework is installed; if
not, the Framework installer requires ~ 175 MB of disk space. The VB 2.2 application installer
will attempt to download and install the DotNet Framework 4.0 if it is not installed on the target
system; this also requires a network connection. If necessary, a user can freely obtain the DotNet
Framework 4 installer at:
http://www.microsoft.com/download/en/details.aspx?id=l 7851
The EPAs Center for Exposure Assessment Modeling (CEAM) web site distributes VB 2.2
at:
http://www.epa.gov/ceampubl/swater/vb2/index.html
Obtain and initiate execution of the VB 2.2 application installer and follow the on-screen
instructions. The VB 2.2 application installer can be found at:
https://iemhub.org/resources/vbmb2 for iemHub Virtual Beach Group members;
https:IIIemhub.org/groups/virtualbeach/j oin to request Group member access.
Finally, the software can be obtained by request (see the contacts list in the Feedback
section at the end of this document). After installation, a shortcut will appear on your desktop to
start the software.
-------
2.0
Installation and Execution
2.1 Viewing this Documentation
Virtual Beach's User Guide can be accessed within the software via the top-level Help User
Guide menu selection or in a context-sensitive fashion via the Fl key. Invoking Fl will launch
Adobe Acrobat or Adobe Reader (if installed) and open the User Guide to the appropriate page.
Note that if the Guide is already open, the Fl key will have no effect; users must close Reader (or
Acrobat) for Fl to launch and open to the correct page. Or if the Guide is already open, users can
navigate to the area of interest via the Table of Contents. . The User Guide (Virtual_Beach_2_
User_Guide.pdf) can also be opened independently of program operation; it resides within the
Documentation folder of the program's installation folder.
-------
3.0
Operational Overview
Virtual Beach 2.2 is simple to operate: it is categorized into five functions, each with its
own component or interface:
Beach Location - a mapping tab whose utility is meant to provide a basis for generating
orthogonal (alongshore and offshore/onshore) wind, current, and/or wave components for the
beach under consideration; its use is optional. Such components can be powerful predictors of
pathogen indicator levels at the beach, so using the beach definition component is recommended
if the dataset under consideration contains wind, wave or current data. This tab is also useful for
locating nearby NWIS/NCDC climate and water quality data sources for a specific location.
Data Processing - a spreadsheet tab to support data manipulation procedures on an imported
dataset. In addition to wind/current/wave component generation, users can generate new
independent variables that represent the products, means, sums, minimums, and maximums of
other IVs, as well as common data transformations for the TVs. Statistical indicators help users
select the best IV transformations in MLR model-building.
Modeling - this tab allows selection of any eligible IVs for consideration in MLR model- building
and model-generation. Model-generation is accommodated by user-selected model evaluation
criteria and automatic generation of the ten best-fit models from a search in which all possible
combinations of predictor variables are tested, or via a heuristic searching algorithm (the Genetic
Algorithm or GA). Regression fit and model variable statistics are generated to help evaluate
the usefulness of predictive variables and overall fit. Time series and XY scatter plots, as well as
reports on best-fit models, can be viewed and/or saved for further analysis and recording.
Residual Analysis - this tab displays plots of a model's regression residuals, including their
normality statistics, and provides means to eliminate highly influential data records and recalculate
the regression model. Altered data sets can be exported for external use and rebuilt models can be
selected for the prediction tab.
Prediction this tab is comprised of three grids where users can enter or import the needed IVs
for the chosen model, enter or import observations that will be compared to model predictions,
and examine model predictions and exceedance probabilities. Time series and XY scatter plots of
observations versus predictions are shown to help users gauge model effectiveness.
-------
4.0
Project Management
Oftentimes the user will put an imported dataset through lengthy pre-processing to prepare
it for analysis. To avoid repeating all of this work, "project" files can be saved and re-opened via
the Project -^ Save and Project -^ Open menu selection. Subsequent opening of a saved project
file will load the processed data sheet and information on the Beach Location tab, including the
beach orientation if the user had defined it. However, no modeling information is saved inside a
project file.
In addition to project files, "model" files can be opened and saved using choices under
the "Model" menu at the top of the VB 2.2 interface. A model file contains information on the
TVs, regression parameters, and other metadata for the currently selected model in the Modeling,
Residual, or MLR Prediction tab. Whenever a model file is saved, VB 2.2 will prompt the user to
enter a Decision Criterion (DC), Regulatory Standard (RS) and Threshold Transformation for the
model. These parameters will be used as initial values (they can be changed when the model file is
opened) for later calculations of model sensitivity and specificity, which depend on the numbers of
false negative and false positive model predictions (see Sections 7.6 and 7.7).
When users open a previously saved model file from within VB 2.2, they are taken directly
to the MLR Prediction tab where they can use the saved model to generate predictions. Model
files are designed for situations where a statistically-savvy developer is charged with developing
regression models for a number of beach sites. After the developer chooses a "best" model for
a site, the model file can be saved and then delivered to the beach manager who will not use VB
2.2 for full-scale model development, but only to input new data, generate predictions, and make
decisions regarding swimming advisories.
-------
5.0
Beach Location Mapping Interface
On VB 2.2 application startup, the map interface is shown, but users can go directly to the
Data Processing tab if desired.
Figure 2. Beach Location interface - the default map type is Yahoo Map, but users have many
mapping options
5.1 Finding a Location
The map interface provides map controls that allow users to look up a location manually
by panning and zooming (mouse drag on the map and use of the mouse wheel or zoom control).
Alternately, a decimal latitude/longitude or place name can be entered. The control uses Google
Maps' reverse geo-coding network service to find locations.
-------
IS Virtual Beach 2.2
ModeJ Help
Map Controls
Beach Location
Map Controls
D NWIS D NCDC
3 STORET
Remove Station Locations
Cuirervt Location
Lat
-7.3769564628601 Lng
loading
Zoom Slider - drag slider up and
down to zoom in and out,
respectively.
Map ControlsAdd Lat/Long and
click "GoToLat/Long" button or enter
a Place and click "GoToPlace."
Map Settings - Select map type from
dropdown menu to change the
display in the map window.
Beach Orientation-use buttons to
add or remove markers on the map.
Once the beach shoreline is
delineated by placing the 1st and 2nd
beach markers, click in the water and
then click "Add Water Marker," which
will lead to the correct orientation
angle being placed into the "Beach
Orientation" box.
Show Station Location - if zoomed in
enough, select a station type and
then click "Show Station Locations"
to display such stations on the map.
Current Location - click anywhere on
the map to display that points Lat
and Long.
Loading-map loading progress bar
that shows network download
activity for map images.
Figure 3. Beach Location tab controls and their function
5.2 Defining the Beach Orientation
Map control allows delineation of a beach on the map to ascertain its orientation, which
-------
5.2 Defining the Beach Orientation
Map control allows delineation of a beach on the map to ascertain its orientation, which
is useful if wind, wave, and/or current flow components are to be used in MLR model-building.
Maps, as opposed to satellite or hybrid images, provide less shoreline detail so it is recommended
that the map setting type use a hybrid or satellite image prior to adding point locations that define
beach boundaries. Once displayed, click on the map (a red marker will appear) and select the
"Add 1st Beach Marker" button; this represents the first point of the extent of your beach shoreline.
Repeat this for the second beach marker and click on the map to indicate which side of the
shoreline represents the water; then hit the "Add Water Marker" button. Marker points will turn
green as you add them. Once the water marker is added, a shaded box (the beach) appears and the
computed orientation angle will be displayed.
SB Virtual Beach 2.2
Figure 4. Adding shoreline and water markers to define beach orientation
Points can be added or removed until the user is satisfied with the beach representation.
To recall the computed beach orientation in the data processing components creation screen (see
Data Processing section below), users can either save and then re-open a project file or they can
note the beach orientation on the mapping screen and manually enter that angle on the components
calculation screen.
-------
5.3 Finding nearby Water Quality, Flow, and Climate Information Sources
Possible nearby data sources for the area of interest may be located and displayed on the
map. USGS NWIS and NOAANCDC station markers at a zoomed-in map area can be located
and displayed by checking appropriate items in the map window and clicking the "Show Station
Locations" button. Note that the "Show Station Locations" button is only enabled when zoomed-
in to an appropriate level (e.g., zoom level three as measured from the top of the zoom control
slider). If either of the selected station categories (NWIS and/or NCSC; the STORE! station
category, although present on the control, is not yet functional) are present within the map display
area, they will appear. Also note that the network server that produces NCDC station locations
restricts location requests to one every 30 seconds - a one-half minute delay is required for
subsequent location requests and an error message will be displayed if the appropriate wait time
has not elapsed. Once station location markers are displayed on the map, hovering over the top-
left hand corner of any station marker will display station ID information. With that information,
users can visit the appropriate web address to gather water/weather data for the area of interest.
Station ID: 09043199399
Station None: ATHENS 2
Figure 5. NOAA/NCDC station marker showing station ID information
Station ID: USGS^EZI7890
Station Name: NORTH 0 CO NEE RIVER AT US 78, AT ATHENS, GA
Figure 6. USGS/NWIS station marker showing station ID information
USGS NWIS web site URL: http://waterdata.usgs.gov/nwis/inventory
NOAANCDC web site URL: http://www.ncdc.noaa.gov/oa/climate/stationlocator.html
-------
H Virtual Beach 2.2
Project Model Help
Beach Location j
Map Controls
~]Lng
, Athens, GA
YahooHybrid
Remove 1st Beach Market
| Remove 2nd Beach
| Remove Water Marker
Beach Orientation -94-96
Show Station Locations
0 NWIS 0 NCDC
D STORET
Remove Station Locations
Cerent Location
41.6254197800841 ' Lat
-87.2442770004272 Lng
loading
S talion ID - U S EFtt-41370G087150701
SlationName: USER*WEILBH-17AT GARY IN
Project Name; Beach Name:
Figure 7. Beach Location interface showing station markers near Gary, Indiana
5.4 Saving Beach Information in a Project File
Use the Project-^Save menu bar selection to open a Save File dialog and to save the project
information to disk. Beach marker and angle information is saved in the file name provided; the
saved file can be anywhere, but using the "Project Files" folder (found in the VB 2.2 root install
folder) is recommended.
-------
6.0
Data Processing
6.1 Data Requirements and Considerations
VB 2.2 accepts files from Excel 2007 or earlier (Excel 2010 is not currently supported), as
well as comma-separated-value (CSV) text files. Input data must conform to certain standards:
The first row of any data column must be a header with the TVs name. For best operation
of the software, the column name should be composed of letters, numbers (don't begin the
column name with a number), and/or underscores, i.e., "_". Other characters in column
names can cause problems.
The first (left-most) column of the dataset must be identification for the observations,
typically a date or time stamp that indicates when the observation was collected. The only
requirement is that each row MUST have a unique ID. VB 2.2 will not import datasets
with non-unique IDs in the first column. If the first column is a time stamp, VB 2.2's
plotting functions will work best if the column is in chronological order, from earliest to
most recent observations.
The second column of the dataset will initially be set as the dependent or response variable;
however, this can be changed after data are imported. Any subsequent columns will be
considered to be IVs.
Variable measurement units are not considered, but certainly affect predictions. Make sure
any data used for predictions are in the same units as those used to build the models; for
example, do not build a MLR model with water temperature in degrees Fahrenheit, then
later import water temperature in degrees Celsius for predictions. It is prudent to include
unit information in the column names (e.g., WaterTemp_C) to remind the user of the proper
units when making predictions.
Missing data (blank cells) are permitted on import, but must be dealt with in Data
Processing prior to modeling.
If present in the imported Excel data sheet (other than in column names or the first ID
column), cells with non-numeric values (i.e., symbols or text) are turned into empty cells.
If such non-numeric characters are present in an imported .csv file, they will be imported to
the data grid, but will be recognized as anomalous data during the required validation scan
and will have to be dealt with (deleted or turned into a numeric value) at that time.
VB 2.2 recognizes any column of data with only two different values as categorical. If
you have a column of categorical data with more than two values, you can designate it as
categorical, using methods described below. The ramification of a variable being identified
as categorical is that VB 2.2 leaves it out of transformation processes.
There is no hard-coded limit on the number of IV columns one can import; however,
a practical limit exists that depends on system processing resources. There is also an
inherent limit: - documentation indicates that the grid components used in the application
are designed for a maximum of 300 columns before performance issues degrade the
application. Modeling 250+ columns of data presents circa 2(10)20 possible data
-------
combinations for MLR processing. The Genetic Algorithm handles this modeling task,
but choosing "Run all combinations" would likely take an immense amount of time to
complete. Depending on how many additional TVs will be created by the user, importing a
dataset with less than 100 IVs should be acceptable.
6.2 Importing a Dataset
When users first click on the Data Processing tab, they open a dataset using the "Import"
button. This brings up a dialog screen where a directory explorer can be used to find the data file
and open it. If the dataset is an Excel file with multiple sheets, a dialog box opens to ask the user
which to import.
H Virtual Beach 2.2
reject Model Help
Beach Location /' Data Processing I
Import
ocuments
^ My Computer
Network Places
ซnBags
JXCC sampling
Cooter N files
L^EPA Support Tools
CJESA2011
CJ Modeling Datasets
Spectra
Rockwell Data
Project Name: Beach Name:
Status: ready l_
Figure 8. Importing a dataset into the Data Processing tab
Once imported, the data grid is shown as a spreadsheet on the right. The second column
of the spreadsheet will be highlighted in blue to indicate its status as the current response variable.
Information about the dataset, such as number of rows and columns, name of the ID column
and name of the response variable, appear on the left. At this point the grid cannot be edited or
interacted with in any manner; tTo access additional processing functionality, the data must be
validated.
-------
6.3 Validating the Imported Data
The "Validate" options window can be accessed by clicking the "Validate" button at the top
of the Data Processing tab. This window primarily launches a required data scan to identify blank
and non-numeric data cells in the imported spreadsheet. However, one can also find and replace
other specified values (e.g., a missing data tag like -999) in the dataset using the "(Optional) Find:"
input box.
Project Model Help
Beach Location Data Processing 1 r
Fie Testing.*
Column Count 3
Row Count 37
Date-Time Index tstamp
RetponseVariable LogCFU
Disabled How Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 7
[ Import | | Validate ]
I
tstamp
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33
38536.46
38536.63
38537.33
38537.46
38537.63
38519.33
38549.46
38550.46
LogCFU uv airtemp
1 452 360 29.3
0.8653
0.8016
1 739
1 028
0.301
1.627
1 247
1.773
0.9378
0.9542
1.073
0.87
1 135
1 239
0.883
-01761
1 176
0.1243
0
1.222
0.5643
0.8368
2.727
2235
0.5223
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1800
900
293
1537
1763
286
1481
1802
292
675
1834
292
1233
1470
3fiRRn S3 n 1 ?4S 1 81 R
Project File Name: Project Name: Beach Nam
23.8
30.7
29.3
23
30.3
28.6
29.2
25
32
29.4
25.7
30.5
34
29.9
31.8
31.1
26.6
23.8
30.3
29.1
30
30.2
DataVa
(Optio
R
D
D
waveheight centershintemp centerwaisttemp
0.15 284 28.4
alj Find:
place With: | |
lete Row
lete Column
Take Action Within:
OnrjThuCell
Take Action
dentily Categorical Variables |
[~ Cancel |
29.9
29.8
31 q
0.3 30.4
0.3 30.1
n? 3dq
30
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
30.4
33.5
27.8
28.2
33.1
28.3
29.2
32.4
27.6
29.8
30
nd
Wind5peed
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
W
0
1C
2C
;:c
4C
5E
6C
7C
SI
9C
1C
11
12
1;
14
15
1E
17
1E
IE
2C
21
22
2E
24
25
7f
Status: ready (
Figure 9. Data validation required to begin data processing
To validate the data, the user clicks "Scan." VB then goes through the spreadsheet, cell
by cell, looking for blanks, non-numeric, or user-specified values entered in the "Find:" input
box. If one of these types of cells is found, the scan will stop to highlight that cell. Users must
decide how to deal with the cell using choices in the "Action" section: they can replace the bad
cell with a specified value, using the "Replace With:" input box, or they can delete the row or
column containing the bad cell. The user must decide where to implement the chosen action
with the "Take Action Within" menu. Possible choices are "Only this Cell," "Only this Row,"
"Only this Column," "Entire Row," "Entire Column," and "Entire Sheet." Items in this menu
are context-sensitive, i.e., they change depending on which Action is selected. This setup gives
the user flexibility, for example, to delete all rows containing missing values within one specific
column of data (Action would be "Delete Row" taken within the "Entire Column"), and replace all
missing values with a user-specified numeric value within another column of data (Action would
-------
be "Replace With:" taken within "Entire Column"). The cell, row, and column reference will
always refer to the highlighted cell. After setting the "Take Action Within" menu, the user clicks
the "Take Action" button, VB 2.2 makes the specified changes to the spreadsheet, and the scan
continues. When the entire spreadsheet has been scanned and all bad cells have been fixed, VB 2.2
reports that "no anomalous data have been found," and the user can click the "Return" button to
close the Scan window.
As stated earlier, VB 2.2 will not attempt to transform categorical data columns. It
automatically identifies columns with only two unique values as categorical, but if the user has
other categorical TVs with more than two categories, those should be identified to VB 2.2 by the
"Identify Categorical Variables" button.
1' Vi,tualBMCll2.2 |- |[nJtX|
Project Model Help
Beach Location I Data Processing *
File
Testing. *ls |
Import [ Validate ]
Column Count 9
Rmyrn.r.t 37
Date-Time
Response
D sabled R
DataV
[Opt
Actio
O
ndex tstamp
Variable LogCFU
:w Count 0 |
^^^^^H
slidation
i:
Replace With:
ฉ Delete Row
O Delete Column
Take Action Within:
p^ntire Column
Identify Categorical Variables j
| Cancel |
tstamp LogCFU uv
38507.33 1.452 380
38507.46 0.8653 1403
38507.63 0.8016 1555
38508.33 1.738
3850146 1.028 1305
38508.63 0.301 1568
38521.46 1.627 1342
38521.63 1.247 1276
38522.33 1773 225
38522.46 0.9378 1260
38522.63 0.9542 1409
38523.33 1.079 295
38528.46 0.97 1300
38523.63 1.195 900
38535.33 1.239 233
38535.46 0699 1537
38535.63 -0.1761 1763
38536.33 1.176 236
38536.46 0.1249 1481
38536.63 ID 1302
38537.33 1.222 232
38537.46 0.5643 675
38587.68 0.6368 1834
38549.33 2.727 292
38549.46 2235 1233
air temp
29.3
29.9
30.7
29.3
29
30.9
28.6
23.2
25
32
29.4
25.7
30.5
34
29.9
31.6
31.1
29.8
30.3
29.1
30
30.2
28.9
29.9
38550.46 0.5229 1470 29.8
3s^n RT n 1 ?aq i si s qi q
Project File Name: Project Name: Beach Name:
wave height
0.15
12
12
12
12
12
102
101
101
101
0.01
11
115
0.18
115
115
13
0.05
0.1
13
12
0.3
12
15
13
centershintemp
28.4
30.5
33.7
27.8
30,2
32.5
33.3
26.4
27.8
32.5
24.6
27.6
30.1
28.7
31.4
35.2
27.3
30.2
27.8
29
34
27.6
30.4
0.3 30.1
n 7 ^q
centerwaisttemp
28.4
3D
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
314
33.5
27.8
29i2
33.1
28.3
29.2
32.4
27.6
29.8
30
WindSpeed j \ปi rt
0
1C
2C
3C
4C
5C
6C
71
8C
9C
1C
11
:
4
E
E
1
IE-
IE
2C
21
22
24
2E
iS 'if
Status;
eady I
Figure 10. Context-sensitive choices for the "Take Action Within" drop-down menu
6.4 Working with a Dataset Post-Validation
After the dataset has passed the validation scan, the function buttons across the top of the
Data Processing tab are enabled.
-------
Project Model Help
Beach Location Data Processing
^^^^H
^^^^^^^^^^^^^^^^^^^^^^^^^aesyyi
-
File Testing.*
Column Count 9
Row Count 37
Date-Time Index Sstamp
Response Variable LogCFU
Disabled Row Count 0
Disabled Column Count 0
Hidden Column Count 0
Independent Variable Count 1
Import
Validate
| Compute A. 0 | Manipulate Transform
Go to Modeling
\
(stamp
^^^^^^H
38507.46
38507.63
38508.33
38508.46
LogCFU
1.452
08653
08016
1738
1.028
38508.63 0301
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
1.627
1.247
1.773
0.9378
0.9542
1.078
0.97
1.195
1.239
38535.46 0.639
38535.63
3953E.33
38536.46
38536.63
38537.33
-0.1761
1.176
0.1249
0
1.222
38537.46 0.5643
38537.63
38549.33
0.6368
2.727
uv airtemp
360 29.3
1403
1555
337
1305
1568
1342
1276
225
1260
1409
285
1800
900
293
1537
1763
286
1481
1802
292
675
1834
292
23.3
30.7
29.3
29
30.8
28.6
28.2
25
32
29.4
25.7
30.5
34
29.9
31.6
31.1
26.6
28.8
30.3
29.1
30
30.2
28.9
waveheight centershinternp
0.15 Isl
0.2
0.2
0.2
0.2
0.2
0.02
0.01
0.01
0.01
0.01
0.1
015
0.18
0.15
0.15
0.3
0.05
0.1
0.3
02
0.3
0.2
0.5
30.5
33.7
27.6
30.2
32.5
28.7
33.3
26.4
27.8
32.5
24.6
27.6
30.1
28.7
31.4
35.2
27.3
30.2
34.7
27.8
29
34
27.6
centewaisttemp \v A
28.4
30
33.1
27.8
30.1
32.1
28.3
33.2
26.4
28
31.8
26.2
27.4
30
29
30.4
33.5
27.8
23.2
33.1
28.3
29.2
324
27.6
38548.46 2235 1233 29.3 0.3 30.4 129.8
1
1
1
1
1
1
1
'
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
>
Project File Name: Project Name: Beach Name: Status: ready .:
Figure 11. Post-validation enabling of the Data Processing functionality
At this point, the grid cells (other than the ID column) are editable - that is, users can
manually enter new numeric data into the cells by double-clicking on a cell and typing in a new
value. VB 2.2 does not allow blank cells or non-numeric data in cells. Additionally, a right mouse-
click on an IV column header presents options:
Validate
Compute A, 0
Manipulate
LogLhU uv
1.452
0.8653
0.8016
1.733
1.023
0.301
1.627
360
1403
1555
337
1305
1568
Disable Column
Enable Column
Set Response Variable
View Plots
Delete Column
1342
,-3.0
29
30.9
28.6
waveneit
0.15
0.2
0.2
0.2
0.2
0.2
0.02
Figure 12. Right-click options on columns that are not the response variable
-------
"Disable Column" turns the column's text red and prevents the column from being passed to the
Modeling tab of VB. Previously-disabled columns can be activated using "Enable Column." "Set
Response Variable" will make that IV the new response variable and it becomes blue as a visual
indication of this change. "View Plots" shows a new screen with column statistics at the far left
and four plots for that IV (1) a scatterplot of the IV versus the response variable in the upper left
panel, (2) a plot of the IV values versus the ID column at the upper right (a time series plot if the
ID is an observation date), (3) a box-and-whiskers plot at the bottom left, and (4) a histogram for
the IV at the bottom right.
H Variable airtemp SOS
Data
Variable Name
Row Count
MaHimum Value
Minimum Value
Average Value
UniqueValues
Zero Count
Median Value
Data Range
AD Statistic
AD Stat P-Value
Mean Value
Standard Deviation
Variance
Kurtosis
Skewness
Value
airlemp
37
3570
2500
30.11
30
0
29900
10700
0.2589
0.6959
30111
2.453
B.045
0.7G7
07G7
[ Replot ]
Scatter Plot
5
2
^
1
""isi"0""""""---^
ฐ" "
22 24 28 23 30 32 34 36 36
40
3D
|ป
10
D
BoxWhisker Plot
j
8
Ti me Ser es P ot
38
36
34 -
32
!-
28 -
26
24 -
I/I
HfH
33.56 33.51 38.52 36.53 33.54 33.55 33.56 36.57 36.
i5tamp(10"3)
12
10 -
I'
6
4 -
2 -
Frequency Plot
I=LJ 1
Li 1 il
8
22 24 26 28 30 32 34 36 38
airtemp
Figure 13. Four different plots available for evaluation of IVs
The scatter plot (upper left) is probably the most-examined, as it can indicate a non-linear
relationship between the IV and the response variable, problems with homogeneity of variance
across the range of the IV, or outliers. Ensuring that the IVs are linearly related to the response
variable raises the probability of producing a robust, meaningful analysis. If the relationship
between the response and the IV is not well-approximated by a straight line (a fundamental
assumption of MLR), it may be beneficial to transform the IV Using VB 2.2 to accomplish this
will be explained later in this document. The scatterplot also shows the best-fit regression line in
red, along with the correlation coefficient ("r") and the significance (p-value) of the correlation
coefficient at the top of the plot. For the most part, p-values below 0.05 are considered statistically
significant.
Identifying odd values (potential outliers or bad data) of any IV can often be done by visually
inspecting these plots. If users double-click on the data point marker for any observation in one
-------
of the top panels or the bottom left panel (i.e., not the histogram), they can disable that point (the
row) in the data grid.
4
3 -
2 -
=
LL.
(J
O>
0
1 -
n
-1 -
2
Scatter Plot
P
Disable Row containing 7/16/2005 7:55: 12 AM
Enable Row containing 7/16/2005 7:55:12 AM
: D :
"~^-~_ n
- ^^-^\ D ฐฐ
n ?^^n ;
u n n ^~~-
n n n
n
n
2 24 26 28 30 32 34 36 3
airtemp
8
Figure 14. Disabling an observation from within the XY scatterplot
The final choice - "Delete Column"-- deletes a column from the data grid, but the original
columns of the imported data sheet (VB 2.2 thinks of these as "main effects") cannot be deleted.
Rows can be disabled and enabled, but not deleted, from the data grid by right-clicking the row
header (far left of each row) and making the desired choice.
If the user right-clicks on the column header of the response variable, a different set of
choices is shown:
-------
Validate
Manipulate
tstamp
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
IQRT'9
LogCFU
1.452
0.8653
Transform
View Plots
UnTransform
0.8016
1.738
1.028
0.301
1.627
1.247
1 77T
337
1305
1568
1342
1276
29
29
none
LoglO
Ln
Power
30.9
28.6
28.2
wavehe
0.15
0.2
0.2
0.2
0.2
0.2
0.02
0.01
n m
Figure 15. Available choices when right-clicking the current response variable
Users can transform the response variable in three ways: Iog10, loge, or a power
transformation (raising the response to an exponent: yx). They can also un-transform the response,
view the plots shown previously for the TVs, or define a transformation of the response variable.
This option is used when a datasheet is imported with an already-transformed response variable.
For example, users could import a datasheet with Iog10-transformed fecal indicator bacteria levels
and then define the response as being Iog10-transformed. Doing this facilitates later comparisons
with observations, decision criteria, and regulatory standards. When users transform the response
variable within VB 2.2 using the "Transform" option, VB 2.2 automatically defines the response as
having the chosen transformation and, in doing so, synchronizes the units of measurement for later
comparisons.
6.5 Computing Alongshore and Onshore/Offshore Wind, Wave and Current Components
Orthogonal wind, current, and wave vectors can be powerful predictors of beach bacterial
concentrations. Depending on the orientation of the beach, wind and currents can influence the
movement of bacteria from a nearby source to the beach, and wave action can re-suspend bacteria
buried in beach sediment. To make more sense of these data, researchers typically decompose
wind/current/wave magnitude and direction into A (alongshore) and O (offshore/onshore)
components for analysis (see equations at the end of this section).
If direction and magnitude (speed/height) data are available, A and O components can be
calculated with the "Compute A, O" button. Clicking it brings up a window where users specify
which columns of the data grid contain the relevant magnitude and direction data, using drop-down
menus (Figure 16). There is also an input box at the bottom of the form for the beach orientation
angle. If the user defined the angle on the "Beach Location" tab, that value should be seen here.
After clicking "OK," new data columns are added to the far right of the data grid, representing the
A and O components of the specified wind, current, or wave data. Unlike the originally imported
IVs, these components can be deleted from the data grid after they are created. Names of these
-------
new columns will be: WindA_comp(X,Y,Z), CurrentO_comp(X,Y,Z), WaveA_comp(X,Y,Z), etc,
where X is the name of the column of data used for magnitude, Y is the name of the column used
for direction, and Z is the beach orientation angle.
H Wind/Cur rent/Wave Components _ || D X |
Wind Data
Specify wind data columns:
Directi
Current D
Specify
Speed 1
on (deg)
current data columns:
Speed
Direction (deg) v
Wave Data
Specify wave data columns:
Wave
Height v
Direction (deg)
Beach Angle (deg): 0.00
Ok Cancel
.:i
Figure 16. Window for computation of alongshore and offshore/onshore components
Notes on wind, wave and current component calculations:
Direction is an angular degree measure. Moving in a clockwise direction from north (0
degrees), values are positive, and negative while moving counter-clockwise. Wind and current
speed (as well as wave height) can be measured in any unit. VB 2.2 adheres to scientific
convention where wind direction is specified as the direction from which the wind blows, while
-------
current and wave directions are specified as the direction toward which the current or waves move.
Thus, wind blowing from west to east has a direction of either 270 or -90 degrees, while a current/
wave moving from west to east has a direction of 90 degrees.
The A component measures the force of the wind/current/wave moving parallel to the
shoreline (Figure 17). A positive A component means winds/currents/waves are moving from
right to left as you look out at the water. A negative A component means winds/currents/waves are
moving left to right as you look out at the water. The O component measures force perpendicular
to the shoreline. A negative O value indicates movement from the land surface directly offshore
(unlikely to see with wave action). A positive O indicates waves/wind/currents from the water to
the shore. These relationships apply no matter how the beach is oriented (Figure 18).
Positive A
Negative A
Figure 17. A and O component definitions for wind, current, and wave data
-------
Beach Orientation tor Wind Component Calculations
270 degrees
135 degrees
315 degrees
90 degrees
0 degrees
45 degrees
180 degrees
Water Land
21E degrees
t
North
Figure 18. Principal beach orientations given in degrees
Equations for calculation of Wind A/O components:
Wind A: -SPD * cosine ( (DIR-BO) * PI/180)
Wind O: SPD * sine ( (DIR-BO) * PI/180)
where SPD is wind speed, DIR is wind direction, BO is the beach orientation (in degrees) and PI
= 3.1416. Current A/O and Wave A/O are these same equations multiplied by -1.6.6 Creation of
New Independent Variables
Users may click the "Manipulate" button to create new columns of data that might serve
-------
6.6 Creation of New Independent Variables
Users may click the "Manipulate" button to create new columns of data that might serve
as useful TVs. On the screen that pops up, there is a list of available TVs on the far left, under
"Independent Variables." If users wish to create a new term, they add any available IV used in this
new term by selecting it and using the ">" button to add it to the "Variables in Expression" box.
Clicking and dragging down through the "Independent Variables" list allows for multiple IVs to be
added at once.
H Manipulate
- n x
Build Expression
Independent Variables
Variables in Expression
airtemp
waveheight
centershintemp
centemaisttemp
WindS peed
WindDirection
CD
m
ฉ Sum O Maximum Q Minimum Q Mean O Product
2nd Order Interactions
OK
Cancel
Figure 19. Window for the formulation of "Manipulates" - arithmetic combinations of existing columns
within the data grid
For example: if users wish to create a new IV that is a row-by-row mean value of the
"centershintemp" and "centerwaisttemp" variables, they add those two to the "Variables in
Expression" box, then choose the "Mean" function, "Add" that expression to the lower box, then
click "OK." That adds a new column of data that represents a row-by-row average of the two IVs,
to the end of the data grid (far right.)
-------
iH Manipulate
Build Expression
Independent Variables
- n x
airtemp
waveheight
WindSpeed
WindDirection
m
m
Variables in Expression
centershintemp
centerwaisttemp
O Sum O Maximum Q Minimum ฉ Mean O Product
MEAN[centershintemp,centerwaisttemp]
2nd Order Interactions
M E AN [centershintemp,centewaisttemp]
OK
Cancel
Figure 20. Creation of a new IV defined as the mean of two existent TVs
Users can create a row-by-row sum, maximum, minimum, mean, or product from any
number of TVs that are added to the "Variables in Expression" box. More than one expression
can be created before the "OK" button is clicked, and TVs can be easily moved in and out of the
box using "<" and ">" keys. Any created expressions can be removed from the lower box with
the "Remove" button. No matter how many IVs are added to the "Variables in Expression" box,
clicking "2nd Order Interactions" will add the cross-products for all possible pairings of those IVs.
Thus, four IVs will produce six interactions, five IVs will produce ten interactions, and so on.
Note that the names of the columns used to create any manipulate are inside the parentheses of that
manipulate's column name.
-------
EH Manipulate
Build Expression
Independent Variables
Variables in Expression
uv
waveheight
WindDirection
mcentershinternp
centerwaisttemp
m
WindSpeed
airtemp
O Sum O Maximum Q Minimum :'*: Mean O Product
M EAN [centershintemp,centemaisttemp,WindS peed,airtemp]
Add
2nd Order Interactions
PROD[centershintemp,centerwaisttemp]
PROD [centershintemp,WindS peed]
PRQD[centershintemp,airtemp]
PROD [centemaisttemp,WindS peed]
PR ODJcentewaisttemp,airtemp]
PR 0 D [WindS peed,airtemp]
OK
Cancel
Figure 21. Formation of two-way cross-products of a set of four existent TVs
VB 2.2 does not allow previously created "manipulates" new columns of data created
through the "Manipulate" button to be further manipulated. Previously-created manipulates will
not appear in the "Independent Variables" section at the left. They can, however, be chosen as the
response variable or deleted from the data grid, using the appropriate menu choices, accessed by a
right-click of the column header.
6.7 Transforming the Independent Variables
VB 2.2 gives users the ability to transform non-categorical TVs to assist in linearizing the
relationship between the TVs and the response variable, which is a fundamental assumption of an
MLR analysis. VB 2.2 provides the following transformations, where Xt is the transformed IV and
X is the original IV
Log10: Xt = log10(X)
Loge: Xt = loge(X)
Inverse: Xt= 1/X
Square: Xt = X2
Square Root: Xt = X05
Quad Root: Xt = X025
Polynomial: Xt = a + bX + cX2
General Exponent: Xt = Xe where the user specifies the value of e
-------
When users click the "Transform" button, they are presented a choice of transformations to
investigate:
Transforms to Perform
Available Transforms
D LoglO
Inverse
Square
SquareRoot
QuadRoot
Polynomial
General Exponent
PI Select All
Dependent Variable:
LogCFU
Figure 22. The range of choices for IV transformations
When users click "Go", the chosen transforms are applied to each non-categorical IV. VB
2.2 then opens a table that allows comparison of the success of each transform using a Pearson
correlation coefficient, a measure of linear dependence between the response variable and the
IVs. For the polynomial transformation, the Pearson coefficient is calculated as the square root of
the adjusted R2 value derived from the regression of the response on Xt. Because this adjusted R2
value can possibly be negative, an empirically-derived formula is applied when adjusted R2 values
fall below 0.1:
Polynomial Pearson Coefficient = (-6.67*RE12 + 13.9*REr 6.24)*(R2)05
where RE: = 1.015 - 1.856*R2 + 1.862*adjR2 - 0.000153*N, R2 and adjR2 are defined by the
regression of the response on Xt, and N = number of observations.
The table that VB 2.2 creates groups all transformed versions of each IV by the IV name,
type of transformation, and the associated Pearson coefficient. By default, the transformation
(this includes the un-transformed version of the IV, denoted by "none"), with the largest absolute
value of the Pearson coefficient is highlighted in black text for selection. Users may override the
default selection by left-clicking on the row header of a transformed IV they choose. They may
-------
also override the default by setting a Threshold percentage and clicking "Threshold Select" on
the left side of the box. This selects the un-transformed IV unless the transformed IV with the
highest absolute value Pearson coefficient exceeds the un-transformed IV Pearson coefficient
by the specified percentage. In essence, the user is saying, "Unless the Pearson coefficient of
the transformed IV is some % greater than the Pearson coefficient of the un-transformed IV, use
the un-transformed IV" This can be useful because transforming IVs makes interpreting model
coefficients more difficult; unless an improvement is seen, transformation may not be worth the
trouble. Users can also revert to the default by clicking "Go" under the "Auto Select" section at
the left.
Pearson Univariate Correlation Results - Maximum Pearson Coefficients (signed) in BOLD text
Help
Variables, possible variable
interactions, and their
transforms are shown. Select
variables for further
processing and modeling.
Auto-Select
The variable or one of its
transforms is selected by
maximum Pearson Coefficient.
(This is the default view shown.)
Threshold Select
Select a transformed variable only
if its Pearson Coefficient exceeds
the untransformed variable's
Pearson Coefficient by a
specified threshold.
Threshold (%\ 20 ;
[ So ]
Manual Select
Mouse-click on a row header to
select or deselect that variable.
At most one member from each
group can be selected.
1 | Add transformed variables to dataset
and disable untransformed columns.
| Ok | 1 Cancel 1 1 Print 1
Dependent Variable: LogCFU
I-
Variable
uv
uv
uv
uv
airtemp
airtemp
airtemp
airtemp
airtemp
waveheight
waveheight
waveheight
waveheight
waveheight
centershintemp
cenlershintemp
centershintemp
centershintemp
centershintemp
centerwaisttemp
centerwaisttemp
Transform
none
INVERSE[uv,101.5]
SQUARE[uv]
QUADROOT[uv]
POLY[uv,1 .21 39824,0.000332681 67,-5.U448752e-07]
none
INVERSE[arrtemp,12.5]
SQUARE[airtemp]
QLIADROOT[airtemp]
POLY[airtemp.-2. 7045932,0 350288G5,-0 00767821 38]
none
INVERSE[waveheight,0.005]
SQUARE[waveheight]
QUADROOT[waveheight]
POLY[waveheight,1 .2708951 ,-7.025051 6,1 9.1 75368]
none
INVERSE[centershin!ernp,1Z3]
SQUARE[centershintemp]
QUADRODT[centeishintemp]
POLY[centershintemp,1 .2563378,0.09461 4607.-0.0035446956]
none
I NVE RS E [centerwaisttemp,! 3.1]
Pearson
Coefficient
-0.4706
0.3335
-0.4887
-0.4339
0.4432
-0.3772
0.3624
-0.3820
-0.3724
0.3170
0.1031
0.2006
0.2612
-0.0666
0.3874
-0,4260
0.4197
-0.4272
-0.4243
0.3669
-0.3991
0.4093
Correlation
P-Value
0 0033
0.0437
0.0021
0.0073
0.0060
0.0214
0.0275
0.0196
0.0232
0.0559
0.5435
0.2339
0.1184
0.6954
0.0178
0.0086
0.0097
0.0084
0.0089
0 0255
0.0144
0.0119
A
Figure 23. Pearson correlation coefficient scores for judging the efficacy of IV transformations
Plotting Transformed IVs
Users may prefer to examine plots visually to determine which transformation of IV to
choose. If users right-clicks on a row header in this correlation table, they can view an array of
scatterplots, time series plots, or frequency plots for each data transformation of the IV represented
by that header. Scatterplots will show the best-fit regression line, the correlation coefficient, and
the p-value for that correlation coefficient.
-------
IB Variable airtemp and its Transforms
airtemp LOG10 INVERSE SQUARE QUADROOT POLY
Pearson Coefficient -0.3772 -0.3706 0.3624 -0.3820 -0.3724 0.3170
QUADRQQT[alrtomp]
22S 233 235 2.UD
QUADRQorpirfcrnp]
IMVEF? SE[3lrt*mp,12.S]
POL v ปtrfanii.-i?B4ปn,ojinim.-a.Hwn \
Figure 24. Scatterplots (Response vs. IV) for six different data transformations of a single IV
After choosing a transformation for each IV, users click "OK." This populates the data
grid with new columns representing transformed versions of the IVs. The small checkbox in the
bottom left corner of Figure 23 controls whether the untransformed version of the IV remains
enabled in the data grid after the user clicks "OK." When the box is checked, for any IV in which
the user chooses a transformed version, the un-transformed version will be disabled in the data
grid. Notice that transformed versions of an IV are put into the data grid immediately after the
original, un-transformed IV
Notes on Transformed IVs
Any transformations put into the data grid can be deleted with the "Delete Column" choice
after right-clicking on their column header. Transformed IVs will appear in the list of IVs on the
"Manipulate" screen; however, transformed IVs cannot be further transformed and will not appear
in the transform table if the user goes back to the "Transform" window.
VB 2.2 transformations have specific processing for certain data values and are not
pure mathematical transformations - they were designed to maintain data order while helping
to linearize the response-IV relationship. For the SQUARE (b=2), SQUAREROOT (b=0.5),
-------
QUADROOT (b=0.25), INVERSE (b=-l) and GENERAL EXPONENT (b is user-defined)
transformations, VB 2.2 uses the signed equivalent of the mathematical function:
xAb == sign(x)*abs(x)Ab
For example: (-2)2 = -4 (-9)05 = -3 (-4)-ฐ5 = -0.5 (-2)-2 =-0.25
To avoid potentially undefined values (i.e., 1/x when x = 0), the INVERSE and GENERAL
EXPONENT (if the user sets b < 0) transformations have special processing:
If x = 0, then VB 2.2 will find the minimum of abs(z), where z is the set of all non-zero
values for the IV in question. For the purpose of computing the transformation, once z is defined,
VB 2.2 substitutes z/2 for x. From this definition, note that z can be either a positive or negative
number.
LOG10 and LOGe transforms are also the signed equivalent of the mathematical functions:
loge(x) == loge(x)
loge(-x) == -loge(x)
Iog10(x) == Iog10(x)
Iog10(-x) == -Iog10(x)
In addition, if (-1 < x < 1), then loge(x) = 0 and Iog10(x) = 0
VB 2.2 will not compute the INVERSE, GENERAL EXPONENT (with a negative b),
LOG10 and LOGe transformations for data columns if more than 10% of the IV's values are zero.
Programmatically, zero is defined as any number whose absolute value is less than l.Oe-21.
POLYNOMIAL transformations are the result of a linear regression of the response
variable on the IV and the square of the IV
Poly(X) = a + b*X + c*X2
where a, b, and c are determined by a multiple linear regression of X and X2 on the response
variable.
In general, the name of the transformed column of data that VB 2.2 creates is simply
the type of transformation, with the original data column name in parentheses. For example,
Water Temp would become LOG10(WaterTemp). There are some exceptions, however:
INVERSE(X,Y): X is the original data column name and Y is the z/2 value discussed
earlier in this section.
POWER(X,Y) : When Y is positive, X is the original data column name and Y is the
exponent specified by the user.
-------
POWER(X,Y,Z) : When Y is negative, X is the original data column name, Y is the
exponent specified by the user, and Z is the z/2 value discussed earlier in this section.
POLY(X, a,b,c): X is the original data column name and a, b, and c are the values of the
polynomial regression coefficients.
Finally, because transformations are determined by the current response variable, when
users change the response variable in the data grid (using the column header right-click menu), all
transformed TVs in the data grid are erased (a message warns the user).
6.8 Saving Processed Data
Data can be saved in a project file (Project-^Save) at any time during data processing. When
the file is opened, the data grid will be repopulated as it appeared when the project was saved.
Also, users may highlight the entire table or sections of the table and use Control-C and Control-V
to copy and paste the data grid into a word processing or spreadsheet application.
6.9 Go to Modeling
After data processing is complete, users must click the "Go to Modeling" button to open
the Modeling tab. If users have already done modeling work and returned to the data sheet
to make changes, they will receive a message that the data sheet has changed and any prior
information on the Modeling, Residual, or MLR Prediction tabs will be erased. Users can then
choose to move forward to the Modeling tab or revert to the previous version of the data sheet
prior to making changes.
The Modeling tab facilitates finding the best model based on criteria selected by the user.
As the number of TVs increases, the number of possible models in the solution space increases
exponentially. Users may select all or a subset of the TVs for consideration in the model to reduce
the size of the solution space.
-------
7.0
Modeling
7.1 Selecting Variables for Model Building
All eligible TVs are listed in the left column ("Available Variables") under the Variable
Selection sub-tab. Any variable users wish to consider for model inclusion must then be moved to
the "Independent Variables" list by highlighting the IV and clicking the ">" key. Any number of
IVs can be moved or removed from this list.
Beach Location
Data Processing , ' Modeling
Model Settings
Variable Selection Control Options
Number of Observations: 37
Dependent Variable: LuyCFU
Available Variables (7)
uv
airtemp
waveheight
centershintemp
centerwaisttemp
WindSpeed
WindDirection
H
Independent Variables (0)
Figure 25. Selecting variables for MLR processing within the Modeling tab
As you add or remove IVs from the "Independent Variables" list, the number of possible MLR
models is displayed in the status strip at the bottom right of the application window. The number
of possible models can grow exceedingly large; 66 IVs represent 7.38*1019 possibilities. More
than 66 variables produces a number that exceeds the capacity of the program to store it - in such
cases, "more than 9.2e019" is displayed.
7.2 Modeling Control Options
The first decision users make on this tab involves which evaluation criteria will be used to judge
model fitness. There are currently ten criteria available in the drop-down menu:
-------
Akaike Information Criterion (AIC)
Corrected Akaike Information Criterion (AICC)
R2
Adjusted R2
Predicted Error Sum of Squares (PRESS)
Bayesian Information Criterion (BIC)
RMSE
Sensitivity
Specificity
Accuracy
Evaluation Criteria
Akaike Information Criterion (AIC) v
t Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Max: 7
5 Maximum VIF
Figure 26. Setting modeling options within the Modeling interface
The "Maximum VIF" (Variance Inflation Factor) parameter is used selectively to discard
models that contain variables with a high degree of multi-collinearity, i.e., IVs that are greatly
correlated with other IVs. If any IV in a model has a VIF exceeding the threshold, that model will
be discarded. The default VIF value used in the application is set to 5. A VIF of 5 means that 80%
(1/5) of the variability in an IV can be explained by the variability of other IVs in the model. A
VIF of 10 means that 90% (1/10) of the variability can be explained, and so on. If users aren't
concerned with muli-collinearity among the explanatory variables in a regression model, they can
lower the Maximum VIF value. However, multi-collinearity leads to poorly estimated regression
coefficients (i.e., large standard deviations of these coefficients).
The "Maximum Number of Variables in a Model" parameter tells VB 2.2 how large the
models being evaluated can be. As a rule, most modelers prefer to have about 10 observations per
estimated parameter in their models, otherwise possibilities increase for model over-fitting and
poor estimation of regression parameters. VB 2.2's recommendation is close to this rule. It equals
(1 + n/10) where n is the number of observations in the dataset. The maximum allowable number
equals n/5. VB 2.2 won't let users set this value over the maximum. The total number of available
parameters is also given here.
If we define/? as the number of parameters in a model, n as the number of observations in
the dataset, RSS as the residual sum of squares for a model, and TSS as the total sum of squares for
a model, then the evaluation criteria for a model can be defined as:
Akaike Information Criterion (AIC): 2p + n*ln(RSS)
Corrected Akaike Information Criterion (AICC): ln(RSS/n) + (n+p)/(n-p-2)
-------
R2: 1 - RSS/TSS
Adjusted R2: 1 - (l-R2)(n-l)/(n-p-l)
Bayes (Schwarz) Information Criterion (BIC): = n*ln(RSS/n) + p*ln(n)
Root Mean Squared Error (RMSE): (RSS/n)1/2
Predicted Error Sum of Squares (PRESS): 1 - E(y- y_)212(y.- yj2
where y. is the i, observation, y is the model estimate of the i, observation when the model coefficients are fitted with
^ i th ' ^ -i th
the i, observation removed from the dataset, and y is the mean value of y in the dataset
th ' ^ m ^
Accuracy: (true positives + true negatives) / number of total observations
Specificity: true positives / (true positives + false positives)
Sensitivity: true negatives / (true negatives + false negatives)
Sensitivity, specificity and accuracy are special cases that require users to enter both
a Decision Criterion (DC) and Regulatory Standard (RS) so that true/false positives and true/
false negatives can be defined. The DC is a modeled (predicted) value the user chooses. Model
predictions above this threshold are considered exceedances, while model predictions below
this value are considered non-exceedances. The RS is typically a safety limit on fecal indicator
bacteria (FIB) levels set by a state or federal agency. The "Threshold Transform" radio buttons tell
VB 2.2 how to transform the DC and RS for comparison to model predictions and observations.
If a transformation definition is set for the response variable (either manually by the user or
automatically by transforming the response) during data processing, that definition will be set as
the default here. Users should understand that changing the threshold transform definition can lead
to problems when comparing modeling predictions to observations. Caution should be exercised.
Model Evaluation!hresholds
Decision Criterion (Horizontal]
Regulatory Standard [Vertical)
Threshold Transform Current US Regulatory Standards
ฉ None E.coli, Freshwater: 235
Enterococci, Freshwater: 61
O Ln
O Power Enterococci, Saltwater: 104
Figure 27. Setting evaluation thresholds and threshold transformation information within the
modeling interface
-------
7.3 Linear Regression Modeling Methods
There are two options for exploring the solution space.
Manual - this option is for a directed model search. If the 'Run all combinations' box
is not checked, a single model including every IV that was added to the "Independent
Variables" column will be evaluated. If 'Run all combinations' is checked, an exhaustive
search is performed. The exhaustive search evaluates every model that can be constructed
with the selected IVs, but does not evaluate any with more parameters than the "Maximum
Number of Variables in a Model" input box. For example, if there are 24 IVs to evaluate
and the maximum number of IVs in a model is set at 8, the exhaustive routine examines
every possible 1-, 2-, 3-, 4-, 5-, 6-, 7-, and 8-parameter model. As the number of IVs rises,
the number of possible models quickly gets so large that the exhaustive routine cannot
maintain reasonable computation times and the user is advised to switch to the genetic
algorithm.
Genetic Algorithm - the Genetic Algorithm (GA) option explores solution spaces too large
to handle exhaustively. Genetic algorithms are loosely based on the natural evolutionary
process, in which individuals in a population reproduce and mutate. Individuals with high
fitness (regression models that produce small residuals) are more likely to reproduce and
pass their genes (IVs) to the next generation. The goal is to find a good solution without
having to examine every possible option and the GA balances random and directed
searching.
-------
IS Virtual Beach 2.2
Project Model Help
Beach Location Data Processing / ^Modeling
Model Settings
8 Virtual Beach 2.2
Project Model Help
Beach Location
Model Settings
Variable Selection Control Options | Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
H 1 Maximum Number of Variables in a Model
Available: 1, Recommended: 4, Max: 7
|5 | MaximumVIF
Model E valuation! hresholds
1235 | Decision Criterion (Horizontal)
|235 | Regulatory Standard (Vertical)
Threshold Transform Current US Regulatory Standards
Nore E. coli, Freshwater: 235
EntetococcL Freshwater: 104
Enterococci Saltwater: 61
Manual G enetic Algorithm I
Run all combinations
Variable Selection Control Options
Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
ซ
n Maximum Number of Variables in a Model
Available: 7, Recommended: 4, Man: 7
|5 Maximum VIF
Model Evaluation!hresholds
1235 I Decision Criterion (Horizontal)
|235 | R egulatory S tandard [Vertical)
Threshold Transform Current US Regulatory Standards
ฉ None E. coli. Freshwater: 235
O Logic
O Ln
O Power
Enterococci, Freshwater: 104
Enterococci, Saltwater: 61
Manual 11 Genetic Algorithm
D Set Seed Value:
Population Size: 100
Number of Generations: 1100
Mutation Rate: fool
Crossover Rate: 10.20
Figure 28. Model building interface using a manual search (left panel) or the Genetic Algorithm
(right panel)
Choosing between an exhaustive and a GA search depends on your data set, available
hardware and time constraints. Fifteen IVs produce about 32,000 model possibilities; on our
system (Dell Precision T5400 workstation running MS Win XPSP3 w/ dual Xeon 2.66 GHz
processors having 4 GB RAM), the exhaustive search was completed in approximately 90 seconds.
Sixteen IVs represent more than 65,000 possibilities which is more than double that of 15 IVs.
Some model building results are summarized below:
Exhaustive Search - Run All Combir
Number of IVs
5
6
7
Number of MLR models
32767
65535
131071
ations
Approximate Time
Required to Generate and
Filter Models (seconds')
90 .
10
280
By contrast, the GA with 17 IVs was completed in less than seven seconds. We note, however,
-------
that the exhaustive search did find a slightly better model than the GA did using the selected AIC
evaluation criterion (49.2 versus 55).
An alternative modeling strategy could be to use the GA on your entire list of TVs, then the
exhaustive search on a subset of the initial TVs - any IV that appears in one of the best ten models
found by the GA. This two-step modeling process is facilitated with the "IV Filter" list control.
Model Information
Best Fits:
-143.0920
-142.9118
-142.8249
142.6259
-142.4560
-141.4349
IV Filter
Figure 29. Using the IV filter to select a subset of variables from the best-fit models
When the GA ends and the 10 best models are shown, use the "Clear List" button to
remove all IVs from the selection list. Select a model from the "Best Fits" list one at a time and
click the "Add to List" button; this action adds any IVs in the model to the Independent Variable
list. After doing this for the ten best models, users likely have a much more manageable IV list
and can run an exhaustive search to find the very best combination of IVs. Regardless of the
method chosen to build models, the "Best Fits" window shows the top ten models found, in terms
of the evaluation criterion chosen.
7.4 Using the Genetic Algorithm
There are five parameters users can set to adjust performance of the GA:
a) Seed value: internal random number generator to produce random values. Setting this
seed to a known value will make the GA run reproducible. Changing the seed will create
a new series of random values, possibly returning different results.
b) Population size: number of individuals in the population of each generation. A larger
population broadens the search at each generation, but slows processing time.
c) Number of generations: how long to run the search since individuals can reproduce
and mutate once each generation. The fitness of every individual in the population is
evaluated at the end of each generation.
d) Mutation rate: chance each individual has of undergoing random mutation in each
generation. The higher the mutation rate, the more random (less directed) the search of
-------
parameter space is.
e) Crossover rate: probability that two selected individuals in the population will exchange
genome parts. Exchanging genes creates new individuals in the population.
The best GA parameter values depend on the dataset being investigated, but typical values
of the mutation rate are between 0.001 and 0.1 (0.1 and 10%) and typical values of the crossover
rate are between 0.4 and 0.75 (40 and 75%). For most datasets, a population size and generation
number of 100 will be sufficient. Larger datasets may require an increase in these numbers for
optimal solutions.
M anualj | G enetic Algorithm
CD Set Seed Value:
Population Size:
Number of Generations: 100
Run
Figure 30. Genetic algorithm options within the modeling interface
7.5 Evaluating Model Output
After selecting a method to build models and an evaluation criterion to rank them,
users then click the "Run" button. Model selection and evaluation progress is displayed on the
"Progress" graph at the lower right of the Modeling tab. Note that the "Run" button changes to
"Cancel;" the process is interruptible should progress be unacceptably slow. Once model-building
is completed, the ten best MLR fits are displayed in the "Best Fits" box. Selecting a model from
the list results in (see Figure 31):
1. A list of the model's TVs with associated regression coefficients and statistics is
displayed on the "Variable Statistics" subtab.
2. A list of the model's evaluation metrics is shown on the "Model Statistics" subtab.
3. The "Results" subtab will show the observations and model fits versus the observation
number. If observations are chronologically ordered, this is basically a time series plot.
4. The "Observed versus Predicted" subtab can show plots and tables based on
observations versus model fits.
-------
5. The "ROC Curves" subtab shows a plot of the Receiver Operating Characteristic curve
of each "Best Fits" model, as well as a table showing the computed AUC (area-under-
the-curve) for each ROC (see Section 7.7).
6. Clicking on "View Report" generates a text report of model and variable statistics for
the selected model.
7. The "Residuals" tab will appear at the top, allowing users to proceed to the residual
analysis component of the application.
8. The "Prediction" tab will appear at the top, allowing users to proceed to the prediction
component of the application.
Note that selecting a different model from the "Best Fits" list updates the Variable and
Model statistics tables and displays of the plotting subtabs.
Project Model Help
Beach Location Data Processing Modeling Residuals MLP Predicticin
Model Sellings
Variable Selection Control Options Number of Observations: 37
Evaluation Criteria
Akaike Information Criterion (AIC)
-
IE:
Maximum Number of Variable: in a Model
7
Available: 7, Recommended: 4,
|5 | MaximumVfF
Model E valuation! hresholds
1235 | Decision Criterion (Horizontal)
|235 | Regulatory Standard (Vertical)
Threshold Transform
0 None
O LoglO
O Ln
O Power [
0 Run all combinatio'
Current US Regulatory Standards
E. coli. Freshwater: 235
Enterococci, Freshwater: 104
Enterococci, Saltwater 61
Model Information
Best Fits:
Progress Results Observed vs Predicted ROC Curves
Exhaustive Search of independent Variable Space
(Percent Complete)
15
14 --
13 \-
12 '-.-
11 --
QtUS
Variable Statistics Model Statistics
Parameter Coefficient Standardized Coefficient Std. Error
(Intercept] 1.8228 0.2994
uv -0.0007 -0.5050 0.0002
waveheight 1.6811 0.2239 1.0139
WiodDireclion -[NTf-ifl
<
-0.4177 OD010
t-Statisti.
6.087i
-3.775E'
1.658C
-3.118;
>
10 15 20 25 30 35 40 45 50 55 60 65 70
Percent Completed
Total number of possible models: 127 I
Figure 31. Modeling results shown after completion of an exhaustive regression run
-------
Model Information
Best Fits:
S.2076
9.1112
9.2219
9.2231
9.2471
10.1760
IV Filter
Add to List
Clear List
v
Variable Statistics
Parameter
j(intercept)
uv
waveheight
WindDirection
Model Statistics
Coefficient
1.8228
-0.0007
1.6811
-0.0030
Standardized ... Std. Error
0.2994
-0.5050 0.0002
0.2239 1.0139
-0.4177 0.0010
(-Statistic P-Value
6.0879 7.4"508e-07 )
-3.7750 0.0006
1.65SO 0.1068
-3.1185 0.0038
Figure 32. Modeling Interface showing variable statistics for the selected Best-Fit model
Model Information
Best Fits:
8.2076
9.1112
9.2219
8.2231
9.2471
10.1760
IV Filter
Add to List
Clear List
v
Variable Statistics ! Model Statist
A
V
Metric
R Squared
Adjusted R Squared
Akaike Information Crite...
Corrected AIC
Bayesian Info Criterion
PRESS
RMSE
Sensitivity
S pecif icity
Accuracy
Wi imhpr nf Hh^prw^Hnn?
CS I
Value
0.4195
0.3667
7.2471
9.1826
-25.3092
17.0348
0.6188
0.0000
1.0000
0.9459
77
Figure 33. Modeling interface showing model evaluation metrics for the selected Best-Fit model
-------
Model Information
Best Fits:
8.2076
9.1112
9.2219
9.2231
9.2471
10.1760
IV Filter
Add to List
Clear List
Variable S tatistics M ฐdel S tatistics
*! Metric
R Squared
Adjusted R Squared
Akaike Information Crite...
Corrected AIC
Bayesian Info Criterion
PRESS
RMSE
Sensitivity
Specificity
- Accuracy
'Ml imher nf nhwwaKnnt
Value
0.4195
0.3667
7.2471
9.1826
-25.3092
17.0349
0.61 SB
0.0000
1.0000
0.9459
T!
Progress Results Observed vs Predicted ROC Curves
Results
3 --
2 --
Figure 34. Modeling interface showing a time series plot for the selected model
-------
Progress Results Predicted vs Observed ROC Curves
Select View
Plct: Pted vs Obs
|235 | Decision Criterion (Horizontal)
|235 | Regulatory Standard (Vertical)
Threshold Transform
O None
[ . . . . 1 ฉ LoglO
Update
' ' O Ln
O Power Q
Model Evaluation
False Positives (Type I): 0
Specificity: 1
False N egatives (Type 1 1 ]: 2
Sensitivity: |0
Accuracy: 1 0.9459
Predictions vs Observations
Predictions
D ^ to no ฃ
-1 -
Decision Threshold Regulatory Threshold
* * :
*# *:
.****
;****
ป * * *
** ** *
ป** * *
i -2 -i 0 1 2
Observations
* ;
*
3 4 !
Figure 35. An XY scatter plot of observed versus predicted values for the selected model
-------
1
3est Fits:
QS^H
3.2076
3.1112
3.2219
3.2231
3.2471
0.1760
IV Filter
Add to List
Clear List
Variable Statistics Model Statistics
A ^
Metric
Value
R Squared 0.4195
Adjusted R Squared 0.3667
Akaike Information Crite... 7.2471
II
View
Report
Cross
Validation
V
lorrected A
3ayesian In
PRESS
RMSE
Sensitivity
Specificity
Accuracy
Ji imhpr nf 1"
Progress Results Observed vs Predicted
C 9.1826
o Criterion -25.3092
17.0349
0.6188
0.0000
1.0000
0.9459
[ ROC Curves !
Model Fit
7.2471
8.2076
9.1112
9.2219
9.2231
9.2471
10.176
10.2047
10.2063
10.2076
AUC
.739683
.635714
.732143
.754464
.754464
.739683
.63
.635714
.635714
.635714
Plot
\
'lew Table
1.0
0.9
0.8
0.7
ฃ ฐ-6
'/'
1 0.4
0.3
0.2
0.1
0.0
Receiver Operating Characteristic Curves
for Best-Fit Models
7.2471 fr- S.2076 t 9.1112 -B 9.2219 9.2231
-ft 9.2471 9 10.176 -* 10.2047 I 10.2063 10.2076
; Y7
^V* 77 C^
I ^
^^^
'^1.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 - Specificity
s
1.0
Figure 36. The ROC curves and AUC table for the Best Fit models
7.6 Viewing X-Y Scatterplots
In multiple locations within VB 2.2 (Modeling, Residual and MLR Prediction tabs), users
can access a subtab that allows them to view information for comparing observations to model
predictions (Figure 35). From this space, users can view four different pieces of data:
1) A plot of predictions versus observations: "Pred vs. Obs"
2) A table summarizing model errors (false negatives/false positives) as the decision criterion (DC)
varies across the range of the response variable: "Error Table: DC as CFU"
3) A plot of the percent of probability of exceedance (calculated based on the current DC) versus
observations: "% Exc vs. Obs"
4) A table summarizing model errors as the percent of probability of exceedance is varied: "Error
Table: DC as % Exc"
-------
These four are chosen with the drop-down menu at the top left corner of the form. On
both of the two plots, a right-button click in the plot area shows a menu of functions for saving,
copying, printing or manipulating the plot view. The plot area can be zoomed and un-zoomed:
left-button mouse drags an area for zooming in; with right-button click, select "Un-Zoom" or "Set
Scale to Default" to see the entire data set. To pan to an area of the plot not in view, hold the Shift
key down and use the left mouse button to drag the view. To view (x,y) values of any data point,
hover the cursor over the data point. If the information does not appear, right-click on the graph
and make sure "Show Point Values" is selected.
In regards to interpretation of these plots, the green (Regulatory Standard) and blue
(Decision Criterion) lines permit model evaluation and provide information on which to base a
DC to be used for predictive purposes. On the plots, false positives represent data points in the
upper left quadrant of the graph, in which the model predictions exceed the DC, but observations
are below the RS. In such cases, a beach advisory would be incorrectly issued based on the
model prediction, leading to potential economic losses. False negatives (points in the lower right
quadrant) represent a potentially more serious scenario: model predictions below the DC and
observations that exceeds the RS. In other words, swimming at the beach may have been allowed
when it should have been prohibited due to elevated FIB concentrations.
A model that produces no false positives or false negatives would be an ideal decision
tool, but this is often unattainable with real data. Examining the two tables (#2 and #4 mentioned
above) on this subtab should allow users to set a robust DC (either using units of the actual
response variable or a percentage probability of exceedance) that minimizes both errors. Note that
in most cases, the RS is set based on federal or state law and should not be adjusted by the user,
however, the user is free to adjust the DC to minimize false negatives and false positives.
7.7 ROC Curves
In addition to time series and scatterplots which show results for an individual model,
users may also compare all "Best Fits" models using the ROC Curves tab. A Receiver Operating
Characteristic curve shows a model's true positive rate (sensitivity) plotted against its false positive
rate (1 - specificity) as a decision threshold varies between the model's minimum and maximum
predicted values. Models can then be compared using the area under their ROC curves (AUC).
Models having the largest AUC values perform best over the entire decision space.
The model with the largest AUC appears in red text in the ROC tab's model list. A single
ROC may be plotted by selecting a model in the list and clicking "Plot." Multiple models can
be selected in the usual Windows fashion with Shift-Click (select all items between the first and
second selection) or Control-Click (select only the clicked items). The background cell color of
models not selected for plot display will be gray after the "Plot" button is clicked.
Clicking "View Table" will replace the ROC plot with a table showing the false positives,
false negatives, sensitivity, and specificity at every evaluated value of the Decision Criterion for a
single selected model. Users need only click on a model in the list to the left of this table to see its
results. The ROC plot will return to view after clicking "View Plot."
AUC calculations are performed and curves are plotted when the "ROC Curve" tab is
selected. If this tab is active and new models are subsequently built, leaving this tab and then
returning will generate the new plots and AUC values.
-------
7.8 Cross-Validation
Clicking the "Cross-Validation" button on the Modeling tab brings up a sub-screen. On it
users can set two parameters: sample size for the testing data (T) and number of random samples
(R) taken. When cross-validation is started, a random sample of size T is taken from the modeling
dataset and set aside. Each "Best Fits" model is then re-fit to the remaining training data. The
TVs in each model stays the same, but the regression coefficients are adjusted to reflect the least-
squares fit to the training data. The Mean Squared Error of Prediction (MSEP) is then calculated
based on the T testing data points for each candidate model. The process (taking a random testing
sample; re-fitting regression coefficients for the ten candidate models based on the training data;
using the re-fit models to make predictions; and computing 10 MSEP values) will be done R times.
A table will show average MSEP values for each candidate model.
Cross-validation is a widespread, useful technique for examining the predictive power of
models, i.e., their ability to make predictions for data they have not seen before. For users wishing
to emphasize the predictive ability of a potential model, cross-validation allows them to evaluate
which candidate model consistently makes the best predictions (i.e., has the lowest MSEP). Note
that the PRESS statistic Virtual Beach 2.2 provides as a model evaluation criterion is a cross-
validation statistic with T set to 1. The PRESS algorithm removes one observation at a time from
the dataset, re-fits the model regression coefficients, and then calculates the squared residual for the
removed observation. It does this once for every observation in the dataset to compute the model's
PRESS value a confined look at a model's predictive potential.
Recommended values to enter for the observations used for testing are approximately 25%
of the total number of observations and 500-1000 trials.
Totd Number of Observations
225
Number of Observations Used for Testing: 40
Number of Trials.
Fitness
>
43.092024667...
- 42 91 181 4497. .
- 42.824883297...
- 42.625847684...
- 42.456029460...
- 41 434871829...
[^336885984...
- 41.238453099...
<
\m 1
MSEP
0.173258378933...
0.133755617610...
0189188307571...
0.172544273813...
0.184848801378...
0.178418303326...
0.175263600776...
0.178221812478...
L^LJ
Ind Var 1 Ind Var 2 Ind Var 3 Ind Var 4
clouds
clouds
clouds
clouds
clouds
SQR[lurbidity]
SQRtlurbiditj-]
SQRpuffaidtii]
SQR[turbiditjj]
SQR[lurbidity]
clouds SQR[lurbiditji]
'.'. -lij.pl.'. J l.llJUlJ:
windspeed
0.180921289930... iwidspeed
clouds
SQR[Previous2 ... POLY[airlemp]
SQR[Previous2 ... POLY[airtemp]
SOR[Previous2 ... POLY|airtemp]
SQR[Previou$2 ... POLY[airtemp]
SQR[Previous2 ... POLY[airlemp]
SQR[Previous2 ... POLY[airtemp]
S 3 R [turbidity] SQF![Previou$24...
SQR [turbidity] SQR[Previous24...
clouds SQR [turbidity] SQR[Previous24...
IndVarS lndVar6 IndVar?
POLY[dewpoinl] POLY[alrnpressure] LOG[cuyahogariv..
POLY[dewpoinl] POLY[alrnpressure] LOG[cuyahogariv.
POLY[dewpoint] LOG[cuyahogaiiv . PDLY[ucomp]
POLY[dewpoint] PO LY[ a tm pressure] LOG[cuyahogariv..
POLY[dewpoinl] LOG[ctiyahogariv... POLY [rocky riverfl..
POLYIdewpoint] POLY[atrn pressure] LOG[cuyahogariv.
PDLY[airtemp] POLY[dewpoint] POLY[atmpressure
POLY[airtemp] POLY[dewpoint] POLY[atmpressute
POLY[airlemp] POLY[dewpoint] POLY[atmpressure v
>
OK |
Figure 37. The cross-validation results for each of the 10 best-fit models
7.9 Report Generation
A text report of modeling results can be generated, copied to the system clipboard, or saved
to a text file using the "View Report" button. Users can view the report within VB 2.2 by selecting
the desired models and clicking on "Generate Report for Selected Models." The report contains
-------
descriptive statistics for each model variable and model evaluation statistic. Any number of best-
fit models can be selected for reporting.
A recommended approach to saving the information in an external application is to copy
the report to the clipboard (with the "CopytoClipboard" button) and paste it into a rich-text
application like MS Word, Write or WordPad. NotePad or other text editors will work, but column
formats will likely be lost and make the report difficult to interpret.
IS MLR Model Building Report - Best Fits
- n x
Select models for report:
-106.3737
-105.1724
-103.3300
-103.6883
-103.6583
-102.6353
-102.4818
MLR Model Building Report
VB2 Project Name:
VB2 Project File:
Imported Data Input File:
Independent Variable: logEcoli
Number of observations: 225
Models are listed in order of best-fit based upon selected evaluation criterion.
Model Evaluation Criterion: Akaike
Model: logEcoli = 13.1649e-01 - 25.41 Q4e-03*airternp + 10.227e-03'turbidity + 87.1563e-03
'clouds - 26.4922e-05*rockyriverflow + 18.4437e-03*windspeed + 18.7124e-05*cuyahogariverflow
22.478Be-02*Previous24hrrainfall + 26.035e-03*dewpoint
Model Evaluation Score: -1.0687e02
All Evaluation Metrics:
R Squared:
Adjusted R Squared:
Akaike Info Criterion:
Corrected AIC:
4.789e-01
4.5966-01
-1.0687602
-1.0585e02
Figure 38. A text report generated on the modeling results
Comparative bar graphs can be displayed to view evaluation criteria for all top models.
Click on "View Evaluation Graphs" to see these plots. Hover the mouse over any plot to display
the relevant evaluation criteria and hovering over any bar displays the associated model. Note
that the evaluation criteria graphs are scaled to emphasize differences between the model scores
although the difference may, in fact, be quite small. With the cursor over any graph, right-mouse
click and select "Set Scale to Default" to view the un-scaled graph.
-------
ffi] Model Evaluation Criteria
Adjusted R2
logEcoli = 13.083Ge-01 - 23.3539e.03"airtemp + 10.8332e-03"turbidity + 9B.1067e.03"clouds 28.6138e-Q5"rockyriveiflc.w + 18.535e-Q5"cuyahogaiiveTflow +
23.473e-02"Previous24hrrain(all + 25.5045e-03*dewpoint
11
II
1
1
1
I"
.
i
i
JL,
-CL
Figure 39. Plots of the various model evaluation metrics for the 10 best-fit models
9 Model Evaluation Criteria
S Model Evaluation Criteria
R2
R2
logEcoli - -14.2808800 + 50.1901 e-01 TOLY[[aillemp][dewpoint]] 47.2897e-02"PC logEcoli = -13.9053eOO + 4B.31 B5e-01"POLY[[airtemp][dewpoint]] - 51.8026e-02"PC
11.2129e-04"SQR[[airtemp][cujiahogariverflow]] + 14.3251 e-Q2"SQR[[Previous24hn 14-3141 e-02"SQR[[Pfevious24htrainfall][windspeed]] 112.4374e-01 TOLY[[airtemp:
0619 .
I-
ฃ
*DB,S
i
ฃ Ofilfi -
1.
DฃI3 -
^ป . |
0
2
1
6 a 10
BBctRtMadB Humbvr
i
i
i
!
|-
[
1=5-
ri
4
6 B 10 12
ฃ
g 0597
Figure 40. Scaled versus un-scaled views of selected model evaluation criterion
-------
8.0
Residual Analysis
Once a model is selected in the "Best Fits" window on the Modeling tab, the "Residuals"
and "MLR Prediction" tabs appear at the top of the interface. Users may click "Residuals" to view
information about residuals of the selected model, but this is not mandatory; they may take the
selected model immediately to prediction mode by clicking on "MLR Prediction." There are four
subtabs on the Residuals tab: Predicted vs Residuals, Observed vs Predicted, DFFITS, and Cook's
Distance.
H Virtual Beach 2.2
Project Model Help
Beach Location Data Processing Modeling ' Residuals I MLR Prediction
Model-:
Variable Statistics
Parameter
(Intercept)
Turbidity
WaveH eight
Dew Point F
WinoV
Station_Pressure
Precip Total
Model Statistics
Coefficient
14.5347
0.0094
0.1469
0.0190
-0.0144
0.4906
24.7024
StandardizedCoefficient
0.3384
0.2185
0.2387
-0.1506
0.1121
0.2124
Std. Error
3.7900
0.0010
0.0242
0.0025
0.0033
0.1287
3.4226
t-Statistic
3.8351
9.3457
6.0642
7.4886
-4.3896
3.8120
7.2174
P-Value
0.0001
1.1916e-19
2.1665e-09
2.0948e-13
1.3102e-05
0.0001
1.3794e-12
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A.D. Normality Statistic = 0.5732
A.D. Statistic P-value = 0.1364
Predictions vs Studentized Residuals
Project Name; Beach Name;
Total number of possible models: 2,047 l_
Figure 41. Information available on the Residuals tab, including a plot of Studentized residuals
versus predictions, the Anderson-Darling residual normality test, and regression statistics
The Predicted vs Residuals subtab shows a graph of the Studentized residuals versus their
predicted model values. The Anderson-Darling Normality Statistic (http://en.wikipedia.org/wiki/
Anderson-Darling) is shown with its significance (p-value). Linear regression assumes normally-
-------
distributed residuals, so if this A-D normality test fails (the A-D p-value is less than 0.05), the user
should 1) transform the response variable, 2) transform some of the TVs, or 3) consider deleting
offensive high leverage observations, which can be done on this tab.
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance
A.D. Normality Statistic = 1.1 610
A.D. Statistic P-value = 0.0043
4 -
3 -
Residuals
ro
Studentized
o -^
-1 -
-2 -
-0
Predictions vs Studentized Residuals
0
o
0
0 0
0 0
00 0
o
0 0
o
0 ฐ
^ ฐ 0 ฐ 0 0
0 o ฐ ฐ
00 o
o
5 0.0 0.5 1.0 1.5 2.0 2.5
Predictions
Figure 42. Plot of studentized predictions vs. residuals and the A-D test of normality
On DFFITS and Cook's Distance subtabs, observations are sorted by the largest (absolute
value) respective measure in a grid at the left. A plot of the DFFITS/Cook's Distances for each
record (observation) versus the Record ID is shown at the right. Data points with very large
DFFITS/Cook's Distances (i.e., lie outside the horizontal red boundaries on the graph) distort the
fitted values and standard deviation of the regression coefficients.
-------
F'redictedvE Fie;idualE UbEetved v; Predicted DFFITS Cook's Distance
37447.375 -0.426416
39223.375 0.401342
39583.38819444... -0.355593
0.346042
10575 J75
38586.375 -0.344014
37483375 0317248
Iterative Rebuild [ Go
Auto Rebuild
2-SORIp/n| = 0.2481
9lop when all DFFITS less than:
Gฐ | O iterative threshold using 2'SQR(p/n]
ฉ constant threshold 0.2481
Residuals
cutoff = 0 249 1
-cutoff =-0.2491
-1.5 --
-2.0 --
100 200 300 400 500 600 700
Record
Figure 43. A table and plot of the DFFITS scores for the residuals
Clicking the Iterative Rebuild "Go" button removes the observation with the largest
absolute value DFFITS/Cook's Distance, re-fits the regression, and calculates new DFFITS/Cook's
Distances for the remaining observations. This model is named "Rebuildl," and it is added to the
"Models" window at the top left of the screen. Clicking on the Iterative Rebuild "Go" button again
would produce a model called "Rebuild2," which is calculated after removing the observation
with the largest absolute value DFFITS/Cook's Distance remaining in the dataset (it is the 2nd
largest absolute value in the original dataset). The user can continue to click "Go" and remove
observations with the largest remaining DFFITS/Cook's Distances, thus creating "Rebuilds,"
"Rebuild4," "Rebuilds," etc. VB will not allow a user to delete any observations if 10 or fewer
observations remain in the dataset.
Whenever a "rebuild" is created by pressing "Go," the information displayed on the
Residual tab (variable and model statistics, Observed vs Predicted plot, Predicted vs Residuals
plot, DFFITS values, etc.) is automatically updated to reflect this new model (even if another
model is highlighted in the "Models" window). However, the user can select any model in the
"Models" window to view its associated data and plots.
The user has complete freedom to carry out the outlier removal process while toggling
back and forth between the DFFITS and Cook's Distance subtabs. For example, the first removal
can be based on a DFFITS value, the next removal can be based on a Cook's Distance, the next
two removals can be based on DFFITS, etc. If the user wishes to clear the "Models" window for
whatever reason, simply click the "Clear" button.
Rather than using Iterative Rebuild, the user has two additional choices for Auto Rebuild,
both of which remove all observations above some threshold. The "iterative threshold" choice
bases removals on a threshold that is updated every time an observation is deleted. For DFFITS,
this threshold is 2*(p/n)05, where p is the number of IVs in the model and n is the current number
of observations in the dataset. For Cook's Distance, the threshold is 4/n.
-------
Iterative Rebuild [ Go J 2*SQR(p/n) = 0.2491
Auto Rebuild
Stop when all DFFITS less than:
O iterative threshold using 2KSQR(p/n)
(*) constant threshold 0.2491
View Data Table
Figure 44. DFFITS/Cook's Distance controls for removing highly influential data points
In the "iterative threshold" process, step one is to check if any DFFITS/Cook's Distances
are above the threshold; if so, VB removes the observation with the largest absolute value DFFITS/
Cook's Distance and then recalculates the regression model, the DFFITS/Cook's Distances, and the
threshold (because n has been reduced by 1). VB then checks to see if any of these new DFFITS/
Cook's Distances are above the new threshold. If so, the process repeats. VB will continue until
no DFFITS/Cook's Distances remain that exceed the current threshold, or until half of the dataset
has been removed, whatever comes first. For example, if a dataset has 100 observations, VB will
allow 50 to be removed before it breaks out of the Auto Rebuild removal loop. At that point the
user can click the Auto Rebuild "Go" button again to potentially remove another 25 observations
of the remaining 50. We note that, in practice, one should not remove more than 5-10% of the
original dataset as outliers; the need to remove more indicates a poor MLR fit and warrants a
different analytical technique.
Using the "constant threshold" Auto Rebuild option differs from the "iterative threshold"
only in that the threshold remains static (i.e., the value the user types into the input box) regardless
of how many observations are deleted. Updated DFFITS/Cook's Distances are still calculated after
every removal event. VB will also stop this process if half the number of starting observations
has been deleted. There is an upper limit to the number that can be entered into the "constant
threshold" input box (DFFITS = 3, Cook's Distance = 16/n).
Upon completion of the Auto Rebuild process, multiple models may have been added to
the "Models" window. For example, if 10 observations were removed, then "Rebuildl" through
"RebuildlO" will appear in the "Models" window.
If a user has interest in both DFFITS and Cook's Distances as outlier metrics, we suggest
one of the following methods:
I) To see if the two criteria would produce different results:
Apply DFFITS removal to your model of choice. Note the results and then clear the Residual
tab using the "Clear" button. Next perform a removal process based on Cook's Distance and
compare the results.
-------
2) To filter out observations that offend either DFFITS or Cook's Distance criteria:
Run DFFITS removal on the model (i.e., remove all observations above your specified DFFITS
threshold), then click the Cook's Distance subtab and perform additional outlier removal based
on its threshold. After this process, remaining observations are "OK" from the perspective of
both metrics.
Note that the highlighted model in the "Models" box is used if the "MLR Prediction" tab is
clicked, not necessarily the model whose information is displayed on the Residuals tab. Also note
that any observations removed from the "Residuals" tab are not removed from the primary dataset
shown on the "Data Processing" tab.
Viewing the Data Table
From the DFFITS or Cook's Distance subtabs, users can click on "View Data Table" to
display a history of the observation removal process for the model highlighted in the "Model" box.
From this window, users may export the dataset for external use or re-importation into VB 2.2.
H Model Data Q@S
Records Eliminated from Model Data Se
Model Value"3' Residual Tjpe Date
> | -1.33971 6 DFFITS 8/16/2007
Rebuild2 -1.013314 DFFITS 6/1/2009
Rebuilds 0.635558 DFFITS 7/25/2008
#
<
Model Da a 9et - Inactive Records in Red
Date logEcoli
[ Save Data ] > | 1.230448921
6/2/2007 2.939519253
6/3/2007 1.697627091
6/4/2007 1.204119983
cyRj^nrr? n QmriQQQQT
<
logEcoli
3.58546073
0.301029986
2.938518253
clouds
4
4
2
3
A
clouds
5
4
3
SQR[turbidity]
1
1
717556403731...
612451549659...
6.606814663663...
3.154362059117...
1 Q^
SQR[turbidity]
16.06237840420...
2.664582518884...
5.540758070878...
SQR[Previous24hrr
0
0
0.223606797749...
0
n
SQR[Previous24h
1.118033988748..
0
0
>
POLY[airtemp] *
1.507064992941.
1.603774691938.
1.783618147049.
1
783618147049.
Figure 45. "View Data Table" window for examining the dataset after removal of influential data
points
The "Observed vs Predicted" subtab is the same as that in Section 7.6. There are two plots
and two tables to examine, along with controls to modify the Decision Criterion (blue horizontal
line) and Regulatory Standard (green vertical line), to judge effects these changes have on model
outcomes (false positives, false negatives, sensitivity, specificity, etc.).
-------
Predicted vs Residuals
Select View
Plot: Pred vs Obs
Observed vs Predicted
V
Plot Thresholds
|235 Decision Criterion (Horizontal)
|235 Regulatory Standard (Vertical)
Threshold Transform
O None
^~^~^ O Ln
O Power
Model Evaluation
False Positives (T
Spec
False Negatives (Ty
Sen
Ace
/pel]: 7
ificity: 1 0.9832 |
pell): 80
itivity: 0.3043
jracy: O.S772
DFFITS 1 Cook's Distance
Predictions vs Observations
7
Predictions
u _i ro oo ฃ= en co
-1 :
2 "
2
cision Thresh
<
-1 (
)
_
?
:ry Threshold |
i
2
Observatic
.
3451
>ns :
Figure 46. Observed vs. Predicted plot on the Residual tab with model evaluation threshold
control and model evaluation statistics
-------
B Virtual Beach 2. 2 EPS
Project Model Help
Beach Location Data Processing Modeling Residuals MLR Prediction T
Models
SelectedModel
Rebuild!
Rebuild2
Rebuilds
Variable Statistics Model Statistics
Paiametei Coefficient SlandardizedCoefficient Std. Eiroi t-Statistic P-Value
[Intercept] 1.3979 0.1576 12.6746 2.3721e-13
uv -0.0005 -0.4334 9.3649e-05 -4.6448 68014e-05
maveheighl -07733 -0.1071 0.6768 -1.1435 0.2622
WindDirection -0.0042 -0.7244 0.0005 -8.0840 6.4821e-09
Predicted vs Residuals Observed vs Predicted DFFITS Cook's Distance]
A.D. Normality Statistic. 0.1526
A.D. Statistic P-vaiue = 09546
Predictions vs Studentized Residuals
1 Residuals
D ^ ro c
Studentize
j ro -^ c
0ฐ
o
-
8
o oo
"ฐ
0 ฐ
0 o ฐ o
o o
0 D
O
-0.5 0.0 0.5 1.0 1.5 2.0
Predictions
Project File Name: Project Name: Bea
:h Name: Total number Gf possible models: 127 (
Figure 47. Residuals interface showing a list of rebuilt models resulting from observation
deletions, and the associated statistics and residual plots for these rebuilds
-------
9.0
Prediction
The MLR Prediction interface allows users to estimate or predict FIB concentrations with
a selected regression model. Whether a user was previously on the Modeling tab (with a model
selected in "Best Fits") or on the Residuals tab (with a model selected in "Models"), the interface
of the MLR Prediction tab will look the same.
9.1 Model Statement
At the top is the linear expression for the chosen model, with values of the regression
coefficients and names of each IV in the model (Figure 48).
9.2 Model Evaluation Thresholds
There are input boxes for the Decision Criterion (DC) and Regulatory Standard (RS).
Setting these allows model predictions to be evaluated and model specificity, sensitivity, and
accuracy to be calculated. When users first arrive at the Prediction tab, values of the DC and
RS will be set to what was on the Modeling tab. The "Threshold Transform" button tells VB
2.2 how to transform the DC and RS for comparison to model predictions and observations. If a
transformation definition was set for the response variable during data processing (either manually
by the user or automatically by transforming the response), that definition will be set here as
the default. Users should be aware that changing the threshold transform definition can cause
problems when comparing modeling predictions to observations. Caution should be exercised.
-------
Project Model Help
Beach Location Data Processing Modeling Residuals MLR Prediction
Model:
LogCFU = 1.8228075 - 0.00067864774'(uv) + 1.6810716*(waveheight) - O.D030005423*(WindDirection)
Model Evaluation Thresholds
Threshold Transform
1235 Decision Criterion (Horizontal] ฎ None
O Login
1235 I Regulatory Standard (Vertical) ^ .
O Power |1.0
Import IVs
Import Qbs
Predictive Record
ID
Proiect File Name;
Project Name: Beach Name:
Total number of possible models; 127 l_
Figure 48. The MLR Prediction interface
9.3 Prediction Form
Most of the prediction form is in three separate data panels: the left panel holds IV data;
the middle panel is for observational data, e.g., lab results of FIB concentrations; and the right
section shows model predictions and evaluation metrics. Each panel also contains a column for a
unique ID for each row of data (e.g., the date that data were collected). The panels have separate
horizontal and vertical scroll bars that become visible if the number of rows or columns exceeds
the viewable area. The three panels independently scroll horizontally, but scroll as a group
vertically. Panels can be re-sized by clicking and dragging the blue vertical partitions. Order
of the columns in the left and right panels can be changed by clicking and dragging the column
headers left or right.
Users can import IV and observational data from a file using "Import IVs" and "Import
Obs" buttons in the "Prediction Form" button bank located in the middle right of the screen, or
users can type data into the input grids. Either way, they should be certain that the entered IV data
are in the same units as those used to build the model.
Depending on which model was selected for prediction, the IV panel will have one column
for every unique IV that appears in the model, plus a column for the row's unique ID. When a data
-------
file is imported with the "Import TVs" button, a "Column Mapper" window opens. This window
allows users to tell VB 2.2 which columns in the imported datasheet should be used for the row
IDs and each IV found in the model. By default, the first column of the imported file maps to the
ID field, but users can choose another column if needed. If a column in the imported spreadsheet
has an identical name to an IV in the model, that column will be automatically selected by VB 2.2
as the appropriate one for that IV
Column Mapper
- n x
Model Variables
uv
waveheight
WindDirection
Imported Variables
(stamp
uv
waveheight
WindDirection
Ok
Cancel
Figure 49. Importation of IV data using the "Column Mapper" window
As with IV data, observational data can be typed into the middle panel or imported
using "Import Obs." For observational data, only two columns are needed: row IDs for
every observation and the actual observations. A "Column Mapper" window appears when
observational data are imported from a file. After they have been imported or manually entered,
users can specify the scale/transformation of the observations for a proper comparison to model
predictions. This is done by right-clicking on the "Observation" column header and defining the
transformation: none, Iog10, loge, or a power transformation. "None" is the default choice. For
example, if LoglO observations are imported, the user would need to change the right-click menu
choice to "LoglO."
Column Mapper
n x
Cancel
Figure 50. Importation of observational data using the "Column Mapper" window
-------
The "Make Predictions" button remains disabled until the IV data (imported from a file or
manually typed) are validated using the "IV Data Validation" button. This scan ensures there are
no blank cells or non-numeric data in the IV columns of the IV data panel and checks that every
row ID is unique (non-numeric data are allowed for the ID column). This validation scan window
is very similar to the validation scan window sin the Data Processing tab; however, "Delete
Column" is not a choice. "Replace With" and "Delete Row" are the only ways to deal with
problems in the IV data grid.
Project Model Help
Beach Location Data Processing Modeling Residuals /MLR Prediction 1
Model:
LogCFU = 1.8228075 - 0.00067B64774*(uv) * 1.6810716*(wavehsight) - 0.0030005123-(WinrJDirBrtion)
Model Evaluation Thresholds
235 Decision Criterion (Horizontal)
235 Regulatory Standard (Vertical)
Predictive Record
Threshold Transform
ฉ None
O LoglO
O Ln
O Poraer jl.0
Import I Vs
Import Qbs
Make
>
{
ID
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
uv waveheight
360
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1800
900
293
1537
1763
WmdDiiec "
0.15 0
0.2
0.2
0.2
0.2
0.2
0.02
001
0.01
001
0.01
0.1
0.15
0.18
0.15
0.15
0.3
38536.33 236 0.05
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
-f
ID
Data Validation
1 (Optional) Find:
Observation
ican
^m
H
r
^
!
Cancel
Project File Name:
Project Name: Beach Name:
Total number of possible models: 127
Figure 51. The IV validation window on the MLR Prediction tab
Once IV data have been validated, clicking the "Make Predictions" button will generate
model predictions. Observational data need not be present to make predictions, but observations
are needed for model evaluation (sensitivity, specificity, false negatives, false positives, etc.). After
clicking "Make Predictions", VB 2.2 uses the model, IV data, and observational data to fill the
right panel with the following data columns: ID, Model Prediction, Decision Criterion, Regulatory
Standard, Exceedance Probability, and Error Type.
-------
9 Virtual Beach 2.2 (T|[5]f5 38507.33
38507.46
38507.63
38508.33
38508.46
38508.63
3852146
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33
Project File Name:
360
1403
1555
337
1305
1568
1342
1276
225
1260
1409
295
1SOD
300
293
1537
1763
0.15
0.2
0.2
0.2
0.2
0.2
0.02
0.01
0.01
0.01
0.01
0.1
0.15
0.18
0.15
0.15
0.3
286 0.05
[
WindDiiec "
0
10
20
30
40
50
60
70
30
90
100
110
120
130
140
150
160
170
IV Data
Validation
Make
Predictions
ID
Predict on Grid
Import IVs
Import Obs
Plot Clear | Export As CSV |
3850733
38507.46
33507.63
38508.33
38508.46
38508.63
38521.46
38521.63
3852233
3852246
3852263
3852833
38528.46
3852863
38535.33
38535.46
38535.63
38536.33
38536.46
Project Name: Beach Name:
Obseivation
1.452
0.8653
0.8016
1.738
1.028
0.301
1.627
1.247
1.773
0.9379
0.9542
1.079
0.97
1.195
1.239
0.699
-0.1761
1.176
0.1249
ID
38507.46
38507.63
38508.33
38508.46
38508.63
38521.46
38521.63
38522.33
38522.46
38522.63
38528.33
38528.46
38528.63
38535.33
38535.46
38535.63
38536.33
ModeLPrediction
1.831
1.177
1.044
1.84
1.153
0.9449
0.7657
0.7636
1.447
0.7145
0.5833
1.461
2!
2:
2:
2:
2:
2:
2:
2:
2:-
2:
2:
2:
0.4933 1 2:
1.125
1.456
0.5818
0.6506
1.203
2
2:
2:
2:
2:
-? T^
Total number of possible models; 127 :
Figure 52. A prediction grid after IVs and observational data have been imported, and model
predictions have been made
The ID column of the model output panel is taken directly from the ID column of the IV
panel, not the observation panel. The "Make Predictions" button makes one model prediction
per row in the IV data panel, regardless of how many observations are entered in the observation
panel.
The Model Prediction column contains predicted values of the response variable. Right-
clicking on this column header allows the user to change how the predictions are displayed in the
table (as linear, log, or power units). The Decision Criterion and Regulatory Standard are values
set by the user (shown in the left panel as transformed by the choice of "Threshold Transform").
The Exceedance Probability (actually the probability x 100) is denned as the probability that
the model prediction will be larger than the Decision Criterion, based on uncertainty bounds
(confidence intervals) around the model predictions.
To compare model predictions to observations, VB 2.2 looks at the prediction ID and
attempts to find an observation in the observation panel with that same ID. VB 2.2 does not
require unique IDs for each row in the observation panel, but note that a model prediction is
compared to the first observation found with the same ID. When comparing model predictions
to observations, an error (false exceedance or false non-exceedance) appears in the "Error Type"
column.
-------
It is important to note that accurately assessing model output depends on synchronized
transformation information regarding the Decision Criterion, Regulatory Standard, model
predictions, and observations. Users must be careful to ensure each value is in a comparable unit.
9.4 Viewing Plots
After predictions have been made, a scatterplot of observations versus predictions can be
viewed by clicking "Plot" in the "Prediction Grid" button bank. If no observational data were
entered, a message asking for observational data appears. The features and functionality of the
form that appears when the "Plot" button is clicked are described in Section 7.6. The data are
based on comparing model predictions (right pane of the Prediction Form) with observations
(middle pane) that share the same, unique ID.
Select View
Plot: Pred vs Qbs
Plot Thresholds
|235 | Decision Criterion (Horizontal)
[235 | Regulatory Standard [Vertical)
Threshold Transform
O None
( ,, . , 1 ffi LoglO
Update
O Ln
O Power
Model Evaluation
False Positives [Type I): 7
Specificity: 0.9882
False N egatives (Type 1 1 ): SO
Sensitivity: 0.3043
Accuracy: O.S772 |
Predictions vs Observations
= q J.
Figure 53. Prediction interface plotting of the observations versus predictions, with model
evaluation threshold controls
9.5 Prediction Form Manipulation
Two other buttons are found in the "Prediction Grid" button bank. If a user wants to view
the table in a spreadsheet or word processing program, "Export as CSV" saves the contents of the
entire table (three panels) in .csv format. "Clear" deletes all information in the predictive table. As
-------
with most of the tabular information in VB 2.2, data in individual panels can be selected with a left
click and drag. Control-C and Control-V can then be used to copy and paste the data into another
application such as WordPad or Excel.
-------
10.0
Future Enhancements
VB 2.2 is a Windows application and undergoes continuous improvement and functional
expansion. In version 3.0, slated for release in 2012, project management enhancements will
allow site-based seasonal prediction and model assessment. The map interface will provide user
access and information to site-specific data such as water quality, water flow gauge readings and
weather data. Model- building functionality will grow beyond MLR to include Gradient Boosting
Machines (Decision Trees), Binary Logistic Regression, Partial Least Squares regression, and
Neural Networks.
-------
11.0
User Feedback
Opinions and experiences from the user community are welcomed by the Virtual Beach
design/development team. Users are encouraged to report problems, issues and likes/dislikes to:
Mike Cyterski - 706 355-8142 (cyterski.mike@,epa.gov)
Mike Galvin - 706 355-8318 (galvin.mike@,epa.gov)
Rajbir Parmar - 706 355-8306 (parmar. raj bir@, epa.gov)
Kurt Wolfe - 706 355-8311 (wolfe.kurt@.epa.eov)
-------
12.0
Acknowledgments
We would like to thank the following people, who generously donated their time and expertise for
software testing and review of this document:
Adam Mednick, Wisconsin DNR
David Rockwell, NOAA
Fran Rauschenberg, USEPA
Wesley Brooks, USGS
Mike Fienen, USGS
Donna Francy, USGS
Richard Zepp, USEPA
Steve Corsi, USGS
-------
-------
-------
United States
Environmental Protection
Agency
PRESORTED STANDARD
POSTAGE & FEES PAID
EPA
PERMIT NO. G-35
Office of Research and Development (8101R)
Washington, DC 20460
Official Business
Penalty for Private Use
$300
------- |