United States
Environmental Protection
Agency
National Risk Management
Research Laboratory
Ada, OK 74820
Research and Development
EPA/600/SR-97/007
February 1997
&EPA Project Summary
Ground-Water Model Testing:
Systematic Evaluation and
Testing of Code Functionality and
Performance
Paul K. M. van der Heijde and D. A. Kanzer
Effective use of ground-water
simulation codes as management
decision tools requires theestablishment
of their functionality, performance
characteristics, and applicability to the
problems at hand. This is accomplished
through systematic code-testing protocol
and code selection strategy. The protocol
contains two main elements: functionality
analysis and performance evaluation.
Functionality analysis is the description
and measurement of the capabilities of a
simulation code; performance evaluation
concerns the appraisal of a code's
operational characteristics (e.g.,
computational accuracy and efficiency,
sensitivity for problem design and
parameterselection, and reproducibility).
Testing of qround-water simulation
codes may take the form of (1)
benchmarking with known, indepen-
dently derived analytical solutions; (2)
intracomparison using different code
functions inciting the same system re-
sponses; (3) intercomparison with com-
parable simulation codes; or (4) com-
parison with field or laboratory experi-
ments. The results of the various tests
are analyzed using standardized statisti-
cal and graphical techniques to identify
performance strengths and weaknesses
of code and testing procedures. The pro-
tocol is demonstrated and evaluated us-
ing a three-dimensional finite difference
flow and solute transport simulation
code, FTWORK.
Introduction
Ground-water modeling has become an
important methodology in support of the
planning and decision-making processes
involved in ground-water resources
development, ground-water protection, and
aquifer restoration. In ground-water
modeling, it is crucial that the code's
credibility is established and its suitability is
determined. This is conducted through
systematic evaluation of code's correctness,
performance, sensitivity to input uncertainty,
and applicability to typical field problems.
Such a systematic approach is referred to
as "code-testing and evaluation protocol."
Without subjecting a ground-water
simulation code to such systematic testing
and evaluation, results obtained with the
code may suffer from low levels of
confidence. Acceptance of a modeling code
depends not only on a series of successful
tests, but also a history of successful
applications to a variety of site conditions
and management problems.
Reviewing the existing literature indicates
that previous code-testing studies appear
to be (1) lacking in systematically addressing
code features and providing insight in the
completeness and effectiveness of the
performed testing, and (2) inconsistent and
incomplete for documentation describing
the code's functions and features. A new
code-testing protocol known asfunctionality
analysis, performance evaluation and
applicability assessment protocol (van der
-------
Heijde et al., 1993) is presented to address
these deficiencies.
The report begins with a review of existing
code-testing literature. The formulation of a
comprehensive code-testing protocol is
developed. Testing strategies are presented
using various graphical and statistical tools.
The protocol is then demonstrated using a
numerical code, FTWORK (Faust et al.,
1990), which is designed to simulate three-
dimensional flow and solute transport in the
saturated zone of the subsurface.
Code Testing
A systematic approach to code testing
combines elements of error-detection,
evaluation of the operational characteristics
ofthecode, and assessment of its suitability
to solve certain types of management
problems, with well-designed test problems,
relevant test data sets, and informative
performance measures.
The code-testing protocol described in
the report is applied in a step-wise fashion
(Table 1). First, the code is analyzed with
respect to its simulation functions and
operational characteristics. Potential code
performance issues are identified, based
on analysis of simulated processes,
mathematical solution methods, computer
limitations and execution environment. This
is followed by the formulation of a test
strategy, consisting of design orselection of
relevant test problems. The set of test
problems is chosen such that all code
functions and features of concern are
addressed. Results of the testing are
documented in tables and mat rices providing
an overview of the completeness of the
testing in various types of informative graphs,
and with a set of statistical measures. The
actual testing may take the form of
1) benchmarking using known,
independent derived analytical
solutions,
2) intracomparison using different code
functions inciting the same system
responses;
3) intercomparison with comparable
simulation codes; or
4) comparison with field or laboratory
experiments.
It is important that each test be documented
with respect to test objectives, model setup
for both the tested code and the benchmark,
if applicable (structure, discretization,
parameters), and results for each test (for
both the tested code and the benchmark).
Functionality of a ground-water modeling
code is defined as the set of functions and
features which the code offers the user in
terms of model framework geometry,
simulated processes, boundary conditions,
and analytical and operational capacities.
The code's functionality must be defined in
sufficient detail for potential users to assess
the code's utility, as well as to enable the
code developerto design a meaningful code-
testing strategy. Functionality analysis
involves the identification and description
of the code's functions, and the subsequent
evaluation of each code function or group of
functions for conceptual correctness and
error-free operation. The information
generated by functionality analysis is
organized into a summary structure, or matrix
that brings together the description of code
functionality, code-evaluation status, and
appropriate test problems. Th\sfunctionality
matrix\s formulated by combining a complete
description of the code functions and
features with the objectives of the test cases.
The functionality matrix illustrates the extent
of the functionality analysis.
Performance evaluation is aimed at
characterizing the operational
characteristics of the code in terms of:
1) computational accuracy (e.g., in
comparison with a benchmark);
2)reliability(e.g., reproducibility of results,
convergence and stability of solution
algorithms, and absence of terminal
failures);
3) sensitivity for grid orientation and
resolution, time discretization, and
model parameters;
4) efficiency of coded algorithms (in terms
of numerical accuracy versus code
execution time, and memory and mass
storage requirements); and
5) resources required for model setup and
analysis (e.g., input preparation time,
effort needed for graphical
representation of simulation output).
Results of the performance evaluation are
reported both quantitatively and qualitatively
in checklists and in tabular form. Reporting
on performance evaluation should provide
potential users information on the
performance as a function of problem
complexity and setup, selection of simulation
control parameters, and spatial and temporal
discretization. The functionality matrix and
performance tables, together with the
supporting test results and comments,
should provide the information needed to
select a code for a site-specific application
and to evaluate the appropriateness of a
code used at a particular site.
Testing Strategy
Comprehensive testing of a code's
functionality and performance is
accomplished through a variety of test
methods. Determining the importance of
the tested functions and the ratio of tested
versus non-tested functions provides an
indication ofthe completeness ofthetesting.
Based on the analysis of functionality and
performance issues, a code-testing strategy
is developed. Such a code-testing strategy
should consist of:
1) formulation of test objectives (as related
to code functionality and performance
issues), and of test priorities (Table 2);
2) selection and/or design of test problems
and determination of type and extent of
testing for selected code functions;
3) determination of level of effort to be
spent on sensitivity analysis for each
test problem;
4) selection of the qualitative and
quantitative measures to be used in the
evaluation of the code's performance;
and
5) determination ofthe level of detail to be
included in the test report and the format
of reporting.
The test procedure includes three levels of
testing (van der Heijde and Elnawawy,
1992). At Level I, a code is tested for
correctness of coded algorithms, code logic
and programming errors by: (1) conducting
step-by-step numerical walk-throughs of the
complete code or through selected parts of
the code; (2) performing simple, conceptual
or intuitive tests aimed at specific code
functions; and (3) comparing with
independent, accurate benchmarks (e.g.,
analytical solutions). If the benchmark
computations themselves have been made
using a computer code, this computer code
should be, in turn, subjected to rigorous
testing by comparing computed results with
independently derived and published data.
At Level II, a code is tested to: (1) evaluate
functions not addressed at Level I; and (2)
evaluate potentially problematic
combinations of functions. At this level, code-
testing is performed by intracomparison (i.e.,
comparison between runs with the same
code using different functions to represent
a particular feature), and intercomparison
(i.e., comparison between different codes
simulating the same problem). Typically,
synthetic data sets are used representing
hypothetical, often simplified ground-water
systems.
-------
At Level III, a code (and its underlying
theoretical framework) is tested to determine
how well a model's theoretical foundation
and computer implementation describes
actual system behavior, and to demonstrate
a code's applicability to representative field
problems. At this level, testing is performed
by simulating a field or laboratory experiment
and comparing the calculated and
independently observed cause-and-effect
responses. Because measured values of
model input, system parameters and system
responses are samples of the real system,
they inherently incorporate measurement
errors, are subject to uncertainty, and may
suffer from interpretive bias. Therefore, this
type of testing always retains an element of
incompleteness and subjectivity.
The test strategy requires that Level I
testing is conducted (often during code
development), and, if successfully
completed, is followed by Level II testing.
The code may gain further credibility and
user confidence by subjecting it to Level III
testing (i.e., field or laboratory testing).
Ideally, code testing should be performed
forthe full range of parameters and stresses
the code is designed to simulate, in practice
this is often not feasible due to budget and
time constraints. Therefore, prospective
code users need to assess whether the
documented tests adequately address the
conditions expected in the target
application(s). If previous testing has not
been sufficient in this respect, additional
testing may be necessary.
Evaluation Measures
Evaluation of code-testing results should
be based on: (1) visual inspection of the
graphical representation of variables
computed with the numerical model and its
benchmark; and (2) quantitative measures
of the goodness-of-fit. Such quantitative
measures or evaluation or performance
criteria, characterize the differences
between the results derived with the
simulation code and the benchmark, or
between the results obtained with two
comparable simulation codes.
Graphical measures are especially
significant to obtain a first, qualitative
impression of test results, and to evaluate
test results that do not lend themselves to
statistical analysis. For example, graphical
representation of solution convergence
characteristics may indicate numerical
oscillations and instabilities in the iteration
process. Practical considerations may
prevent the use of all data-pairs in the
generation of graphical measures. Thus, a
subset of data-pairs may be selected for
use with graphical measures. There are
five types of graphical evaluationtechniques
particularly suited: (1) X-Y plots or line
graphs of spatial or temporal behavior of
variables; (2) one-dimensional column plots
or histograms (for test deviations); (3)
combined plots of line graphs and column
plots of deviations; (4) contour and surface
plots; and (5)three-dimensional column plots
or histograms. The conclusions from visual
inspection of graphic representations of
testing results may be described qualitatively
(and subjectively) by such attributes as
"poor," "reasonable," "good," and "very
good."
There are three general procedures,
coupled with standard linear regression
statistics and estimation of error statistics,
to provide quantitative goodness-of-fit
measures (Donigian and Rao, 1986): (1)
paired-data performance - the comparison
of simulated and observed data in time and
space; (2)time and space integrated, paired-
data performance — the comparison of
spatially and temporally integrated or
averaged simulated and observed data; and
(3) frequency domain performance - the
comparison of simulated and observed
frequency distributions. The organization
and evaluation of code intercomparison
results can be cumbersome due to the
potentially large number of data-pairs
involved if every computational node is
included in the analysis. This can be
mitigated by analyzing smaller,
representative sub-samples of the full set of
model domain data-pairs. The
representativeness of the selected data-
pairs is often a subjective judgment. For
example, in simulating one-dimensional,
uniform flow, the data pairs should be located
at least on two lines parallel to the flow
direction, one in the center of the model
domain and one at the edge.
Useful quantitative evaluation measures
for code-testing include: (1) Mean Error
(ME), defined as the mean difference (i.e.,
deviation) between the model calculated
values versus the benchmark values; (2)
Mean Absolute Error (MAE), defined as the
average of the absolute values of the
deviations; (3) Positive Mean Error (PME)
and Negative Mean Error(NME), defined as
the ME for the positive deviations and
negative deviations, respectively; (4) Mean
Error Ratio (MER), a composite measure
indicating systematic overprediction/
underprediction by the code; (5) Maximum
Positive Error (M P E) a n d Maximum Nega tive
Error (MNE), defined as the maximum
positive and negative deviations,
respectively, indicating potential
inconsistencies orsensitive model behavior;
and (6) Root Mean Squared Error (RMSE),
defined as the square root of the average of
the squared differences between the model
calculated values and its benchmark
equivalents.
Various computed variables may be the
focus of graphic or statistical comparison,
including hydraulic heads (in space and
time), head gradients, global water balance,
internal and boundary fluxes, velocities
(direction and magnitude), flow path lines,
capture zones, travel times, and location of
free surfaces and seepage surfaces,
concentrations, mass fluxes, and
breakthrough curves at observation points
and sinks (wells, streams).
Code-testing Protocol
Demonstration
The code-testing and evaluation protocol
is applied to a block-centered finite-
difference simulation code, FTWORK(Faust
etal., 1990), which was designed to simulate
transient and steady-state three-
dimensional saturated ground-water flow
and transient transport of a single dissolved
component under confined and unconfined
conditions. To demonstrate the use of the
code-testing protocol, the following steps
have been taken, featuring the FTWORK
code: (1) identifying and examining code
functionality; (2) determining type and
objectives of tests performed and
documented by the code developers; (3)
evaluating the suitability of performed tests
for use in protocol demonstration; (4)
compiling protocol summary structure (i.e.,
checklists) using performed tests; (5)
designing and conducting new tests, based
on deficiencies in performed tests; and (6)
summarizing the combined results of tests
performed by code developers and tests
performed as part of the protocol
demonstration.
Most of the tests originally performed by
the developers were adapted, augmented,
and reanalyzed to ensure consistency with
the protocol. Additional tests were designed
and executed to evaluate capabilities and
characteristics of the FTWORK code, not
addressed in the FTWORK documentation.
Discussion and Conclusions
Historically, reporting on simulation code-
testing has been limited to the use of author-
selected verification problems. Few studies
have focused on author-independent
evaluation of a code, or at code
intercomparison. Main deficiencies in
reported code-testing efforts include
-------
incompleteness of the performed testing,
absence of discussion regarding tested code
functions as compared with available code
functions and features, and lack of detail in
test problem implementation. This makes it
difficult to recreate the data sets for additional
analysis. The protocol presented in this
report aims to address these issues. In
addition, the protocol covers many other
test issues, ranging from performance and
resource utilization to usefulness as a
decision-making support tool.
The code-testing protocol is designed to
be applicable to all types of simulation codes
dealing with fluid flow transport phenomena
in the unsaturated and saturated zones of
the subsurface. Selection and
implementation of test problems will differ
for the different types of codes. However,
evaluation techniques are in principle
independent of code type. Test results are
presented in a form that is unbiased by the
requirements posed by specific applications.
It aims to provide enough detail to establish
confidence in the code's capabilities and to
efficiently determine its applicability to
specific field problems. Because users of
code-testing results may differ in terms of
objectives, the protocol leaves ittothe users
to determine if a tested code is suitable to
their needs.
The most critical element of the code-
testing protocol is the functionality analysis
(including elements of what is often called
"code verification"). Many different test
configurations can be used and, for some
code types, a large number of benchmark
solutions may be available. For other code
types, intercomparison may be the only
available option. Selection of benchmarks
and design of test problems should be guided
by test objectives and in the context of the
completeness of the testing exercise.
Protocol tools such as functionality tables
and functionality matrices are effective aids
inthedesignoftestproblems. Well-designed
tests not only identify code functionality
problems, but should also provide important
information on correct implementation of
code features. Functionality analysis may
be limited because not all code features can
be adequately addressed using benchmark
solutions. Often, code intracomparison,
code intercomparison and conceptual
testing are required, resulting in a more
subjective assessment of code accuracy
and operational constraints.
The functionality analysis, performance
evaluation and applicability assessment
protocol, presented in the full report, provides
a comprehensive framework for systematic
and in-depth evaluation of a variety of
ground-water simulation codes. While
allowing flexibility in implementation, it
secures, if properly applied, addressing all
potential coding problems. It should be noted
that the protocol does not replace scientific
review nor the use of sound programming
principles. Most effectively, the code-testing
under the protocol should be performed as
part of the code development process.
Additional testing in accordance with the
protocol may be performed under direction
of regulatory agencies, or by end-users. If
properly documented, code-testing in
accordance with the protocol supports
effective independent review and
assessment for application suitability. As
such, the protocol contributes significantly
to improve quality assurance in code
development and use in ground-water
modeling.
REFERENCES
Donigian, Jr., A.S. and Rao, P.S.C. 1986.
Example model testing studies. In: Vadose
Zone Modeling of Organic Pollutants (eds.
S.C. Hern and S.M. Melancon), Lewis
Publishers, Chelsea, Michigan.
Faust, C.R., P.M. Sims, C.P. Spalding,
P.F. Anderson, and D.E. Stephenson. 1990.
FTWORK: Athree-dimensional groundwater
flow and solute transport code. WRSC-RP-
89-1085. Westinghouse Savannah River
Company, Aiken, South Carolina.
van der Heijde, P.K.M., and O.A.
Elnawawy. 1992. Quality Assurance and
Quality Control in the Development and
Application of Ground-Water Models. EPA
600/R-93/011, Office of Research and
Development, U.S. Environmental
Protection Agency, Washington, D.C.
van der Heijde, P.K.M., S.S. Paschke,
and D.A. Kanzer. 1993. Ground-water flow
and solute transport model functionality
testing and performance evaluation. In: H. J.
Morel-Seytoux (ed.), Proc. Thirteenth AGU
Hydrology Days, Fort Collins, Colorado, pp.
378-389.
-------
Table 1. Procedures of code-testing and evaluation protocol.
CODE TESTING AND EVALUATION PROTOCOL
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
analyze the code documentation with respect to simulation functions, operational features, mathematical framework,
and software implementation;
identify code performance issues based on understanding of simulated processes, mathematical methods, computer
limitations, and software environment;
develop testing strategy that addresses relevant code functionality and performance issues including selection and/
or design of test problems and determination of appropriate evaluation measures;
execute test problems and analyze results using standard graphic and statistical evaluation techniques;
collect code performance issues and code test problems in overview tables and display matrices reflecting
correctness, accuracy, efficiency, and field applicability;
identify performance strengths and weakness of code and testing procedure;
document each test setup and results in report form and as electronic files (text, data, results, graphics); and
communicate results (e.g., prepare executive summary, overview report, etc.).
Table 2. Major test issues for three-dimensional finite-difference saturated ground-water flow and solute transport codes.
General Features
• mass balances (regular versus irregular grid)
• variable grid (consistency in parameter and stress allocation)
Hydrogeologic Zoning. Parameterization, and Flow Characteristics
• aquifer pinchout, aquitard pinchout
• variable thickness layers
• storativity conversion in space and time (confined-unconfined)
• anisotropy
• unconfined conditions
• dewatering
• sharp contrast in hydraulic conductivity
Boundary Conditions for Flow
• default no-flow assumption
• areal recharge in top active cells
• induced infiltration from streams (leaky boundary) with potential for dewatering below the base of the semi-pervious boundary
• drain boundary
• prescribed fluid flux
• irregular geometry and internal no-flow regions
Transport and Fate Processes
• hydrodynamic dispersion (longitudinal and transverse)
• advection-dominated transport
• retardation (linear and Freundlich)
• decay (zero and first-order)
• spatial variability of dispersivity
• effect of presence or absence cross-term for dispersivity
Boundary Conditions for Solute Transport
• default zero solute-flux assumption
• prescribed solute flux
• prescribed concentration on stream boundaries
• irregular geometry and internal zero-transport zones
• concentration-dependent solute flux into streams
Sources and Sinks
• effects of time-varying discharging and recharging wells on flow
• multi-aquifer screened wells
• solute injection well with prescribed concentration (constant and time-varying flow rate)
• solute extraction well with ambient concentration
-------
Paul K. van derHeijde and David A. Kanzerare with the International Ground Water
Modeling Center, Institute for Ground-Water Research and Education, Colorado
School of Mines, Golden, CO 80401-1887.
Joseph R. Williams is the EPA Project Officer (see below).
The complete report, entitled "Ground-Water Model Testing: Systematic Evaluation
and Testing of Code Functionality and Performance," (Order No. xxx; Cost:
$xx.OO, subject to change) will be available only from:
National Technical Information Service
5285 Port Royal Road
Springfield, VA 22161
Telephone: 703-487-4650
The EPA Project Officer can be contacted at
Subsurface Protection and Remediation Division
National Risk Management Research Laboratory
U. S. Environmental Protection Agency
Ada, OK 74820
------- |