United States
Environmental Protection
Agency
Environmental Monitoring and Support FPAGO;) 4 ;iO
Laboratory j::r< 1 9 Hi)
Research Triangle Park NC 27711 j. -j
Research and Development
oERA
Validation of Air
Monitoring Data
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U S Environmental
Protection Agency nave been grouped into nine series These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The nine series are
1 Environmental Health Effects Research
2 Environmental Protection Technology
3 Ecological Research
4 Environmental Monitoring
5 Socioeconomic Environmental Studies
6 Scientific and Technical Assessment Reports (STAR)
7 Interagency Energy-Environment Research and Development
8 "Special" Reports
9 Miscellaneous Reports
This report has been assigned to the ENVIRONMENTAL MONITORING series.
This series describes research conducted to develop new or improved methods
and instrumentation for the identification and quantification of environmental
pollutants at the lowest conceivably significant concentrations. It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia 22161.
-------
EPA-600/4-80-030
June 1980
Validation of Air
Monitoring Data
by
A. Carl Nelson, Jr., Dave W. Armentrout
and Ted R. Johnson
PEDCo Environmental, Inc.
Durham, North Carolina 27701
Contract No. 68-02-2722
Task No. 14
EPA Project Officer: Raymond C. Rhodes
Data Management and Analysis Division
Prepared for
U.S. ENVIRONMENTAL PROTECTION AGENCY
Environmental Monitoring Systems Laboratory
Research Triangle Park, North Carolina 27711
, '- strcet
230 Soucr. :- — '• ^
Chicago, 0iS
-------
DISCLAIMER
This report has been reviewed by the Environmental Monitoring Systems
Laboratory, U.S. Environmental Protection Agency, and approved for publication.
Mention of trade names or commercial products does not constitute endorsement
or recommendation for use.
n
-------
ACKNOWLEDGEMENT
This report was prepared for the Environmental Monitoring
Systems Laboratory of the U.S. Environmental Protection Agency.
Mr. Raymond C. Rhodes was the Project Officer. PEDCo Environ-
mental, Inc., appreciates the direction and extensive review pro-
vided. Appreciation is also expressed to Ms. Debora Pizer and
Mr. Jose Sune for the information on CHAMP, to Mr. Robert B.
Jurgens for information on RAMS, to Mr. E. Gardner Evans for
examples of data validation tests and to Mr. Seymour Hochheiser
for discussions and reviews. In the preparation of this report,
materials from the Data Validation lectures of the Air Pollution
Training Institute (APTI), 470 course (Quality Assurance for Air
Pollution Measurement Systems) and examples from the Office of
Air Quality Planning and Standards (OAQPS) guideline have been
used. Mr. A. Carl Nelson served as PEDCo's Project Manager.
Mr. Nelson, Mr. Dave W. Armentrout, and Mr. Ted R. Johnson were
the principal authors.
-------
FOREWORD
Measurement and monitoring research efforts are designed to anticipate
potential environmental problems, to support regulatory actions by developing
an in-depth understanding of the nature and processes that impact health and
the ecology, to provide innovative means of monitoring compliance with
regulations and to evaluate the effectiveness of health and environmental
protection efforts through the monitoring of long-term trends. The
Environmental Monitoring Systems Laboratory, Research Triangle Park,
North Carolina, has the responsibility for: assessment of environmental
monitoring technology and systems; implementation of agency-wide quality
assurance programs for air pollution measurement systems; and supplying
technical support to other groups in the Agency including the Office of
Air, Noise and Radiation, the Office of Toxic Substances, and the Office
of Enforcement.
Data validation, an element of quality assurance, is necessary to
provide accurate and reliable environmental data. Data of known and
acceptable quality are needed for measuring compliance with regulations,
assessing health effects, and developing optimum strategies to cope with
environmental pollution situations. A unified treatment of validation
of particular types of data bases is needed to support broad-scale uses
of these data, this report presents a systematic review of procedures and
techniques which have proven useful in performing the data validation
function. Recommendations are given for the selection and implementation
of procedures and techniques appropriate for organizations performing
air monitoring and depending upon their particular mode of operations,
computational and statistical capability, and monitoring objectives.
iv
-------
ABSTRACT
Data validation refers to those activities performed after the data
have been obtained and thus serves as a final screening of the data before
they are used in a decision making process. This report provides organi-
zations that are monitoring ambient air levels and stationary source
emissions with a collection of data validation procedures and with
criteria for selection of the appropriate procedures for the particular
application. Both hypothetical and case studies, and several examples
are given to illustrate the use of the procedures. Statistical procedures
and tables are in the appendices.
-------
CONTENTS
• *
Page
Acknowledgment iii
Abstract v
Figures ix
Tables x
Executive Summary xi
1.0 Introduction 1-1
1.1 Purpose 1-1
1.2 Scope and Organization of the Document 1-1
1.3 References 1-2
2.0 Background 2-1
2.1 Definition and Scope of Data Validation 2-1
2.2 Data Validation Procedures 2-2
2.3 Selection of Data Validation Procedures 2-2
2.4 Implementation of Data Validation 2-3
2.5 Brief Literature Review 2-3
2.6 References 2-4
3.0 Data Validation Procedures 3-1
3.1 Routine Procedures 3-1
3.1.1 Data identification checks 3-2
3.1.2 Unusual event review 3-2
3.1.3 Deterministic relationship checks 3-2
3.1.4 Data processinq procedures 3-3
3.2 Tests for Internal Consistency 3-5
3.2.1 Data plots 3-5
3.2.2 Dixon ratio test 3-6
3.2.3 Grubbs test 3-12
3.2.4 Gap test 3-14
3.2.5 "Johnson" p test 3-16
3.2.6 Multivariate tests 3-17
3.3 Tests for Historical Consistency 3-19
3.3.1 Gross limit checks 3-19
3.3.2 Pattern and successive value tests 3-21
3.3.3 Parameter relationship test 3-21
3.3.4 Shewhart control chart 3-23
VI
-------
CONTENTS (continued)
Page
3.4 Tests for Consistency of Parallel Data Sets 3-29
3.4.1 The sign test 3-29
3.4.2 Wilcoxon signed-rank test 3-32
3.4.3 Rank sum test 3-34
3.4.4 Intersite correlation test 3-38
3.5 References 3-44
4.0 Selection and Implementation of Procedures 4-1
4.1 Organizational Criteria 4-2
4.1.1 Number of data sets 4-2
4.1.2 Historical data requirements 4-3
4.1.3 Nature of data anomalies 4-3
4.1.4 Strip chart data 4-3
4.1.5 Size of data set 4-4
4.1.6 Magnetic tape data 4-4
4.1.7 Data transmitted by telephone lines 4-4
4.1.8 Timeliness of procedure 4-5
4.2 Analytical Criteria 4-5
4.2.1 Statistical sophistication 4-5
4.2.2 Computational requirements 4-5
4.2.3 Expense of analysis 4-6
4.2.4 Sensitivity of the procedure 4-6
4.2.5 Use of data 4-7
4.3 Implementation of Data Validation Process 4-7
4.4 References 4-9
5.0 Hypothetical Examples and Case Studies 5-1
5.1 Hypothetical Example for Ambient Air Monitoring 5-1
5.1.1 Without computerized support 5-1
5.1.2 With computerized support 5-4
5.2 Hypothetical Example for Source Testing 5-4
5.3 Case Study of a Manual Data Validation System 5-5
5.3.1 Quality control 5-6
5.3.2 Data validation 5-8
vn
-------
CONTENTS (continued)
Page
5.4 Case Study of the CHAMP Automated Data Validation
System 5-10
5.4.1 Quality control functions 5-12
5.4.2 Data validation 5-14
5.5 Case Study of a Regional Air Monitoring System
(RAMS) Data Validation 5-23
5.5.1 Quality control 5-23
5.5.2 Data validation 5-27
5.6 References 5-35
6.0 Bibliography 6-1
Appendix A - Statistical Tables A-l
Appendix B - Fitting Distributions to Data B-l
Appendix C - Calculation of Limits for Shewhart Control Chart C-l
-------
FIGURES
Number Page
3-1 Computer plot of data with single anomalous value 3-7
3-2 Computer plot of data with single anomalous value 3-8
3-3 Illustration of Dixon ratio test for two TSP
monthly data sets 3-10
3-4 Illustration of Grubbs test for two TSP monthly
data sets 3-13
3-5 Illustration of gap test for hourly ozone data
(Newport, KY, 1976) 3-15
3-6 P(x) versus concentration for Weibull distribu-
tion fitted to 1976 N02 data from Essex, MD.
Acceptance range defined as 0.05 < P(x) < 0.95 3-18
3-7 Shewhart control chart for mean values with 1978
data plotted 3-26
3-8 Shewhart control chart for range values with 1978
data plotted 3-27
3-9 Intersite correlation test data 3-42
5-1 Automated quality control tests 5-13
5-2 Example CHAMP data validation report (partial
printout) 5-16
5-3 High/low critical values for CHAMP secondary
parameters (partial printout) 5-16
5-4 CHAMP validation system, invalidity causes by
hour 5-17
5-5 Five-minute values of invalid secondary param-
eters 5-17
5-6 Example CHAMP journal entries for data valida-
tion 5-18
5-7 Example CHAMP validation data review 5-19
5-8 Example ozone daily data curve from CHAMP
data validation 5-21
5-9 Example NO curve from CHAMP data validation 5-22
5-10 Location ox RAMS stations 5-25
5-11 RAMS data flow: RAMS station 5-26
5-12 RAMS data flow, central facility 5-28
5-13 RAMS hourly average temperature data,
January 21, to February 9, 1975 5-31
5-14 Minute temperature data from Station 103, from
1200 to 2359 hours, February 1, 1975 5-32
5-15 Reference weather data used in RAMS data
validation 5-34
5-16 Examples of instrument responses that can be
detected through minute successive differences 5-35
IX
-------
TABLES
Number
3-1 Examples of Hourly and Daily Gross Limit Checks
for Ambient Air Monitoring 3-20
3-2 Summary of Limit Values used in EPA Region V
for Pattern Tests 3-22
3-3 TSP Data from Site 397140014H01 Selected as
Historical Data Base for Shewhart Control Chart 3-25
3-4 TSP Data from Site 397140014H01 for Control Chart
(1978) 3-25
3-5 Ozone Data (ppb) Recorded August 1, 1978, at
Monitoring Sites 909929911101 (Site A) and
090020013101 (Site B) 3-31
3-6 Ozone Data (ppb) Incorporating Simulated
+5 ppb Calibration Error at Site A 3-33
3-7 Procedure for Wilcoxon Signed-Rank Test 3-34
3-8 Application of Wilcoxon Signed-Rank Test to
Ozone Data (ppb) in Table 3-4 3-35
3-9 Procedure for Wilcoxon Rank Sum Test 3-36
3-10 Application of Rank Sum Test to Ozone Data
in Table 3-5 3-37
3-11 TSP Data from Sites 397140014H01 and
397140020H01 for Intersite Correlation
Test, 1978 3-40
3-12 x2 Values for Two Degrees of Freedom, for
Various Probability Levels 3-43
4-1 Factors to Consider in the Selection of Data
Validation Procedures 4-1
4-2 Selection of Data Validation Procedures 4-2
5-1 CHAMP Environmental Parameters 5-11
5-2 Parameters Monitored in the Regional Air
Monitoring System 5-24
5-3 Typical Gross Limit Values used in the Regional
Air Monitoring System Data Validation Pro-
cedure 5-29
A-l Dixon Criteria for Testing of Extreme Observation
(Single Sample) A-l
A-2 Critical Values for 5% and 1% Tests of Dis-
cordancy for Two Outliers in a Normal Sample A-2
A-3 Critical T Values for One-Sided Grubbs Test when
Standard Deviation is Calculated from Sample A-3
A-4 Wilcoxon Signed-Rank Test A-6
A-5 Rank Sum Test a = P[H is true] A-7
B-l Estimation of Distribution Parameters B-4
C-l Factors for Estimating Control Limits of
Shewhart Chart C-2
-------
EXECUTIVE SUMMARY
An essential element of quality assurance is the validation
of data. The data validation procedures in this report span the
range of application from the collectors of data (e.g., State or
local agencies) to the users of the data (researchers of regional
or Federal agencies, consultants, and industrial personnel).
It is stressed that data validation is most effectively per-
formed where the data are collected; questionable data can be
most easily checked at this level. The validation procedures are
described in increasing order of complexity and statistical
sophistication. They are grouped into four categories for ease
in understanding their role in applications: procedures which
should be applied routinely, those which are used to test the
internal consistency within a given data set, and procedures for
comparing data sets with historical data and with other data
sets. The procedures applicable at the originating level are
described first. The more complex statistical procedures may be
used if the level of computer capability, training, and support
is adequate.
The proper implementation of data validation will ensure its
effectiveness. An individual should be assigned the responsibil-
ity of data validation, even though it may be a part-time activ-
ity. One responsibility will be to develop the data validation
plan as an integral part of the agency quality assurance plan. A
data flow diagram should be developed to identify each step of
data handling process which may result in an error or in lost
data. The data validation plan can be developed from the tech-
niques available in this report or in the literature (many refer-
ences are included in this report). A review and evaluation of
the effectiveness of the data validation process should be made
periodically 1) to optimize the sensitivity of the checks, and 2)
-------
to summarize and evaluate the basic reasons and causes of invali-
dation. Validation limits may be changed to alter the sensitiv-
ity. A particular procedure may not be adequate or may be too
costly (relative to its effectiveness) to be used on a routine
basis.
In summary, data validation is an effective means of ensur-
ing the integrity of the data for use in decisionmaking.
xii
-------
1.0 INTRODUCTION
1.1 PURPOSE
The purpose of this report is to provide organizations that
are monitoring ambient air levels and stationary source emissions
with (1) a collection of useful data validation procedures and
with (2) criteria for selecting the appropriate procedures for
the application. Data validation procedures are an integral part
of a complete quality assurance program.
In this report, data validation will refer to those ac-
tivities performed after the fact—that is, after the data have
been obtained. Quality control has the purpose of minimizing the
amount of bad data collected or obtained. Data validation is to
prevent the remaining bad data from getting through the data col-
lection and storage system. Data validation thus serves as a
final screen before the data are used in the decision making pro-
cess. Whether data validation is performed by a data validator
with this specific assignment, by a researcher using an existing
data bank, or by a member of a field team or local agency, it is
preferable that data validation be performed as soon as possible
after the data have been obtained. At this stage, the question-
able data can be checked by recalling information concerning un-
usual events and meteorological conditions which can aid in the
validation process. Also, timely corrective actions may be taken
when indicated to minimize further generation of questionable
data.
1.2 SCOPE AND ORGANIZATION OF THE DOCUMENT
Although this report includes discussion of the general
theory of data validation, the examples described here concern
air pollution data. In particular, the validation procedures can
be applied to ambient air monitoring data, source test data, and
meteorological data. This document presents general data valida-
tion procedures in Section 3, guidelines for selection and
1-1
-------
implementation of data validation procedures in Section 4, and
case studies and hypothetical examples in Section 5. The refer-
ences are given at the end of each section. Appendices A, B, and
C contain tables and supplemental mathematical and statistical
background. The reader will not need a mathematical or statisti-
cal background to follow the guidelines in this document; he or
she needs only an interest in developing a data validation pro-
cess .
This document is not intended to replace previous publica-
tions . Some recent publications are recommended to the reader
for additional information.1"5 Section 2 contains additional
background information.
1.3 REFERENCES
1. U.S. Environmental Protection Agency. Screening Proce-
dures for Ambient Air Quality Data. EPA-450/2-78-037,
July 1978.
2. Rhodes, R. C., and S. Hochheiser. Data Validation Con-
ference Proceedings. Presented by Office of Research
and Development, U.S. Environmental Protection Agency,
Research Triangle Park, North Carolina, EPA 600/9-79-
042, September 1979.
3. U.S. Department of Commerce. Computer Science and
Technology: Performance Assurance and Data Integrity
Practices. National Bureau of Standards, Washington,
D.C., January 1978.
4. Barnett, V., and T. Lewis. Outliers in Statistical
Data. John Wiley and Sons, New York, 1978.
5. Naus, J. I. Data Quality Control and Editing. Marcel
Dekker, Inc., New York, 1975.
1-2
-------
2 .0 BACKGROUND
2.1 DEFINITION AND SCOPE OF DATA VALIDATION
Data validation is "a systematic process for reviewing a
body of data against a set of criteria to provide assurance that
the data are adequate for their intended use."1 In technical
literature, data validation may be referred to as data editing,
data screening, data checking, data auditing, data verification,
data certification, and technical data review. Each of these
terms refers to a set of procedures for detecting, evaluating,
and correcting erroneous or questionable data. Data validation
is one of the primary activities for ensuring the reporting and
use of good data, whereas quality control (QC) activities are de-
signed for the purpose of acquiring good data.
The procedures included in data validation vary with the
method of obtaining data. In one application the data may be
recorded on a field data form and returned to a laboratory for
further handling prior to storage in a data bank. In another ap-
plication, such as CHAMP (Section 5), the data are obtained
directly from the monitor; several data validation checks (in-
cluding plotting the data) are made using the computer system,
and questionable data are identified along with information on
ancillary data needed by the validator to aid in the data valida-
tion.
It becomes more difficult to distinguish between QC and data
validation activities when using a computerized system for ac-
quiring, processing, checking, and finally storing the data for
retrieval. However, all procedures applied after the data are
first obtained and which result in the identification of ques-
tionable data and which result in subsequent investigation of
these data, will be considered as data validation. The QC pro-
cess will apply to the techniques applied prior to obtaining the
data, such as calibration checks of the instruments and the
use of control samples. Although these QC techniques may result
2-1
-------
in data being questioned and eliminated, these data are screened
by the QC system of checks and not by the data validation system.
Recommended QC checks and guidelines for their use are contained
in the Quality Assurance Handbook. This handbook discusses data
validation briefly in Volume I and in the introduction to Volume
II.1
The data validation process includes not only the identifi-
cation or flagging of questionable data but also the investiga-
tion of apparent anomalies. The latter step is often performed
by a person other than the one performing the first step, partic-
ularly when the data validation is being performed at an organi-
zational level removed from the source of the data.
2.2 DATA VALIDATION PROCEDURES
Specific data validation procedures are described in Section
3. Although the examples and terminology used are primarily from
air pollution, these procedures are not limited to this one data
category. A given procedure may be used for air monitoring data
(e.g., hi-vol data), for meteorological data, for source test
data, and in fact for almost any data obtained by a similar mea-
surement system.
2.3 SELECTION OF DATA VALIDATION PROCEDURES
In addition to general descriptions of the data validation
procedures, this document discusses how to select the most appro-
priate procedure for a particular application. The procedure(s)
most appropriate for a local agency with five monitoring stations
and one station reporting meteorological data may not be appro-
priate for the performance of a Reference Method 6 source test at
a utility plant. A State or Federal agency with a data bank
storage and retrieval system may require still other procedures.
In each application, the data validation procedures should be
selected with respect to considerations such as the volume of
data, type of data output/transmission, computational capability,
graphics capability, and the nature of expected errors. These
factors are considered in Section 4.
2-2
-------
2.4 IMPLEMENTATION OF DATA VALIDATION
The selection of the data validation procedures is followed
by the implementation of the data validation process. Decisions
must be made concerning:
1. Personnel performing the procedures,
2. Frequency of data checks,
3. Methods of flagging data,
4. Procedures for investigating anomalies, and
5. Summarization of data validation including a summary of
and subsequent decisions concerning flagged data.
Criteria for making these decisions are in Section 4.3.
2.5 BRIEF LITERATURE REVIEW
There are several good references on the application of
validation procedures to specific types of data.2"12 Reference 2
contains a discussion of the Dixon ratio test, the Shewhart
quality control test, pattern tests for four pollutants, copies
of three technical papers, a comparison of these tests, and com-
puter programs for three test procedures (gap test, pattern test
and Shewhart control chart). This development work done in EPA,
Office of Air Quality Planning and Standards (OAQPS) became the
basis for the Air Data Screening System, which is now being
implemented in 27 states through the Air Quality Data Handling
System (AQDHS). References 3 through 7 contain background infor-
mation on the statistical test procedures in the OAQPS Guide-
line.2 These references demonstrate and compare the effective-
ness of the screening procedures of reference 2. In particular,
Reference 3 describes the application and evaluation of the Dixon
ratio test. Reference 4 compares seven screening test procedures
using continuous 1-hour measurements for three pollutants and two
tests (Dixon ratio and Shewhart control chart) using 24-hour data
for three pollutants. Reference 5 compares the Dixon ratio test
and the Shewhart control chart; good agreement was indicated and
use of both procedures was recommended; however, the control
chart procedure was preferred if only one procedure was used.
Reference 6 describes the application of the control chart proce-
2-3
-------
dure to data from monitoring sites in Region V. Reference 7 de-
scribes an automated pattern test procedure and includes a com-
puter program for same. The limit values used in this pattern
test are in Section 3.3. Reference 8 contains a collection of
papers on a variety of applications in air pollution. Reference
9 is a recent publication on data validation as applied to com-
puter systems. A major portion of the literature on data valida-
tion would be classified as statistical in content. Furthermore,
many of the procedures for identifying possible data anomalies
are included throughout the statistical literature under the sub-
ject of outliers. Reference 10, a recent text on this subject,
summarizes the results dispersed throughout the literature. A
monograph on statistical applications to data QC and editing is
Reference 11. Reference 12 contains several statistical tests
for outliers and the corresponding tables. Several additional
references will be given at the ends of respective sections.
2.6 REFERENCES
1. U.S. Environmental Protection Agency. Quality Assur-
ance Handbook: Vol. I, Principles; Vol. II, Ambient
Air Specific Methods; and Vol. Ill, Stationary Source
Specific Methods. EPA-600/9-76-005, 1976.
2. U.S. Environmental Protection Agency. Screening Proce-
dures for Ambient Air Quality Data. EPA-450/2-78-037,
July 1978.
3. W. F. Hunt, Jr., T. C. Curran, N. H. Frank, and R. B.
Faoro, "Use of Statistical Quality Control Procedures
in Achieving and Maintaining Clean Air," Transactions
of the Joint European Organization for Quality Control/
International Academy for Quality Conference, Vernice
Lido, Italy, September 1975.
4. W. F. Hunt, Jr., R. B. Faoro, T. C. Curran, and W. M.
Cox, "The Application of Quality Control Procedures to
the Ambient Air Pollution Problem in the USA," Trans-
actions of the European Organization for Quality Con-
trol, Copenhagen, Denmark, June 1976.
5. W. F. Hunt, Jr., R. B. Faoro, and S. K. Goranson, "A
Comparison of the Dixon Ratio Test and Shewhart Control
Test Applied to the National Aerometric Data Bank,"
Transactions of the American Society for Quality Con-
trol, Toronto, Canada, June 1976.
2-4
-------
6. W. F. Hunt, Jr., J. B. Clark, and S. K. Goranson, "The
Shewhart Control Chart Test: A Recommended Procedure
for Screening 24-Hour Air Pollution Measurements," J.
Air Poll. Control Assoc. 28:508 (1979).
7. R. B. Faoro, T. C. Curran, and W. F. Hunt, Jr., "Auto-
mated Screening of Hourly Air Quality Data," Trans-
actions of the American Society for Quality Control,
Chicago, 111., May 1978.
8. Rhodes, R. C., and S. Hochheiser. Data Validation Con-
ference Proceedings. Presented by Office of Research
and Development, U.S. Environmental Protection Agency,
Research Triangle Park, North Carolina, EPA 600/9-79-
042, September 1979.
9. U.S. Department of Commerce. Computer Science and
Technology: Performance Assurance and Data Integrity
Practices. National Bureau of Standards, Washington,
D.C., January 1978.
10. Barnett, V. and T. Lewis. Outliers in Statistical
Data. John Wiley and Sons, New York, 1978.
11. Naus, J. I. Data Quality Control and Editing. Marcel
Dekker, Inc., New York, 1975.
12. 1978 Annual Book of ASTM Standards, Part 41. Standard
Recommended Practice for Dealing with Outlying Observa-
tions, ASTM Designation: E 178-75. pp. 212-240.
2-5
-------
3.0 DATA VALIDATION PROCEDURES
This section contains descriptions, along with examples, of
recommended data validation procedures. These procedures fall
into four categories:
1. Check and review procedures which should be used to
some extent in every validation process,
2. Procedures for testing the internal consistency of a
single data set,
3. Procedures for testing the consistency of data sets
with previous data (historical or temporal consistency), and
4. Procedures for testing the consistency of two or more
data sets collected at the same time or under similar conditions
(consistency of parallel data sets).
These four categories are described in Sections 3.1 to 3.4. In
each section, the procedures are arranged in increasing order of
statistical complexity. Hence the user of this report wishing to
use the simplest procedures should refer to the first one or two
procedures in each subsection. In particular, a local or State
agency, with a small staff but without computer facilities and
statistical support, would probably use only the procedures de-
scribed in Sections 3.1.1, 3.1.2, 3.1.3, 3.2.1, 3.3.1, 3.3.2,
3.3.3, and 3.3.4. The selection of appropriate data validation
procedures for a particular application is in Section 4. Imple-
mentation of the data validation process is described in Section
4.3.
3.1 ROUTINE PROCEDURES
Validation checks which should be made routinely during the
processing of data include checks for proper data identification
codes, review of unusual events, deterministic relationship
checks, and performance checks of the data processing system.
3-1
-------
3.1.1 Data Identification Checks
Data with improper identification codes are useless. Iden-
tification fields which must be correct are: 1) time (start and
stop time and date), 2) location, 3) sampling/analytical method
code, 4) pollutant method interval unit, 5) parameter, and 6)
decimal. Examples of data identification problems that have been
noted by the EPA regional offices include: 1) improper State
identification codes; 2) data identified for a nonexistent day
(e.g., October 35); and 3) duplicate data from one monitoring
site but no data from another. Since most of these types of
problems are the result of human error, an individual other than
the original person preparing the forms should scan the data
coding forms prior to using the data for computer entry or manual
summary. The data listings should also be checked after entry
into a computer system or data bank.
3.1.2 Unusual Event Review
A log should be maintained by each agency to record extrin-
sic events (e.g., construction activity, duststorms, unusual
traffic volume, and traffic jams) that could explain unusual
data. Depending on the purpose of data collection, this informa-
tion could also be used to explain why no data are reported for a
specified time interval, or it could be the basis for deleting
data from a file for specific analytical purposes.
3.1.3 Deterministic Relationship Checks
Data sets which contain two or more physically or chemically
related parameters should be routinely checked to ensure that the
measured values on an individual parameter do not exceed the mea-
sured values of an aggregate parameter which includes the indi-
vidual parameter. For example, N02 values should not exceed NO
X
values recorded at the same time and location. The following
table lists some, but not all, of the possible deterministic
relationship checks involving air quality and meteorological
parameters. The measured values of the individual parameters
(first column of table) should not exceed the corresponding
measured values of the aggregate parameter (second column).
3-2
-------
Individual parameter Aggregate parameter
NO (nitric oxide) NO (total nitrogen oxides)
&L
N02 (nitrogen dioxide) NO (total nitrogen oxides)
X
CH4 (methane) THC (total hydrocarbons)
S02 (sulfur dioxide) total sulfur
H2S (hydrogen sulfide) total sulfur
Pb (lead) TSP (total suspended
particulate)
dewpoint temperature
Data sets in which individual parameter values exceed the cor-
responding aggregate values should be flagged for further inves-
tigation. Minor exceptions to allow for measurement system noise
may be permitted in cases where the individual value is a large
percentage of the aggregate value.
The deterministic checks listed above are based on theoreti-
cal relationships between the parameters. Empirical relation-
ships can often be developed by reviewing historical data and
noting parameter behavior which seldom or never occurs. Param-
eter relationship checks based on historical data of this kind
are discussed in Section 3.3.3.
3.1.4 Data Processing Procedures
Reference 1 identifies 67 procedures currently in use for
detecting and, when possible, correcting errors as they occur in
computer systems. A review of reference 1 reveals that several
procedures fall within the categories of internal, historical,
and parallel data consistency checks while others are peculiar to
data processing.1 Some of the latter techniques are:
1. Context and staged edits (e.g., a field edit for check-
ing the data values against the field specifications for length,
character set, and value range).
2. Addition of quality flags to items in a data base to
condition processing to avoid a mismatch between the data quality
and its use.
3. Redundancy in batches, files, and inputs to improve
reliability.
3-3
-------
4. Checks on data sequence (e.g., input data are checked
for correct time sequence).
5. Editing by classification of category, class limits,
normal limits, and trend limits. For example, the behavior of an
individual data item can be compared to its previous behavior or
to the aggregate of individuals in its group. Procedures of this
type are included in Sections 3.2 and 3.3.
6. Parallel check calculations, useful when the same re-
sults can be obtained by two independent calculation procedures.
7. Built-in test data, verification tests, and diagnostics
to provide a test environment without the risk of allowing an un-
checked program access to real files.
8. On-line testing to exercise the fault detection logic.
9. Clearly defined organizational responsibilities to
ensure that the correct data validation procedures are continu-
ally employed.
No single grand or ideal solution to the problem of error
control within automated data processing exists. Each organiza-
tion must analyze its data processing system and the possible
errors peculiar to the system. However, all organizations should
develop a data flow diagram which indicates the steps in data
handling at which an error can occur. In general these steps can
be classified as (1) data input, (2) data transmission, (3) data
processing and (4) data output.
Principal sources of error in the data input stage include
key-punch errors and the use of mislabeled computer files. One
should always review the input for errors of this kind. This can
be done conviently if the input data are included in the computer
output in a format for easy review.
The principal result of transmission error is the loss or
alteration of data. A simple way to check for transmission
error is to transmit the data a second time and then pair the two
data streams. Gaps and alterations will be immediately apparent
unless the transmission error is systematic.
Processing errors are more difficult to characterize and to
detect. They are usually caused by deficiencies in the computer
3-4
-------
programs which manipulate the data files, perform mathematic cal-
culations, and format the output results. A standard method of
checking for processing errors is to make up a small, typical
data set, perform the appropriate data manipulations and calcu-
lations by hand, and compare with the results from the data pro-
cessing system. This procedure should provide a good check if
the data processing errors are not related to the size of the
data set. This procedure is also appropriate for checking the
part of the data processing system which outputs the data.
A detailed discussion of possible data validation procedures
for computer applications would require a report of size at least
equal to this entire report and would be repetitive of informa-
tion available in the pertinent literature, some of which is
listed in reference 1. Because reference 1 contains sufficient
detail for computer trained personnel to develop a data valida-
tion system peculiar to their own system, no attempt is made here
to duplicate its material.
3.2 TESTS FOR INTERNAL CONSISTENCY
Internal consistency tests check for values in a data set
which appear atypical when compared to values of the whole data
set. Common anomalies of this type include unusually high or low
values (outliers) and large differences in adjacent values. The
following tests for internal consistency are listed in order of
increasing statistical sophistication. These tests will not de-
tect errors which alter all values of the data set by either an
additive or multiplicative factor (e.g., an error in the use of
the scale of a meter or recorder).
3.2.1 Data Plots
Data plotting (including strip chart records) is one of the
most effective means of identifying possible data anomalies.
However, plotting all data points may require considerable manual
effort or computer time. Nevertheless, data plots will often
identify unusual data that would not ordinarily be identified by
other internal consistency tests.
3-5
-------
Figures 3-1 and 3-2 show computer plots of data with two
types of unusual values. In Figure 3-1, there is an unusually
high value which would be identified by almost all of the test
procedures described in Section 3.2. On the other hand, the
anomalous 4AM value in Figure 3-2 is similar in magnitude to
several other values recorded for August 25 and 26 and would not
be detected by most of these tests. The large difference between
the 4AM value and the neighboring values in the time sequence is
immediately apparent from the data plot.
Although data plots are particularly appropriate for check-
ing the internal consistency of data, they may also be used for
checking historical consistency (e.g., the Shewhart control chart
in Section 3.3.4) and parallel consistency (e.g., the intersite
correlation test in Section 3.4.4).
3.2.2 Dixon Ratio Test
3 4
The Dixon ratio tests ' are the simplest of the statistical
tests recommended for evaluating the internal consistency of
data. This section describes Dixon ratio tests for evaluating
(1) the largest value in a data set and (2) the largest pair of
values in a data set. Both procedures test the assumption H
that the highest value(s) in a sample are consistent with the
spread of the data. Other Dixon ratio tests which may be of
interest to the data analyst are described in references 3 and 4.
3.2.2.1 Testing a Single High Value - The only data preparation
required for testing a single high value is the identification of
the lowest and the highest two or three (depending on n) values,
where x^ x2, x3, . . ., xn_2, xn_1/ xn is the arrangement of the
lowest three and highest three values in ascending order of mag-
nitude. The statistic of interest for n £ 7 is the ratio of the
difference between the highest and second highest values to the
difference between the highest and lowest values (the range of
the values in the data set); thus,
x - x -,
R = _2 SZ±. Equation 3-1
xn ~ Xl
3-6
-------
SITE: 058440101001
DATE: 2 SEP 77
PARAMETER CODE; 42603 METHOD NUMBER: 14
OXIDES OF NITROGEN INSTRUMENTAL CIIEMHUHINESCENCE
PARTS PER MILLION (MIN DETEC - 0.0100)
0 • 2 9 " 5 • . x ANOMALOUS VALUE
U)
I
1
0.1175
C . 1 1 0 8
n . i 04 3
0.0966
0.0879
0.0771
0.0681
0 . 0607
0. 0 1*77
0 . 0 't 1 9
X X .
XX
• .X
• XX XXXX.
• xx .x
X X X
X
. x xx .xx
XX .XX
• XXX X .
X X
X
X
*
X
2 1 o HOURS
DAILY STANDARD DEVIATION - 0.3400 POOLED STANDARD DEVIATION = 0.5501
*** TESTS SHOWING SIGNIFICANT AT 1% ***
STUDENTIZED T (HIGHEST VALUE) MODIFIED CHAUVENET (HIGHEST VAL.)
T TABULAR T C TABULAR C
2.977 2.963 2.977 2.375
o
Figure 3-1. Computer plot of data with single anomalous value.
-------
SITE: 058440101001
DATE: 26 RUG 77
PARAMETER CODE: 42503 METHOD NUMBER: 14
PARTS PER MILLION („» DETEC - O.OlSS)"5
00
0 . 1 3 5 3*STD-DEV BETWEEN ADJACENT HOURS OF THIS DAY
Figure 3-2. Computer plot of data with single anomalous value.
-------
The ratio R is a fraction between 0 and 1. As this ratio in-
creases, the probability that the highest value is consistent
with the rest of the data—that is, P[HQ is true]—decreases.
Table A-l (Appendix A) lists critical values of R associated with
P[HQ is true] = 5% and P[HQ is true] = 1% for normally distrib-
uted data sets. Note that critical values are not listed for n
values exceeding 25 and that the test procedure changes with n
(Equation 3-1 is used for n <_ 7). The Dixon ratio test is not
recommended for large data sets.
The above procedure assumes the data are normally and inde-
pendently distributed. Since normal distributions are symmetri-
cal in shape, they have equal mean and median values. However,
air pollution data are not usually normally distributed. Data
which are nonnormal are usually positively skewed (mean > medi-
an). In these cases a Dixon ratio test based on the lognormal
distribution may be more appropriate; thus calculate R by using
In x - In x ,
R = i = S=i. Equation 3-2
In x - In x-, H
Because Equation 3-2 is less sensitive to outliers of a normal
distribution than Equation 3-1, it should be used only for test-
ing data which are known to be adequately fitted by the lognormal
distribution. If the data are not normally distributed and if
one is unsure of the adequacy of the lognormal distribution, the
gap test described in Section 3.2.4 should be considered.
Figure 3-3 illustrates the use of the Dixon ratio test in
evaluating 2 months of 24-hour TSP data. The two data sets are
identical except for the extreme values. Using the normality as-
sumption (Equation 3-1), the Dixon ratio of data set A is
0 _ 154 - 117 _ n „
KA " 154 - 42 ~ U'JJ'
and the Dixon ratio of data set B is
_ 420 - 154 _
RB ~ 420 - 56 ~ U-/J-
We accept the assumption H for data set A since 0.33 is smaller
than 0.642, the 5% critical value for n = 5 in Table A-l. The
value of 0.73 is larger than the 5% critical value but less than
3-9
-------
DATA SET A
56 87 117 1 5U
TSP, yg/m
0 0
X _ X ,
o _ _.n ""1
RA - x - x
n 1
DATA SET B
56 87 117 151*
420
x _ x,
n i
A20-151*
l»20-56
= 0.73
Figure 3-3. Illustration of Dixon ratio test for two
TSP monthly data sets.
3-10
-------
the 1% critical value; consequently, 0.01 < P[H is true] < 0.05
and the value 420 appears to be inconsistent with the rest of the
data set. Table A-l contains several forms of the Dixon ratio
test procedure as a function of sample size (n = 3 to 25) and
whether the largest or smallest value is suspect. The applica-
tion of the Dixon ratio test to air pollution data and the com-
parison of the test with other test procedures is given in ref-
erences 5, 6, and 7.
3.2.2.2 Testing a Pair of High Values - The Dixon ratio test
described above (Section 3.2.2.1) will not identify data sets
which have two or more outliers of similar magnitude. To test
for a pair of high values, calculate
X ™" X
R =: _n nz2 Equation 3-3
xn " xl
for normal data sets and
In x - In x _
R = in ^ - in x" Equation 3-4
for lognormal data sets. Decisions concerning acceptance or
rejection of the assumption H —that the two highest values are
consistent with the rest of the data set—are made according to
the procedure described in Section 3.2.2.1 using Table A-2.
Reference 3 describes related procedures appropriate for testing
three or more outliers. Note: The Dixon ratio tests and similar
statistical procedures for identifying outliers in a data set
are not recommended for repeated use on the same data set. The
error risks specified in tables were theoretically derived on
the assumption that no extreme values have been removed from the
data prior to the test. In practice, however, the tests are
often applied successively. The user should be very cautions in
doing this, that is, never discarding the data until a just cause
has been determined. The user might also limit the rejected data
to a small percentage of the data set, say 5%. In all applica-
tions, the treatment of the outliers should be documented in
order that a subsequent user of the results will know the impact
3-11
-------
of the outliers. For example, analyses can be given with and
without the discarded data.
3.2.3 Grubbs Test
Like the Dixon ratio test described in Section 3.2.2.1, the
3 8
Grubbs test ' can be used to determine whether the largest
observation XR in a sample from a normal distribution is too
large with respect to the internal consistency of the data set.
The Grubbs test differs from the Dixon ratio test in that its
test statistic,
**. ^ J*L.
T = —, Equation 3-5
is calculated using all of the values in the data set. In addi-
tion to x , the data analyst must determine the arithmetic mean,
- 1
x = — I x^, Equation 3-6
and the standard deviation
s = [Z(xi - x)2/(n - 1)]1/2. Equation 3-7
The following equations can be used for positively skewed data
which approximate the lognormal distribution:
x1 = ^ I In x^ Equation 3-8
-9 1/9
s1 = [Z(ln x^ - x1) /(n-1)] ' . Equation 3-9
High values of T indicate the likelihood that the maximum value
is too high. Table A-3 gives upper probability levels for T for
3 <_ n <_ 147.
Figure 3-4 illustrates the Grubbs test in evaluating the 2
months of TSP data used in Section 3.2.2 to illustrate the Dixon
ratio test. The T statistic for data set A is
TA = 45.5 = 1-380.
Examining the n = 5 row in Table A-3, we find that P[H is true]
> 10% since 1.380 < 1.602. It follows that a maximum value of
154 is not unexpected for this particular data set. The T sta-
tistic for data set B is
T = 420 - 166.8 -
B 146.1 X
3-12
-------
DATA SET A
2 56 87 117 1 5»»
100
200
300
TSP, yg/nr
X _ X
DATA SET B
5 6
1 0 Q
200
300
TSP, yg/nr
14 0 0
420
^ o o
^20-166.8
Figure 3-4. Illustration of Grubbs test for two
TSP monthly data sets.
3-13
-------
Table A-3 indicates that P[HQ is true] < 0.025, since 1.733 >
1.715. The value 420 appears to be inconsistent with the data
set and should be flagged for further investigation.
3.2.4 Gap Test
4 9
The gap test ' identifies spurious outliers by examining
the frequency distribution for large gaps. The length of the gap
between the largest x,, and the next largest value x , is x -
n ^ n-l n
x^ .. , between the second and third largest values is x , - x ~,
n-i J n-l n-2
and similarly for other gaps. The occurrence of a gap length
larger than a predetermined critical value indicates a possible
data anomaly.
The test assumes that the upper percentiles of the frequency
distribution can be closely fit by a two-parameter exponential
curve defined as
*
F(x) = 1 - exp [-A(x - 6)] Equation 3-10
where F(x) is the fraction of the total observations less than or
equal to x; A. is the slope parameter; and 8 is the location
parameter. Only the X value of the fitted curve is used in the
gap test. Values of A can be estimated from the expression
In [1 - F(x )] - In [1 - F(x )]
A = — Equation 3-11
x2 - X-L
where F(x..) and F(x2) are the quantile values corresponding to
the concentration values x.. and x«. For fits using three or more
quantiles, use the least squares procedure for the 2-parameter
exponential distribution described in Appendix B.
The probability of a gap of at least k units occurring in a
data set free from erroneous values is
P[gap >_ k] = e~Ak. Equation 3-12
A small P value indicates that the gap size is in question and
that the corresponding data value(s) should be flagged.
Figure 3-5 illustrates the use of the gap test for hourly
ozone data having a suspect value of 23 pphm. Substituting the
*exp [a] is a convenient means of writing (typing) e where e is
the base for natural logarithms.
3-14
-------
si-e
Frequency
c:
-s
o>
oo
i
01
-s
eu
o
:3
o
IQ
CD
T3
CD
l/l
-h
O
-S
O
c
-s
o
M
O
3
fD
Q.
O)
r+
a>
in
T3
o
o
o
o
o
10
o
o
o
o
vn
O
O
O CTN
N
O
CX3
O >•£>
ft)
r-r O
-1
Oi -•
O —
X) —
"D v>o
ZT
CTl
CO
o
N)
NJ
c
O)
3
O
o
o
-c-
o
-c-
CJN
cx>
IQ
01
o
-h
CD
C
-------
90th and 99th percentile values, 4.5 pphm and 9.0 pphm, into
Equation 3-11 yields
x - In 0.10 - In 0.01 _
A ~ 9.0 - 4.5 - 0.512.
The gap is eight units in length; consequently
P[gap >_ 8] = e-(°
There is less than a 2% probability that a gap of this length or
larger would occur in the data set under the assumption stated.
The 23-pphm value should be flagged for further investigation.
3.2.5 "Johnson" p Test
The "Johnson" p test assumes that the sampling distribution
of n observations can be approximated by the cumulative distribu-
tion F(x), where F(x) is an identifiable function which defines
the fraction of individuals in the sampled population having a
value less than or equal to x. Let x be the largest recorded
value in a sample of n observations from the population distribu-
tion F(x). The probability that all n observations from F(x) are
less than xn is [F(xn)]n; that is, the chance that a single ob-
servation is less than xn, F(xn), raised to the power n. Thus
the probability P(xn) that at least one value exceeds x (i.e.,
the largest value is at least x ) is
P(xn) = 1 - [F(xn)]n. Equation 3-13
We can now define an acceptance region for x by requiring that
P(xn) fall between 0.05 and 0.95; that is,
0.05 < P(xn) < 0.95. Equation 3-14
Data sets with maximum values which cause P(x ) to fall outside
n
of the acceptance region should be flagged for further investiga-
tion.
The selection of an appropriate cumulative distribution F(x)
to characterize the data is important in determining a reasonable
acceptance region. Two distributions which often provide close
fits to ambient air quality data are the Weibull and the lognor-
mal. ' Appendix B contains procedures for fitting these dis-
tributions to data and for selecting the appropriate F(x) func-
tion.
3-16
-------
Figure 3-6 shows the probability function of Equation 3-13
corresponding to a Weibull distribution fitted to 1976 N02 data
from Essex, MD. The limits of the acceptance region correspond-
ing to 0.05 < P(x ) < 0.95 can be determined directly from the
graph by noting that P(165) = 0.05 and P(125) = 0.95. The value
of 140 ppb corresponds to P(x ) = 0.50 and is considered a likely
maximum value for this data set. If the data set has a recorded
maximum value less than 125 ppb or higher than 165 ppb, it should
be investigated further.
If the observed (recorded) maximum value is consistent with
the overall distribution of the data set, there is a 10% proba-
bility that the maximum value of a valid data set will fall out-
side this acceptance region. If, in the general case, the data
analyst decides to check all sets with x values outside the ac-
ceptance region, he or she would expect to check 1 out of 10 data
sets unnecessarily. Where data validation requirements are more
stringent, the acceptance range could be reduced to 0,10 < P(x )
< 0.90. On the other hand if valid data are being checked too
frequently the acceptance range can be changed to 0.01 < P(x ) <
0.99. The acceptable range of x values should not be too
narrow, since the P values are calculated from fitted distribu-
tions that only approximate the data set.
3.2.6 Multivariate Tests
The procedures given in Section 3.2.2 through 3.2.5 can be
used for testing data sets involving more than one variable by
applying them independently to each variable; however, this
approach may be inefficient, particularly when the variables are
statistically correlated. In this case, the data analyst should
consider the test procedure given in Section 3.4.4 for correlated
data. Although this test is included under parallel data sets
because of the particular application to correlated data from two
sets, it can be applied to correlated variables within the same
data set; for example, to the concentrations of two pollutants,
such as TSP and Pb. In some cases, a multivariate test of this
kind will show that a value of one variable that appears to be an
3-17
-------
1 .0
x
a.
0.2
0.1
Concentration (x ), ppb
n
250
acceptance range
for x
Figure 3-6. P(x) versus concentration for Weibull distribution
fitted to 1976 N02 data from Essex, MD. Acceptance
range defined as 0.05 < P(x) < 0.95.
3-18
-------
outlier using a single variable test procedure will be consist-
ent with the data set when one or more other variables are con-
sidered. Conversely, there may be a value of one variable that
is consistent with the other data in the data set when only one
variable is considered, but is definitely a possible outlier when
two variables are considered.
Multivariate test procedures which have been successfully
used to perform data validation checks include cluster ana-
12 13
lysis, principal component analysis, and correlation ana-
lyses. Applications of these usually require computerized pro-
cedures. For example, cluster analysis can be applied using a
program called NORMIX.14
3.3 TESTS FOR HISTORICAL CONSISTENCY
Tests for historical consistency check the data set with re-
spect to similar data recorded in the past. It is important to
note that some of the data validation procedures to be described
in this section will detect relatively small changes in the
average value and/or the dispersion of the data values. In
particular, these procedures will detect changes where each item
is increased (decreased) by a constant or by a multiplicative
factor. This is not the case for the procedures in Section 3.2
which yield the same value for the test statistic when all data
are changed by the same constant or multiplicative factor.
3.3.1 Gross Limit Checks
Gross limit checks are useful in detecting data values that
are either highly unlikely or generally considered impossible.
Upper and lower limits are developed by examining historical data
for a site (or for other sites in the area). Whenever possible,
the limits should be specific to each monitoring site and should
consider both the parameter and instrument/method characteris-
tics. Table 3-1 shows examples of gross limit checks that have
been used for ambient air monitoring data in the St. Louis
area. ' Although this technique can be easily adapted to com-
puter application, it is particularly appropriate for technicians
who reduce data manually or who scan strip charts to detect un-
usual events.
3-19
-------
TABLE 3-1. EXAMPLES OF HOURLY AND DAILY GROSS LIMIT CHECKS FOR AMBIENT
AIR MONITORING15'16
Parameter
03
N02
NO
N0x
Total suspended particulates
CO
Total hydrocarbons
Methane
Total sulfur
S02
H2S
Aerosol scatter
Windspeed
Wind direction
Temperature
Dewpoint
Temperature gradient
Barometric pressure
Limits
Lower
0 ppm
0 ppm
0 ppm
0 ppm
0 ug/m3
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0.000001 m"
0 m/s
0°
-20°C
-30°C
-5°C
950 mb
Upper
1 ppma
2 ppm
3 ppm
5 ppm
2000 ug/m3
100 ppm
25 ppma
25 ppma
1 ppm
1 ppm
1 ppm
0.0040
22.2 m/s
\b
m"^
360° (540° for
some wind
systems)
45°C
45°C
5°C
1050 mb
These limits have been changed from the
on after-the-fact considerations.
original in reference 15 and 16 based
Upper limit for a 24-hour average.
3-20
-------
3.3.2 Pattern and Successive Value Tests
The pattern and successive value tests check the data for
pollutant behavior which has never or very rarely occurred. Like
the gross limit checks, they require that a set of boundary
values or limits be determined empirically from prescreened his-
torical data. Values representing pollutant behavior outside of
these predetermined limits are then flagged for further investi-
gation.
EPA has recommended the use of the pattern tests which place
upper limits on:
1. The individual concentration value (maximum hour test),
2. The difference in adjacent concentration values (adja-
cent hour test),
3. The difference or percentage difference between a value
and both of its adjacent values (spike test), and
4. The average of four or more consecutive values (consec-
4
utive value test).
The maximum hour test (a form of gross limit check) can be used
with both continuous and intermittent data. The other three
tests should be used only with continuous data.
Table 3-2 is a summary of limit values developed by EPA for
hourly average data. These values were selected on the basis of
empirical tests on actual data sets. Note that the limit values
vary with different data stratifications (e.g., day/night).
These limit values will usually be inappropriate for other
pollutants, data stratifications, averaging times, or EPA
regions. In these cases, the data analyst should develop the
required limit values by examining historical data similar to the
data being tested. These limit values can be later modified if
they flag too many values that are later proven to be correct or
if they flag too few errors. Pattern tests should continue to
evolve to meet the needs of the analyst and the characteristics
of the data.
3.3.3 Parameter Relationship Test
Parameter relationship tests can be divided into two main
categories: deterministic tests involving the theoretical rela-
3-21
-------
TABLE 3.2. SUMMARY OF LIMIT VALUES USED IN EPA REGION V
FOR PATTERNS TESTS4
Pollutant (units)
Ozone-total
oxidant ((jg/m3)
Carbon monoxide
(mg/m3)
Sulfur dioxide
(Mg/m3)
Nitrogen dioxide
(pg/m3)
Data
stratification
summer day
summer night
winter day
winter night
rush traffic
hours
nonrush traffic
hours
EPA region
None
Maximum
hour
1000
750
500
300
75
50
2600
1200
Adjacent
hour
300
200
250
200
25
25
500
500
Spike
200(300%)
100(300%)
200(300%)
100(300%)
20(500%)
20(500%)
200(500%)
200(300%)
Consec-
utive
4-hour
500
500
500
300
40
40
1000
1000
tionships between parameters (e.g., NO < NO ) and empirical tests
X
which check whether or not a parameter is behaving normally in
relation to the observed behavior of one or more other param-
eters . Deterministic parameter checks are discussed in Section
3.1.3; empirical parameter checks are discussed in this section
since determining the "normal" behavior of related parameters
requires the detailed review of historical data.
The following area-specific example illustrates the testing
of meteorological data using a combination of successive value
tests, gross limit tests, and parameter relationship tests. One
should consult with the local National Weather Service office for
relationships for a specific area. The validation protocol calls
for the following procedures to be applied to ambient temperature
data based on the availability of hourly averages reported in
monthly formats:
I. Check the hourly average temperature. The minimum
should occur between 04-09 hours, and the maximum should occur
between 12-17 hours.
3-22
-------
2. Inspect the hourly data for each day. Hourly changes
should not exceed 10°F. If a decrease of 10°F or more occurs,
check the wind direction and the precipitation summaries. The
wind direction should have changed to a more northerly direction,
and/or rainfall of 0.15 in. or more per hour should have fallen.
3. Hourly values should not exceed predetermined maximum
or minimum values based on month of the year. For example, in
November the maximum allowable temperature is 85°F and the mini-
mum allowable temperature is 10°F.
If any of the above criteria are not met, then the data for the
appropriate time period are flagged for anomaly investigation.
In this example, relationship checks have been developed for
temperature and wind direction as well as temperature and precip-
itation. Other pairs of parameters for which relationship checks
could be developed include solar insolation and cloud cover;
windspeed aloft and ground windspeed; ozone and NO; and tempera-
ture and humidity.
3.3.4 Shewhart Control Chart
The gross limit checks and the pattern tests described in
Sections 3.3.1 and 3.3.2, respectively, use critical values
("control limits") based on historical data to identify possible
data anomalies involving single values or small numbers of con-
secutive values. The Shewhart control chart is a valuable sup-
plement to these two tests in that it identifies data sets which
have mean or range values that are inconsistent with past data
sets.
The classical use of a quality control chart is to determine
the limits on the basis of historical data and to apply these
limits to future data to determine the state of control. In the
data validation process the control chart can be used in this
classical sense (particularly at the local agency) or in an
after-the-fact sense. In the latter case, the control chart
technique may be applied to data already recorded/documented by
using a portion of the data to determine the limits to be used
for the remaining data. The discussion of quality control charts
herein applies to either use (classical or after-the-fact).
3-23
-------
The first step in the development of a Shewhart control
chart is the selection of a suitable sample size for the test.
Each sample should contain between 4 and 15 values and should re-
present a well-defined time period (day, month, quarter, etc.)
for which there is at least 10 to 15 historical data samples.
Where possible, these time periods should relate to the NAAQS
(National Ambient Air Quality Standards) of interest. Months or
quarters would be appropriate time periods for tests of 24-hour
TSP, SO2, and N02 data collected at 6- or 12-day intervals.
The second step is to calculate the limits on the Shewhart
control chart following the procedures in Appendix C or in a
18
standard reference text. The details of the calculation will
be illustrated by the following example.
Average 24-hour TSP concentrations at Philadelphia monitor-
ing site 397140014H01 are recorded every sixth day. We desire to
use a Shewhart control chart to check 1978 data as they are re-
ported. A sample size of five is chosen so that incoming data
will be checked 12 times a year if all values are recorded. A
review of the TSP. data for 1975-77 reveals 25 months which con-
tain exactly five TSP values. Table 3-3 lists the mean and the
range of each of these data sets. No seasonal trends are appar-
ent over the 3-year period. We apply the Grubbs test (Section
3.2.3) to the data sets with suspicious range values (>100). In
each case, T < 1.602, P[H is true] >10%, so we decide not to re-
ject the data set. These 25 data sets now form the historical
data base for determining the limits for the control chart.
Since all data sets, including the one to be tested, contain
five observations, Equations C-l, C-5, C-6, and C-10 (Appendix C)
are used with d2 = 2.326 and c2 = 0.8407. If z = 2, then
x (mean of x's) = 56.5
a- (standard deviation of the mean) = 9.0
X
UL- (upper 2cr limit for the mean) = 74.5
X
LL- (lower 2a limit for the mean) = 38.5
J\,
R (mean range) =47.0
UL^ (upper 2a limit for R) = 80.9
JK
LLR (lower 2a limit for R) = 13.0
3-24
-------
TABLE 3-3. TSP DATA FROM SITE 397140014H01 SELECTED AS HISTORICAL DATA BASE
FOR SHEWHART CONTROL CHART (1975-1977)
Month-year
1-75
5-75
6-75
7-75
8-75
10-75
11-75
12-75
1-76
4-76
5-76
7-76
9-76
Mean (x),
ug/m3
54.6
63.8
59.0
63.0
68.2
41.8
68.4
57.6
82.4
90.2
43.8
72.6
73.4
Range (R),
pg/m3
67
39
25
23
54
26
81
39
87
117
48
80
83
Month-year
10-76
11-76
12-76
3-77
4-77
6-77
7-77
8-77
9-77
10-77
11-77
12-77
Mean (x),
|jg/m3
34.6
53.4
52.2
40.4
63.6
45.4
53.4
58.6
46.0
45.6
49.8
30.4
Range (R),
|jg/m3
50
29
44
28
57
31
19
26
12
33
54
22
Figures 3-7 and 3-8 have been constructed using these values, and
the mean and range values from eleven 1978 data sets have been
plotted. The raw data for 1978 are in Table 3-4.
TABLE 3-4. TSP DATA FROM SITE 397140014H01 FOR CONTROL CHART (1978)
Data set
1
2
3
4
5
6
7
8
9
10
11
Month
1
2
3
4
5
6
8
9
10
11
12
Mean 1
30.6
47.4
54.4
31.8
53.6
64.8
68.8
43.2
52.4
60.8
31.6
Range
27
60
39
29
46
46
87
31
59
71
22
3-25
-------
CONCENTRATION, ug/m3
o o o o o
o o o o
c
~J
(D
1
--J
N>
b
iar
•-s <-o
o
o
i-h -f
--s
o
o
!i
O •*" ON
"* OO
I "
:3 ^-J
cu
c:
5
3i
-3-
(—"
CO
Q. 0
rl-
0)
TJ — '
C)
rt-
Ci.
1
o |
1 o
1
— . o
|
— D .
'
1
1
1
1
|
1
— - 0
1
1
1
I
D
1
1.
I
1
1
1
|
1
O
O
1
1
o
1
1
1 ~~
1
«
-------
LZ-S.
CONCENTRATION, vg/nr
— • N> V/J -C-VT1 O-> --J OO >^> O — • N
oooooooooooc
-n
CO
£Z
-s
ro _.
OJ
1
oo
K>
CO
zr
|
<-*•
O
3 *•
O
0 ^
3"
Dl
-s o
rt >
i
~h J>
O
-S CO
m
_j ~o
to
ro
ID
— • c»
(D
5. U5
rl-
3^
i—* _,
OO
Q-
Oi
c+ ""*
Cu
-o
O
0
0
o
o
o
—
o
1 II
—
o
—
—
a —
0
o
—
—
—
1
a.
-------
Figure 3-7 shows three mean values below the LL-, but no
A.
mean values above the UL-. The overall distribution of plotted
2\
points suggests a possible explanation for the anomalous low
values. Of the 11 points, eight are below the x line while only
three are above it. We should investigate two hypotheses: (1)
air quality has improved and (2) the TSP monitor has developed a
negative bias. The first hypothesis can be checked by seeing if
TSP data from nearby monitors show similar trends. Measurement
bias may be revealed through calibration checks and careful
inspection of quality assurance records. In particular, check
for changes in the location of the sampling mechanism.
The 1978 range values also tend to be smaller than expected;
seven are less than R while only four are greater than R (Figure
3-8). None of the range values are less than the LL^, however.
The sole outlier is slightly above the UL=. A good preliminary
check of data sets with anomalous range vaues is the Grubbs test.
Following the procedure described in Section 3.2.3, we find that
x = 129, x = 68.8, s = 34.6, and T = 1.740. Since inspection of
Table A-3 reveals that 0.010 < P[HQ is true] < 0.025 when
T = 1.740, we are suspicious of the highest value, xn = 129.
Further investigation should focus on determining the validity
of this measurement.
Assuming that we have determined the cause(s) of the anoma-
lies noted on Figures 3-7 and 3-8, we must now decide on the ap-
propriate control limits for plotting 1979 data. If an improve-
ment in air quality has indeed occurred, we should calculate new
control limits which reflect the lower x values expected. A rec-
ommended procedure in this case is to omit the earliest year of
data and to add the most recent. If the problem is traced to
measurement bias, either the original control limits can be re-
tained or new control limits can be calculated by adding any
valid 1978 data sets to the 25 data sets in the historical data
base.
It should be apparent from this example that a physical
control chart need not be constructed to test individual mean and
range values; a computer program can be easily developed which
3-28
-------
will list all data sets which fail the test. The main benefit of
the visible control chart is its capability to indicate trends
and other data patterns. If new data are consistent with histor-
ical data, the points plotted on the control chart should fall
within the limits, randomly above and below the central line. A
long run of points on one side of the line (even though none of
the points lies outside the control limits) may indicate a sys-
tematic bias in the data or an actual trend in air quality that
requires investigation. Common practice is to check these pos-
sibilities whenever the run exceeds 6 points. Used in this way,
the control chart can warn the data analyst of data problems
before they become serious.
3.4 TESTS FOR CONSISTENCY OF PARALLEL DATA SETS
The tests for internal consistency described in Section 3.2
implicitly assume that most of the values in a data set are cor-
rect. Consequently, if all of the values in a data set incorpo-
rate a small positive bias, tests such as the Dixon ratio test
would not identify the data set as inconsistent. One method of
identifying a systematic bias of this type is to compare the data
set with other data sets which presumably have been sampled from
the same population (i.e., same air mass and time period) and to
check for differences in the average value or the overall distri-
bution of values. This section describes four such procedures—
the sign test, the Wilcoxon signed-rank test, the Wilcoxon rank
sum test, and the intersite correlation test—which are recom-
mended for comparisons involving two "parallel" data sets. These
four tests are presented in order of increasing sensitivity to
differences between the data sets and increasing computational
complexity. The first three tests are nonparametric; that is,
they do not assume that the data have a particular distribu-
19
tion. Consequently, these tests can be used for the nonnormal
data sets which frequently occur in air quality analysis.
3.4.1 The Sign Test
19
The sign test is a relatively simple procedure for testing
the assumption H --that two related (paired) samples, such as
3-29
-------
data sets from adjacent monitoring instruments, have the same
median. The data analyst simply determines the sign ( + or -) of
the algebraic difference between each of the pairs of data
points, and then counts the total number of positive signs (n )
and negative signs (n_); differences of zero are ignored. The
probability that both samples have the same median is
P[HQ is true] = 2 N!(1/2)N z .. , (N^ ) , Equation 3-15
where n is the smaller of the two numbers n+ and n , and N =
n+ + n_. For large N (say >25), one can use the normal approxi-
mation to the above probability by calculating
z = - . Equation 3-16
1
In this case, P[HQ is true] is equal to twice the area to the
left of z under the standard normal curve. The following table
lists P values corresponding to selected values of z.
z P
-1.282 0.20
-1.645 0.10
-1.960 0.05
z P
-2.326 0.020
-2.576 0.010
-2.807 0.005
A z value of -1.85 would imply that the probability of the sam-
ples representing populations with the same distribution is be-
tween 5% and 10%.
Table 3-5 lists ozone data recorded August 1, 1978, at two
monitoring stations in Washington, B.C. The difference column
contains 13 positive values and 6 negative values. Consequently,
n+ = 13, n_ = 6, n = 6, and N = 19. Substituting these values in
Equation 3-15 yields
19 6 1
P[HQ is true] = (2)(19! )(l/2)'Ly I j,(19_j }, = 0.167.
If the normal approximation (Equation 3-16) is used, we have
_ _ 12 — 19 -. r r\r
Z = = -1.606,
and 0.10 < P[HQ is true] < 0.20. Generally the data analyst will
not reject H if P[H is true] > 0.05.
3-30
-------
TABLE 3-5. OZONE DATA (ppb) RECORDED AUGUST 1, 1978, AT
MONITORING SITES 090020011101 (SITE A) AND
090020013101 (SITE B)
Hour
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Site A
65
40
35
30
30
15
15
5
10
10
35
65
65
100
130
90
70
70
85
55
45
25
20
20
Site B
50
50
45
35
25
15
10
5
5
10
50
55
60
90
110
65
65
70
65
50
35
40
20
30
Difference
+15
-10
-10
-5
+5
0
+5
0
+5
0
-15
+10
+5
+10
+20
+25
+5
0
+20
+5
+10
-15
0
-10
3-31
-------
Table 3-6 was developed to simulate a calibration error in
the site A data. Each reading in the site A column in Table 3-5
was increased by 5 ppb. Readings in the site B column were not
changed. There are now 18 positive values and 5 negative values.
It follows that n+ = 18, n_ - 5, n = 5, and N = 23. Using Equa-
tion 3-15, the probability that H is true is
P[H is true] = (2)(23!)(1/2) I .,(* .. = 0.0106.
0 j=o J ' ^J~J i •
The normal approximation (Equation 3-16) yields
2( 5\ - 9^
z = H2J. ££ = -2.711,
corresponding to P [H is true] = 0.007. In either case, P[H is
true] < 0.05, and there is good reason to believe that a funda-
mental inconsistency exists between the two data sets.
3.4.2 Wilcoxon Signed-Rank Test
Like the sign test, the signed-rank test can be used to test
the assumption (H ) that two samples come from populations having
the same medians. The Wilcoxon test is generally more powerful
than the sign test since it considers both the sign and the mag-
nitude of the difference between paired data. Table 3-7 lists
the steps in the procedure. The large sample approximation given
in step 5 is appropriate when N > 20.
Table 3-8 illustrates the application of the signed-rank
test to the ozone data listed in Table 3-5. The absolute dif-
ferences were first listed according to increasing magnitude and
then assigned ranks. Note that differences of zero are ignored;
there are 19 nonzero differences. The sum of the ranks associ-
ated with the minus sign (T-. ) is 65.5 and the sum of the ranks
associated with the plus sign (T?) is 124.5. Table A-4 indicates
that when N = 19 the probability of H being true would be less
than 10% if T2 <_ 53 or T2 >_ 137. Since 53 < 124.5 < 137, we can
say P[H is true] > 0.10. Using the large sample approximation
(Equation 3-17),
3-32
-------
2 =
Tl -
N(N + 1)
VN(N+1) (2N+1)
24
Equation 3-17
we find that
z =
,, , 19(19+1)
55-5 ~ —4
- = -1.187,
24
which corresponds to P[H is true] = 0.235. In neither case is
P[H is true] < 0.10; consequently, there is not good reason to
reject H .
TABLE 3-6. OZONE DATA (ppb) INCORPORATING SIMULATED
+5 ppb CALIBRATION ERROR AT SITE A
Hour
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Site A
70
45
40
35
35
20
20
10
15
15
40
70
70
105
135
95
75
75
90
60
50
30
25
25
Site B
50
50
45
35
25
15
10
5
5
10
50
55
60
90
110
65
65
70
65
50
35
40
20
30
Difference
+20
-5
-5
0
+10
+5
+10
+5
+10
+5
-10
+15
+10
+15
+25
+30
+10
+5
+25
+10
+15
-10
+5
-5
3-33
-------
TABLE 3-7. PROCEDURE FOR WILCOXON SIGNED-RANK TEST19
1. Determine the sign and magnitude of the algebraic difference between each
of the pairs of data points; assume that N scores remain after throwing
out all zeros.
2. Assign ranks to the absolute values of these N differences; use the average
rank in case of ties. Make sure that the ranks increase as the absolute
values increase.
3. Assign to each rank the sign which it represents.
4. Determine the sum of the ranks associated with a minus sign (T^) and the
sum of the ranks associated with a plus sign (Tp).
5. To test H , either use published tables (see Appendix A, Table A-4) based
on T and N or use a large sample normal approximation with
T
. N(N
z = • Equation 3-17
/N(N+1) (2N+1)
24
The probability that H is true--that is, P[H is true]--is equal to twice
the area to the left of z under the standard normal curve.
If we repeat the signed-rank test with the biased ozone data
in Table 3-6, we find that TI = 38.5, T2 = 237.5, and N = 23.
Table A-4 indicates that if T0 < 54 or if T0 > 222 when N = 23,
/. — £. —
then P[H is true] < 0.01. Since 237.5 > 222 we reject HQ; there
is good reason to believe that samples A and B represent popula-
tions with different medians. The normal approximation supports
this conclusion since
38.5 + "(23 + 1)
z = * : = -3.026
(23+l)[(2)(23)+l]
24
and a z value of -3.026 corresponds to P[H is true] = 0.0025.
3.4.3 Rank Sum Test
The rank sum procedure is useful in testing the assumption
(H ) that two samples represent populations with the same distri-
bution. Unlike the sign test and the ranked sign test, the rank
3-34
-------
TABLE 3-8. APPLICATION OF WILCOXON SIGNED-RANK TEST TO OZONE
DATA (ppb) IN TABLE 3-4
Absolute difference
0
0
0
0
0
5
5
5
5
5
5
5
10
10
10
10
10
10
15
15
15
20
20
25
(Signed) Rank
*
*
*
*
*
-4
+4
+4
+4
+4
+4
+4
-10.5
-10.5
+10.5
+10.5
+10.5
-10.5
+15
-15
-15
+17.5
+17.5
+19
3-35
-------
sum test is applicable to independent (unrelated) samples. Table
3-9 lists the steps in the procedure appropriate for tests of 10
or more data pairs.
TABLE 3-9. PROCEDURE FOR WILCOXON RANK SUM TEST
1. Combine the n, observations from population 1 and the n~ observations
from population 2, arrange them in ascending order of size, and then
assign ranks from 1 to (n-j^ + n,,) where n, < r\~. In case of ties, use the
average rank.
2. Calculate T,5 the sum of the ranks assigned to observations from popu-
lation 1.
3'. Compare T-^ with the critical values in Table A-5, a = 0.10.20 If T.
< T, < T for the values of T and T listed for samples of n, and n9
-L i Xf i it
in the table, then the P[H is true] < 0.10. If T.. is outside the indi-
cated range, go to the a = 0.05 table and repeat test. If T0 < T, < T ,
Xf JL ~" i
then 0.05 < P[H is true] < 0.10. If not, continue search until a value
is found for which T < T-. < T . If a < 0.05 there is good reason to
SL - '1 - 'r
reject H .
4. To test H for n? > 20, use the large sample normal approximation, Equa-
tion 3-18. The probability that H is true, i.e., P[H is true], is
equal to twice the area to the right of z under the standard normal
curve.
Table 3-10 illustrates the application of the rank sum test
to the ozone data listed in Table 3-5. Note that tied values are
assigned an average rank value. The sum TI of the ranks from the
first sample is equal to 598. In Table A-5, a = 0.10, T£ = 507
and T = 669 for n., = n, = 24.20 Consequently T0 <_ T, <_ T , P[H
XT J_ ^ J6 *"~ J. "~"~ X. O
is true] >_ 0.10; thus H should not be rejected. Using the large
sample normal approximation, (Equation 3-18),
T - _i i i
1 2
z = ; Equation 3-18
] 12
3-36
-------
and substituting the data,
598 _ (24)(24+24+1)
z = 2 —. = 0.206.
A/(24) (24)(24+24+1)
V 12
A z value of 0.206 corresponds to P[H is true]
= 0.837
TABLE 3-10. APPLICATION OF RANK SUM TEST TO OZONE
DATA IN TABLE 3-5
Value
5
5
5
10
10
10
10
15
15
15
20
20
20
25
25
30
30
30
35
35
35
35
40
40
Data set
A
B
B
B
A
A
B
A
B
A
A
B
A
B
A
A
A
B
A
B
A
B
A
B
Rank
2
2
2
5.5
5.5
5.5
5.5
9
9
9
12
12
12
14.5
14.5
17
17
17
20.5
20.5
20.5
20.5
23.5
23.5
Value
45
45
50
50
50
50
55
55 '
60
65
65
65
65
65
65
70
70
70
85
90
90
100
110
130
Data set
B
A
B
B
B
B
B
A
B
A
A
A
B
B
B
A
A
B
A
B
A
A
B
A
Rank
25.5
25.5
28.5
28.5
28.5
28.5
31.5
31.5
33
36.5
36.5
36.5
36.5
36.5
36.5
41
41
41
43
44.5
44.5
46
47
48
3-37
-------
Application of the rank sum test to the biased data in Table
3-6 yields T-. = 626.5. Table A-5 shows that T <_ 626.5 <_ T for
X jL ±i
a - 0.10; it follows that P[H is true] > 0.10. The large sample
normal approximation is consistent with this result since z =
0.794 and P[H is true] = 0.427. In either case, the rank sum
test does not indicate that H should be rejected.
Note that the rank sum test accepts H with respect to the
data in Table 3-6, but the sign test and the signed-rank test re-
ject H . This apparent contradiction is the result of a funda-
mental difference in the construction of the tests. The sign and
signed-rank tests are paired-value tests; consequently, they are
particularly sensitive to differences between related observa-
tions. Both of these tests changed from accepting HQ to reject-
ing H when a small bias (5 ppb) was introduced into data set A.
Because the rank sum test compares sample distributions without
actually pairing the data, it is relatively insensitive to small
data shifts. The rank sum test is more appropriate for identi-
fying censored samples.
The data listed in Table 3-10 are similar to those in Table
3-5, except that all values from site A exceeding 60 ppb have
been omitted. In this case n^ =15, n2 = 24, and 0^ = 225. The
largest a value in Table A-5 for which T£ <_ 1^ <_ Tr is 0.02.
Consequently, 0.02 < P[HQ is true] <_ 0.05, and there is good rea-
son to reject H . Substitution of the appropriate values into
the equation for the large sample approximation yields z = -2.165
and P[HQ is true] = 0.030.
The sign and signed-rank tests are not as sensitive to cen-
sored data as the rank sum test, mainly because they ignore un-
paired data. In the example just described, the sign and signed-
rank tests would not use the nine values from site B for which
there are no corresponding values listed for site A. Un-
like the rank sum test, neither of these paired data tests would
have rejected H .
3.4.4 Intersite Correlation Test
The intersite correlation test is suggested as a means of
comparing two correlated parameters being measured at the same
3-38
-------
site or at neighboring sites. An example is given to illustrate
the procedure for TSP measurements on every sixth day at each of
the neighboring sites over a period of 1 year.
It would be possible to treat each site independently; how-
ever, this approach does not consider the relationship between
the measured TSP concentrations at the two sites, and hence may
err either by not identifying a potential data anomaly or by
identifying a possible outlier and later finding that it is con-
sistent with that measured at a neighboring site.
The da*ta for the example are in Table 3-11. Denote by x and
y the measurements of \ig TSP/m3 at sites 39714002_OH01 and
397140014H01, respectively. The original data and the logarithms
are both given because TSP data are almost always better approxi-
mated by using the lognormal distribution.
Because the lognormal distribution is preferable, these data
are plotted in Figure 3-9, log-log paper (2 cycles). Note that
there is a relatively high correlation (about 0.90) between the
two measurements.
The calculation procedure is given for two cases: (1) as-
suming the data to be normally distributed (actually a bivariate
normal) and (2) assuming the data are lognormally distributed.
(In the latter case, only the changes in the calculations from
21
case 1 are given.)
Calculations (Assuming Bivariate Normal)
1. Calculate the mean and the standard deviation for each
variable.
x = 56.5 y = 49.0
sx = 26.3 s = 22.9
2. Calculate the correlation coefficient, r, for the
two measurements,
zxy. U*)(zy)
r = =0.91 Equation 3-19
sx sy
and n is the number of paired observations.
3-39
-------
TABLE 3-11. TSP DATA FROM SITES 397140014H01 AND
397140020H01 FOR INTERSITE CORRELATION
TEST, 1978
Site 20
(x),
(jg/m3
43
40
24
31
50
13
65
54
58
53
77
59
75
36
28
30
31
57
41
65
31
69
60
87
76
33
73
57
36
65
40
88
71
175
85
Site 14
(y),
(jg/m3
31
34
13
40
49
19
79
39
51
46
72
49
72
33
18
24
24
47
32
66
28
68
74
83
80
37
69
55
28
51
42
53
56
129
64
In x
3.76
3.69
3.18
3.43
3.91
2.56
4.17
3.99
4.06
3.97
4.34
4.08
4.32
3.58
3.33
3.40
3.43
4.04
3.71
4.17
3.43
4.23
4.09
4.47
4.33
3.50
4.29
4.04
3.58
4.17
3.69
4.48
4.26
5.16
4.44
In ^y
3.43
3.53
2.56
3.69
3.89
2.94
4.37
3.66
3.93
3.83
4.28
3.89
4.28
3.50
2.89
3.18
3.18
3.85
3.47
4.19
3.33
4.22
4.30
4.42
4.38
3.61
4.23
4.01
3.33
3.93
3.74
3.97
4.03
4.86
4.16
(continued)
3-40
-------
TABLE 3-11 (continued)
Site 20
(x),
|jg/m3
56
88
75
33
50
57
108
46
69
42
76
101
38
57
45
28
33
52
38
43
Site 14
(y),
(jg/m3
46
46
57
26
41
41
90
31
63
37
69
108
37
47
43
26
25
45
39
23
In x
4.03
4.48
4.32
3.50
3.91
4.04
4.68
3.83
4.23
3.74
4.33
4.62
3.64
4.04
3.81
3.33
3.50
3.95
3.64
3.76
In y
3.83
3.83
4.04
3.26
3.71
3.71
4.50
3.43
4.14
3.61
4.23
4.68
3.61
3.85
3.76
3.26
3.22
3.81
3.66
3.14
3. Obtain a probability ellipse, as described in steps 4
through 7 and shown in Figure 3-9.
4. Assume that the means, standard deviations, and corre-
lation coefficient are known values describing all data from the
sites. There should be considerable representative data used in
the determination of these statistics, at least 50 days (measure-
ments) for each site. This is similar to the assumption made in
the development of a quality control chart.
5. Assume a probability level such as 95%; that is, the
ellipse to be constructed should contain 95% of the data points.
6. Determine a critical value of x2 (chi-square) for two
degrees of freedom for 95% probability, see Table 3-12 for a list
of values of x2 for various probability levels from 50% to
99.95%. The value is 5.99 for 95% probability.
7. Plot the ellipse with coordinates (x, y) which satisfy
the equation,
3-41
-------
Q- --^
I/) •—
I- O
200
150
100
90
80
70
60
50
O
-O
c -cr
o ^-
— r-.
4-j cr\
30
C. 4-> or
(!) — •O
O l/l
C
o
20
15
10
•95% Probability ellipse
10 15 20 25 30 40 50 60 708090100 150 200
Concentration, yg TSP/m3
(Site 397140020H01)
Figure 3-9. Intersite correlation test data.
3-42
-------
. 2r
sxAV \ s
-i)'
y/
= x2
Equation 3-20
TABLE 3-12. x2 VALUES FOR TWO DEGREES OF FREEDOM,
FOR VARIOUS PROBABILITY LEVELS
Confidence level
i
50
60
70
80
90
95
97.5
99
99.5
99.9
99.95
X2
1.39
7.83
2.41
3.22
4.61
5.99
7.38
9.21
10.6
13.8
15.2
or on substitution,
1 r/x-56.5^2
1-(0.91)* I 26.3
x-56.5\/y-49.0
./y-49.0\21
V 22.9^ J D'yy'
This ellipse falls within (inscribed within) a rectangle with
center at (x, y) and with two sides of lengths 2s
and 2sy x
2s:
2s.
= 2s V5.99', respectively.
f5T9~? = 128.9
In this example,
-y V5T99 = 112.0.
Note that the V5.99 is the square root of the x2 value corre-
sponding to the confidence level selected.
The computations are tedious and should be programmed for
computerized solution on repeated use. Another approach is to
use the ellipse plotting procedure adapted for the NORMIX
14
cluster analysis.
Calculations (Assuming Lognormal Distribution)
If the analysis is performed using the logarithms of x and
y (say x' and y') one can substitute x' and y' throughout the
above steps. After all of the computations have been completed
using the logarithms, transform the results back to the original
3-43
-------
X X
data by using either e or 10 , depending on whether natural or
common logarithms have been used.
The computations for the logarithms are:
x' = 3.94 y' = 3.79
sx, = 0.44 s , = 0.47
r = 0.90
The lengths of the sides of the rectangle are:
2s , /57991 = 2.15
X
2s - /5T93 = 2.30.
If a data point falls outside of the ellipse then the values
for both x and y should be flagged for checking.
There are four data points outside the 95% confidence
ellipse. If the data had been studied one variable at a time,
the point (175,129) would be subject to question, as indicated in
Section 3.3.4. Even though this point falls outside the ellipse
it does not appear to be inconsistent with the remaining data
based on data from both of the sites; however, further study of
these relatively high values would be advisable. The other three
data points have at least one coordinate value within the range
of the other data, and studying one variable at a time would not
necessarily suggest these coordinate values as possible anoma-
lies. Two of these points should be studied further, based on
the correlated data.
3.5 REFERENCES
1. U.S. Department of Commerce. Computer Science and
Technology: Performance Assurance and Data Integrity
Practices. National Bureau of Standards, Washington,
D.C. January 1978.
2. Data Validation Program for SAROAD, Northrup Services,
EST-TN-78-09, December 1978, (also see Program Documen-
tation Manual, EMSL).
3. Barnett, V., and T. Lewis. Outliers in Statistical
Data. John Wiley and Sons, New York, 1978.
4. U.S. Environmental Protection Agency. Screening Proce-
dures for Ambient Air Quality Data. EPA-450/2-78-037,
July 1978.
3-44
-------
5. W. F. Hunt, Jr., T. C. Curran, N. H. Frank, and R. B.
Faoro, "Use of Statistical Quality Control Procedures
in Achieving and Maintaining Clean Air," Transactions
of the Joint European Organization for Quality Control/
International Academy for Quality Conference, Vernice
Lido, Italy, September 1975.
6. W. F. Hunt, Jr., R. B. Faoro, T. C. Curran, and W. M.
Cox, "The Application of Quality Control Procedures to
the Ambient Air Pollution Problem in the USA," Trans-
actions of the European Organization for Quality Con-
trol, Copenhagen, Denmark, June 1976.
7. W. F. Hunt, Jr., R. B. Faoro, and S. K. Goranson, "A
Comparison of the Dixon Ratio Test and Shewhart Control
Test Applied to the National Aerometric Data Bank,"
Transactions of the American Society for Quality Con-
trol, Toronto, Canada, June 1976.
8. Grubbs, F. E., and G. Beck. Extension of Sample Sizes
and Percentage Points for Significance Test of Outlying
Observations. Technometrics, Vol. 14, No. 4, November
1972.
9. Curran, T. C., W. F. Hunt, Jr., and R. B. Faoro.
Quality Control for Hourly Air Pollution Data. Pre-
sented at the 31st Annual Technical Conference of the
American Society for Quality Control, Philadelphia,
May 16-18, 1977.
10. Johnson, T. A Comparison of the Two-Parameter Weibull
and Lognormal Distributions Fitted to Ambient Ozone
Data. PEDCo Environmental, Inc., Durham, North
Carolina, and The Air Pollution Control Association.
Quality Assurance in Air Pollution Measurement. Pre-
sented at the Air Pollution Control Association, New
Orleans, March 11-14, 1979.
11. Larsen, R. I. A Mathematical Model for Relating Air
Quality Measurements to Air Quality Standards, Publi-
cation No. AP-89, U.S. Environmental Protection Agency,
1971.
12. Marriott, F. H. C. The Interpretation of Multiple
Observations. Academic Press, New York, 1974.
13. Hawkins, D. M. The Detection of Errors in Multivariate
Data Using Principal Components. Journal of the
American Statistical Association, Vol. 69, No. 346.
1974.
14. Wolfe, J. H. NORMIX: Computation Methods for Esti-
mating the Parameters of Multivariate Normal Mixtures
of Distributions, Research Memorandum SRM 68-2. U.S.
Naval Personnel Research Activity, San Diego, 1967.
3-45
-------
15. U.S. Environmental Protection Agency. Guidelines for
Air Quality Maintenance Planning and Analysis. Vol.
11. Air Quality Monitoring and Data Analysis. EPA-
450/4-74-012, 1974.
16. U.S. Environmental Protection Agency. Quality Assur-
ance and Data Validation for the Regional Air Monitor-
ing System of the St. Louis Regional Air Pollution
Study. EPA-600/4-76-016, 1976.
17. W. F. Hunt, Jr., J. B. Clark, and S. K. Goranson, "The
Shewhart Control Chart Test: A Recommended Procedure
for Screening 24-Hour Air Pollution Measurements," J.
Air Poll. Control Assoc. 28:508 (1979).
18. Grant, E. L., and R. S. Leavenworth. Statistical
Quality Control. McGraw-Hill Book Company, New York.
19. Siegel, S. Nonparametric Statistics for the Be-
havioral Sciences, McGraw-Hill, 1956.
20. Remington, R. D., and M. A. Schork. Statistics with
Applications to the Biological and Health Sciences.
Prentice-Hall, Inc., Englewood Cliffs, New Jersey,
1970.
21. Hald, A. Statistical Theory with Engineering Applica-
tions. New York, 1952.
3-46
-------
4.0 SELECTION AND IMPLEMENTATION OF PROCEDURES
There are several factors to be evaluated before one can
select the most appropriate data validation procedure for the
specific application. These factors can be categorized into two
sets of decision criteria (1) those based on an organizational
perspective and (2) those based on analytical characteristics.
Table 4-1 gives a breakdown of the factors to be considered in
the selection of data validation procedures.
TABLE 4-1. FACTORS TO CONSIDER IN THE SELECTION
OF DATA VALIDATION PROCEDURES
Organizational Criteria
1. Number of data sets
2. Historical data requirements
3. Nature of data anomalies
4. Manual methods
5. Continuous methods
6. Strip chart data
7. Magnetic tape data
8. Data transmitted by telephone lines
9. Timeliness of the procedure
Analytical Criteria
1. Statistical sophistication
2. Computational requirements
3. Expense of analysis
4. Sensitivity of the procedure
5. Use of data
The organization criteria can be used as an initial screen-
ing of the procedures based on the data needs, the analytical
capabilities of the agency, and staff training. For example, a
local agency with limited staff and without computer facilities
and statistical support would select from those procedures in
Sections 3.1.1, 3.1.2, 3.1.3, 3.2.1, 3.3.1, and 3.3.2--that is,
4-1
-------
data ID checks, unusual event review, deterministic relationship
checks, data plots, gross limit checks, and pattern checks. On
the other hand, a Federal agency with extensive capabilities can
use any of the validation procedures with heavy emphasis on com-
puterized procedures, computer graphics, Shewhart control charts,
the gap test, and/or the Johnson p test.
The factors are described in Sections 4.1 and 4.2. Table
4-2 gives a selection procedure using three scenarios based on
whether or not a local, State, or Federal agency has computer and
statistical resources available. Section 4.3 contains a discus-
sion of the implementation of the data validation process.
TABLE 4-2. SELECTION OF DATA VALIDATION PROCEDURES
Scenario of
Agency/Resources
Select procedures from those in
Subsections listed below
State or local agency
1. Without both computer and
statistical support
2. With computer but limited
statistical support
Federal, State, or local agency
With both computer and statistical
support
3.1.1, 3.1.2, 3.1.3,
3.2.1 (limited to manual effort)
3.3.1, 3.3.2
The above plus 3.2.1
(computerized graphics), 3.2.2
3.3.3, 3.3.4, 3.4.1
Any procedure in Section 3.
4.1 ORGANIZATIONAL CRITERIA
The selection criteria which can be used in preliminary
screening of the validation procedures are given in this section.
4.1.1 Number of Data Sets
The test procedures for internal consistency are designed
for the validation of a single data set, without the use of his-
torical data. If there is an instrument error which consistently
alters all of the values in the data set, these test procedures
will not be of any value. Hence it is necessary to use at least
4-2
-------
one test which identifies or flags data based on previous or
historical data or on comparison with other (parallel) data
sets. If an extensive analysis of past data is available or can
be readily performed then the test procedures of Section 3.3
should be considered, otherwise the data should be compared with
other parallel data sets which are described in Section 3.4.
4.1.2 Historical Data Requirements
The major test procedures which require historical data for
setting limits are the gross limit checks and pattern tests, the
Shewhart control chart, and the intersite correlation test. In
all of these, data are required over an extensive time period so
that the results can be applied to new data. Limits can be al-
tered as appropriate, to take into account bdth the additional
data and the experience gained in using the procedures.
4.1.3 Nature of Data Anomalies
Some tests are designed for a single outlier (extreme
value), whereas other tests are designed to detect shifts in
either the variation of the data and/or in the mean (or median).
For example, there are Dixon ratio tests for a single outlier and
an outlier pair; the sign test checks for shifts in the mean or
median. The Shewhart test detects shifts in either the mean or
the variance (range or standard deviation) and anomalous outliers
by means of the range check. The graphical techniques are likely
to identify any one of the data anomaly types if the graph is not
too complex.
4.1.4 Strip Chart Data
The analysis of strip chart data can be limited to simple
visual scans for gross anomalies. However, it may also be de-
sirable to establish limits within which successive values must
fall for the data to be valid. Hence, the successive difference
test would be particularly useful. Also, gross limit and pattern
tests can be easily applied to strip chart data.
4-3
-------
4.1.5 Size of Data Set (Manual and Continuous Methods, Table
4-1)
Manual methods are most appropriate for small data sets,
typically daily averages for each day or for every sixth day.
Procedures which can be performed without a computer are the
Dixon ratio, the Shewhart control chart, data plots, and the
routine procedures described in Sections 3.1.1, 3.1.2, and 3.1.3.
Computer methods are desirable for validating large data
sets (e.g., from continuous monitoring of hourly average concen-
trations) since data handling and calculations by manual means
can become tedious and can result in errors. However, the data
validation procedure can be added to the standard analysis and
data storage procedures to minimize the total computer time.
4.1.6 Magnetic Tape Data
Data recorded on magnetic tape can be easily validated using
computerized methods. Almost all of the procedures can be con-
sidered. The more sensitive and economical ones for the expected
types of anomalies would be best used. Turnaround time should be
minimal. Section 3.1.4 describes several procedures which should
be routinely used to maintain data integrity.1 For example, the
data can be blocked into logical periods (days, weeks, months,
etc.) and test performed for data sets using internal consistency
checks (e.g., Dixon ratio) or comparison with historical data
(e.g., control chart). In addition routine checks can be made of
the descriptive identification codes.
4.1.7 Data Transmitted by Telephone Lines
The recommendations for data transmitted by telephone lines
are the same as those for data on magnetic tape. Experience
gained through CHAMP suggests it is not always advisable to use
the telemetry data when the magnetic tape data are available and
when the latter can be used in a timely manner. In any case,
there should be data validation checks to ensure that the telem-
etry data are the same as the data on magnetic tape and/or the
raw data. Section 3.1.4 describes several procedures to be con-
sidered for routine use.1
4-4
-------
The principal result of transmission error is the loss or
alteration of data. One way to check for transmission error is
to transmit the data a second time and then pair the two data
streams. Gaps and alterations will be immediately apparent un-
less the transmission error is systematic.
4.1.8 Timeliness of Procedure
One of the key aspects of a good data validation program is
the timely identification of data anomalies and the feedback of
this information to the data source for corrective action and for
review of the raw data and background information needed to check
the validity of the suspect data. The shortest possible turn-
around time is desired so that the information loop can be closed
before vital information is lost and/or before more questionable
data are generated. The combination of procedures selected must
satisfy these time constraints. For example, some procedures
with very quick response times can be used jointly with more com-
plex procedures in order to catch gross errors quickly and to
catch less obvious errors within a slower response time but still
within a time frame that can result in useful data validation
implementation.
4.2 ANALYTICAL CRITERIA
The selection factors which are analytical in nature are
described in this section.
4.2.1 Statistical Sophistication
The test procedures are arranged within each subsection of
Section 3 from those that are least sophisticated to those that
are most complex. Although some test procedures may appear to be
statistically complex, these same test procedures may be very
convenient to use. On the other hand, some of those with least
sophistication can require considerable effort (e.g., plotting of
all of the data).
4.2.2 Computational Requirements
Although a computerized approach is desirable for several of
the procedures, all of them can be accomplished by manual means.
4-5
-------
The only procedures requiring computer help for routine use with
large volumes of data, are data plots, gap and Johnson p tests,
intersite correlation, and rank sum tests. For small data sets,
many of these tests can be performed easily by manual means.
Some procedures require considerable computation to derive the
limits from historical data, but the test is easily performed
after this initial step, which need not be repeated until one
suspects that the limits need to be changed to reflect real
changes in air quality.
4.2.3 Expense of Analysis
The expense of the analysis in manual/computer time is rel-
atively small compared to data costs and/or the cost of provid-
ing invalid data to a data bank for use by a decision maker.
However, these costs must be considered from the standpoint of
using an efficient data validation procedure. If a procedure
ignores too many data anomalies or flags too many good data,
either the limits need to be adjusted or the procedure eliminated
from further use. The expensive procedures are not always the
best (e.g., the Shewhart test will not cost much on a continuing
basis, and it will detect many types of invalid data). The
graphical plots can be very time consuming, by manual means or by
computer; however, the results will also prove very useful for
detecting invalid data that would be missed by other procedures.
4.2.4 Sensitivity of the Procedure
The capability of the data validation procedure to correctly
identify invalid data is critical in the selection. One should
also refer to Section 4.1.3 above which discusses the nature of
the data anomaly. Some procedures will identify a single out-
lier; others will detect shifts in the median (or mean) level,
while still others will detect changes in the variability of the
data.
Some procedures, such as the Dixon ratio test for a single
large value (Section 3.2.2.1), may be very insensitive if there
is a second large value. In these cases, another Dixon ratio
4-6
-------
test (Section 3.2.2.2) can be used which is sensitive to pairs of
outliers. It is desirable that the user of the procedure deter-
mine the types of invalid data and then design the data valida-
tion program to discriminate for these specific types of data.
Thus it is not possible to say that one procedure is always best,
rather, the appropriateness of a procedure varies with the type
of data anomaly.
Decisions concerning sensitivity requirements can be made
best after some experience has been gained in using the proce-
dure. The data analyst should continually note the percentage of
flagged data which upon investigation are found to be erroneous
values. A control chart may be used to maintain a record of the
performance of the data validation procedure. The sensitivity
and cost of implementation of different procedures should be com-
pared. In comparing the sensitivities one must be careful to
obtain an independent assessment of each procedure. This may be
a problem when the procedures are applied successively (or in
series). An exponential identifier (e.g., see Section 5.5.1 for
34
the use of 10 for invalid data) can be used in computer appli-
cations to indicate the procedure(s) which identified the data
as questionable and to aid in statistical analysis of the ef-
ficiency of the procedures.
4.2.5 Use ofData
The intended use of data should be considered in determining
the time and level of effort to be put into data validation. If
the data are to be used in making important policy decisions,
then they should be more carefully screened than if the data are
being used in a less critical role. If in doubt, err on the side
of more careful validation; data are often used in a manner un-
known to the generator/originator of the data.
4.3 IMPLEMENTATION OF DATA VALIDATION PROCESS
To make the data validation process effective requires the
consideration of several important factors in both the selection
and implementation of the procedures. Some of the factors to be
4-7
-------
considered in implementation were identified in Section 2.4 and
are discussed in this section.
Initially, a person needs to be identified with the re-
sponsibility for the data validation activities. In a local or
State agency, this may be a part-time function; whereas in a re-
gional or Federal agency, one or more persons may be given this
responsibility on a full-time basis.
The data validation procedures need to be selected by taking
into consideration the several factors identified in Sections 4.1
and 4.2. The process must be consistent with, for example, the
use of the data, -the network size, data volume, and staff size
and training.
After the selection of the procedures, the limits, patterns,
and confidence levels must be determined (using historical data)
for a particular application to obtain an efficient procedure.
The limits may need to be adjusted after some experience with the
procedures. The tighter the limits, the greater the quantity of
data that are flagged and hence the greater is the effort of
checking on the flagged data. There is a tradeoff between the
stringency of the data validation and the costs of flagging too
many data values (and the completeness of the remaining data,
particularly if the data identified by the validation process are
deleted or flagged relative to further use in some computations).
A data validation plan should be prepared and documented as
a part of the quality assurance plan (as described in Volume I of
the Quality Assurance Handbook).2 This plan should include:
1. The data flow (reporting) hierarchy;
2. The data validation check points;
3. The methods for checking the data;
4. The documentation of the process (the data values flag-
ged for investigation, the values inferred to be anomalies and
those for which no clear decision could be made);
5. The techniques used for anomaly investigation;
6. The reporting schedules;
7. The mechanism for feedback of validation information
to the data collection, to the quality control procedures, and to
the validation process;
4-8
-------
8. The resources (including costs) for performing data
validation;3
9. Listing of QC checks used to reject data forthrightly
and without further investigation (e.g., the use of zero and span
checks to reject data when the span drift exceeds a specified
limit) ;
10. The routine handling of data in blocks (or time
periods) with a corresponding sign-off form indicating the data
have been subjected to the data validation process; and
11. The procedures for tracking the causes of invalid data.
A periodic reporting procedure must be used to summarize
the performance of the data validation process with regard to
this plan. Finally, the validation process can be made most ef-
fective by adjusting, if necessary, the limits (or criteria) in
order to minimize the incorrect number or percentage of infer-
ences with regard to possible data anomalies. Each investigation
costs time and money, so the incorrect inferences must be mini-
mized. A good documentation system should enable the validator
to solve this problem.3 An effective reporting and feedback
system will also contribute to the motivation of the personnel
involved with the process and to timely corrective action.
4.4 REFERENCES
1. U.S. Department of Commerce. Computer Science and
Technology: Performance Assurance and Data Integrity
Practices. National Bureau of Standards, Washington,
D.C., January 1978.
2. U.S. Environmental Protection Agency. Quality Assur-
rance Handbook. Vol. I. Principles. EPA-600/9-76-005,
1976.
3. Smith, F. Guideline for the Development and Implemen-
tation of a Quality Cost System for Air Pollution Mea-
surement Programs. Research Triangle Institute.
RTI/1507/01F, November 1979.
4-9
-------
5.0 HYPOTHETICAL EXAMPLES AND CASE STUDIES
Two hypothetical examples and three case studies are de-
scribed in this section. The two examples are very simple ones
involving ambient air monitoring and source testing and are given
first in Sections 5.1 and 5.2. The three case studies are given
last and in increasing order of sophistication (Sections 5.3,
5.4, and 5.5). Few simple examples as described in Sections 5.1
and 5.2 are documented in quality assurance or pollution control
literature because the individuals involved probably do not feel
their programs reveal any innovative procedures. Hence these
examples are hypothetical, though hopefully realistic of the
needs of the local agency performing ambient air monitoring and
source tests.
5.1 HYPOTHETICAL EXAMPLE FOR AMBIENT AIR MONITORING
Suppose that a local agency has four hi-vol monitoring
stations for TSP, operating every sixth day, and two stations for
ozone, operating 5 months per year. At one station, temperature,
rainfall, and wind data are being obtained. Two cases will be
considered: one without and one with computerized methods. In
each case the preferred procedures will be selected for the data
validation process.
5.1.1 Without Computerized Support - Several very simple and
useful manual procedures which can be used to validate the data
will be suggested for each pollutant/parameter.
TSP - There are approximately 60 measurements per year at
each site. If all four sites have different monitoring sched-
ules, the data from each site should be validated independently.
A gross limit check should be established for TSP, perhaps the
air quality standard value if this value has a small probability
of occurrence, say less than 5%. A gross limit check value can
be obtained by analyzing the historical data for the most recent
5-1
-------
year and determining a single value which has a small likelihood
of exceedance. Gross limit checks are discussed in Section
3.3.1. In some cases, multiple limits or a pattern test should
be used depending on the results of the following analysis for
each of the four stations
1. Analyze 1 year of data (about 60 values), examine the
data for a seasonal/day-of-week pattern (e.g., four seasons,
weekday and weekend day, eight categories),
2. If there are no significant patterns, then use a single
gross limit check; otherwise, determine a limit for the indicated
pattern, say weekend versus weekday values.
Typical TSP data can be adequately approximated by the
lognormal distribution, hence it is usually necessary to make the
lognormal transformation of the data and then to statistically
analyze the transformed data (including the analysis described
above) and ultimately to transform back (if desired) to the
actual data values to obtain the appropriate limits. The limits
determined in this way are usually larger but more statistically
accurate than if no transformation were used.
A second procedure, which is usually beneficial in detecting
changes in the data from one or more of the four hi-vol stations,
is to develop a quality control chart for the following:
1. The hi-vol, average and range values for each set of
five consecutive readings (about 1 month of data) for each of the
four stations, and
2. The average and range for the four station averages.
The four charts developed in step 1 monitor the individual sta-
tions for suspicious changes in either the average or range. The
last chart monitors the group of four stations for suspicious
changes in the overall average and for the variation among the
four station averages. To develop the control limits for these
charts, use the data for 1 year (60 values, or 12 sets of five
each) and the methodology described in Appendix H, Volume I of
the Quality Assurance Handbook and in Appendix C of this report.1
5-2
-------
Finally, a year of data can be plotted in a few hours since
there are only 240 measurements per year (60 values at four
sites). The data for all four hi-vols should be plotted on the
same graph or on graphs placed one below the other to facilitate
comparisons.
03 - The two 03 sites are assumed to be in operation for 5
months of the year and to record hourly average values. For
these data, the preferred data validation procedures would be
gross limit and pattern checks of Section 3.1. Limits such as
those in Table 3-2 of Section 3.3 should be derived from histor-
ical data for at least 3 years if available. The data should be
statistically analyzed for day-night patterns and perhaps for two
groupings within the 5 months—that is, the 3 months with higher
03 levels and 2 months with lower 03 levels. Each of the 5
months also could be treated separately in the analysis and then
grouped as the data and the statistical analysis indicate.
The Shewhart quality control chart may be used to monitor
both stations. The historical data would need to be grouped in
the same manner as they are to be used in the quality control
chart and then analyzed for averages and ranges to obtain the
quality control limits. Note: The data do not have to be plot-
ted on charts to apply this procedure; however, the visible chart
may be beneficial in detecting trends/patterns in the data that
would not be evident on review of a data record.
Meteorological Data - Assume that hourly average tempera-
ture, rainfall, windspeed, and wind direction are available for
the data validation.
The following pattern tests should be used with necessary
modifications to be consistent with the meteorological conditions
for the agency:
1. The minimum hourly average should occur between 04-09
hours, and the maximum between 12-17 hours.
2. Hourly changes should not exceed 6°C (10°F). If a de-
crease of 6°C (10°F) or more occurs, the wind direction should
have changed to a more northerly direction, and/or rainfall of
0.15 in. or more should have occurred.
5-3
-------
3. Hourly values should not fall outside of the specified
maximum and minimum values for each month. These limit values
need to be determined from historical data.
5.1.2 With Computerized Support -
TSP - The routine manual procedures suggested in Section
5.1.1 may be computerized if desired. No additional procedures
are recommended because the data are not voluminous and because
the recommended procedures can be performed manually.
03 - In the case of 03 the volume of the data makes the use
of a computer worthwhile. For example, a plot of the O3 data by
hour can reveal any unusual values and/or changes in consecutive
values from the typical diurnal and seasonal patterns.
In addition, a check of the extreme values for 5 months
using the gap or "Johnson" p test (Sections 3.2.4 and 3.2.5)
could be beneficial. Particularly in the Johnson p test, the
distribution of the upper portion of the data is approximated in
the process of checking the extreme values. Hence the identifi-
cation of a possible data anomaly may actually be of secondary
interest in the use of the Johnson p test. The computations can
be easily performed on a programmable calculator or a computer
given the data values and procedure(s) in Section 3.2.5 and
Appendix B.
Meteorological Data - The availability of a computer should
not materially affect the selection of the procedure; however, it
may result in the use of a computerized method of performing the
routine manual analyses.
5.2 HYPOTHETICAL EXAMPLE FOR SOURCE TESTING
Suppose that a Method 6 source test is performed at a util-
ity plant with a flue gas desulfurization (FGD) system. One of
the obvious data validation checks to perform is a report review
using gross limit and/or parameter relationship checks. Some
data checks to be performed are:
1. Barometric pressure - At sea level, the value should be
approximately 760 mm (30 in.) and between 736 and 787 mm (29 and
31 in.) Eg; at other elevations, the value should decrease 2.8 cm
(1.1 in.) Hg per 305 m (1000 ft) above sea level.
5-4
-------
2. Moisture data - Nomographs provide moisture content at
saturation as a function of stack absolute pressure and stack gas
temperature. If the reported value is higher than the maximum
that was read from the nomograph, the data are suspect.
3. Volumetric flow rate data - These data are difficult to
cross-check accurately, but gross limits can be determined from
engineering considerations.2 In addition, the data should be
compared with data obtained under similar process conditions from
previous tests on the same source, similar sources, or tests per-
formed at the inlet to the control device.
4. Emission results - These are the most difficult but
also the most important data to validate. A sulfur balance
should yield a good check of the S02 emission results. However,
there is considerable variability in the sulfur in coal (relative
standard deviation of 15% is not unusual). The limited amount of
data for a series of three runs does not make the use of a Dixon
ratio or Grubbs test practical.2 That is, a decision may be made
to eliminate one extreme from a group of three values (using the
Dixon test) as a result of process variations of the order of two
to one resulting from factors beyond the control of the plant.
Hence, decisions concerning the validity of data need also be
based on other experiences with this or similar plants.
5.3 CASE STUDY OF A MANUAL DATA VALIDATION SYSTEM
This section describes an automated monitoring and data
acquisition system for which the validation procedures are per-
formed manually. The monitoring system includes 26 sites
which monitor 124 environmental parameters. In the case of one
parameter, SO2, a comprehensive formal quality assurance program
has been developed. In the case of the other parameters, no
formal quality assurance procedures have been documented. A
brief description of the validation procedures developed for the
parameters without the quality assurance programs will be fol-
lowed by a more detailed description of the validation procedure
for S02. The case study illustrates the complementary nature of
5-5
-------
the quality control and the data validation procedures within a
monitoring system, and it shows clearly that, as more comprehen-
sive quality control procedures are applied during data genera-
tion, the need for in-depth data validation procedures can be
limited to less frequent review or to spot check items listed
under QC below. The network to which this case refers is not
identified for proprietary reasons.
5.3.1 Quality Control
The major QC procedures for the instruments other than the
continuous S02 monitors are: daily zero-span checks, 6-month
multipoint calibration, and 6-month scheduled preventive main-
tenance. Instruments are rotated in the field. Each instrument
is brought to the central laboratory for maintenance and calibra-
tion every 6 months and is replaced in the field by another in-
strument (a better practice would be to calibrate the instruments
in the field). Instruments are checked on site by technicians
two or three times a week on the average. Data are transmitted
automatically to a central computer. Strip charts are generated
as backup to the computer data base.
A total quality assurance program was designed for the S02
monitors in this network. Field level QC procedures include the
following:
1. Operator checklists - Upon entering a monitoring shel-
ter, the field technician completes a checklist of key parameter
observations: pressure gauge readings, critical temperature
readings, weather or site-specific conditions, and so forth, that
may significantly influence SO2 data quality. Checklists are
designed to instruct the operator to take specific action under
specific parameter conditions—for example, if a parameter is out
of tolerance; action may include phone consultation with the
chief technician. Checklists are forwarded to the laboratory
weekly.
2. Zero-span control charts - The field operator checks
the S02 strip charts and manually plots the zero-span results for
each day on a control chart. Predetermined control limits are
5-6
-------
established for each zero-span control chart. Control charts are
forwarded to the central laboratory for review and approval
monthly. If an out-of-limits condition occurs during the month,
the field technician consults by phone with the chief technician.
3. Maintenance reports - Formal reports of unscheduled
maintenance performed in the field are completed as required and
forwarded weekly to the chief technician for review. This proce-
dure facilitates identification of chronic instrument problems.
4. Strip chart review - The field operator looks for
unusual traces on the strip chart that indicate equipment mal-
functions. If malfunctions are suspected, the operator makes a
note on the strip chart and flags the problem for the chief
technician. Strip charts are forwarded to the central laboratory
for review approximately monthly.
5. Calibration and preventive maintenance - Instruments
are calibrated and maintained on a rotating 6-month schedule.
(These schedules were selected prior to recent quality assurance
standards which recommend a quarterly schedule.)
6. Laboratory strip chart scan - The quality assurance co-
ordinator scans S02 strip charts upon receipt from the field to
identify uncharacteristic traces that may have been missed by the
field technician. For time periods when either the field opera-
tor, the chief technician, or the quality assurance coordinator
know that the data are unacceptable, a form is completed to
inform a data validator to delete the data from the data base. A
data delete form is only prepared when there are backup data or
information (e.g., zero/span data, remark in the daily log; etc.)
to justify the action.
7. Quality assurance filing system - All maintenance and
calibration records, zero-span control charts, field checklists,
strip charts, and data delete or change request forms are marked
with the site and instrument identification. These documents are
filed permanently by site, instrument, and date to provide future
backup for answering data validation questions.
5-7
-------
In addition to the QC procedures above, the central air quality
monitoring laboratory has a teletype capability to poll any
monitoring site for specific parameters.
8. Data review and audit - The quality assurance coordi-
nator performs an annual audit of the files and traces the oc-
currence of events from causes through subsequent corrective
actions. Also the laboratory supervisor reviews the data delete
forms on a monthly basis to suggest corrective actions.
5.3.2 Data Validation
The actual data validation in this system is performed by a
contractor not associated with the routine operation of the
system. The validation contractor's responsibilities include:
1. Strip chart review,
2. Interparameter relationship review,
3. Recapture of valid data lost during data transmission,
4. Confirmation of excursions above the ambient air qual-
ity standards, and
5. Deletion from the data file of all data invalidated
through the QC or validation procedures.
The following two exceptions apply:
1. Strip chart review is not performed since the QC pro-
cedures for S02 specify that this technique be applied at the
field level. Although this is the only real point of departure
between validation of S02 data and validation of the other param-
eters, it is significant because an in-depth, after-the-fact re-
view of strip charts is time consuming and expensive for a net-
work of this size. For example, after-the-fact chart review for
S02 would consume on the average of 1 to 2 hours of technician
time per site-parameter-month combination, depending on the kinds
of problems encountered and the volume of communication required
between the data validator and the monitoring laboratory.
2. Interparameter relationship review is applied to mete-
orological data only, except in the case of excursions above the
ambient air quality standards.
5-8
-------
5.3.2.1 Strip Chart Review - Strip charts are scanned manually
to identify questionable data periods. The purpose of the scan
is to identify:
1. Unusual spikes,
2. Invalid data periods identified by field technicians,
and
3. Uncharacteristic traces, for example a relatively flat
or constant windspeed trace over several hours.
5.3.2.2 Interparameter Relationship Review - As indicated pre-
viously, the interparameter relationship review is performed for
meteorological parameters only. Hourly averages are reviewed
from formatted computer reports that display data for each hour
of each month. One review compares dewpoint and ambient tempera-
tures and then flags hourly dewpoint values that exceed the
hourly temperature. Daily average dewpoint values that exceed
the daily average temperature are also indicated. Another review
considers windspeeds measured at two tower levels. Data are
flagged whenever the average windspeed at the lower level exceeds
that at the upper level. Procedures like these must be based on
judgments by an experienced meteorologist. Validation instruc-
tions in many cases direct the validator to backup strip charts
for questionable time periods identified during the interparam-
eter relationship review.
5.3.2.3 Recapture of Valid Data Lost During Data Transmission -
The data validation contractor receives a magnetic tape of data
collected through the data transmission system at the same time
the strip charts, data change requests, and data listings needed
for validation are received. The objective of the monitoring
system is to reach a level of at least 90 percent valid data
collection. The data validator reviews the data listings for
time periods when data capture is below 90 percent. The valida-
tor then checks the strip charts for those time periods and
reduces data for specific hours that appear to be valid but were
not picked up by the data transmission system. The data values
are picked up from the strip charts through an electronic data
digitizer and then used to update the data file on the magnetic
tape.
5-9
-------
5.3.2.4 Confirmation of Excursions above the Ambient Air Quality
Standards -The data validator receives a listing showing specific
hour, 3-hour, or 24-hour values that constitute excursions above
the ambient air quality standards. The strip charts are examined
for these specific time periods to verify the reported values.
If the reported data differ from the strip chart values and the
strip chart values are valid, then the data are changed in the
file. The standards report is annotated and returned to the
client's quality assurance coordinator for filing.
The confirmation procedure requires the data validators to
check meteorological data records and strip charts for the time
periods when standards excursions were reported. Unusual condi-
tions or invalid meteorological data are also recorded on the
standards report.
5.3.2.5 Deletion of Invalid Data - The final step in the valida-
tion procedure is the update of the data tape. Data picked up
during data validation, as explained in Section 5.3.2.3, are
merged into the file. During the update procedure, data found to
be invalid by either the data validators, by the field operators
through the QC procedures, or by the quality assurance coordina-
tor (in the case of S02) are deleted from the file. A listing of
each value added to or deleted from the permanent file is pro-
duced for filing with each monitoring stations records.
5.4 CASE STUDY OF THE CHAMP AUTOMATED DATA VALIDATION SYSTEM
The U.S. EPA and the Rockwell International Science Center
designed and implemented an automated data acquisition system for
the Community Health Air Monitoring Program (CHAMP). CHAMP was
implemented to provide reliable air quality data to support the
Community Health Environmental Surveillance System (CHESS), a
program to study the effects of air pollutants on community
health.
The CHAMP network included 23 monitoring sites in six
cities. The network provided data for the 19 environmental
parameters listed in Table 5-1. Monitors at each site were
polled by minicomputers at the sites, and the central computer at
5-10
-------
TABLE 5-1. CHAMP ENVIRONMENTAL PARAMETERS
Computer code
Parameter
N0x
NO
N02
SNO
03
S02
CH4
THC
NMHC
WS
VWM
VWD
TOUT
TIN
DEWP
BP
HVFL
RSPI
RSPF
nitrogen oxides
nitric oxide
nitrogen dioxide (calculated)
sampled nitrogen oxide
ozone
sulfur dioxide
methane
total hydrocarbons
nonmethane hydrocarbons
wind speed
vector wind magnitude
vector wind direction
temperature outside
temperature inside
dewpoint
barometric pressure
hi-vol flow
respirable suspended
particulate flow rate; inlet
respirable suspended
particulate flow rate, final
5-11
-------
Research Triangle Park (RTF), North Carolina polled the sites
every 2 hours. These data were used to review system status at
each remote monitoring site and to provide feedback for field
personnel to perform maintenance as needed.
Magnetic tapes were generated at each site, and the tapes
were forwarded to RTF every 2 weeks. The data validation
procedures were applied to the data on tape rather than to the
data transmitted via telephone lines to the central facility be-
cause the data on the tapes were generally more complete and more
reliable than the telemetry data.
Strip charts for gaseous pollutants were generated at remote
sites. The strip charts provided a backup, and data that were
lost due to computer failure were recaptured from the strip
charts. The strip charts also provided a visual QC check for
station operators.
5.4.1 Quality Control Functions
Several automated QC checks were performed by the on-site
computers as the data were generated. Manual QC procedures were
used also.
The automated QC procedures were based on limits set for the
analog values of the critical operating parameters. Critical
limits were entered into the computer; if these limits were ex-
ceeded, then the associated 5-minute averages were flagged as
possibly invalid. Direct environmental measurements, referred to
as "primary channels," were checked for nonzero voltage, normal
setting of all valves, power to the instrument, and digitally
measured tolerances for proper ambient sampling. Instrument
operations (e.g., hydrogen flow in an S02 monitor), referred to
as "secondary channels," were also tested to verify that values
were within tolerances; and status conditions necessary for cor-
rect sampling were machine checked every five minutes. The same
control procedure was applied to the calibration operation to
verify proper flows of the calibration gases and correct valve
settings on the flow system. Figure 5-1 illustrates the QC pro-
cedure .
5-12
-------
LEVEL 1
DATA
CONVERT
FROM
VOLT TO
ENG. UNITS
CHECK
FLOW
RATE(S)
VALVES
SET
CORRECTLY
CALIBRATION
CONSTANTS
REJECT IF
OUT OF
TOLERANCE
REJECT
IF
NOT
ACCEPT
AS
LEVEL II
DATA
YES
ZERO &
SPAN
PERI OKML'D
WITHIN LAST
24 n
NO
FLAG AS
POSSI3LY
INVALID
Figure 5-1. Automated quality control tests.
5-13
-------
During calibration and instrument tests, the sampling indi-
cator was in the off position. During sampling, the indicator
was in the on position. During QC checking, the status bits were
read to determine the on or off condition.
Error conditions flagged by the automated system were re-
viewed daily at the central computer facility. As a result of
the QC review of data received, either personnel could be dis-
patched to remote sites for instrument adjustments or field
operators could be advised of needed changes. This real-time
turnaround afforded by automated QC checks was important for
maintaining continuous operation throughout the system. It
allowed needed adjustments to be made quickly so that a minimum
quantity of data was lost or invalidated.
Manual QC checks included visual inspection of control
charts for unusual values or large changes in reported data
values. If the field operator noted unusual events at a site
which might affect subsequent data validation or review, he or
she was instructed to enter explanatory entries into the data
file.
Manual QC calibration audits were performed on the entire
system quarterly. Zero-span calibration was performed every 3
days. A tickler file was maintained at each site to indicate
when preventive maintenance was required.
5.4.2 Data Validation
The data validation procedures were performed at the CHAMP
central computer facility at RTF. Computer listings of the data
that had been subjected to QC checks were reviewed.
A report listing the hourly average values for each of the
19 parameters (primary channels) measured at the remote monitor-
ing stations was inspected. For each station, the time (hour),
the parameter name, the parameter hourly average value, and the
number of valid 5-minute values used to compute the hourly
average were listed as shown in Figure 5-2. The data shown for
ozone for hours 13 through 15 illustrate the application of this
technique. For hour 13, twelve valid 5-minute values were used
5-14
-------
to compute the hourly value of 0.1356. For hour 14, only four
valid 5-minute values were used in calculating the hourly average
of 0.1532. The "B" indicates that data were missing and invalid
for the remaining eight 5-minute periods during the hour. In-
valid data would be the result of primary or secondary channels
being outside acceptable limits during the QC tests on site.
During hour 15, only one valid 5-minute average was available.
The "M" indicates that the remaining eleven 5-minute averages
were missing.
As indicated in Figure 5-3, a summary of secondary param-
eters associated with each of the primary parameters in the
previous listing was produced. The summary shows the high and
low critical limits for each parameter. Note that in the case of
ozone, the critical value range for the flow of ethylene (FETH)
was set at 20.000 to 30.000.
A third listing, Figure 5-4, shows the reasons for invalid
data flagged in the hourly data listing illustrated in Figure
5-2. Again using the ozone parameter, the listing indicates that
FETH was outside of the critical limits during hour 14. Figure
5-5 shows the invalid 5-minute values for ozone and the 5-minute
values of each of the associated secondary parameters that were
determined to be invalid during the QC tests. No 5-minute value
for a primary parameter was valid unless the following criteria
were met:
1. The values of all the 5-minute averages for all of the
associated secondary parameters had to be available.
2. The 5-minute values of all of the associated secondary
parameters had to be valid.
Figure 5-6 lists journal entries for the time period corre-
sponding to the preceding data listings. The journal entries
were useful in arriving at decisions on the validity of flagged
data.
A data review, illustrated in Figure 5-7 was produced for a
quick overview of 5-minute average status for each data period.
Each hour in this data listing is represented by 12 status bits,
5-15
-------
en
i
M
cn
CHAMP DATA, VALIDATED ON Ol-AUG-79 WITH VERSION 7.00 OF VALDAT
VALIDATION REPORT FOR STATION - 0841 FOR DAY - 265 - 1977
HOURLY AVERAGES
TIME
121 0
131 0
141 0
151 0
161 0
TIME
121 0
131 0
141 0
151 0
161 0
N0x
0.0612 12
0.0458 12
0.0452M10
0.0402M 1
0.0641M 6
VMH
6.0 12
6.5 12
7.2M10
6.8M 1
11. 2M 6
NO
0.0069 12
0.0044 12
0.0045M10
0.0039M 1
0.0052M 6
VWD
252.0 12
244.7 12
234.9M10
309. 4M 1
239. 3M 6
0.
0.
0.
0.
0.
T
23
24
25
25
25
N02
0542 12 0
0414 12 0
0407M10 0
0363M 1 0
0589M 6 0
OUT T
.0 12 22
.4 12 22
.2M10 23
.8M 1 23
.OM 6 22
SN02
.0568 12 0.
.0437 12 0.
. 0431M10 0.
. 0386M 1 0.
.0618M 6 0.
03 S02
1300 12
1356 12
1532B 4
1574M 1
1739M 6
M 0
M 0
M 0
M 0
M 0
IN DEWP BP
.3 12 10.
.5 12 8.
.4M10 7.
.4M 1 6.
. 5M 6 10.
0 12 734.
2 12 734.
9M10 733.
9M 1 733.
3M 6 733.
7 12
1 12
6M10
5M 1
4M 6
CH4
1.92 12
1.82 12
1.84M10
1.74M 1
1.94M 6
HVFL
1.135 12
1.134 12
1.133M10
1.133M 1
1.134M 6
THC
1.88 12
1.78 12
1.80M10
1.70M 1
1.90M 6
HSPI
1.110 12
1.112 12
1.120M10
M 0
1.134M 6
NMHC
-0.02 12
-0.02 12
-0.03M10
-0.02M 1
-0.02M 6
HSPF
1.109 12
1.115 12
1.122M10
M 0
1.139M 6
WS
7.7 12
8.5 12
9.0M10
6.8M 1
11. 5M 6
Figure 5-2. Example CHAMP data validation report (partial printout).
STATION - 0841 DAY 265-1977
VALIDATION CRITERIA USED
N02
SNO
SNO
SNO
-^o3
03
S02
S02
VAC
F02
SFNO
VAC
FETH
SF03
HS02
SFSO
LOW
LOW
LOW
LOW
LOW
LOW
LOW
LOW
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
115.
28.
130.
115.
20.
800.
165.
130.
000
000
000
000
000
000
000
000
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
= 135.
= 38.
= 300.
= 135.
= 30.
=1600.
= 215.
= 200.
000
000
000
000
000
000
000
000
Figure 5-3. High/low critical values for CHAMP secondary parameters (partial printout).
-------
STATION - 0841 DAY 265-1977
INVALIDITY CAUSES BY HOUR
03 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
FETH
FTH/
SF
RSPI 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
RSPI
NEGV
RSPF 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
RSPF
en NEGV
M L—
•^ Figure 5-4. CHAMP validation system, invalidity causes by hour.
STATION - 0841 DAY 265-1977
INVALID SECONDARIES-FIVE MINUTE AVERAGES
14: 7 03 = 0.1449 FETH= 17.692 14:12 03 = 0.0626 FETH= -0.774 14:17 03 = 0.0182 FETH = -2.327
14:22 03 = 0.0102 FETH= -2.688 14:27 03 = 0.0147 FETH= 0.814 15: 5 RSPI= -0.924 RSPI= -0.924
15: 5 RSPF= -0.999 RSPF= -0.999
Figure 5-5. Five-minute values of invalid secondary parameters.
-------
STATION - 0841 DAY 265-1977
JOURNAL ENTRIES
STATION - 0841 DAY - 265 TIME - 13:45
WORD 9 HIT 1 GOES TO 1 STATE SOMETIMES: NO APPARENT REASON, R
NEPHELOMETER POWER IS NOT OFF. RICK F ON
STATION - 0841 DAY - 265 TIME - 13:48
OPC/PHA: WAS AT 50000 SECONDS & COUNTING, REFUSED TO TRANSFER R
DATA TO COMPUTER. USED A TO RESTART. RICK F E
STATION - 0841 DAY - 265 TIME - 15: 1
DAS: WAS PRINTING 7242, 5246, 0323 DURING POLL ATTEMPTS, R
NO RECORDS TAKEN, ZEROED CORE, LOADED PROGRAMS & CONSTANTS. R
RICK F D
STATION - 0841 DAY - 265 TIME - 16:31
DAS CRUSHED E 0002: DURING POLL ATTEMPT. TTY PRINTED 0316, R
0316, THEN TURNED OFF DIGITL DISPLAY, THEN WENT TO E 0002.
ZEROED CORE, LOADED PROGRAMS & CONSTANTS. RICK F E
STATION - 0841 DAY - 265 TIME - 16:46
TSP, RSP, ANDENSEN FILTERS CHANGED, S02 BUBBLER RUNNING: ANALYZER
B.O. EAA, OPC/PHA CALIBRATION ABORTED: DAS CRASHEDS, WARM, A
LITE HAZE TODAY. RICK F 01
STATION - 0841 DAY - 265 TIME - 22:22
DAVID TORRES AT GLENDORA TO CHECK OUT MODEM PROBLEMS
Figure 5-6. Example CHAMP journal entries for data validation.
-------
Station -
Day 265-1977
NO
NO
N02
SNO
S02
CH.
u
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8 .
000000000000
16
0
8
16
0
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
9
17
1
000000000000
-.,-,.--.-• -,CL)
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
10
18
2
000000000000
AIA KLVILJ-"
i>
000000000000
1 1
000000000000
19
000000000000
3
000000000000
11
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
1 1
19
3
000000000000
1
000000000000
12
000000000000
20
000000000000
1)
000000000000
12
000000000000
20
000000000000
14
000000000000
12
000000000000
20
000000000000
It
000000000000
12
000000000000
20
000000000000
it
000000000000
12
000000000000
20
000000000000
1*
12
20
1)
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
13
21
5
000000000000
6
000000000000
lit
0000000000--
22
000000000000
6
000000000000
111
0000000000--
22
000000000000
6
000000000000
11)
0000000000--
22
000000000000
6
000000000000
14
0000000000--
22
000000000000
6
000000000000
11)
01 1 1 1 1 1000
22
000000000000
6
11)
22
6
000000000000
0000000000
15
-0
23
0000000000
7
0000000000
15
-0
23
0000000000
7 "
0000000000
15
-o--
23
0000000000
7
0000000000
15
-o--
23
0000000000
7
0000000000
15
23
ooooooooooc
7
15
23
7
00000000000
Figure 5-7. Example CIL".['P validation data review.
-------
each representing a 5-minute time interval. The 5-minute inter-
val flags are represented symbolically as follows:
"0" - acceptable data
"I" - invalid data
"-" - missing data
Note that the flags for ozone data during hours 14 and 15 can be
traced through the previous illustrations. For the cases in
which 5-minute averages were missing, operator logs and strip
charts were reviewed to determine if data could be captured; in
cases where acceptable data were available, the data were entered
into the data base.
Another feature of the CHAMP data validation routine was the
application of a graphic technique. Figure 5-8 shows a plot of
ozone as a function of time for day 265, the same period used in
the preceding illustrations. The valid 5-minute averages are
represented as O's; invalid 5-minute averages are entered as
"I's". (The stacking of 5-minute values is merely a function of
printer limitations). Better resolution was obtained in some
cases by using a continuous line plotter. The illustration
shows, however, that a useful graphic technique can be applied
(relatively inexpensively) that flows well with other data re-
ports produced for a specific time period.
The graphic display technique was useful in two ways--first,
for quickly spotting extreme data values that warranted further
investigation and secondly, for tracking hourly pollutant trends
daily. In the case of ozone, the trend is clear in Figure 5-8.
Figure 5-9 is a complement to the ozone pattern, and it
confirms that NO trends during the same day behaved in the way
jrC
that would be expected from the photochemical relationships
involved.
After the validation procedure was completed for CHAMP data,
the data were approved for incorporation into the reporting data
base. The automated and manual QC and data validation procedures
in the CHAMP system have lent significant credibility to the
final reported data.
5-20
-------
0 CONCENTRATION,
-
ho
~n
f.
c
-5
CD -C"
C-TI
oo \_n
X
CU
3
rt>
O
a:
CL 0
-J. POO
1 OO
O— >
f i n— *
a>
H- o
CD >— •
n
< ^
ro
o
3 _^
0 ^
n:
"^3
— ^ O*^
r+
0)
—* *
< CO
OJ
pi. VO
OJ
— *• ro
0 0
3
NJ
NJ
O
O
M
*
f\ '
*
'
j
~7
j:
.
_
_
0 0 0 O
NJ -C- CT> CO
O — • V*) Jr-
N> OO -C- VO
CX> U1 M >«D
3 1
J
H
1
30
U
; o —
,
§
\J W
Q Q
O Q
§ —
i;
8
~ 0
o —
ftp°
f\K
a§
gO °
^
I
00
-1
-1
O
1
OO
- -•*
0
M
1
^
f—
0
o
o
oo
cz
-f
o
0
— I
— "•
-------
zz-9
NO CONCENTRATION,
y\
o
— J
K>
-C-
— D
10
c:
-S CTN
C71
oo
m
x -j-
1 2u>
"E »
<° ^°
o "*•""*
c
-5 —
ft)
-h —
-S VA>
O
3 _
o *•
J> _
12 ^j-j
CL _
Cu
Q. Tl
O) 00
o Co
3
O
O O O O O
O O K> \jo jr-
O VJ3 O O O
OS O\ O f VD
-t- oo .e- oo to
FO {J L
Oo o — -
Oo
o
oo
— 8§°00
oi;
i|8
«§
1 1
I;
' O
Q
o80
°°00
° rtft
°00
w O
O OO
— 88 —
Q;: —
Q i '
^— . ft fl
oH
oo
ft n
O -_
O w
O O
o
I §0
1
8
— Q Q
o
o o
8Q
\j ' '
:B
± Jo |
H
0
|
OO
J"—
>
M
O^»
un
i
kD
^
T3
O
H
O
-n
z
o
X
00
•j.,
-n
cr
2
— i
o
2
O
-H
m
-------
5.5 CASE STUDY OF A REGIONAL AIR MONITORING SYSTEM (RAMS) DATA
VALIDATION
The U.S. EPA operated the Regional Air Monitoring System
(RAMS) in and around the St. Louis metropolitan area from 1975
through mid-1977. The system was part of the Regional Air Pollu-
tion Study (RAPS), undertaken to collect aerometric data for
urban and rural scale dispersion model development and evalua-
tion. The RAMS included 25 monitoring sites at which the param-
eters shown in Table 5-2 were monitored. Data were transmitted
to a central computer facility from each monitoring site via
leased lines. Figure 5-10 shows the relative locations of moni-
toring stations in the RAMS network. Figure 5-11 shows the flow
of data within each station. In the RAMS system, minicomputers
served QC and data handling functions similar to those in the
CHAMP network. Unlike the CHAMP data base, which was developed
with backup data tapes from each station, the RAMS data base was
produced from data transmitted directly to the central computer.
The station tapes in RAMS served as backup in instances when the
telemetry system or central computer malfunctioned. Station data
buffers were polled by the central computer every minute.
5.5.1 Quality Control
Several automated QC checks (based on the operating status
of each instrument) were incorporated in RAMS to prescreen the
data. The automated QC checks included:
o System status checks,
o Analog checks, and
o Zero/span checks.
The system status checks included approximately 35 electrical
checks of critical parameters (e.g., flame-out, valve on-off
status) to determine the capability of each instrument to produce
valid data. When status checks indicated malfunctions, a flag
was appended to the value of the associated environmental param-
eter to indicate that the 1-minute value was invalid.a
aSystem programmers developing programs to summarize data must be
aware of the extent to which flags,.,are used. In RAMS, the
sample value was multiplied by 10~ when a status condition was
invalid. Problems arose when data analysis programs recognized
such values as zero.
5-23
-------
TABLE 5-2. PARAMETERS MONITORED IN THE REGIONAL
AIR MONITORING SYSTEM
Air quality: Sulfur dioxide
Total sulfur
Hydrogen sulfide
Ozone
Nitric oxide
Oxide of nitrogen
Nitrogen dioxide
Carbon monoxide
Methane
Total hydrocarbons
Meteorological: Windspeed
Wind direction
Temperature
Temperature gradient
Pressure
Dewpoint
Aerosol scatter
Solar radiation: Pyranometer
Pyrheliometer
Pyrgeometer
Measurement
interval (min)
3.75
1
3.75
1
1
1
1
5
5
5
1
1
1
1
1
1
1
1
1
1
Number of
stations
13
12
13
25
25
25
25
25
25
25
25
25
25
7
7
25
25
6
4
4
5-24
-------
REGIONAL AIR MONITORING STATION (RAMS) NETWORK f
122CJ
1150
116 D
109 D
117 D
10 20
SCALE, KM
DI
123 D
Figure 5-10. Location of RAMS stations.
5-25
-------
INSTRUMENTS:
0.5 s
SAMPLES
1 MIN
ARITHMETIC
AVERAGE
METEOROLOGICAL
SOLAR RADIATION
I
N)
AIR QUALITY
MUX
AND
ADC
DAILY
CALIBRATION
STATION LOG
INST. AND
SYSTEM STATUS
TELE-
COMMUNICATIONS
CENTRAL
FACILITY
Figure 5-11. RAMS data flow: RAMS station.
-------
Analog checks determined the status of several key parameter
conditions such as permeation tube bath temperature and reference
voltages. If acceptable limits for these parameters were ex-
ceeded, then the corresponding environmental data were invali-
dated. In this case, a value of 1034 was substituted for the
1-minute reading of the environmental parameter in the data file.
Zero/span checks were performed daily, with the zero/span
commands coming from the central computer. If predetermined
instrument drift limits were exceeded, the data from that instru-
ment for the previous day were flagged as invalid, and a field
operator was notified.
5.5.2 Data Validation
Figure 5-12 shows the data flow within the RAMS central
facility. As the figure indicates, RAMS validation included two
levels of data checks:
o Intrastation checks of minute data, and
o Interstation checks of hourly data.
The RAMS minute values that were not invalidated by the QC
checks mentioned previously were converted to engineering units
and then subjected to the intrastation checks. The automated
intrastation checks included , value limit checks, relational
checks, and 10-minute time continuity checks. The gross limit
checks were based on the operating ranges of each instrument.
Examples of typical limits are in Table 5-3. Using the operating
ranges avoided the problem of setting statistical limits for each
parameter.
For some parameters, interparameter conditions were set as
validation criteria; for example, total sulfur had to be less
than the sum of SO2 and hydrogen sulfide (when compared on a
sulfur basis in ppm). Instrument noise bands precluded strict
interpretation of some of the interparameter checks.
Ten-minute continuity checks were used to see if a variable
changed over a period of time. It was recognized early in the
RAMS program that a constant voltage output from a sensor indi-
cated mechanical or electrical failures in the sensor instrumen-
tation. A daily report was generated to show questionable time
5-27
-------
I
K>
OO
25 "*
FiFi n-^
SITES
^f
COMPUTE
HOUR
AVERAGES
IT
POP
11/40
->
24-HOUR
CALIBRATION
AND DRIFT
SUMMARY
INTERSTATION
VALIDATION
OFHOUR
AVERAGES
-
CONVERT
MINUTE VALUES
TO ENG. UNITS
INTR/
VALID;
CONST;
OFMIN
STATION
mON AND
\NTCHECK
UTE DATA
FLAG MINUTE DATA
INVALIDATED BY
INTERSTATION
CHECKS
-
CREATE
RAMS SYSTEM
TAPE WITH
CALIBRATIONS, MINUTE
DATA AND HOUR AVERAGES
/ RAMS \
SYSTEM
i TAPE y
Figure 5-12. RAMS data flow, central facility.
-------
TABLE 5-3. TYPICAL GROSS LIMIT VALUES USED
AIR MONITORING SYSTEM DATA VALIDATION
IN THE REGIONAL
PROCEDURE
Parameter
Ozone
Nitric oxide
Oxides of
nitrogen
Carbon monoxide
Methane
Total hydro-
carbons
Sulfur dioxide
Total sulfur
Hydrogen sulfide
Aerosol scatter
Windspeed
Wind direction
Temperature
Dewpoint
Temperature
gradient
Barometric
pressure
Pyranometers
Pyrgeometers
Pyrheliometers
Instrumental or
natural limits
Lower
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0.000001m-1
0 m/s
0°
-20°C
-30°C
-5°C
950 mb
-0.50
0.30
-0.50
Upper
1 ppm
5 ppm
5 ppm
100 ppm
25 ppm
25 ppm
1 ppm
1 ppm
1 ppm
0.0040m-1
22.2 m/s
360° (540° for
some wind
systems)
45°C
45°C
5°C
1050 mb
2.50 Langleys/
min
0.75 Langleys/
min
2.50 Langleys/
min
Interparameter
condition
NO x 03 <0.04
NO - NO
-------
periods so that field personnel could be dispatched to a site if
instrument malfunctions were suspected. These checks could not
be applied to some parameters; for example, barometric pressure,
since it can remain constant (to the number of digits recorded)
much longer than ten minutes.
The interstation checks were applied to hourly averages com-
piled from the minute data that passed both the on-site QC tests
and the after-the-fact intrastation checks. Interstation con-
tinuity checks were used to look for consistency in parameter be-
havior throughout the network during specific time periods. The
interstation checks in RAMS were applied to meteorological param-
eters only. Initial application to pollutant parameters showed
that too many false data anomalies were flagged. The intersta-
tion checks consisted of plotting curves of hourly averages of
each parameter. Several curves representing data from multiple
sites were plotted on the same page for ease of comparison among
stations. The mean and difference of values that fell outside
the 90th percentile of the combined network distribution were re-
ported. Data curves were reviewed over periods of several days
to see trend relationships to the rest of the network. When
problems were detected through the review of the hourly average
curves, then plots of the minute data for the corresponding time
periods were produced. Figure 5-13 is a continuous plot of hour-
ly average temperature data for several stations in the RAMS net-
work. The hourly data were relatively uniform among the nine
stations represented; however, the small negative depression on
the circled portion of the curve for Station 103 on the thirty-
second Julian day. Figure 5-14, a plot of minute temperature
data for the same station and time period, reveals that a problem
occurred during most of hour 14 of the day in question and that
the data reported for that hour are invalid.
This example shows that, although hourly average data may
not have indicated instrument problems over the long term, prob-
lems such as temporary voltage surges may indeed have occurred.
In the example, statistical analysis of hourly average values
would not have detected the problem. In the short term this kind
5-30
-------
o
~ys
m
CD
O
c:
-5
rt>
en
I
cr
-s
TEMPERATURE (T), C
3> 3> i i i I I ( i i I
—I H ro M-t-ro M-c-ro N> -C-ro r-o -e-ho NJ -c-ro sj x-ro ro-e-M M^-KJ M-P-
mmoooooooooooooooooooooooooooooooooooo
OO
o
QJ —•
ro ro
QJ
fD
ro —•
3
fD ^D
DJ
t-l-
t;
co —•
Q OJ
DJ —•
C-i
CU
DJ
I
V
o
ro
o
OJ
o
-C-
o
\_n
O
o
CO
STATION NUMBER
-------
U1
I
(jj
to
o -5
•t
UJ
LU
a.
-15
-20
1200 1330 1500 1630 1800 unnD 1930 2100 2230
HUUK
Figure 5-14. Minute temperature data from Station 103, from 1200 to 2359 hours, February 1, 1975
-------
of information may not be important for hourly or daily data
fluctuations, but it can be important as a quality assurance
feedback mechanism. Even though the QC checks were extensive,
they were not sufficient to detect this problem. The validation
procedure was a supplement to the QC procedures that may be
useful in improving those procedures as well as in identifying
additional instrument function problems.
The use of data plotting techniques for hourly and minute
data coupled with followup to field operation loop was helpful in
answering questions regarding data values flagged during the
intrastation validation tests. A special data listing was pro-
duced to show the average standard deviation and the number of
excursions above statistical limits. The largest deviations were
flagged for followup. This technique was extremely time consum-
ing, since the followup was manual.
Meteorological data were used in RAMS to provide insight to
questionable data periods. Weather summaries like the one in
Figure 5-15 provided convenient references for determining condi-
tions that might have significantly altered pollutant patterns
during specific periods of time.
In retrospect, two specific changes in approach might have
improved the validation effort in RAMS:
1. Graphic techniques such as those described herein could
have been more useful if they had been applied closer to real
time. The major problem is that (as mentioned previously) the
graphical techniques are manpower intensive, so condensing the
time frame (especially for a network approaching RAMS size) could
markedly increase the validation cost.
2. Successive difference checks could have been applied to
minute (or hourly average) data to provide feedback to the QC
system. These checks could solve the problem of an instrument
malfunction going undetected for a long period of time. Figure
5-16 illustrates the kinds of instrument-response problems that
might be detected using successive difference checks of minute
(or hourly average) data. Optimum limits need to be determined
for the successive difference checks by a statistical analysis of
5-33
-------
LAMBERT FIELD, ST. LOUIS WEATHER SUMMARY
WEATHER
PRECIPITATION
(IN.)
T = TRACE
U 1 UU Ml 1 • • I 1 •
1MB MM LMUUlftUMJ 1 U BUIMUlMHlt 1 | ™*f)F
1« • 1 >t Mil U1M «V tM 1UU 1 •! UM
Vt
1 U
i u y
CEILING
(FT)
SKY
COVER
U)
VISIBILITY
(Ml)
PRESSURE
(MB)
TEMPERATURE
(DEC. C)
RELATIVE
HUMIDITY
U)
WIND
DIRECTION
(DEC.)
WINDSPEED
(M/S)
25000-
10000-
Ul I 10 U^iJ'!'H "M'T"U U
100
60
~^V^^^
20
360
240
120
I6r-
8
0
10
15 20
JANUARY 1977
25
30
Figure 5-15.
Reference weather data used in RAMS data validation,
5-34
-------
historical data. For some applications it will be necessary to
determine successive differences on a percentage basis (i.e.,
above a specified level).3
a) SINGLE OUTLIER b) STEP FUNCTION
SPIKE
d) STUCK
e)
MISSING
f)
CALIBRATION
9)
Figure 5-16.
DRIFT
Examples of instrument responses that can be
detected through minute successive differences,
5.6 REFERENCES
1. U.S. Environmental Protection Agency. Quality Assur-
ance Handbook. Vol. I. Principles. EPA-600/9-76-005,
1976.
2. Rhodes, R. C., and S. Hochheiser. Data Validation Con-
ference Proceedings. Presented by Office of Research
and Development, U.S. Environmental Protection Agency,
Research Triangle Park, North Carolina, EPA 600/9-79-
042, September 1979.
3. Hartwell, T., Use of Successive Time Difference and
Dixon Ratio Test for Data Validation, Data Validation
Conference Proceedings, EPA-600/9-79-042, September
1979.
5-35
-------
6.0 BIBLIOGRAPHY
Barnett, V., and T. Lewis. Outliers in Statistical Data. John
Wiley and Sons, New York, 1978.
Curran, T. C., W. F. Hunt, Jr., and R. B. Faoro. Quality Control
for Hourly Air Pollution Data. Presented at the 31st Annual
Technical Conference of the American Society for Quality
Control, Philadelphia, May 16-18, 1977.
Data Validation Program for SAROAD, Northrup Services, EST-TN-
78-09, December 1978, (also see Program Documentation
Manual, EMSL).
Faoro, R. B., T. C. Curran, and W. F. Hunt, Jr., "Automated
Screening of Hourly Air Quality Data," Transactions of the
American Society for Quality Control, Chicago, 111., May
1978.
Grant, E. L., and R. S. Leavenworth. Statistical Quality Con-
trol. McGraw-Hill Book Company, New York.
Grubbs, F. E., and G. Beck. Extension of Sample Sizes and Per-
centage Points for Significance Test of Outlying Observa-
tions. Technometrics, Vol. 14, No. 4, November 1972.
Hald, A. Statistical Theory with Engineering Applications. New
York, 1952.
Hartwell, T., Use of Successive Time Difference and Dixon Ratio
Test for Data Validation, Data Validation Conference Pro-
ceedings, EPA-600/9-79-042, September 1979.
Hawkins, D. M. The Detection of Errors in Multivariate Data
Using Principal Components. Journal of the American Statis-
tical Association, Vol. 69, No. 346. 1974.
Hunt, Jr., W. F., J. B. Clark, and S. K. Goranson, "The Shewhart
Control Chart Test: A Recommended Procedure for Screening
24-Hour Air Pollution Measurements," J. Air Poll. Control
Assoc. 28:508 (1979).
Hunt, Jr., W. F., T. C. Curran, N. H. Frank, and R. B. Faoro,
"Use of Statistical Quality Control Procedures in Achieving
and Maintaining Clean Air," Transactions of the Joint
European Organization for Quality Control/International
Academy for Quality Conference, Vernice Lido, Italy,
September 1975.
6-1
-------
Hunt, Jr., W. F., R. B. Faoro, T. C. Curran, and W. M. Cox, "The
Application of Quality Control Procedures to the Ambient Air
Pollution Problem in the USA," Transactions of the European
Organization for Quality Control, Copenhagen, Denmark, June
1976.
Hunt, Jr., W. F., R. B. Faoro, and S. K. Goranson, "A Comparison
of the Dixon Ratio Test and Shewhart Control Test Applied to
the National Aerometric Data Bank," Transactions of the
American Society for Quality Control, Toronto, Canada, June
1976.
Johnson, T. A Comparison of the Two-Parameter Weibull and Log-
normal Distributions Fitted to Ambient Ozone Data. PEDCo
Environmental, Inc., Durham, North Carolina, and The Air
Pollution Control Association. Quality Assurance in Air
Pollution Measurement. Presented at the Air Pollution
Control Association, New Orleans, March 11-14, 1979.
Larsen, R. I. A Mathematical Model for Relating Air Quality
Measurements to Air Quality Standards, Publication No.
AP-89, U.S. Environmental Protection Agency, 1971.
Marriott, F. H. C. The Interpretation of Multiple Observations.
Academic Press, New York, 1974.
Naus, J. I. Data Quality Control and Editing. Marcel Dekker,
Inc., New York, 1975.
Remington, R. D., and M. A. Schork. Statistics with Applications
to the Biological and Health Sciences. Prentice-Hall, Inc.,
Englewood Cliffs, New Jersey, 1970.
Rhodes, R. C., and S. Hochheiser. Data Validation Conference
Proceedings. Presented by Office of Research and Develop-
ment, U.S. Environmental Protection Agency, Research Tri-
angle Park, North Carolina, EPA 600/9-79-042, September
1979.
Siegel, S. Nonparametric Statistics for the Behavioral Sciences,
McGraw-Hill, 1956.
Smith, F. Guideline for the Development and Implementation of a
Quality Cost System for Air Pollution Measurement Programs.
Research Triangle Institute. RTI/1507/01F, November 1979.
U.S. Department of Commerce. Computer Science and Technology:
Performance Assurance and Data Integrity Practices.
National Bureau of Standards, Washington, D.C., January
1978.
U.S. Environmental Protection Agency. Guidelines for Air Quality
Maintenance Planning and Analysis. Vol. 11. Air Quality
Monitoring and Data Analysis. EPA-450/4-74-012, 1974.
6-2
-------
U.S. Environmental Protection Agency. Quality Assurance and Data
Validation for the Regional Air Monitoring System of the St.
Louis Regional Air Pollution Study. EPA-600/4-76-016, 1976.
U.S. Environmental Protection Agency. Quality Assurrance Hand-
book. Vol. I. Principles. EPA-600/9-76-005, 1976.
U.S. Environmental Protection Agency. Quality Assurance Hand-
book: Vol. I, Principles; Vol. II, Ambient Air Specific
Methods; and Vol. Ill, Stationary Source Specific Methods.
EPA-600/9-76-005, 1976.
U.S. Environmental Protection Agency. Screening Procedures for
Ambient Air Quality Data. EPA-450/2-78-037, July 1978.
Wolfe, J. H. NORMIX: Computation Methods for Estimating the
Parameters of Multivariate Normal Mixtures of Distributions,
Research Memorandum SRM 68-2. U.S. Naval Personnel Research
Activity, San Diego, 1967.
1978 Annual Book of ASTM Standards, Part 41. Standard Recom-
mended Practice for Dealing with Outlying Observations, ASTM
Designation: E 178-75. pp. 212-240.
6-3
-------
APPENDIX A
STATISTICAL TABLES
-------
TABLE A-l. DIXON CRITERIA FOR TESTING OF EXTREME
OBSERVATION (SINGLE SAMPLE)*
n
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Criterion
_ x2 - xl
r!0 x - x,
n 1
xn " Vl
xn - xl
_ X2 - Xl
11 Vi-xi
_ xn " xn-l
x - x
n 2
_ X3 - x1
31 x - xl
I I -L. J.
xn " Xn-2
II lit.
r\ s
X3 " Xl
r —
22 x - x
" xn-2 xi
xn " xn-2
II I 1 £.
x - x~
n 3
if* smallest value
is suspected;
if largest value
is suspected.
if smallest value
is suspected;
if largest value
is suspected.
if smallest value
is suspected.
if largest value
is suspected.
if smallest value
is suspected.
if largest value
is suspected;
Significance level
10%
.886
.679
.557
.482
.434
.479
.441
.409
.517
.490
.467
.492
.472
.454
.438
.424
.412
.401
.391
.382
.374
.367
.360
5%
.941
.765
.642
.560
.507
.554
.512
.447
.576
.546
.521
.546
.525
.507
.490
.475
.462
.450
.440
.430
.421
.413
.406
1%
.988
.889
.780
.698
.637
.683
.635
.597
.679
.642
.615
.641
.616
.595
.577
.561
.547
.535
.524
.514
.505
.497
.489
^Reproduced with permission from W. J. Dixon, "Processing Data for Outliers,"
Biometrics, March 1953, Vol. 9, No. 1, Appendix, Page 89.
X! < x2 < ..« < xn_2 < xn_1 < xn
Criterion r10 applies for 3 < n < 7
Criterion r^ applies for 8 < n < 10
Criterion r2i applies for 11 £ n £ 13
Criterion r22 applies for 14 < n < 25
A-l
-------
TABLE A-2. CRITICAL VALUES FOR 5% AND 1% TESTS OF
DISCORDANCY FOR TWO OUTLIERS IN A NORMAL SAMPLE*
n
4
5
6
7
8
9
10
12
14
16
18
20
25
30
5%
0.967
0.845
0.736
0.661
0.607
0.565
0.531
0.481
0.445
0.418
0.397
0.372
0.343
0.322
1%
0.992
0.929
0.836
0.778
0.710
0.667
0.632
0.579
0.538
0.508
0.484
0.464
0.428
0.402
*•
Barnett, V., and T. Lewis. Outliers in Statistical Data, Table XHIe,
p. 311. John Wiley and Sons, New York, 1978.
A-2
-------
TABLE A-3. CRITICAL T VALUES FOR ONE-SIDED GRUBBS TEST
WHEN STANDARD DEVIATION IS CALCULATED FROM SAMPLE*
Nwmber of
Obitnrttlont
n
3
4
S
6
7
a
9
10
11
12
13
14
It
16
17
18
20
21
22
23
24
25
26
27
28
29
31
33
37
39
40
41
42
43
44
45
46
47
48
49
60
Upper .IS Upper .SI Upper U
Significance S1jn1f1c«nc« St9n1f1c»nc»
Level Level Level
1.155
1.499
1.780
2.011
2.201
2.358
2.492
.606
.705
.791
.867
.935
.997
.052
.103
.149
.191
.230
.266
.300
.332
.362
.389
.41$
.440
.464
.486
.507
.528
.S4«
.565
.582
.599
.616
.631
.646
.660
.673
.687
.700
.712
.724
.736
.747
.757
.768
.779
.789
1.155
1.496
1.764
1.973
2.139
2.274
.387
.482
.564
.636
.699
.75$
.606
2.852
2.894
2.932
2.968
3.001
031
060
087
112
135
157
178
199
.218
.236
.253
.270
.286
.301
.316
!j43
.354
.369
.381
.393
.404
.415
.425
.435
.445
.455
3.464
3.474
3.483
1.155
1.492
1.749
1.944
2.097
221
323
410
2.485
.550
.607
.659
.70$
.747
.785
2.821
2.854
.884
.912
.939
.963
.987
3.009
3.029
3.049
3.068
3.DBS
3.103
3.119
3.13$
3.150
3.164
1.178
1.191
1.204
1.216
1.228
1.240
3.251
3.261
3.271
3.282
3.292
3.302
3.310
3.319
3.329
3.336
Upper 2.51
gn1 f ICince
Level
1.15S
1.481
1.715
1.887
2.020
2.126
2.215
2.290
2.355
2.412
2.462
2.507
2.549
2.585
2.620
2.651
2.681
2.709
2.733
2.758
2.781
2.802
2.822
2.641
2.859
2.876
2.893
2.908
2.924
2.938
2.952
2.965
2.979
2.991
3.003
3.0)4
3.025
3.036
3.046
3.057
3.067
3.075
3.085
3.094
3.103
3.111
3.120
3.128
Upp*r 51
S1jn1 f Icanc*
Level
1.153
1.463
1.672
1.822
1.938
2.032
2.110
2.176
2.234
2.28$
2.331
2.371
2.409
2.4«3
2.475
2.504
2.532
2.557
2.580
2.603
2.624
2.644
2.663
2.68)
2.698
2.714
2.730
2.74S
2.759
2.773
2.786
2.799
2.811
2.823
2.835
2.846
2.857
2.866
2. 877
2.887
2.896
2.905
2.914
2.923
2.931
2.940
2.948
2.956
Upper lot
Significance
Level
1.148
1.425
1.602
1.729
1.828
1.909
1.977
2.036
2.088
2.134
2.175
2.213
2.247
2.279
2.309
2.335
2.361
2.385
2.408
2.4Z9
2.448
2.467
2.466
2.502
2.519
2.534
2.549
2.563
2.577
2.59)
2.604
2.616
2.628
2.639
2.650
2.661
2.671
2.6B2
2.t92
2.700
2.710
2.719
2.727
2.736
2.744
2.753
2.760
2.768
Grubbs, F. E., and G. Beck. Extension of Sample Sizes and Percentage
Points for Significance Test of Outlying Observations. Technometrics,
Vol. 14, No. 4, November 1972.
(continued)
A-3
-------
TABLE A-3 (continued)
Kurter of
Otiervitton*
n
51
52
S3
$4
55
56
57
58
59
60
' 61
67
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
61
62
83
84
65
86
87
aa
69
90
91
92
93
94
95
96
97
98
99
100
Upper .11 Upp.tr .51 Upper II Upper 2.51 Upper 51 Upper 101
Significance Significance Significant* Significance Significance Significance
Lcvtl Ltvtl L*v*l Ltvtl Ltvtl Ltvtl
798
608
816
825
834
842
851
858
867
874
882
869
3.896
903
910
917
923
930
3.936
942
948
954
960
965
971
.977
.982
.987
.992
.998
.002
.007
.012
.017
.021
.026
.031
.035
.039
.044
.049
.053
.057
.060
.064
.069
.073
.076
.080
.084
3.491
3.500
;.507
3.516
3.524
.531
.539
.546
.553
.560
.566
.573
3.579
3.586
592
598
605
610
617
622
627
3.633
636
643
646
654
656
663
669
673
677
682
687
691
3.6V5
.699
.704
.708
.712
.716
.720
.725
.728
.732
.736
3.739
3.744
3.747
3.750
3.754
.345
.353
.361
.366
.376
.383
.391
.397
.405
.411
.416
.424
.430
.437
.442
.449
.454
.460
.466
.471
.476
.482
.487
.492
.496
.502
.507
.511
.516
.521
.525
.529
.534
.539
.543
.547
.551
.555
.559
.563
.567
.570
.575
.579
.562
.586
.589
.593
.597
.600
3.136
3.143
3.151
3.156
.166
.172
.180
3.186
3.193
3.199
3.205
3.212
3.218
3.224
3.230
3.235
3.241
3.246
3.252
3.257
3.262
3.267
3.272
3.278
3.282
3.287
3.291
3.297
.301
.305
.309
.315
.319
.323
.327
.331
.335
.339
.343
.347
.350
.355
.358
.362
.365
.369
.372
.377
.330
.363
964
971
978
986
992
000
006
013
019
025
032
037
044
.049
.055
.061
.066
.071
.076
.082
.087
.092
.098
.102
.107
.111
.117
.121
.125
.130
.134
.139
.143
.147
.151
.155
.160
.163
.167
.171
.174
.179
.162
.186
.189
.193
.196
.201
.204
.207
775
783
790
798
804
811
818
624
631
837
84 2
2.849
854
860
866
671
877
883
688
893
897
903
908
912
917
922
927
931
935
940
2.945
949
953
957
961
966
970
973
2.977
2.981
2.984
989
993
996
000
003
006
Oil
014
3.017
(continued)
A-4
-------
TABLE A-3 (continued)
Nirtwr of Upper .11 Ctf*r .51 Uppxr IS
Observtttons Significance Significant* S1cm1f1c«nct
n Level Ltvcl Ltvtl
101
10Z
103
104
105
106
107
108
109
110
111
112
111
114
IIS
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
136
139
140
141
142
143
144
14S
146
147
.088
.092
.095
.098
.102
.105
.109
.112
.116
.119
.122
.125
.129
.132
.135
.13«
.141
.144
.146
.150
.153
.156
.159
.161
.164
.166
.169
.173
.175
.178
.ISO
.1B3
.185
.188
.190
.193
.196
.198
.200
.203
.205
.207
.209
.212
.214
.216
.219
757
760
,765
768
771
774
777
780
784
787
790
793
796
799
602
805
808
611
814
817
819
3.822
824
627
831
833
836
838
MO
3.84J
3.645
3.648
3.850
3.853
3.856
3.858
3.660
863
865
3.867
3.869
3.871
3.874
3.876
3.879
3.881
3.883
603
607
610
614
617
620
623
626
629
632
636
639
642
645
647
650
653
656
659
662
665
667
3.670
3.672
3.675
3.677
3.680
3.683
3.6B6
3.688
3.690
3.693
3.695
3.697
3.700
3.702
3.704
3.707
3.710
3.712
3.714
3.716
3.719
3.721
3.723
3.725
3.727
Upp«r 2.SX
Slgnl flcinc*
Ltvtl
3.386
3.390
3.393
3.397
3.400
3.403
3.406
3.409
3.412
3.415
3.418
3.422
3.424
3.427
3.430
3.433
3.435
3.438
3.441
3.444
3.447
3.450
3.452
3.455
3.457
3.460
3.462
3.465
3.467
3.470
3.473
3.475
3.478
3.480
3.482
3.484
3.487
3.489
3.491
3.493
3. '97
3.499
3.501
3.503
3.505
3.507
3.509
Upper 5t
Sign) f)c«nc«
Level
3.210
3.214
3.217
3.220
3.224
3.227
3.230
3.233
3.236
3.239
3.242
3.245
3.248
3.251
3.254
3.257
3.259
3.262
3.265
3.267
3.270
3.274
3.276
3.279
3.281
3.284
3.286
3.289
3.291
3.294
3.296
3.298
3.302
3.304
3.306
3.309
3.311
3.313
3.315
3.318
3. 320
3.322
3.324
3.326
3.328
3.331
3.3J4
Upper 101
S1jn1 f1c«nc»
Level
3.021
3.024
3.027
3.030
3.033
3.037
3.040
3.043
3.046
3.049
3.052
3.055
3.058
3,061
3.064
3.067
3.070
3.073
3.075
3.078
3.081
3.083
3.086
3.089
3.092
3.095
3.097
3.100
3.102
3.104
3.107
3.109
3.112
3.114
3.116
3.119
3.1?2
3.124
3.126
3.129
3.131
3.133
3.135
3.138
3.140
3.142
3.144
A-5
-------
TABLE A-4. WILCOXON SIGNED-RANK TEST*
n = number of pairs
Number
of pairs,
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Critical values
CK0.10
0, 15
2, 19
3, 25
5, 31
8, 37
10, 45
13, 53
17, 61
21, 70
25, 80
30, 90
35, 101
41, 112
47, 124
53, 137
60, 150
67, 164
75, 178
83, 193
91, 209
100, 225
cKO.05
0, 21
2, 26
3, 33
5, 40
8, 47
10, 56
13, 65
17, 74
21, 84
25, 95
29, 107
34, 119
40, 131
46, 144
52, 158
58, 173
66, 187
73, 203
81, 210
89, 236
oKO.02
0, 28
1, 35
3, 42
5, 50
7, 59
9, 69
12, 79
15, 90
19, 101
23, 113
28, 125
32, 139
37, 153
43, 167
49, 182
55, 198
62, 214
69, 231
76, 249
cKO.Ol
0, 36
1, 44
3, 52
5, 61
7, 71
9, 82
12, 93
15, 105
19, 117
23, 130
27, 144
33, 158
37, 173
42, 189
48, 205
54, 222
61, 239
68, 257
*The data of this table is reproduced from Documenta GEIGY, Scientific
Tables 7th edition. With kind permission of CIBA-BEIGY Ltd., Basle
(Switzerland).
A-6
-------
TABLE A-5. RANK SUM TEST a = P[HQ is true]*
In performing the test, begin with the a = 0.10 table and if Tx does not fall
between T, and T , repeat the test using a = 0.05, a = 0.02, and a = 0.01,
until the inequaTity is satisfied.
a - 0.01
N.
4
5
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
25
T, T,
10- 34
10- 31
11- 41
11- 4}
11- 4t
11- 32
13- 33
13- 39
14- 61
13- 63
13- if
It- 71
It- 76
17- 79
It- tl
It- tt
If- 19
If- 93
20- ft
X-IOO
15- 40
It- 44
16- 49
17- 31
11- 37
19- 111
10- 63
11- 69
11- 73
11- 71
13- 12
14- tt
13- 90
It- 94
17- 91
11-10!
19-106
19-111
30-113
31-119
32-12}
21- 43
11- 30
23- 55
14- tO
13- 63
It- 70
17- 73
21- 10
30- t4
31- 99
32- 94
)3- 99
34-104
36-101
37-113
31-111
39-133
40-121
42-132
41-137
44-141
43-147
it- se
19- fl
31- 67
n-73
34- 71
33- H
17- 19
31- 93
40-IOO
41-106
43-111
44-117
46-122
47-llt
49-133
30-139
32-144
33-130
13-133
37-ltO
Sl-ltt
60-171
T, TV
a- 67
31- 74
40- 10
42- It
4J- 9J
43- 99
47-103
49-111
31-117
31-123
34-110
Sf-llt
31-142
60-141
62-134
64-160
66-166
61-172
70-171
71-113
71-191
71-197
T, T,
4t- tO
41- 17
30- 94
32-101
34-104
54-115
31-122
tl-Ilt
63-133
63-141
67-149
69-136
72-161
74-169
76-176
ri-iti
11-119
tJ-196
93-201
lt-209
90-216
92-123
10
37- 9J
39-101
61-109
64-116
66-124
61-111
71 -1W
73-147
76-134
79-161
11-169
14-176
96-194
»9-19l
92-191
94-206
97-313
99-211
102-221
103-233
107-243
110-230
11
6S-10t
71-1 It
73-123
76-IJ3
79-141
12-149
14-131
87-164
90-174
91-1*1
96-190
99-198
102-106
103-214
101-222
111-230
114-231
117-246
120-234
111-M2
126-270
129-778
12
11-121
14-111
17-141
90-130
91-139
96-169
99-177
102-ltt
105-175
1O9-201
112-212
113-211
119-129
112-711
113-247
129-233
132-264
136-272
139-111
142-290
146-298
149-307
13
T, TV
94-140
91-149
101-139
104-169
109-178
111-188
113-197
llt-107
121-116
125-226
129-233
133-344
136-134
140-263
144-171
141-211
152-29C
155-300
159-309
163-319
167-327
170-337
14
109-137
112-lft
116-171
120-1IS
123-199
127-209
131-119
111-229
119-239
143-249
147-159
131-269
133-279
160-198
164-298
168-308
17J-3U
176-32J
180-331
184-34!
181-358
193-367
IS
TI T,
123-173
139-197
132-199
136-209
140-220
144-231
149-241
133-232
137-163
161-173
166-214
171-OT4
175-305
180-315
184-326
189-336
193-347
198-357
202-368
207-378
211-389
216-399
16
T, T,
141-193
14S-207
149-219
134-230
139-241
161-333
167-263
172-276
177-287
181-199
196-310
191-321
194-331
201-343
206-354
211-365
216-376
220-388
225-399
230-410
235-421
240-432
17
T, T,
139-213
163-118
168-140
171-133
177-163
182-277
187-189
191-301
197-313
201-113
208-336
213-348
218-360
213-377
228-384
234-395
239-407
244-419
250-430
255-442
260-454
265-466
IB
Ti T,
177-117
182-130
187-163
191-176
197-289
102-101
109-314
218-340
214-332
230-364
235-377
241-389
246-402
151-414
258-426
263-439
269-451
275-463
280-476
286-488
292-500
19
r, T,
197-239
102-173
207-397
111-301
218-314
223-329
219-341
113-334
241-367
147-180
253-393
259-406
265-419
271-432
277-445
183-458
289-471
295-484
301-497
307-510
313-523
319-514
20
T, r,
118-281
113-197
118-312
134-316
140-340
246-354
232-368
238-391
164-396
271-409
277-423
283-437
290-450
296-464
302-47B
309-491
315-505
322-518
328-532
335-545
341-559
348-572
21
T, T,
139-307
143-311
130-318
136-313
163-367
169-392
273-397
189-423
295-440
302-454
309-468
315-483
322-497
329-511
336-525
343-539
350-553
356-568
363-582
370-596
377-610
22
Tt T,
162- 311
167- 349
174- 364
380- 380
187- 393
193- 411
300- 416
314- 436
321- 471
32S- 486
335- 501
342- 516
350- 530
357- 545
364- 560
371- 575
378- 590
384- 404
393- 619
400- 634
408- 648
23
7/1 TV
313- 339
191- 176
199- 391
303- 408
311- 413
319- 440
lit- 416
340- 498
348- 503
355- 519
363- 534
370- 550
378- 565
385- 581
393- 596
401- 611
408- 627
416- 642
414- 457
431- 673
439- 688
24
r, T,
310- 396
lit- 404
331- 411
310- 419
337- 433
343- 471
331- 499
360- 304
368- 520
376- 536
383- 553
391- 569
395- 585
407- 601
415- 617
423- 633
431- 649
439- 665
447- 681
455- 697
444- 711
472- 728
25
Tt r,
333- 413
142- 411
149- 431
137- 469
364- 496
372- 303
390- 120
388- 317
396- 554
404- 571
413- 587
421- 604
429- 621
437- 638
446- 654
454- 671
463- 687
471- 704
480- 720
488- 737
497- 753
505- 770
tx - 002
JV.4
N.
4
f
6
7
8
9
10
11
12
13
14
It
16
II
18
19
21
22
23
24
15
Ti T,
—
10- 30
11- 33
n- 37
11- 40
13- 43
13- 47
14- 30
13- 31
13- 37
it- to
17- 61
IJ - (17
/*- 70
19- 73
19- 77
10- 80
21- 93
21- 97
22- 90
23- 91
21- 97
5
n TV
13- 33
14- J»
17- 41
It- 47
19- 31
20- 33
21- 39
11- 61
11- 67
24- 71
13- 73
It 79
17- 13
18 87
19- 91
30- 93
31— 99
11-101
33-107
34-111
33-113
36-119
6
T, T,
22- 44
13- 49
14- 54
13- 39
27- 63
It- tt
39- 73
30- 78
31- 11
31- 87
It 96
17 101
19-101
4O-IW
41-113
44-114
43-119
47-133
48-139
30-141
7
T, T,
29- 33
31- 60
31- 66
34- 71
33- 77
37- 13
39- 17
40- 93
42- 98
44-103
41-109
47 -114
49 119
31-124
31-130
34-133
38-143
39-131
61-136
63-161
64-167
B
ft TV
It- 66
40- 71
41- 79
43- 83
45- »1
47- 97
49- 01
31- 09
31- 13
16 - 10
II it
to u
62 It
64 44
66- 30
tt- 36
72-169
74-174
76-180
79-186
11-191
9
T, T,
41- 78
30- as
32- 92
34- 99
36-106
61-119
63-116
66-113
tt-139
71 141
71 111
7t 118
78 161
81 171
83 178
99-191
90-198
93-204
91-111
98-117
10
r, T,
38- 91
61- 99
63-107
66-114
tt-112
71-119
74-136
77-143
79-1)1
91 IK
41 161
H8 111
VI 119
93 187
96- IV4
99-101
103-113
109-331
110-130
113-117
116-144
n
r, TV
70-106
73-114
73-123
78-131
91-139
94-147
98-134
51-161
94-170
97-178
100-186
1111 IV4
11)7 101
1 tO- 1119
113-1 7
lit-' 1
113-1 0
126-2 9
119-236
132-164
135-272
12
Ti TV
83-121
86-130
89-139
91-148
93-137
99-163
102-174
106-181
10»-1»1
111-199
116-108
JO lit
14-114
11-141
34-130
141-166
143-171
149-283
152-292
156-300
13
r, T,
96-118
100-147
103-137
107-166
111-173
114-183
118-194
111-101
136 111
1)0 111
in no
118 JIV
HI 148
146 117
130 166
134-373
1 57-283
161-294
165-303
169-312
173-321
177-330
14
r, T,
111-133
113-16!
111-176
112-186
117-193
111-101
133-113
119-113
143-11'
141 -144
ISi 1S4
IH 104
1«1 171
163-183
169-29J
174-302
178-312
1B2-322
187-331
191-341
196-350
200-360
15
Ti T,
117-173
131-184
113-193
119-106
144-116
148-117
131-137
137-148
161-318
167 268
111 179
i;t iw
IHI n'i
11^ 110
ITO-JJO
I95-3M)
200-340
204-351
209-361
214-371
219-381
224-391
16
r, T,
143-191
148-104
131-216
37-117
63-138
67-149
72-160
77-171
81 181
87-39 I
91-1(14
2? in
«•>!»
1)7- )U
212. 14«
Jit. 359
222-370
228-380
233-391
238-402
243-413
248-424
17
Tt T,
161-213
166-113
171-217
76-149
81-161
86-173
91 381
V7 2Vn
1(11 108
1(18 11V
111 111
]|« 14)
;j4 r,4
l\\ 111
741 )HH
246-400
252-411
257-423
263-434
269-445
274-457
18
Tt T,
180-334
183-147
190-260
193-173
101-183
107 297
19
T, T,
199-137
203-270
110-184
llfl-197
111-310
118 333
111 11(1114 1 in
118 Jll\14(> 14V
314 II4]241 ml
20 I 21
Ti T,
110-180
136-194
231-108
138-331
344-316
130-J30
217 I6J
Ti T,
242-304
22
r, TT
164-330
248-3 /»' 27 1- 34 5
2J4-JJ4 i277- I fit
267-J6I
274- S77
2*1 tVI
till IVII 1V1 41V
11(1 Un n 1 IJ4 7/6 404
115 Wl
241 Wl
}4; )iu
,!•>» I'M
159 40/
Id1* 419
271-131
27J-443
283-455
289-467
295-479
301-491
101 414
;V> in; ^D) 411 108 44h
^ \mi J)r vi
in 4iij;'Hi 444 wi 4;-.
//H 4/VJ01 4',/ HO 4fl'J
;«4 4IHJ1IO 4H) ll/ Ml*
190 451
297 463
303-476
310-4B8
316-501
322-514
329-526
117 4H>
313- 497
330-510
337-523
344-536
351-549
358-562
144 M /
i 1 1 5 1 1
359-544
366- 558
373-572
380-586
388-599
191 191
1V9 4O6
1(16 41(1
23
r, r.
188- 336
191- 3 2
24
r, r,
313- JgJ
J20- 00
102- 8-327- IJ
,1(19- 4
lit- 0
114 3
111 I
M 4 J 1 IV n
JJJ- IJ
t4i~ JO
JJO- 6fl
na «j
inn vn
U 4 II 147 1 )/4 514
12 4 V154 /
11 4 'J
14 44
1*. •. 11
1'. '. 1
U'i ^ /
111 5 1
388-580
395-595
403 609
411-623
418-638
ic.J •> i
i in •• i
i/ft •> ;
I'M '. /
lo 1 '
418- 617
426 537
•434- M7
442- 662
450- 677
1>U Mil
V>1 S4S
\'t1 M, 1
HI / *, / /
4J4 M>n
1U '-14
449- 65S
4SB- 670
466- f,Rf>
475- 701
483- 717
25
Tt TV
JJJ- 4/7
J4 - 29
JJ 46
in fi4
17 - HO
t/ V7
IS 14
v/ u
40 4.1
41 M
4J *i,
4; -it.
4t 11
44 ft
4*i% 4*.
464 f,l
4R2- 93
490 10
499- 26
508- 42
517- 758
(continued)
A-7
-------
TABLE A-5 (continued)
4
5
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
2!
Nt 4
10- 24
11- 29
12- 3}
IS- Ji
14- 31
14- 42
IS- 4}
It- 41
If- 31
11- 34
It- 37
X- 60
21- tl
11- SI
23- 73
14- 76
13- 79
16- 11
27- 13
27- 19
It- 93
5
16- 34
17- 38
U- 41
10- 43
11- 49
11- 33
23- 37
14- 61
16- 64
17- 6»
11- 72
29- 76
30- 10
32- t3
34- 91
33- 93
37- 91
39-101
39-106
40-110
41-113
6
23- 43
14- 4t
14- 51
17- 37
29- 61
31- 63
31- 70
34- 74
33- 79
37- 13
11- II
4O- 92
42- 96
43-101
46-110
41-114
30-1 It
31-113
33-127
34-132
36-136
7
31- 33
33- SI
34- 64
It- 69
31- 74
40- 79
42- H
44- It
46- 94
41- 99
30-104
32-109
54-114
36-119
60-119
62-134
64-139
66-144
61-149
70-114
72-139
8
40- 64
41- 70
44- 76
46- 11
49- 87
31- 93
33- 99
33-103
38-110
60-116
62-121
63-117
67-131
70-131
74-130
77-133
79-161
91-167
14-171
16-171
19-113
a — 0.05
9
49- 77
32- 13
33- >9
37- 96
60-101
61-109
63-113
69-121
71-117
73-134
76-140
7f-H6
12-112
14-139
90-171
13-177
93-194
99-190
101-196
104-202
107-209
10
60- 90
63- 97
66-104
69-111
71-111
73-113
78-132
11-139
14-146
11-132
91-139
94-166
97-17]
1OO-190
103-117
107-193
110-100
113-207
116-114
119-121
123-227
126-234
11
72-104
73-111
79-119
11-117
S3-133
19-141
92-110
94-157
99-163
103-171
106-110
110-117
lli-193
117-201
121-209
124-217
111-114
131-131
135-239
139-244
142-254
14«-261
12
13-119
19-127
92-136
96-144
100-131
104-160
107-169
111-177
115-185
119-193
123-201
127-209
131-217
133-113
139-233
143-J41
147-249
151-257
154-265
159-273
1(3-281
1(7-219
13
99-133
103-144
107-133
111-161
113-171
119-110
174-119
121-197
131-206
156-215
141-223
143-132
130-240
134-149
159-257
163-266
1(7-275
172-283
176-292
180-301
185-309
119-318
14
114-131
111-161
122-172
127-111
131-191
136-200
141-109
143-219
130-111
133-137
1(0-14*
164-236
1(9-163
174-274
179-283
184-292
188-302
193-311
198-320
203-329
208-338
213-347
15
130-170
134-111
139-191
144-201
149-111
134-111
139-131
164-141
169-131
174-261
179-271
184-181
190-290
195-300
200-310
205-320
211-329
216-339
221-349
226-359
232-368
237-378
16
147-199
131-201
137-111
162-111
167-133
173-143
171-234
113-163
119-273
193-213
2OO-296
206-306
211-317
217-327
223-337
228-348
234-358
240-368
245-379
251-389
257-399
2(2-410
17
164-110
170-221
173-133
191-144
197-733
191-167
191-271
2O4-1I9
110-3OO
116-311
222-322
228-333
214-344
HO-3S5
246-366
252-377
258-388
264-399
270-410
276-421
282-432
289-442
18
113-231
119-243
193-133
101-167
207-179
113-291
119-303
116-314
131-326
239-337
245-349
251-361
258-372
264-384
271 -»5
277-407
283-419
290-430
296-442
303-453
309-465
316-476
19
203-133
209-266
113-279
221-291
221-304
233-316
241-311
241-341
255-353
262-365
2(9-377
275-390
212-402
289-414
296-426
303-438
310-450
317-462
324-474
330-487
337-499
344-511
20
T, T,
224-276
230-190
237-303
244-316
131-329
231-342
363-333
272-361
279-381
286-394
293-407
301-419
308-432
315-445
322-458
330-470
337-483
344-496
352-508
359-521
366-534
374-546
21
n T,
246-300.
133-J14
260-321
267-342
274-136
211-370
299-313
296-197
304-410
312-423
319-437
327-450
335-443
342-477
350-490
358-503
3(5-517
373-530
381-543
389-556
396-570
404-383
22
T, T,
369-313
276-340
213-333
291-369
291-314
306-391
314-412
322-426
330-440
338-454
354-482
362-496
370-510
378-524
387-537
395-551
403-5(5
411-S7J
419-593
427-607
436-620
23
393- 331
3OO- 367
1O9- 392
316- 397
324- 412
332- 427
340- 442
349- 456
357- 471
365- 486
374- 500
382- 515
3»1- 529
399- 544
408- S58
416- 573
425- 587
434- 601
442- (16
451- (30
459- 645
4«8- (59
24
T, T,
317- 379
313- 393
333- 411
342- 426
330- 441
339- 437
368- 472
376- 488
385- 503
394- 518
403- 533
412- 548
411- 563
429- 579
438- 594
447- 609
456- (24
465- 639
474- (54
483- (69
«»1- 484
502- (98
25
r, T,
343- 407
332- 423
360- 440
369- 436
371- 471
317- 499
396- 504
405- 520
414- 536
423- 552
433- 567
442- 583
451- 599
4«1- 614
470- (30
479- (46
489- 651
498- (77
508- 692
517- 708
527- 723
536- TH
a. - 0.10
~,
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
N, 4
11- 15
11- 21
13- 31
14- 34
13- 37
16- 40
17- 43
11- 46
19- 49
10- 32
11- S3
22- 39
24- 60
13- 63
16- 66
27- 69
31- 71
29- 73
30- 71
31- II
32- 14
33- 17
17- 33
19- 34
20- 40
11- 44
13- 47
24- 31
26- 34
17- 39
19- 62
30- 63
31- 69
33- 72
34- 76
33- 90
37- 13
39- 97
40- 90
41- 94
43- 97
44-101
43-103
47-109
r, T,
24- 42
26- 46
18- 50
29- 33
31- 39
13- 63
33- 67
39- 76
40- 90
43- 14
44- 99
46- 91
47- »t
49-101
31-103
33-109
33-113
37-117
31-132
60-126
62-130
32- 32
34- 37
36- 62
39- 44
41- 71
43- 76
43- 11
49- 91
31- 93
34-100
36-103
39-110
61-114
63-119
63-114
67-119
69-134
71-131
74-143
76-149
71-133
T, T,
41- 63
44- 69
46- 74
49- 79
51- 85
34- 90
36- 96
61-106
64-111
67-117
69-113
71-111
73-133
77-139
90-144
13-149
93-133
99-160
90-166
93-171
96-176
31- 73
54- 91
57- 17
60- 93
63- 99
46-105
69-111
73-123
71-129
91-133
14-141
97-147
90-133
93-139
96-163
99-171
101-177
105-113
109-119
111-195
114-201
10
62- 19
66- 94
69-101
71-109
73-113
79-121
82-118
99-141
92-149
96-134
99-161
103-167
10t-l74
110-1SO
113-197
117-193
110-100
123-207
127-213
130-220
133-227
U
74-101
79-109
92-116
93-114
99-131
93-139
97-145
100-153
104-160
109-167
111-174
116-191
120-119
123-196
127-103
131-110
135-217
139-224
142-232
146-239
150-246
154-253
12
97-117
91-125
93-131
99-141
104-149
109-156
111-164
116-171
110-180
115-197
119-193
133-203
131-210
142-219
146-226
150-234
154-242
159-249
163-257
167-265
171-273
176-280
13
101-133
106-141
no-iso
115-159
119-167
114-179
111-194
133-191
139-100
142-209
147-117
131-223
136-134
161-242
165-251
170-259
175-267
180-275
184-284
189-292
194-300
199-308
14
T, T,
116-150
121-159
126-169
111-177
136-116
141-195
146-204
131-113
136-111
161-131
144-240
171-249
176-258
161-267
186-276
191-285
196-294
202-302
207-311
212-320
217-329
222-338
15
131-161
139-177
143-117
149-197
153-107
139-116
164-116
170-133
175-143
111-254
186-264
191-174
197-283
202-293
208-302
214-311
219-321
225-330
230-340
236-349
242-358
247-368
16
150-116
133-197
161-107
166-111
172-221
171-119
194-141
190-231
196-261
101-179
207-289
213-299
219-309
225-319
231-329
237-339
243-349
249-359
255-369
261-379
267-389
273-399
17
169-206
173-219
179-229
196-239
191-230
191-261
204-171
110-293
217-291
223-304
229-315
235-326
242-336
248-347
255-357
261-368
268-378
274-389
280-400
287-410
293-421
300-431
18
117-117
193-139
199-251
106-262
212-274
219-291
226-297
131-309
239-319
245-331
252-342
259-353
266-364
273-375
180-384
286-398
293-409
300-420
307-431
314-442
321-453
328-464
19
207-149
113-162
120-174
227-296
234-299
141-310
149-333
135-334
262-346
269-358
276-370
284-381
291-393
298-405
305-417
313-418
320-440
327-452
335-463
342-475
349-487
357-498
20
119-171
135-195
242-199
249-311
137-313
264-336
172-349
279-361
286-374
294-386
301-399
309-411
317-423
325-435
332-448
340-4(0
348-471
355-485
363-497
371-509
379-521
386-534
21
330-296
237-310
163-333
171-337
290-350
199-363
296-376
304-389
312-402
320-415
328-428
336-441
344-454
352-467
360-480
368-493
376-506
385-518
393-531
401-544
409-557
417-570
22
273-321
111-333
199-349
197-363
303-377
313-391
321-405
329-419
338-432
346-446
355-459
363-473
372-486
380-500
389-513
398-526
406-540
415-553
413-5(7
432-580
441-593
449-607
23
197-347
305-361
313-377
311-3<>1
330-406
339-430
348-434
356-449
365-463
374-477
383-491
392-505
401-519
410-533
419-547
428-561
437-575
444-589
455-603
444-417
473-631
482-«45
24
312- 374
330- 390
339- 405
349- 410
337- 43 S
364- 450
375- 465
384- 480
393- 495
403- 509
412- 524
422- 538
431- 553
440- 568
450- 582
459- 597
469- 611
478- 626
488- 640
497- 655
507- 449
514- 684
25
7i T,
349- 402
337- 419
366- 434
373- 430
393- 463
394- 481
403- 497
413- 512
423- 527
433- 542
442- 558
452- 573
462- 588
472- 603
482- (18
492- 633
501- 649
511- 664
521- 679
531- 694
541- 709
551- 774.
*The data of this table are extracted
GEIGY SCIENTIFIC TABLES, 6th Ed., pp
Division of Geigy Chemical
with kind permission from DOCUMENTA
124-127, Geigy Pharmaceuticals,
Corporation, Ardsley, N. Y.
A-i
-------
APPENDIX B
FITTING DISTRIBUTIONS TO DATA
-------
FITTING DISTRIBUTIONS TO DATA
Many of the outlier tests in Section 3 assume that the un-
derlying distribution of the data is known. Four distributions
which often provide good fits to air pollution and meteorological
data are the normal, lognormal, exponential, and Weibull. The
mathematical equations of these four distributions are:
1 rw 2
Normal: G(x) =1 — / exp (-tz/2) dt B-l
V2^T J
w = 5-IlJi B-2
1 rw 2
Lognormal: G(x) = 1 - —— / exp (-tV2) dt B-3
w = lngx - M B-4
Exponential: G(x) = exp [-A.(x-6)] B-5
Weibull: G(x) = exp [-(x/6)k]. B-6
G(x) is defined as the fraction of a population having values
greater than x.
Each of the four distributions described above can be com-
pletely described by specifying values for two parameters. These
parameters relate to the shape, scale, or location of the dis-
tribution when plotted on graph paper. Accurate estimates of the
maximum values of a data set require good estimates of the param-
eters of the distribution which most closely fit the data. Two
methods of fitting these distributions are discussed in this sec-
tion: the traditional method of maximum likelihood which uses
all the data and a least squares approach, which emphasizes the
upper tail of the data.
B-l
-------
THE METHOD OF MAXIMUM LIKELIHOOD
The principal advantage of using the method of maximum
likelihood to estimate distribution parameters is that it pro-
duces estimates which have minimum variance and distributions
which asymptotically approach the normal distribution. Table
B-l lists maximum likelihood estimators (MLE's) for the param-
eters of the four distributions discussed above. Unfortunately,
the MLE's for the shape and scale parameters of the Weibull dis-
tribution cannot be calculated directly. The equations
1
k =
6 = n
-, j.
°
n n
() I x In x. - I In x-
-1
B-7
B-8
must be solved simultaneously using an iterative procedure. A
computer program which uses a "golden section" iterative proce-
2
dure has been developed by Mage et al.
THE METHOD OF LEAST SQUARES
The parameters estimated by the maximum likelihood method
determine distributions which fit the whole data set. However,
the maximum value of a given data set is often better estimated
using the parameters of a distribution fit to the upper tail of
the data distribution by the method of least squares. This
method requires that the equation defining the distribution under
consideration be expressed as a linear relationship of the form
z = ay + b. Equations B-l, B-3, B-5, and B-6 can be rewritten in
linear form if the following identities are used.
Distribution z a y_ b
normal z a x ~a
lognormal z - In x -jj
exponential In G(x) -\ x X6
Weibull ln[-ln G(x)] k In x -kin 6
B-2
-------
The values of z for the normal and lognormal distributions are
determined from the standard normal distribution such that the
area under the standard normal curve from z to °° is equal to
G(x). The following table lists z values for selected G(x)
values in the upper tail of the data distribution.
G
0
0
0
0
0
0
(x)
.50
.40
.30
.20
.10
.05
0
0
0
1
1
2
0
.253
.524
.842
.282
.645
0
0
0
0
0
0
G(X)
.02
.01
.005
.002
.001
.0005
2
2
2
2
3
3
Z
.054
.326
.576
.880
.090
.291
A linear regression analysis of the data which have been
transformed from x and G(x) values to y and z by the appropriate
identities listed above will determine a regression line which
has an equation in the form of z = ay + b. Parameters of the
corresponding distribution can be determined from the values of
a and b using the equations listed in Table B-l under least
squares estimators (LSE's).
One measure of the goodness of fit of the distribution
2
under consideration is the coefficient of determination (r ),
which is determined as part of the linear regression analysis.
2
The closer the r value is to unity, the better the distribution
fits the data.
There are numerous other statistics which have been sug-
gested for quantifying goodness of fit. EPA has developed a
program which calculates six goodness of fit statistics: abso-
lute deviations, weighted absolute deviations, Chi-square,
Kolmogov-Smirnov, Cramer-von Mises-Smirnov, and the maximum
2
value of the log-likelihood function. Four other goodness of
fit statistics in common use are the Kuiper, Watson, Anderson-
3
Darling, and Shapiro-Wilk statistics. Of these statistics, the
2
r and Chi-square are the easiest to calculate, though not neces-
sarily the best choices for distinguishing which of the distribu-
tions under investigation provide the best fit to the data.
B-3
-------
However, for routine data validation procedures requiring the
selection of a distribution to characterize the data, the r2
statistic is recommended since it can be easily calculated as
part of the linear regression procedure. In general, the distri-
bution which yields the highest r2 value should be used to
characterize the data. Parameter values are then estimated using
the LSE equations in Table B-l.
TABLE B-l. ESTIMATION OF DISTRIBUTION PARAMETERS
Distribution
Normal
Lognormal
Exponential
Weibull
Parameter
Location
M
--
6
__
Scale
a
M
A
6
Shape
—
a
k
Estimator
MLE
p = x
a = s
p = mean of
In x
a = std. dev.
of In x
~
6 = min x.
2(x.-min x.)
Iterative solu-
tions of simul-
taneous equa-
tions B-7 and
B-8
LSE
-b
_ "
i
a = -
a
p = —
1
a = -
a
-h
6 = —
a
K = -a
5 = exp(-b
k = a
REFERENCES
1. Hastings, N. A. J., and J. B. Peacock. Statistical
Distributions. John Wiley and Sons, New York.
2. Mage, David T., et al. "Techniques for Fitting Proba-
bility Models to Experimental Data," In: Proc. of
Specialty Conference on Quality Assurance in Air Pollu-
tion Measurement, New Orleans, Louisiana, March 1979.
3. Stephens, M. A. EDF Statistics for Goodness of Fit and
Some Comparisons. Journal of the American Statistical
Association, Vol. 69, No. 347, September 1974.
B-4
-------
APPENDIX C
CALCULATION OF LIMITS FOR SHEWHART CONTROL CHART
-------
CALCULATION OF LIMITS FOR SHEWHART CONTROL CHART
The Shewhart control chart enables the data analyst to test
the hypothesis H that the mean or range value of the sample
under evaluation comes from a population having the same distri-
bution as the historical data sets A.. , A9, . . . , A .
-L £* ii
The first step of the procedure is the selection of a suit-
able sample size for the test, as indicated in Section 3.3.3.
Each sample should contain between 4 and 15 values and should re-
present a well-defined time period (day, month, quarter, etc.)
for which there is a large body of historical data. Where pos-
sible these time periods should relate to the NAAQS of interest.
Months or quarters would be appropriate time periods for tests of
24-hour TSP, S0?, and N09 data collected at 6- or 12-day inter-
vals.
The second step is the selection of historical data for
determining the limits on the Shewhart control chart. To the
extent possible, these data sets should contain data collected
under the same conditions (averaging time, site location, season,
weather, local emissions, etc.) as the data set under investiga-
tion. For convenience, these data sets will be labeled A.,, A9,
. . . , Aj^. Data set A, contains n.. values and has a mean x, and a
range R_. Similarly, data set A9 contains n« values and has a
mean x9 and a range R9. Continuing in this manner, the data
analyst can develop the following table.
Data set Sample size Mean Range d9
A2 n2 X2 R2 d22 R2/d22
C-l
d2N Vd2N
-------
TABLE C-l. FACTORS FOR ESTIMATING CONTROL LIMITS OF SHEWHART CHART
Number of observations
in subgroup
n
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Factors8
d2
1.128
1.693
2.059
2.326
2.534
2.704
2.847
2.970
3.078
3.173
3.258
3.336
3.407
3.472
3.532
3.588
3.640
3.689
3.735
3.778
3.819
3.858
3.895
3.931
4.086
4.213
4.322
4.415
4.498
4.572
4.639
4.699
4.755
4.806
4.854
4.898
4.939
4.978
5.015
C2
0.5642
0.7236
0.7979
0.8407
0.8686
0.8882
0.9027
0.9139
0.9227
0.9300
0.9359
0.9410
0.9453
0.9490
0.9523
0.9551
0.9576
0.9599
0.9619
0.9638
0.9655
0.9670
0.9684
0.9696
0.9748
0.9784
0.9811
0.9832
0.9849
0.9863
0.9874
0 . 9884
0.9892
0.9900
0.9906
0.9912
0.9916
0.9921
0.9925
c2/d2
0.5002
0.4274
0.3875
0.3614
0.3428
0.3285
0.3171
0.3077
0.2998
0.2931
0.2873
0.2821
0.2775
0.2733
0.2696
0.2662
0.2631
0.2602
0.2575
0.2551
0.2528
0.2506
0.2486
0.2467
0.2386
0.2322
0.2270
0.2227
0.2190
0.2157
0.2128
0.2103
0.2080
0.2060
0.2041
0.2024
0.2008
0.1993
0.1979
These factors assume sampling from a normal universe.
02
-------
Figure 3-7 of Section 3 shows the general form of a control
chart for testing sample means. Values of x are indicated on the
vertical axis and units of time (or sample number) are indicated
on the horizontal axis. The central line is a solid horizontal
line drawn at
i N
x = | ^x.. C-l
The upper and lower control limits are dashed horizontal lines
drawn at
UCL- = x + ZCT- C-2
X X
and
LCL- = x - za- C-3
where x x
*\> 1 ~\
ax = rvn; .1 dr. c"4
and n^. is the size of the sample being tested. Computations are
considerably reduced if samples are selected such that n^ = n, =
n2 = . . . = rLg = n. In this case, the second, fifth, and sixth
columns in the table above can be omitted and
•v I N
a- = d j^ i I Ri. C-5
Figure 3-8 of Section 3 shows the general form of a control
chart for testing sample ranges. Values of R are indicated on
the vertical axis, and units of time (or sample number) are indi-
cated on the horizontal axis. The central line is a solid hori-
zontal line drawn at
N
R = =; I R. C-6
The upper and lower control limits are dashed horizontal lines
drawn at
UCL_, = R + zaD C-l
s\ K
LCL_. = R - zan C-8
K K
C-3
-------
where
- C2 N Ri
aD = — I -T± C-9
R N i=l d2i
and c2 is determined from Table C-l according to the size nA of
the sample being tested. Negative LCL's should be treated as
zeros. If samples are selected such that nA = n, = n~ = ... =
n , then
~ °2 N _ C2R
r*. — ~3 iTr *- ** ' — .---^--™. C*" JL 0
R dN i d
C2
Values of -5- for 0 < n < 100 are listed in Table C-l.
2
The data analyst is testing the hypothesis H that the mean
or range value of the sample under evaluation comes from a popu-
lation having the same distribution as the historical data sets
A,, A,, ... , A . If H is true, the probability of the mean or
JL £* ii \J
range value falling outside the control limits is assumed to be
equal to twice the area under the standard normal curve to the
right of z. Consequently, the z value contained in Equations
C-2, C-3 , C-7, and C-8 is selected according to the desired
sensitivity of the test. The following is a list of probabili-
ties corresponding to selected z values.
z pfH ^s
1.282 0.20
1.645 0.10
1.96 0.05
2.00 0.0455
2.326 0.02
2.576 0.01
3.00 0.0027
The value z = 3 corresponding to P[H is true] <_ 0.0027 is com-
monly used for materials testing, but it may be too large for
testing air quality data. The data analyst may wish to use z = 2
to determine the initial control limits and to later increase z
if the original limits flag too many valid data sets.
C-4
-------
TECHNICAL REPORT DATA
if late reed Inur.icnuot on tlit rcierK be lore lonwlctintl
-
EPA-600/4-80-030
VALIDATION OF AIR MONITORING DATA
3. RECIPIENT'S ACCESSIO:*NO.
9. REPORT OAT":
June 1980
6. PERFORMING ORGANIZATION CODE
8. PERFORMING ORGANIZATION REPORT NO
A. Carl Nelson, Jr., Dave W. Armentrout,
and Ted R. Johnson
3320-N
9. PEflhOSMlNG ORGANIZATION NAME AND ADDRESS
PEDCo Environmental, Inc.
505 South Duke Street, Suite 503
Durham, North Carolina 27701
10. PROGRAM ELEMENT NO.
Assignment No. 14
11. CONTRACT/GRANT NO.
68-02-2722
12. SPONSORING AGENCY NAME AND ADDRESS
Office of Research and Development
Environmental Monitoring Systems Laboratory
Research Triangle Park, N. C. 27711
13. TYPE Of REPC" . ' NO PERIOD CO VE 3 = 0
14. SPONSORING A~c "ICY CODE
EPA 600/08
15. SUPPLEMENTARY NOTES
16. ABSTRACT •~~~~~ ~~ ' ~~ ~— '
Data validation refers to those activities performed after the data have
been obtained and thus serves as a final screening of the data before they are
used in a decision making process. This report provides organizations that are
monitoring ambient air levels and stationary source emissions with a collection
of data validation procedures and with criteria for selection of the
appropriate procedures for the pacticular application. Both hypothetical and
case studies, and several examples are given to Illustrate the use of the
procedures. Statistical procedures and tables are in the appendices.
7.
KEY WJflOS AND DOCUMENT ANALYSIS
•»• DESCRIPTORS
Data Validation
Data Screening
Data Editing
Quality Assurance
Outliers
Statistics
Environmental Data
• 3. wliT«!sy r(O:< STATEMENT
Release to public
b.lDENTIFIERS/OPEN ENOEO TERMS
[nvironmental Monitoring
)ata Management
19. SECURITY CL.ASS , flat Rtfant
Unclassified
20. SECURITY CLASS ( Tliit pagtl
Unclassified
C COSAT! I'l.-IJ Group
43F
68A
21. NO OF PAGES
22. PRICE
EPA Form JJJO-1 (1-73)
C-5
------- |