United States
                Environmental Protection
                Agency
               Environmental Monitoring and Support FPAGO;) 4 ;iO
               Laboratory              j::r< 1 9 Hi)
               Research Triangle Park NC 27711       j. -j
                Research and Development
oERA
Validation  of Air
Monitoring Data

-------
                RESEARCH REPORTING SERIES

Research reports of the Office of Research and Development, U S Environmental
Protection Agency nave been grouped into nine series These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology  Elimination of traditional grouping  was  consciously
planned to foster technology transfer and a maximum interface in related fields.
The nine series are

      1   Environmental  Health  Effects Research
      2   Environmental  Protection Technology
      3   Ecological Research
      4   Environmental  Monitoring
      5   Socioeconomic Environmental Studies
      6   Scientific and Technical Assessment Reports (STAR)
      7   Interagency Energy-Environment Research and Development
      8   "Special" Reports
      9   Miscellaneous Reports

This report has been assigned to the ENVIRONMENTAL MONITORING series.
This series describes research conducted to develop new or improved methods
and instrumentation for the identification and quantification of  environmental
pollutants at the lowest conceivably significant concentrations. It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia  22161.

-------
                              EPA-600/4-80-030

                                        June 1980
  Validation  of Air
  Monitoring  Data
               by
  A. Carl Nelson, Jr., Dave W. Armentrout
         and Ted R. Johnson

       PEDCo Environmental, Inc.
     Durham, North Carolina 27701
       Contract No. 68-02-2722
            Task No. 14
  EPA Project Officer: Raymond C. Rhodes
  Data Management and Analysis Division
            Prepared for

U.S. ENVIRONMENTAL PROTECTION AGENCY
Environmental Monitoring Systems Laboratory
Research Triangle Park, North Carolina 27711
                          ,   '-     strcet
                   230 Soucr.  :- — '•     ^
                   Chicago,     0iS

-------
                              DISCLAIMER





     This report has been reviewed by the Environmental  Monitoring Systems



Laboratory, U.S. Environmental  Protection Agency,  and approved for publication.



Mention of trade names or commercial  products does not constitute endorsement



or recommendation for use.
                                      n

-------
                         ACKNOWLEDGEMENT

     This  report  was prepared  for the  Environmental  Monitoring
Systems Laboratory  of the U.S.  Environmental Protection Agency.
Mr. Raymond  C.  Rhodes was  the  Project  Officer.   PEDCo Environ-
mental, Inc., appreciates the direction and extensive review pro-
vided.  Appreciation  is  also expressed  to Ms. Debora  Pizer and
Mr. Jose  Sune  for  the  information  on  CHAMP,  to Mr.  Robert B.
Jurgens  for  information  on  RAMS, to  Mr.  E.  Gardner  Evans for
examples of  data  validation tests and to Mr. Seymour Hochheiser
for discussions and  reviews.   In the preparation of this report,
materials  from  the  Data  Validation lectures of the Air Pollution
Training Institute  (APTI), 470  course  (Quality Assurance for Air
Pollution  Measurement Systems)  and  examples  from the  Office of
Air Quality  Planning and  Standards  (OAQPS)  guideline  have been
used.    Mr.  A.  Carl  Nelson  served  as  PEDCo's Project Manager.
Mr. Nelson,  Mr. Dave W.  Armentrout,  and  Mr.  Ted  R.  Johnson were
the principal authors.

-------
                                 FOREWORD
     Measurement and monitoring research  efforts  are  designed  to  anticipate
potential  environmental  problems,  to support regulatory actions by developing
an in-depth understanding of the nature and processes that impact health  and
the ecology, to provide  innovative means  of monitoring compliance with
regulations and to evaluate the effectiveness of  health and environmental
protection efforts through the monitoring of long-term trends.  The
Environmental  Monitoring Systems Laboratory, Research Triangle Park,
North Carolina, has the  responsibility for:  assessment of environmental
monitoring technology and systems; implementation of  agency-wide  quality
assurance programs for air pollution measurement  systems;  and  supplying
technical  support to other groups in the Agency including  the  Office  of
Air, Noise and Radiation, the Office of Toxic Substances,  and  the Office
of Enforcement.

     Data validation, an element of quality assurance, is  necessary to
provide accurate and reliable environmental data.  Data of known  and
acceptable quality are needed for measuring compliance with regulations,
assessing health effects, and developing optimum strategies to cope with
environmental  pollution situations.  A unified treatment of validation
of particular types of data bases is needed to support broad-scale uses
of these data,  this report presents a systematic review of procedures  and
techniques which have proven useful in performing the data validation
function.  Recommendations are given for the selection and implementation
of procedures and techniques appropriate for organizations performing
air monitoring and depending upon their particular mode of operations,
computational  and statistical capability, and monitoring objectives.
                                     iv

-------
                                    ABSTRACT
     Data validation refers to those activities performed after the data
have been obtained and thus serves as a final  screening of the data before
they are used in a decision making process.   This report provides organi-
zations that are monitoring ambient air levels and stationary source
emissions with a collection of data validation procedures and with
criteria for selection of the appropriate procedures for the particular
application.  Both hypothetical and case studies, and several examples
are given to illustrate the use of the procedures.  Statistical procedures
and tables are in the appendices.

-------
                                  CONTENTS
                                                      • *
                                                                      Page

Acknowledgment                                                         iii
Abstract                                                                 v
Figures                                                                 ix
Tables                                                                   x
Executive Summary                                                       xi

1.0  Introduction                                                      1-1

     1.1  Purpose                                                      1-1
     1.2  Scope and Organization of the Document                       1-1
     1.3  References                                                   1-2

2.0  Background                                                        2-1

     2.1  Definition and Scope of Data Validation                      2-1
     2.2  Data Validation Procedures                                   2-2
     2.3  Selection of Data Validation Procedures                      2-2
     2.4  Implementation of Data Validation                            2-3
     2.5  Brief Literature Review                                      2-3
     2.6  References                                                   2-4

3.0  Data Validation Procedures                                        3-1

     3.1  Routine Procedures                                           3-1

          3.1.1  Data identification checks                            3-2
          3.1.2  Unusual event review                                  3-2
          3.1.3  Deterministic relationship checks                     3-2
          3.1.4  Data processinq procedures                            3-3

     3.2  Tests for Internal Consistency                               3-5

          3.2.1  Data plots                                            3-5
          3.2.2  Dixon ratio test                                      3-6
          3.2.3  Grubbs test                                           3-12
          3.2.4  Gap test                                              3-14
          3.2.5  "Johnson" p test                                      3-16
          3.2.6  Multivariate tests                                    3-17

     3.3  Tests for Historical Consistency                             3-19

          3.3.1  Gross limit checks                                    3-19
          3.3.2  Pattern and successive value  tests                    3-21
          3.3.3  Parameter relationship test                           3-21
          3.3.4  Shewhart control chart                                3-23
                                     VI

-------
                      CONTENTS (continued)

                                                             Page

     3.4  Tests for Consistency of Parallel Data Sets        3-29

          3.4.1  The sign test                               3-29
          3.4.2  Wilcoxon signed-rank test                   3-32
          3.4.3  Rank sum test                               3-34
          3.4.4  Intersite correlation test                  3-38

     3.5  References                                         3-44

4.0  Selection and Implementation of Procedures               4-1

     4.1  Organizational Criteria                             4-2

          4.1.1  Number of data sets                          4-2
          4.1.2  Historical data requirements                 4-3
          4.1.3  Nature of data anomalies                     4-3
          4.1.4  Strip chart data                             4-3
          4.1.5  Size of data set                             4-4
          4.1.6  Magnetic tape data                           4-4
          4.1.7  Data transmitted by telephone lines          4-4
          4.1.8  Timeliness of procedure                      4-5

     4.2  Analytical Criteria                                 4-5

          4.2.1  Statistical sophistication                   4-5
          4.2.2  Computational requirements                   4-5
          4.2.3  Expense of analysis                          4-6
          4.2.4  Sensitivity of the procedure                 4-6
          4.2.5  Use of data                                  4-7

     4.3  Implementation of Data Validation Process           4-7

     4.4  References                                          4-9

5.0  Hypothetical Examples and Case Studies                   5-1

     5.1  Hypothetical Example for Ambient Air Monitoring     5-1

          5.1.1  Without computerized support                 5-1
          5.1.2  With computerized support                    5-4

     5.2  Hypothetical Example for Source Testing             5-4

     5.3  Case Study of a Manual Data Validation System       5-5

          5.3.1  Quality control                              5-6
          5.3.2  Data validation                              5-8
                               vn

-------
                      CONTENTS (continued)

                                                             Page

     5.4  Case Study of the CHAMP Automated Data Validation
            System                                           5-10

          5.4.1  Quality control functions                    5-12
          5.4.2  Data validation                             5-14

     5.5  Case Study of a Regional Air Monitoring System
            (RAMS) Data Validation                           5-23

          5.5.1  Quality control                             5-23
          5.5.2  Data validation                             5-27

     5.6  References                                         5-35

6.0  Bibliography                                             6-1

Appendix A - Statistical Tables                               A-l

Appendix B - Fitting Distributions to Data                     B-l

Appendix C - Calculation of Limits for Shewhart Control Chart C-l

-------
                             FIGURES

Number                                                       Page

3-1       Computer plot of data with single anomalous value  3-7
3-2       Computer plot of data with single anomalous value  3-8
3-3       Illustration of Dixon ratio test for two TSP
            monthly data sets                                3-10
3-4       Illustration of Grubbs test for two TSP monthly
            data sets                                        3-13
3-5       Illustration of gap test for hourly ozone data
            (Newport, KY, 1976)                              3-15
3-6       P(x) versus concentration for Weibull distribu-
            tion fitted to 1976 N02 data from Essex, MD.
            Acceptance range defined as 0.05 < P(x) < 0.95   3-18
3-7       Shewhart control chart for mean values with 1978
            data plotted                                     3-26
3-8       Shewhart control chart for range values with 1978
            data plotted                                     3-27
3-9       Intersite correlation test data                    3-42
5-1       Automated quality control tests                    5-13
5-2       Example CHAMP data validation report (partial
            printout)                                        5-16
5-3       High/low critical values for CHAMP secondary
            parameters (partial printout)                    5-16
5-4       CHAMP validation system, invalidity causes by
            hour                                             5-17
5-5       Five-minute values of invalid secondary param-
            eters                                            5-17
5-6       Example CHAMP journal entries for data valida-
            tion                                             5-18
5-7       Example CHAMP validation data review               5-19
5-8       Example ozone daily data curve from CHAMP
            data validation                                  5-21
5-9       Example NO  curve from CHAMP data validation       5-22
5-10      Location ox RAMS stations                          5-25
5-11      RAMS data flow:  RAMS station                      5-26
5-12      RAMS data flow, central facility                   5-28
5-13      RAMS hourly average temperature data,
           January 21, to February 9, 1975                   5-31
5-14      Minute temperature data from Station 103, from
            1200 to 2359 hours, February 1, 1975             5-32
5-15      Reference weather data used in RAMS data
            validation                                       5-34
5-16      Examples of instrument responses that can be
            detected through minute successive differences   5-35
                               IX

-------
                             TABLES

Number
3-1       Examples of Hourly and Daily Gross Limit Checks
            for Ambient Air Monitoring                       3-20
3-2       Summary of Limit Values used in EPA Region V
            for Pattern Tests                                3-22
3-3       TSP Data from Site 397140014H01 Selected as
            Historical Data Base for Shewhart Control Chart  3-25
3-4       TSP Data from Site 397140014H01 for Control Chart
            (1978)                                           3-25
3-5       Ozone Data (ppb) Recorded August 1, 1978, at
            Monitoring Sites 909929911101 (Site A) and
            090020013101 (Site B)                            3-31
3-6       Ozone Data (ppb) Incorporating Simulated
            +5 ppb Calibration Error at Site A               3-33
3-7       Procedure for Wilcoxon Signed-Rank Test            3-34
3-8       Application of Wilcoxon Signed-Rank Test to
            Ozone Data (ppb) in Table 3-4                    3-35
3-9       Procedure for Wilcoxon Rank Sum Test               3-36
3-10      Application of Rank Sum Test to Ozone Data
            in Table 3-5                                     3-37
3-11      TSP Data from Sites 397140014H01 and
            397140020H01 for Intersite Correlation
            Test, 1978                                       3-40
3-12      x2 Values for Two Degrees of Freedom, for
            Various Probability Levels                       3-43
4-1       Factors to Consider in the Selection of Data
            Validation Procedures                            4-1
4-2       Selection of Data Validation Procedures            4-2
5-1       CHAMP Environmental Parameters                     5-11
5-2       Parameters Monitored in the Regional Air
            Monitoring System                                5-24
5-3       Typical Gross Limit Values used in the Regional
            Air Monitoring System Data Validation Pro-
            cedure                                           5-29
A-l       Dixon Criteria for Testing of Extreme Observation
             (Single Sample)                                  A-l
A-2       Critical Values for 5% and 1% Tests of Dis-
            cordancy  for Two Outliers in a Normal  Sample     A-2
A-3       Critical T Values  for One-Sided Grubbs Test when
            Standard Deviation is Calculated from  Sample     A-3
A-4       Wilcoxon Signed-Rank Test                          A-6
A-5       Rank Sum Test a =  P[H  is true]                    A-7
B-l       Estimation  of Distribution Parameters              B-4
C-l       Factors for Estimating Control Limits of
            Shewhart  Chart                                   C-2

-------
                        EXECUTIVE SUMMARY

     An essential element of  quality  assurance is the validation
of data.  The  data  validation procedures in this report span the
range of application  from the collectors of data (e.g., State or
local agencies) to the users of the data (researchers of regional
or  Federal   agencies,   consultants,  and  industrial  personnel).
     It is stressed that data validation is most effectively per-
formed  where  the data  are  collected;  questionable data  can be
most easily checked at this  level.  The validation procedures are
described  in  increasing order  of  complexity  and  statistical
sophistication.  They  are  grouped into  four  categories for ease
in understanding  their  role  in  applications:   procedures which
should  be  applied  routinely,  those which  are used to  test the
internal consistency within a given  data set, and procedures for
comparing data sets  with historical  data and  with  other data
sets.   The  procedures  applicable at  the  originating  level  are
described first.  The more  complex statistical procedures may be
used if the  level of computer  capability,  training,  and support
is adequate.
     The proper implementation of data validation will ensure its
effectiveness.  An individual should be assigned the responsibil-
ity of  data validation,  even  though it may be a part-time activ-
ity.   One responsibility will be  to develop  the data validation
plan as an integral part of the agency quality assurance plan.  A
data flow diagram  should be  developed to  identify  each step of
data handling process which  may result in an error  or  in lost
data.   The data validation  plan can be  developed  from the tech-
niques available in this report or in the literature (many refer-
ences are included  in this  report).  A  review and evaluation of
the effectiveness of  the data validation process  should be made
periodically 1) to optimize  the sensitivity of the checks, and 2)

-------
to summarize and evaluate the basic reasons and causes of invali-
dation.  Validation limits may  be  changed to alter the sensitiv-
ity.  A  particular  procedure may  not  be adequate or  may be too
costly (relative  to its effectiveness)  to  be used  on a routine
basis.
     In summary,  data validation is  an effective means of ensur-
ing the integrity of the data for use in decisionmaking.
                                 xii

-------
                        1.0  INTRODUCTION
1.1  PURPOSE
     The purpose of this  report is to provide organizations that
are monitoring ambient air levels and stationary source emissions
with  (1)  a collection of  useful data validation  procedures  and
with  (2)  criteria for  selecting the appropriate  procedures  for
the application.  Data validation procedures are an integral part
of a complete quality assurance program.
     In  this report,  data validation  will  refer to  those  ac-
tivities performed  after  the fact—that is,  after the data have
been obtained.  Quality control has the purpose of minimizing the
amount of  bad data  collected  or obtained.   Data validation is to
prevent the remaining bad data from getting through the data col-
lection  and  storage  system.   Data  validation thus serves  as  a
final screen before the data are used in the decision making pro-
cess.  Whether  data validation  is performed  by a  data validator
with this  specific  assignment,  by a researcher using an existing
data bank, or by  a  member of a field team or local agency, it is
preferable that data  validation be performed as soon as possible
after the  data  have been  obtained.  At this stage, the question-
able data  can be  checked  by recalling information concerning un-
usual events  and  meteorological conditions which  can  aid in the
validation process.   Also, timely corrective actions may be taken
when  indicated  to  minimize  further generation of questionable
data.
1.2  SCOPE AND ORGANIZATION OF THE DOCUMENT
     Although  this  report includes  discussion of the  general
theory of  data validation, the examples described here  concern
air pollution data.   In particular, the validation procedures can
be applied to ambient air monitoring data,  source test data,  and
meteorological data.  This document presents general data valida-
tion  procedures  in  Section  3,  guidelines  for  selection and

                              1-1

-------
implementation of  data validation  procedures in  Section 4,  and

case studies and hypothetical  examples  in Section 5.  The refer-

ences are given at the end of each section.  Appendices A, B,  and

C  contain  tables  and  supplemental mathematical  and statistical

background.  The reader will not need a mathematical or statisti-

cal background  to  follow the guidelines in  this  document;  he or

she needs  only an interest in developing  a  data  validation pro-
cess .

     This  document  is  not intended to  replace  previous publica-

tions .    Some  recent publications  are  recommended to  the reader

for  additional information.1"5   Section  2  contains  additional
background information.

1.3  REFERENCES

     1.   U.S.  Environmental Protection Agency.   Screening Proce-
          dures for Ambient Air  Quality Data.   EPA-450/2-78-037,
          July 1978.

     2.   Rhodes, R. C., and S. Hochheiser.  Data Validation Con-
          ference Proceedings.   Presented by Office  of Research
          and Development, U.S.  Environmental  Protection Agency,
          Research  Triangle  Park,  North Carolina,  EPA 600/9-79-
          042,  September 1979.

     3.   U.S.   Department  of Commerce.   Computer  Science  and
          Technology:   Performance  Assurance and  Data Integrity
          Practices.   National Bureau  of  Standards,  Washington,
          D.C., January 1978.

     4.   Barnett,  V., and  T.  Lewis.   Outliers  in Statistical
          Data.  John Wiley and Sons,  New York,  1978.

     5.   Naus, J.  I.   Data  Quality Control and Editing.  Marcel
          Dekker, Inc., New York, 1975.
                              1-2

-------
                         2 .0  BACKGROUND

2.1  DEFINITION AND SCOPE OF DATA VALIDATION
     Data validation  is  "a  systematic process  for  reviewing  a
body of data  against  a set of criteria to provide assurance that
the  data  are  adequate for  their  intended use."1   In technical
literature,  data  validation may be referred  to  as data editing,
data screening, data  checking,  data auditing, data verification,
data certification,  and  technical  data  review.   Each  of  these
terms refers  to a set of procedures  for  detecting,  evaluating,
and  correcting erroneous or questionable  data.   Data validation
is  one  of the primary activities for  ensuring the reporting and
use of good data,  whereas quality control (QC) activities are de-
signed for the purpose of acquiring good data.
     The  procedures  included  in data  validation vary  with the
method  of obtaining  data.   In  one  application the  data  may be
recorded  on  a field  data form and returned  to  a laboratory for
further handling prior to storage in a data bank.  In another ap-
plication,  such  as  CHAMP  (Section 5),  the  data are  obtained
directly  from the monitor;  several data validation  checks  (in-
cluding plotting  the   data)  are  made  using the  computer system,
and  questionable  data  are  identified  along  with information on
ancillary data needed by the validator to aid  in the data valida-
tion.
     It becomes more difficult to distinguish  between QC and data
validation  activities when  using  a computerized  system for ac-
quiring,  processing,  checking,  and finally  storing the data for
retrieval.   However,   all procedures  applied  after the  data are
first obtained and which result in the  identification  of ques-
tionable  data and which result  in subsequent  investigation of
these data,  will  be considered  as  data validation.   The QC pro-
cess will apply to the techniques applied prior to obtaining the
data,  such   as calibration  checks  of the  instruments and the
use  of  control samples.   Although these QC techniques may result
                              2-1

-------
in data being questioned  and  eliminated,  these data are screened
by the QC system of checks and not by the data validation system.
Recommended QC checks and guidelines  for their use are contained
in the Quality Assurance  Handbook.  This  handbook discusses data
validation briefly in Volume  I  and in the introduction to Volume
II.1
     The data validation  process  includes not only the identifi-
cation or  flagging of questionable data  but  also the investiga-
tion  of  apparent anomalies.  The  latter  step  is  often performed
by a person other than the one performing the first step, partic-
ularly when the  data  validation is being performed at an organi-
zational level removed from the source of the data.
2.2  DATA VALIDATION PROCEDURES
     Specific data validation procedures are described in Section
3.  Although the examples and terminology used are primarily from
air pollution, these  procedures are not limited to this one data
category.  A  given procedure  may be used for air monitoring data
(e.g.,  hi-vol data),  for meteorological  data,  for  source test
data,  and  in  fact for almost any  data obtained by a similar mea-
surement system.
2.3   SELECTION OF DATA VALIDATION  PROCEDURES
      In  addition to general  descriptions of the data validation
procedures, this document discusses how to select the most  appro-
priate procedure  for  a particular application.  The procedure(s)
most  appropriate for a local  agency with  five monitoring stations
and  one  station reporting  meteorological  data may not be  appro-
priate for the performance  of a Reference Method  6 source test  at
a utility plant.   A State  or Federal  agency with  a data bank
storage  and  retrieval system may  require still other procedures.
In  each application,  the  data validation procedures  should  be
selected  with respect  to  considerations  such as  the volume  of
data,  type of data output/transmission,  computational capability,
graphics  capability,  and the nature of expected errors.  These
factors  are considered in Section  4.
                               2-2

-------
2.4  IMPLEMENTATION OF DATA VALIDATION
     The selection of  the  data validation procedures is followed
by the implementation  of  the data validation process.  Decisions
must be made concerning:
     1.   Personnel performing the procedures,
     2.   Frequency of data checks,
     3.   Methods of flagging data,
     4.   Procedures for investigating anomalies, and
     5.   Summarization of data validation including a summary of
          and subsequent decisions concerning flagged data.
Criteria for making these decisions are in Section 4.3.
2.5  BRIEF LITERATURE REVIEW
     There  are  several  good  references  on  the  application  of
validation procedures to specific types of data.2"12  Reference 2
contains  a  discussion  of  the Dixon  ratio  test,   the  Shewhart
quality  control  test,  pattern  tests  for  four pollutants,  copies
of three technical papers,  a comparison of these tests, and com-
puter programs for three  test procedures  (gap test, pattern test
and Shewhart control  chart).   This development work done in EPA,
Office of  Air  Quality Planning and  Standards (OAQPS) became the
basis  for  the  Air  Data   Screening  System,  which is  now being
implemented  in  27 states  through the Air Quality  Data Handling
System (AQDHS).  References 3 through 7 contain background infor-
mation on  the statistical test  procedures  in  the  OAQPS  Guide-
line.2   These  references  demonstrate and  compare the effective-
ness of  the  screening procedures  of reference 2.  In particular,
Reference 3 describes the application and evaluation of the Dixon
ratio test.  Reference 4 compares seven screening test procedures
using continuous 1-hour measurements for three pollutants and two
tests (Dixon ratio and Shewhart control chart) using 24-hour data
for three  pollutants.   Reference  5 compares the Dixon ratio test
and the  Shewhart control  chart; good agreement was indicated and
use  of  both procedures  was  recommended;  however,  the  control
chart  procedure  was  preferred if  only one  procedure  was used.
Reference 6 describes the application of the control chart proce-
                              2-3

-------
dure to data  from  monitoring  sites in Region V.  Reference 7 de-

scribes an  automated pattern test procedure  and  includes  a com-

puter program  for  same.  The  limit  values used  in this pattern

test are  in Section 3.3.   Reference  8 contains  a  collection of

papers on a variety  of applications  in air pollution.  Reference

9  is  a  recent publication on data validation  as  applied to com-

puter systems.  A major portion of the literature on data valida-

tion would be classified as statistical in content.   Furthermore,

many  of  the procedures  for identifying possible data anomalies

are included throughout the statistical literature under the sub-

ject of  outliers.   Reference  10,  a recent  text on  this subject,

summarizes  the  results  dispersed  throughout  the literature.   A

monograph on  statistical applications to data QC and editing is

Reference 11.   Reference  12  contains several  statistical tests

for  outliers  and  the  corresponding tables.   Several additional

references  will be  given  at  the ends  of respective sections.

2.6  REFERENCES

     1.   U.S.  Environmental  Protection Agency.  Quality  Assur-
          ance  Handbook:   Vol. I,  Principles; Vol.  II,  Ambient
          Air Specific  Methods;  and Vol.  Ill, Stationary Source
          Specific Methods.  EPA-600/9-76-005, 1976.

     2.   U.S. Environmental Protection Agency.  Screening Proce-
          dures for  Ambient Air Quality  Data.  EPA-450/2-78-037,
          July 1978.

     3.   W. F. Hunt,  Jr.,  T.  C.  Curran, N. H.  Frank,  and R. B.
          Faoro, "Use  of Statistical  Quality  Control Procedures
          in Achieving  and Maintaining Clean  Air," Transactions
          of the Joint European Organization for Quality Control/
          International  Academy  for Quality  Conference,  Vernice
          Lido, Italy, September 1975.

     4.   W. F. Hunt,  Jr.,  R.  B.  Faoro,  T. C.  Curran,  and W. M.
          Cox,  "The  Application of Quality Control  Procedures to
          the Ambient  Air Pollution Problem  in the USA,"  Trans-
          actions  of the European Organization for Quality Con-
          trol, Copenhagen, Denmark,  June 1976.

     5.   W.  F.  Hunt,  Jr., R.  B.  Faoro,  and  S.  K. Goranson, "A
          Comparison of the Dixon Ratio Test and Shewhart Control
          Test  Applied  to  the National  Aerometric  Data  Bank,"
          Transactions  of the  American  Society for Quality Con-
          trol, Toronto, Canada,  June 1976.
                              2-4

-------
 6.    W.  F. Hunt, Jr.,  J.  B.  Clark,  and S.  K.  Goranson,  "The
      Shewhart Control  Chart  Test:   A  Recommended  Procedure
      for Screening  24-Hour  Air Pollution  Measurements,"  J.
      Air Poll.  Control Assoc. 28:508 (1979).

 7.    R.  B. Faoro, T.  C.  Curran,  and W. F.  Hunt,  Jr.,  "Auto-
      mated Screening of  Hourly Air  Quality Data,"  Trans-
      actions  of  the  American  Society for Quality Control,
      Chicago, 111.,  May 1978.

 8.    Rhodes,  R.  C.,  and S. Hochheiser.   Data Validation Con-
      ference  Proceedings.  Presented  by Office  of Research
      and Development, U.S. Environmental Protection Agency,
      Research Triangle  Park,  North Carolina, EPA  600/9-79-
      042,  September 1979.

 9.    U.S.   Department  of Commerce.    Computer  Science  and
      Technology:   Performance  Assurance and  Data  Integrity
      Practices.   National  Bureau of Standards,  Washington,
      D.C., January 1978.

10.    Barnett, V.   and T.  Lewis.    Outliers   in  Statistical
      Data.  John Wiley and Sons,  New York,  1978.

11.    Naus, J. I.  Data  Quality Control and Editing.  Marcel
      Dekker,  Inc.,  New York,  1975.

12.    1978 Annual Book of ASTM  Standards,  Part 41.   Standard
      Recommended Practice for Dealing with Outlying Observa-
      tions, ASTM Designation:  E 178-75.   pp.  212-240.
                          2-5

-------
                 3.0  DATA VALIDATION PROCEDURES

     This section contains descriptions,  along  with examples,  of
recommended data  validation procedures.   These procedures  fall
into four categories:
     1.    Check and  review procedures  which should  be used  to
some extent in every validation process,
     2.    Procedures  for  testing the  internal  consistency  of a
single data set,
     3.    Procedures  for  testing  the  consistency  of  data  sets
with  previous  data   (historical  or  temporal  consistency),  and
     4.    Procedures  for  testing the consistency of  two  or more
data sets collected  at  the  same time or under similar conditions
(consistency of parallel data sets).
These four  categories are described in Sections 3.1  to 3.4.  In
each section,  the procedures  are arranged in increasing order of
statistical complexity.   Hence the user of this  report wishing to
use the  simplest procedures  should  refer  to the first one or two
procedures  in  each  subsection.  In particular,  a local or State
agency,   with  a small staff  but without computer facilities and
statistical support,  would  probably use  only the procedures de-
scribed  in  Sections  3.1.1,  3.1.2,  3.1.3,  3.2.1,   3.3.1,  3.3.2,
3.3.3, and  3.3.4.   The selection of  appropriate  data validation
procedures  for  a particular  application is in Section 4.   Imple-
mentation of the data validation process  is described in Section
4.3.
3.1  ROUTINE PROCEDURES
     Validation checks which should be  made routinely during the
processing  of data include  checks  for proper data identification
codes,  review  of  unusual  events,  deterministic   relationship
checks,   and performance  checks  of the data processing  system.
                              3-1

-------
3.1.1  Data Identification Checks
     Data with improper identification codes  are  useless.   Iden-
tification fields which must be  correct  are:   1)  time (start and
stop time  and  date),  2)  location, 3)  sampling/analytical  method
code,  4)  pollutant method  interval  unit,  5) parameter,  and  6)
decimal.   Examples of data identification problems that have been
noted  by  the EPA  regional offices  include:   1)  improper State
identification codes;  2)  data identified  for a  nonexistent  day
(e.g.,  October  35);  and  3) duplicate  data  from  one  monitoring
site but  no data  from another.   Since  most of these  types  of
problems are the  result of human error,  an individual other than
the  original  person  preparing  the  forms  should scan  the  data
coding forms prior to using the data  for  computer entry or manual
summary.   The data  listings should  also  be  checked  after entry
into a computer system or  data bank.
3.1.2  Unusual Event Review
     A log should be maintained  by each agency to record extrin-
sic  events  (e.g.,  construction activity,  duststorms,  unusual
traffic  volume,  and  traffic  jams)  that  could  explain unusual
data.  Depending on the purpose of data collection, this informa-
tion could also be used to explain why no data are reported for a
specified time  interval,  or it  could  be the  basis  for deleting
data from a file for specific analytical  purposes.
3.1.3  Deterministic Relationship Checks
     Data sets which contain two or more physically or chemically
related parameters should be routinely checked to ensure that the
measured values on an individual parameter do not exceed the mea-
sured  values  of an aggregate parameter which includes the indi-
vidual parameter.  For  example,  N02  values should not exceed NO
                                                                X
values  recorded  at the  same  time  and location.   The  following
table  lists some,  but not all,  of the  possible deterministic
relationship  checks  involving  air  quality   and meteorological
parameters.   The measured  values of  the  individual  parameters
(first  column  of  table)   should  not  exceed the  corresponding
measured values of the aggregate parameter  (second column).

                              3-2

-------
     Individual parameter              Aggregate parameter
     NO (nitric oxide)             NO   (total  nitrogen  oxides)
                                     &L
     N02 (nitrogen dioxide)        NO   (total  nitrogen  oxides)
                                     X
     CH4 (methane)                 THC (total hydrocarbons)
     S02 (sulfur dioxide)          total sulfur
     H2S (hydrogen sulfide)        total sulfur
     Pb (lead)                     TSP (total suspended
                                         particulate)
     dewpoint                      temperature
Data sets  in which individual  parameter values exceed  the cor-
responding aggregate  values  should be flagged for further inves-
tigation.   Minor exceptions to allow for measurement system noise
may be  permitted  in cases where the  individual  value  is a large
percentage of the aggregate value.
     The deterministic checks listed above are based on theoreti-
cal  relationships between  the  parameters.   Empirical  relation-
ships  can  often  be developed  by  reviewing  historical  data and
noting  parameter  behavior which seldom  or never occurs.  Param-
eter relationship checks based on historical data  of  this kind
are discussed in  Section 3.3.3.
3.1.4  Data Processing Procedures
     Reference  1  identifies  67  procedures currently in use for
detecting  and,  when possible, correcting errors as they occur in
computer systems.   A  review of reference  1  reveals that several
procedures  fall within  the categories  of internal, historical,
and parallel data consistency checks while others are peculiar to
data processing.1  Some of the latter techniques are:
     1.   Context and staged edits (e.g.,  a field edit for check-
ing the data values against the field specifications for length,
character set,  and value range).
     2.   Addition  of quality flags  to  items in  a data base to
condition processing to avoid a mismatch between the data quality
and its use.
     3.   Redundancy  in  batches,  files,  and inputs  to improve
reliability.
                              3-3

-------
     4.    Checks on  data  sequence (e.g., input data  are checked
for correct time sequence).
     5.    Editing by  classification of  category,  class  limits,
normal limits,  and trend limits.   For example,  the  behavior of an
individual data item can  be  compared to its previous  behavior or
to the aggregate of individuals in its group.  Procedures of this
type are included in Sections 3.2 and 3.3.
     6.    Parallel check  calculations,  useful  when the  same re-
sults can be obtained  by  two independent calculation procedures.
     7.    Built-in test data, verification tests,  and diagnostics
to provide a test environment without the risk of allowing an un-
checked program access to real files.
     8.    On-line testing to  exercise  the  fault detection logic.
     9.    Clearly  defined   organizational   responsibilities  to
ensure  that  the correct  data  validation procedures  are  continu-
ally employed.
     No  single  grand or  ideal  solution to  the problem  of error
control within  automated  data processing exists.   Each organiza-
tion  must analyze  its data  processing system and the  possible
errors peculiar to the system.  However, all organizations should
develop  a data  flow diagram  which  indicates  the steps  in data
handling at which an error can occur.  In general these steps can
be classified as  (1)  data input, (2) data transmission,  (3) data
processing and  (4) data output.
     Principal  sources of error  in  the  data input stage include
key-punch  errors  and  the use  of mislabeled computer  files.  One
should  always review the  input for errors of this kind.  This can
be done conviently if  the input  data are included in the computer
output  in a format for easy review.
     The  principal  result of  transmission  error is  the loss or
alteration  of  data.   A  simple  way to  check for  transmission
error is to transmit the  data  a  second time  and then pair the two
data  streams.   Gaps  and alterations will be immediately apparent
unless  the transmission error  is  systematic.
     Processing errors are  more difficult to characterize  and to
detect.   They are usually caused by deficiencies in the computer
                              3-4

-------
programs which manipulate the data files, perform mathematic cal-
culations, and  format the output results.  A  standard method of
checking  for  processing  errors  is to  make  up a  small,  typical
data set,  perform the appropriate data manipulations  and calcu-
lations by hand,  and  compare with the results from the data pro-
cessing system.   This procedure should provide  a good  check if
the  data  processing  errors  are  not related  to  the size  of the
data set.   This procedure is  also  appropriate for  checking the
part of the data processing system which outputs the data.
     A detailed discussion of possible data validation procedures
for computer applications would require a report of size at least
equal  to  this  entire  report and would  be  repetitive of informa-
tion  available  in  the  pertinent  literature,  some  of  which is
listed  in reference 1.   Because reference  1  contains  sufficient
detail  for  computer trained personnel  to  develop  a  data valida-
tion system peculiar to their own system, no attempt is made here
to duplicate its material.
3.2  TESTS FOR INTERNAL CONSISTENCY
     Internal consistency tests  check  for values  in  a  data set
which  appear  atypical when compared to values of  the  whole data
set.  Common anomalies of this type include unusually high or low
values  (outliers) and large  differences in adjacent values.  The
following tests  for internal consistency are  listed in order of
increasing statistical sophistication.   These  tests  will not de-
tect errors which alter  all  values  of  the data set  by either an
additive  or  multiplicative factor (e.g., an error in  the use of
the scale of a meter or recorder).
3.2.1  Data Plots
     Data plotting  (including  strip chart  records) is  one of the
most  effective  means of identifying  possible  data  anomalies.
However, plotting all  data points may require considerable manual
effort  or computer time.   Nevertheless,  data  plots  will  often
identify  unusual data  that would not  ordinarily be identified by
other internal consistency tests.
                              3-5

-------
     Figures 3-1  and 3-2  show computer plots  of data  with  two
types of  unusual  values.   In  Figure  3-1,  there  is  an unusually
high value  which  would be identified  by almost  all  of  the test
procedures  described in  Section 3.2.   On the  other hand,  the
anomalous 4AM  value  in  Figure 3-2  is  similar  in magnitude  to
several other values  recorded  for  August 25 and 26 and would not
be detected by most of these  tests.   The large difference between
the 4AM value and  the neighboring  values in the time sequence is
immediately apparent from the data  plot.
     Although data plots  are particularly  appropriate for check-
ing  the  internal  consistency  of data,  they may  also  be  used for
checking historical consistency (e.g.,  the Shewhart control chart
in  Section  3.3.4)  and parallel consistency  (e.g.,  the intersite
correlation test in Section 3.4.4).
3.2.2  Dixon Ratio Test
                          3 4
     The Dixon ratio tests '   are the simplest of the statistical
tests  recommended  for evaluating  the  internal  consistency  of
data.   This section  describes  Dixon ratio tests for evaluating
(1)  the  largest value in a data set and (2)  the largest pair of
values  in  a  data  set.   Both  procedures test the  assumption H
that  the highest  value(s)  in  a  sample are  consistent  with the
spread  of  the  data.   Other Dixon ratio  tests   which may be of
interest  to the data  analyst are described in references 3  and 4.
3.2.2.1   Testing a Single High  Value -  The only  data preparation
required  for testing  a single high value is the  identification of
the  lowest  and  the highest two or three (depending on n) values,
where x^ x2, x3,  .  .  ., xn_2,  xn_1/ xn  is the arrangement  of the
lowest  three  and  highest three values  in ascending order  of mag-
nitude.   The  statistic of interest for  n £ 7 is  the  ratio  of the
difference  between the highest and  second highest values  to the
difference  between the highest and  lowest values  (the  range of
the  values  in the data set); thus,
               x  -  x -,
            R = _2	SZ±.                             Equation 3-1
               xn ~  Xl
                               3-6

-------
SITE:  058440101001
DATE:  2 SEP 77
PARAMETER CODE; 42603  METHOD NUMBER:  14
                                                OXIDES OF NITROGEN   INSTRUMENTAL   CIIEMHUHINESCENCE
         PARTS PER MILLION      (MIN DETEC -    0.0100)
      0 • 2 9 " 5  •                                                             .                      x ANOMALOUS  VALUE



U)
I
1





0.1175
C . 1 1 0 8
n . i 04 3
0.0966
0.0879
0.0771
0.0681
0 . 0607
0. 0 1*77
0 . 0 't 1 9
X X .
XX
• .X
• XX XXXX.
• xx .x
X X X
X
. x xx .xx
XX .XX
• XXX X .
                                                                                                                                        X X
                                                                                                                                     X
                                                                                                                                   X
                                                                                                                                *
                                                                                                                              X
                                                                                                                                      2 1      o HOURS
DAILY STANDARD DEVIATION -    0.3400     POOLED STANDARD DEVIATION =    0.5501

                                         *** TESTS SHOWING SIGNIFICANT AT 1% ***
  STUDENTIZED T (HIGHEST VALUE)  MODIFIED CHAUVENET (HIGHEST VAL.)
         T    TABULAR T                 C   TABULAR C
        2.977   2.963                  2.977   2.375
                                                                                                                  o
                            Figure  3-1.    Computer  plot  of  data with  single  anomalous  value.

-------
     SITE: 058440101001
     DATE: 26 RUG 77
     PARAMETER CODE: 42503   METHOD NUMBER: 14

              PARTS PER MILLION      („» DETEC -     O.OlSS)"5
00
0 . 1 3 5 3*STD-DEV BETWEEN ADJACENT HOURS OF THIS DAY
                              Figure  3-2.    Computer plot  of  data with single  anomalous  value.

-------
The ratio  R is a  fraction between 0  and  1.   As  this  ratio in-
creases, the  probability  that  the highest  value  is  consistent
with  the  rest  of  the  data—that  is,  P[HQ  is  true]—decreases.
Table A-l (Appendix A) lists critical values of R associated with
P[HQ  is  true]  = 5% and P[HQ is true]  =  1%  for  normally distrib-
uted  data  sets.  Note that critical values are  not listed for n
values  exceeding 25  and that  the  test  procedure  changes  with n
(Equation 3-1  is  used  for  n <_ 7).  The Dixon  ratio test  is not
recommended for large data sets.
     The above  procedure assumes  the  data  are normally and inde-
pendently distributed.  Since normal  distributions are symmetri-
cal in  shape,  they have equal mean and  median  values.   However,
air pollution  data  are  not usually normally distributed.  Data
which are  nonnormal  are usually positively  skewed (mean > medi-
an).  In these cases  a  Dixon  ratio test based on the lognormal
distribution may be  more appropriate; thus  calculate  R by using
          In x  - In x  ,
      R = i	=	S=i.                            Equation 3-2
          In x  - In x-,                               H
Because Equation 3-2 is less  sensitive  to outliers of a normal
distribution than  Equation  3-1,  it should  be used only for test-
ing data which are known to be adequately fitted by  the lognormal
distribution.   If  the data  are  not normally distributed  and if
one is  unsure  of the adequacy of the lognormal distribution, the
gap test described in Section 3.2.4 should be considered.
      Figure  3-3  illustrates the use  of  the Dixon ratio test in
evaluating  2 months  of 24-hour TSP data.  The  two data sets are
identical except for the extreme values.   Using the  normality as-
sumption (Equation 3-1), the Dixon ratio of data set A is
          0  _ 154 - 117 _ n „
          KA " 154 - 42  ~ U'JJ'
and the Dixon ratio of data set B is
             _ 420 - 154 _
          RB ~ 420 - 56  ~ U-/J-
We  accept  the  assumption H  for data set A since  0.33 is  smaller
than  0.642,  the 5%  critical  value for  n =  5 in Table A-l.  The
value of 0.73  is  larger than the 5% critical value  but less than
                              3-9

-------
                       DATA  SET A
   56  87   117   1 5U
                        TSP,  yg/m
                                                           0 0
                   X   _  X  ,
             o  _  _.n     ""1
             RA -  x   -  x
                   n     1
                        DATA SET B
   56   87  117   151*
                                                              420

                   x  _ x,
                    n     i
A20-151*
l»20-56
                                        = 0.73
Figure 3-3.   Illustration of Dixon ratio test for two
             TSP  monthly data sets.
                             3-10

-------
the 1% critical value;  consequently,  0.01 < P[H  is true] < 0.05
and the value 420 appears to be inconsistent with the rest of the
data set.   Table  A-l contains  several  forms of  the Dixon ratio
test procedure  as a  function of sample  size  (n = 3 to  25)  and
whether the  largest  or smallest value  is  suspect.   The applica-
tion of the  Dixon ratio test to air pollution data  and the com-
parison of  the  test  with other test procedures  is  given in ref-
erences 5, 6, and 7.
3.2.2.2   Testing a Pair of High Values  -  The Dixon ratio  test
described  above   (Section  3.2.2.1)  will  not identify  data sets
which have  two  or more  outliers of similar magnitude.   To test
for a pair of high values, calculate
               X  ™" X
           R =: _n	nz2                             Equation 3-3
               xn " xl
for normal data sets and
               In x  - In x  _
           R = in ^ - in x"                         Equation 3-4

for  lognormal  data  sets.   Decisions  concerning acceptance  or
rejection of the assumption H —that the  two  highest values are
consistent  with  the  rest of the data set—are made  according to
the  procedure  described  in Section  3.2.2.1 using Table  A-2.
Reference 3  describes related procedures  appropriate for testing
three or more outliers.   Note:  The Dixon ratio tests and similar
statistical  procedures  for  identifying outliers  in a  data  set
are not recommended  for repeated use on  the same data set.  The
error  risks specified  in tables were  theoretically  derived  on
the assumption that  no  extreme values have been removed from the
data prior  to  the test.   In practice,  however,  the  tests  are
often applied successively.   The user should be very cautions in
doing this, that  is,  never discarding the data until a just cause
has been determined.   The user might also limit the rejected data
to a small  percentage  of the data set,  say 5%.   In all applica-
tions,   the  treatment  of  the outliers  should  be  documented  in
order that  a subsequent user of the results will know the impact
                              3-11

-------
of  the  outliers.  For  example,  analyses  can  be given  with and
without the discarded data.
3.2.3  Grubbs Test
     Like the Dixon  ratio  test  described in Section 3.2.2.1, the
            3  8
Grubbs  test '    can  be  used to  determine  whether the  largest
observation XR  in a sample  from  a normal  distribution is too
large with  respect to the internal consistency  of  the  data set.
The  Grubbs  test differs  from the Dixon  ratio  test in  that its
test statistic,
               **.  ^ J*L.
           T =	—,                               Equation 3-5
is calculated using all of the values in the data set.   In addi-
tion to x ,  the data analyst must determine the arithmetic mean,
           -   1
           x = — I x^,                               Equation 3-6
and the standard deviation
           s = [Z(xi - x)2/(n - 1)]1/2.              Equation 3-7
The  following  equations  can be  used for positively skewed data
which approximate the lognormal distribution:

           x1  = ^ I In x^                            Equation 3-8
                           -9       1/9
           s1  =  [Z(ln x^ - x1) /(n-1)]  ' .           Equation 3-9
High values  of  T indicate the likelihood  that  the  maximum value
is too high.  Table  A-3  gives upper probability levels for T for
3 <_ n <_ 147.
     Figure 3-4  illustrates  the Grubbs test in  evaluating  the 2
months of TSP data used  in Section 3.2.2 to illustrate the Dixon
ratio test.   The T statistic for data set A is

           TA =    45.5    = 1-380.
Examining the n  =  5  row  in Table A-3,  we find that P[H  is true]
> 10%  since 1.380 <  1.602.   It follows that  a  maximum value of
154  is not  unexpected for this particular data  set.   The T sta-
tistic for data set B is
           T  = 420 - 166.8   -
            B      146.1      X

                              3-12

-------
                        DATA SET  A
2  56  87    117   1 5»»
         100
                         200
                                         300
                        TSP, yg/nr
                   X  _ X
                        DATA SET  B
   5 6
         1 0 Q
                         200
                                          300
                        TSP, yg/nr
                                                         14 0 0
                                                              420
                                                          ^ o o
                            ^20-166.8
        Figure 3-4.   Illustration  of  Grubbs test  for two
                     TSP monthly data sets.
                             3-13

-------
Table A-3  indicates that  P[HQ is  true] < 0.025,  since  1.733  >
1.715.  The  value  420  appears  to  be inconsistent with the data
set and should be flagged for further investigation.
3.2.4  Gap Test
                  4 9
     The gap test '    identifies  spurious outliers  by examining
the frequency distribution for large gaps.  The length of the gap
between the  largest  x,, and  the next largest  value  x  ,  is x  -
                      n                   ^            n-l      n
x^ .. ,  between the second and third largest values is x  ,  - x  ~,
 n-i                                  J               n-l    n-2
and similarly  for other  gaps.  The  occurrence  of  a  gap length
larger than  a  predetermined critical value  indicates a possible
data anomaly.
     The test assumes that the upper percentiles of the frequency
distribution can be closely  fit  by a  two-parameter exponential
curve defined as
                                  *
        F(x) = 1 - exp [-A(x - 6)]                   Equation 3-10
where F(x) is the fraction of the total observations less than or
equal to  x; A.  is the slope  parameter;  and  8  is  the location
parameter.   Only the X value of the fitted  curve  is used in the
gap test.   Values of A can be estimated from the expression
               In [1 - F(x )] - In [1 - F(x  )]
           A =	—      Equation 3-11
                           x2 - X-L
where F(x..)  and F(x2)  are  the quantile  values  corresponding to
the concentration values x.. and x«.  For  fits using three or more
quantiles,  use  the  least  squares  procedure  for  the 2-parameter
exponential  distribution described in Appendix B.
     The probability of  a  gap of at least k units occurring in  a
data set free from erroneous values is
           P[gap >_ k] = e~Ak.                       Equation 3-12
A  small P value indicates  that the gap  size  is  in question and
that the corresponding data value(s) should be flagged.
     Figure  3-5  illustrates the use  of  the  gap  test for hourly
ozone data  having a suspect value of  23 pphm.  Substituting the
*exp  [a]  is  a convenient means of writing (typing) e  where e is
 the base for natural logarithms.
                              3-14

-------
                                                          si-e
                                                      Frequency
c:
-s
o>
oo
 i
01
-s
eu
o
:3
o
IQ
CD
T3
CD
l/l
 -h
 O
 -S
O
c
-s
o
M
O
3
fD

Q.
O)
r+
a>
in

T3
o
                                    o
                                    o
                                          o
                                          o
                                            10
                                            o
                                            o
o
o
vn
O
O
 O CTN
 N
 O
    CX3
 O >•£>
 ft)

 r-r  O
 -1
 Oi  -•
 O  —
X)  —
"D v>o
ZT
CTl
               CO
   o

   N)
               NJ
                                                                                                    c
                                                                                                    O)
                                                                                                    3
                                                                                                    O
                                                                                                               o
                                                                                                               o
                                             -c-
                                             o
-c-
CJN
                      cx>
               IQ
                01
                o
                -h

                CD

                C

-------
90th  and  99th  percentile  values, 4.5  pphm  and  9.0 pphm,  into
Equation 3-11 yields
           x - In 0.10 - In 0.01 _
           A ~ 	9.0 - 4.5	 - 0.512.
The gap is eight units in length; consequently
           P[gap >_ 8] = e-(°
There is less  than  a 2% probability that a gap of this length or
larger would  occur  in the data  set under  the assumption stated.
The  23-pphm value should  be  flagged for  further investigation.
3.2.5  "Johnson" p Test
     The "Johnson" p  test  assumes that the sampling distribution
of n observations can be approximated by the cumulative distribu-
tion  F(x),  where F(x) is  an  identifiable  function which defines
the  fraction of  individuals in  the sampled population having a
value less  than or  equal  to  x.  Let x  be  the  largest recorded
value in a sample of n observations from the population distribu-
tion F(x).   The probability that all n observations from F(x) are
less than xn  is  [F(xn)]n;  that is, the  chance that a single ob-
servation is  less than xn,  F(xn),  raised to  the power n.   Thus
the probability  P(xn) that at least one value exceeds  x  (i.e.,
the largest value is at least x  ) is
          P(xn) = 1 - [F(xn)]n.                     Equation 3-13
We can now  define an acceptance region  for  x by requiring that
P(xn) fall between 0.05 and 0.95; that is,
          0.05 < P(xn) < 0.95.                      Equation 3-14
Data  sets with maximum values which cause P(x )  to fall outside
                                               n
of the acceptance region should be flagged for further investiga-
tion.
     The selection of an appropriate cumulative distribution F(x)
to characterize the data is important in determining a reasonable
acceptance  region.   Two distributions which often provide close
fits to ambient  air  quality data are the Weibull and the lognor-
mal.   '     Appendix  B contains  procedures  for fitting these dis-
tributions  to  data  and for selecting the  appropriate F(x)  func-
tion.
                              3-16

-------
     Figure 3-6  shows  the probability  function  of Equation 3-13
corresponding to  a Weibull distribution fitted  to 1976 N02 data
from Essex, MD.   The  limits  of the acceptance region correspond-
ing  to  0.05 < P(x ) < 0.95 can  be determined directly from the
graph by noting  that  P(165)  = 0.05 and P(125) = 0.95.  The value
of 140 ppb corresponds to P(x ) = 0.50 and is considered a likely
maximum value for this  data  set.  If the data set has a recorded
maximum value less than 125 ppb or higher than 165 ppb, it should
be investigated further.
     If the observed  (recorded)  maximum value is consistent with
the  overall distribution  of  the data set, there  is  a 10% proba-
bility that the  maximum value of a valid data set will fall out-
side this  acceptance  region.   If, in the  general  case, the data
analyst decides  to  check  all  sets with x  values outside the ac-
ceptance region, he or she would expect to check 1 out of 10 data
sets unnecessarily.  Where data  validation requirements are more
stringent, the  acceptance  range  could be reduced to 0,10 < P(x  )
< 0.90.  On  the other  hand  if valid data are  being checked too
frequently the  acceptance  range  can be changed to 0.01 < P(x ) <
0.99.   The  acceptable  range of  x  values  should  not be  too
narrow, since  the P values are  calculated from fitted distribu-
tions that only approximate the data set.
3.2.6  Multivariate Tests
     The procedures  given in Section 3.2.2  through  3.2.5 can be
used for  testing data  sets  involving more  than  one variable by
applying  them  independently  to  each  variable;  however,  this
approach may be  inefficient,  particularly when the variables are
statistically correlated.  In this case,  the data analyst should
consider the test procedure given in Section 3.4.4 for correlated
data.  Although  this  test is included under parallel  data sets
because of the particular application to correlated data from two
sets,  it  can  be applied to correlated  variables  within the same
data set;  for example,  to the  concentrations  of two pollutants,
such as TSP and Pb.   In some cases,  a  multivariate  test of this
kind will show that a value of one variable that appears to be an
                              3-17

-------
     1 .0
x
a.
     0.2
     0.1  	
                               Concentration (x ),  ppb
                                               n
                                                                     250
                                                        acceptance range
                                                          for x
       Figure 3-6.  P(x) versus concentration for Weibull  distribution
                    fitted to 1976 N02 data from Essex,  MD.   Acceptance
                    range defined as 0.05 < P(x) < 0.95.
                                   3-18

-------
outlier using  a  single variable test procedure  will  be consist-
ent with the data  set when one or more other  variables are con-
sidered.  Conversely,  there  may be a value  of one  variable that
is consistent with the  other data in the data  set  when only one
variable is considered, but is definitely a possible outlier when
two variables are considered.
     Multivariate  test  procedures which  have  been successfully
used  to perform  data  validation  checks  include  cluster  ana-
      12                                13
lysis,   principal component  analysis,     and  correlation  ana-
lyses.  Applications  of these usually require computerized pro-
cedures.  For  example,  cluster analysis  can be applied  using a
program called NORMIX.14
3.3  TESTS FOR HISTORICAL CONSISTENCY
     Tests for historical consistency check the data set with re-
spect to similar data recorded in the past.   It is important to
note that some of  the data validation procedures to be described
in  this section will  detect relatively  small  changes  in  the
average  value  and/or the  dispersion of  the  data values.   In
particular,  these procedures  will  detect  changes where each item
is  increased  (decreased)  by a constant  or by  a multiplicative
factor.  This  is not the case for the procedures  in  Section 3.2
which yield  the  same value for the test  statistic  when all data
are changed by the same constant or multiplicative factor.
3.3.1  Gross Limit Checks
     Gross limit checks  are  useful in detecting data values that
are  either  highly unlikely  or generally  considered  impossible.
Upper and lower  limits are developed by examining historical data
for a site  (or  for other sites in the area).  Whenever possible,
the limits should  be  specific to each monitoring site and should
consider both  the parameter and  instrument/method characteris-
tics.   Table  3-1 shows examples of gross  limit checks that have
been  used  for  ambient  air  monitoring  data  in  the  St.  Louis
area.   '    Although this technique can be easily adapted to com-
puter application, it is particularly appropriate for technicians
who reduce data  manually or who scan strip  charts  to detect un-
usual events.
                              3-19

-------
  TABLE  3-1.   EXAMPLES  OF  HOURLY AND DAILY GROSS LIMIT CHECKS FOR AMBIENT
                            AIR MONITORING15'16
Parameter
03
N02
NO
N0x
Total suspended particulates
CO
Total hydrocarbons
Methane
Total sulfur
S02
H2S
Aerosol scatter
Windspeed
Wind direction
Temperature
Dewpoint
Temperature gradient
Barometric pressure
Limits
Lower
0 ppm
0 ppm
0 ppm
0 ppm
0 ug/m3
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0.000001 m"
0 m/s
0°
-20°C
-30°C
-5°C
950 mb
Upper
1 ppma
2 ppm
3 ppm
5 ppm
2000 ug/m3
100 ppm
25 ppma
25 ppma
1 ppm
1 ppm
1 ppm
0.0040
22.2 m/s




\b





m"^

360° (540° for
some wind
systems)
45°C
45°C
5°C
1050 mb




These limits have been changed from the
on after-the-fact considerations.
original  in reference 15 and 16 based
Upper limit for a 24-hour average.
                                   3-20

-------
3.3.2  Pattern and Successive Value Tests
     The  pattern  and successive  value  tests check  the  data for
pollutant behavior which has never or very rarely occurred.  Like
the  gross  limit  checks,  they  require  that  a set  of  boundary
values or  limits  be  determined empirically  from prescreened his-
torical data.  Values  representing pollutant behavior outside of
these predetermined  limits  are then flagged for further investi-
gation.
     EPA has recommended the use  of the pattern tests which place
upper limits on:
     1.   The individual concentration value (maximum hour test),
     2.   The difference in adjacent  concentration values (adja-
cent hour test),
     3.   The difference or percentage difference between a value
and both of its adjacent values  (spike test), and
     4.   The average of four or more consecutive values (consec-
                  4
utive value test).
The maximum hour  test  (a form of gross limit  check)  can be used
with  both  continuous  and  intermittent data.   The  other  three
tests should be used only with continuous data.
     Table 3-2 is  a  summary of limit values developed by EPA for
hourly average data.   These values were selected on the basis of
empirical tests on actual  data sets.   Note that the limit values
vary with different data stratifications (e.g., day/night).
     These  limit  values  will  usually be  inappropriate  for other
pollutants,  data  stratifications,   averaging  times,   or   EPA
regions.    In  these  cases,  the  data  analyst  should  develop the
required limit values by examining historical data similar to the
data being  tested.   These  limit values can  be  later modified if
they flag too many values  that are later proven to be correct or
if they  flag  too few  errors.   Pattern tests  should continue to
evolve to  meet  the needs of the  analyst  and the characteristics
of the data.
3.3.3  Parameter Relationship Test
     Parameter relationship  tests can be  divided into  two  main
categories:   deterministic tests  involving  the theoretical  rela-
                              3-21

-------
         TABLE 3.2.  SUMMARY OF LIMIT VALUES USED  IN EPA REGION V
                           FOR PATTERNS TESTS4
Pollutant (units)
Ozone-total
oxidant ((jg/m3)
Carbon monoxide
(mg/m3)
Sulfur dioxide
(Mg/m3)
Nitrogen dioxide
(pg/m3)
Data
stratification
summer day
summer night
winter day
winter night
rush traffic
hours
nonrush traffic
hours
EPA region
None
Maximum
hour
1000
750
500
300
75
50
2600
1200
Adjacent
hour
300
200
250
200
25
25
500
500
Spike
200(300%)
100(300%)
200(300%)
100(300%)
20(500%)
20(500%)
200(500%)
200(300%)
Consec-
utive
4-hour
500
500
500
300
40
40
1000
1000
tionships between parameters (e.g., NO < NO  ) and empirical tests
                                           X
which check  whether or not  a parameter  is  behaving normally  in
relation  to  the  observed behavior of  one or  more  other param-
eters .   Deterministic  parameter checks  are  discussed in  Section
3.1.3;  empirical  parameter checks  are  discussed in this  section
since  determining  the  "normal"  behavior of  related parameters
requires the detailed review of historical data.
     The  following  area-specific  example illustrates the  testing
of  meteorological data using  a combination  of successive value
tests,  gross limit  tests,  and parameter relationship tests.  One
should consult with the local National Weather  Service office for
relationships for a specific area.  The validation protocol calls
for the following procedures to be applied to ambient temperature
data based on  the  availability  of hourly  averages reported  in
monthly formats:
     I.   Check  the  hourly  average  temperature.    The  minimum
should occur  between 04-09  hours,  and the  maximum should occur
between 12-17 hours.
                              3-22

-------
     2.   Inspect the  hourly data for  each  day.   Hourly changes
should not  exceed 10°F.   If a decrease of  10°F  or more occurs,
check  the  wind direction  and the precipitation  summaries.   The
wind direction should have changed to a more northerly direction,
and/or rainfall of  0.15 in.  or more per hour should have fallen.
     3.   Hourly  values should not exceed predetermined maximum
or minimum  values based on  month of the year.   For example,  in
November the maximum  allowable temperature  is 85°F and the mini-
mum allowable temperature is 10°F.
If any of  the  above criteria are not met, then  the data for the
appropriate  time  period  are flagged for  anomaly investigation.
     In this example,  relationship checks have been developed for
temperature and wind direction as well as temperature and precip-
itation.  Other pairs of parameters for which relationship checks
could  be  developed include  solar insolation  and  cloud cover;
windspeed aloft and ground windspeed;  ozone and NO; and tempera-
ture and humidity.
3.3.4  Shewhart Control Chart
     The gross  limit  checks  and  the  pattern tests  described in
Sections 3.3.1  and  3.3.2,  respectively,  use  critical  values
("control limits")  based  on historical  data to identify possible
data  anomalies  involving single values  or  small  numbers of con-
secutive values.   The  Shewhart control  chart  is  a valuable sup-
plement  to  these  two  tests in that it identifies data sets which
have  mean  or range values  that are inconsistent with past data
sets.
     The classical use  of a  quality control chart is to determine
the  limits on the  basis of  historical  data and  to apply these
limits to  future  data to determine the state of control.  In the
data  validation   process  the control chart  can be  used in this
classical  sense   (particularly at  the   local  agency)  or in  an
after-the-fact  sense.   In  the  latter   case,  the  control  chart
technique  may  be applied  to data already recorded/documented by
using  a  portion   of the data to determine the  limits  to be used
for the remaining data.  The discussion of quality control charts
herein applies to either use  (classical or after-the-fact).

                              3-23

-------
     The  first step  in the  development of  a  Shewhart control
chart is  the selection of  a  suitable sample  size  for the test.
Each sample should contain between 4 and 15 values and should re-
present  a well-defined time  period (day, month, quarter,  etc.)
for  which there is  at least 10  to 15 historical  data samples.
Where possible,  these  time  periods  should  relate to  the NAAQS
(National Ambient  Air Quality Standards) of  interest.  Months or
quarters  would be  appropriate time periods for  tests of 24-hour
TSP, SO2, and N02 data collected at 6- or 12-day intervals.
     The  second  step is to calculate the limits on the Shewhart
control  chart following the  procedures  in  Appendix  C or  in a
                         18
standard  reference  text.     The  details  of the  calculation will
be illustrated by the following example.
     Average  24-hour  TSP concentrations  at Philadelphia monitor-
ing site  397140014H01 are recorded every sixth day.   We desire to
use  a Shewhart control chart  to check 1978 data as they are re-
ported.    A  sample  size of five is chosen so that  incoming data
will be checked 12  times  a year if  all  values  are recorded.  A
review  of the TSP.  data for 1975-77 reveals 25 months which con-
tain exactly five  TSP values.  Table 3-3 lists  the mean and the
range of  each of these data sets.  No seasonal  trends are appar-
ent  over the 3-year period.  We  apply  the  Grubbs test  (Section
3.2.3)  to the data sets with suspicious range values (>100).  In
each case, T  < 1.602, P[H  is true] >10%, so we  decide not to re-
ject the data set.   These 25 data sets  now  form the historical
data base for  determining the limits for the control chart.
     Since all data  sets,  including the one to be tested, contain
five observations, Equations C-l, C-5, C-6, and  C-10  (Appendix C)
are used  with  d2 = 2.326 and c2 = 0.8407.  If z  = 2, then
          x  (mean of  x's)  = 56.5
          a-  (standard  deviation  of the mean) =  9.0
           X
          UL-  (upper  2cr limit for the mean) = 74.5
             X
          LL-  (lower  2a limit for the mean) = 38.5
             J\,
          R  (mean range) =47.0
          UL^  (upper  2a limit for R) = 80.9
             JK
          LLR (lower  2a limit for R) = 13.0

                              3-24

-------
TABLE 3-3.  TSP  DATA FROM  SITE 397140014H01 SELECTED AS HISTORICAL DATA BASE
                    FOR  SHEWHART CONTROL CHART (1975-1977)
Month-year
1-75
5-75
6-75
7-75
8-75
10-75
11-75
12-75
1-76
4-76
5-76
7-76
9-76
Mean (x),
ug/m3
54.6
63.8
59.0
63.0
68.2
41.8
68.4
57.6
82.4
90.2
43.8
72.6
73.4
Range (R),
pg/m3
67
39
25
23
54
26
81
39
87
117
48
80
83
Month-year
10-76
11-76
12-76
3-77
4-77
6-77
7-77
8-77
9-77
10-77
11-77
12-77

Mean (x),
|jg/m3
34.6
53.4
52.2
40.4
63.6
45.4
53.4
58.6
46.0
45.6
49.8
30.4

Range (R),
|jg/m3
50
29
44
28
57
31
19
26
12
33
54
22

  Figures 3-7  and 3-8 have been constructed using these values,  and
  the mean  and  range  values  from eleven  1978 data  sets have  been
  plotted.  The  raw data  for 1978 are  in Table  3-4.
      TABLE 3-4.  TSP DATA FROM SITE 397140014H01 FOR CONTROL CHART (1978)
Data set
1
2
3
4
5
6
7
8
9
10
11
Month
1
2
3
4
5
6
8
9
10
11
12
Mean 1
30.6
47.4
54.4
31.8
53.6
64.8
68.8
43.2
52.4
60.8
31.6
Range
27
60
39
29
46
46
87
31
59
71
22
                                  3-25

-------
   CONCENTRATION, ug/m3



o     o     o     o     o
                              o     o     o     o
c
~J
(D
1
--J
 N>
b
iar
•-s <-o
o
o
i-h -f
--s
o
o
!i
O •*" ON
"* OO
I "
:3 ^-J
cu
c:
5
3i
-3-
(—"
CO
Q. 0
rl-
0)
TJ — '
C)
rt-
Ci.
1
o |

1 o

1
— . o
|

— D .
'
1
1
	 1

1

|
1
— - 0
1
1
1

I
	 D
1
1.
I
1
1




1

| 	

	
1
O 	


O 	


	

1
1
o 	
1
1

1 ~~
1
«

-------
                                          LZ-S.
                                    CONCENTRATION, vg/nr
— • N> V/J -C-VT1 O-> --J OO >^> O — • N
oooooooooooc
-n
CO
£Z
-s
ro _.
OJ
1
oo
K>
CO
zr
|
<-*•
O
3 *•
O
0 ^
3"
Dl
-s o
rt >
i
~h J>
O
-S CO
m
_j ~o
to
ro
ID
— • c»
(D
5. U5
rl-
3^
i—* _,
OO
Q-
Oi
c+ ""*
Cu
-o
O







0











	




0


o

o







o










—




o
















1 II





—



o

	





—






	

	







—






a —





0

o
—

—

—










1
a.

-------
     Figure  3-7  shows three  mean values  below the LL-,  but no
                                                       A.
mean values  above  the UL-.   The  overall  distribution  of plotted
                         2\
points  suggests  a  possible  explanation   for  the  anomalous  low
values.  Of  the  11  points,  eight are below the x line while only
three  are  above  it.  We should  investigate  two hypotheses:  (1)
air quality  has  improved  and (2) the TSP monitor has developed a
negative bias.   The  first hypothesis can be checked by seeing if
TSP data  from nearby monitors show  similar trends.  Measurement
bias  may  be  revealed  through  calibration  checks and  careful
inspection  of quality assurance records.   In  particular,  check
for changes  in the location of the sampling mechanism.
     The 1978 range values also tend to be smaller than expected;
seven  are  less  than R while only four are greater than R (Figure
3-8).   None  of the range values  are  less  than the LL^,  however.
The  sole  outlier is slightly above  the UL=.   A good preliminary
check  of data sets with anomalous range vaues is the Grubbs test.
Following  the  procedure  described in Section 3.2.3, we find that
x  = 129,  x  = 68.8, s = 34.6, and T = 1.740.  Since inspection of
Table  A-3   reveals   that  0.010  < P[HQ   is   true] < 0.025  when
T  = 1.740,  we  are  suspicious  of the  highest  value,  xn  = 129.
Further investigation should  focus   on determining the validity
of this measurement.
     Assuming  that  we have  determined the cause(s) of the  anoma-
lies  noted on Figures 3-7 and 3-8, we must now  decide on the ap-
propriate  control  limits  for plotting 1979 data.   If an improve-
ment  in air quality has indeed  occurred,  we  should calculate new
control limits which  reflect  the lower x values  expected.   A  rec-
ommended  procedure in this case  is  to  omit the earliest year of
data  and  to  add the most recent.   If the problem is traced to
measurement  bias,  either the  original  control limits can  be re-
tained or new control limits  can  be  calculated  by  adding any
valid 1978 data sets to the  25  data sets in the  historical  data
base.
      It should  be  apparent from this  example  that  a physical
control chart need  not be constructed to  test individual mean and
range values; a  computer  program can be easily  developed which
                               3-28

-------
will list all data sets which fail the test.  The main benefit of
the visible  control  chart is  its capability to  indicate trends
and other data patterns.  If new data are consistent with histor-
ical data,  the points  plotted  on the control  chart should fall
within the limits, randomly  above and below the central line.  A
long run of  points on one side of the line (even though none of
the points  lies outside the control  limits)  may  indicate a sys-
tematic bias  in the  data or an actual trend  in air quality that
requires investigation.   Common practice is  to check these pos-
sibilities whenever  the run  exceeds  6 points.  Used in this way,
the control  chart can warn  the  data analyst  of  data problems
before they become serious.
3.4  TESTS FOR  CONSISTENCY OF PARALLEL DATA SETS
     The tests  for internal  consistency  described in Section 3.2
implicitly assume that most  of the values in a data set are cor-
rect.   Consequently,   if all  of the values in a data set incorpo-
rate a  small positive  bias,  tests such  as the Dixon ratio test
would not  identify the data set  as  inconsistent.   One method of
identifying a systematic bias of this type is to compare the data
set with other data  sets which presumably have been sampled from
the same population  (i.e.,  same air mass and time period) and to
check for differences  in the average value or the overall distri-
bution of values.  This section describes four such procedures—
the sign  test, the  Wilcoxon  signed-rank  test,  the Wilcoxon rank
sum test,  and  the intersite correlation test—which  are recom-
mended for comparisons  involving two  "parallel" data sets.  These
four tests  are presented  in  order of increasing  sensitivity to
differences  between  the  data sets  and  increasing computational
complexity.   The first  three  tests   are  nonparametric;  that is,
they do  not  assume  that  the data  have  a particular distribu-
     19
tion.     Consequently,  these  tests can be used for the nonnormal
data sets which frequently occur in air quality analysis.
3.4.1  The Sign Test
                  19
     The sign test   is a relatively  simple procedure for testing
the assumption H --that  two  related  (paired)  samples,  such as

                              3-29

-------
data  sets  from  adjacent  monitoring  instruments,  have  the  same
median.  The data  analyst  simply determines the sign ( + or -) of
the  algebraic  difference  between  each  of  the  pairs of  data
points, and  then counts the total number of  positive signs  (n )
and negative  signs (n_);  differences  of  zero are  ignored.   The
probability that both samples have the same median is

     P[HQ is true] = 2 N!(1/2)N  z  .. , (N^ ) ,        Equation 3-15

where  n  is  the smaller of  the  two numbers  n+  and  n  ,  and N =
n+ +  n_.   For  large N (say >25), one can use the normal approxi-
mation to the above probability by calculating
                    z = - .                     Equation 3-16
                            1
In this  case,  P[HQ is  true]  is equal  to  twice the  area to the
left of  z under the standard normal curve.   The following table
lists P values corresponding to selected values of z.
          z              P
       -1.282          0.20
       -1.645          0.10
       -1.960          0.05
   z              P
-2.326          0.020
-2.576          0.010
-2.807          0.005
A z value  of  -1.85 would imply that the probability  of the sam-
ples representing  populations  with the same  distribution is be-
tween 5% and 10%.
     Table 3-5  lists ozone data recorded August  1,  1978,  at two
monitoring stations  in  Washington,  B.C.   The  difference column
contains 13 positive values and 6 negative values.  Consequently,
n+ = 13, n_ = 6, n = 6, and N = 19.  Substituting these values in
Equation 3-15 yields
                                19  6      1
   P[HQ is true] = (2)(19! )(l/2)'Ly  I  j,(19_j }, = 0.167.

If  the normal  approximation  (Equation 3-16)  is used,  we have
                _ _ 12 — 19    -. r r\r
                Z = 	 = -1.606,
and 0.10 < P[HQ is true] < 0.20.  Generally the data analyst will
not reject H  if P[H  is true] > 0.05.

                              3-30

-------
TABLE 3-5.   OZONE DATA (ppb) RECORDED  AUGUST  1,  1978, AT
      MONITORING SITES 090020011101 (SITE  A)  AND
                 090020013101 (SITE B)
Hour
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Site A
65
40
35
30
30
15
15
5
10
10
35
65
65
100
130
90
70
70
85
55
45
25
20
20
Site B
50
50
45
35
25
15
10
5
5
10
50
55
60
90
110
65
65
70
65
50
35
40
20
30
Difference
+15
-10
-10
-5
+5
0
+5
0
+5
0
-15
+10
+5
+10
+20
+25
+5
0
+20
+5
+10
-15
0
-10
                         3-31

-------
     Table 3-6  was  developed to simulate a  calibration error in
the site A data.  Each  reading in the site  A column in Table 3-5
was increased by  5  ppb.   Readings in the site B  column were not
changed.  There are  now 18 positive values and 5  negative values.
It follows that n+  =  18,  n_  - 5,  n = 5,  and N =  23.  Using Equa-
tion 3-15,  the probability that H  is true is
     P[H  is true] = (2)(23!)(1/2)    I  .,(* ..  = 0.0106.
        0                            j=o J ' ^J~J i •
The normal approximation (Equation 3-16) yields
         2( 5\ - 9^
     z = H2J.	££ = -2.711,

corresponding to  P [H   is  true]  = 0.007.  In either case, P[H  is
true] <  0.05,  and there is good reason to believe  that a funda-
mental inconsistency exists between the two data sets.
3.4.2  Wilcoxon Signed-Rank Test
     Like the sign test, the signed-rank test can be used to test
the assumption (H ) that two samples come from populations having
the  same medians.  The Wilcoxon  test  is generally more powerful
than the  sign  test since  it considers both the sign and the mag-
nitude  of the  difference  between paired data.  Table  3-7 lists
the steps in the procedure.  The large sample approximation given
in step  5 is appropriate when N > 20.
     Table  3-8  illustrates  the  application of the  signed-rank
test  to the ozone  data listed in  Table 3-5.   The absolute dif-
ferences  were  first listed according to increasing magnitude and
then  assigned  ranks.  Note that  differences  of zero are ignored;
there  are 19 nonzero differences.  The  sum  of the ranks associ-
ated  with the  minus  sign (T-. ) is  65.5 and  the  sum of the ranks
associated with the plus sign  (T?)  is 124.5.  Table A-4  indicates
that  when N =  19 the probability of H  being true would be less
than  10% if T2 <_ 53 or T2 >_ 137.   Since 53 < 124.5 < 137, we can
say  P[H  is true] >  0.10.   Using the  large sample approximation
(Equation 3-17),
                              3-32

-------
            2  =
                Tl -
                     N(N + 1)
                   VN(N+1) (2N+1)
                        24
                       Equation 3-17
we find that
      z =
            ,,  ,    19(19+1)
            55-5  ~ —4	
- = -1.187,
                     24
which  corresponds to  P[H  is true]  = 0.235.   In neither  case is
P[H   is true] <  0.10; consequently,  there is  not good reason to
reject  H  .
           TABLE 3-6.  OZONE DATA (ppb) INCORPORATING SIMULATED
                   +5 ppb CALIBRATION ERROR AT SITE A
Hour
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Site A
70
45
40
35
35
20
20
10
15
15
40
70
70
105
135
95
75
75
90
60
50
30
25
25
Site B
50
50
45
35
25
15
10
5
5
10
50
55
60
90
110
65
65
70
65
50
35
40
20
30
Difference
+20
-5
-5
0
+10
+5
+10
+5
+10
+5
-10
+15
+10
+15
+25
+30
+10
+5
+25
+10
+15
-10
+5
-5
                               3-33

-------
           TABLE 3-7.   PROCEDURE FOR WILCOXON SIGNED-RANK TEST19

1.    Determine the  sign and magnitude of the algebraic difference  between each
     of  the  pairs  of  data points;  assume  that  N  scores  remain after  throwing
     out all zeros.
2.    Assign ranks to the absolute values of these N differences; use the average
     rank  in  case  of ties.   Make sure that the  ranks increase as  the  absolute
     values increase.
3.    Assign to each rank the  sign which it represents.
4.    Determine the sum of the  ranks associated  with a minus sign  (T^) and the
     sum of the ranks associated with  a plus sign (Tp).
5.    To  test  H , either use  published tables  (see  Appendix  A,  Table A-4) based
     on T  and N or use a large sample normal approximation with
                  T
                     . N(N
               z =                 •                           Equation 3-17
                     /N(N+1)  (2N+1)
                          24
     The  probability  that H   is true--that is, P[H   is true]--is equal  to twice
     the  area to the  left of z under the  standard normal curve.

        If we repeat the signed-rank test  with the biased ozone  data
  in  Table  3-6,  we find  that  TI   =  38.5,  T2 =  237.5, and N =  23.
  Table  A-4 indicates that if  T0  < 54  or if T0 >  222 when N =  23,
                                   /.  —             £. —
  then  P[H  is true]  < 0.01.   Since 237.5 >  222 we  reject HQ; there
  is  good reason to believe that  samples A and B represent popula-
  tions  with  different  medians.  The normal approximation supports
  this  conclusion  since
                     38.5 +  "(23 + 1)
                z =           *       	:  = -3.026
                          (23+l)[(2)(23)+l]
                               24
  and  a  z  value  of  -3.026  corresponds  to P[H  is  true] =  0.0025.
  3.4.3   Rank Sum Test
        The rank sum   procedure  is useful  in testing the assumption
  (H  )  that two samples represent populations  with the same  distri-
  bution.  Unlike the sign  test and  the ranked  sign test, the rank
                                   3-34

-------
TABLE 3-8.   APPLICATION OF  WILCOXON  SIGNED-RANK TEST TO OZONE
                   DATA (ppb)  IN  TABLE  3-4
Absolute difference
0
0
0
0
0
5
5
5
5
5
5
5
10
10
10
10
10
10
15
15
15
20
20
25
(Signed) Rank
*
*
*
*
*
-4
+4
+4
+4
+4
+4
+4
-10.5
-10.5
+10.5
+10.5
+10.5
-10.5
+15
-15
-15
+17.5
+17.5
+19
                             3-35

-------
sum test is applicable to independent (unrelated)  samples.   Table
3-9 lists the steps  in the procedure appropriate  for tests  of 10
or more data  pairs.

             TABLE  3-9.  PROCEDURE FOR WILCOXON RANK SUM TEST

1.   Combine the  n,  observations from population  1  and the n~ observations
    from  population  2,  arrange  them  in  ascending  order  of  size,  and  then
    assign ranks from 1 to (n-j^ + n,,) where n,  < r\~.  In case of ties, use the
    average rank.
2.   Calculate  T,5  the sum of  the ranks assigned  to observations from popu-
    lation 1.
3'.   Compare T-^ with  the  critical values in  Table  A-5, a  =  0.10.20   If T.
    <  T,  < T   for the values  of T   and  T  listed  for samples  of n,  and n9
        -L    i                    Xf      i                        it
    in  the table,  then the  P[H  is true] <  0.10.  If T.. is outside the indi-
    cated range, go  to the  a = 0.05 table and repeat test.  If T0 < T, < T ,
                                                             Xf    JL ~"  i
    then  0.05  <  P[H  is true] < 0.10.  If not, continue search until a value
    is  found  for which  T  < T-.  < T  .  If a  < 0.05 there  is  good reason to
                         SL  -  '1 -  'r
     reject H .
4.   To test  H   for  n? > 20, use the large sample normal approximation, Equa-
     tion  3-18.   The probability that H   is  true,  i.e.,  P[H   is  true],  is
     equal  to twice the  area  to  the  right  of  z under the standard normal
     curve.
      Table  3-10 illustrates  the  application of  the rank  sum test
to  the ozone  data listed in Table 3-5.   Note that tied values are
assigned an average  rank value.   The sum TI of the ranks  from the
first sample  is equal to 598.   In Table A-5,  a   =  0.10,  T£ = 507
and T  = 669  for n.,  =  n, = 24.20   Consequently T0 <_ T, <_  T ,  P[H
      XT             J_     ^                           J6 *"~  J. "~"~   X.     O
is  true] >_  0.10; thus  H  should not be  rejected.   Using the large
sample normal approximation,  (Equation  3-18),

                T  - _i	i	i	
                 1         2
           z =               	;                        Equation 3-18

              ]          12
                                  3-36

-------
and substituting the data,
               598 _ (24)(24+24+1)
           z  =              2     —.  = 0.206.
              A/(24) (24)(24+24+1)
               V        12
A z value  of  0.206 corresponds to  P[H  is true]
= 0.837
             TABLE 3-10.   APPLICATION OF RANK SUM TEST TO OZONE
                           DATA IN TABLE 3-5
Value
5
5
5
10
10
10
10
15
15
15
20
20
20
25
25
30
30
30
35
35
35
35
40
40
Data set
A
B
B
B
A
A
B
A
B
A
A
B
A
B
A
A
A
B
A
B
A
B
A
B
Rank
2
2
2
5.5
5.5
5.5
5.5
9
9
9
12
12
12
14.5
14.5
17
17
17
20.5
20.5
20.5
20.5
23.5
23.5
Value
45
45
50
50
50
50
55
55 '
60
65
65
65
65
65
65
70
70
70
85
90
90
100
110
130
Data set
B
A
B
B
B
B
B
A
B
A
A
A
B
B
B
A
A
B
A
B
A
A
B
A
Rank
25.5
25.5
28.5
28.5
28.5
28.5
31.5
31.5
33
36.5
36.5
36.5
36.5
36.5
36.5
41
41
41
43
44.5
44.5
46
47
48
                                3-37

-------
     Application of the rank sum test to the biased data in Table
3-6 yields T-. =  626.5.   Table A-5 shows that T  <_ 626.5 <_ T  for
             X                                  jL            ±i
a - 0.10; it follows that P[H  is true] > 0.10.  The large sample
normal  approximation  is  consistent  with this result  since  z  =
0.794 and  P[H  is true]  =  0.427.  In either  case,  the rank sum
test does not indicate that H  should be rejected.
     Note  that  the  rank sum test accepts H  with respect to the
data in Table 3-6, but the sign test and the signed-rank test re-
ject H  .   This  apparent contradiction is the  result of a funda-
mental difference in the construction of the tests.  The sign and
signed-rank  tests are  paired-value  tests;  consequently, they are
particularly  sensitive to  differences between  related observa-
tions.   Both of these  tests changed from accepting HQ to reject-
ing H   when  a small bias (5 ppb) was introduced into data set A.
Because  the  rank sum  test  compares  sample  distributions without
actually pairing  the  data,  it is relatively insensitive to small
data  shifts.   The rank sum test  is  more appropriate for identi-
fying censored samples.
     The data listed  in Table 3-10 are similar to those in Table
3-5,  except that  all  values from  site  A exceeding  60 ppb have
been omitted.   In this case n^ =15, n2 = 24, and 0^ = 225.  The
largest  a  value  in Table A-5  for  which T£  <_ 1^ <_  Tr is 0.02.
Consequently, 0.02 < P[HQ is true] <_ 0.05, and there  is good rea-
son  to  reject H  .   Substitution of the  appropriate values into
the equation for  the large  sample approximation yields  z = -2.165
and P[HQ is  true] = 0.030.
     The sign and signed-rank tests are not as sensitive to cen-
sored  data as the rank sum  test, mainly because they  ignore un-
paired  data.  In  the example  just described, the  sign and signed-
rank  tests would not  use the nine  values  from site B for which
there   are  no  corresponding  values  listed  for  site  A.   Un-
like  the rank sum test, neither  of  these paired  data tests would
have rejected H .
3.4.4   Intersite  Correlation  Test
     The intersite correlation  test is suggested  as a means  of
comparing  two  correlated parameters being measured at the  same

                               3-38

-------
site or at neighboring  sites.   An example is given to illustrate
the procedure for TSP measurements  on every sixth day at each of
the neighboring sites over a period of 1 year.
     It would be possible  to treat each site independently; how-
ever,   this  approach does  not consider the  relationship between
the measured  TSP  concentrations at the two  sites,  and hence may
err either  by not  identifying  a potential  data  anomaly  or  by
identifying a possible  outlier  and later finding that it is con-
sistent with that measured at a neighboring site.
     The da*ta for the example are in Table 3-11.  Denote by x and
y  the  measurements  of  \ig  TSP/m3  at  sites  39714002_OH01  and
397140014H01,  respectively.  The original data and the logarithms
are both given because TSP data are almost always better approxi-
mated by using the lognormal distribution.
     Because the lognormal distribution is preferable, these data
are plotted  in Figure 3-9,  log-log paper (2  cycles).  Note that
there  is  a relatively high  correlation (about  0.90)  between the
two measurements.
     The calculation  procedure  is given  for  two  cases:   (1) as-
suming the data to  be normally distributed (actually a bivariate
normal)  and  (2) assuming  the data are lognormally distributed.
(In the  latter case, only the  changes in  the  calculations from
                  21
case 1 are given.)
Calculations (Assuming Bivariate Normal)
     1.   Calculate the mean and the  standard deviation for each
variable.
           x = 56.5                 y = 49.0
          sx = 26.3                s  = 22.9
     2.   Calculate  the  correlation   coefficient,  r,  for  the
two measurements,
               zxy. U*)(zy)
           r =	 =0.91                Equation 3-19
                  sx sy
and n is the number of paired observations.
                              3-39

-------
             TABLE 3-11.  TSP DATA FROM SITES 397140014H01 AND
                   397140020H01 FOR INTERSITE CORRELATION
                                 TEST, 1978
Site 20
(x),
(jg/m3
43
40
24
31
50
13
65
54
58
53
77
59
75
36
28
30
31
57
41
65
31
69
60
87
76
33
73
57
36
65
40
88
71
175
85
Site 14
(y),
(jg/m3
31
34
13
40
49
19
79
39
51
46
72
49
72
33
18
24
24
47
32
66
28
68
74
83
80
37
69
55
28
51
42
53
56
129
64
In x
3.76
3.69
3.18
3.43
3.91
2.56
4.17
3.99
4.06
3.97
4.34
4.08
4.32
3.58
3.33
3.40
3.43
4.04
3.71
4.17
3.43
4.23
4.09
4.47
4.33
3.50
4.29
4.04
3.58
4.17
3.69
4.48
4.26
5.16
4.44
In ^y
3.43
3.53
2.56
3.69
3.89
2.94
4.37
3.66
3.93
3.83
4.28
3.89
4.28
3.50
2.89
3.18
3.18
3.85
3.47
4.19
3.33
4.22
4.30
4.42
4.38
3.61
4.23
4.01
3.33
3.93
3.74
3.97
4.03
4.86
4.16
(continued)
                                   3-40

-------
TABLE 3-11 (continued)
Site 20
(x),
|jg/m3
56
88
75
33
50
57
108
46
69
42
76
101
38
57
45
28
33
52
38
43
Site 14
(y),
(jg/m3
46
46
57
26
41
41
90
31
63
37
69
108
37
47
43
26
25
45
39
23
In x
4.03
4.48
4.32
3.50
3.91
4.04
4.68
3.83
4.23
3.74
4.33
4.62
3.64
4.04
3.81
3.33
3.50
3.95
3.64
3.76
In y
3.83
3.83
4.04
3.26
3.71
3.71
4.50
3.43
4.14
3.61
4.23
4.68
3.61
3.85
3.76
3.26
3.22
3.81
3.66
3.14
     3.   Obtain  a  probability ellipse,  as  described in steps 4
through 7 and shown in Figure 3-9.
     4.   Assume  that the  means,  standard deviations, and corre-
lation coefficient  are  known values describing all data from the
sites.  There should be considerable representative data used in
the determination of these statistics, at least 50 days  (measure-
ments) for each  site.   This  is similar to the assumption made in
the development of a quality control chart.
     5.   Assume  a  probability level  such  as 95%; that is,  the
ellipse to be constructed  should contain 95% of the data points.
     6.   Determine a  critical value  of  x2  (chi-square) for two
degrees of freedom for 95% probability, see Table 3-12 for a list
of  values  of x2  for  various probability  levels  from  50%  to
99.95%.  The value is 5.99 for 95% probability.
     7.  Plot the ellipse with  coordinates  (x,  y) which satisfy
the equation,
                              3-41

-------
Q- --^
I/) •—
I- O
     200
     150
100
 90

 80

 70

 60


 50
  O
 -O
c -cr
o ^-
— r-.
4-j cr\
      30
C. 4->   or
(!) —   •O
O l/l
C
o
      20
      15
      10
                                            •95% Probability  ellipse
        10      15   20   25  30     40  50  60 708090100    150    200

                         Concentration,  yg  TSP/m3

                          (Site 397140020H01)
             Figure 3-9.  Intersite  correlation test data.
                                   3-42

-------
                  . 2r
                         sxAV  \ s
-i)'
 y/
                                            = x2
              Equation 3-20
            TABLE 3-12.  x2 VALUES FOR TWO DEGREES OF FREEDOM,
                    FOR VARIOUS PROBABILITY  LEVELS
Confidence level





i





50
60
70
80
90
95
97.5
99
99.5
99.9
99.95
X2
1.39
7.83
2.41
3.22
4.61
5.99
7.38
9.21
10.6
13.8
15.2
or on substitution,
    1     r/x-56.5^2
1-(0.91)*  I  26.3
                                x-56.5\/y-49.0
           ./y-49.0\21
           V  22.9^ J  D'yy'
This  ellipse  falls  within  (inscribed  within)  a  rectangle  with
center at  (x, y)  and with two sides of lengths  2s
and 2sy x
     2s:
     2s.
          = 2s  V5.99', respectively.
          f5T9~? = 128.9
In this example,
       -y V5T99 = 112.0.
Note that  the V5.99  is  the square  root of  the  x2 value  corre-
sponding to the confidence level selected.
    The  computations  are  tedious  and  should be  programmed  for
computerized  solution on  repeated use.  Another  approach is  to
use  the  ellipse  plotting  procedure   adapted  for   the   NORMIX
                 14
cluster analysis.
Calculations (Assuming Lognormal Distribution)
     If  the  analysis  is performed using the logarithms of x  and
y  (say x'  and y')  one can  substitute  x'  and  y'  throughout  the
above  steps.   After all of  the  computations have been  completed
using the  logarithms,  transform the results  back  to the original
                              3-43

-------
                      X       X
data by using either e   or 10  ,  depending on whether natural or
common logarithms have been used.
     The computations for the logarithms are:
           x' = 3.94                y' = 3.79
          sx, = 0.44               s , = 0.47
           r = 0.90
The lengths of the sides of the rectangle are:
          2s , /57991 = 2.15
            X
          2s - /5T93 = 2.30.
     If a data point falls outside of the ellipse then the values
for both x and y should be flagged for checking.
     There  are  four  data  points  outside  the  95%  confidence
ellipse.   If  the data had  been studied one variable  at  a time,
the point (175,129) would be subject to question, as indicated in
Section 3.3.4.   Even  though this  point falls  outside the ellipse
it  does not  appear  to be  inconsistent with  the  remaining data
based on  data  from both of the sites;  however,  further  study of
these relatively high values would be advisable.   The other three
data points  have at least one  coordinate value  within the range
of the  other  data,  and studying one variable  at a time would not
necessarily  suggest  these  coordinate  values  as  possible anoma-
lies.   Two  of these points  should  be  studied  further,  based on
the correlated data.
3.5  REFERENCES
     1.   U.S.  Department  of  Commerce.   Computer  Science  and
          Technology:   Performance  Assurance  and  Data Integrity
          Practices.   National  Bureau  of  Standards,  Washington,
          D.C.  January 1978.
     2.   Data Validation Program for  SAROAD,  Northrup Services,
          EST-TN-78-09, December 1978,  (also see Program Documen-
          tation Manual, EMSL).
     3.   Barnett,  V.,  and T. Lewis.   Outliers  in  Statistical
          Data.  John Wiley and Sons,  New York, 1978.
     4.   U.S. Environmental Protection Agency.  Screening Proce-
          dures  for Ambient Air Quality Data.  EPA-450/2-78-037,
          July 1978.
                              3-44

-------
 5.    W.  F. Hunt, Jr., T.  C.  Curran,  N. H. Frank,  and  R.  B.
      Faoro,  "Use of  Statistical Quality  Control  Procedures
      in Achieving  and  Maintaining Clean  Air,"  Transactions
      of the Joint European Organization for Quality Control/
      International  Academy  for Quality Conference,  Vernice
      Lido, Italy,  September 1975.

 6.    W.  F. Hunt, Jr., R.  B.  Faoro,  T. C.  Curran,  and  W.  M.
      Cox,  "The Application of  Quality Control Procedures to
      the Ambient Air  Pollution Problem in the  USA," Trans-
      actions of  the  European Organization for  Quality Con-
      trol, Copenhagen,  Denmark, June 1976.

 7.    W.  F.  Hunt,  Jr.,  R.  B.  Faoro,   and  S.  K.  Goranson,  "A
      Comparison of the Dixon Ratio Test and Shewhart Control
      Test Applied  to  the National  Aerometric Data Bank,"
      Transactions of the  American Society for  Quality Con-
      trol, Toronto, Canada, June 1976.

 8.    Grubbs, F. E., and G.  Beck.   Extension of Sample Sizes
      and Percentage Points for Significance Test of Outlying
      Observations.   Technometrics, Vol. 14,  No.  4, November
      1972.

 9.    Curran,  T.  C.,  W.  F.  Hunt,  Jr.,  and R.   B.  Faoro.
      Quality Control  for Hourly  Air Pollution Data.   Pre-
      sented at  the 31st Annual Technical  Conference of the
      American  Society   for Quality   Control,  Philadelphia,
      May 16-18, 1977.

10.    Johnson,  T.  A Comparison  of the Two-Parameter Weibull
      and  Lognormal Distributions  Fitted  to Ambient  Ozone
      Data.   PEDCo  Environmental,   Inc.,   Durham,   North
      Carolina,   and The Air  Pollution Control  Association.
      Quality Assurance  in Air Pollution  Measurement.   Pre-
      sented at the Air  Pollution Control Association,  New
      Orleans,  March 11-14, 1979.

11.    Larsen, R.  I.  A  Mathematical  Model for  Relating Air
      Quality Measurements  to  Air Quality  Standards,  Publi-
      cation No. AP-89,  U.S. Environmental Protection Agency,
      1971.

12.    Marriott,   F.  H.  C.   The  Interpretation  of Multiple
      Observations.   Academic Press,  New York, 1974.

13.    Hawkins,  D. M.  The Detection of Errors  in Multivariate
      Data  Using  Principal  Components.   Journal  of  the
      American  Statistical Association,  Vol. 69,  No.  346.
      1974.

14.    Wolfe,  J.  H.   NORMIX:   Computation  Methods  for  Esti-
      mating the  Parameters  of Multivariate  Normal Mixtures
      of Distributions,  Research Memorandum  SRM 68-2.   U.S.
      Naval  Personnel  Research  Activity,   San Diego,  1967.

                          3-45

-------
15.    U.S.  Environmental  Protection Agency.  Guidelines  for
      Air Quality  Maintenance Planning  and Analysis.   Vol.
      11.  Air  Quality Monitoring  and Data Analysis.   EPA-
      450/4-74-012, 1974.

16.    U.S.  Environmental  Protection Agency.  Quality  Assur-
      ance and Data Validation for  the Regional Air Monitor-
      ing System   of  the   St.  Louis  Regional  Air  Pollution
      Study.   EPA-600/4-76-016,  1976.

17.    W.  F.  Hunt,  Jr.,  J.  B.  Clark, and S.  K.  Goranson,  "The
      Shewhart Control  Chart  Test:   A Recommended  Procedure
      for Screening 24-Hour  Air  Pollution  Measurements,"  J.
      Air Poll.  Control Assoc. 28:508  (1979).

18.    Grant,   E.   L.,   and R.  S.  Leavenworth.    Statistical
      Quality Control.  McGraw-Hill Book Company,  New York.

19.    Siegel,   S.   Nonparametric   Statistics   for   the   Be-
      havioral Sciences, McGraw-Hill,  1956.

20.    Remington,  R.  D.,  and  M.  A.  Schork.  Statistics  with
      Applications  to   the  Biological and  Health  Sciences.
      Prentice-Hall,  Inc.,  Englewood  Cliffs,   New  Jersey,
      1970.

21.    Hald,  A.  Statistical Theory  with  Engineering Applica-
      tions.   New York, 1952.
                          3-46

-------
         4.0   SELECTION AND  IMPLEMENTATION OF  PROCEDURES


     There  are  several  factors  to  be evaluated before  one  can

select  the most  appropriate  data validation  procedure  for  the

specific application.   These factors can be categorized into  two

sets  of decision criteria  (1) those based  on  an  organizational

perspective  and  (2)  those  based on analytical  characteristics.

Table  4-1  gives  a breakdown of  the  factors to  be  considered in

the selection of data validation procedures.


             TABLE 4-1.  FACTORS TO CONSIDER IN THE SELECTION
                      OF DATA VALIDATION PROCEDURES
                         Organizational Criteria
                  1.   Number of data sets
                  2.   Historical data requirements
                  3.   Nature of data anomalies
                  4.   Manual methods
                  5.   Continuous methods
                  6.   Strip chart data
                  7.   Magnetic tape data
                  8.   Data transmitted by telephone lines
                  9.   Timeliness of the procedure

                 	Analytical Criteria	

                  1.   Statistical sophistication
                  2.   Computational requirements
                  3.   Expense of analysis
                  4.   Sensitivity of the procedure
                  5.   Use of data
     The  organization  criteria  can be used  as an initial  screen-

ing  of the procedures  based on  the data  needs,  the analytical

capabilities of  the agency,  and  staff  training.   For example,  a

local  agency with limited staff  and without  computer facilities

and  statistical  support would  select  from  those  procedures  in

Sections  3.1.1,  3.1.2,  3.1.3,   3.2.1,  3.3.1,  and 3.3.2--that  is,
                               4-1

-------
data  ID  checks,  unusual  event review,  deterministic relationship
checks,  data plots,  gross limit  checks, and  pattern checks.   On
the other  hand,  a  Federal agency with extensive capabilities  can
use any  of the validation procedures  with heavy emphasis  on com-
puterized  procedures,  computer graphics,  Shewhart control  charts,
the gap  test,  and/or the Johnson p  test.
     The factors  are  described  in Sections  4.1 and 4.2.   Table
4-2 gives  a  selection procedure  using  three scenarios based  on
whether  or not a local, State, or Federal agency has computer  and
statistical  resources  available.   Section  4.3 contains  a  discus-
sion of  the  implementation of the data validation process.

            TABLE 4-2.  SELECTION OF DATA VALIDATION PROCEDURES
      Scenario of
     Agency/Resources
Select procedures from those in
    Subsections listed below
State or local agency
1.   Without both computer and
    statistical support
2.   With computer but limited
    statistical support
Federal, State, or local agency
  With both computer and statistical
    support
3.1.1, 3.1.2, 3.1.3,
3.2.1 (limited to manual effort)
3.3.1, 3.3.2
The  above plus 3.2.1
(computerized graphics), 3.2.2
3.3.3, 3.3.4, 3.4.1
Any  procedure in Section 3.
4.1  ORGANIZATIONAL CRITERIA
     The  selection criteria  which  can be  used  in  preliminary
screening of the validation procedures are given in  this section.
4.1.1   Number of Data Sets
     The  test  procedures  for  internal consistency are designed
for  the validation of a single  data set,  without  the  use of his-
torical data.  If there is an instrument error which consistently
alters  all of  the values in  the data  set,  these  test procedures
will not  be  of any value.  Hence it is necessary  to use at least
                               4-2

-------
one  test  which  identifies  or  flags  data based  on previous  or
historical  data  or  on  comparison  with  other  (parallel)  data
sets.  If an  extensive  analysis of past data is available or can
be  readily  performed then  the test procedures  of Section  3.3
should be considered, otherwise the  data should be compared with
other  parallel  data  sets  which are described in  Section  3.4.
4.1.2  Historical Data Requirements
     The major  test procedures  which require historical data for
setting limits  are  the  gross limit checks and pattern tests,  the
Shewhart control  chart,  and the intersite correlation test.   In
all  of these,  data  are  required over an extensive time period so
that the results  can  be applied to new  data.   Limits  can be  al-
tered  as  appropriate, to take  into  account bdth  the  additional
data and the experience gained in using the procedures.
4.1.3  Nature of Data Anomalies
     Some  tests  are  designed  for   a   single  outlier  (extreme
value),  whereas  other  tests  are designed  to  detect  shifts  in
either the variation  of the data and/or in the mean (or median).
For  example, there  are Dixon ratio tests for a single outlier and
an outlier pair;  the  sign test checks for shifts  in the mean or
median.  The  Shewhart test detects shifts in either the mean or
the variance  (range or standard deviation) and anomalous outliers
by means of the range check.  The graphical techniques are likely
to identify any one of the data  anomaly types if the graph is not
too complex.
4.1.4  Strip Chart Data
     The analysis  of  strip  chart data  can be  limited to simple
visual scans  for gross  anomalies.   However,  it may also be  de-
sirable to  establish  limits within  which successive values must
fall for the  data to  be valid.  Hence,  the successive difference
test would be particularly useful.  Also, gross limit and pattern
tests can be easily applied to strip chart data.
                             4-3

-------
4.1.5  Size of Data Set  (Manual  and  Continuous  Methods,  Table
       4-1)
     Manual methods  are most  appropriate  for  small  data  sets,
typically  daily  averages  for  each day  or  for every  sixth day.
Procedures  which can  be  performed  without  a  computer  are  the
Dixon  ratio,  the  Shewhart control  chart,  data  plots,  and  the
routine procedures described in Sections 3.1.1,  3.1.2,  and 3.1.3.
     Computer  methods   are desirable for validating  large data
sets (e.g., from continuous monitoring  of hourly average concen-
trations)  since  data handling  and calculations by  manual  means
can  become tedious and can result in errors.   However,  the data
validation  procedure  can be added  to  the  standard  analysis  and
data  storage  procedures   to minimize  the  total  computer  time.
4.1.6  Magnetic  Tape Data
     Data  recorded on magnetic tape can be easily validated using
computerized  methods.   Almost  all of the procedures can be con-
sidered.   The more sensitive and economical ones for the expected
types of anomalies would be best used.   Turnaround time should be
minimal.   Section 3.1.4 describes several procedures which  should
be  routinely  used to  maintain  data integrity.1  For example, the
data can be blocked  into  logical periods  (days,  weeks,  months,
etc.) and  test performed for data sets using internal  consistency
checks  (e.g., Dixon  ratio)  or  comparison  with  historical data
(e.g., control chart).  In addition routine checks can be made of
the  descriptive  identification codes.
4.1.7  Data Transmitted by Telephone Lines
     The  recommendations  for data transmitted by telephone lines
are  the  same  as those for data  on magnetic  tape.  Experience
gained  through  CHAMP suggests it  is not always advisable  to use
the  telemetry data when the magnetic tape  data are  available and
when the  latter  can  be used in  a timely  manner.   In any case,
there  should  be  data validation checks  to  ensure that the  telem-
etry data  are the  same as the  data on magnetic tape and/or the
raw data.   Section 3.1.4  describes several procedures to be con-
sidered  for routine use.1
                              4-4

-------
     The principal  result of transmission  error  is the  loss  or
alteration of  data.   One way to check  for  transmission error  is
to transmit  the  data a  second  time and  then pair the  two  data
streams.  Gaps and alterations  will be immediately apparent un-
less the transmission error is systematic.
4.1.8  Timeliness of Procedure
     One of  the  key  aspects  of  a good data validation program is
the  timely  identification of data  anomalies  and  the  feedback  of
this information to the data source for corrective action and for
review of the raw data and background information needed to check
the  validity of  the  suspect  data.   The  shortest  possible turn-
around time is desired so that the information loop can be closed
before vital information  is  lost and/or before more questionable
data are generated.   The  combination of procedures selected must
satisfy  these  time  constraints.   For  example,  some  procedures
with very quick response times can be used jointly with more com-
plex procedures  in  order to catch gross errors  quickly  and  to
catch less obvious errors within a slower response time but still
within  a time frame that can  result  in useful  data  validation
implementation.
4.2  ANALYTICAL CRITERIA
     The  selection factors which  are  analytical  in  nature  are
described in this section.
4.2.1  Statistical Sophistication
     The test  procedures are arranged  within each subsection  of
Section  3  from those that are  least sophisticated to  those that
are most complex.  Although some test procedures may appear to be
statistically  complex,  these same  test  procedures may  be  very
convenient to  use.   On the other hand,  some  of those  with least
sophistication can require considerable effort  (e.g.,  plotting of
all of the data).
4.2.2  Computational Requirements
     Although a computerized approach is desirable for several of
the procedures,  all  of  them  can be accomplished by manual means.

                             4-5

-------
The only procedures requiring  computer  help for routine use with
large volumes  of  data,  are data plots,  gap and  Johnson p tests,
intersite correlation, and rank  sum tests.   For small data sets,
many  of  these tests  can  be  performed easily  by manual  means.
Some  procedures  require considerable  computation to  derive the
limits from  historical  data,  but  the  test is  easily performed
after this  initial step,  which need not be repeated  until one
suspects  that the  limits need  to  be changed  to reflect real
changes in air quality.
4.2.3  Expense of Analysis
     The expense  of the  analysis in manual/computer time is rel-
atively  small  compared to data costs and/or the cost of provid-
ing  invalid  data  to  a  data  bank  for  use  by a decision maker.
However,  these costs  must be  considered  from the  standpoint of
using an efficient data validation procedure.   If  a procedure
ignores  too  many  data anomalies  or flags too  many  good  data,
either the limits need to be adjusted or the procedure eliminated
from  further  use.  The  expensive  procedures  are not  always the
best  (e.g., the  Shewhart test will not cost much on a continuing
basis,  and it will  detect  many  types  of invalid  data).  The
graphical plots can be very time consuming,  by manual means or by
computer; however,  the results  will  also prove  very  useful for
detecting invalid data that  would be missed by  other procedures.
4.2.4  Sensitivity of the Procedure
     The capability of the data validation procedure to correctly
identify  invalid  data is  critical  in the selection.   One  should
also  refer  to Section 4.1.3 above  which  discusses  the nature of
the  data anomaly.  Some procedures will  identify  a  single out-
lier; others  will detect  shifts in the  median  (or mean)  level,
while still others  will detect changes in the variability  of the
data.
      Some procedures,  such as the  Dixon  ratio  test for a  single
large value  (Section  3.2.2.1),  may be  very insensitive if  there
is  a second  large value.   In these cases, another  Dixon  ratio
                             4-6

-------
test (Section 3.2.2.2) can be used which is sensitive to pairs of
outliers.  It is  desirable  that the user of the procedure deter-
mine the  types  of invalid data and then design  the  data valida-
tion program  to discriminate for  these specific types  of data.
Thus it is not possible to say that one procedure is always best,
rather,  the  appropriateness of a procedure varies with the type
of data anomaly.
     Decisions  concerning sensitivity  requirements  can  be  made
best after  some experience has  been  gained in  using the proce-
dure.  The data analyst should continually note the percentage of
flagged  data  which upon  investigation  are  found to  be erroneous
values.  A control  chart  may be used to maintain a record of the
performance  of  the  data  validation procedure.   The sensitivity
and cost of implementation of different procedures should be com-
pared.    In  comparing  the sensitivities one  must be  careful to
obtain an independent  assessment  of each procedure.   This may be
a  problem when  the procedures  are applied successively (or in
series).  An  exponential  identifier (e.g.,  see Section 5.5.1 for
              34
the use  of  10   for invalid data) can be used in computer appli-
cations  to  indicate the  procedure(s)  which identified  the  data
as  questionable and  to   aid in  statistical analysis  of  the ef-
ficiency of the procedures.
4.2.5  Use ofData
     The intended use of  data should be considered in determining
the time  and  level  of effort to be put into data validation.  If
the  data are to be used in making important  policy decisions,
then they should be more  carefully screened than if the data are
being used in a less critical role.  If in doubt, err on the side
of more  careful validation; data are often used in  a manner un-
known to the generator/originator of the data.
4.3  IMPLEMENTATION OF DATA VALIDATION PROCESS
     To  make  the data validation  process  effective  requires the
consideration of  several  important factors in both the selection
and implementation  of  the procedures.   Some of the factors to be
                             4-7

-------
considered in  implementation  were identified in  Section 2.4 and
are discussed in this section.
     Initially,  a  person  needs  to  be identified  with  the re-
sponsibility for the data validation activities.   In  a local or
State agency,  this may  be  a part-time function;  whereas in a re-
gional or  Federal  agency,  one or more persons may  be  given this
responsibility on a full-time basis.
     The data validation procedures need to be selected by taking
into consideration the several factors identified in Sections 4.1
and  4.2.   The  process must be consistent  with,  for example, the
use  of  the data, -the  network size,  data  volume,  and  staff size
and training.
     After the selection of the procedures, the limits, patterns,
and  confidence  levels  must be determined  (using historical data)
for  a particular  application to obtain an  efficient  procedure.
The limits may need to be adjusted after some experience with the
procedures.  The tighter  the  limits,  the greater the quantity of
data  that are  flagged and hence the  greater  is  the  effort of
checking  on  the flagged data.   There is  a  tradeoff between the
stringency of  the  data validation and  the costs  of flagging too
many  data values  (and the completeness  of  the  remaining data,
particularly if the data identified by the validation process are
deleted or flagged relative to further use in some computations).
     A data  validation plan should be prepared and documented as
a part of the  quality assurance plan (as described in Volume  I of
the  Quality  Assurance Handbook).2   This  plan  should  include:
      1.   The  data flow (reporting) hierarchy;
      2.   The  data validation check points;
      3.   The  methods for checking the data;
     4.   The  documentation of the process  (the data values  flag-
ged  for  investigation, the values  inferred  to be  anomalies and
those for which no clear decision could be made);
      5.   The  techniques used for anomaly  investigation;
      6.   The  reporting schedules;
      7.   The  mechanism for  feedback of  validation  information
to the data  collection, to the quality control procedures,  and to
the  validation process;
                              4-8

-------
     8.   The  resources   (including  costs)  for performing  data
validation;3
     9.   Listing of  QC  checks used  to  reject data forthrightly
and without further investigation (e.g.,  the use of zero and span
checks  to  reject data  when the  span drift exceeds  a specified
limit) ;
    10.   The  routine  handling  of  data  in  blocks  (or  time
periods) with  a corresponding sign-off  form indicating the data
have been subjected to the data validation process; and
    11.   The procedures for tracking the causes of invalid data.
     A  periodic reporting  procedure  must  be  used  to summarize
the  performance of  the  data  validation process  with regard to
this plan.   Finally,  the  validation process can be made most ef-
fective  by adjusting, if necessary,  the limits (or criteria) in
order  to minimize the incorrect  number  or  percentage of infer-
ences with regard to possible data anomalies.  Each investigation
costs  time and money, so the  incorrect  inferences must be mini-
mized.   A  good documentation  system  should enable the validator
to  solve  this problem.3   An effective reporting  and feedback
system  will  also contribute to  the motivation  of the personnel
involved with  the process and to timely  corrective action.
4.4  REFERENCES
     1.   U.S.  Department  of  Commerce.   Computer   Science  and
          Technology:   Performance Assurance  and  Data Integrity
          Practices.   National Bureau of Standards,  Washington,
          D.C., January 1978.
     2.   U.S.  Environmental Protection Agency.   Quality Assur-
           rance Handbook.  Vol. I. Principles.  EPA-600/9-76-005,
           1976.
     3.    Smith,  F.   Guideline for the  Development  and Implemen-
           tation  of  a Quality Cost System  for  Air Pollution Mea-
           surement   Programs.    Research   Triangle   Institute.
           RTI/1507/01F, November 1979.
                             4-9

-------
           5.0  HYPOTHETICAL EXAMPLES AND CASE STUDIES
     Two hypothetical  examples and  three case  studies are  de-
scribed in  this  section.   The two examples are  very simple ones
involving ambient air monitoring and source testing and are given
first in Sections  5.1  and 5.2.  The three case studies are given
last  and  in  increasing  order  of  sophistication  (Sections  5.3,
5.4, and 5.5).   Few  simple examples  as described in Sections 5.1
and 5.2 are  documented in quality assurance or pollution control
literature because the  individuals  involved  probably do not feel
their  programs   reveal  any  innovative  procedures.   Hence  these
examples  are hypothetical,  though  hopefully  realistic of  the
needs  of  the local agency performing  ambient air monitoring and
source tests.

5.1  HYPOTHETICAL EXAMPLE FOR AMBIENT AIR MONITORING
     Suppose  that  a  local  agency  has  four hi-vol  monitoring
stations for TSP, operating every sixth day,  and two stations for
ozone, operating 5 months per year.  At one station, temperature,
rainfall,  and wind data  are  being obtained.   Two  cases will be
considered:    one without and  one  with computerized methods.   In
each case the preferred procedures will be selected for the data
validation process.
5.1.1   Without  Computerized Support -  Several  very  simple  and
useful manual procedures which can be  used to validate the data
will be suggested for each pollutant/parameter.
     TSP - There  are approximately  60  measurements  per year at
each  site.   If   all  four  sites have different monitoring sched-
ules, the data  from  each site should be validated independently.
A  gross  limit check should  be established for  TSP, perhaps the
air quality  standard value if this value has  a small probability
of  occurrence,  say less than 5%.   A gross limit check value can
be  obtained  by  analyzing the historical data  for the most recent
                             5-1

-------
year and determining a  single  value which has a small likelihood
of  exceedance.    Gross  limit  checks  are  discussed in  Section
3.3.1.   In  some  cases,  multiple limits or  a  pattern test should
be used  depending on the  results  of the  following analysis for
each of the four stations
     1.   Analyze 1  year  of data  (about  60 values),  examine the
data  for  a seasonal/day-of-week  pattern  (e.g.,   four  seasons,
weekday and weekend day, eight categories),
     2.   If there are no significant patterns, then use a single
gross limit check; otherwise, determine a limit for the indicated
pattern, say weekend versus weekday values.
     Typical  TSP  data  can be  adequately approximated by the
lognormal distribution,  hence it is usually necessary to make the
lognormal  transformation  of the  data and  then  to statistically
analyze  the transformed  data  (including  the  analysis  described
above)  and ultimately  to  transform back (if  desired)  to the
actual data values  to obtain the appropriate limits.  The limits
determined  in this  way  are usually larger but more statistically
accurate than if no transformation were used.
     A second procedure, which is usually beneficial in detecting
changes in the data from one or more of the four hi-vol stations,
is to develop a quality control chart for the following:
     1.   The hi-vol,  average  and  range  values for each  set of
five consecutive readings  (about 1 month of data) for each of the
four stations, and
     2.   The  average  and  range  for the  four  station averages.
The  four charts  developed in step  1  monitor  the individual sta-
tions for suspicious changes in either the  average  or range.  The
last  chart monitors  the   group  of four  stations  for suspicious
changes  in the overall average  and for  the  variation among the
four  station  averages.   To  develop  the  control limits for  these
charts,  use the  data for 1 year  (60 values,  or 12 sets of five
each)  and  the methodology described in Appendix  H,  Volume I of
the Quality Assurance Handbook and in Appendix C of this report.1
                             5-2

-------
     Finally, a year  of  data can be plotted in a few hours since
there  are only  240  measurements per  year  (60  values  at  four
sites).   The  data  for all four hi-vols should be  plotted on the
same graph or on  graphs  placed one below the other to facilitate
comparisons.
     03 - The two  03  sites are assumed to be  in  operation for 5
months  of the year  and  to  record  hourly  average values.   For
these  data,  the preferred data  validation  procedures  would  be
gross  limit  and  pattern  checks  of Section 3.1.   Limits such as
those  in  Table 3-2 of Section 3.3 should be derived from histor-
ical data  for at  least 3 years if available.   The data should be
statistically analyzed for day-night patterns and perhaps for two
groupings within the  5 months—that is,  the 3 months with higher
03  levels and 2  months  with lower  03  levels.    Each of the  5
months  also  could  be  treated separately in the analysis and then
grouped as the data and the statistical analysis indicate.
     The  Shewhart  quality control chart  may  be used  to monitor
both stations.   The historical data would need to be  grouped in
the  same  manner  as they  are to  be  used  in the  quality control
chart  and then  analyzed for  averages  and  ranges  to  obtain the
quality control limits.   Note:   The data  do not have to be plot-
ted on charts to apply this procedure; however, the visible chart
may  be beneficial  in  detecting  trends/patterns in the data that
would not be evident on review of a data record.
     Meteorological Data - Assume  that hourly average  tempera-
ture,  rainfall,  windspeed,  and  wind direction are available for
the data validation.
     The  following pattern  tests should  be used  with necessary
modifications to be consistent with the meteorological conditions
for the agency:
     1.   The minimum hourly average should  occur between 04-09
hours,  and the maximum between 12-17 hours.
     2.   Hourly changes  should  not exceed  6°C (10°F).  If a de-
crease  of 6°C (10°F)  or  more occurs, the  wind  direction should
have changed to  a more  northerly direction,   and/or  rainfall  of
0.15 in. or more should have occurred.
                             5-3

-------
     3.   Hourly values should not  fall  outside of the specified
maximum and  minimum values for  each  month.   These  limit  values
need to be determined from historical  data.
5.1.2  With Computerized Support -
     TSP - The  routine manual  procedures suggested in  Section
5.1.1  may  be computerized if desired.   No additional  procedures
are  recommended  because the data are not voluminous and because
the recommended procedures can be performed manually.
     03 - In the case  of  03  the volume of the data makes the use
of a computer  worthwhile.   For  example,  a plot of the O3 data by
hour can reveal  any unusual  values  and/or changes in consecutive
values from the typical diurnal and seasonal  patterns.
     In  addition,  a  check of  the  extreme values  for  5  months
using  the  gap  or  "Johnson" p  test  (Sections  3.2.4  and  3.2.5)
could  be  beneficial.   Particularly in  the  Johnson p  test,  the
distribution of  the  upper portion of  the data is approximated in
the  process  of checking the  extreme values.   Hence the identifi-
cation  of  a possible  data anomaly  may  actually  be of secondary
interest in  the  use of the Johnson p test.   The computations can
be easily  performed on a programmable  calculator  or  a computer
given  the  data  values  and procedure(s)  in  Section  3.2.5  and
Appendix B.
     Meteorological Data  - The  availability  of a computer should
not  materially affect  the  selection of the procedure; however, it
may  result  in  the  use of  a computerized method of performing the
routine manual analyses.
5.2  HYPOTHETICAL EXAMPLE  FOR SOURCE TESTING
     Suppose that  a Method 6 source test is performed  at a util-
ity  plant with a flue  gas desulfurization  (FGD)  system.  One of
the  obvious  data validation checks  to perform  is a  report review
using  gross  limit   and/or  parameter  relationship  checks.   Some
data checks  to be performed are:
     1.   Barometric pressure - At  sea level, the value  should be
approximately  760 mm (30 in.)  and  between 736  and  787  mm (29 and
31 in.) Eg;  at other elevations,  the value should decrease 2.8 cm
(1.1 in.) Hg per 305 m (1000 ft)  above sea level.

                             5-4

-------
     2.   Moisture data - Nomographs provide  moisture content at
saturation as a function of stack absolute pressure and stack gas
temperature.   If  the  reported value  is  higher than  the maximum
that was read from the nomograph, the data are suspect.
     3.   Volumetric flow rate data - These data are difficult to
cross-check  accurately,  but gross limits  can be  determined from
engineering  considerations.2   In  addition,  the  data  should  be
compared with data obtained under similar process conditions from
previous tests on the same source,  similar sources, or tests per-
formed at the inlet to the control device.
     4.   Emission results  -  These  are  the  most  difficult but
also  the most important  data   to  validate.    A   sulfur  balance
should yield  a good check of the S02 emission results.  However,
there is considerable variability in the sulfur in coal (relative
standard deviation of 15% is not unusual).  The limited amount of
data for a  series of three runs does not make the use of a Dixon
ratio or Grubbs test practical.2  That is, a decision may be made
to eliminate  one  extreme  from a group of three values (using the
Dixon test) as a result of process variations of the order of two
to one  resulting from  factors  beyond the  control  of the plant.
Hence,  decisions  concerning  the validity  of data need  also  be
based on other experiences with this or similar plants.
5.3  CASE STUDY OF A MANUAL DATA VALIDATION SYSTEM
     This  section  describes  an automated monitoring  and  data
acquisition  system  for which the validation  procedures are per-
formed  manually.   The   monitoring   system  includes  26  sites
which monitor 124 environmental parameters.   In  the  case of one
parameter,  SO2, a comprehensive  formal quality assurance program
has  been developed.   In the case  of the other  parameters,  no
formal  quality  assurance  procedures  have been   documented.   A
brief description of  the  validation  procedures developed for the
parameters  without  the quality  assurance programs will  be fol-
lowed by a  more detailed description of the validation procedure
for  S02.  The case  study illustrates the complementary nature of
                             5-5

-------
the quality  control  and the data validation procedures  within a
monitoring system, and  it  shows  clearly that,  as more comprehen-
sive quality  control  procedures are applied during  data genera-
tion,   the  need  for  in-depth data  validation  procedures  can  be
limited  to  less  frequent  review or to spot check  items  listed
under  QC below.  The  network  to which  this  case refers  is  not
identified for proprietary reasons.
5.3.1  Quality Control
     The major  QC procedures for the instruments  other  than the
continuous  S02  monitors  are:    daily  zero-span checks,  6-month
multipoint  calibration,  and 6-month  scheduled  preventive main-
tenance.  Instruments  are  rotated in  the field.  Each instrument
is brought to the central laboratory for maintenance and calibra-
tion every  6  months  and is replaced in  the  field  by another in-
strument (a better practice would be to calibrate the instruments
in  the  field).   Instruments are checked on site  by technicians
two or  three  times  a week on the average.   Data are transmitted
automatically to  a central computer.   Strip charts are generated
as backup to the computer data base.
     A  total  quality assurance  program  was  designed for the S02
monitors in this  network.   Field level QC procedures include the
following:
     1.   Operator checklists -  Upon entering  a monitoring shel-
ter, the field  technician  completes a checklist of key parameter
observations:   pressure  gauge  readings,  critical  temperature
readings, weather or site-specific conditions,  and so forth, that
may  significantly influence  SO2 data  quality.   Checklists  are
designed to instruct the  operator  to  take  specific action under
specific parameter conditions—for example,  if a parameter is out
of  tolerance;  action  may  include  phone consultation  with the
chief  technician.   Checklists   are  forwarded  to  the laboratory
weekly.
     2.   Zero-span control charts  - The field  operator checks
the S02  strip charts and manually plots  the zero-span results  for
each day on  a  control chart.    Predetermined  control limits  are
                             5-6

-------
established for each zero-span control chart.  Control charts are
forwarded  to  the  central  laboratory  for  review  and  approval
monthly.  If an  out-of-limits  condition occurs during the month,
the field technician consults by phone with the chief technician.
     3.   Maintenance reports  -  Formal  reports   of unscheduled
maintenance performed in  the  field are completed as required and
forwarded weekly to the chief technician for review.  This proce-
dure facilitates  identification of chronic  instrument problems.
     4.   Strip chart review  -  The  field  operator  looks  for
unusual traces  on the strip  chart that  indicate  equipment mal-
functions.   If  malfunctions are  suspected,  the  operator makes a
note  on  the  strip  chart and  flags  the  problem  for the  chief
technician.   Strip charts are forwarded to the central laboratory
for review approximately monthly.
     5.   Calibration and preventive maintenance   -   Instruments
are  calibrated  and  maintained on  a  rotating 6-month schedule.
(These schedules were selected prior  to recent quality assurance
standards which recommend a quarterly schedule.)
     6.   Laboratory strip chart scan - The quality assurance co-
ordinator scans  S02 strip charts upon  receipt from  the  field to
identify uncharacteristic traces that may have been missed by the
field technician.   For time periods when either the field opera-
tor, the  chief technician,  or  the  quality assurance coordinator
know  that the  data  are  unacceptable,  a  form  is  completed to
inform a data validator to delete the data from the data base.  A
data delete  form is only prepared when there  are  backup data or
information (e.g., zero/span data, remark in the daily log;  etc.)
to justify the action.
     7.   Quality assurance filing system  -  All maintenance  and
calibration records, zero-span  control  charts,  field checklists,
strip charts, and  data delete  or change request forms are marked
with the site and instrument identification.  These documents are
filed permanently by site, instrument, and date to provide future
backup for answering data validation questions.
                             5-7

-------
     In addition to the QC procedures above,  the central air quality
monitoring  laboratory  has  a  teletype  capability  to  poll  any
monitoring site for specific parameters.
     8.   Data review and audit  -  The quality  assurance  coordi-
nator  performs  an annual audit  of the files and  traces  the oc-
currence  of  events   from  causes  through subsequent  corrective
actions.  Also  the laboratory  supervisor  reviews the data delete
forms on a monthly basis to suggest corrective actions.
5.3.2  Data Validation
     The actual data  validation  in this  system is performed by a
contractor  not  associated with  the  routine  operation  of  the
system.   The validation  contractor's responsibilities  include:
     1.   Strip chart review,
     2.   Interparameter relationship review,
     3.   Recapture of  valid data  lost during data transmission,
     4.   Confirmation  of excursions  above  the  ambient air qual-
ity standards,  and
     5.   Deletion  from the data  file of  all  data invalidated
through the QC  or validation procedures.
The following two exceptions apply:
     1.   Strip chart review  is not performed  since the  QC pro-
cedures  for S02  specify that this  technique be  applied  at the
field  level.   Although this is  the  only  real point of departure
between validation of S02 data and validation of the other param-
eters,  it  is  significant because an in-depth, after-the-fact re-
view  of strip charts is time consuming and  expensive for a net-
work of this  size.   For example,  after-the-fact chart review for
S02 would  consume on  the  average of  1 to  2 hours of technician
time per site-parameter-month combination, depending on the kinds
of problems  encountered and the volume of communication required
between the data validator  and the monitoring laboratory.
     2.    Interparameter relationship  review is applied to mete-
orological  data only,  except  in the case of  excursions above the
ambient air quality standards.
                             5-8

-------
5.3.2.1   Strip Chart Review - Strip charts  are scanned manually
to  identify questionable data periods.   The purpose of the scan
is to identify:
     1.   Unusual spikes,
     2.   Invalid data periods identified  by field technicians,
and
     3.   Uncharacteristic  traces,  for  example a relatively flat
or constant windspeed  trace over several hours.
5.3.2.2   Interparameter Relationship Review - As  indicated pre-
viously,  the  interparameter relationship review is performed for
meteorological  parameters  only.   Hourly  averages  are  reviewed
from  formatted computer reports that display data for each hour
of each month.  One review compares dewpoint  and ambient tempera-
tures  and  then flags  hourly dewpoint values  that  exceed  the
hourly  temperature.   Daily  average dewpoint  values  that exceed
the daily average temperature are also  indicated.  Another review
considers  windspeeds  measured  at  two   tower levels.   Data  are
flagged whenever the average windspeed  at the  lower level exceeds
that at the  upper  level.  Procedures like these must be based on
judgments by  an experienced  meteorologist.   Validation instruc-
tions in  many cases direct the validator  to backup strip charts
for questionable  time periods identified  during the interparam-
eter relationship review.
5.3.2.3   Recapture of Valid Data Lost During Data Transmission -
The data  validation contractor receives a  magnetic tape of data
collected through  the data transmission system  at the same time
the strip charts, data change requests, and data listings needed
for validation are  received.   The  objective of  the  monitoring
system  is to  reach a  level  of at  least 90  percent  valid data
collection.    The  data  validator  reviews  the data  listings  for
time periods  when data capture  is  below 90 percent.  The valida-
tor then  checks  the  strip  charts  for  those time periods  and
reduces data  for specific  hours  that appear to be valid but were
not picked  up by the data  transmission system.   The data values
are picked  up from the  strip charts through an electronic data
digitizer and then  used to update  the  data  file  on the magnetic
tape.
                             5-9

-------
5.3.2.4  Confirmation of Excursions above the Ambient Air Quality
Standards -The data validator receives a listing showing specific
hour, 3-hour, or  24-hour  values  that constitute excursions above
the ambient air quality standards.  The strip charts are examined
for  these  specific time periods  to  verify the  reported values.
If the reported  data differ from the strip chart  values and the
strip chart  values are valid, then  the data are  changed in the
file.  The  standards  report  is  annotated and  returned  to  the
client's  quality assurance coordinator for filing.
     The  confirmation  procedure  requires the data validators to
check meteorological  data records and strip charts  for the time
periods when  standards  excursions were  reported.  Unusual condi-
tions  or  invalid  meteorological  data are  also recorded on the
standards report.
5.3.2.5  Deletion of Invalid Data - The final step in the valida-
tion procedure  is the  update  of the data tape.   Data picked up
during data  validation,  as  explained in  Section  5.3.2.3,  are
merged into the file.  During the update procedure, data  found to
be invalid by either the  data validators, by the field operators
through the  QC procedures,  or by the quality assurance coordina-
tor  (in the case of S02) are deleted from the file.  A listing of
each value  added to  or deleted   from the permanent file is pro-
duced for filing with each monitoring stations records.
5.4  CASE STUDY OF THE CHAMP AUTOMATED DATA VALIDATION SYSTEM
     The U.S.  EPA and  the  Rockwell  International  Science Center
designed and  implemented an automated data acquisition system for
the  Community Health Air Monitoring Program (CHAMP).   CHAMP was
implemented  to  provide reliable  air quality  data  to support the
Community  Health  Environmental   Surveillance  System  (CHESS),   a
program  to  study the  effects  of  air  pollutants  on community
health.
     The  CHAMP  network  included  23  monitoring   sites  in  six
cities.   The  network  provided   data for  the   19 environmental
parameters  listed  in Table  5-1.   Monitors  at  each  site were
polled by minicomputers at the sites, and the central  computer at
                             5-10

-------
        TABLE 5-1.   CHAMP ENVIRONMENTAL PARAMETERS
Computer code
     Parameter
     N0x

     NO

     N02

     SNO

     03

     S02

     CH4

     THC

    NMHC

     WS

     VWM

     VWD

    TOUT

     TIN

    DEWP

     BP

    HVFL

    RSPI


    RSPF
nitrogen oxides

nitric oxide

nitrogen dioxide (calculated)

sampled nitrogen oxide

ozone

sulfur dioxide

methane

total hydrocarbons

nonmethane hydrocarbons

wind speed

vector wind magnitude

vector wind direction

temperature outside

temperature inside

dewpoint

barometric pressure

hi-vol  flow

respirable suspended
 particulate flow rate; inlet

respirable suspended
 particulate flow rate, final
                        5-11

-------
Research Triangle  Park  (RTF),  North Carolina  polled  the  sites
every 2  hours.   These  data were used to review  system status at
each  remote  monitoring  site  and to  provide feedback  for  field
personnel to perform maintenance as needed.
     Magnetic tapes  were generated at  each site, and  the  tapes
were  forwarded  to  RTF  every  2  weeks.   The  data  validation
procedures were  applied to the  data  on tape rather  than to the
data  transmitted via telephone  lines  to the central facility be-
cause the data on the tapes were generally more complete and more
reliable than the telemetry data.
      Strip charts for gaseous pollutants were generated at remote
sites.   The  strip  charts  provided  a  backup, and  data that were
lost  due to  computer  failure  were  recaptured  from  the  strip
charts.   The  strip  charts also provided  a visual QC  check for
station operators.
5.4.1  Quality Control Functions
      Several  automated QC  checks  were  performed  by  the on-site
computers as  the data  were generated.  Manual QC procedures were
used  also.
      The automated QC procedures were based on limits set for the
analog  values  of  the  critical operating  parameters.   Critical
limits  were  entered into  the computer;  if these limits were ex-
ceeded,   then the  associated 5-minute  averages  were  flagged as
possibly invalid.  Direct  environmental measurements, referred to
as  "primary  channels," were  checked  for nonzero voltage, normal
setting  of all  valves,  power  to  the  instrument,  and digitally
measured  tolerances  for  proper  ambient   sampling.    Instrument
operations  (e.g.,  hydrogen flow in  an  S02 monitor),  referred to
as  "secondary channels,"  were  also  tested to verify that values
were  within  tolerances;  and status conditions necessary  for  cor-
rect  sampling were machine checked every  five minutes.   The  same
control  procedure  was  applied  to  the   calibration  operation to
verify  proper flows of  the calibration gases  and correct valve
settings on  the flow system.  Figure 5-1  illustrates  the QC  pro-
cedure .
                              5-12

-------
LEVEL 1
 DATA
 CONVERT
  FROM
 VOLT TO
ENG.  UNITS
 CHECK
 FLOW
RATE(S)
  VALVES
   SET
CORRECTLY
                 CALIBRATION
                  CONSTANTS
                 REJECT IF
                   OUT OF
                 TOLERANCE
                   REJECT
                     IF
                    NOT
                                                ACCEPT
                                                  AS
                                               LEVEL II
                                                 DATA
                                           YES
                               ZERO &
                                SPAN
                             PERI OKML'D
                             WITHIN LAST
                                24 n
                                                                       NO
                                                                  FLAG AS
                                                                  POSSI3LY
                                                                  INVALID
                  Figure 5-1.   Automated quality control  tests.
                                   5-13

-------
     During calibration and instrument  tests,  the sampling indi-
cator was  in  the off  position.   During sampling,  the  indicator
was in the on position.  During QC checking,  the status  bits were
read to determine the on or off condition.
     Error  conditions  flagged by  the  automated system  were  re-
viewed daily  at the central  computer  facility.  As  a  result of
the QC  review of data  received,  either personnel  could be dis-
patched  to  remote   sites  for  instrument adjustments   or  field
operators  could be  advised  of needed  changes.  This  real-time
turnaround  afforded by  automated  QC  checks  was  important  for
maintaining  continuous  operation  throughout  the  system.   It
allowed needed  adjustments to be made quickly  so  that  a minimum
quantity of data was lost or invalidated.
     Manual  QC  checks  included   visual  inspection  of  control
charts  for unusual  values or large  changes  in  reported  data
values.   If the  field operator noted  unusual  events at  a site
which might affect  subsequent data  validation  or  review,  he or
she was  instructed  to enter  explanatory  entries  into  the data
file.
     Manual QC  calibration audits  were performed  on the entire
system  quarterly.   Zero-span  calibration  was  performed every 3
days.    A  tickler file was maintained  at  each site  to  indicate
when preventive maintenance was required.
5.4.2  Data Validation
     The  data  validation procedures were performed at  the CHAMP
central computer  facility  at  RTF.   Computer  listings of the data
that had been subjected to QC checks were reviewed.
     A  report  listing the hourly  average values  for  each of the
19 parameters  (primary channels)  measured  at the remote monitor-
ing stations  was inspected.  For  each  station, the time (hour),
the parameter  name,  the parameter  hourly average value, and the
number  of  valid 5-minute values used  to  compute  the  hourly
average were  listed as shown in Figure 5-2.  The data  shown for
ozone for  hours 13  through 15 illustrate the application of this
technique.  For hour 13,  twelve valid  5-minute values  were used
                             5-14

-------
to compute  the  hourly value  of 0.1356.  For hour  14,  only four
valid 5-minute values were used in calculating the hourly average
of 0.1532.  The  "B"  indicates that data were missing and invalid
for the  remaining eight 5-minute  periods  during the  hour.   In-
valid data  would be the result of primary  or secondary channels
being  outside acceptable  limits  during  the QC  tests  on  site.
During hour  15,  only one  valid 5-minute average was  available.
The  "M"  indicates  that  the  remaining  eleven  5-minute  averages
were missing.
     As  indicated  in Figure  5-3,  a  summary of  secondary param-
eters  associated with  each  of the  primary parameters in  the
previous listing  was produced.  The  summary shows  the  high  and
low critical limits for each parameter.   Note that in the case of
ozone, the  critical value range for the flow of ethylene (FETH)
was set at 20.000 to 30.000.
     A third  listing,  Figure 5-4, shows the reasons  for invalid
data  flagged  in  the hourly  data  listing  illustrated  in Figure
5-2.   Again using the ozone parameter,  the listing indicates that
FETH  was  outside of the critical limits during  hour  14.  Figure
5-5 shows the invalid  5-minute values  for ozone and the 5-minute
values of  each  of the associated  secondary parameters that were
determined to be  invalid  during the  QC tests.  No 5-minute value
for a primary parameter was  valid unless the following criteria
were met:
     1.   The values of all  the 5-minute averages for all of the
associated secondary parameters had to be available.
     2.   The 5-minute values  of  all  of the associated secondary
parameters had to be valid.
     Figure 5-6  lists  journal entries  for  the time  period corre-
sponding to  the  preceding data  listings.   The  journal entries
were  useful  in  arriving at decisions on the validity  of flagged
data.
     A data review,  illustrated in Figure  5-7 was produced for a
quick overview  of 5-minute average status  for  each data period.
Each hour in  this  data listing is represented by 12 status bits,
                             5-15

-------
en
i
M
cn
CHAMP DATA, VALIDATED ON Ol-AUG-79 WITH VERSION 7.00 OF VALDAT
VALIDATION REPORT FOR STATION - 0841 FOR DAY - 265 - 1977
HOURLY AVERAGES
TIME
121 0
131 0
141 0
151 0
161 0
TIME
121 0
131 0
141 0
151 0
161 0
N0x
0.0612 12
0.0458 12
0.0452M10
0.0402M 1
0.0641M 6
VMH
6.0 12
6.5 12
7.2M10
6.8M 1
11. 2M 6
NO
0.0069 12
0.0044 12
0.0045M10
0.0039M 1
0.0052M 6
VWD
252.0 12
244.7 12
234.9M10
309. 4M 1
239. 3M 6

0.
0.
0.
0.
0.
T
23
24
25
25
25
N02
0542 12 0
0414 12 0
0407M10 0
0363M 1 0
0589M 6 0
OUT T
.0 12 22
.4 12 22
.2M10 23
.8M 1 23
.OM 6 22
SN02
.0568 12 0.
.0437 12 0.
. 0431M10 0.
. 0386M 1 0.
.0618M 6 0.
03 S02
1300 12
1356 12
1532B 4
1574M 1
1739M 6
M 0
M 0
M 0
M 0
M 0
IN DEWP BP
.3 12 10.
.5 12 8.
.4M10 7.
.4M 1 6.
. 5M 6 10.
0 12 734.
2 12 734.
9M10 733.
9M 1 733.
3M 6 733.
7 12
1 12
6M10
5M 1
4M 6
CH4
1.92 12
1.82 12
1.84M10
1.74M 1
1.94M 6
HVFL
1.135 12
1.134 12
1.133M10
1.133M 1
1.134M 6
THC
1.88 12
1.78 12
1.80M10
1.70M 1
1.90M 6
HSPI
1.110 12
1.112 12
1.120M10
M 0
1.134M 6
NMHC
-0.02 12
-0.02 12
-0.03M10
-0.02M 1
-0.02M 6
HSPF
1.109 12
1.115 12
1.122M10
M 0
1.139M 6
WS
7.7 12
8.5 12
9.0M10
6.8M 1
11. 5M 6






                        Figure 5-2.   Example CHAMP data validation  report  (partial  printout).
STATION - 0841 DAY 265-1977
VALIDATION CRITERIA USED
N02
SNO
SNO
SNO
-^o3
03
S02
S02
VAC
F02
SFNO
VAC
FETH
SF03
HS02
SFSO
LOW
LOW
LOW
LOW
LOW
LOW
LOW
LOW
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
LIMIT =
115.
28.
130.
115.
20.
800.
165.
130.
000
000
000
000
000
000
000
000
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
HIGH
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
LIMIT
= 135.
= 38.
= 300.
= 135.
= 30.
=1600.
= 215.
= 200.
000
000
000
000
000
000
000
000
              Figure 5-3.  High/low critical values for CHAMP secondary parameters  (partial  printout).

-------
    STATION - 0841 DAY 265-1977

                        INVALIDITY CAUSES BY HOUR

    03     0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18    19   20   21    22    23
                                                                       FETH
                                                                       FTH/
                                                                       SF

    RSPI   0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18    19   20   21    22    23
                                                                            RSPI
                                                                            NEGV

    RSPF   0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18    19   20   21    22    23
                                                                            RSPF
en                                                                           NEGV
M L—	
•^                            Figure 5-4.   CHAMP validation system, invalidity causes by hour.
    STATION - 0841 DAY 265-1977

                        INVALID SECONDARIES-FIVE MINUTE AVERAGES

    14:  7  03  = 0.1449  FETH=  17.692   14:12  03 = 0.0626  FETH=  -0.774   14:17  03  =  0.0182   FETH =   -2.327
    14:22  03  = 0.0102  FETH=  -2.688   14:27  03 = 0.0147  FETH=   0.814   15: 5  RSPI=  -0.924   RSPI=    -0.924
    15:  5  RSPF= -0.999  RSPF=  -0.999
                        Figure 5-5.   Five-minute values of invalid secondary parameters.

-------
     STATION - 0841 DAY 265-1977

                            JOURNAL ENTRIES
STATION - 0841    DAY - 265            TIME - 13:45
 WORD 9 HIT 1 GOES TO 1 STATE SOMETIMES:   NO APPARENT REASON, R
 NEPHELOMETER POWER IS NOT OFF.   RICK F  ON
STATION - 0841    DAY - 265           TIME - 13:48
 OPC/PHA:  WAS AT 50000 SECONDS & COUNTING, REFUSED TO TRANSFER R
 DATA TO COMPUTER.  USED  A TO RESTART.   RICK F  E
STATION - 0841   DAY - 265            TIME - 15:  1
 DAS:   WAS PRINTING 7242,  5246,   0323  DURING POLL ATTEMPTS, R
 NO RECORDS TAKEN,  ZEROED CORE,  LOADED PROGRAMS & CONSTANTS. R
 RICK F  D
STATION - 0841    DAY - 265           TIME - 16:31
 DAS CRUSHED E 0002:  DURING POLL ATTEMPT.   TTY PRINTED 0316, R
 0316, THEN TURNED OFF DIGITL DISPLAY, THEN WENT TO E 0002.
 ZEROED CORE, LOADED PROGRAMS & CONSTANTS.   RICK F  E
STATION - 0841    DAY - 265           TIME - 16:46
 TSP, RSP, ANDENSEN FILTERS CHANGED,  S02 BUBBLER RUNNING:   ANALYZER
 B.O.  EAA, OPC/PHA CALIBRATION ABORTED:  DAS CRASHEDS,  WARM, A
 LITE HAZE TODAY.  RICK F  01
STATION - 0841    DAY - 265           TIME - 22:22
 DAVID TORRES  AT GLENDORA TO CHECK OUT MODEM PROBLEMS
Figure 5-6.  Example CHAMP journal  entries for data validation.

-------
Station  -
               Day  265-1977
NO
NO
N02
SNO
S02
CH.
u
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8
000000000000
16
0
000000000000
8 .
000000000000
16
0
8
16
0
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
000000000000
9
000000000000
17
000000000000
1
9
17
1
000000000000
-.,-,.--.-• -,CL)
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
000000000000
10
000000000000
18
000000000000
2
10
18
2
000000000000
AIA KLVILJ-"
i>
000000000000
1 1
000000000000
19
000000000000
3
000000000000
11
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
000000000000
1 1
000000000000
19
000000000000
3
1 1
19
3
000000000000
1
000000000000
12
000000000000
20
000000000000
1)
000000000000
12
000000000000
20
000000000000
14
000000000000
12
000000000000
20
000000000000
It
000000000000
12
000000000000
20
000000000000
it
000000000000
12
000000000000
20
000000000000
1*
12
20
1)
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
000000000000
13
000000000000
21
000000000000
5
13
21
5
000000000000
6
000000000000
lit
0000000000--
22
000000000000
6
000000000000
111
0000000000--
22
000000000000
6
000000000000
11)
0000000000--
22
000000000000
6
000000000000
14
0000000000--
22
000000000000
6
000000000000
11)
01 1 1 1 1 1000
22
000000000000
6
11)
22
6
000000000000
0000000000
15
-0 	
23
0000000000
7
0000000000
15
-0 	
23
0000000000
7 "
0000000000
15
-o-- 	
23
0000000000
7
0000000000
15
-o-- 	
23
0000000000
7
0000000000
15

23
ooooooooooc
7
15
23
7
00000000000
              Figure 5-7.  Example  CIL".['P validation data review.

-------
each representing a  5-minute  time  interval.   The 5-minute inter-
val flags are represented symbolically as follows:
          "0" - acceptable data
          "I" - invalid data
          "-" - missing data
Note that the  flags  for ozone data during hours 14 and 15 can be
traced  through the  previous  illustrations.   For  the cases  in
which  5-minute averages  were missing,   operator  logs  and  strip
charts were  reviewed to determine if data could be  captured;  in
cases where acceptable data were available,  the data were entered
into the data base.
     Another feature of the CHAMP data validation routine was the
application  of a  graphic technique.   Figure 5-8 shows  a  plot of
ozone as a  function  of time  for day 265, the same period used in
the  preceding  illustrations.   The valid 5-minute  averages  are
represented  as O's;  invalid  5-minute   averages  are entered  as
"I's".   (The  stacking  of 5-minute  values is merely a function of
printer  limitations).    Better  resolution was  obtained  in some
cases  by  using  a   continuous  line  plotter.   The  illustration
shows, however,  that a  useful  graphic  technique can be  applied
(relatively  inexpensively)  that flows  well  with other data re-
ports produced  for a specific time period.
     The graphic display technique was useful in two ways--first,
for  quickly  spotting extreme  data values that warranted further
investigation  and secondly,  for tracking hourly pollutant trends
daily.   In the case  of ozone, the trend  is  clear  in Figure 5-8.
     Figure  5-9 is  a  complement  to  the  ozone  pattern,  and it
confirms that NO   trends during the  same day  behaved in the way
                jrC
that  would  be  expected  from  the   photochemical  relationships
involved.
     After the  validation procedure was completed for CHAMP data,
the  data were approved for incorporation into the reporting data
base.  The automated and manual QC and data validation  procedures
in  the  CHAMP  system  have lent  significant credibility  to the
final reported  data.
                             5-20

-------
0  CONCENTRATION,

-


ho
~n
f.
c
-5
CD -C"
C-TI
oo \_n

X
CU
3

rt>
O

a:
CL 0

-J. POO
	 1 OO
O— >
f i n— *
a>
H- o
CD >— •
n

< ^
ro
o
3 _^
0 ^
n:
"^3
— ^ O*^

r+
0)
—* *
< CO
OJ
pi. VO
OJ
— *• ro
0 0
3
NJ
NJ
O
O
M
*
f\ '
*
'



j






















~7
j:
.
_

_
0 0 0 O
NJ -C- CT> CO
O — • V*) Jr-
N> OO -C- VO
CX> U1 M >«D
3 1
J
H
1
30
U





; o —

,
§
\J W
Q Q
O Q
§ —
i;

8
~ 0
o —

ftp°
f\K
a§
gO °

^



I

00
-1
-1
O
1

OO
- -•*
0
M
1
^




f—
0

o

o
oo

cz
-f
o
0
— I
— "•









-------
       zz-9
NO  CONCENTRATION,
  y\
o
— J
K>


-C-
— D

10
c:
-S CTN

C71
oo
m
x -j-
1 2u>
"E »
<° ^°
o "*•""*
c
-5 —

ft)
-h —
-S VA>
O
3 _
o *•
J> _
12 ^j-j
CL _
Cu
Q. Tl
O) 00
o Co
3
O

O O O O O
O O K> \jo jr-
O VJ3 O O O
OS O\ O f VD
-t- oo .e- oo to
FO {J L
Oo o — -
Oo
o
oo
— 8§°00
oi;

i|8

«§
1 1

I;
' O


Q
	 o80
°°00
° rtft
°00
w O
O OO
— 88 —
Q;: —
Q i '
^— . ft fl

oH
	 oo
ft n 	
O -_
O w
O O
	 o


I §0
	 1
8
— Q Q
o
o o
8Q
\j ' '
:B
± 	 Jo |

H
0
|
OO
J"—
>
M

O^»
un
i

kD
^

T3
O
H
O
-n
z
o
X


00
•j.,

-n
cr
2
— i

o
2
O
-H
m





-------
5.5  CASE STUDY OF A REGIONAL AIR MONITORING SYSTEM (RAMS) DATA
     VALIDATION
     The U.S.  EPA  operated  the Regional  Air Monitoring System
(RAMS) in  and around the  St.  Louis metropolitan  area from 1975
through mid-1977.  The system was part of the Regional Air Pollu-
tion  Study  (RAPS),  undertaken  to collect  aerometric  data  for
urban  and  rural  scale  dispersion  model development  and evalua-
tion.  The RAMS  included  25  monitoring sites at which the param-
eters  shown  in Table 5-2 were  monitored.   Data  were transmitted
to  a  central  computer facility from  each  monitoring  site  via
leased lines.  Figure  5-10  shows the relative locations of moni-
toring stations  in  the  RAMS  network.   Figure 5-11 shows the flow
of  data  within each station.   In  the  RAMS system, minicomputers
served  QC  and  data handling functions  similar  to those in  the
CHAMP  network.   Unlike the CHAMP  data  base,  which was developed
with backup  data  tapes  from each station,  the RAMS data base was
produced from  data  transmitted  directly to the central computer.
The  station  tapes in RAMS served as backup in instances when the
telemetry system  or central computer malfunctioned.  Station data
buffers were polled by the central computer every minute.
5.5.1  Quality Control
     Several  automated  QC checks  (based on the  operating status
of  each  instrument) were  incorporated  in  RAMS  to prescreen  the
data.  The automated QC checks  included:
     o    System  status checks,
     o    Analog  checks, and
     o    Zero/span checks.
The  system  status  checks  included approximately  35  electrical
checks  of  critical parameters (e.g.,  flame-out, valve on-off
status) to determine the capability of each instrument to produce
valid  data.   When  status  checks indicated malfunctions,  a flag
was  appended to  the value of the associated environmental param-
eter to indicate  that the 1-minute value was invalid.a
aSystem programmers developing programs to summarize data must be
 aware of the extent to which flags,.,are used.  In RAMS, the
 sample value was multiplied by 10~   when a status condition was
 invalid.  Problems  arose  when data analysis programs recognized
 such values as zero.
                             5-23

-------
TABLE 5-2.   PARAMETERS  MONITORED  IN THE REGIONAL
              AIR MONITORING  SYSTEM

Air quality: Sulfur dioxide
Total sulfur
Hydrogen sulfide
Ozone
Nitric oxide
Oxide of nitrogen
Nitrogen dioxide
Carbon monoxide
Methane
Total hydrocarbons
Meteorological: Windspeed
Wind direction
Temperature
Temperature gradient
Pressure
Dewpoint
Aerosol scatter
Solar radiation: Pyranometer
Pyrheliometer
Pyrgeometer
Measurement
interval (min)
3.75
1
3.75
1
1
1
1
5
5
5
1
1
1
1
1
1
1
1
1
1
Number of
stations
13
12
13
25
25
25
25
25
25
25
25
25
25
7
7
25
25
6
4
4
                    5-24

-------
 REGIONAL AIR MONITORING STATION (RAMS)  NETWORK   f
                                         122CJ
                                                      1150
                                                            116 D
                                                    109 D
                                                          117 D
10          20
 SCALE,  KM
DI
                                                                      123 D
        Figure  5-10.   Location of RAMS  stations.
                            5-25

-------
         INSTRUMENTS:
  0.5 s
 SAMPLES
  1  MIN
ARITHMETIC
 AVERAGE
         METEOROLOGICAL
         SOLAR RADIATION
I
N)
           AIR QUALITY
     MUX
     AND
     ADC
             DAILY
          CALIBRATION
                                  STATION LOG
  INST.  AND
SYSTEM STATUS
                                                    TELE-
                                                COMMUNICATIONS
                                                                                    CENTRAL
                                                                                    FACILITY
                                   Figure 5-11.   RAMS  data  flow:   RAMS station.

-------
     Analog checks determined the status of several key parameter
conditions such as permeation tube bath temperature and reference
voltages.   If  acceptable  limits  for  these  parameters were  ex-
ceeded,  then  the  corresponding  environmental data  were  invali-
dated.   In this case,  a value of  1034  was  substituted  for  the
1-minute reading of the environmental parameter in the data file.
     Zero/span  checks  were  performed  daily, with  the zero/span
commands  coming  from  the  central  computer.  If  predetermined
instrument drift limits were exceeded,  the data from that instru-
ment  for the  previous  day  were  flagged  as  invalid,  and  a field
operator was notified.
5.5.2  Data Validation
     Figure  5-12  shows  the  data  flow  within  the  RAMS  central
facility.  As  the  figure indicates,  RAMS validation included two
levels of data checks:
     o    Intrastation checks of minute data, and
     o    Interstation checks of hourly data.
     The  RAMS  minute values that were not  invalidated by the QC
checks mentioned previously were converted  to  engineering units
and  then subjected  to the  intrastation checks.    The automated
intrastation  checks  included , value  limit  checks,  relational
checks,  and 10-minute  time  continuity checks.   The  gross limit
checks were based  on the  operating ranges  of  each instrument.
Examples of typical limits are in Table 5-3.  Using the operating
ranges avoided the problem of setting statistical limits for each
parameter.
     For  some  parameters,  interparameter conditions  were set as
validation  criteria;  for  example,  total  sulfur  had  to  be less
than  the sum  of SO2  and  hydrogen  sulfide   (when  compared on a
sulfur basis  in ppm).   Instrument noise bands precluded strict
interpretation of some of the interparameter  checks.
     Ten-minute  continuity checks were used to  see if a variable
changed  over  a period  of  time.   It was  recognized early in the
RAMS  program  that a constant voltage  output from  a sensor indi-
cated mechanical or  electrical  failures in the sensor instrumen-
tation.   A  daily report was  generated to show  questionable time

                             5-27

-------
 I
K>
OO
25 "*
FiFi n-^
SITES
^f
COMPUTE
HOUR
AVERAGES




IT
POP
11/40
->
24-HOUR
CALIBRATION
AND DRIFT
SUMMARY


INTERSTATION
VALIDATION
OFHOUR
AVERAGES

-


CONVERT
MINUTE VALUES
TO ENG. UNITS



INTR/
VALID;
CONST;
OFMIN
STATION
mON AND
\NTCHECK
UTE DATA


FLAG MINUTE DATA
INVALIDATED BY
INTERSTATION
CHECKS
-

CREATE
RAMS SYSTEM
TAPE WITH
CALIBRATIONS, MINUTE
DATA AND HOUR AVERAGES




/ RAMS \
SYSTEM
i TAPE y
                                      Figure 5-12.   RAMS  data flow, central facility.

-------
TABLE 5-3.   TYPICAL GROSS LIMIT VALUES  USED
      AIR MONITORING SYSTEM DATA VALIDATION
IN THE REGIONAL
PROCEDURE
Parameter
Ozone
Nitric oxide
Oxides of
nitrogen
Carbon monoxide
Methane
Total hydro-
carbons
Sulfur dioxide
Total sulfur
Hydrogen sulfide
Aerosol scatter
Windspeed
Wind direction


Temperature
Dewpoint
Temperature
gradient
Barometric
pressure
Pyranometers

Pyrgeometers

Pyrheliometers

Instrumental or
natural limits
Lower
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0 ppm
0.000001m-1
0 m/s
0°


-20°C
-30°C
-5°C
950 mb

-0.50

0.30

-0.50

Upper
1 ppm
5 ppm
5 ppm
100 ppm
25 ppm
25 ppm
1 ppm
1 ppm
1 ppm
0.0040m-1
22.2 m/s
360° (540° for
some wind
systems)
45°C
45°C
5°C
1050 mb

2.50 Langleys/
min
0.75 Langleys/
min
2.50 Langleys/
min
Interparameter
condition
NO x 03 <0.04
NO - NO 
-------
periods so that  field  personnel  could be dispatched to a site if
instrument malfunctions  were  suspected.   These  checks could not
be applied to  some  parameters;  for example,  barometric pressure,
since  it  can remain constant (to the number  of  digits recorded)
much longer than ten minutes.
     The interstation checks were applied to hourly averages com-
piled  from the minute  data that passed both the on-site QC tests
and  the after-the-fact  intrastation  checks.   Interstation con-
tinuity checks were used to look for consistency in parameter be-
havior  throughout the  network  during  specific time periods.  The
interstation checks in RAMS were applied to meteorological param-
eters  only.   Initial  application to  pollutant parameters showed
that too  many false data  anomalies were  flagged.   The intersta-
tion checks  consisted  of  plotting  curves of  hourly  averages of
each parameter.   Several curves representing  data from multiple
sites  were plotted  on  the same page for ease of comparison among
stations.  The  mean and  difference of  values that  fell outside
the 90th percentile of the combined network distribution were re-
ported.  Data  curves  were reviewed over  periods of several days
to  see trend  relationships  to  the  rest  of  the  network.   When
problems were  detected through the review of the hourly average
curves, then plots  of  the minute data for the corresponding time
periods were produced.   Figure 5-13 is a continuous plot of hour-
ly average temperature data for several stations in the RAMS net-
work.   The hourly  data  were  relatively  uniform  among the nine
stations represented;  however,  the small  negative depression on
the  circled  portion of the curve for Station  103  on the thirty-
second Julian day.  Figure  5-14,  a  plot of  minute temperature
data for the same station  and time period, reveals that a problem
occurred during  most of  hour  14 of the day  in question and that
the data reported for that hour are invalid.
     This  example  shows  that,  although hourly  average data may
not have indicated  instrument  problems  over the long term, prob-
lems such  as temporary voltage  surges may indeed have occurred.
In  the example,  statistical  analysis  of hourly  average values
would  not have detected the problem.  In the short term this kind
                             5-30

-------
      o
      ~ys
      m
      CD
      O
c:
-5
rt>

en
 I
cr
-s
                                              TEMPERATURE  (T),    C
      3>  3>  i           i           i          I           I           (          i           i           I
      —I  H ro    M-t-ro    M-c-ro    N> -C-ro    r-o -e-ho    NJ -c-ro    sj  x-ro    ro-e-M    M^-KJ   M-P-
      mmoooooooooooooooooooooooooooooooooooo
OO


o
QJ    —•

ro    ro

QJ

fD


ro    —•
3

fD    ^D

DJ
t-l-
t;

co    —•

Q    OJ
DJ    —•
C-i
CU
DJ
                             I

                                            V
                          o
                          ro
o
OJ
o
-C-
o
\_n
                                                                     O
o
CO
                                                 STATION  NUMBER

-------
U1
I
(jj
to
        o   -5

          •t
         UJ
         LU
         a.
          -15
          -20
              1200    1330     1500     1630      1800 unnD 1930      2100      2230
                                                       HUUK

 Figure 5-14.  Minute  temperature data  from Station 103, from 1200 to 2359 hours, February  1,  1975

-------
of  information  may not  be  important  for hourly  or  daily  data
fluctuations, but it  can be  important  as  a  quality assurance
feedback mechanism.   Even though  the  QC checks were  extensive,
they were not sufficient  to  detect this problem.  The validation
procedure was  a  supplement  to  the  QC  procedures  that may  be
useful in improving  those procedures  as well  as  in identifying
additional instrument function problems.
     The use of  data plotting techniques for  hourly  and minute
data coupled with followup to field operation loop was helpful in
answering  questions   regarding data  values  flagged  during  the
intrastation validation  tests.   A special data listing was  pro-
duced to  show  the average  standard  deviation  and  the number of
excursions above statistical limits.   The largest deviations were
flagged for  followup.  This  technique  was extremely time consum-
ing, since the followup was manual.
     Meteorological data  were  used in  RAMS to provide insight to
questionable data periods.   Weather  summaries  like the one  in
Figure 5-15 provided convenient references for determining condi-
tions that  might  have  significantly  altered  pollutant patterns
during specific periods of time.
     In retrospect,  two  specific  changes  in approach might have
improved the validation effort in RAMS:
     1.    Graphic techniques such as those described herein could
have  been  more useful  if they had been applied closer  to  real
time.  The  major  problem  is  that (as  mentioned previously)  the
graphical techniques  are  manpower intensive,  so  condensing  the
time frame (especially for a network approaching RAMS size)  could
markedly increase the validation cost.
     2.    Successive difference checks could have been applied to
minute  (or  hourly average)  data  to provide feedback  to the QC
system.   These  checks could  solve the problem  of  an instrument
malfunction  going undetected for  a long period of time.   Figure
5-16  illustrates  the kinds  of instrument-response problems  that
might be  detected using  successive  difference  checks  of minute
(or  hourly  average)  data.  Optimum limits need to be determined
for the successive difference checks by a statistical analysis of
                             5-33

-------
                       LAMBERT FIELD, ST. LOUIS WEATHER SUMMARY
 WEATHER

 PRECIPITATION
  (IN.)
T = TRACE
U 1 UU Ml 1 • • I 1 •
1MB MM LMUUlftUMJ 1 U BUIMUlMHlt 1 | ™*f)F
1« • 1 >t Mil U1M «V tM 1UU 1 •! UM
Vt
1 U
i u y
 CEILING
  (FT)
  SKY
 COVER
  U)
 VISIBILITY
    (Ml)
 PRESSURE
   (MB)
 TEMPERATURE
  (DEC. C)
 RELATIVE
 HUMIDITY
   U)
   WIND
 DIRECTION
  (DEC.)
 WINDSPEED
    (M/S)
              25000-
10000-
                                                                      Ul I 10 U^iJ'!'H "M'T"U U
                100
   60
~^V^^^
   20
  360
  240
  120

   I6r-

    8

    0
                                    10
                                15         20
                               JANUARY 1977
                                                25
30
                  Figure 5-15.
                  Reference weather data used in RAMS data validation,
                              5-34

-------
historical  data.   For  some applications  it  will be necessary to

determine  successive  differences  on  a percentage  basis  (i.e.,

above a specified  level).3
             a)   SINGLE OUTLIER   b)   STEP FUNCTION
                    SPIKE
              d)    STUCK
             e)
  MISSING
f)
CALIBRATION
             9)

      Figure  5-16.
  DRIFT
Examples of instrument responses  that can  be
detected through minute successive differences,
5.6  REFERENCES

     1.   U.S.  Environmental  Protection Agency.   Quality Assur-
          ance Handbook.   Vol.  I.   Principles.  EPA-600/9-76-005,
          1976.

     2.   Rhodes, R. C.,  and S.  Hochheiser.  Data Validation  Con-
          ference Proceedings.  Presented  by Office  of Research
          and Development, U.S. Environmental Protection Agency,
          Research  Triangle Park,  North Carolina,  EPA 600/9-79-
          042, September  1979.

     3.   Hartwell,   T.,  Use  of Successive  Time  Difference  and
          Dixon  Ratio  Test  for  Data Validation,  Data Validation
          Conference   Proceedings,   EPA-600/9-79-042,   September
          1979.
                                5-35

-------
                        6.0  BIBLIOGRAPHY
Barnett, V.,  and T. Lewis.  Outliers  in Statistical Data.   John
     Wiley and Sons, New York,  1978.

Curran, T.  C., W. F. Hunt, Jr., and R. B. Faoro.  Quality Control
     for Hourly Air Pollution Data.  Presented at the 31st Annual
     Technical  Conference of  the  American  Society  for  Quality
     Control, Philadelphia, May 16-18, 1977.

Data Validation Program  for  SAROAD,  Northrup  Services,  EST-TN-
     78-09,  December   1978,   (also  see  Program  Documentation
     Manual, EMSL).

Faoro,  R. B., T. C. Curran, and W. F. Hunt, Jr., "Automated
     Screening  of  Hourly Air  Quality  Data,"  Transactions of the
     American  Society  for  Quality  Control,  Chicago,  111.,  May
     1978.

Grant,   E.  L.,  and  R.  S. Leavenworth.   Statistical  Quality Con-
     trol.   McGraw-Hill Book Company, New York.

Grubbs,  F.  E.,  and G. Beck.   Extension  of Sample  Sizes and Per-
     centage  Points for  Significance  Test of  Outlying Observa-
     tions.  Technometrics, Vol. 14, No. 4, November 1972.

Hald, A.   Statistical  Theory with Engineering Applications.  New
     York,  1952.

Hartwell,  T.,  Use  of  Successive  Time  Difference  and Dixon Ratio
     Test  for Data Validation,  Data  Validation  Conference Pro-
     ceedings, EPA-600/9-79-042, September 1979.

Hawkins, D.  M.   The  Detection of  Errors  in  Multivariate Data
     Using Principal Components.  Journal of the American Statis-
     tical  Association, Vol.  69, No. 346.  1974.

Hunt, Jr.,  W.  F.,  J.  B. Clark, and S. K. Goranson,  "The Shewhart
     Control  Chart  Test:  A Recommended Procedure for Screening
     24-Hour  Air Pollution Measurements,"  J.  Air  Poll.  Control
     Assoc. 28:508  (1979).

Hunt, Jr.,   W.  F.,   T.  C. Curran, N.  H.  Frank, and  R.  B. Faoro,
     "Use of  Statistical  Quality Control Procedures in Achieving
     and  Maintaining  Clean  Air,"  Transactions  of  the  Joint
     European   Organization  for  Quality  Control/International
     Academy   for   Quality  Conference,  Vernice   Lido,   Italy,
     September 1975.
                                6-1

-------
Hunt, Jr., W.  F.,  R.  B.  Faoro,  T.  C.  Curran,  and W.  M.  Cox,  "The
     Application of Quality Control Procedures to the Ambient Air
     Pollution Problem in  the USA," Transactions of the European
     Organization for Quality Control,  Copenhagen,  Denmark,  June
     1976.

Hunt, Jr., W.  F.,  R.  B.  Faoro,  and S. K.  Goranson,  "A Comparison
     of the Dixon Ratio Test and Shewhart Control Test Applied to
     the  National  Aerometric  Data  Bank," Transactions  of  the
     American  Society for  Quality  Control,  Toronto,  Canada,  June
     1976.

Johnson,  T.   A Comparison of the  Two-Parameter  Weibull and  Log-
     normal  Distributions  Fitted  to  Ambient Ozone  Data.   PEDCo
     Environmental,  Inc.,  Durham,  North  Carolina,  and The  Air
     Pollution  Control  Association.   Quality Assurance  in  Air
     Pollution  Measurement.   Presented  at  the  Air  Pollution
     Control Association,  New Orleans, March 11-14,  1979.

Larsen,  R.   I.   A  Mathematical  Model  for Relating  Air  Quality
     Measurements  to  Air  Quality  Standards,   Publication  No.
     AP-89, U.S. Environmental Protection Agency, 1971.

Marriott, F.  H.  C.   The  Interpretation of Multiple Observations.
     Academic Press, New York,  1974.

Naus,  J.  I.   Data  Quality Control and Editing.  Marcel  Dekker,
     Inc., New York, 1975.

Remington, R. D., and M.  A. Schork.  Statistics with Applications
     to the Biological and Health Sciences.  Prentice-Hall,  Inc.,
     Englewood Cliffs, New Jersey, 1970.

Rhodes,  R.  C.,  and S. Hochheiser.   Data  Validation Conference
     Proceedings.   Presented  by Office of Research and Develop-
     ment,  U.S.  Environmental  Protection  Agency,  Research  Tri-
     angle  Park,  North   Carolina,  EPA 600/9-79-042,  September
     1979.

Siegel, S.  Nonparametric  Statistics for the Behavioral Sciences,
     McGraw-Hill, 1956.

Smith,  F.   Guideline  for  the Development  and Implementation of a
     Quality  Cost  System  for Air Pollution Measurement Programs.
     Research  Triangle  Institute.  RTI/1507/01F,  November  1979.

U.S.  Department of  Commerce.   Computer Science and Technology:
     Performance   Assurance   and  Data    Integrity  Practices.
     National  Bureau  of  Standards,   Washington,  D.C.,  January
     1978.

U.S. Environmental Protection Agency.    Guidelines for Air Quality
     Maintenance  Planning and  Analysis.   Vol.  11.   Air Quality
     Monitoring and Data Analysis.  EPA-450/4-74-012, 1974.


                                6-2

-------
U.S. Environmental Protection Agency.  Quality Assurance and Data
     Validation for the Regional Air Monitoring System of the St.
     Louis Regional Air Pollution Study.  EPA-600/4-76-016,  1976.

U.S. Environmental  Protection Agency.  Quality  Assurrance  Hand-
     book.  Vol. I. Principles.  EPA-600/9-76-005, 1976.

U.S. Environmental  Protection  Agency.   Quality  Assurance  Hand-
     book:   Vol.  I,  Principles;  Vol.  II,  Ambient Air Specific
     Methods;  and  Vol.  Ill,  Stationary Source Specific Methods.
     EPA-600/9-76-005, 1976.

U.S. Environmental  Protection Agency.  Screening Procedures for
     Ambient  Air  Quality  Data.   EPA-450/2-78-037,  July  1978.

Wolfe,   J.  H.   NORMIX:   Computation  Methods  for  Estimating the
     Parameters of Multivariate Normal Mixtures of Distributions,
     Research Memorandum SRM 68-2.  U.S. Naval Personnel Research
     Activity, San Diego, 1967.

1978 Annual  Book  of ASTM  Standards,  Part 41.   Standard Recom-
     mended Practice for Dealing with Outlying Observations, ASTM
     Designation:  E 178-75.  pp. 212-240.
                                6-3

-------
    APPENDIX A






STATISTICAL TABLES

-------
               TABLE  A-l.   DIXON  CRITERIA  FOR TESTING OF  EXTREME
                         OBSERVATION  (SINGLE SAMPLE)*

n
3
4
5
6
7
8
9
10



11
12
13


14
15
16
17
18

19
20
21
22
23
24
25


Criterion

_ x2 - xl
r!0 x - x,
n 1
xn " Vl
xn - xl

_ X2 - Xl
11 Vi-xi
_ xn " xn-l

x - x
n 2
_ X3 - x1
31 x - xl
I I -L. J.
xn " Xn-2
	 II lit.
r\ s
X3 " Xl
r —
22 x - x
" xn-2 xi
xn " xn-2
II I 1 £.
x - x~
n 3








if* smallest value
is suspected;
if largest value
is suspected.

if smallest value
is suspected;
if largest value
is suspected.

if smallest value
is suspected.

if largest value
is suspected.
if smallest value
is suspected.

if largest value
is suspected;








Significance level
10%
.886
.679
.557
.482
.434
.479
.441
.409



.517
.490
.467


.492
.472
.454
.438
.424

.412
.401
.391
.382
.374
.367
.360
5%
.941
.765
.642
.560
.507
.554
.512
.447



.576
.546
.521


.546
.525
.507
.490
.475

.462
.450
.440
.430
.421
.413
.406
1%
.988
.889
.780
.698
.637
.683
.635
.597



.679
.642
.615


.641
.616
.595
.577
.561

.547
.535
.524
.514
.505
.497
.489
^Reproduced with permission from W.  J.  Dixon,  "Processing Data for Outliers,"
 Biometrics, March 1953,  Vol.  9, No.  1,  Appendix,  Page 89.
 X! < x2 < ..« < xn_2 < xn_1 < xn
     Criterion r10 applies for 3 < n <  7
     Criterion r^ applies for 8 < n <  10
     Criterion r2i applies for 11 £ n £ 13
     Criterion r22 applies for 14 < n < 25
                                      A-l

-------
             TABLE  A-2.   CRITICAL VALUES  FOR 5% AND 1% TESTS OF
              DISCORDANCY FOR  TWO OUTLIERS  IN A NORMAL SAMPLE*
n
4
5
6
7
8
9
10
12
14
16
18
20
25
30
5%
0.967
0.845
0.736
0.661
0.607
0.565
0.531
0.481
0.445
0.418
0.397
0.372
0.343
0.322
1%
0.992
0.929
0.836
0.778
0.710
0.667
0.632
0.579
0.538
0.508
0.484
0.464
0.428
0.402
*•
Barnett, V., and T.  Lewis.   Outliers in Statistical  Data,  Table XHIe,
p.  311.   John Wiley and Sons,  New York, 1978.
                                   A-2

-------
            TABLE  A-3.   CRITICAL  T VALUES FOR ONE-SIDED GRUBBS  TEST
               WHEN STANDARD DEVIATION IS  CALCULATED  FROM SAMPLE*
 Nwmber of
Obitnrttlont
   n
   3
   4
   S
   6
   7
   a
   9
  10
  11
  12
  13
  14
  It
  16
  17
  18

  20
  21
  22
  23
  24
  25
  26
  27
  28
  29

  31

  33
        37

        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        60
                   Upper .IS    Upper .SI     Upper U
                  Significance S1jn1f1c«nc«  St9n1f1c»nc»
                    Level        Level        Level
1.155
1.499
1.780
2.011
2.201
2.358
2.492
 .606
 .705
 .791
 .867
 .935
 .997
 .052
 .103
 .149
 .191
 .230
 .266
 .300
 .332
 .362
 .389
 .41$
 .440
 .464
 .486
 .507
 .528
 .S4«
 .565
 .582
 .599
 .616
 .631
 .646
 .660
 .673
 .687
 .700
 .712
 .724
 .736
 .747
 .757
 .768
 .779
 .789
1.155
1.496
1.764
1.973
2.139
2.274
 .387
 .482
 .564
 .636
 .699
 .75$
 .606
2.852
2.894
2.932
2.968
3.001
  031
  060
  087
  112
  135
  157
  178
  199
 .218
 .236
 .253
 .270
 .286
 .301
 .316

 !j43
 .354
 .369
 .381
 .393
 .404
 .415
 .425
 .435
 .445
 .455
3.464
3.474
3.483
                                            1.155
                                            1.492
                                            1.749
                                            1.944
                                            2.097
                                             221
                                             323
                                             410
                                            2.485
 .550
 .607
 .659
 .70$
 .747
 .785
2.821
2.854
 .884
 .912
 .939
 .963
 .987
3.009
3.029
3.049
3.068
3.DBS
3.103
3.119
3.13$
3.150
3.164
1.178
1.191
1.204
1.216
1.228
1.240
3.251
3.261
3.271
3.282
3.292
3.302
3.310
3.319
3.329
3.336
Upper 2.51
gn1 f ICince
Level
1.15S
1.481
1.715
1.887
2.020
2.126
2.215
2.290
2.355
2.412
2.462
2.507
2.549
2.585
2.620
2.651
2.681
2.709
2.733
2.758
2.781
2.802
2.822
2.641
2.859
2.876
2.893
2.908
2.924
2.938
2.952
2.965
2.979
2.991
3.003
3.0)4
3.025
3.036
3.046
3.057
3.067
3.075
3.085
3.094
3.103
3.111
3.120
3.128
Upp*r 51
S1jn1 f Icanc*
Level
1.153
1.463
1.672
1.822
1.938
2.032
2.110
2.176
2.234
2.28$
2.331
2.371
2.409
2.4«3
2.475
2.504
2.532
2.557
2.580
2.603
2.624
2.644
2.663
2.68)
2.698
2.714
2.730
2.74S
2.759
2.773
2.786
2.799
2.811
2.823
2.835
2.846
2.857
2.866
2. 877
2.887
2.896
2.905
2.914
2.923
2.931
2.940
2.948
2.956
Upper lot
Significance
Level
1.148
1.425
1.602
1.729
1.828
1.909
1.977
2.036
2.088
2.134
2.175
2.213
2.247
2.279
2.309
2.335
2.361
2.385
2.408
2.4Z9
2.448
2.467
2.466
2.502
2.519
2.534
2.549
2.563
2.577
2.59)
2.604
2.616
2.628
2.639
2.650
2.661
2.671
2.6B2
2.t92
2.700
2.710
2.719
2.727
2.736
2.744
2.753
2.760
2.768
Grubbs,  F.  E.,  and G.  Beck.   Extension of Sample Sizes  and  Percentage
Points  for  Significance Test  of Outlying  Observations.   Technometrics,
Vol.  14, No. 4,  November  1972.

(continued)
                                        A-3

-------
 TABLE  A-3  (continued)
        Kurter of
       Otiervitton*
           n

           51
           52
           S3
           $4
           55
           56
           57
           58
           59
           60
        '   61
           67
           63
           64
           65
           66
           67
           68
           69
           70
           71
           72
           73
           74
           75
           76
           77
           78
           79
           80
           61
           62
           83
           84
           65
           86
           87
           aa
           69
           90
           91
           92
           93
           94
           95
           96
           97
           98
           99
          100
  Upper .11     Upp.tr  .51     Upper II      Upper 2.51    Upper 51      Upper  101
Significance   Significance  Significant*  Significance  Significance  Significance
   Lcvtl         Ltvtl         L*v*l         Ltvtl         Ltvtl         Ltvtl
     798
     608
     816
     825
     834
     842
     851
     858
     867
     874
     882
     869
   3.896
     903
     910
     917
     923
     930
   3.936
     942
     948
     954
     960
     965
     971
    .977
    .982
    .987
    .992
    .998
    .002
    .007
    .012
    .017
    .021
    .026
    .031
    .035
    .039
    .044
    .049
    .053
    .057
    .060
    .064
    .069
    .073
    .076
    .080
    .084
3.491
3.500
;.507
3.516
3.524
 .531
 .539
 .546
 .553
 .560
 .566
 .573
3.579
3.586
  592
  598
  605
  610
  617
  622
  627
3.633
  636
  643
  646
  654
  656
  663
  669
  673
  677
  682
  687
  691
3.6V5
 .699
 .704
 .708
 .712
 .716
 .720
 .725
 .728
 .732
 .736
3.739
3.744
3.747
3.750
3.754
.345
.353
.361
.366
.376
.383
.391
.397
.405
.411
.416
.424
.430
.437
.442
.449
.454
.460
.466
.471
.476
.482
.487
.492
.496
.502
.507
.511
.516
.521
.525
.529
.534
.539
.543
.547
.551
.555
.559
.563
.567
.570
.575
.579
.562
.586
.589
.593
.597
.600
3.136
3.143
3.151
3.156
 .166
 .172
 .180
3.186
3.193
3.199
3.205
3.212
3.218
3.224
3.230
3.235
3.241
3.246
3.252
3.257
3.262
3.267
3.272
3.278
3.282
3.287
3.291
3.297
 .301
 .305
 .309
 .315
 .319
 .323
 .327
 .331
 .335
 .339
 .343
 .347
 .350
 .355
 .358
 .362
 .365
 .369
 .372
 .377
 .330
 .363
 964
 971
 978
 986
 992
 000
 006
 013
 019
 025
 032
 037
 044
.049
.055
.061
.066
.071
.076
.082
.087
.092
.098
.102
.107
.111
.117
.121
.125
.130
.134
.139
.143
.147
.151
.155
.160
.163
.167
.171
.174
.179
.162
.186
.189
.193
.196
.201
.204
.207
  775
  783
  790
  798
  804
  811
  818
  624
  631
  837
  84 2
                                                                       2.849
  854
  860
  866
  671
  877
  883
  688
  893
  897
  903
  908
  912
  917
  922
  927
  931
  935
  940
2.945
  949
  953
  957
  961
  966
  970
  973
2.977
2.981
2.984
  989
  993
  996
  000
  003
  006
  Oil
  014
                                                       3.017
(continued)
                                                   A-4

-------
TABLE  A-3  (continued)
         Nirtwr of      Upper .11     Ctf*r .51     Uppxr IS
        Observtttons  Significance  Significant*  S1cm1f1c«nct
            n            Level         Ltvcl         Ltvtl
           101
           10Z
           103
           104
           105
           106
           107
           108
           109
           110
           111
           112
           111
           114
           IIS
           116
           117
           118
           119
           120
           121
           122
           123
           124
           125
           126
           127
           128
           129
           130
           131
           132
           133
           134
           135
           136
           137
           136
           139
           140
           141
           142
           143
           144
           14S
           146
           147
.088
.092
.095
.098
.102
.105
.109
.112
.116
.119
.122
.125
.129
.132
.135
.13«
.141
.144
.146
.150
.153
.156
.159
.161
.164
.166
.169
.173
.175
.178
.ISO
.1B3
.185
.188
.190
.193
.196
.198
.200
.203
.205
.207
.209
.212
.214
.216
.219
  757
  760
  ,765
  768
  771
  774
  777
  780
  784
  787
  790
  793
  796
  799
  602
  805
  808
  611
  814
  817
  819
3.822
  824
  627
  831
  833
  836
  838
  MO
3.84J
3.645
3.648
3.850
3.853
3.856
3.858
3.660
  863
  865
3.867
3.869
3.871
3.874
3.876
3.879
3.881
3.883
  603
  607
  610
  614
  617
  620
  623
  626
  629
  632
  636
  639
  642
  645
  647
  650
  653
  656
  659
  662
  665
  667
3.670
3.672
3.675
3.677
3.680
3.683
3.6B6
3.688
3.690
3.693
3.695
3.697
3.700
3.702
3.704
3.707
3.710
3.712
3.714
3.716
3.719
3.721
3.723
3.725
3.727
Upp«r 2.SX
Slgnl flcinc*
Ltvtl
3.386
3.390
3.393
3.397
3.400
3.403
3.406
3.409
3.412
3.415
3.418
3.422
3.424
3.427
3.430
3.433
3.435
3.438
3.441
3.444
3.447
3.450
3.452
3.455
3.457
3.460
3.462
3.465
3.467
3.470
3.473
3.475
3.478
3.480
3.482
3.484
3.487
3.489
3.491
3.493
3. '97
3.499
3.501
3.503
3.505
3.507
3.509
Upper 5t
Sign) f)c«nc«
Level
3.210
3.214
3.217
3.220
3.224
3.227
3.230
3.233
3.236
3.239
3.242
3.245
3.248
3.251
3.254
3.257
3.259
3.262
3.265
3.267
3.270
3.274
3.276
3.279
3.281
3.284
3.286
3.289
3.291
3.294
3.296
3.298
3.302
3.304
3.306
3.309
3.311
3.313
3.315
3.318
3. 320
3.322
3.324
3.326
3.328
3.331
3.3J4
Upper 101
S1jn1 f1c«nc»
Level
3.021
3.024
3.027
3.030
3.033
3.037
3.040
3.043
3.046
3.049
3.052
3.055
3.058
3,061
3.064
3.067
3.070
3.073
3.075
3.078
3.081
3.083
3.086
3.089
3.092
3.095
3.097
3.100
3.102
3.104
3.107
3.109
3.112
3.114
3.116
3.119
3.1?2
3.124
3.126
3.129
3.131
3.133
3.135
3.138
3.140
3.142
3.144
                                                  A-5

-------
                   TABLE A-4.   WILCOXON SIGNED-RANK TEST*
                             n = number of pairs
Number
of pairs,
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Critical values
CK0.10




0, 15
2, 19
3, 25
5, 31
8, 37
10, 45
13, 53
17, 61
21, 70
25, 80
30, 90
35, 101
41, 112
47, 124
53, 137
60, 150
67, 164
75, 178
83, 193
91, 209
100, 225
cKO.05





0, 21
2, 26
3, 33
5, 40
8, 47
10, 56
13, 65
17, 74
21, 84
25, 95
29, 107
34, 119
40, 131
46, 144
52, 158
58, 173
66, 187
73, 203
81, 210
89, 236
oKO.02






0, 28
1, 35
3, 42
5, 50
7, 59
9, 69
12, 79
15, 90
19, 101
23, 113
28, 125
32, 139
37, 153
43, 167
49, 182
55, 198
62, 214
69, 231
76, 249
cKO.Ol







0, 36
1, 44
3, 52
5, 61
7, 71
9, 82
12, 93
15, 105
19, 117
23, 130
27, 144
33, 158
37, 173
42, 189
48, 205
54, 222
61, 239
68, 257
*The data of this table is reproduced from  Documenta  GEIGY,  Scientific
 Tables 7th edition.   With kind permission  of  CIBA-BEIGY  Ltd.,  Basle
 (Switzerland).
                                    A-6

-------
                 TABLE A-5.   RANK SUM TEST a = P[HQ is true]*
In performing the test,  begin with the a = 0.10 table and if Tx does not fall
between T,  and T ,  repeat the test using a = 0.05,  a = 0.02, and a = 0.01,
until the inequaTity is  satisfied.
a - 0.01

N.
4
5
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
25


T, T,
10- 34
10- 31
11- 41
11- 4}
11- 4t
11- 32
13- 33
13- 39
14- 61
13- 63
13- if
It- 71
It- 76
17- 79
It- tl
It- tt
If- 19
If- 93
20- ft
X-IOO


15- 40
It- 44
16- 49
17- 31
11- 37
19- 111
10- 63
11- 69
11- 73
11- 71
13- 12
14- tt
13- 90
It- 94
17- 91
11-10!
19-106
19-111
30-113
31-119
32-12}


21- 43
11- 30
23- 55
14- tO
13- 63
It- 70
17- 73
21- 10
30- t4
31- 99
32- 94
)3- 99
34-104
36-101
37-113
31-111
39-133
40-121
42-132
41-137
44-141
43-147


it- se
19- fl
31- 67
n-73
34- 71
33- H
17- 19
31- 93
40-IOO
41-106
43-111
44-117
46-122
47-llt
49-133
30-139
32-144
33-130
13-133
37-ltO
Sl-ltt
60-171


T, TV
a- 67
31- 74
40- 10
42- It
4J- 9J
43- 99
47-103
49-111
31-117
31-123
34-110
Sf-llt
31-142
60-141
62-134
64-160
66-166
61-172
70-171
71-113
71-191
71-197


T, T,
4t- tO
41- 17
30- 94
32-101
34-104
54-115
31-122
tl-Ilt
63-133
63-141
67-149
69-136
72-161
74-169
76-176
ri-iti
11-119
tJ-196
93-201
lt-209
90-216
92-123
10
37- 9J
39-101
61-109
64-116
66-124
61-111
71 -1W
73-147
76-134
79-161
11-169
14-176
96-194
»9-19l
92-191
94-206
97-313
99-211
102-221
103-233
107-243
110-230
11
6S-10t
71-1 It
73-123
76-IJ3
79-141
12-149
14-131
87-164
90-174
91-1*1
96-190
99-198
102-106
103-214
101-222
111-230
114-231
117-246
120-234
111-M2
126-270
129-778
12
11-121
14-111
17-141
90-130
91-139
96-169
99-177
102-ltt
105-175
1O9-201
112-212
113-211
119-129
112-711
113-247
129-233
132-264
136-272
139-111
142-290
146-298
149-307
13
T, TV
94-140
91-149
101-139
104-169
109-178
111-188
113-197
llt-107
121-116
125-226
129-233
133-344
136-134
140-263
144-171
141-211
152-29C
155-300
159-309
163-319
167-327
170-337
14
109-137
112-lft
116-171
120-1IS
123-199
127-209
131-119
111-229
119-239
143-249
147-159
131-269
133-279
160-198
164-298
168-308
17J-3U
176-32J
180-331
184-34!
181-358
193-367
IS
TI T,
123-173
139-197
132-199
136-209
140-220
144-231
149-241
133-232
137-163
161-173
166-214
171-OT4
175-305
180-315
184-326
189-336
193-347
198-357
202-368
207-378
211-389
216-399
16
T, T,
141-193
14S-207
149-219
134-230
139-241
161-333
167-263
172-276
177-287
181-199
196-310
191-321
194-331
201-343
206-354
211-365
216-376
220-388
225-399
230-410
235-421
240-432
17
T, T,
139-213
163-118
168-140
171-133
177-163
182-277
187-189
191-301
197-313
201-113
208-336
213-348
218-360
213-377
228-384
234-395
239-407
244-419
250-430
255-442
260-454
265-466
IB
Ti T,
177-117
182-130
187-163
191-176
197-289
102-101
109-314
218-340
214-332
230-364
235-377
241-389
246-402
151-414
258-426
263-439
269-451
275-463
280-476
286-488
292-500
19
r, T,
197-239
102-173
207-397
111-301
218-314
223-329
219-341
113-334
241-367
147-180
253-393
259-406
265-419
271-432
277-445
183-458
289-471
295-484
301-497
307-510
313-523
319-514
20
T, r,
118-281
113-197
118-312
134-316
140-340
246-354
232-368
238-391
164-396
271-409
277-423
283-437
290-450
296-464
302-47B
309-491
315-505
322-518
328-532
335-545
341-559
348-572
21
T, T,
139-307
143-311
130-318
136-313
163-367
169-392
273-397
189-423
295-440
302-454
309-468
315-483
322-497
329-511
336-525
343-539
350-553
356-568
363-582
370-596
377-610
22
Tt T,
162- 311
167- 349
174- 364
380- 380
187- 393
193- 411
300- 416
314- 436
321- 471
32S- 486
335- 501
342- 516
350- 530
357- 545
364- 560
371- 575
378- 590
384- 404
393- 619
400- 634
408- 648
23
7/1 TV
313- 339
191- 176
199- 391
303- 408
311- 413
319- 440
lit- 416
340- 498
348- 503
355- 519
363- 534
370- 550
378- 565
385- 581
393- 596
401- 611
408- 627
416- 642
414- 457
431- 673
439- 688
24
r, T,
310- 396
lit- 404
331- 411
310- 419
337- 433
343- 471
331- 499
360- 304
368- 520
376- 536
383- 553
391- 569
395- 585
407- 601
415- 617
423- 633
431- 649
439- 665
447- 681
455- 697
444- 711
472- 728
25
Tt r,
333- 413
142- 411
149- 431
137- 469
364- 496
372- 303
390- 120
388- 317
396- 554
404- 571
413- 587
421- 604
429- 621
437- 638
446- 654
454- 671
463- 687
471- 704
480- 720
488- 737
497- 753
505- 770
tx - 002
JV.4
N.
4
f
6
7
8
9
10
11
12
13
14

It
16
II
18
19
21
22
23
24
15
Ti T,
—
10- 30
11- 33
n- 37
11- 40
13- 43
13- 47
14- 30
13- 31
13- 37
it- to

17- 61
IJ - (17
/*- 70
19- 73
19- 77
10- 80
21- 93
21- 97
22- 90
23- 91
21- 97
5
n TV
13- 33
14- J»
17- 41
It- 47
19- 31
20- 33
21- 39
11- 61
11- 67
24- 71
13- 73

It 79
17- 13
18 87
19- 91
30- 93
31— 99
11-101
33-107
34-111
33-113
36-119
6
T, T,
22- 44
13- 49
14- 54
13- 39
27- 63
It- tt
39- 73
30- 78
31- 11
31- 87


It 96
17 101
19-101
4O-IW
41-113
44-114
43-119
47-133
48-139
30-141
7
T, T,
29- 33
31- 60
31- 66
34- 71
33- 77
37- 13
39- 17
40- 93
42- 98
44-103
41-109

47 -114
49 119
31-124
31-130
34-133
38-143
39-131
61-136
63-161
64-167
B
ft TV
It- 66
40- 71
41- 79
43- 83
45- »1
47- 97
49- 01
31- 09
31- 13
16 - 10
II it

to u
62 It
64 44
66- 30
tt- 36
72-169
74-174
76-180
79-186
11-191
9
T, T,
41- 78
30- as
32- 92
34- 99
36-106

61-119
63-116
66-113
tt-139
71 141

71 111
7t 118
78 161
81 171
83 178
99-191
90-198
93-204
91-111
98-117
10
r, T,
38- 91
61- 99
63-107
66-114
tt-112
71-119
74-136
77-143
79-1)1
91 IK
41 161

H8 111
VI 119
93 187
96- IV4
99-101
103-113
109-331
110-130
113-117
116-144
n
r, TV
70-106
73-114
73-123
78-131
91-139
94-147
98-134
51-161
94-170
97-178
100-186

1111 IV4
11)7 101
1 tO- 1119
113-1 7
lit-' 1
113-1 0
126-2 9
119-236
132-164
135-272
12
Ti TV
83-121
86-130
89-139
91-148
93-137
99-163
102-174
106-181
10»-1»1
111-199
116-108

JO lit
14-114

11-141
34-130
141-166
143-171
149-283
152-292
156-300
13
r, T,
96-118
100-147
103-137
107-166
111-173
114-183
118-194
111-101
136 111
1)0 111
in no

118 JIV
HI 148
146 117
130 166
134-373
1 57-283
161-294
165-303
169-312
173-321
177-330
14
r, T,
111-133
113-16!
111-176
112-186
117-193
111-101
133-113
119-113
143-11'
141 -144
ISi 1S4

IH 104
1«1 171
163-183
169-29J
174-302
178-312
1B2-322
187-331
191-341
196-350
200-360
15
Ti T,
117-173
131-184
113-193
119-106
144-116
148-117
131-137
137-148
161-318
167 268
111 179

i;t iw
IHI n'i
11^ 110
ITO-JJO
I95-3M)
200-340
204-351
209-361
214-371
219-381
224-391
16
r, T,
143-191
148-104
131-216
37-117
63-138
67-149
72-160
77-171
81 181
87-39 I
91-1(14

2? in
«•>!»
1)7- )U
212. 14«
Jit. 359
222-370
228-380
233-391
238-402
243-413
248-424
17
Tt T,
161-213
166-113
171-217
76-149
81-161
86-173
91 381
V7 2Vn
1(11 108
1(18 11V
111 111

]|« 14)
;j4 r,4

l\\ 111
741 )HH
246-400
252-411
257-423
263-434
269-445
274-457
18
Tt T,
180-334
183-147
190-260
193-173
101-183
107 297
19
T, T,
199-137
203-270
110-184
llfl-197
111-310
118 333
111 11(1114 1 in
118 Jll\14(> 14V
314 II4]241 ml
20 I 21
Ti T,
110-180
136-194
231-108
138-331
344-316
130-J30
217 I6J
Ti T,
242-304
22
r, TT
164-330
248-3 /»' 27 1- 34 5
2J4-JJ4 i277- I fit

267-J6I
274- S77
2*1 tVI

till IVII 1V1 41V
11(1 Un n 1 IJ4 7/6 404
115 Wl

241 Wl
}4; )iu
,!•>» I'M
159 40/
Id1* 419
271-131
27J-443
283-455
289-467
295-479
301-491
101 414
;V> in; ^D) 411 108 44h
^ 	 \mi J)r vi
in 4iij;'Hi 444 wi 4;-.
//H 4/VJ01 4',/ HO 4fl'J
;«4 4IHJ1IO 4H) ll/ Ml*
190 451
297 463
303-476
310-4B8
316-501
322-514
329-526
117 4H>
313- 497
330-510
337-523
344-536
351-549
358-562
144 M /
i 1 1 5 1 1
359-544
366- 558
373-572
380-586
388-599
191 191
1V9 4O6
1(16 41(1
23
r, r.
188- 336
191- 3 2
24
r, r,
313- JgJ
J20- 00
102- 8-327- IJ
,1(19- 4
lit- 0
114 3
111 I
M 4 J 1 IV n
JJJ- IJ
t4i~ JO
JJO- 6fl
na «j
inn vn
U 4 II 147 1 )/4 514
12 4 V154 /
11 4 'J

14 44
1*. •. 11
1'. '. 1
U'i ^ /
111 5 1
388-580
395-595
403 609
411-623
418-638
ic.J •> i

i in •• i
i/ft •> ;

I'M '. /
lo 1 '
418- 617
426 537
•434- M7
442- 662
450- 677
1>U Mil
V>1 S4S

\'t1 M, 1
HI / *, / /

4J4 M>n
1U '-14
449- 65S
4SB- 670
466- f,Rf>
475- 701
483- 717
25
Tt TV
JJJ- 4/7
J4 - 29
JJ 46
in fi4
17 - HO
t/ V7
IS 14
v/ u
40 4.1
41 M
4J *i,

4; -it.
4t 11
44 ft
4*i% 4*.
464 f,l
4R2- 93
490 10
499- 26
508- 42
517- 758
(continued)
                                    A-7

-------
TABLE A-5 (continued)


4
5
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
2!
Nt 4
10- 24
11- 29
12- 3}
IS- Ji
14- 31
14- 42
IS- 4}
It- 41
If- 31
11- 34
It- 37
X- 60
21- tl
11- SI
23- 73
14- 76
13- 79
16- 11
27- 13
27- 19
It- 93
5
16- 34
17- 38
U- 41
10- 43
11- 49
11- 33
23- 37
14- 61
16- 64
17- 6»
11- 72
29- 76
30- 10
32- t3
34- 91
33- 93
37- 91
39-101
39-106
40-110
41-113
6
23- 43
14- 4t
14- 51
17- 37
29- 61
31- 63
31- 70
34- 74
33- 79
37- 13
11- II
4O- 92
42- 96
43-101
46-110
41-114
30-1 It
31-113
33-127
34-132
36-136
7
31- 33
33- SI
34- 64
It- 69
31- 74
40- 79
42- H
44- It
46- 94
41- 99
30-104
32-109
54-114
36-119
60-119
62-134
64-139
66-144
61-149
70-114
72-139
8
40- 64
41- 70
44- 76
46- 11
49- 87
31- 93
33- 99
33-103
38-110
60-116
62-121
63-117
67-131
70-131
74-130
77-133
79-161
91-167
14-171
16-171
19-113
a — 0.05
9
49- 77
32- 13
33- >9
37- 96
60-101
61-109
63-113
69-121
71-117
73-134
76-140
7f-H6
12-112
14-139
90-171
13-177
93-194
99-190
101-196
104-202
107-209
10
60- 90
63- 97
66-104
69-111
71-111
73-113
78-132
11-139
14-146
11-132
91-139
94-166
97-17]
1OO-190
103-117
107-193
110-100
113-207
116-114
119-121
123-227
126-234
11
72-104
73-111
79-119
11-117
S3-133
19-141
92-110
94-157
99-163
103-171
106-110
110-117
lli-193
117-201
121-209
124-217
111-114
131-131
135-239
139-244
142-254
14«-261
12
13-119
19-127
92-136
96-144
100-131
104-160
107-169
111-177
115-185
119-193
123-201
127-209
131-217
133-113
139-233
143-J41
147-249
151-257
154-265
159-273
1(3-281
1(7-219
13
99-133
103-144
107-133
111-161
113-171
119-110
174-119
121-197
131-206
156-215
141-223
143-132
130-240
134-149
159-257
163-266
1(7-275
172-283
176-292
180-301
185-309
119-318
14
114-131
111-161
122-172
127-111
131-191
136-200
141-109
143-219
130-111
133-137
1(0-14*
164-236
1(9-163
174-274
179-283
184-292
188-302
193-311
198-320
203-329
208-338
213-347
15
130-170
134-111
139-191
144-201
149-111
134-111
139-131
164-141
169-131
174-261
179-271
184-181
190-290
195-300
200-310
205-320
211-329
216-339
221-349
226-359
232-368
237-378
16
147-199
131-201
137-111
162-111
167-133
173-143
171-234
113-163
119-273
193-213
2OO-296
206-306
211-317
217-327
223-337
228-348
234-358
240-368
245-379
251-389
257-399
2(2-410
17
164-110
170-221
173-133
191-144
197-733
191-167
191-271
2O4-1I9
110-3OO
116-311
222-322
228-333
214-344
HO-3S5
246-366
252-377
258-388
264-399
270-410
276-421
282-432
289-442
18
113-231
119-243
193-133
101-167
207-179
113-291
119-303
116-314
131-326
239-337
245-349
251-361
258-372
264-384
271 -»5
277-407
283-419
290-430
296-442
303-453
309-465
316-476
19
203-133
209-266
113-279
221-291
221-304
233-316
241-311
241-341
255-353
262-365
2(9-377
275-390
212-402
289-414
296-426
303-438
310-450
317-462
324-474
330-487
337-499
344-511
20
T, T,
224-276
230-190
237-303
244-316
131-329
231-342
363-333
272-361
279-381
286-394
293-407
301-419
308-432
315-445
322-458
330-470
337-483
344-496
352-508
359-521
366-534
374-546
21
n T,
246-300.
133-J14
260-321
267-342
274-136
211-370
299-313
296-197
304-410
312-423
319-437
327-450
335-443
342-477
350-490
358-503
3(5-517
373-530
381-543
389-556
396-570
404-383
22
T, T,
369-313
276-340
213-333
291-369
291-314
306-391
314-412
322-426
330-440
338-454
354-482
362-496
370-510
378-524
387-537
395-551
403-5(5
411-S7J
419-593
427-607
436-620
23
393- 331
3OO- 367
1O9- 392
316- 397
324- 412
332- 427
340- 442
349- 456
357- 471
365- 486
374- 500
382- 515
3»1- 529
399- 544
408- S58
416- 573
425- 587
434- 601
442- (16
451- (30
459- 645
4«8- (59
24
T, T,
317- 379
313- 393
333- 411
342- 426
330- 441
339- 437
368- 472
376- 488
385- 503
394- 518
403- 533
412- 548
411- 563
429- 579
438- 594
447- 609
456- (24
465- 639
474- (54
483- (69
«»1- 484
502- (98
25
r, T,
343- 407
332- 423
360- 440
369- 436
371- 471
317- 499
396- 504
405- 520
414- 536
423- 552
433- 567
442- 583
451- 599
4«1- 614
470- (30
479- (46
489- 651
498- (77
508- 692
517- 708
527- 723
536- TH
a. - 0.10

~,
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
N, 4
11- 15
11- 21
13- 31
14- 34
13- 37
16- 40
17- 43
11- 46
19- 49
10- 32
11- S3
22- 39
24- 60
13- 63
16- 66
27- 69
31- 71
29- 73
30- 71
31- II
32- 14
33- 17


17- 33
19- 34
20- 40
11- 44
13- 47
24- 31
26- 34
17- 39
19- 62
30- 63
31- 69
33- 72
34- 76
33- 90
37- 13
39- 97
40- 90
41- 94
43- 97
44-101
43-103
47-109


r, T,
24- 42
26- 46
18- 50
29- 33
31- 39
13- 63
33- 67
39- 76
40- 90
43- 14
44- 99
46- 91
47- »t
49-101
31-103
33-109
33-113
37-117
31-132
60-126
62-130


32- 32
34- 37
36- 62
39- 44
41- 71
43- 76
43- 11
49- 91
31- 93
34-100
36-103
39-110
61-114
63-119
63-114
67-119
69-134
71-131
74-143
76-149
71-133


T, T,
41- 63
44- 69
46- 74
49- 79
51- 85
34- 90
36- 96
61-106
64-111
67-117
69-113
71-111
73-133
77-139
90-144
13-149
93-133
99-160
90-166
93-171
96-176


31- 73
54- 91
57- 17
60- 93
63- 99
46-105
69-111
73-123
71-129
91-133
14-141
97-147
90-133
93-139
96-163
99-171
101-177
105-113
109-119
111-195
114-201
10
62- 19
66- 94
69-101
71-109
73-113
79-121
82-118
99-141
92-149
96-134
99-161
103-167
10t-l74
110-1SO
113-197
117-193
110-100
123-207
127-213
130-220
133-227
U
74-101
79-109
92-116
93-114
99-131
93-139
97-145
100-153
104-160
109-167
111-174
116-191
120-119
123-196
127-103
131-110
135-217
139-224
142-232
146-239
150-246
154-253
12
97-117
91-125
93-131
99-141
104-149
109-156
111-164
116-171
110-180
115-197
119-193
133-203
131-210
142-219
146-226
150-234
154-242
159-249
163-257
167-265
171-273
176-280
13
101-133
106-141
no-iso
115-159
119-167
114-179
111-194
133-191
139-100
142-209
147-117
131-223
136-134
161-242
165-251
170-259
175-267
180-275
184-284
189-292
194-300
199-308
14
T, T,
116-150
121-159
126-169
111-177
136-116
141-195
146-204
131-113
136-111
161-131
144-240
171-249
176-258
161-267
186-276
191-285
196-294
202-302
207-311
212-320
217-329
222-338
15
131-161
139-177
143-117
149-197
153-107
139-116
164-116
170-133
175-143
111-254
186-264
191-174
197-283
202-293
208-302
214-311
219-321
225-330
230-340
236-349
242-358
247-368
16
150-116
133-197
161-107
166-111
172-221
171-119
194-141
190-231
196-261
101-179
207-289
213-299
219-309
225-319
231-329
237-339
243-349
249-359
255-369
261-379
267-389
273-399
17
169-206
173-219
179-229
196-239
191-230
191-261
204-171
110-293
217-291
223-304
229-315
235-326
242-336
248-347
255-357
261-368
268-378
274-389
280-400
287-410
293-421
300-431
18
117-117
193-139
199-251
106-262
212-274
219-291
226-297
131-309
239-319
245-331
252-342
259-353
266-364
273-375
180-384
286-398
293-409
300-420
307-431
314-442
321-453
328-464
19
207-149
113-162
120-174
227-296
234-299
141-310
149-333
135-334
262-346
269-358
276-370
284-381
291-393
298-405
305-417
313-418
320-440
327-452
335-463
342-475
349-487
357-498
20
119-171
135-195
242-199
249-311
137-313
264-336
172-349
279-361
286-374
294-386
301-399
309-411
317-423
325-435
332-448
340-4(0
348-471
355-485
363-497
371-509
379-521
386-534
21
330-296
237-310
163-333
171-337
290-350
199-363
296-376
304-389
312-402
320-415
328-428
336-441
344-454
352-467
360-480
368-493
376-506
385-518
393-531
401-544
409-557
417-570
22
273-321
111-333
199-349
197-363
303-377
313-391
321-405
329-419
338-432
346-446
355-459
363-473
372-486
380-500
389-513
398-526
406-540
415-553
413-5(7
432-580
441-593
449-607
23
197-347
305-361
313-377
311-3<>1
330-406
339-430
348-434
356-449
365-463
374-477
383-491
392-505
401-519
410-533
419-547
428-561
437-575
444-589
455-603
444-417
473-631
482-«45
24
312- 374
330- 390
339- 405
349- 410
337- 43 S
364- 450
375- 465
384- 480
393- 495
403- 509
412- 524
422- 538
431- 553
440- 568
450- 582
459- 597
469- 611
478- 626
488- 640
497- 655
507- 449
514- 684
25
7i T,
349- 402
337- 419
366- 434
373- 430
393- 463
394- 481
403- 497
413- 512
423- 527
433- 542
442- 558
452- 573
462- 588
472- 603
482- (18
492- 633
501- 649
511- 664
521- 679
531- 694
541- 709
551- 774.
*The data of this table are extracted
 GEIGY SCIENTIFIC TABLES, 6th Ed.,  pp
 Division of Geigy Chemical
          with kind permission from DOCUMENTA
           124-127, Geigy Pharmaceuticals,
Corporation, Ardsley,  N.  Y.
                                     A-i

-------
         APPENDIX B






FITTING DISTRIBUTIONS TO DATA

-------
                  FITTING DISTRIBUTIONS TO DATA

     Many of  the  outlier tests in Section 3  assume  that the un-
derlying distribution  of the  data  is  known.   Four distributions
which often provide good fits to air pollution and meteorological
data  are  the normal,  lognormal,  exponential, and  Weibull.   The
mathematical equations of these four distributions are:
                          1     rw        2
     Normal:  G(x) =1	—   /  exp  (-tz/2) dt              B-l
                         V2^T J

                 w = 5-IlJi                                    B-2

                          1     rw        2
  Lognormal:  G(x) = 1 - ——   /  exp  (-tV2) dt              B-3
                 w = lngx - M                                 B-4

Exponential:  G(x) = exp [-A.(x-6)]                            B-5
    Weibull:  G(x) = exp [-(x/6)k].                           B-6
G(x)  is  defined  as  the fraction of a population  having values
greater than x.
     Each  of  the four distributions  described  above can be com-
pletely described by specifying values for two parameters.  These
parameters  relate  to the shape,  scale,  or  location of the dis-
tribution when plotted on graph paper.  Accurate estimates of the
maximum values of a data set require good estimates  of the param-
eters of  the  distribution which  most closely fit the data.  Two
methods of  fitting these distributions are discussed in this sec-
tion:  the traditional method  of maximum likelihood  which uses
all  the  data  and a least squares approach,  which emphasizes the
upper tail  of the data.
                              B-l

-------
THE METHOD OF MAXIMUM LIKELIHOOD
     The  principal  advantage   of  using  the  method of  maximum
likelihood  to  estimate distribution  parameters  is  that  it pro-
duces  estimates  which  have minimum  variance and  distributions
which  asymptotically approach  the normal  distribution.   Table
B-l  lists  maximum likelihood  estimators  (MLE's) for  the param-
eters of the  four  distributions discussed above.  Unfortunately,
the MLE's  for  the  shape and scale parameters of the Weibull dis-
tribution cannot be calculated directly.  The equations
                         1
          k =
          6 = n
                  -, j.
                    °
                       n              n
()    I  x  In x.  -  I   In x-
                               -1
                                             B-7
B-8
must be  solved simultaneously using  an  iterative procedure.   A
computer program  which uses a  "golden section"  iterative proce-
                                      2
dure has been developed by Mage et al.
THE METHOD OF LEAST SQUARES
     The parameters estimated  by the maximum  likelihood method
determine  distributions  which fit the whole data set.  However,
the maximum  value of a given data set is  often better estimated
using  the  parameters of a distribution  fit  to  the upper tail of
the  data  distribution by  the method of  least  squares.   This
method requires that the equation defining the distribution  under
consideration  be  expressed as a  linear  relationship of the form
z = ay + b.  Equations B-l, B-3, B-5,  and B-6 can be rewritten in
linear form  if the  following identities are used.
     Distribution        z         a         y_         b

     normal              z         a         x         ~a
     lognormal           z         -         In x       -jj
     exponential       In G(x)     -\         x         X6
     Weibull        ln[-ln G(x)]   k         In x     -kin 6
                              B-2

-------
The values  of z for  the normal and  lognormal  distributions are
determined  from  the  standard  normal  distribution such  that the
area  under  the  standard normal curve  from z  to °° is  equal to
G(x).    The  following table  lists z  values  for selected  G(x)
values in the upper tail of the data distribution.
G
0
0
0
0
0
0
(x)
.50
.40
.30
.20
.10
.05


0
0
0
1
1
2
0
.253
.524
.842
.282
.645

0
0
0
0
0
0
G(X)
.02
.01
.005
.002
.001
.0005

2
2
2
2
3
3
Z
.054
.326
.576
.880
.090
.291
     A  linear  regression  analysis  of  the  data which  have been
transformed from x  and G(x)  values to y and z by the appropriate
identities  listed  above will  determine a regression  line which
has  an  equation in the form  of z = ay + b.  Parameters  of the
corresponding distribution can be determined  from  the  values of
a  and  b  using  the equations  listed in Table  B-l under least
squares estimators (LSE's).
     One  measure of  the  goodness  of  fit  of the  distribution
                                                              2
under  consideration is  the  coefficient  of  determination (r ),
which is  determined as part  of the  linear  regression  analysis.
                 2
The closer  the  r value is to unity,  the better the distribution
fits the data.
     There  are  numerous other statistics which  have  been sug-
gested  for quantifying  goodness  of  fit.   EPA  has developed a
program which  calculates  six  goodness  of fit statistics:   abso-
lute   deviations,   weighted   absolute   deviations,   Chi-square,
Kolmogov-Smirnov,   Cramer-von   Mises-Smirnov,   and   the  maximum
                                       2
value  of  the log-likelihood  function.    Four other  goodness of
fit statistics  in  common  use  are the  Kuiper,  Watson,  Anderson-
                                      3
Darling, and Shapiro-Wilk  statistics.    Of  these statistics, the
 2
r  and Chi-square are the easiest to calculate, though not neces-
sarily the best choices for distinguishing which of the distribu-
tions   under  investigation  provide the  best  fit   to  the  data.
                              B-3

-------
However,  for routine  data  validation procedures  requiring the
selection  of a  distribution  to characterize  the data,  the r2
statistic  is recommended  since it  can be  easily calculated as
part of the linear regression procedure.  In general, the distri-
bution  which  yields  the  highest  r2  value should  be  used to
characterize the data.  Parameter values are then  estimated  using
the LSE equations in Table B-l.

             TABLE B-l.  ESTIMATION OF DISTRIBUTION PARAMETERS
Distribution

Normal



Lognormal




Exponential

Weibull




Parameter
Location

M



--




6

__




Scale

a



M




A

6




Shape

—



a






k




Estimator
MLE

p = x

a = s

p = mean of
In x
a = std. dev.
of In x
~
6 = min x.
2(x.-min x.)
Iterative solu-
tions of simul-
taneous equa-
tions B-7 and
B-8
LSE
-b
_ "
i
a = -
a
p = —

1
a = -
a
-h
6 = —
a
K = -a
5 = exp(-b
k = a



REFERENCES
      1.   Hastings,  N.  A.  J.,  and J.  B.  Peacock.   Statistical
          Distributions.   John Wiley and Sons,  New York.
      2.   Mage,  David T.,  et  al.   "Techniques  for Fitting Proba-
          bility Models  to  Experimental  Data,"   In:   Proc.  of
          Specialty  Conference on Quality Assurance in Air Pollu-
          tion  Measurement,  New  Orleans,  Louisiana,  March 1979.
      3.   Stephens,  M.  A.  EDF Statistics for Goodness of Fit and
          Some  Comparisons.   Journal of  the  American Statistical
          Association,  Vol.  69,  No. 347,  September 1974.
                               B-4

-------
                   APPENDIX C






CALCULATION OF LIMITS FOR SHEWHART CONTROL CHART

-------
        CALCULATION OF LIMITS FOR SHEWHART CONTROL CHART

     The Shewhart control  chart enables the data analyst to test
the  hypothesis  H   that  the  mean or  range  value of  the sample
under evaluation comes from a population having the same distri-
bution as the historical data sets A.. ,  A9, . . . ,  A .
                                    -L   £*        ii
     The first step  of  the procedure is the selection of a suit-
able  sample  size for  the  test,  as indicated  in  Section 3.3.3.
Each sample should contain between 4 and 15 values and should re-
present  a  well-defined time  period (day, month,  quarter,  etc.)
for  which  there  is  a large body  of historical  data.   Where pos-
sible these time periods  should relate to the NAAQS of interest.
Months or quarters would be appropriate time periods for tests of
24-hour  TSP,  S0?,  and N09 data collected  at 6- or 12-day inter-
vals.
     The  second  step  is  the selection  of historical  data  for
determining the  limits on the  Shewhart control  chart.   To  the
extent  possible,  these data  sets should  contain  data collected
under the same conditions  (averaging time, site location, season,
weather, local emissions,  etc.)  as the data set under investiga-
tion.   For  convenience,  these data  sets  will be  labeled A.,,  A9,
. . . , Aj^.  Data set  A,  contains n.. values and has a mean x, and  a
range R_.  Similarly, data set A9  contains  n«  values  and  has  a
mean  x9 and  a  range R9.   Continuing  in this manner,  the data
analyst can develop the following table.

Data set       Sample size      Mean     Range     d9
   A2               n2           X2       R2       d22     R2/d22
                              C-l
                                                   d2N     Vd2N

-------
     TABLE C-l.   FACTORS  FOR  ESTIMATING  CONTROL  LIMITS  OF SHEWHART CHART
Number of observations
in subgroup
n
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Factors8
d2
1.128
1.693
2.059
2.326
2.534
2.704
2.847
2.970
3.078
3.173
3.258
3.336
3.407
3.472
3.532
3.588
3.640
3.689
3.735
3.778
3.819
3.858
3.895
3.931
4.086
4.213
4.322
4.415
4.498
4.572
4.639
4.699
4.755
4.806
4.854
4.898
4.939
4.978
5.015
C2
0.5642
0.7236
0.7979
0.8407
0.8686
0.8882
0.9027
0.9139
0.9227
0.9300
0.9359
0.9410
0.9453
0.9490
0.9523
0.9551
0.9576
0.9599
0.9619
0.9638
0.9655
0.9670
0.9684
0.9696
0.9748
0.9784
0.9811
0.9832
0.9849
0.9863
0.9874
0 . 9884
0.9892
0.9900
0.9906
0.9912
0.9916
0.9921
0.9925
c2/d2
0.5002
0.4274
0.3875
0.3614
0.3428
0.3285
0.3171
0.3077
0.2998
0.2931
0.2873
0.2821
0.2775
0.2733
0.2696
0.2662
0.2631
0.2602
0.2575
0.2551
0.2528
0.2506
0.2486
0.2467
0.2386
0.2322
0.2270
0.2227
0.2190
0.2157
0.2128
0.2103
0.2080
0.2060
0.2041
0.2024
0.2008
0.1993
0.1979
These factors assume sampling from a normal universe.



                                   02

-------
     Figure 3-7 of  Section 3 shows the general form of a control
chart for testing sample means.  Values of x are indicated on the
vertical axis and units  of time (or sample number) are indicated
on the  horizontal  axis.   The central  line  is  a solid horizontal
line drawn at
              i   N
          x = | ^x..                                        C-l

The upper  and lower  control limits are  dashed horizontal lines
drawn at
          UCL- = x + ZCT-                                      C-2
             X         X
and
          LCL- = x - za-                                      C-3
where        x         x

             *\>   1        ~\
          ax = rvn; .1  dr.                                  c"4

and n^.  is  the  size of the sample being tested.  Computations are
considerably reduced if  samples  are selected such that n^ = n, =
n2 = . .  . = rLg = n.   In this  case,  the second,  fifth,  and sixth
columns in the table above can be omitted and
             •v   I    N
          a- = d j^ i   I Ri.                                   C-5

     Figure 3-8 of Section 3 shows the general  form of a control
chart for  testing sample  ranges.   Values of  R  are indicated on
the vertical axis, and units of time (or sample  number) are indi-
cated on the horizontal  axis.  The central line is a solid hori-
zontal  line drawn at
                 N
          R = =;  I R.                                         C-6
The upper  and lower  control  limits are  dashed horizontal lines
drawn at
        UCL_, = R + zaD                                        C-l
           s\         K
        LCL_. = R - zan                                        C-8
           K         K
                              C-3

-------
where

             - C2  N   Ri
          aD = —  I   -T±                                      C-9
           R   N  i=l  d2i

and c2  is  determined from Table C-l according  to the size nA  of

the  sample being  tested.   Negative  LCL's should  be treated  as

zeros.   If samples  are  selected  such that nA = n,  = n~  =  ...  =
n ,  then
             ~ °2   N     _ C2R
           r*. — ~3 iTr  *-  ** ' — .---^--™.                               C*" JL 0
           R   dN      i    d
          C2
Values of -5- for 0 < n < 100 are listed in Table C-l.
           2

     The data  analyst  is testing the hypothesis H   that the  mean
or range value of the sample under evaluation comes from a popu-
lation having  the same  distribution  as  the historical  data  sets
A,, A,,  ...  ,  A .   If H  is true, the probability  of the mean or
 JL   £*          ii       \J
range value  falling outside the control  limits  is  assumed to be
equal to  twice the  area under the  standard normal curve  to the
right  of z.   Consequently,  the  z value  contained in  Equations
C-2,  C-3 ,  C-7,  and  C-8  is  selected  according to the  desired
sensitivity  of the test.  The  following  is a list  of probabili-
ties corresponding to selected  z values.

               z                   pfH  ^s
            1.282                        0.20
            1.645                        0.10
            1.96                         0.05
            2.00                         0.0455
            2.326                        0.02
            2.576                        0.01
            3.00                         0.0027

The  value z = 3 corresponding to  P[H  is true] <_ 0.0027  is  com-

monly  used for  materials testing,  but it may  be  too  large  for

testing  air quality  data.  The data  analyst may wish to use z = 2

to  determine  the initial  control  limits and to later  increase z

if the original  limits  flag  too  many valid data sets.
                               C-4

-------
                                    TECHNICAL REPORT DATA
                             if late reed Inur.icnuot on tlit rcierK be lore lonwlctintl
     -
    EPA-600/4-80-030
    VALIDATION OF AIR MONITORING DATA
                                                            3. RECIPIENT'S ACCESSIO:*NO.
                                                            9. REPORT OAT":
                                                              June 1980
6. PERFORMING ORGANIZATION CODE
                                                            8. PERFORMING ORGANIZATION REPORT NO
    A. Carl Nelson, Jr.,  Dave W.  Armentrout,
    and Ted R. Johnson
      3320-N
 9. PEflhOSMlNG ORGANIZATION NAME AND ADDRESS
    PEDCo Environmental,  Inc.
    505 South Duke Street, Suite 503
    Durham, North Carolina  27701
10. PROGRAM ELEMENT NO.
     Assignment No.  14
11. CONTRACT/GRANT NO.
      68-02-2722
 12. SPONSORING AGENCY NAME AND ADDRESS
    Office of Research and Development
    Environmental Monitoring Systems Laboratory
    Research Triangle Park, N. C.   27711
                                                            13. TYPE Of REPC" . ' NO PERIOD CO VE 3 = 0
14. SPONSORING A~c "ICY CODE
     EPA  600/08
 15. SUPPLEMENTARY NOTES
 16. ABSTRACT            •~~~~~	~~	'	~~	~—	'
         Data validation refers  to  those activities performed  after the data have

    been obtained and thus serves as  a  final  screening of the  data  before they are

    used in  a decision making process.   This  report provides organizations that are

    monitoring ambient air levels and stationary source emissions with  a collection

    of data  validation procedures and with criteria for selection of the

    appropriate procedures for the  pacticular application.  Both hypothetical  and

    case studies, and several examples  are given to Illustrate the  use  of the

    procedures.   Statistical  procedures and tables are in the appendices.
 7.
                                KEY WJflOS AND DOCUMENT ANALYSIS
•»• DESCRIPTORS
Data Validation
Data Screening
Data Editing
Quality Assurance
Outliers
Statistics
Environmental Data
• 3. wliT«!sy r(O:< STATEMENT
Release to public
b.lDENTIFIERS/OPEN ENOEO TERMS
[nvironmental Monitoring
)ata Management
19. SECURITY CL.ASS , flat Rtfant
Unclassified
20. SECURITY CLASS ( Tliit pagtl
Unclassified
C COSAT! I'l.-IJ Group
43F
68A
21. NO OF PAGES
22. PRICE
EPA Form JJJO-1 (1-73)
                                                C-5

-------