&EPA
United States
Environmental Protection
Agency
Environmental Monitoring and Support EPA-600 4-79-044
Laboratory July 1979
Research Triangle Park NC 27711
Research and Development
The Maximum
Likelihood Approach
to Probabilistic
Modeling of Air
Quality Data
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U.S. Environmental
Protection Agency, have been grouped into nine series. These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology. Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The nine series are:
1. Environmental Health Effects Research
2 Environmental Protection Technology
3. Ecological Research
4. Environmental Monitoring
5. Socioeconomic Environmental Studies
6. Scientific and Technical Assessment Reports (STAR)
7. Interagency Energy-Environment Research and Development
8. "Special" Reports
9. Miscellaneous Reports
This report has been assigned to the ENVIRONMENTAL MONITORING series.
This series describes research conducted to develop new or improved methods
and instrumentation for the identification and quantification of environmental
pollutants at the lowest conceivably significant concentrations. It also includes
studies to determine the ambient concentrations of pollutants in the environment
and/or the variance of pollutants as a function of time or meteorological factors.
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia 22161.
-------
THE MAXIMUM LIKELIHOOD APPROACH TO
PROBABILISTIC MODELING OF AIR QUALITY DATA
by
Terence Fitz-Simons and David M. Holland
Environmental Monitoring and Research Laboratory
Environmental Protection Agency
Research Triangle Park, N. C. 27711
ENVIRONMENTAL MONITORING AND SUPPORT LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, N. C. 27711
-------
DISCLAIMER
This report has been reviewed by the Environmental Monitoring and
Support Laboratory, U.S. Environmental Protection Agency, and approved
for publication. Mention of trade names or commercial products does
not constitute endorsement or recommendation for use.
-------
FOREWORD
Measurement and monitoring research efforts are designed to anticipate
potential environmental problems, to support regulatory actions by developing
an in-depth understanding of the nature and processes that impact health and
the ecology, to provide innovative means of monitoring compliance with regu-
lations and to evaluate the effectiveness of health and environmental pro-
tection efforts through the monitoring of long-term trends. The Environmental
Monitoring and Support Laboratory, Research Triangle Park, North Carolina, is
responsible for development of: environmental monitoring technology and
systems; agency-wide quality assurance programs for air pollution measurement
systems; and technical support to the Agency's operating function including;
the Office of Air, No'ise and Radiation, the Office of Toxic Substances and
the Office of Enforcement.
In order for the laboratory to effectively analyze the data generated
by these activities, statistical assumptions must be made regarding the
underlying distribution of these data. This report presents a tool to aid
in evaluating these assumptions. Any changes in the material presented in
this report will be presented in future reports.
Thomas R. Hauser
Director
Environmental Monitoring and
Support Laboratory
-------
ABSTRACT
Software developed by the authors using maximum likelihood estimation
to fit six probabilistic models is presented. The software is designed
as a tool for the air pollution researcher to determine what assumptions
are valid in the statistical analysis of air pollution data for the
purposes of standard setting, roll-back calculations, estimation of
maximum concentrations, threshold approximations, and handling missing
observations. The program fits user's data to the normal distribution,
the 3-parameter lognormal distribution, the 3-parameter Weibull distri-
bution, the 3-parameter gamma distribution, the Johnson SR distribution
(a 4-parameter lognormal distribution), and the 4-parameter beta distri-
bution. The parameters are estimated using standard closed solutions
to maximizing equations, Gaussian elimination to solve non-linear maxi-
mizing equations where possible, and a golden section search for all
other parameters. Graphical output contains a histogram of the data
superimposed by the fitted density for each model. Six goodness-of-fit
criteria are supplied and ranked by the program to aid in the selection
of the most appropriate choice among the six models. These criteria are
absolute deviations (AD statistic), weighted absolute deviations (WAD
statistic), Kolmogorov-Smirnov statistic, Cramer-von Mises-Smirnov
statistic, the log-likelihood function, and the observed significance
level of the Chi-square goodness-of-fit test. The results of applying
the program to several subsets of the Los Angeles Catalyst Study data
base are presented.
fv
-------
CONTENTS
Foreword iii
Abstract iv
Figures vi
1. Introduction 1
2. Methodology 2
3. Comparing Distributions 4
4. Application of MAXFIT 7
5. Future Developments 25
6. Summary 27
References 28
Appendix A 30
Appendix B 33
-------
FIGURES
Number Page
1 MAXFIT Output: Fit of normal distribution for data in
Example 1 ..... 10
2 MAXFIT Output: Fit of 3-parameter lognormal distribution
for data in Example 1 11
3 MAXFIT Output: Fit of Gamma distribution for data in Example 1 . 12
4 MAXFIT Output: Fit of Wei bull distribution for data in
Example 1 13
5 MAXFIT Output: Fit of Johnson Sb distribution for data in
Example 1 14
6 MAXFIT Output: Fit of Beta distribution for data in Example 1 . . 15
7 MAXFIT Output: Fit of the normal distribution for data in
Example 2 18
8 MAXFIT Output: Fit of the 3-parameter lognormal distribution
for data in Example 2 19
9 MAXFIT Output: Fit of the Gamma distribution for data in
Example 2 '. 20
10 MAXFIT Output: Fit of the Weibull distribution for data in
Example 2 21
11 MAXFT Output: Fit of the Johnson SR distribution for data in
Example 2 . . 22
12 MAXFIT Output: Fit of the Beta distribution for data in
Example 2 23
TABLES
Number Page
1 Available Distributions and Identifying Numbers 8
2 MAXFIT Output: Comparison Statistics for Data 1n Example 1 ... 16 -
3 MAXFIT Output: Comparison Statistics for Data in Example 2 ... 24
-------
SECTION 1
INTRODUCTION
There is a growing interest in the application of probability models to
air quality data in areas such as:
standard setting,
emission roll-back calculations,
estimation of maximum concentrations,
threshold approximations, and
handling missing observations.
Selection of a valid probability model to describe air quality data is a
difficult problem involving many factors. Larsen (1,2,3,4,5) proposed the
two-parameter and three-parameter univariate log-normal models for describing
air quality data collected in urban areas. Mage and Ott (6,7) introduced
censorship to the three-parameter lognormal probability model whenever the
third parameter was negative. These authors concluded that this model was a
superior, general purpose model suitable for a large variety of environmental
phenomena. Standard logarithmic probability paper was used to fit their
proposed probability model. Curran and Frank (8) used semi-log graph paper
to plot l-F(x) vs. pollutant concentration as a technique for examining the
behavior of the upper tails of air quality data. Curran and Frank pointed
out that fitting techniques based upon two selected percentiles may be sensi-
tive to the choice of an underlying distribution, and that the stability of
these estimates for varying choices of percentiles needs investigation.
Computer software providing efficient estimation of parameters and
evaluation of goodness-of-fit for several probability modes for ambient air
quality data has not been available. A computer program, MAXFIT, is presented
here for the purpose of identifying a suitable probability model to describe
ambient air quality data using maximum likelihood techniques.
-------
SECTION 2
METHODOLOGY
The approach employed Is maximum likelihood estimation. This is a
simple method, yielding estimates for almost every model imaginable, and its
estimators have several statistically desirable properties. The various
distribution models are compared by goodness-of-fit statistics providing
an objective basis for choosing one of the six models for a particular
application.
The main idea behind maximum likelihood estimation may be stated very
simply. Given a random sample from a population distributed according to a
density f(x) with parameters a, 6, and X (f[x;a,e,X]), the maximum likelihood
estimates (a,i, and X) are found such that the likelihood that the random
sample came from f(x;a,0,X) is greater than or equal to the likelihood that
the sample came from f(x;a,B,x); where a, B, and X are any other estimates of
(», 6 and X.
Under certain assumptions discussed in Rao (9), maximum likelihood
estimators have statistically desirable characteristics. First, they are
consistent. Consistent estimators become "better" as the sample size in-
creases. If biased, the bias decreases as the sample size gets larger. The
variance of the estimators lessens as the sample size increases. Second,
they converge to the true value of the parameters with a probability equal
to one. If the entire population were included, the estimators would equal
the parameters with a probability equal to one. Third, as n (sample size)
approaches infinity, sampling distribution of the estimators approaches
normal distribution.
The method of fitting using maximnn likelihood is described below:
The likelihood function 1s defined as the joint density of the
random sample.
t The logarithm of the likelihood function is determined. This
facilitates differentiation techniques and since the logarithm of
X Increases whenever X does, the likelihood and the log-likelihood
reach a maximum at the same point.
Differentiation techniques are employed to maximize the log-
likelihood. If these techniques fail to yield equations which can
be solved algebraically, the correct values of the parameters must
be found by an iterative search. This usually requires a computer.
-------
FORTRAN Software utilizing these concepts was developed based on a
program written by Schreuder, et al (10) making use of several IMSL (Inter-
national Mathematical and Statistical Library, 1975) subroutines. While
modifying this program to run on the EPA UNIVAC 1110 situated at Research
Triangle Park, North Carolina, several new capabilities were incorporated.
These include the estimation of the lower boundary parameter in the 3-
parameter lognormal (Johnson S^), gamma, Weibull, the 4-parameter beta, and
lognormal (Johnson SB) distributions. Boundary parameters are estimated by
a golden section search (11). The golden section search is a fairly effi-
cient technique, accomplishing a 38 percent reduction in the interval of
uncertainty for each iteration and requiring one evaluation of the function
per iteration. Graphic output was also included to provide a visual display
of the fit.
A list of the distributions that were fitted and the estimates used to
fit them is shown in Appendix A. All summations are from 1 to the number of
observations in the data set (n).
-------
SECTION 3
COMPARING DISTRIBUTIONS
The program MAXFIT supplies the user with six goodness-of-fit statistics
as a basis for comparing fitted distributions. The statistics supplied are:
absolute deviations, weighted absolute deviations, the observed significance
level of the Chi -square (x2) test, Kolmogorov-Smirnov statistic, Cramer-
von Mises-Srnirnov statistic, and the maximum value of the log- likelihood
function. All these statistics are computed upon grouped data. However, the
log- likelihood function is also computed using ungrouped data when the
original data are available.
It should be noted that model validity tests, are designed to reject the
null hypothesis that the data follow a certain distribution as few times as
possible when the hypothesis is true. Therefore, these statistics are designed
to indicate with some certainty when data do not follow the distribution
stated in the null hypothesis. However, the literature discloses no other
method to reveal how closely the data fit a distribution. The six various
statistics are defined and discussed briefly below.
ABSOLUTE DEVIATIONS (AD)
This statistic puts less emphasis on large deviations between expected
and observed frequencies than does the classic x2 statistic.
where P . = observed proportion of data falling in the i interval
th
P . = expected proportion of data falling in the 1 interval
n = total number of observations in the data set
N = total number of intervals.
WEIGHTED ABSOLUTE DEVIATIONS (WAD)
The WAD statistic puts less emphasis upon the fit in the tails of the
distribution.
-------
CHI-SQUARE STATISTIC (x2 )
The x2 statistic is very popular and widely used; however, it has poor
power characteristics and is considered unreliable unless intervals have
expected frequencies equal to or greater than five.
Pei
KOLMOGOROV-SMIRNOV STATISTIC (d)
The Kolmogorov-Smirnov statistic is an empirical distribution function
(EOF) statistic. EOF is the observed cumulative distribution function.
The statistic is the maximum distance between the observed and the expected
cumulative frequency distribution.
where Sn = observed cumulative probability at the i observed value
ni
F = expected cumulative probability at the i observed value
CRAMER-VON MISES-SMIRNOV STATISTIC (W2) .
The Cramer-von Mises-Smirnov statistic is the sum of squares of the
distance between observed cumulative distribution and the expected cumulative
distribution function weighted by the expected probabilities. This test puts
less emphasis upon the tails of the distribution when measuring the fit.
LIKELIHOOD
The likelihood is also presented as a goodness-of-fit statistic. If we
are committed to fitting, using the maximum likelihood criteria, it is con-
sistent to use the larger likelihood functions as a criterion for choosing
between two or more distributions having the same number of parameters. Such
a comparison is also the basis of the Neyman-Pearson theorem (9) dealing with
the construction of a powerful test of a simple hypothesis involving two
distributions.
-------
The formulas of the log-likelihood functions of the various distribu-
tions are shown as follows where the superscript ~ represents the MLE
estimates of the parameters and the summations are from 1 to n.
Normal distribution:
Ijj-O+logl&l+logla2])
t Johnson Si distribution:
Johnson SB distribution:
-n(l/2-log[{c-a/a]-Hog[2ir]/2+Zlog[c-xi]/rH-Elog[x1-X]/n)
Gamma distribution:
-n(alog[e]+log[r{a}J-[o-l]zlog[x1-x]/n+_
Weibull distribution:
-n(l-[;-l]zlog[xrx]-«-[a-l]log[8]-log[a/e])
_
t Beta distribution:
n(log[r{a+g}/{r(a)r(&)}]-[a+8-lJlog[Ml
+[a-l]Elog[x1-X]+[0-l]Elog[C-x1J
-------
SECTION 4
APPLICATION OF MAXFIT
INPUTS
Subroutine USER must be supplied by the user to pass certain control
variables to MAXFIT. An example of this subroutine is given in Appendix B.
The subroutine argument list must be as follows:
SUBROUTINE USER (X,P,XB,NGRP,TF,N,NC,BEGIN,STEP,TOL,TITLE,XMIN,XMAX,
NDIST).
The variables that must be provided by the user are defined to be:
X = array of length m *2000 of class midpoints if the data are grouped;
into classes or individual observations if the data are not grouped.
P = an array of length n ^100 of class probabilities for data grouped
into classes. This array is not defined for ungrouped data.
XB = an array of length p = n+1, consisting of the lower limits of each
class with the last element being the upper limit of the last
class. It is only needed if NGRP=1. Two additional classes will
be created by MAXFIT to cover the entire range of the distribution
being fitted.
NGRP = 1 if data are grouped, = 0 if data are individual observations.
This variable indicates whether the statistics calculated by MAXFIT
are to be based on individual observations or on class frequencies.
NDIST = an array of length 6, containing the identifying numbers for the
desired distributions.
NC = number of classes if class limits, XB, are specified by the user.
These limits must be specified if NGRP=1. If NC=0, MAXFIT calcu-
lates class limits based on value of BEGIN and STEP. The last
class calculated by MAXFIT contains the maximum observed data
point.
BEGIN = The lower boundary of the first class for which predictions are
desired. Default value is slightly less than the observed data
minimum value.
-------
STEP = The class width. Default value is the (data range)/12.
TOL = primary cutoff point used in the searching subroutines. The
Optional, default value is .0001.
TITLE = an array of length 10 which contains a 60-character title in elements
of length 6. Default value Is blanks.
N = number of classes for which the user wants predictions if N6RP=1,
or the number of observations if NGRP=0.
If the user specifies the class limits, he gets predicted frequencies
for the range of classes he specifies and the variable N should indicate
the number of classes covered by that range. If the user does not specify
classes, the program creates class limits based on the values of BEGIN and
STEP. The variable TOL is used as the measure of precision desired in the
maximum likelihood estimation processes where interaction is required. The
available distributions and their identifying numbers are in Table 1.
TABLE 1. AVAILABLE DISTRIBUTIONS AND IDENTIFYING NUMBERS
Distribution Distribution
Name Number
Normal
Lognormal
Gamma
Wei bull
SB
Beta
1
2
3
4
5
6
The program provides the following output for each data set:
1. Observed minimum and maximum data value, number of observations, mean,
variance, and standard deviation, index of skewness (1/37), skewness
squared (BI) and index of kurtosis (g2)-
2. For each distribution fitted:
Estimated parameters for the distribution (notations and descrip-
tions of the distributions are given in Appendix A).
t By classes of the variable, the observed and predicted probabil-
ities, the difference between them (residual), the cumulative
observed and predicted probabilities, and the observed and predicted
frequencies and their difference (residual frequency).
A CALCOMP plot with the density of the fitted distribution super-
Imposed upon the histogram of the data.
8
-------
3. A comparison of the distributions and their ranking from best (1) to
worst (6) in terms of absolute deviations, weighted absolute deviations,
Chi-square, Kolmogorov-Smirnov, Cramer-von Mises-Smirnov, and log
likelihood statistics.
These statistics are ranked not as goodness-of-fit statistics, but as
absolute measurements of fit. They may be used to calculate valid goodness-
of-fit statistics when degrees of freedom are taken into account. The value
of the observed significance level is printed out for the Chi-square goodness-
of-fit test defined as the probability that a Chi-square random variable
will be greater than the observed value. The observed value is calculated
regrouping classes to insure that the predicted frequency is greater than 5.
If N be the number of classes used to compute the statistic, the degrees of
freedom for each model are as follows:
normal N-3
3-parameter lognormal N-4
3-parameter Gamma N-4
3-parameter Wei bull N-4
Johnson SB N-5
4-parameter Beta N-5.
EXAMPLES OF OUTPUT
Description of Data Sets
A short-term study was conducted by the Environmental Monitoring
and Support Laboratory (EMSL) at the Los Angeles Catalyst Study (LACS) site
from 7/20/78 to 8/30/78. Their purpose was to quantify the influence of NOX
emissions from automobiles on ozone (03) and nitrogen dioxide (N0£) measure-
ments made in close proximity to the freeway (12). The data base was strati-
fied into low, medium and high 03 categories by the background 03 level at
the upwind site.
Oxides of nitrogen (NO ) and N02 measurements taken at the upwind LACS
site 3 were used to fit the probability models. The ranges of the two data
sets are equivalent due to the low nitrogen oxide (NO) measurements at the
upwind site. However, the results of fitting probability models to these two
data sets are very different.
The first example is N02 measurements recorded at site 3. The 92
observations range from 0.007 to 0.059 ppm. The output of MAXFIT is shown in
Figures 1-6 and Table 2. The Weibull distribution appears to be the "best"
model for this data. It has the highest log-likelihood value, the highest
observed significance level of the Chi-square test and does fairly well when
compared to other models by the remaining fitting criteria. The Weibull is
also a useful model since data can be transformed to an exponential distri-
bution and analyses may be performed assuming an exponential distribution.
The second example is NO measurements recorded at site 3. Again, the
92 observations range from 0.007 to 0.059 ppm. The output of MAXFIT for this
-------
LACS BITE 3 NO2 HIGH 03 LEVEL
OBS.MIN.X -
OBS.MAX.X -
NO. OF OB8. -
.007000
.060000
92.0000
MEAN" .0246
VARIANCE - .0001
STANDARD DEV. -
.0112
INDEX OF SKEWNESS - .7315
INDEX OF KURTOSIS - 3.3555
SKEWNESS SQUARED - .5361
NORMAL DISTRIBUTION
MEAN- .0246
VARIANCE- .0001
STANDARD DEVIATION - .0112
X1
X2
INFINITY -
.007-
.011-
.016-
.020-
X>24-
.029-
.033-
.037-
.042-
.046-
.060-
.066-
.069-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.060
.065
.069
.063
.063-INFINITY
LACS SITE 3 NO2
OBSERVED
.000000
.086957
.119666
.195662
.173913
.064348
.130435
.130436
.043478
.021739
.010670
.010870
.010870
.010870
.000000
PREDICTED RESIDUAL
.056922
.069663
.094116
.128129
.160290
.151885
.132252
.099218
.064132
.035715
.017136
.007083
.002523
.000774
.000262
-.056921
.027394
.026449
.067523
.023623
-.097538
-.001817
.031217
-.020664
-.013976
-.006266
.003786
.008347
.010096
.000262
CUMULATIVE
OBSERVED
.000000
.086967
.206622
.402174
.576087
.630435
.760870
.891304
.934783
.966522
.967391
.978261
.989130
1.000000
1.000000
CUMULATIVE OBSERVED
PREDICTED FREQUENCY
.056922
.116485
.210601
.338730
.489020
.640905
.773168
.672375
.936507
.972222
.989358
.996441
.998964
.999738
1.000000
HIGH OZ LEVEL
NORMAL DISTRIBUTION
.000
8.000
11.000
18.000
16.000
5.000
12.000
12.000
4.000
2.000
1.000
1.000
1.000
1.000
.000
MU - 0.0246
SIGMA2 - 0.0001
PREDICTED
FREQUENCY
5.237
6.480
8.659
11.788
13.827
13.973
12.167
9.128
5.900
3.286
1.577
.652
.232
.071
.024
RESIDUAL
FREQUENCY
-5.237
2.520
2.341
6.212
2.173
- 8.973
- .167
2.872
- 1.900
- 1.286
- .577
.348
.768
.929
-.024
46.15 _
37.63
30.10
g 22.68
UL
16.06
7.63
0.00
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
FIGURE 1 MAXFIT OUTPUT: FIT OF NORMAL DISTRIBUTION FOR DATA IN EXAMPLE 2
-------
LACS SITE 3 N02 HIGH 03 LEVEL
LOG NORMAL DISTRIBUTION
MEAN = -3.5020
VARIANCE = .1213
STANDARD DEVIATION = .3483
SEARCHED LOWER BOUND = -.007367
XI
-.007-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.048-
.050-
.055-
.059-
X2
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.059
.063
.063-INFINITY
OBSERVED
.000000
.086957
.119565
.195652
.173913
.054348
.130435
.130435
.043478
.021739
.010870
.010870
.010870
.010870
.000000
PREDICTED
.016618
.068441
.134676
.170804
.166827
.138416
.103334
.071926
.047761
.030723
.019347
.012015
.007397
.004532
.007182
RESIDUAL
-.016618
.018516
-.015111
.024848
.007086
-.084068
.027101
.058509
-.004283
-.008984
-.008478
-.001145
.003472
.006338
-.007182
CUMULATIVE
OBSERVED
.000000
.086957
.206522
.402174
.576087
.630435
.760870
.891304
.934783
.956522
.967391
.978261
.989130
1.000000
1.000000
CUMULATIVE
PREDICTED F
.016618
.085069
.219735
.390539
.557366
.695782
.799116
.871042
.918803
.949526
.968874
.980889
.988286
.992818
1 .000000
OBSERVE
REQUENI
.000
8.000
11.000
18.000
16.000
5.000
12.000
12.000
4.000
2.000
1.000
1.000
1.000
1.000
.000
1.529
6.297
12.390
15.714
15.348
12.734
9.507
6.617
4.394
2.827
1.780
1.105
.681
.417
.661
-1.529
1.703
-1.390
2.286
.652
-7.734
2.493
5.383
-.394
-.827
-.780
-.105
.319
.583
-.661
LACS SITE 3 N02 HIGH OZ LEVEL
LNORM3 DISTRIBUTION
F(X)
45.15
37.63
30.10
22.58
15.05
7.53
0.00
MU= -3.5020
SIGMA * =0.1213
LOWER BOUND = -0.0074
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
X
FIGURE 2 MAXFIT OUTPUT: FIT OF 3-PARAMETER LOG NORMAL DISTRIBUTION FOR DATA IN EXAMPLE 1
-------
LACS SITE 3 NO2 HIGH 03 LEVEL
GAMMA DISTRIBUTION
ALPHA- 3.1864
BETA- .0066
SEARCHED LOWER BOUND-
X1
X2
OBSERVED
.004-
.007-
.011-
.016
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.069-
.007
.011
.016
.020
.024
.029
.033
.037
.042
,046
.060
.065
.069
.063
.063-INFINITY
.000000
.086957
.119566
.196662
.173913
.064348
.130436
.130436
.043478
.021739
.010870
.010870
.010870
.010870
.000000
.003606
PREDICTED
.010620
.070766
.146624
.166444
.167326
.120842
.098862
.071127
.049068
.032771
.021331
.013597
.008618
.006258
.007938
RESIDUAL
-.010529
.007190
-.026069
.027208
.016588
-.075494
.031673
.069308
- .005590
- .011032
- .010461
- .002727
.002362
.005611
-.007938
CUMULATIVE
OBSERVED
.000000
.086967
.206822
.402174
.576087
.630436
.760870
.891304
.934783
.956522
.967391
.978261
.989130
1.000000
1.000000
CUMULATIVE
PREDICTED
.010529
.090295
.236919
.404363
361688
.691630
.790393
.861620
.910588
.943369
.964690
.978286
.986804
.992062
1.000000
OBSERVED
FREQUENCY
.000
8.000
11.000
18.000
16.000
5.000
12.000
12.000
4.000
2.000
1.000
1.000
1.000
1.000
.000
PREDICTED
FREQUENCY
.969
7.338
13.397
15.497
14.474
11.945
9.095
6.544
4.614
3.015
1.962
1.251
.784
.484
.730
RESIDUAL
FREQUENCY
-.969
.662
-Z397
2.503
1.526
-6.945
2.905
5.456
- .514
-1.015
- .962
- .251
.216
.516
-.730
ro
LACS SITE 3 N02 HIGH OZ LEVEL
GAMMA DISTRIBUTION
LOWER BOUND- 0.0036
ALPHA- 3.1864
BETA- 0.0066
FIX)
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055
0.00
.059 .063
FIGURE 3 MAXFIT OUTPUT: FIT OF GAMMA DISTRIBUTION FOR DATA IN EXAMPLE 1
-------
LACS SITE 3 NO2 HIGH 03 LEVEL
WEIBULL DISTRIBUTION
BETA= .0212
C= 1.732S
SEARCHED LOWER BOUND =
.005694
X1
X2
OBSERVED
.006-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.059-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.059
.063
.063-INFINITY
.000000
.086957
.119565
.195652
.173913
.054348
.130435
.130435
.043478
.021739
.010870
.010870
.010870
.010870
.000000
PREDICTED
.007848
.087676
.141050
.159711
.153660
.132469
.104811
.077111
.053188
.034591
.021298
.012454
.006934
.003683
.003525
RESIDUAL
-.007848
-.000719
-.021485
.035941
.020263
-.078121
.025624
.053324
-.009710
-.012852
-.010429
-.001585
.003936
.007186
-.003525
CUMULATIVE
OBSERVED
.000000
.086957
.206522
.402174
.576087
.630435
.760870
.891304
.934783
.956522
.967391
.978261
.989130
1.000000
1.000000
CUMULATIVE
PREDICTED
.007848
.095524
.236574
.396285
.549935
.682405
.787216
.864326
.917515
.952105
.973404
.985858
.992792
.996475
1 .000000
OBSERVED
FREQUENCY
.000
8.000
11.000
18.000
16.000
5.000
12.000
12.000
4.000
2.000
1.000
1.000
1.000
1.000
.000
PREDICTED
FREQUENCY
.722
8.066
12.977
14.693
14.136
12.187
9.643
7.094
4.893
3.182
1.959
1.146
.638
.339
.324
RESIDUAL
FREQUENCY
-.722
- .066
-1.977
3.307
1.864
-7.187
2.357
4.906
- .893
-1.182
- .959
- .146
.362
.661
-.324
LACS SITE 3 NO2 HIGH OZ LEVEL
WEIBULL DISTRIBUTION
45.15P
LOWER BOUND = 0.0057
ALPHA = 1.7325
BETA = 0.0212
F(X)
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
X
FIGURE 4 MAXFIT OUTPUT: FIT OF WEIBULL DISTRIBUTION FOR DATA IN EXAMPLE 1
-------
LACS SITE 3 NO2
SB DISTRIBUTION
HIGH 03 LEVE L
FIX)
MEAN- -1.0364
VARIANCE- .6181
MAX- .079366
MIN- .002743
X1 X2 OBSERVED PREDICTED RESIDUAL
CUMULATIVE CUMULATIVE OBSERVED PREDICTED RESIDUAL
OBSERVED PREDICTED FREQUENCY FREQUENCY FREQUENCY
.003- .007
.007- .011
.011- .016
.016- .020
.020- .024
.024- .029
.029- 433
.033- .037
000000 .011040 -.011040
086957 .083043 .003913
119565 .144029 - .024464
195652 .161326 .034327
173913 .150945 .022968
054348 .128165 - .073807
130435 .101948 .028486
130435 .076939 .063496
.037- .042 .043478 .065223 -.011744
.042- .046
.046- .050
021739 .037540 - .016800
010870 .023917 - .013047
.060- .065 .010870 .014018 - .003148
.055- .059 .010870 .007325 .003645
.069- .063
.063- .079
010670 .003231 .007638
000000 .001324 -.001324
.000000
.088967
.206522
.402174
.576087
.630436
.760870
.891304
.934783
.956622
.967391
.978261
.989130
1.000000
1.000000
.011040 .000 1.016 -1.016
.094084 8.000 7.640 .360
.238112 11.000 13.251 -2.251
.399437 18.000 14.842 3.158
.660382 16.000 13.887 2.113
.678637 5.000 11.790 - 6.790
.780485 12.000 9.379 2.621
.857424 12.000 7.078 4.922
.912646 4.000 5.080 - 1.080
.950186 2.000 3.454 ~ 1.454
.974103 1.000 2.200 - 1.200
.988120 1.000 1.290 - .290
.996445 1.000 .674 .326
.998676 1.000 .297 .703
1.000000 .000 .122 -.122
LACS SITE 3 N02 HIGH OZ LEVEL
«1R i
ID
37.63
30.10
22.58
16.06
7.63
0.00
-
-
-
- y
^/
SB DISTRIBUTION
/
/
I
^MIM^
^
.
^s
X
\s
MU- -1.0364
SIGMA 2 - 0.6181
LOWER BOUND = 0.0027
1 IDDCD Dftl IMn n A mQA
UrrcH bUUIMU * U.U/U*»
^>.
^v^
"T
1 1 ^1-- 4 1
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
X
FIGURES MAXFIT OUTPUT: FIT OF JOHNSON SB DISTRIBUTION FOR DATA IN EXAMPLE 1
-------
LACS SITE 3 NO2 HIG H 03 LEVE L
BETA DISTRIBUTION
ALPHA- 1.8673909
BETA" 6.5897199
MAX- .081378
MIN- .005647
XI X2 OBSERVED PREDICTED RESIDUAL CUMULATIVE CUMULATIVE OBSERVED PREDICTED RESIDUAL
OBSERVED PREDICTED FREQUENCY FREQUENCY FREQUENCY
.006- .007 .000000 .008148 -.008148 .000000
.007- .011 .086957 .092974 -.006018 .086957
.011- .016 .119565 .143431 -.023866 .206522
.016- .020 .195652 .155967 .039685 .402174
.020- .024 .173913 .147240 .026673 .576087
.024- .029 .054348 .127375 -.073027 .630435
.029- .033 .130435 .103017 .027417 .760870
.033- .037 .130435 .078469 .051966 .891304
.037- .042 .043478 .056330 - .012852 .934783
.042- .046 .021739 .037964 -.016215 .966522
.046- .050 .010870 .023793 - .012923 .967391
.050- .055 .010870 .013676 - .002806 .978261
.055- .059 .010870 .007042 .003827 .989130
.059- .063 .010870 .003129 .007741 1.000000
.063- .081 .000000 .001454 -.001454 1.000000
.008148 .000 .750 - .750
.101122 8.000 8.554 - .554
.244563 11.000 13.196 - 2.196
.400520 18.000 14.349 3.651
.547760 16.000 13.546 2.454
.675135 5.000 11.718 -6.718
.778152 12.000 9.478 2.522
.856621 12.000 7.219 4.781
.912951 4.000 5.182 - 1.182
.950906 2.000 3.492 - 1.492
.974699 1.000 2.189 - 1.189
.988375 1.000 1.258 - .258
.995417 1.000 .648 .352
.998546 1.000 .288 .712
1.000000 .000 .134 -.134
LACS SITE 3 NO2 HIGH OZ LEVEL
45.15
37.63
Mm
. iu
FIX)
22.58
15.05
7.53
0.00
-
"
-
" I
/
BETA DISTRIBUTION
S
/
/
- !
-\
^
\
^-
.007 .011 .016 .020 .024 .029 .033 .037
X
LOWER BOUND = 0.0056
UPPER BOUND = 0.0814
Al DUA 1 QG~JA
MLrrlM ~ I.OD/f
BETA= 5.5897
\.
^r^
I r +- i
.042 .046 .050 .055 .059 .063
FIGURE 6 MAXFIT OUTPUT: FIT OF BETA DISTRIBUTION FOR DATA IN EXAMPLE 1
-------
LACS SITE 3 NO 2 HIGH 03 LEVEL
NORMAL
LOG NORMAL
GAMMA
WEIBULL
SB
BETA
ABSOLUTE DEVIATION
WEIGHTED ABS. DEVIATION
CHI-SQUARE (*)
KOLMOGOROV-SMIRNOV
CRAMER-VON MISES-SMIRNOV
LOG LIKELIHOOD
.36327934 + 002
(6)
.36439217 + 001
(6)
.12294319 -001
(6)
.87066924 - 001
(6)
.18261896 + 000
(69
.28306338 + 003
(6)
.26840004 + 002
(1)
.26141019+001
(1)
.77316610-001
(2)
.65347441 - 001
(5)
.82409978 - 001
(5)
.28868319 + 003
(5)
.27568650 + 002
(3)
.27436755 + 001
(2)
.49182300 - 001
(3)
.61095604 - 001
(4)
.76377309 - 001
<4)
.28923631 + 003
(4)
.26914336 + 002
(2)
.27989979 + 001
(3)
.97144187 - 001
(1)
.51969778 - 001
(3)
.68976456 - 001
(2)
.29017593+003
(1)
.28404869 + 002
(4)
.28615729 + 001
(4)
.36629757 - 001
(4)
.48101939 - 001
(2)
.64916543-001
<1»
.28974790 + 003
(3)
.28944865 + 002
(5)
.29539632 + 001
(5)
.34430578 - 001
(5)
.44699905 - 001
(1)
.69541976 - 001
(3)
.29012675 + 003
(2)
* OBSERVED SIGNIFICANCE LEVEL
TABLE 2 MAXFIT OUTPUT: COMPARISON STATISTICS FOR DATA IN EXAMPLE 1
-------
data set is shown in Figures 7-12 and Table 3. For this data set there is
no obvious "best" model. Beta-distribution has the highest log-
likelihood value. However, gamma distribution has the lowest Kolmogorov-
Smirnov statistic. The 3-parameter lognormal shows up best in the Chi-square
test and weighted absolute deviations. Therefore, the 3-parameter lognormal
distribution would probably serve as the "best" model in this example
since it is a simple model to work with and scored reasonably well on several
fit criteria.
17
-------
LACS SITE 3 NOX HIGH 03 LEVEL
OBS.MIN.X = .007000
OBS.MAX.X = .059000
NO. OF OBS. = 92.0000
NORMAL DISTRIBUTION
MEAN = .0252
VARIANCE = .0001
STANDARD DEVIATION
MEAN = .0252
VARIANCE" .0001
STANDARD DEV. = .0114
INDEX OF SKEWNESS = .7463
INDEX OF KURTOSIS = 3.2460
SKEWNESS SQUARED = .5570
X1
X2
.0114
OBSERVED PI
CO
INFINITY-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.059-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.059
.063
.063-INFINITY
.000000
.086957
.076087
.239130
.173913
.043478
.130435
.097826
.086957
.010870
.021739
.010870
.000000
.021739
.000000
EDICTED
.055187
.056838
.089794
.122932
.145847
.149950
.133602
.103156
.069023
.040022
.020110
.008756
.003304
.001080
.000401
RESIDUAL
- .055187
.030118
- .013707
.116199
.028066
- .106472
- .003167
- .005330
.017934
- .029152
.001629
.002114
- .003304
.020659
- .000401
CUMULATIVE
OBSERVED
.000000
.086957
.163043
.402174
.576087
.619565
.750000
.847826
.934783
.945652
.967391
.978261
.978261
1.000000
1.000000
CUMULATIVE
PREDICTED
055187
.112025
.201819
.324750
.470597
.620547
.754149
.857305
.926328
.966350
.986460
.995216
.998519
.999599
1.000000
OBSERVED
FREQUENCY
.000
8.000
7.000
22.000
16.000
4.000
12.000
9.000
8.000
1.000
2.000
1,000
.000
2.000
.000
PREDICTED RESIDUAL
FREQUENCY FREQUENCY
5.077
5.229
8.261
11.310
13.418
13.795
12.291
9.490
6.350
3.682
1.850
.806
.304
.099
.037
LACS SITE 3 NOX HIGH 03 LEVEL
NORMAL DISTRIBUTION
MU= 0.0252
SIGMA - = 0.0001
- 5.077
2.771
- 1.261
10.690
2.582
- 9.795
- .291
- .490
1.650
- 2.682
.150
.194
- .304
1.901
- .037
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
X -
FIGURE 7 MAXFIT OUTPUT: FIT OF NORMAL DISTRIBUTION TO DATA IN EXAMPLE 2.
-------
LACS SITE 3 NOX HIGH 03 LEVEL
LOG NORMAL DISTRIBUTION
MEAN- -3.5370
VARIANCE - .1337
STANDARD DEVIATION - .3667
SEARCHED LOWER BOUND - -.006922
X1
X2 OBSERVED PREDICTED RESIDUAL
.006-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.059-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.069
.063
.063-1 NFINITY
.000000
.086967
.076087
.239130
.173913
.043478
.130436
.097826
.086957
.010870
.021739
.010870
.000000
.021739
.000000
.013136
.063109
.130498
.168756
.166634
.139403
.106139
,074096
.049934
.032668
.020962
.013287
.008360
.005241
.008876
_ .013136
.023847
_ .064411
.070374
.007379
- .095925
.026296
.023730
.037023
- .021799
.000777
- .002417
-.008360
.016498
- .008876
CUMULATIVE
OBSERVED
.000000
.086967
.163043
.402174
.676087
.619666
.760000
.847826
.934783
.946652
.967391
.978261
.978261
1.000000
1.000000
CUMULATIVE
PREDICTED
.013136
.076246
.206744
.375800
.542034
.681437
.786576
.860672
.910606
.943274
.964237
.977523
.985883
.991124
1.000000
OBSERVED
FREQUENCY
.000
8.000
7.000
22.000
16.000
4.000
12.000
9.000
8.000
1.000
2.000
1.000
.000
2.000
.000
-PREDICTED
FREQUENCY
1.209
5.806
12.006
15.526
15.321
12.825
9.673
6.817
4.594
3.005
1.929
1.222
.769
.482
.817
RESIDUAL
FREQUENCY
-1.209
2.194
-5.006
6.474
.679
-8.825
2.327
2.183
3.406
-2.005
.071
-.222
-.769
1.518
-.817
LACS SITE 3 NOX HIGH 03 LEVEL
LNORM3 DISTRIBUTION
MU= -3.5370
SIGMA 2 = 0.1337
LOWER BOUND = -0.0059
7.63 -
0.00
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
X
FIGURE 8 MAXFIT OUTPUT: FIT OF 3 PARAMETER LOG NORMAL DISTRIBUTION FOR DATA IN EXAMPLE 2.
-------
LACS SITE 3 NOX HIGH 03 LEVEL
GAMMA DISTRIBUTION
ALPHA = 3.1819
BETA = .0067
SEARCHED LOWER BOUND <
X1
X2 OBSERVED
.004-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.059-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.059
.063
.063-INFINITY
.000000
.086957
.076087
.239130
.173913
.043478
.130435
.097826
.086957
.010870
.021739
.010870
.000000
.021739
.000000
.003853
PREDICTED
.008224
.072925
.139776
.165745
.157521
.131863
.101669
.074000
.051615
.034840
.022913
.014754
.009335
.005820
.008998
RESIDUAL
- .008224
.014031
- .063689
.073385
.016392
- .088385
.028766
.023826
.035342
- .023970
- .001174
- .003884
- .009335
.015919
- .008998
CUMULATIVE
OBSERVED
.000000
.086957
.163043
.402174
.576087
.619565
.750000
.847826
.934783
.945652
.967391
.978261
.978261
1.000000
1.000000
CUMULATIVE
PREDICTED
.008224
.081149
.220925
.386670
.544192
.676055
.777724
.851724
.903339
.938179
.961092
.975846
.985181
.991002
1.000000
OBSERVED
FREQUENCY
.000
8.000
7.000
22.000
16.000
4.000
12.000
9.000
8.000
1.000
2.000
1.000
.000
2.000
.000
PREDICTED
FREQUENCY
.757
6.709
12.859
15.249
14.492
12.131
9.354
6.808
4.749
3.205
2.108
1.357
.859
.535
.828
RESIDUAL
FREQUENCY
_ .757
1.291
- 5.859
6.751
1.508
-8.131
2.646
2.192
3.251
-2.205
- .108
- .357
- .859
1.465
- .828
LACS SITE 3 NOX
HIGH 03 LEVEL
GAMMA DISTRIBUTION
LOWER BOUND =
ALPHA- 3.1819
BETA = 0.0067
0.0039
0.00
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059
X
FIGURE 9 MAXFIT OUTPUT: FIT OF GAMMA DISTRIBUTION FOR DATA IN EXAMPLE 2.
.063
-------
LACS SITE 3 NOX HIGH 03 LEVEL
WEIBULL DISTRIBUTION
BETA - .0218
C- 1.7617
SEARCHED LOWER BOUND - .005742
X1
X2
OBSERVED
.006-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.060-
.056-
.059-
.007
.011
.016
.020
.024
.028
.033
.037
.042
.046
.060
.065
.069
.063
.063-INFINITY
.000000
.086967
.076087
.239130
.173913
.043478
.130438
.087826
.086967
.010870
.021739
.010870
.000000
.021739
.000000
PREDICTED RESIDUAL
.008666
.081266
.134664
.156648
.162516
.133809
.107686
.080656
.066479
.037324
.023344
.013861
.007832
.004221
.004141
-.006686
.006691
-.068677
.083482
.021397
-.090331
.022749
.017271
.030477
-.026464
-.001604
-.002991
-.007832
.017519
-.004141
LACS SITE 3 NOX
HIGH 03 LEVEL
WEIBULL DISTRIBUTION
CUMULATIVE
OBSERVED
.000000
.086057
.163043
.402174
.576087
.619566
.760000
.847826
.934783
.945662
.967391
.978261
.978261
1.000000
1.000000
CUMULATIVE
PREDICTED
.006655
.087921
.222584
.378232
.530748
.664557
.772243
.852799
.909278
.946602
.969945
.983806
.991638
.995859
1.000000
OBSERVED
FREQUENCY
.000
8.000
7.000
22.000
16.000
4.000
12.000
9.000
8.000
1.000
2.000
1.000
.000
2.000
.000
PREDICTED
FREQUENCY
.612
7.476
12.389
14.320
14.031
12.310
9.907
7.411
5.196
3.434
2.148
1.276
.721
.388
.381
RESIDUAL
FREQUENCY
-.612
.524
-5.389
7.680
1.969
-8.310
2.093
1.589
2.804
-2.434
-.148
-.275
-.721
- 1.612
-.381
46.16
37.63
30.10
B 22.58
ik
16.05
7.63
0.00
LOWER BOUND = 0.0057
ALPHA = 1.7517
BETA- 0.0218
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059
X
FIGURE 10 MAXFIT OUTPUT: FIT OF WEIBULL DISTRIBUTION FOR DATA IN EXAMPLE 2
.063
-------
ro
ro
LACS SITE 3 NOX
SB DISTRIBUTION
MEAN = -1.0339
VARIANCE = .6214
MAX = 080503
MIN= .032990
HIGH 03 LEVEL
X1
X2
OBSERVED
.003-
.007-
.011-
.016-
.020-
.024-
.029-
.033-
.037-
.042-
.046-
.050-
.055-
.059-
.063-
.007
.011
.016
.020
.024
.029
.033
.037
.042
.046
.050
.055
.059
.063
.081
.000000
.086957
.076087
.239130
.173913
.043478
.130435
.097826
.086957
.010870
.021739
.010870
.000000
.021739
.000000
LACS SITE 3 NOX
PREDICTED RESIDUAL
.008625 - .008625
.076215 .010741
.138778 -.062691
.158836 .080294
.150719 .023194
.129460 -.085982
.104146 .026289
.079535 .018291
.057857 .029099
.039966 -.029096
.025978 -.004238
.015631 -.004762
.008472 - .008472
.003946 .017793
.001836 -.001836
CUMULATIVE
OBSERVED
.000000
.086957
.163043
.402174
.576087
.619565
.750000
.847826
.934783
.945652
.967391
.978261
.978261
1 .000000
1.000000
CUMULATIVE
PREDICTED
.008625
.084841
.223619
.382455
.533174
.662634
.766780
.846315
.904172
.944138
.970116
.985747
.994219
.998164
1.000000
OBSERVED
FREQUENCY
.000
8.000
7.000
22.000
16.000
4.000
12.000
9.000
8.000
1.000
2.000
1.000
.000
2.000
.000
PREDICTED
FREQUENCY
.794
7.012
12.768
14.613
13.866
11.910
9.581
7.317
5.323
3.677
2.390
1.438
.779
.363
.169
RESIDUAL
FREQUENCY
- .794
.988
- 5.768
7.387
2.134
- 7.910
2.419
1.683
2.677
- 2.677
- .390
- .438
- .779
1.637
- .169
HIGH 03 LEVEL
SB DISTRIBUTION
MU= -1.0339
SIGMA 2= 0.6214
LOWER BOUND = 0.0030
UPPER BOUND = 0.0805
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055 .059 .063
FIGURE 11 MAXFIT OUTPUT: FIT OF JOHNSON SB DISTRIBUTION FOR DATA IN EXAMPLE 2
-------
ro
Co
LACS SITE 3 NOX HIGH 03 LEVEL
BETA DISTRIBUTION
ALPHA- 1.9339036
BETA- 6.7444366
MAX* .083143
WIN- .006647
X1 X2 OBSERVED PREDICTED RESIDUAL
CUMULATIVE CUMULATIVE OBSERVED PREDICTED
OBSERVED
.006- .007
.007- .011
.011- .016
.016- .020
.020- .024
.024- .029
.029- .033
.033- .037
.037- .042
.042- .046
.046- .050
.050- .055
.056- .059
.059- .063
.063- .083
LACS SITE 3 NOX
.000000 .008710 -.006710
.086967 .084586 .002371
.076087 .137076 -.060989
.239130 .163181 .086970
.173913 .147323 .026590
.043478 .129321 -.086842
.130435 .106908 .024527
.097826 .081618 .016208
.086957 .059291 .027686
.010870 .040477 -.029607
.021739 .025777 -.004038
.010870 .016120 -.004251
.000000 .008008 -.008008
.021739 .003710 .018030
.000000 .001916 -.001916
.000000
.066957
.163043
.402174
.676087
.619565
.760000
.847826
.934783
.945652
.967391
.978261
.978261
1.000000
1.000000
PREDICTED FREQUENCY FREQUENCY
.006710 .000 .617
.091296 8.000 7.782
.228372 7.000 12.611
.381532 22.000 14.091
.528856 16.000 13.554
.658176 4.000 11.897
.764084 12.000 9.743
.846702 9.000 7.509
.904992 8.000 5.455
.945469 1.000 3.724
.971246 2.000 2.371
.986386 1.000 1.391
.994375 ..000 .737
.998084 2.000 .341
1 .000000 .000 .176
HIGH 03 LEVEL
BETA DISTRIBUTION
45.15 r-
37.63
30.10
22.58
H
" 16.06
7.53
0.00
"
-
-
-
1
1
/
/
/
/*
/
-
-^*
^v
\
^S
^
\
LOWER BOUND = 0.0056
UPPER BOUND - 0.0831
ALPHA= 1.9339
BETA= 6.7444
^^
^>_
r5^ T- rr-
RESIDUAL
FREQUENCY
-.617
.218
-5.611
7.909
2.446
-7.897
2.257
1.491
2.545
- 2.724
-.371
-.391
-.737
1.689
-.176
.007 .011 .016 .020 .024 .029 .033 .037 .042 .046 .050 .055
X
FIGURE 12 MAXFIT OUTPUT: FIT OF BETA DISTRIBUTION FOR DATA IN EXAMPLE 2
-------
LACS SITE 3 NOX HIGH 03 LEVEL
NORMAL
LOG NORMAL
GAMMA
WEIBULL
SB
BETA
ABSOLUTE DEVIATION
WEIGHTED ABS. DEVIATION
CHI-SQUARE («)
KOLMOGOROV-SMIRNOV
CRAMER-VON MISES-SMIRNOV
LOG LIKELIHOOD
'OBSERVED SIGNIFICANCE LEVEL
.39876353 + 002
(6)
.40290285 + 001
(3)
.80831527 - 003
(6)
.10548994-1-000
(6)
.25241871 +000
(6)
.28121652+003
(6)
.37706056 + 002
(3)
.39116101 +001
(1)
.27928927 - 001
<1)
.61872242 - 001
(4)
.11858179 + 000
(5)
.28741674 + 003
(5)
.38209408 + 002
(5)
.40558162+001
(5)
.23569159 - 001
(2)
.57881671 -001
(1)
.11282250 + 000
(2)
.28792753 + 003
(4)
.36539779 + 002
(1)
.40038714+001
(2)
.91304910-002
(5)
.59540748 - 001
(2)
.11458265+000
(3)
.28859950 + 003
(2)
.37849178+002
(4)
.40795364 + 001
(6)
.11633393-001
(3)
.60575323 - 001
(3)
.10825762 + 000
(1)
.28846644 + 003
(3)
.37050406 + 002
(2)
.40345331 + 001
(4)
.92297671 - 002
(4)
.65328118-001
(5)
.11509788+000
(4)
.28860871 + 003
(1)
TABLE 3 MAXFIT OUTPUT: COMPARISON STATISTICS FOR DATA IN EXAMPLE 2
-------
SECTION 5
FUTURE DEVELOPMENTS
Obviously MAXFIT is not the answer to all air quality data distribution
problems. Continuing efforts are planned to improve the program in several
areas.
IMPROVEMENT OF THE SOFTWARE'S GROUPED DATA HANDLING CAPACITY
At present, software considers grouped data to be repeated observations
at class mid-points. The simplistic method causes estimates to become un-
stable when a large proportion of the data falls in one class. This problem
can be avoided by maximizing the probability that certain sample values fall
in certain classes instead of maximizing a likelihood function of grouped
data.
TRUNCATION AND CENSORING
If the underlying variate X cannot be observed in part or parts of its
range, the distribution of X is usually termed truncated. In addition,
observations which cannot be recorded below a level R, the variate X is
truncated on the left. Similarly, the distribution of X can be truncated on
the right and doubly truncated. Observations from any type of truncated
distribution are in actuality drawn from an incomplete distribution. A
truncated variate does not differ from other variates, but is treated in a
special way because its distribution is generated by an underlying untrun-
cated variable.
There are times when an experimenter is forced to put little faith in
sample values when they occur above or below certain values. In these cases,
such an experimenter might not wish to use the information contained in the
data values in estimating parameters. This situation occurs when a value is
below the minimum detectable limits of an instrument or measurement method.
As in truncation, we may have censoring on the left when a certain minimum
response is necessary to make a valid measurement, censoring on the right,
and double censoring. Censoring can be more completely defined by distin-
guishing Type I and Type II censoring. Type I censoring is said to occur
when the number of censored observations is a random variable, whereas, in
Type II censoring the number of censored observations is fixed. Censoring is
a property of the sample whereas truncation is a property of the distribution
(14,15).
25
-------
In the truncated and censored cases, parameters of frequency functions
can be estimated by maximum likelihood estimation. For a continuous variate,
with frequency function f(x;e), the following likelihood function can be
maximized when f(x;e) is doubly truncated at known points a and b, with a
-------
SECTION 6
SUMMARY
MAXFIT is a vehicle for applying a statistically valid method of fitting
distributions. The software makes use of the maximum likelihood method of
estimation to fit the normal distribution, the 3-parameter lognormal distri-
bution, the 3-parameter gamma distribution, the 3-parameter Weibull distribu-
tion, the Johnson S« distribution, and the 4-parameter beta distribution. With
the increased availability of high speed computers, we feel that maximum
likelihood is a better method to fit probability models than outdated,
inefficient graphical techniques.
There are many managers within EPA who must make statistical assumptions
regarding the distribution of air pollution data. These assumptions may
affect agency policy. Such assumptions affect standard setting, emission
roll-back calculations, estimation of maximum concentrations, threshold
approximations and the handling of missing observations. It is hoped MAXFIT
could be used as a tool to aid in making such assumptions.
In fitting air quality data, we have found large variations in the
shape of the distributions suggested by the data. We consider fitting one
distribution to all air quality data to be inadvisable. Even the two highly
similar data sets presented as examples in this report led to the selection
of different models as "best." With this software, several models can be
fit and the goodness-of-fit of each can be compared in several ways. Thus, a
rational decision can be made as to which model would be adequate for a given
data base and purpose.
27
-------
REFERENCES
1. Larsen, R. I. A Mathematical Model for Relating Air Quality
Measurements to Air Quality Standards. Publication No. AP-89, U.S.
Environmental Protection Agency, Research Triangle Park, North Carolina,
1971.
2. Larsen, R. I. An Air Quality Data Analysis System for Interrelating
Effects, Standards, and Needed Source Reductions. J. Air Poll. Control
Assoc. 23:993, 1973.
3. Larsen, R. I. An Air Quality Data Analysis System for Interrelating
Effects, Standards, and Needed Source Reductions. Part 2, J. Air Poll.
Control Assoc. 24:551, 1974.
4. Larsen, R. I. An Air Quality Data Analysis System for Interrelating
Effects, Standards, and Needed Source Reductions. Part 3, J. Air. Poll.
Control Assoc. 26:325, 1976.
5. Larsen, R. I. An Air Quality Data Analysis System for Interrelating
Effects, Standards, and Needed Source Reductions. Part 4, J. Air Poll.
Control Assoc. 27:454, 1977.
6. Mage, D. T., and W. R. Ott. Refinements of the Lognormal Probability
Model for Analysis of Aerometric Data. J. Air Poll. Control Assoc.
28(8):796-798, 1978.
7. Ott, W. R., and D. T. Mage. A General Purpose Univariate Probability
Model for Environmental Data Analysis. Comput. & Ops. Res.
3:209-216, 1976.
8. Curran, T. C., and N. H. Frank. Assessing the Validity of the Lognormal
Model When Predicting Maximum Air Pollution Concentrations. Annual
Meeting of the Air Pollution Control Association, Boston, Massachusetts,
1975.
9. Rao, C. R. Linear Statistical Inference and Its Applications. 2nd Ed.,
John Wiley and Sons, New York City, New York, 1973.
10. Schreuder, H. T., W. L. Hafley, E. W. Whitehorne, and B. T. Dare,
Maximum Likelihood Estimation for Selected Distributions (MLESD).
Tech. Report No. 61, School of Forest Resources, North Carolina State
University, Raleigh, North Carolina, 1978.
11. Box, M. J., D. Davies, and W. H. Swann. Non-linear Optimization
Techniques. Oliver & Boyd, Edinburgh, Scotland, 1969.
28
-------
12. Rodes, C. R., and D. M. Holland. NCu/CL Sampler Siting Study. (To be
published as an EPA Environmental Monitoring Series Report).
13. IMSL Subroutine Library, Vol. 1 & 2, International Mathematical and
Statistical Library, Inc., Houston, Texas, 1975.
14. Hald, A. Maximum Likelihood Estimation of the Parameters of a Normal
Distribution which is Truncated at a Known Point. Skandinavisk
Aktuarietidskrift, Vol. 32:119, 1949.
15. Kendall, M. 6., and A. Stuart. The Advanced Theory of Statistics.
Vol. 2, 2nd Ed. Hafner Publishing Co., New York City, New York, 1967.
16. Johnson, N. L., and S. Kotz. Continuous Univariate Distributions - 1.
Houghton MiffTin Co., Boston, Massachusetts, 1970.
29
-------
APPENDIX A. DISTRIBUTION DESCRIPTIONS
Normal Distribution: _«>
-------
APPENDIX A (continued)
Estimates:
grouped data: x and f are searched for, y = Ilog
f x. -t }
- x
-
-n.
- xi J
ungrouped data: X and I are searched for, y = £log
xi~X )
Gamma Distribution:
Density: f(x)
(x - x)a"1.EXP[-(x -
ear(a)*
x.ni
' 1
TT
a
)g(xrx).ni _
n
ungrouped data: x is searched for,
Z(xrx)Slog(xrx) . Ilog(xrx) ^ =Q
a must satisfy
I (xrX)«
n
(continued)
r(a)
= /
-x
dx.
31
-------
APPENDIX A (continued)
Beta Distribution:
Density: f(x) =
Estimates:
ungrouped: oU 0. X» and | are searched for.
grouped data: a, e, X, and | are searched for.
32
-------
APPENDIX B
SUBRDUT I NE USER P » XB » NGRP > TF . N > NC , BEG I N t STEP > TDL
* T I TLE * XM I N j XMflX > MD I ST>
I MPL I C I T DOUBLE PREC I S I DH Cfl-H » D- Y>
RERL TITLECl;-
DflTfl I FL.-"'H -'
DIMEMSIDN X<1> ?XB<1> >HDIST<1> >P<1>
1 = 0
XMflX=-9.D6
XMIM*9.D6
TDL=.0001DO
TITLE ='LflCS S'
T I TLE C£>=' I TE 3 '
TITLEC3>=' NDX '
TITLE <4J=' HIGH '
TITLEC5>=-D3 LEV''
TITLE<6>='EL
2 RERD 05 > 1 0 > END=3> I DZL » TC > RVGSP n I S I TE > RMOX
1 0 FDRMflT -::R 1 j T 1 7 F7 . 0 ! Ft. . 2 . 1 3 j T4S j F5 . 3.>
IF < I S I TE . HE . 3 . DR . I DZL . HE . I FL> GO TO 5
IF (RNDX.LE. l.D-6> GD TD 5
XI=RNDX
IF
IFCXI.GT.XMflX> XMRX=XI
5 GQ TD 2
3 DD 4 J=l,6
4 NDIST=1
NGRP=0
N=I
TF=M
NC=0
BEGIN=XMIN-l.D-5
STEP='::XMRX-XMIN> - 12. DO
RETURN
END
33
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing)
REPORT NO.
EPA 600/4-79-044
2.
3. RECIPIENT'S ACCESSIOf*NO.
4. TITLE AND SUBTITLE
THE MAXIMUM LIKELIHOOD APPROACH TO PROBABILISTIC
MODELING OF AIR QUALITY DATA
5. REPORT DATE
July 1979
6. PERFORMING ORGANIZATION CODE
. AUTHOR(S)
Terence Fitz-Simons
David M. Hoi land
8. PERFORMING ORGANIZATION REPORT NO.
9. PERFORMING ORGANIZATION NAME AND ADDRESS
10. PROGRAM ELEMENT NO.
IAD 883
11. CONTRACT/GRANT NO.
12. SPONSORING AGENCY NAME AND ADDRESS
Environmental Monitoring and Support Laboratory
Office of Research and Development
U.S. Environmental Protection Agency
Research Triangle Park, N.C. 27711
13. TYPE OF REPORT AND PERIOD COVERED
Final
14. SPONSORING AGENCY CODE
EPA-600/08
15. SUPPLEMENTARY NOTES
16. ABSTRACT
Software using maximum likelihood estimation to fit six probabilistic
models is presented. The software is designed as a tool for the air pollution
researcher to determine what assumptions are valid in the statistical analysis
of air pollution data for the purposes of standard setting, roll-back calculations,
estimation of maximum concentrations, threshold approximations, and handling
missing observations. The program fits user's data to the normal distribution,
the 3~parameter lognormal distribution, the 3~parameter Weibull distribution,
the 3~parameter gamma distribution, the Johnson SR distribution (a 4-parameter
lognormal distribution), and the ^-parameter beta distribution. The parameters
are estimated using standard closed solutions to maximizing equations, and a
golden section search for all other parameters. Graphical output contains a
histogram of the data superimposed by the fitted density for each model. Six
goodness-of-f it criteria are supplied and ranked by the program to aid in the
selection of the most appropriate choice among the six models. These criteria
are absolute deviations (AD statistic), weighted absolute deviations (WAD
statistic), Kolmogorov-Smirnov statistic, Cramer-von Mises-Smirnov statistic,
the log-likelihood function, and the observed significance level of the Chi-
square goodness-of-f it test. The results of applying the program to several
subsets of the Los Angeles Catalyst Study data base are presented.
17.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b.lDENTlFIERS/OPEN ENDED TERMS C. COSATI Field/Group
Maximum Likelihood estimations,
Lognormal, gamma, Weibull, beta,
Johnson SD, Johnson S. Distribution
D L
Statistics
Statistical Modeling
43F
68A
18. DISTRIBUTION STATEMENT
Release to Public
19. SECURITY CLASS (ThisReport)
Unclassified
21. NO. OF PAGES
33
20. SECURITY CLASS (Thispage)
Unclassified
22. PRICE
EPA Form 2220-1 (9-73)
------- |