Environmental Protection Technology Series
SOURCE ASSESSMENT:
SEVERITY OF STATIONARY
AIR POLLUTION SOURCES-
A SIMULATION APPROACH
Industrial Environmental Research Laboratory
Office of Research and Development
U.S. Environmental Protection Agency
search Triangie Park, North Carolina 27711
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U.S. Environmental
Protection Agency, have been grouped into five series. These five broad
categories were established to facilitate further development and application of
environmental technology. Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The five series are:
1. Environmental Health Effects Research
2. Environmental Protection Technology
3. Ecological Research
4. Environmental Monitoring
5. Socioeconomic Environmental Studies
This report has been assigned to the ENVIRONMENTAL PROTECTION
TECHNOLOGY series. This series describes research performed to develop and
demonstrate instrumentation, equipment, and methodology to repair or prevent
environmental degradation from point and non-point sources of pollution. This
work provides the new or improved technology required for the control and
treatment of pollution sources to meet environmental quality standards.
EPA REVIEW NOTICE
This report has been reviewed by the U.S. Environmental
Protection Agency, and approved for publication. Approval
does not signify that the contents necessarily reflect the
views and policy of the Agency, nor does mention of trade
names or commercial products constitute endorsement or
recommendation for use.
This document is available to the public through the National Technical Informa-
tion Service. Springfield, Virginia 22161.
-------
EPA-600/2-76-032e
July 1976
SOURCE ASSESSMENT:
SEVERITY OF STATIONARY AIR POLLUTION
SOURCES--A SIMULATION APPROACH
by
E.G. Eimutis, B.J. Holmes, and L. B. Mote
Monsanto Research Corporation
1515 Nicholas Road
Dayton, Ohio 45407
Contract No. 68-02-1874
ROAPNo. 21AXM-071
Program Element No. 1AB015
EPA Project Officer: Dale A. Denny
Industrial Environmental Research Laboratory
Office of Energy, Minerals, and Industry
Research Triangle Park, NC 27711
Prepared for
U.S. ENVIRONMENTAL PROTECTION AGENCY
Office of Research and Development
Washington, DC 20460
-------
PREFACE
The Industrial Environmental Research Laboratory (IERL) of EPA has the
responsibility for insuring that air pollution control technology is
available for stationary sources. If control technology is unavailable,
inadequate, uneconomical or socially unacceptable, then development of
the needed control techniques is conducted by IERL. Approaches con-
sidered include process modifications, feedstock modifications, add-on
control devices, and complete process substitution. The scale of control
technology programs ranges from bench to full scale demonstration plants.
The Chemical Processes Branch of IERL has the responsibility for devel-
oping control technology for a large number (>500) of operations in the
chemical and related industries. As in any technical program the first
step is to identify the unsolved problems.
Each of the industries is to be examined in detail to determine if there
is sufficient potential environmental risk to justify the development of
control technology by IERL. Monsanto Research Corporation (MRC) has
contracted with EPA to investigate the environmental impact of various
industries which represent sources of emissions in accordance with EPA's
responsibility as outlined above. Dr. Robert C. Binning serves as
Program Manager in this overall program entitled, "Source Assessment."
As a first step, MRC has developed a priority listing of the industries
in each of four categories: combustion, organic materials, inorganic
materials, and open sources. The purpose and intended use of this
listing is that it serve as one of several guides to the selection of
those sources for which MRC will perform detailed source assessments.
Source assessment documents will be produced by MRC and used by EPA to
make decisions regarding the need for developing additional control
technology for each specific source.
In order to analyze the severity of those sources in which the emission
points number in the thousands or hundred thousands, a statistical
approach is required such as the Monte Carlo simulation technique
described in this report. An example of this approach for analyzing
coal-fired steam electric utilities is included.
iii
-------
CONTENTS
Section Page
I Introduction 1
II Source Severity 3
A. Mathematical Structure 5
B. Derivation of x 6
C. Pollutant Severity Equations 7
III Simulation Methodology 9
A. Introduction 9
B. Theory and Methodology 12
1. All Input Variables Independent 13
2. Dependent Input Variables 15
IV Example of Use of Simulation Approach With 19
Coal-Fired Electric Utilities
V Discussion of Non-Normality and Chi-Square 25
Goodness-Of-Fit Test
A. Central Limit Theorem and T-Test 26
B. Coefficient of Skewness and Kurtosis 34
as Analytical Measures of Non-Normal
Distributions
C. Weibull Distribution 37
D. Gamma Distribution 41
E. Log-Normal Distribution 44
F. Sample Skewness and Kurtosis 45
G. The Chi-Square Goodness-Of-Fit Test 49
VI Appendixes 57
A. Standard Statistical Formulae For Finite 53
Populations
B. Detailed Derivations of the Criteria 61
Pollutant Severity Equations
C. The Simulation Programs 66
D. The Goodness-Of-Fit Program 94
E. Treatment of Correlated Data by Linear 111
Trans formation
VII References
v
-------
LIST OF TABLES
Table
1 Criteria Pollutant Severity Equations
2 Results of Deterministic Calculations
3 Notation Used for Statistical Parameters
4 Comparison of Random Sample Values and
Population Mean for Coal-Fired Electric
Utilities
5 Values for Various Points of Interest in 40
the Weibull Distribution for Various
Values of Parameter b (a=l)
6 Values for Various Points of Interest in 41
the Weibull Distribution for Various
Values of Parameter b (a=1.0 x 10~5)
7 0.05 and 0.01 Points of the Distribution 47
of Yi/ the Coefficient of Skewness
8 Percentage Points of the Distribution of 48
Y2/ the Measure of Kurtosis
9 Values of Glf G2 and (G2-3) for Power 49
Plant Example
10 Theoretical and Actual Frequencies for 53
Nine Class Frequencies
11 Theoretical and Actual Frequencies for 54
Nine Class Intervals
A-l First Square Root Factor in Equation A-3 59
as a Function of Sample Size
C-l Variables, Distributions, Parameters and 78
Clips for Coal-Fired Electric Utilities
Example
C-2 Summary of Input Data by Groups for Coal- 79
Fired Electric Utilities Example
C-3 Computer Listings of the Simulation 82
Programs
D-l Theoretical and Actual Frequencies for 99
Various Class Intervals
D-2 Computer Listings of the Simulation Programs ioi
Programs - Goodness-Of-Fit Program
E-l Comparison of Simulation Results 113
VI
-------
LIST OF FIGURES
Figure
1 Frequency Histograms for the Severity of
SO2 Emissions from Coal-Fired Electric
Utilities Comparing the Simulation and
Deterministic Methods
2 Comparison of Cumulative Frequency for the 22
Severity of SO2 Emissions from Coal-Fired
Electric Utilities Using the Simulation
and Deterministic Methods
3 Unimodal Continuous Distribution 35
4 Examples of Kurtosis in Distributions 36
5 Probability Density Function for the 37
Exponential Distribution
6 Probability Density Function of the Weibull 38
Distribution for Various Values of
Parameter b
7 Probability Density Function for the Gamma 43
Distribution for Various Values of
Parameter a
8 Probability Density Function for the Log- 45
Normal Distribution
VI1
-------
LIST OF SYMBOLS
Symbol
A
A,B,C,D,E...
a,b,a,3
B
B1
CC
c.d.f.
I ,C2,CLIP(I,J)
E
e
F
FI,...,F
f 1 , . . . , f
91
92
H
h
HO
IC0DE
IC0DE(1)
ITIL
IVAL
n
n
Definition
Y-intercept for regression line
[=YB - B(XB)]
Simulated variable values, same as IVAL
where A = variable 1, B = variable 2, etc
Arbitrary parameters
Slope of regression line ( =
Average breathing rate
Coal consumed
Cumulative distribution function
Two dimensional array; maximum/minimum
values for associated stochastic variables
Emission factor
2.72
Hazard potential factor (=TLV-K) for
severity calculations, or statistical
ratio of variances (Appendix E)
Actual frequencies of the sample data
in n class intervals
Theoretical frequencies that would be
expected for a sample of the same size
from the given distribution
Coefficient of skewness of sample
Measure of kurtosis of sample
Relative hazard
Emission height
Hypothesis that data are distributed
according to some given distribution
Two dimensional array containing infor-
mation on which dependent variable
is correlated and which distribution
it fits
Code which identifies the type of distribu-
tion for independent variable 1
Title for x-axis on plot and title of
report output
Variable value, selected from VAR(I,J)
IX
-------
Symbol
K
K } , K2 ...
LT
in 3 , mit ,mk
N
n
NCFLAG
NDVAR
NFLAG
NGR0UP
MINT
NIVAR
NPASS
NSAMP
P
PAR(I,J)
p.d.f.
PNUM(I)
LIST OF SYMBOLS (Continued)
Definition
Safety factor = (~-r-
Conversion factors
Lethal dose for 50% of the people exposed
Input integer used to indicate specific
option to be used in program
Third, fourth, k— central moment about
X, respectively
Population exposed to a specific source,
persons
Number of samples or class intervals
(for simulation) or number of point
sources emitting the i— material
(otherwise)
Input integer which provides the option
of using Sturge's rule to set up the
number of class intervals and using a
value of XMIN and XMAX to establish W
and the resulting class intervals
An integer which gives the number of
dependent input variables
Flag to identify whether or not entered
data are last data set to be analyzed
An integer which tells the simulation
program the sample size for each pass
The number of class intervals to be
used in program
An integer which tell the simulation
program how many input variables are
independent
An integer which tells the simulation
program how many passes to make
Sample size
Probability
Paramter matrix with values for each
statistical distribution
Probability density function
Same as SNUM(l) but for population
prediction
-------
Symbol
Q
R
s
SD
(SD)2
SE
SF
SNUM(I)
S,SD
SX
sx
SY
S] ,82
TLV
T1
t
LIST OF SYMBOLS (Continued)
Definition
Emission rate
Linear correlation coefficient or
random number (Appendix C)
Lung retention factor for the pollutant
of interest (dimensionless factor,
U
VAR2, VAR3
W
X
X
Source severity or standard deviation
corrected for small sample size
Standard deviation of Y for a given X
Sample variance
Standard error (= SY/1 - R2)
Subprogram to calculate severity
Source severity of the ij— material in
the region around the j— source
Resultant values of severity, where 1=1
indicates the probability that S lies in
the range <0.1, 1=2 indicates 0.11
Sample standard deviation
Standard deviation of X
Standard deviation of X
Standard deviation of Y
Sample standard deviation of XI and X2
data, respectively
The "Student t" random variable with
n degrees of freedom (=Z//x2/n)
Threshold limit value
Total time (=t2-t!)
Time
Start time
Finish time
Wind speed
Independent variables
Width of class intervals for a histogram
Stochastic variable
Sample mean (= ~
xi
-------
n
Symbol
X
n
XB
XMAX
XMIN
XP0P
XSAMP
X j , . . . X
XlfX2
YB
Y1,Y2
Z
z
a
r
Yl
Y2
e
v
a
£2
IT
P
a
a
a2
ax
LIST OF SYMBOLS (Continued)
Definition
Mean of a sample of size n
Mean deviation of X
Maximum value
Minimum value
Population size
Actual sample size
A random sample from the population
Arrays of correlated data
Mean deviation of Y
Arrays of transformed data
A standard normal random variable
Random variable [z = f(xi,...,x )]
Angle of rotation in radians to make
correlation zero
Gamma function
Coefficient of skewness of population
Measure of kurtosis of population
Any positive member of population
Population mean
Estimate of the total population mean
Third and fourth central moment about
y, respectively
Mean of Yl (=XI1)
Mean of Y2 (=XI2)
3.14159
Correlation coefficient
Standard deviation
Estimate of population standard deviation
Population variance
Estimate of standard error of mean
Horizontal dispersion coefficient
Vertical dispersion coefficient
Average maximum ground level concentration
xix
-------
LIST OF SYMBOLS (Continued)
Symbol Definition
X2 Any Chi-square random variable with n
degrees of freedom
X Maximum ground level concentration
max
X(t) Concentration time history
V Delivered dose = B1-R1•/x(t)dt
TA Actual pollutant dose delivered from a
given point source (=N-B!-R1-T1-x)
*_ Potentially hazardous dose (=N-B'-R'-T'-F)
Xlll
-------
SECTION I
INTRODUCTION
/
i
A prioritization listing of air pollution sources was
developed earlier to serve as a first step in selecting
industries for detailed study to determine whether a suffi-
cient potential environmental risk exists to justify the
development of control technology. Preparation of the
listing or relative ranking of emission sources involved the
use of an impact factor which is a worst case, integral
quantity. In current assessment studies, one of the potential
environmental impacts of a source is determined from the
source severity, defined as the ground level concentration
contribution of pollutants relative to some potentially
hazardous concentration of the same species. The frequency
distribution of the severity of well-documented source types
can be examined deterministically. However, source types
which are complex or involve a large number of emission
points require a statistical approach to simulate the fre-
quency distribution of the severity and ultimately permit
the assessment of the source.
The basic premise of the simulation approach is that detailed
information on all required factors and all emission points
for certain source types will not be obtained because of
time or cost limitations. When such detailed information
cannot be collected, the inputs can be described in terms of
a distribution of values. In many cases (e.g., electric
-------
utility boiler capacities), these distributions are readily
available; for some sources, special approximating techniques
must be used to develop them. Having developed frequency
distributions for the inputs, a distribution of severity can
be calculated by means of a simulation technique. When
several input variables are treated as frequency distribu-
tions, the computation is extremely tedious; however, with a
high-speed digital computer, computation times are on the
order of a few seconds.
This document presents a methodology for describing the
severity distributions of various source types. A Monte
Carlo simulation technique is described together with effi-
cient algorithms for fitting the inverse Weibull, gamma,
normal, and log-normal cumulative density functions. Using
coal-fired steam electric utilities as an example, signifi-
cant correlation is demonstrated between deterministic and
simulated severity results.
-------
SECTION II
SOURCE SEVERITY
The air pollution severity, S, of a given source should in
some way be proportional to the degree of potential hazard
it imposes upon individuals in its environment. The relative
hazard, H, from a specific emission can be defined as being
directly proportional to the delivered dose, the probability
of dose delivery, and number of people who would receive it,
and inversely proportional to the toxicity of the material
as follows:
S . H . (1)
LD50
where S = source severity
H = relative hazard
N = number of persons
LD50 = lethal dose for 50% of the people exposed
P = probability of dose delivery
Y = delivered dose = B1 -R' -/x (t)dt
B1 = average breathing rate
R1 = lung retention factor
X (t) = concentration time history
The source severity is herein defined as the ratio of the
dose of a pollutant delivered to a population, relative to
some potentially hazardous dose. Since LD50 data are not
available for human beings, another measure of potentially
hazardous dosage was used.
-------
The potentially hazardous dose for a given pollutant from a
specific point source in a given region is defined as
follows: ^\ \j -_ f-w &\ 1 ,
-t 2 ^ ,-^L
V = NB'R' 'TLV(t) K dt (2)
where H* = potentially hazardous dose, g
N = population exposed to a specific source, persons
B1 = average breathing rate, m3/s-person
R' = lung retention factor for the pollutant of
interest (dimensionless factor, 0
-------
= N-B'-R' / X(t) dt (6)
where x(t) = the actual ground level concentration time
history of a pollutant of interest emitted by
a specific point source, g/m3
The value of x (t) is very difficult to obtain and was there-
fore approximated by an average value, x- The total actual
dose delivered for a specific pollutant from a specific
source is then:
*A = N-B'.R'. T1- ^ (7)
Since our measure of source severity was defined as the
ratio of the two dosages, then:
s .
F N-B1 -R1 -T ' »F
or S = *• (9)
A. MATHEMATICAL STRUCTURE
maera n e
the . .
The source severity, S, of the i — material in the region
around the j — source is expressed as the ratio of x". . to
Fi ; i.e.,
Si. = Jil no,
For the i — emitted material, the severity vector, S., is
defined by:
-------
s± =1 . I (ID
Sin
where n = number of point sources emitting the i— material
The mean and median severity for the i— material may be
obtained from a cumulative frequency plot of S.. In addition,
n-fractiles, the standard deviation, and confidence limits
about the mean severity can be calculated. Since all of the
populations under consideration are finite, alternate
forms of the standard statistical equations were used (as
presented in Appendix A).
B. DERIVATION OF x
Since source to receptor distances were not compiled, the
maximum ground level concentration for elevated point
sources, x ,l was used:
lUclX
= 2 QCTs
xmax
ireuh2a
where x = maximum ground level concentration, g/m3
max
IT = 3.14159
e = 2.72
u = wind speed, m/s
1Slade, D. H. (ed.). Meteorology and Atomic Energy.
Environmental Science Services Administration, Air
Resources Labs. Silver Spring. AEC Publication No.
TID-24190. July 1968. 445 p.
-------
h = emission height, m
a = vertical dispersion coefficient, m
z
a = horizontal dispersion coefficent, m
Q = emission rate, g/s
The average concentration, xt is a function of sampling time,
t, and it can be related to the maximum concentration x x
as follows:2
tj P
7 = x r- (13>
* Amax t2
where tj = 3 min
t2 = reference time period, min
p = 0.17
C. POLLUTANT SEVERITY EQUATIONS
For any material with a given TLV, the severity equation
becomes:
ti\P
xmax'
c = £ = _ (14)
F
Assuming a national average wind speed of approximately
10 mph (4.5 m/s) , an averaging period of 24 hours and
substituting the appropriate values, Equation 14 becomes:
/ 3 \°-17
o - X xmax\1440/
S ~ F ~
(TLV) (3.33 x ID"3)
0.35 y
xmax
(TLV)3.33 x ID"3
105 Xr
TLV
2Turner, D. B. Workbook of Atmospheric Dispersion Estimates,
1970 Revision. U.S. Department of Health, Education, and
Welfare. Cincinnati. Public Health Service Publication
No. 999-AP-26. May 1970. 84 p.
-------
or
S =
(2) (105)Qo
- - - 5.
ireuh2a (TLV)
(15)
The national average atmospheric stability is approximately
neutral. Hence, a =s a , and
= 1-0
(16)
Thus:
S =
38.43h2(TLV)
S -
(TLV)h2
(17)
, NO , CO,
X X
Since the criteria pollutants (particulates, SO
and hydrocarbons) have established ambient air standards,
the appropriate standard (in g/m3) can be substituted for
the potential hazard factor, F. The severity equations for
the five criteria pollutants are shown in Table 1. (Detailed
derivations are shown in Appendix B) .
Table 1. CRITERIA POLLUTANT SEVERITY EQUATIONS
Pollutant
Particulate
S0x
NOX
Hydrocarbons
CO
Severity equation
S = 70Qh-2
S = 50Qh~2
S = SlSQh-2-1
S = 162.5Qh~2
S = 0.78Qh~2
(18)
(19)
(20)
(21)
(22)
8
-------
SECTION III
SIMULATION METHODOLOGY
A. INTRODUCTION
In many statistical analyses of data, it is frequently
desired to consider a random variable which is a function
of other random variables. An example pertinent to air
pollution studies is given by the severity equations for
ground level concentrations of air pollutants. For example,
the severity equation for SO2 emissions from the stacks of
coal-fired electric utility plants is given by:
S = 50Q (19)
h2
where Q = emission rate, g/s
h = emission height, m
The emission rate can be calculated from:
Q= (CO (E)(% sulfur)(Ki) (23)
where CC = coal consumed, g/yr
0.019 g SO2 (1% sulfur coal)
E = emission factor =
g of coal consumed
% sulfur = percent of sulfur in the coal
K! = 3.171 x 10~8 (to convert from g/yr to g/s)
-------
(K2)(CC)(% sulfur)
or S = (24)
h2
where K2 = 3 x 10~9
Next, consider a general setting where the random variable z
is a function of the random variables xj, ..., x given by
z = f(xlf ..., x ) for some function f. Suppose the actual
distributions of the input random variables xlf ..-, x are
known including their probability density functions (p.d.f.)
and the corresponding cumulative distribution functions
(c.d.f.). Then it seems reasonable to assume that the distri-
bution of the random variable z can be obtained. In a sense
this is true in that integral formulae have been developed
which give the probability density function and the cumula-
tive distribution function for z as a function of the same
functions for the x..3 These formulae, however, are very
complicated even for the case of the simple sum, difference,
product, or quotient of two random variables. Also, even if
the integrals are successfully evaluated, the resulting
probability density function for z will in general not be
exactly one of the standard distributions and as a result
may be difficult to handle. There are certain special cases
in which the resulting p.d.f. will be known.3 In these in-
stances, the analytical approach to finding z explicitly is
by far the best approach. In other instances certain sim-
plifying" assumptions about the distribution of z can be made
provided certain things are true about the coefficient of
variability or equivalently the coefficient of skewness of
the input variables.4 However, in cases where there are more
3Parzen, E. Modern Probability Theory and Its Applications.
New York, John Wiley & Sons, 1960.
^Springer, C. H., et al. Probabilistic Models. Homewood,
Richard D. Irwin, Inc., 1968.
10
-------
than two input variables or there is considerable skewness
exhibited by the variables or the function f becomes compli-
cated, then the strict analytical approach to finding the
distribution of z explicitly will in general not be applicable.
In these cases, where the general approach of finding the
explicit distribution function for z is not applicable, an
alternate approach is to calculate "many" values of z for
explicit values of the input variables Xj, ..., x and use
these values to estimate (rather closely if enough values of
z are known) such things as the mean, standard deviation,
etc., for z. Also, class intervals can be formed and a
frequency histogram and cumulative distribution plot can be
developed for the "many" calculated values of z. This will
yield a distribution of z without any knowledge of an
analytical formula for its p.d.f. or even without knowing
whether any of the known standard distributions of statistics
exist for the distribution. This approach is called the
deterministic approach because in this technique it is
possible to determine explicit values for z from explicit
values of the input variables Xj, ..., x . This approach is
an approximation to the strictly analytical approach described
earlier. The deterministic approach works well whenever it
is possible to actually calculate the "many" values of z
deterministically from given values of the input variables.
The term "many" means at least 30 values for the purpose of
estimating the mean, standard deviation, and a 95% confidence
interval for the mean. (The t-test for finding confidence
intervals is discussed in Section V of this report.) However,
to obtain a better frequency histogram for z, 100 or more values
of z should be available. (A histogram can be constructed
with less values but it will tend to be less meaningful because
the number of class intervals will have to be smaller.)
11
-------
Finally consider the situation when either no explicit values
of the input variable are available from which values of z
can be calculated or the number of such values is too small to
permit calculation of enough values of z to determine useful
information regarding its distribution. In this situation a
tool commonly used is the probabilistic approach which uses a
computer simulation to obtain values for z. For example,
instead of knowing many values for the input variables
xlf ..., x , only limited information may be available, such
as an estimate of the mean and possibly the range and symmetry
or skewness properties. Such situations are not amenable to
either the analytical or deterministic approach. In this case,
the input variables are fitted to some theoretical distribu-
tion and the small amount of available information about the
variables is used to determine the parameters of the distri-
butions. The computer is then used to sample a large number
of times from each input variable's distribution function and
to subsequently use these data to calculate a large number of
values of z from which the mean, standard deviation, etc.,
can be estimated and frequency histograms and cumulative
distribution plots for z can be prepared. Some of the
techniques and procedures used in such a computer simulation
are described below.
B. THEORY AND METHODOLOGY
The equation for the severity (S) of ground level concentra-
tions of SO2 emissions from the stacks of coal-fired electric
utilities will be used to illustrate the methodology utilized
in the simulation approach. The severity equation is:
(K2)(CC)(% sulfur)
S = _ (24)
h2
!
where K2/ % sulfur, CC, and h have the meanings defined earli-
er. Thus, the input random variables are % sulfur, CC, and h.
12
-------
1. All Input Variables Independent
When all of the input random variables are independent, the
methodology is relatively simple. A large sample (e.g., of
size n) is drawn from the distribution of each of the input
variables. These data are then used one by one to calculate
n values of S. From these n values the mean, standard
deviation, etc., can be calculated and a frequency histogram
and cumulative distribution can be plotted.
Some comments are in order regarding the method by which
samples are drawn from the distribution of the input varia-
bles. First, it should be noted that the input variables
are restricted to one of four types of continuous distribu-
tions: the Weibull, normal, gamma, or log-normal distribu-
tion. The type of each input variable and the corresponding
parameters for its distribution function must thus be speci-
fied. The method of obtaining the "best" type for each
variable and the corresponding parameters is described in
Section V and the parameters for the SO2 example are given in
Appendix C. In these goodness-of-fit procedures, it is
necessary to have a random sample of data points for the input
variable in order to be able to fit it to the proper distribu-
tion. However, certain situations may arise when that much
information about the input variable is not available. For
example, two extreme points on the distribution and either the
mean or mode may be known, or some information may be available
to determine whether the distribution is symmetric or skewed.
In such situations where the goodness-of-fit program is inoper-
able, it may still be possible to fit the variable to one of
the four distributions above and to obtain its parameters.
As a demonstration of the above procedure, consider the
following example. Suppose that for an input variable, x,
it is known with 95% confidence that the values of x will be
13
-------
between e and e5 (where e = 2.7..). Suppose also that the
mode of the distribution is known to be between e and e2 and
that the mean is approximately equal to e3. These points then
indicate that x is a rather heavily skewed right distribution.
The graph for the p.d.f. of x may resemble the one shown below:
Since it is known that the 0.025 point on the cumulative
graph is approximately equal to e and the 0.975 point is
approximately equal to e5, this information can be used alone
to calculate A and B in a Weibull fit (described later). Thus,
one finds that A = 1.25 and B = 7.29 x 10"3. These values of
A and B yield a theoretical mean y = 47.7 which is larger than
the estimated e3 value for the mean. The theoretical mode is
14.2 which again is larger than the estimated mode. Thus, the
Weibull fit could be used as an approximation to the "true1'
distribution of x, although it is not a very good fit as our
observations about the mean and mode indicate.
Another way of obtaining a distribution for x is to assume
that it is log-normally distributed since the log-normal
distribution is a right-skewed distribution. If x is assumed
to be a log-normal distribution, then log x must be normal.
Hence, by taking the logarithm of the 0.025 point and 0.975
point of x, the same points on the cumulative graph of log x
are obtained which were assumed to be normal. These points '
are 1 and 5, respectively. Thus, the mean g of log x should,
be taken to be 3 and, since 1 and 5 are the 0.025 and 0.975
14
-------
points, respectively, it is found that a = 1.2. The values
vi = 3 and a = 1.2 can thus be used as parameters to sample
from the normal for values of log x. By taking antilogarithms
of the sample, a sample for x can be obtained.
In view of the above discussion, it is evident that several
avenues are available for obtaining a distribution to fit the
given data or information about each input variable. The
simulation program (for the case of independent input varia-
bles) simply takes the parameters for the given type of dis-
tribution for an input variable and samples from this distri-
bution to obtain a random sample for that input variable.
The method by which the program selects the sample from a
given distribution varies with the distribution. The direct
approach is used for the Weibull and normal distributions.
The rejection method is used for the gamma distribution.
Finally, for the log-normal distribution, the direct approach
is used to sample from the normal with the mean equal to the
mean of log x and the standard deviation equal to the stan-
dard deviation of log x. These are sometimes obtained by
taking the log of the geometric mean and geometric standard
deviation of x. After obtaining a sample for log x, the
antilogarithm of these values gives a sample for x. These
methods of sampling are further discussed later in this
report.
2. Dependent Input Variables
Consider what happens when two (or more) of the input varia-
bles are correlated to some degree. If this situation occurs,
and a sampling procedure is used such as the one discussed
above, which assumes independent variables, it will tend to
distort the mean as well as the distribution of the output
variable. For example, in the severity example it was found
15
-------
that the variables CC and h had a sample correlation coeffi-
cient of about 0.55. Thus, whenever CC and h were sampled
independently, large values of CC were allowed with small
values of h and vice versa. This procedure tended to distort
the "true" distribution of S by allowing unrealistically low
values for S and, more importantly, unrealistically high
values. These high values caused the simulated mean to be
16.0 whereas the true mean was 8.9 for the deterministic pop-
ulation calculation. Thus, some way was needed to account
for the correlation that existed between CC and h.
Consider two input variables, X and Y, which are correlated
with linear correlation coefficient R. This value of R can
either be estimated and supplied directly to the program or
it can be obtained by a simple regression on the sample data
for X and Y. Once R is obtained, the slope, B, and the Y-
intercept, A, for the regression line is given by:
B -
A = YB - B(XB) (26)
where XB and SX are the mean and standard deviation of X,
and YB and SY are the corresponding parameters for Y. (This
assumes that X is taken as the independent variable in the
regression line.) If R had been supplied directly to the
program (without any sample data) , then XB, SX, YB, and SY
would also be required in order to calculate A and B.
After A and B are obtained, the following relationship exists;
Y = A + BX = YB + B(X - XB) (27)
with an error term to account for the fact that the regres-
sion line is not an exact relation between X and Y. The
16
-------
next item needed is the standard deviation (or error), SE,
of Y due to the regression line estimate of Y. This is given
by:
SE
= SY \1 - R2 (28)
Whenever R = 0 (indicating that X and Y are independent),
then SE = SY as would be expected (i.e., the standard devia-
tion of Y due to the regression line estimate is simply SY).
Also, if R = ±1, then SE = 0 ^s would be expected since
there is no deviation from the regression line in this case.
/
SE shown above is the standard deviation for Y when the
regression line estimate of Y is used and the value of X is
XB, which is not likely to occur often. Hence, a way to
compute the standard deviation (SD) of Y is needed using the
regression estimate of Y and any value of X. Intuition
suggests that the larger the deviation of X from XB, the
larger the error in estimates of Y. A formula for computing
SD is given by:
^ +
n(SX)2
(29)
where n is the sample size to be drawn from Y. Since these
simulations usually have large values of n, then for our
purposes SD is approximately equal to SE. However, the
correct formula is still used in the program for calculating
SD (the standard deviation of Y for a given X).
The method described below pertains to sampling from the
pair of variables X and Y where X and Y are correlated as
above with X independent and Y dependent. First, a value of
X is selected at random from the population describing X.
This value is then substituted into the regression equation
to obtain an estimate of Y which is taken as the expected
17
-------
value or mean of Y for this given value of X. Finally, the
standard deviation of Y, SD, is calculated for this value of
X. Then a sample is taken from the distribution of Y with
the mean given by the estimate of Y from the regression
equation and the standard deviation given by SD. This pro-
vides a "correlated" random value of Y associated with the
given value of X. This procedure is continued until the
desired sample size for X and Y is attained.
The following discussion pertains to the type of distribution
from which sampling is allowed for Y, the dependent variable
in the regression equation. Upon changing from one value of
X to the next, the above procedure will simply provide a new
mean and standard deviation for Y to be used for sampling
purposes. Since the normal and log-normal distributions are
the easiest distributions from which to sample when the mean
and standard deviation are known, the program allows the user
to select only one of these two distributions for sampling
from Y. However, the program is designed for expanding this
choice to the Weibull and gamma distributions if more sub-
routines are written and added. Whenever the log-normal
distribution is used for Y, the program performs the regres-
sion analysis for log Y as a function of log X, so that the
sample value for log Y is first calculated and then converted
to Y internally by taking antilogarithms. It must be remem-
bered that if one chooses to use the log-normal distribution
for Y and to supply R, XB, YB, SX, and SY instead of supply-
ing the raw sample data, then the values of XB, SX, YB, and
SY must be given in terms of log X and log Y. That is, SB
must be the mean of log X, etc. If the raw data values are
supplied for X and Y respectively, the program automatically
converts to logs if Y is assigned a log-normal distribution.
18
-------
SECTION IV
EXAMPLE OF USE OF SIMULATION APPROACH
WITH COAL-FIRED ELECTRIC UTILITIES
In order to obtain an indication of how well the simulation
procedure approximates the "true" population, S02 emissions
from the stacks of coal-fired electric utilities were examined.
Data were available on % sulfur, coal consumed (CC), and stack
height (h) for 224 power plants in the U.S. This was consid-
ered to be the total population which was to be simulated by
using only a small number (24) of plants in order to obtain
information about the distributions of % sulfur, CC, and h.
To obtain a "random" sample, the first 24 plants on the list
were selected. Percent sulfur, CC, and h for these 24 plants
were then fitted to the four distributions considered in the
simulation program. The distributions were then selected
which appeared to fit the data better on an overall basis
considering the SE, x2~value, actual class interval compari-
sons, and coefficient of skewness and measure of kurtosis
calculations. (These techniques of choosing the best fit are
discussed in Sections V and VI of this report.) For % sulfur,
the Weibull Maximum Likelihood Fit was selected and clipped at
the 5% and 99% points. For CC, the Weibull Least Squares Fit
was selected and clipped at the 5% and 99% points. Also, h
was found not to be independent of CC. Hence, it was decided
to treat h as a dependent variable correlated with the inde-
pendent variable CC by using the raw data on the 24 plants to
obtain R, the correlation coefficient. The coefficient of
skewness indicated that h was not normal but skewed to the
right. Furthermore, the coefficient of skewness and measure of
19
-------
kurtosis for log h indicated "near-normality." Hence, it
was decided to use the log-normal distribution for h.
The severity equation for SO£ emissions from the stacks of
coal-fired electric utilities was discussed earlier and is
repeated below:
(K2)(% sulfur)(CC)
S =
h2
(24)
Using this function for S, the data, as indicated above, were
entered into the simulation program and 5,000 values were
calculated for S. Subsequently, the mean, standard devia-
tion, percentage with S > 1, maximum value, and minimum
value were calculated. A deterministic calculation of these
values was performed for all 224 plants in the population
and the results are compiled in the table below:
Table 2. RESULTS OF DETERMINISTIC CALCULATIONS
Parameter
Mean
Standard deviation
Maximum value
Minimum value
Percent having S > 1
Simulated
value
9.25
12.5
154.5
0.08
91
Deterministic
value
8.9
12.4
136.0
0.36
95
Frequency histograms and cumulative frequency plots were
also drawn for both the simulated values and the determinis-
tic values of S and these are shown in Figures 1 and 2.
The large-sample Z test was performed to determine whether
there was a significant difference in the simulated and
deterministic mean values obtained above. The test, as
20
-------
SIMULATED DETERMINISTIC
to
SAMPLE SIZE = 5000
MIN. VALUE = 0.08
MAX. VALUE = 154.45
MEAN = 9.25
STD. DEV. =12.48
224
0.36
135.96
8.88
12.37
0.27
187*0 23-23
502 SEVERITY(24 PLflNTS)
Figure 1. Frequency histograms for the severity of SO2 emissions from coal-fired
electric utilities comparing the simulation and deterministic methods
-------
to
to
SIMULATED DETERMINISTIC
SAMPLE SIZE = 5000 224
MIN. VALUE = 0.08 0.36
MAX. VALUE =154.45 135.96
MEAN = 9.25 8.88
STD. DEV. =12.48 12.37
4^09
7.92
11.74
15-57
3-40 23-22 27.05 30-87
SOZ SEVERITY 124 PLflNTS)
34.70 3'8-53 4^.354*6.1850-00
53-83
Figure 2. Comparison of cumulative frequency for the severity of 862 emissions from
coal-fired electric utilities using the simulation and deterministic methods
-------
might be expected, showed no significance in the difference
at the 0.01 or 0.05 levels. Furthermore, the F test for
significant difference in the variances was also negative,
indicating no significant difference.
As can be seen from the frequency histogram of both the
simulated and deterministic plots, the severity appears to
be log-normally distributed. Furthermore, the cumulative
plots for the simulated and deterministic values of S indi-
cate an extreme-value distribution, such as the log-normal
distribution. As can be seen by comparing the cumulative
plots for the simulated and deterministic values of S, parts
of the distributions agree very well whereas other parts do
not agree as well. This is considered acceptable since the
total population in the deterministic calculation has only
224 points and this is very much a discrete population,
whereas the simulation plot assumes a continuous distribution
for S. Moreover.- the sample correlation coefficient, which
was calculated from the 24 plants in the sample, probably
does not completely represent the true picture of the actual
correlation between CC and h. For these reasons, the X2
goodness-of-fit indicates an overall poor fit of the simu-
lated cumulative to the deterministic cumulative distribution.
However, keeping in mind the purpose behind a simulation
(viz., to obtain information about an output variable when
very little is known about the input variables) and the con-
ditions under which it is generally performed (viz., little
or no information about the true distributions of the input
variables), the simulated values of the mean, etc., and the
cumulative plots are actually remarkably close to the "true"
values.
23
-------
SECTION V
DISCUSSION OF NON-NORMALITY AND CHI-SQUARE
GOODNESS-OF-FIT TEST
There are many real-world situations in which massive amounts
of data are collected and analyzed. In many of these cases
the data come from a population which is finite. Thus, when
statistical methods are employed to analyze the data, the
underlying distributions are of the discrete type. However,
it is common practice to approximate these discrete distri-
butions with one or more of a large number of continuous
distributions. The reason, of course, is that continuous
distributions are much easier to handle.
After deciding to use continuous distributions in the
analysis of the data, the experimenter must decide which of
the many types of distributions available will most closely
approximate his data and yield the best results for his
purposes. In some situations, past experience leads the
experimenter to choose a particular type of distribution.
For example, the exponential distribution and one of its
generalizations, the gamma distribution, are often used to
describe the distribution of arrival and/or departure times
in a queueing theory problem. In reliability theory, the
exponential distribution and another of its generalizations,
the Weibull distribution, are often used to describe the
distribution of time-to-first-failure or mean time between
failures. In measuring air pollution, the log-normal dis-
tribution is often used to describe the distribution of the
concentrations of pollutants in the air. However, in many
25
-------
instances the experimenter may have no previous knowledge of
how his data "should" be distributed. In these instances it
is sometimes desirable simply to use the well-known normal
distribution to describe the data. With the assumption of
normality, many statistical tests and procedures become
trivial to apply. However, when normality is assumed and the
"true" distribution is non-normal, error is introduced into
the problem. Thus, when an experimenter assumes his data are
normally distributed in order to be able to apply statistical
tests that require the normal distribution, he is naturally
concerned with the extent to which his assumption distorts
the desired results. One of the purposes of this section is
to shed some light on questions of the above type.
This discussion is mainly concerned with data collected in
connection with air pollution analysis and it is frequently
the case that these data exhibit extreme value characteris-
tics. This means that the data are not symmetrically located
around the mode but instead have extreme values to one side
or the other of the mode which causes the mean to be shifted
to the right or left of the mode. Thus, the data cannot
generally be described by the normal distribution. Analytical
techniques that can be used to detect and measure certain
"degrees of non-normality" of data will be discussed. Some
of the well-known extreme value distributions, e.g., the
Weibull, gamma, and log-normal distributions, will also be
reviewed and compared with the normal distribution for
different values of their parameters. Finally, the x2
goodness-of-fit test will be discussed to show how it can be
used to test the fit of the data to a given distribution.
A. CENTRAL LIMIT THEOREM AND T-TEST
Consider a given population which has an unknown distribu-
tion with a finite mean and variance. Let X^, ..., Xn denote
26
-------
a random sample from this population so that each X. is a
random variable with the same distribution as the underlying
population distribution. Table 3 gives the notation for
certain statistical parameters for both the sample and the
population.
Table 3. NOTATION USED FOR STATISTICAL PARAMETERS
Parameter
Mean
Variance
Third central moment
Fourth central moment
k central moment
Sample
notation
X
(SD)2
m
3
m
mk
Population
notation
y
a2
y
3
y
•
•
•
In general, English letters are used to denote a sample
parameter and Greek letters are used to denote the corre-
sponding population parameter. The first result to be dis-
cussed is the so-called "law of large numbers." This law
states that each of the sample characteristics in the table
/
above (as well as some others to be discussed later)
converges in probability to the corresponding population
parameter. It is instructive to see what this means for the
case of X, the sample mean. Let e > 0 be any positive
member. Then: \
(30)
27
-------
where X denotes the mean of a sample of size n. This indi-
cates that the probability that the mean of a sample of size
n will differ from the true population mean by as much as
e > 0 (for any e > 0) approaches 0 as the sample size n
approaches +°°. The same is true of the other sample
characteristics.
Practically speaking, the law of large numbers is of little
benefit unless one can determine how good the approximation
is as a function of sample size or unless confidence intervals
can be found for certain samples of size n > 30 where the
population parameters are "reasonably approximated" by their
sample counterparts. Instead of simply taking the sample
parameter [e.g., X or (SD)2] as the value of the population
parameter (y or a2), it is usually desirable instead to
construct confidence intervals around the sample parameter
within which the target population parameter is assumed to
be with a certain degree of confidence. Since an approxima-
tion to the population mean and a confidence interval about
it seem to be of central importance in most statistical
analyses of data, procedures (and their limitations) for
accomplishing this task are described below.
First, it is necessary to consider some preliminary defini-
tions and resultsi A random variable is said to have a chi-
square distribution with n degrees of freedom if its p.d.f.
is given by:
f(X) =
x
'2" for X > 0
n
(32)
0 elsewhere
28
-------
Suppose a random sample Xj , . . . , Xn is available from a
normal population with mean, y, and variance, a2. Then the
random variable defined by:
n
^-i2 (n-1) (SB)
(33)
has a chi-square distribution with n-1 degrees of freedom.
Next, suppose that Z is a standard normal random variable
and x2 is anY chi-square random variable with n degrees of
freedom. Then the random variable:
T = —£= (34)
is defined to be the "Student t" random variable with n
degrees of freedom. The T variable is a symmetric distribu-
tion that looks very much like the normal except that its
tails are somewhat wider.
The T-test is used to find confidence intervals about the
mean of a sample. Let Xlf ..., Xn denote a sample of any
size from a normal distribution with unknown mean, y, and
n
Z Xi
variance, a2. Let X = — - be the sample mean and let (SD) 2
be the (unbiased) sample variance. By standard statistical
theorems, X is designated as a normal random variable with
a2
mean, y, and variance, — . Also, by one of the results stated
above, ' * ' — is found to be a chi-square random variable
az
with n-1 degrees of freedom. Thus:
X-y
X-y
(35)
29
-------
is a "Student t" random variable with n-1 degrees of freedom.
Thus, by using a table for the T variable, one can find t
such that: 2"
P(|T| 1 t^) = 1 - a (36)
2
for any given a. For example, if a = 0.05, then t0.Q25 = 2.2
for 11 degrees or t0.o25 = 2.18 for 12 degrees of freedom,
etc. Hence given a and the number of degrees of freedom, a
value t can be found such that
a.
2
-t < T < t
SL — — Sfr
4
with probability 1-ct. Thus, with a = 0.05, one finds a 95%
confidence interval for T. For the T defined above, it is
v —
found that | — | <_ t0.o25 with 95% confidence
SD//n
<- < y-X
to.025 1 ~:±to.025
SD//n
Y 4- SD^ ^YJ-4- SD
A ~ t0.025 — ±_ V 1 A + t0.025 —
/n /n
Thus, an interval is obtained in which the population mean
lies with 95% certainty.
The above procedure will now be examined with reference to
the assumption of normality. It is useful to establish the
reason why it was necessary for the sample to be from a
normal population. This question can be answered by reversing
the steps used above. In defining the T random variable, the
numerator had to be a standard normal random variable and the
denominator had to be "%/•*— where Y had a chi-square distribu-
* n A ^
tion with n degrees of freedom. Thus, if the sample were
_ IX±
taken from a non-normal population, the sample mean, X = ———,
will not (in general) be a normal random variable and the
variable *n~ ' ^ — will not in general have a chi-square
a2
30
-------
distribution. Thus, for non-normal populations, the ratio
used in defining the T variable will not produce a T variable
and, hence, the T-test is not strictly valid. In such cases,
the central limit theorem is used. This theorem states that
if a sample of size n is taken from any population with finite
mean, y, and variance, a2, the sample mean, X , approaches a
normal random variable in distribution as n increases. In
fact, for samples of size n ^ 30, the sample mean is so close
to a normal distribution that for all practical purposes it
can be considered normal. Thus, if a sample of size n ^ 30
is taken from any population and its mean is calculated, one
will obtain approximately a normal random variable with mean,
2
y (the unknown population mean) , and variance — (where a2 is
the unknown population variance) . Hence, letting
Z = n~u (37)
a /Sri.
a standard normal random variable is obtained. Since a is
unknown in the above equation and the sample size is rela-
tively large (>J30) , a can be approximated by the sample
standard deviation, SD, to obtain
x -v
Z = — - - (38)
SD//n
for the standard normal random variable. Hence, the normal
probability table may be used to find a value Z such that
2
P( Izl < Z ) = 1 - o (39)
— SL
2
and, hence, obtain (as in the T-test)
X - Z
n
with 100(1 - a)% confidence.
31
-------
If the above two procedures are compared, it is found that
the ratio under consideration for finding the values t or Z
_ a a
Y— 2 2
is the same in both cases, namely, —-^—, and the only
SD//n
difference is the specific reference table used. If the
underlying population is normal and n is arbitrary, the T
table is used and if not, but n >_ 30, the normal table is
used. Actually, the T tables in most statistics books stop
at n = 30 (29 degrees of freedom) because for n >_ 30 the
variable defined by the above ratio can be considered to be
normal or T regardless of the underlying population distri-
bution. Hence, it would be useless to construct tables
beyond approximately n = 30 for the T variable since they
would simply duplicate the values that could be obtained
from a normal table.
It is sometimes desired to find a confidence interval when n
is less than 30 and our data are non-normal. Unfortunately,
there is no completely statisfactory analytical answer that
can be given in this case. However, statisticians have found
from experience that the normal approximations guaranteed by
the central limit theorem for samples of size n ^ 30 are still
very close to normal for most mound-shaped probability distri-
butions. In some cases, the approximation is valid for samples
as small as n = 5. Hence, in the case of n < 30, one can
simply calculate the confidence interval in the same way as
described above, assuming normality, and subsequently point
out the possible pitfalls that could occur if the underlying
distribution strays too far from normality (viz., the confi-
dence interval would no longer be valid and more sampling
would be required). An alternate approach for n < 30, when
there is reason to believe that the data are log-normally
distributed, is to apply the T-test to the logs of the data
and obtain a confidence interval.
32
-------
It is interesting to apply the above procedure to the data
for coal-fired electric utilities. Data are available for
224 power plants on coal consumption, stack heights, percent
sulfur in the coal used, and percent ash in the coal used.
The first 24 plants on the list were then selected and con-
sidered to be a random sample from the population of 224
plants. Using the 24 plants in the sample, the sample mean,
standard deviation, and 95% confidence limits on the mean
were calculated for each of the above types of data and these
were compared with the population parameters for all 224
plants. The results are provided in Table 4.
Table 4. COMPARISON OF RANDOM SAMPLE VALUES AND POPULATION MEAN
FOR COAL-FIRED ELECTRIC UTILITIES
Type of data
Coal consumed, kg/yr
Stack height, m
Percent sulfur
Percent ash
Random sample values
Mean
1,326
93.6
1.82
12.25
Standard
deviation
1,349
36
1.15
4.1
95% confidence
interval
756, -1,896
78.4;108.8
1.33; 2. 31
10.52;14.0
Population
mean
1,089
101.3
2.48
13.03
As noted above, the 95% confidence interval contains the
population mean for each of the data types studied except
the percent sulfur. Using various statistical goodness-of-
fit tests (to be discussed later) on the data, it was found
that coal consumption was exponentially distributed almost
perfectly, the percent ash was log-normally distributed, and
stack height could not be fitted with much success to any of
the distributions that were tried. However, even with these
wide-ranging distributions (which are very much non-normal)
and a sample size of only 24, the T-test still provided
satisfactory results in every case except the percent sulfur
used. Upon investigating the percent sulfur situation more
33
-------
closely, it was found that the range on the data in the sample
of the first 24 plants was from 0.4% to 4% whereas the popu-
lation data ranged from 0.4% to 6.2%. This indicates that,
in the case of percent sulfur, the random sample is probably
not very good. However, the 95% confidence interval for
percent sulfur was still relatively close to containing the
actual population mean.
B. COEFFICIENT OF SKEWNESS AND KURTOSIS AS ANALYTICAL
MEASURES OF NON-NORMAL DISTRIBUTIONS
In a symmetric distribution, every moment of odd order about
the mean (if existent) must be zero. Any such moment which
is not zero may thus be considered as a measure of the asym-
metry or skewness of the distribution. The simplest of these
measures is y which is of the third dimension in units of
3
the variable. In order to reduce this to zero dimension and
so construct an absolute measure, division by a3 is performed
and the ratio
v,
Y = -1 (40)
1 a3
is regarded as a measure of the skewness. y is called the
coefficient of skewness.
In statistical applications, unimodal continuous distributions
of the type shown in Figure 3 are often encountered. The
frequency curve forms a long tail to the right of the mode
and a short tail to the left. Thus, in calculating y , the
cubes of the positive deviations will generally outweigh the
negative cubes and, hence, y will be positive as will
yo 3
Y = — 2-. Thus, the above distribution is said to be skewed
1 a3
right or have positive skewness. Similarly, negative skew-
ness occurs when y < 0.
34
-------
VALUE OF VARIABLE
Figure 3. Unimodal continuous distribution
Reducing the fourth moment y to zero dimension in the same
way as above, the measure of kurtosis of a distribution is
defined as:
Y = -- (41)
2 a"
The measure of kurtosis is used as a measure of the degree of
flattening of a frequency curve near its center. The normal
distribution has a constant measure of kurtosis, y =3.
2
Thus y > 3 means that the distribution has a sharper peak,
2
thinner shoulders, and fatter tails than the normal distribu-
tion. Likewise, y < 3 means that the distribution has
2
flatter peaks, fatter shoulders, and thinner tails than the
normal distribution. Figure 4 exhibits these features.
All of the above distributions are not skewed although curve 1
and curve 2 in Figure 4 would fail to be normal because of
deviations in their measures of kurtosis.
It can be shown that y = 0 and y = 3 for the normal distri-
, . 12
bution. Hence the values of skewness and kurtosis for a
35
-------
u
2
W
D
O
W
L) = KURTOSIS > 3
(2) = KURTOSIS < 3
(3) = NORMAL
VALUE OF VARIABLE
Figure 4. Examples of kurtosis in distributions
given distribution can be used to compare it with the normal
To obtain another reference point for comparisons, consider
the exponential distribution function whose probability
density is given by:
f(X) =
-r
I o
1/B e "' M for X
elsewhere
(42)
where 3 > 0 is an arbitrary parameter. By calculating the
necessary moments, etc., it is found that the exponential
distribution also has constant values for y and y indepen-
1 2
dent of the parameter B. These are given by:
y = coefficient of skewness = 2
i
y = measure of kurtosis = 9
2
(43)
Hence, another reference point is available for comparison
purposes. The exponential distribution is a one-tailed dis-
tribution which (as indicated above) is heavily skewed right
with a sharper peak, thinner shoulders and a fatter tail
36
-------
than that of the corresponding normal distribution. The
density function for the exponential distribution is graphed
below.
u
53
H
D
O
VALUE OF VARIABLE
Figure 5. Probability density function for
the exponential distribution
In the next sections, the formulae will be investigated for
various points of interest on the Weibull and gamma distribu-
tions as a function of their parameters. The mean, median,
mode, value of the p.d.f. at the mode (i.e., the maximum
value of the p.d.f.) and the coefficient of skewness and
measure of kurtosis will be considered.
C.
WEIBULL DISTRIBUTION
The two-parameter family of Weibull density functions is
given by:
,b
f (X)
{abX15'1
0 <
-aX ^
e for
(44)
otherwise
where a, b > 0 are arbitrary parameters. The cumulative
distribution can be found in closed form and is given by:
F(X)
-r
(45)
.-e~aX for X > I
0 elsewhere
Graphs of the p.d.f. of the Weibull distribution for various
values of b are given below (for a = 1):
-------
>H
S 1-2
H
m
H
EH
5
0.9
0.6
0.3
b =
b = 3.26
0.5 1.0
VALUE OF VARIABLE
2.0
Figure 6. Probability density function of the Weibull
distribution for various values of parameter b (for a = 1)
From Figure 6, it can be seen that for b = 3.26 the curve is
almost normal; for b = 2, it is skewed right with positive
kurtosis; and, for b = 1, it is an exponential curve. For
b = 1, it can be seen that the Weibull distribution reduces
precisely to the exponential distribution. In order to
observe how the curve changes for different values of a and
b, some of its points of interest are analytically calculated
below as a function of a and b. The following formulae are
used:
= mean = a ']
mode = a
i/b^i
median = a ' (log 2)
1/b
(1+1/b)
1/b
(46)
(47)
7
(for b >_ 1; if b < 1, the mode = 0)
(48)
f fb-I\b~l~\
value of p.d.f. = abf - j
f (mode) = maximum
(for b > 1; if b < 1 the max. value = + »)
(49)
38
-------
a2 = variance = a 2/b[r(l+2/b) - r2(1+1/b)] (50)
a = standard deviation = /a"2" (51)
y = third central moment
3
= a~3/b[r(l+3/b) - 3T(1+1/b)r(l+2/b)+2r3(1+1/b)]
= a~3/br(l+3/b) - 3ya2-y3 (52)
y = fourth central moment
= a~4/b[r(l+4/b) - 4F(1+1/b)r(1+3/b) +
6T2(1+1/b)r(l+2/b) - 3T4(1+1/b)]
= a~4/br(l+4/b) - 4yy -6y2a2-y4 (53)
3
y
Y = coefficient of skewness = —3- (54)
1 a3
y
Y = measure of kurtosis = —^- (55)
2 H
Letting a = 1, the above formulae are used to calculate the
points of interest for various values of b as given in
Table 5.
As can be seen from Table 5 at b = 3.26, the Weibull distri-
bution function is "almost" normal. For example, its mean,
mode, and median all occur at approximately the same point.
Furthermore, its coefficient of skewness is near 0 and its
measure of kurtosis is near 3 at that value of b. As b
decreases from 3.26 to the values of 2.0, 1.5 and lower, the
curve becomes more and more non-normal. Its mode begins to
shift farther to the left of its mean (and median) indicating
a skewed right distribution. The coefficient of skewness
(as expected) begins to increase from 0.089 to 0.631, 1.08,
etc. Finally, the measure of kurtosis begins to increase
also, indicating sharper peaks, thinner shoulders and fatter
39
-------
Table 5. VALUES FOR VARIOUS POINTS OF INTEREST IN THE
WEIBULL DISTRIBUTION FOR VARIOUS VALUES
OF PARAMETER B (WHEN A = 1)
Point of interest
Mode
Value of f (mode)
Median
Mean (y)
Variance
3rd moment about y
Standard deviation (a)
F(y+a)
F(y-a)
% between y-a and y+a
Coefficient of
skewness (= 0 for
normal)
Measure of kurtosis
or excess (= 3 for
normal)
4th moment about y
Value
b=3.26
0.894
1.26
0.894
0.896
0.091
0.002
0.303
0.836
0.166
67%
0.089
2.73
0.023
b=2.0
0.707
0.858
0.833
0.886
0.215
0.0628
0.463
0.838
0.163
67.5%
0.631
3.244
0.150
b=1.5
0.481
0.745
0.783
0.903
0.375
0.248
0.613
0.845
0.145
70%
1.08
4.384
0.619
b=1.4
0.409
0.736
0.770
0.911
0.436
0.343
0.660
0.848
0.134
71.4%
1.19
4.838
0.918
b=l.l
0.113
0.808
0.716
0.965
0.771
1.174
0.878
0.859
0.081
77.8%
1.73
7.296
4.336
tails than the normal. At b = 1, the Weibull distribution
reduces to the exponential distribution with y = 2 and
Y =9. It can be seen that y and y are converging to
2 12
these values for when b = 1.1, y =1.73 and y =7.3. If b
1 2
is allowed to attain values < 1, then as indicated earlier the
mode becomes 0 and the maximum value of f becomes +». Further,
the parameters y and y continue to increase giving larger
1 2
values than those corresponding to the exponential distribu-
tion. When b > 3.26, the Weibull distribution becomes skewed
left and its kurtosis drops below 3. From the above discus-
sion it can be seen that the parameter b controls the shape
of the Weibull distribution, i.e., whether it is "almost"
normal or skewed right or left, etc.
40
-------
Consider what happens whenever the parameter, a, changes for
given values. Table 6 is similar to Table 5 except that the
value of a has been changed from 1.0 to 1.0 x 10~5. This
change in the value of a "stretches" the p.d.f. to cover
data that is wide ranging. It also causes the maximum value
of the p.d.f. to decrease to the point where the curve is
almost flat. The values of y and y do not change since
1 2
they are unitless measures and should not be affected.
D. GAMMA DISTRIBUTION
The gamma distribution is another of the extreme value dis-
tributions. The two-parameter family of the gamma p.d.f. is
given by:
A—Xa"1e"X/B for X > 0
f(X) =
0 otherwise
(56)
where a, 8 > 0 are arbitrary parameters. It can be noted
that when a = 1, the gamma distribution reduces to the
exponential distribution. When a = n/2 and 6 = 2 for a
positive integer n, the gamma distribution reduces to the
chi-square distribution with n degrees of freedom.
The closed form for the cumulative distribution function of
the gamma distribution does not in general exist and in order
to use a table to look up its values one should consult a
table of the incomplete gamma function. Of course, if a = 1,
the closed form exists and, if a = n/2 and 6=2 for a posi-
tive integer n, the appropriate chi-square table can be used.
The graph of the p.d.f. of the gamma distribution for various
values of a with 3=1 (fixed) is shown in Figure 7.
As can be seen from Figure 7, the gamma distribution is a
skewed right distribution with relatively fat tails. The
41
-------
Table 6. VALUES FOR VARIOUS POINTS OF INTEREST IN THE WEIBULL DISTRIBUTION
FOR VARIOUS VALUES OF PARAMETER b (WHEN a = 1.0 X 10~5)
Point of interest
Mode
Value of f (mode)
Median
Mean (y)
Variance
3rd moment about y
Standard deviation (a)
F(y+o)
F(y-a)
% between y-a and y+a
Coefficient of
skewness
a
Measure of kurtosis
4th moment about y
Values
b = 3.26
30.6
0.037
30.6
30.6
107.5
79.8
10.4
0.836
0.165
67.1%
0.089
2.73
3.41 x 101*
b = 2.0
223.6
0.003
263.4
280.2
2.15 x 101*
1.97 x 106
146.6
0.838
0.163
67.5%
0.631
3.248
1.5 x 109
b = 1.5
1,036
0.0003
1,687
1,946
1.74 x 106
2.98 x 109
1,319
0.845
0.145
70%
1.08
4.39
1.33 x 1013
b = 1.4
1,525
0.0002
2,870
3,396
6.06 x 106
1.78 x 1010
2,461
0.848
0.134
71.4%
1.19
4.82
1.77 x 1014
b = 1.1
3,968
0.00002
25,140
33,880
9.50 x 108
5.08 x 1013
30,831
0.859
0.066
79.3%
1.73
7.294
6.59 x 1018
to
The slight variations between the coefficient of skewness and the measure of kurtosis values shown in
Tables 5 and 6 are due to calculator round-off/ and actually should be identical.
-------
o
I
D
a
w
r~ a=l
VALUE OF VARIABLE
Figure 7. Probability density function for the gamma
distribution for various values of parameter a (for 6=1)
parameter a is the shape parameter and the parameter $ is the
stretch parameter. To get an analytical picture of skewness,
kurtosis, etc., for the gamma distribution, some formulae are
provided below for calculating the mean, variance, etc.
y = mean = a 8
a2 = variance = a8:
mode = (a-1)6
(57)
(58)
(59)
a-1
/ i \ \Ji -i-
f (mode) = maximum value of p.d.f. = -^-7—(-5— e v" """' (60)
i la i p
-(a-1)
a = standard deviation = /a"2" = 8/a" (61)
y =
4
y
= _i
«3
= 2a83
. 4 + 6<
B3a3/2
(62)
(63)
(64)
43
-------
Y = -*- = ^-* ""P = 3 + 6/a (65)
2 au a2B4
From the above formulae for y and y , it can be seen that
1 2
the gamma distribution always has a positive coefficient of
skewness and always has a measure of kurtosis > 3. Thus, it
will tend to be right skewed and have sharper peaks and
fatter tails than the corresponding normal distribution.
Only by changing the shape parameter to very large values of
a can the gamma distribution be made to approach the normal
distribution's values for y and y . When this is done, the
l 2
mean and the mode tend to become closer together as they
should for the normal distribution. For a = 1, the skewness
is 2 and kurtosis is 9 as it should be for the exponential
distribution of which it is a generalization. Finally, it
should be noted that for values of a < 1, the skewness and
kurtosis increase sharply to very large values.
E. LOG-NORMAL DISTRIBUTION
The two-parameter family of log-normal p.d.f. is given by:
(—^- e-i/2(10? x-«)2 for X > 0
f (x) = < x/2~7e p
0 elsewhere
where a is any real number and B > 0. A random variable with
this p.d.f. has the property that its logarithm to base e is
a normal random variable. A graph of the p.d.f. for the
log-normal distribution looks similar to the one shown in
Figure 8.
As can be seen, the log-normal distribution is another right
skewed distribution. Further it tends to have "heavier
tails" (i.e., larger kurtosis) than even the exponential
44
-------
VALUE OF VARIABLE
Figure 8. Probability density function
for the log-normal distribution
distribution. (This distribution is shown graphically by
Curran and Frank.5)
F. SAMPLE SKEWNESS AND KURTOSIS
Previous discussions were concerned with theoretical distri-
butions and their analytical skewness and kurtosis. It is
also instructive to consider samples drawn at random from
some population. Consider the sample analogy of skewness
and kurtosis and observe how they can be used to predict
deviations from normality and toward some other distribution,
e.g., the exponential distribution, etc. Formulae are pre-
sented below which are used to produce unbiased estimates of
the variance and some other higher central moments about the
mean:
m3 = third (unbiased) central moment about X
n
n
— i-% — f— ^T-
(n-1) (n-2) =
L (X. X)
(66)
5Curran, T. C., and N. H. Frank. Assessing the Validity of
the Lognormal Model when Predicting Maximum Air Pollution
Concentrations. (Presented at the 68th Annual Meeting of
the Air Pollution Control Association, Boston. June 15-20,
1975.)
45
-------
= fourth (unbiased) central moment about X
n2-2n+3 n , _-.k
(n-1) (n-2) (n-3) ._^ l*i X;
(2n-3) i „ ,„ _y-. 2
[n
1,1 '"'
(67)
n(n-l)(n-2)(n-3)
(SD)2 = unbiased estimate of variance
1 n —
= ^- Z (X±-X)2 (68)
From the above, the sample coefficient of skewness, gi, and
measure of kurtosis, g2, can be constructed as follows:
(69)
(70)
The law of large numbers discussed earlier can now be en-
larged to include the sample parameters gj and g2; i.e., g!
and g2 also converge in probability to their population
counterparts, y and y , as the sample size, n, becomes large,
1 2
Tables 7 and 8 provide the 0.05 and 0.01 points of the
distribution of the coefficient of skewness and the measure
of kurtosis, respectively, which can be used to evaluate the
sampling distributions for parameters gi and (g2~3) . If a
given sample has a value of gj or (g2~3) beyond the 0.05
point, it may be deemed to be non-normal. In a stricter
test, the 0.01 point would be used. The tables for both gj
and (g2~3) are not complete and do not give values for small
numbers of samples in the case of (g2~3) . More extensive
tables can probably be obtained from the original source
shown in the tables. If the calculated values of gi and
(g2-3) fall below the 0.05 value, this does not mean that
one can categorically accept normality. However, it is a
46
-------
Table 7. 0.05 AND 0.01 POINTS OF THE DISTRIBUTION
OF y / THE COEFFICIENT OF SKEWNESS (NORMAL UNIVERSE)6'3
Sample size/ n
25
30
35
40
45
50
60
70
80
90
100
125
150
175
200
250
300
400
500
750
1,000
Probability that yi will exceed listed
value in positive direction is
0.05 point
0.711
0.661
0.621
0.587
0.558
0.533
0.492
0.459
0.432
0.409
0.389
0.350
0.321
0.298
0.280
0.251
0.230
0.200
0.179
0.146
0.127
0.01 point
1.061
0.982
0.921
0.869
0.825
0.787
0.723
0.673
0.631
0.596
0.567
0.508
0.464
0.430
0.403
0.360
0.329
0.285
0.255
, 0.208
0.180
Points listed are on the positive tail of the distribution;
with a minus sign attached, they are equally valid for the
negative tail.
6Geary, R. c., and E. S. Pearson. Tests of Normality.
London, University College, 1938.
47
-------
Table 8. PERCENTAGE POINTS OF THE DISTRIBUTION
of Y / THE MEASURE OF KURTOSIS (NORMAL UNIVERSE)6
Sample size, n
200
250
300
500
1,000
2,000
5,000
Probability that
Y2 falls below
listed value is:
0.01
point
-0.63
-0.58
-0.54
-0.43
-0.32
-0.23
-0.15
0.05
point
-0.49
-0.45
-0.41
-0.33
-0.24
-0.17
-0.11
Probability that
Y2 falls above
listed value is:
0.05
point
0.57
0.52
0.47
0.37
0.26
0.18
0.12
0.01
point
0.98
0.87
0.79
0.60
0.41
0.28
0.17
good indication that the distribution does not stray far
from normality (with respect to skewness and kurtosis) and
would not produce serious error in most statistical appli-
cations, assuming that it is a normal distribution.
As an example of the above procedure for using g: and g2 and
the tables, consider the 224 power plants described earlier
and their data on coal consumption, plant capacity, percent
sulfur, percent ash, and stack height. Table 9 gives the
values of g!, g2 and (g2-3) for these data.
From the conclusions shown in Table 9, it can be seen that,
with the exception of sulfur, normality of the sample data
can be denied on the basis of skewness and kurtosis alone.
Percent ash is denied because of skewness alone and the
others fail both tests. The natural question to ask at this
point is what distribution (if any) do the data fit if it is
definitely non-normal? Since gj = 2.05 and g2 = 8.03 for
coal consumed and y = 2 and y = 9 for the exponential dis-
1 2
tribution, one might guess that the coal consumption data are
48
-------
Table 9. VALUES OF
g2 AND (gz-3) FOR POWER PLANT EXAMPLE
Type of data
Coal consumed
Plant capacity
Stack height
Percent sulfur
Percent ash
9i
2.05
1.5
1.46
0.25
0.48
92
8.03
4.8
5.14
2.88
3.04
(92-3)
5.03
1.8
2.14
-0.12
0.04
Conclusions
Non-normal
Non-normal
Non- normal
Looks normal as far as
skewness and kurtosis
are concerned
Non-normal because of
slight right skewness;
kurtosis is satisfactory
almost exponentially distributed. In the next section,
another goodness-of-fit test called the chi-square goodness-
of-fit will be discussed. This technique can be used to
test how well sample data will fit any variety of distribu-
tions and it will be used to analyze the distribution of
coal consumed and percent sulfur in the power plants data.
G. THE CHI-SQUARE GOODNESS-OF-FIT TEST
The chi-square test for goodness-of-fit is a general non-
parametric statistical test of hypothesis used to test a
hypothesis, H0, of the form: The sample data are distributed
according to some given probability distribution. The chi-
square test compares a set of actual sample frequencies with
a set of frequencies that would be expected on the basis of
HQ. If the two sets compare well, the hypothesis is accepted
and the data are assumed to be distributed in the way claimed,
If the two sets of frequencies compare poorly, the hypothesis
is rejected. Since the sampling distribution that is used
tends to form a chi-square distribution under the given
hypothesis, the test is called a chi-square test.
49
-------
In order to formulate the chi-square test, let Fj, . .., Fn
.*
represent the actual frequencies of the sample data in n
class intervals. Under the hypothesis (H0) that the data
are distributed according to some given distribution
function, let flr ..., fn be the theoretical frequencies
that would be expected for a sample of the same size from the
given distribution. If H0 is to be true, sample values of
the quantity:
n (F.-f.)2
X2 = I —± - (71)
will tend to form a chi-square distribution. Hence, given a
sample and its actual frequency, the theoretical frequencies
that would result from H0 can be calculated and substituted
into the above formula to obtain a value of x2- Then, by
looking at a chi-square table, one can determine whether the
sample value of x2 is significant enough (at whatever level
desired, usually 0.01 or 0.05) to be rejected as a value from
the chi-square distribution. If it is, the hypothesis H0 is
rejected and if it is not, HQ is accepted. In the case of
rejection, the probability of error can be stated and one can
be as confident as desired in the rejection. However, if
rejection cannot be made, at a given level of confidence, then
HO is accepted but this cannot be done with any statement of
error concerned.
Before applying the chi-square test, one must consider the
number of degrees of freedom to be used for the sampling
distribution x2- Since the value for x2 is extended over n
terms (where n = the number of class intervals) , it may
appear that the sampling distribution has n degrees of
freedom. However, some of these degrees of freedom have been
utilized in the construction of the test. The given sample
size was used to determine the theoretical frequency of that
sample size for each class interval. This reduces the number
50
-------
of degrees of freedom by one to n-1. The "given" distribu-
tion in H0 (the hypothesis to which the test is being applied)
is generally one of the many distributions available in
statistics and these are generally parametric families of
distributions with one, two, or more parameters. In calcu-
lating the theoretical frequencies for each class interval
from the given distribution, these parameters must be speci-
fied in some way. The usual methods of obtaining these
parameters for a given sample involve using the data in the
sample in some way (e.g., the method of moments, the method
of maximum likelihood, or the method of least squares). If
the sample data are used to determine the parameters, the
number of degrees of freedom must be further reduced by one
for each parameter so estimated. Thus, for a two-parameter
distribution, the total number of degrees of freedom finally
realized for the x2 test is n-3 (where n = the number of
class intervals used). If the distribution in the hypothesis
H0 has its parameters specified independently of the sample,
or if the theoretical frequencies are given without needing
to be calculated from a given distribution, it is not neces-
sary to reduce the number of degrees of freedom by any more
than one (i.e., to n-1).
Another consideration that arises with the chi-square test
which needs some discussion is the problem concerned with the
number of theoretical frequencies for each class interval.
vlt is a conservative rule that the theoretical frequencies
in any class interval be five or more. If this is not the
case, adjacent intervals should be pooled so as to accomplish
this task. When two class intervals are pooled, however, the
number (n) of class intervals is reduced by one, and, hence,
so is the number of degrees of freedom of x2- Thus, it is
of no benefit to try to use a large number of class intervals
to start the test for in the end these will have to be pooled
to obey "the rule of 5." Another point to observe is that
51
-------
whenever the number of class intervals has to be reduced to
three or less in order to obey the "rule of 5," the number of
degrees of freedom for x2 drops to zero and, hence, the x2
test is not applicable. Thus, for very small sample sizes
(approximately 15 or less), the x2 test will not be very work-
able and other measures must be used to get fits to the data.
Actually, the "rule of 5" is conservative and need not be
strictly obeyed. For example, if a theoretical frequency for
some class interval is around four and the actual frequency
is close to that value, it would not be necessary to pool the
class intervals and thus reduce the number of degrees of
freedom. Furthermore, it has been observed that if an error
of 1% can be tolerated in the probabilities read from the
chi-square table at the 0.05 level or a 0.2% error at the
0.01 level, then the smallest theoretical frequency can be
as low as two if there are at least six degrees of freedom,
as low as one if there are 10 degrees of freedom, and as low
as 0.5 if there are 25 degrees of freedom. However, all of
these low frequencies should occur only once to be allowed;
otherwise, pooling must be used according to the "rule of 5"
described above.
The data from the power plants will now be used to demonstrate
the x2 test. Consider coal consumed, which on the basis of
skewness and kurtosis was guessed to be an exponential dis-
tribution. The method of least squares is used to arrive at
values of the parameters a and b in the Weibull distribution
from the given data. These values were given by a = 9.7 x 10~4
and b = 0.9974. Since b *\> 1, the exponential distribution is
represented here. A chi-square test will now be performed to
see if the data do indeed fit the exponential distribution.
Table 10 gives the actual and theoretical frequencies for nine
class intervals (determined by Sturge's rule) beginning with
the smallest value and proceeding through the largest value.
52
-------
Table 10. THEORETICAL AND ACTUAL FREQUENCIES FOR
NINE CLASS FREQUENCIES (COAL CONSUMED)
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
115.2
52.0
27.1
14.2
7.4
3.9
2.0
1.1
1.2
Actual
frequency
116
55
22
14
8
4
3
0
2
In order to obey "the rule of 5," the last four class inter-
vals are pooled into one with a theoretical frequency of 8.2
and an actual frequency of 9. Thus, six class intervals and
three degrees of freedom result. The value of x2 for this
table is 1.3. Referring to a x2 table, this value of x2 is
not significant at the 0.01 level or at the 0.05 level. In
fact, it becomes significant at the 0.8 level. This indicates
a very good fit and, hence, one can conclude that the data
are indeed exponentially distributed.
Recalling that the gamma distribution should also fit the
exponential distribution for a = 1, a method of moments fit
was completed for the gamma distribution to the sample data
to obtain the parameters, a = 1.06 and B = 1024.8. Upon
doing ax2 test, it was found that the number of degrees of
freedom was three and x2 = 1.9. This value of x2 was not
significant until the 0.6 level of significance. Hence, a
very good fit to the data was also obtained using the gamma
distribution.
53
-------
In view of the above discussion, one would certainly believe
that the x2 test should reject normality of these data. Using
the maximum likelihood fit to the data (i.e., simply using
y = sample mean and a = sample standard deviation) for the
normal, one obtains three degrees of freedom and ax2 value
of 31.6 which can be rejected at any level for three degrees
of freedom.
Since the log-normal distribution is a right skewed distribu-
tion, it is reasonable to believe that it may fit these data.
Indeed, the x2 tests yields six degrees of freedom with
X2 = 3.4. This value is not significant with this number of
degrees of freedom until the 0.75 or 0.8 level. Hence, the
log-normal distribution also yields a very good fit to these
data.
Next, consider the % sulfur in the coal consumed. Since the
skewness and kurtosis tests earlier indicated a possible nor-
mal distribution for these data, the normal distribution was
fitted to the data by maximum likelihood and ax2 test was
performed. The table for nine class intervals is shown below:
Table 11. THEORETICAL AND ACTUAL FREQUENCIES FOR
NINE CLASS INTERVALS (% SULFUR IN COAL)
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
22.3
31.4
46.5
50.0
39.0
22.0
9.1
2.7
1.0
Actual
frequency
32
28
37
51
42
26
4
1
3
54
-------
A clarification is in order to define the manner in which the
last few class intervals should be pooled. Since pooling the
last two intervals yields a theoretical frequency of 3.7 with
an actual frequency of 4, this will be used to obtain five
degrees of freedom and a x2 value of 10.3. This value of x2
is not significant at the 0.05 level (it becomes significant
at about 0.07 or 0.08). Hence, normality for % sulfur is
accepted although with some reservations. If the last three
intervals had been pooled, four degrees of freedom would have
been obtained and x2 would be 9.2 which is not significant at
0.05 (it becomes significant also at about 0.07).
Since the fit to the normal distribution for % sulfur is
rather poor by x2* various fits to other distributions may be
evaluated, viz., the Weibull (by both maximum likelihood and
least squares) and the gamma and the log-normal distributions.
All of these distributions fail, although the Weibull maximum
likelihood comes closest with five degrees of freedom and a
X2 of 18.3. However, this value is significant at 0.01 and
consequently the fit is rejected with "99% confidence."
In order to perform chi-square, skewness, and kurtosis tests
on sample data, the amount of computation becomes very large.
Hence, a computer program was written to perform the compu-
tations involved in obtaining these measures for various
types of fits to the four distributions which have been
considered in this report: the Weibull, gamma, log-normal,
and normal distributions. This program and its method of
operations are described in Appendix D.
55
-------
SECTION VI
APPENDIXES
A. Standard Statistical Formulas for Finite Populations
B. Detailed Derivations of the Criteria Pollutant Severity
Equations
C. The Simulation Programs
D. The Goodness-Of-Fit Program
E. Treatment of Correlated Data by Linear Transformation
57
-------
APPENDIX A
STANDARD STATISTICAL FORMULAS FOR FINITE POPULATIONS
In certain cases, when sampling from finite populations (e.g.,
brick kilns, cattle feedlots, etc.) and the sample size, n,
is small compared to the total population size, N, correc-
tions must be made to the standard deviation and confidence
limit calculations. It should be recognized that there is a
difference between the sample standard deviation and the
estimated population standard deviation. When dealing with
a population of 400 plants comprising a given source type,
N=400. If 10 plants are surveyed for a stack height, age,
capacity, etc., n = 10. Using the data from those ten plants,
one can compute the means and standard deviations of the
various parameters. But the computed values are the biased
mean and standard deviation of the sample of 10 points.
What is really desired is an unbiased estimate of the mean
and standard deviation of the population of 400 plants. The
symbol "~" will be used over another symbol to stand for an
estimate. Assume that a sample mean, X, and a sample standard
deviation, SD, have been computed. An estimate, y, of the
total population mean, y, is:
y = X (A-l)
where X = -^ (A-2)
An estimate of the population standard deviation, a, is
simply:
(A-3)
where SD = sample standard deviation = — JnZx2-(Ex)2 (A-4)
58
-------
Earlier it was indicated that the sample standard deviation
gave a biased estimate of the corresponding population param-
eter. The first square root factor in Equation A-3 corrects
for this bias. As the sample size, n, becomes larger, the
factor approaches unity; hence, the bias is less for larger
sample sizes than for small ones. The following table shows
this tendency:
Table A-l.
FIRST SQUARE ROOT FACTOR IN EQUATION A-3
AS A FUNCTION OF SAMPLE SIZE
Sample size
(n)
2
5
10
50
100
1,000
1^
1.4142
1.1180
1.0541
1.0102
1.0050
1.0001
The second factor,
«
/ in Equation A-3 adjusts for finite
population size, and, if n = N, the correction factor is
unity or,
a = SD
(A-5)
which is logical since the sample size is equal to the popu-
lation size. There is one more important parameter used in
calculating confidence limits and that is the estimated
standard error of the mean, a_:
(A-6)
Confidence limits on ia are thus
(A-7)
59
-------
where K is the standard "Student t" variable with (n-1)
degrees of freedom, and unless otherwise specified should
be for the 95% (a = 0.05) confidence level.
The preceding formulae are summarized for reader convenience
below:
Quantity
Formula
Sample mean, X:
Sample standard deviation, SD;
Estimated population mean, p
Estimated population
standard deviation, a:
Estimated standard
error of mean, 5_:
IX
n
Vn£x2-(Ex)2
n
X
n
N-n
N-1
(A-2)
(A-4)
(A-l)
(A-6)
60
-------
APPENDIX B
DETAILED DERIVATIONS OF THE CRITERIA
POLLUTANT SEVERITY EQUATIONS
1. CO SEVERITY EQUATION
Since the primary standard for carbon monoxide, CO, is for a
1-hour averaging time, t = 60 minutes and tg = 3 minutes.
Given X = 82 (B-l)
Amax ireuhz v '
Correcting for averaging time:
- / 3\0< 17
xmax ~ x (B~2)
= 2 Q / 3\°-17
\6Qj
2 Q(0.6)
1-2 Q
(3.14) (2.72) (4.5)hz
X = (3.12 x 10-2)Q ( }
Amax h2
Given S = -S2. (B-3)
For the criteria pollutants, F is set equal to the primary
standard which is 0.04 g/m3 for CO.
- Xmax - (3.12 x 10-2)Qh~2
-
S - (B-4,
61
-------
2. HYDROCARBON SEVERITY EQUATION
The primary standard for hydrocarbons is for a 3 -hour
averaging time. Thus, t = 180 minutes and t0 = 3 minutes.
xmax - - -
/ 3 \Q-17
xmax \180/
=0.5 xmax
= (0.5)(0.052)Q = 0.0260 Q
For hydrocarbons, F = 1.6 x 10~u g/m3
Then S - Xmax - 0.026 Qh"2 .
inen S ~ F " 1.6 x 10-1* (B 6)
and SHC = - (B-7)
3. PARTICULATE SEVERITY EQUATION
The primary standard for particulate is for a 2 4 -hour
averaging time .
xmax
_ (0.052)Q(0.35) _ (0.0182)Q
-- _2 j-2 (B-8)
For particulates, Fp = 2.6 x 10"^ g/m3
X
_ Xmax _ 0.0182 Qh~2 .
5 ~ " -4* (B~9)
F 2.6 x ID
and SP =
62
-------
4. SOV SEVERITY EQUATION
The primary standard for SO is for a 24-hour averaging time.
X
The primary standard is 3.65 x lO"1* g/m3 .
Fcn = 3.65 x ICT1*
bux
and S - Xmax - (0.0182)Qh-2
and S - -j -- 3.65 x KT1* (B 12)
and SSQx = ^Q (B_13)
5. SEVERITY EQUATION FOR NOV
2\
Since NO has a primary standard with a 1-year averaging
X
time, the x.,,-,, correction equation cannot be used. Instead
the following equation is used:
- 2.03 Q
P
"-1YJI
2U,
(B-14)
A difficulty arises, however, because a distance, X, from
emission point to receptor, is included. To overcome this,
the following rationale is proposed:
The equation x = Brr (B-l)
^ Amax ireuh2
is valid for neutral conditions or when a = a . This
z y
maximum occurs when
h = /2o~ (B-15)
Z
and, since a = aX (B-16)
Z
Personal communication. Bruce Turner, Environmental Pro-
tection Agency.
63
\
-------
then the distance X
is,
where the maximum concentration occurs
= (JL.\*
X
max
(B-17)
For neutral conditions, a = 0.113 and b = 0.911.
The following sample calculations illustrate the concentra-
tion estimates.
Assume Q = 10 g/s and h = 50 m, then
50
^max V0.16
. 098
= 548.7 m
(B-18)
and
a = (0.113) (548.7)°-911 = 35.4 m
z
Assume u = 4.5 m/s, then
- 2.03 Q
=
x =
a uX
z max
(2.03) (10)
(35.4) (4.5) (548.7)
exp -(
or
= (2.32 x 10-1*) (0.369)
= 8.57 x 10~5 g/m3
(B-19)
Simplifying Equation B-14, since a = 0.113 x0-911 m and
Z
u = 4.5 m,
X =
X
4 Q
{1 . 911
max
exp
h N1-098
max \0.16,
= 7.5 h1-098
(B-20)
(B-21)
and
4 Q _
4 Q
X1-911 (7.5 hl.098)1.911
64
-------
-=O^Q exp|.,{^
h2.1
az = 0.113 X0-911
= 0.71 h
Therefore,
-=CL08S_Q
h2.1
20.71
= 0.085 Q
h2.i
(0.371)
= 3.15 x 10~2 Q
h2-1
(B-22)
Substituting Q and h:
X = 8.5 x 10~5 g/m3
Therefore:
-'
- = 3.15 x 10"^ Q
X h2-!
exp
zmax
2
.
2
The NO standard is 1.0 x 10~4 g/m3. Therefore,
(B-23)
F = 1 x
g/m
and the NO severity equation is:
= (3.15 x 10"2)Qh"2-1
i__ |
x 1 x 10-^
315 Q
h2.1
(B-24)
65
-------
APPENDIX C
THE SIMULATION PROGRAMS
1. THE INPUT TO THE PROGRAM
In this section, details regarding the input to the simula-
tion program are discussed. The input is divided into nine
groups, some of which are always required and some of which
are optional. Each of these groups is discussed in the order
in which it should appear in practice.
a. Input Groups
Group 1: This is a required group and should appear first.
It is a single card containing three pieces of data: a title
and two flags to indicate later options. The format is 20A2,
215.
(ITIL(I),1=1,20) - a title which will be printed as such on
output and which will appear below the x-axis on plots.
This is read in columns 1-40.
LT - an integer in cols 41-45. Suppose it has been decided
that there is a correlation between a pair of input
variables such that a simple regression must be performed
to obtain the regression line and SE for sampling pur-
poses as described in a previous section. Then as men-
tioned earlier,- the user has the option of either
supplying the program with the raw data and having the
regression analysis performed on these data or supplying
R, XB, SX, YB, and SY directly and using these directly
to obtain the regression line information. By the use
of the flag LT, the user signals the program which option
Columns are subsequently abbreviated as: cols.
66
-------
should be exercised. If LT = 0 (i.e., cols 41-45 are
left blank), the program will expect raw data upon which
to perform the regression analysis. If LT = 1 (actually
any integer other than 0), the program will expect to
read in R etc. to perform the regression analysis. In
either case, data must be read into the program. However,
these data will not be placed as the second group but
rather will appear in Group 8 to be discussed below.
The option is available to run the program with no
dependent variables. In this case LT = 0 and the
number of independent variables NDVAR =0. NDVAR is
in Group 3.
NCFLAG - an integer in cols 46-50. This flag provides the
option of using Sturge's rule to set up the number of
class intervals and using a value of XMIN and XMAX ob-
tained from the first 50 or so values of the output
variables to establish W and the resulting class inter-
vals. If NCFLAG = 0 (cols 46-50 are blank), Sturge's
rule and XMIN and XMAX as described earlier will be used
as indicated above. If NCFLAG = 1 (anything ^ 0) the
user can read into the program the number, NINT, of
class intervals he wishes to use and the value XMIN of
the left endpoint of the first class interval and the
width, W, of each succeeding class interval. If the
user decides to read in NINT etc., this information will
be on a card in Group 9 of the input.
t
Group 2; This is a required group and consists of a single
card with three pieces of data: XSAMP, XP0P, and NSAMP. The
format is 2F10.3, 15. This program was originally designed
to simulate the severity, S, of air pollution concentrations.
In doing so, it is sometimes desired to know how many plants
from the sample and from the population have predicted
67
-------
severity below 0.1, between 0.1 and 1, and above 1. Thus, the
program automatically monitors the number of simulated values
of S that fall in these ranges. It then divides the number
in each range by the simulated sample size (usually 5,000 or
more) to obtain the relative frequency of each range. It
then multiplies these relative frequencies by the actual
sample size, XSAMP, and population size, XP0P, to give the
required predictions for each range. Thus, the program re-
quires XSAMP and XP0P in real format and this is the place
that it is provided. Also, the sample size, NSAMP, is entered
in integer form to be used (if required) for reading in the
raw data for the regression analysis later.
Of course, if the distributions from samples have not been
estimated, or if a person is simply not interested in the
values between 0.1 and 1, etc., then XSAMP and XP0P can be
left blank (and also NSAMP if a regression analysis on raw
data is not being performed). However, if these values are
left blank, the blank card must still be supplied as Group 2.
Group 3; This is a required group and consists of a single
card containing four pieces of information, all integers.
The format is 415. Due to restricted core size, samples can-
not be drawn as large as required (usually 5,000 or more) for
each input variable in order that all of these can be used at
one time to calculate values of the output variable. Instead,
small samples must be drawn out for each input variable and
that many values of the output variable must be calculated.
This is followed by the statistical manipulations desired
[for example keeping a running US and Z(S)2 for later use in
finding the mean and standard deviation of output variables]
and a repetition of the process as many times as needed to
generate the desired sample size. The parameters on this
card dictate the manner in which the program does this.
68
-------
NGR0UP - an integer in cols 1-5, which tells the program the
sample size for each pass. This number must be <_ 50.
NPASS - an integer in cols 6-10 which tells the program how
many passes to make, drawing out samples of size NGR0UP
at each pass. Clearly, the final sample size will be
NGR0UP*NPASS.
NIVAR - an integer in cols 11-15. This parameter tells the
program how many of the input variables are independent.
For example, in a given equation there may be five input
variables. The user may decide that one of the variables
(e.g., Y) is dependent on (or correlated with) another
variable (e.g., X) and that the other three variables
are independent. Then NIVAR in this case would be taken
to be four: the three that are given independent and
the independent variable X in the correlation or regres-
sion equation. Thus, there would be one dependent
variable, Y, the variable that will be dependent in the
regression equation.
NDVAR - an integer in cols 16-20 which gives the number of
dependent input variables. In the example above, NDVAR
would be one. However, in general, two or more pairs of
correlated variables may be present in the equation so
that NDVAR could be two or more. The program does not
allow a variable to be independent on one pair for
correlation purposes and dependent in another. However,
the same independent variable can be used for two or
more dependent variables and by proper choice this is
all that one should ever need.
A word of caution is in order at this point. If NDVAR
is >_ 2, a regression analysis must be performed for each
correlated pair in order to obtain the regression equa-
tion to be used for sampling purposes. It was pointed
69
-------
out earlier that the flag LT in the first data group
gave the user the option of either reading in raw data
or supplying R etc., directly. The user is cautioned
here that if two or more regression fits are necessary,
they must both be performed in the same manner, either
both with raw data or both with user supplied values of
R etc. The program uses LT to get into one of two loops:
either read raw data for Xj and Yj, perform a regression
analysis and then repeat for X2 and ¥2 etc. until the
number of dependent variables is exhausted; or, read R
etc. for Xj and Ylf perform a regression analysis and
then read R etc. for X£ and ¥2 etc. until the number of
dependent variables is exhausted. Finally, it should be
noted that NIVAR + NDVAR (= the total number of variables)
must be <_ 10.
Group 4: This is a required group and consists of a single
card containing codes which tell what type of distribution is
to be used for sampling from the independent variables. The
values on the card are integers punched in 1615 format (actu-
ally all 16 will not be used). The values on the card are
read into an integer array IC0DE(I) from I = 1 to NIVAR. Thus,
IC0DE(1) is the code which tells what type of distribution
independent variable 1 has, etc. The distributions and their
corresponding codes are listed below:
code 1 = Weibull distribution
code 2 = normal distribution
code 3 = gamma distribution
code 4 = log-normal distribution
For example, suppose three independent variables exist and
they are ordered as VARj, VAR2, VAR3. Suppose VARj has a
Weibull distribution while VAR2 has a normal and VAR3 a log-
normal distribution. Then the data card would look like the
line below:
70
-------
cols 1-5 6-10 11-15
In setting up the function card (to be discussed later), one
must exercise caution to be certain that the variables so
ordered above will appear in their proper places in the
function.
Group 5: This group of data may consist of more than one
card depending on the number of independent variables con-
tained in the function. This is the place where the program
is given the parameters that go along with the distribution
selected for each input (independent) variable. For each
distribution, the parameters are listed which the program
needs for it as shown below:
Distribution Program parameters
Weibull ->- A and B
normal •> y and a
gamma -»• a and 3
log-normal -*• y and a for log X or equivalently
the log of geometric y and geometric
a for X.
These parameters are read into a matrix PAR(I,J) which is a
dimensional 10 x 2. Thus, each row of the matrix has two
components and the rows correspond to the given variables.
The parameters for independent variable 1 should be read into
the first row; independent variable 2 into the second row,
and so on until the independent variables are exhausted.
Thus, the program reads values into PAR(I,J) row-wise, i.e.,
it reads parameters for independent variable 1 into row 1 as
the first two values on the data card, etc. The parameters
are expected in the same order as listed in the table above,
i.e., for the Weibull distribution, A first then B etc. The
format is 8E10.3.
71
-------
As an example, consider the situation described above where
three independent variables (VARj, VAR£, and VAR3) were
present and these were Weibull, normal, and log-normal dis-
tributions, respectively. The data card for the parameters
should contain A then B for the Weibull as the first two
entries; then y then a for the normal as the next two entries;
and, finally, y then a for log VAR3 (or the log of geometric
y then geometric a for VAR3) as the last two entries. Thus,
there would be six fields (of the eight total) taken up on
the card and it might look as that shown below:
cols 1-10 11-20 21-30 31-40 41-50 51-60
A B y a y a
where the last y and a are for log VAR3.
Group 6: This group of data may consist of more than one
card depending on the number of independent variables. Some-
times in fitting continuous distributions to raw data or real-
world situations, the continuous distributions tend to take
on extremely high or extremely low values which are unrealis-
tic for the given data. This is a particular problem with
extreme-value distributions like the Weibull, gamma, or log-
normal distributions. Hence, it is desired to "clip" the
continuous distributions at appropriate points to keep these
unusually large or small values from occurring. This is the
data group in which the program is instructed at which points
to clip the (independent) distributions. The lower and upper
clip for each variable is read into a two-dimensional array
CLIP(I,J) which is dimensioned 10 x 2. The procedure of
reading one row containing a low clip and high clip, respec-
tively for each independent variable in the order in which
they are coded in IC0DE, is the same as that used for the
parameters.
72
-------
The following discussion pertains to the means by which the
distributions are clipped. The direct approach to sampling
from the Weibull distribution uses the cumulative distribu-
tion function to obtain a sample value for X from a given
random number, R. Hence, to clip the distribution so that
it does Jiot allow values of X below some given value or above
some other given value, all that is needed is to find at
which points, Cj and C2, between 0 and 1, these low and high
values occur on the distribution. The program is then
supplied with these points, C^ and C2 , and it will automat-
ically clip the random number generator so that the random
numbers generated will be between Cj and C2; hence, the
corresponding X-value will be in the correct range also.
For example, suppose it is desired to exclude values of X
below 10 or above 6,000 and that these points occur at
Cj = 0.02 and C2 = 0.96, respectively, on the cumulative
distribution for the Weibul'l. The program is then supplied
with these parameters (viz., 0.02 and 0.96) as the values
for the array CLIP(I,J) and the random number generator is
clipped according to the equation below:
R = (0.96 - 0.02)R + 0.02 = 0.94R + 0.02
This equation will automatically be set up internally.
Next, the problem of clipping the normal and log-normal
distributions is discussed. The direct approach to sampling
from these distributions uses the probability density func-
tion and not the cumulative distribution function. Hence,
the clips in these cases cannot be applied in the same manner.
Thus, a very direct approach was used for clipping the values
of the variable X in this case. The program was supplied
with the actual lowest value that X was allowed to assume and
likewise the highest value that X was allowed to assume. The
program then samples from normal with no restrictions. If
the sample value obtained is between the low and high value
73
-------
supplied, it is retained. Otherwise, the program samples
again and continues to do so until it obtains a value in the
proper range. Then the program continues the above procedure
until the desired sample size is attained. A word of caution
is in order. In supplying clips for the normal, low and high
values for X are supplied directly. However, in supplying
clips for the log-normal, low and high values for log X,
not X, should be supplied.
If any or all of the variables are to be undipped, this
card(s) must still be supplied. To leave the Weibull distri-
bution undipped, values of 0.0 and 1.0, respectively, are
supplied for the low and high clip. To leave the normal or
log-normal distribution undipped, extremely low or extremely
high values (for example, 1.0 x 10"18 and 1.0 x 1018) are
supplied for the low and high clip.
Group 7; This data group may consist of more than one card
depending on the number, NDVAR, of dependent variables. If
NDVAR is >_ 1, the program in this data group is given two
pieces of information about each dependent variable:
(1) which independent variable it is correlated with (i.e.
the number of the independent variable, e.g., 1 for VARj
etc.); and (2) the type of distribution to be used in sam-
pling for values of the dependent variable (either normal = 2
or log-normal = 4, at present structure). These codes for
the dependent variables are read into a two-dimensional
integer array IDC0DE(I,J) which is a 10 x 2 matrix. Hence,
each row of IDC0DE corresponds to a dependent variable in
the same order as the dependent variables are entered, i.e.,
row 1 for variable 1, etc.
For example, suppose there are five input variables, four of
which are independent and one of which is dependent. Suppose
further that for the numbering used on the independent vari-
ables, the dependent variable is correlated with the third
74
-------
independent variable and the log-normal distribution is
desired for sampling from the dependent variable. The data
card for this group would then appear as follows:
cols 1-5 6-10
Group 8; This data group is optional and must appear only
if NDVAR is >_ 1. It may consist of more than one card. This
data group will contain the information necessary to perform
the regression analysis between the independent variables and
dependent variables. Depending on the value of the flag LT
in Group 1, this data group will contain either the raw data
for the pair(s) of correlated variables (independent variable
then dependent variable) or the values of R, XB, SX, YB, SY
(in that order) for the independent variable X and the
dependent variable Y. The format on all cards is 8E10.3.
Note that if NDVAR is >_ 2, the raw data must appear as
follows:
jIndependent VARj)
(Dependent VARi)
(Independent VAR2|
/Dependent VAR2j (Even if IndePendent VAR2 = Independent
etc.
Further, if NDVAR is ^ 2 and it is desired to enter the
values of R, etc. it must appear as below:
, XBlf SXlf YBj, SY
, XB2, SX2, YB2/ SY2| etc.
75
-------
Group 9: This data group is optional and will appear only
if NCFLAG 7* 0 on the first data card. In this group, specific
values of NINT (number of class intervals), XMIN (beginning
left endpoint) and W (width of class intervals) are read into
the program. As mentioned previously, if NCFLAG = 0, the
program will establish class intervals of its own and this
data group is omitted. The format used for this group is
I5,2F10.3.
b. The Function Card
Whenever changes are made from one run of this program to
another run, they are generally made from one function
(describing an output random variable as a function of input
random variables) to another function (with different input
and output random variables). Hence, it is necessary to
change one card in the program itself, which will be referred
to as the function card. It appears as a card in the function
subprogram SF with calling argument IVAL and common arguments
VAR(I,J) (as well as some other dummy arguments).
VAR(I,J) is a 51 x 10 matrix in which the samples drawn from
the distributions of the input variables are stored for each
pass from 1 to NPASS. These samples are stored columnwise,
i.e., column 1 contains the sample from VARj , etc., until all
of the independent variables are sampled. In the first column
after the NIVAR, the dependent variables are stored until they
are exhausted. After storing the values in VAR(I,J) .in this
way, the NGR0UP values of the output variable for this pass
are calculated. In this calculation, the function card,
defined above, is used.
On a single pass of the program, the values of the output
variable from I = 1 to NGR0UP are calculated. For a given value
of I, this information is transferred to the subprogram SF as
76
-------
IVAL. For this value of IVAL, a value SF of the output
variable is calculated according to the rule specified on the
function card. Thus, caution should be exercised in construc-
ting the function card so that it reflects the true function
of the input variables in the proper order in which they
appear in VAR(I,J) .
As an example, consider the function
3.1*A*B2
Z =
C*D+E
(where * denotes multiplication). Suppose variables A, C, D,
and E are taken to be independent and B is dependent and
correlated with variable E. If the independent variables are
numbered as below:
A = 1
C = 2
D = 3
E = 4
then automatically, variable B will be numbered 5 in VAR(I,J),
i.e., column 5 will contain the values of E. Thus the SF
card would be:
SF = (3.1*VAR(IVAL,1)*(VAR(IVAL,5)**2)/
(VAR(IVAL,2)*VAR(IVAL,3)+VAR(IVAL,4))
c. Example of Overall Input Data
Suppose it is desired to simulate the severity, S, of SO2
emissions from coal-fired electric utilities. The severity
equation is given by:
s = (30.1)(% sulfur)(CC)
h2
77
-------
where % sulfur = percent of sulfur in coal used
CC = coal consumed in 106 kg/yr
h = stack height in meters
Assume that % sulfur and CC are to be independent and h is to
be correlated with coal consumed. Suppose the type of dis-
tribution for each input variable has been determined along
with the corresponding parameters and clips according to the
information in the table below.
Table C-l. VARIABLES, DISTRIBUTIONS, PARAMETERS AND CLIPS
FOR COAL-FIRED ELECTRIC UTILITIES EXAMPLE
Variable
% sulfur-1
CC-2
h-3
Type of variable
distribution
2 (normal)
1 (Weibull)
4 (log-normal)
Parameters
y=2.5
A=9.7xlO~I+
a
0 = 1.1
B=1.0
Clips
0.025 and
0.05 and
_a
1.0
0.99
Not needed since this variable is dependent and, hence,
correlated.
Assume that the sample size from which the parameters were
obtained is 224 out of a population of 600. (This is not a
situation where a simulation is required but it can serve as
an example.) Suppose that the program is instructed to set
up its own class intervals and to perform a regression
analysis on the raw data for CC and h. Finally, assume that
a simulated sample size of 5,000 is desired. Then the input
data would appear as shown in Table C-2. In Table C-2, the
values that should appear on the card(s) are underlined and
the columns in which they should appear are shown in paren-
theses beside them.
78
-------
Table C-2.
10
SUMMARY OF INPUT DATA BY GROUPS FOR COAL-FIRED
ELECTRIC UTILITIES EXAMPLE
Group
number
I
II
III
IV
V
VI
VII
VIII
IX
Input data3
Titles (1-40) (blank)
(41-45)
224.0 (1-10) 600.0 (11-20) 224
50 (1-5) 100 (6-10)
2_ (1-5) 1 (6-10)
2.5 (1-10) 1.1 (11-20
2 (11-15)
) 9.7x10"
0.025 (1-10) 1.0 (11-20) 0.05
2 (1-5) £ (6-10)
Raw data for CC
Raw data for h
Omitted
(blank) (46-50)
(21-25)
1 (16-20)
•* (21-30) 1.0 (31-40)
(21-30) 0.99 (31-40)
Underlined information represents input data which is to be placed
in columns designated by parentheses.
JEight per card in format 8E10.3.
-------
2. DESCRIPTION OF OUTPUT
There are two forms of output from the simulation program:
(1) printed output, which is divided into three parts; and
(2) the frequency histogram and cumulative frequency function
of the simulated sample of output values.
a. The Printed Output
The first item that is printed out is the title read in on
the first card of the input. The next item printed is the
mean and standard deviation of the output variable. The
probabilities that the output variable lies in the range
< 0.1, between 0.1 and 1, and > 1, respectively, are subse-
quently printed. These are followed by the number of plants
(outcomes) from the sample which are predicted to fall in
each of these three ranges, respectively, and then by the
same numbers for the population. These are printed as
SNUM(l), SNUM(2), SNUM(3), PNUM(l), PNUM(2), and PNUM(3) for
sample number in range 1, etc. Finally, a table is printed
out which gives the class intervals, the actual frequency of
the sample in each class interval, and the cumulative fre-
quency function for the right endpoint of the class interval.
A few special comments are provided regarding the first and
last class intervals. The first class interval printed has as
its left endpoint the minimun value of the output variable in
the whole sample of size NGR0UP*NPASS. Its right endpoint is
the minimum value found in the first NGR0UP values of the
output variables, for it is after the first pass that the
program automatically sets up class intervals. The last
class interval printed has its left endpoint equal to the
maximum value obtained on the first pass from the first
NGR0UP values and its right endpoint is the overall maximum
value for the whole sample. If the user supplies his own
80
-------
class intervals and the lowest value supplied is lower than
the actual minimum value calculated by the program, the first
class interval will then look "backwards." For example,
suppose the user supplied 0 as the beginning class interval
value and the lowest observed value was 0.03 during the
simulation. The first two lines for the first two class
intervals would then appear something like the following:
Class interval Actual frequency Cumulative frequency
From 0.03 to 0.0 0.0 0.0
From 0.0 to 0+W ? ?
Likewise, the last class interval could appear "strange" in
a similar situation. If one wishes to know XMIN and XMAX for
the whole sample, these can be found as indicated above.
b. The Plots
The output plots from the simulation program are self-
explanatory and will not be further discussed.
81
-------
Table C-3. COMPUTER LISTINGS OF THE SIMULATION PROGRAMS
BIT TRANI16)
DIMENSION AF<35).CF(35)«1TIL(20) ,ErtP< 3),PROB< 3> •
1SNU* ( 3) . PfJUM( 3) , XE (35) . CLIP (10 12) .KAF ( 35)
COMMON ICOCE(lU). VAR(51-10>.PAR(10.2),SFV(51>,IOCODE(10,2)i
ICORdO.b) ,XSK(?50) ,YSR(ZbO)
C THIS PROGRAf IS DESIGNED TO PERFORH A COMPUTER
C SIMULATION TO OBTAIN VALUES OF ONE RANDOM VARIABLE
C DEFINED AS A FUNCTION OF OTHER RANDOM VARIABLES.
c ALL THE RANDOM VARIABLES ARE ASSUMED TO BE
C CONTINUOUS AND CAN Bt ANY ONE OF FOUR DIFFERENT
C DISTRIBUTIONS] THE WLIBULL. THE NORMAL. THE GAMMA.
C OR THE LOG-NORMAL. FURTHERMORE, THE PROGRAM WILL
C ACCOUNT FOR CORRELATION BETWEtN CERTAIN PAIRS OF
c THE INPUT VARIABLES.
READ(1.100)TRAN,IC6U1.CT1,CY2.XII,XI2.ANGL
WRITE(5.500) IRAN,ICOL1.CY1.CY2,XII.XI2.ANGL
RE AD (1,12) (ITIH I) ,1 = 1,20) ,LT,NCFLAG
REAOI1.13) XSAMP.XPOP.NSAMP
READ(i.io) NGROUP.NPASS.NIVAR.NDVAR
SMEAN=0.0
sso=o.o
XMIN=1.0£18
XMAX=0.0
DO 25 1=1,35
25 AF(I)=0.0
DO 26 1=1,3
26 EMP(I)=0.0
CALL RANDU(9.IY,YFL)
IX = IY
READd.10) (ICODE(I),I=1>NIVAR)
READ (1,11)((PAR(I.J),J=lt2),I=l,NIVAR)
11 FORMATI8E10.3)
READ(1,11)((CLIP(I.J),J=1,2),I=1,NIVAR)
IF(NDVAR.EQ.O) GO TO 6
READ 11<10)((IDCODE(I,J)•J=l12)•1=1•NDVAR)
IF(LT.EQ.O) GO TO 60
DO 70 I=1.NDVAR
IROW=I
CALL SR(NSAMP,IROU,LT)
70 CONTINUE
GO TO d
60 00 <40 I = 1,MDVAR
IROW=I
REAO(l.ll)(XSR(K),K=1,NSAMP)
READ! 1,11)(YSR(K),K=1,NSAMP)
KO=IDCODE(I,2)
IFIKO.LQ.m GO TO 52
GO TO 50
52 DO 51 J=l,fjSAMP
XSR(J)=ALOG(XSR(J>)
YSR(J)=ALOG(YSR(J)>
51 CONTINUL
bO CALL SKUISAMP.IhOW.LT)
10 CONTINUL
tt IPASS=1
22 DO i 1 = 1,hilVAK
ICOL=I
KK = ICCUE(I)
GO TO ("»,5,6.7) , KK
"» C1=CLIP(1.1)
82
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
C2=CLIP(I.2)
CALL USA.1P(NGROUP, IX. ICOL.C1.C2)
60 TO 3
5 Cl=CLIPll.l)
C2=CLIP<1.2>
CALL NOKSAh(NGROUP11XtICOL.Cl.C2)
GO TO 3
(> CALL GAMSAMCNGROUP.IX.ICOL)
SO TO 3
7 C1=CLIP
CALL FDPLOTINCINT,SMEAN,SDEV,MINT,W.XE.XfnN.XMAX.KAF.ITIL,XT)
10 FORMAT116I5)
12 FORMAT(20A2>2I5)
13 FORMAT(2F10.3.I5)
100 FORMAT(lbLl.I2.1X,5E10.3)
500 FORMATdHO, 16L1.12. IX, 5L15. 7)
END
83
-------
Table C-3 (continued) . COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE:
COMMON ICOOE < 10 ) t VAR ( bl . 10 ) , P AR < 10 . 2 ) . SEV ( bl ) , IOCOOE 110.3',
1COR<10«5>«XSR(250).YSK(2501
C • THIS SUBROUTINE DOtS tlTHER A SIMPLE REGRESSION ON
C THE INPUT RAW DATA FOK Two CORRELATtD VARIABLES OR
C READS IN THE VALUES OF R, XBi SX. YB, AMD SY AND
C CALCULATES THE REGRESSION LINE FROM THESE VALUES.
C IN EITHLR EVENT, THE VALUES OF THE INTERCEPT A,
C SLOPE B, STANDARD ERROR IN REGRESSION LINE SE. AND XB
C AND SX AKE STORED IN THE IKOW ROW OF AN ARRAY
C COR(I.J) TO BE USED LATER IN THE SUBROUTINES WHICH
C DRAW SAMPLES FROM THE DEPENDENT VARIABLES.
IF(LT.NE.O) GO TO 5
FN = FLOAT(HH)
SI = 0.0
S2 = 0.0
oo i 1=1, MM
SI = Sl+XSKd)
S2 = S2+YSJRII)
1 CONTINUE
XB = Sl/FN
YB = S2/FN
SI = 0.0
S2 = 0.0
S = 0.0
DO 2 1=1. MM
SI = S1+(XSR'I)-XB)»*2
S2 = S2+(YSRCII-YB)»»2
S = S+(XSR(I)-XU>*(YSR=A
COR(IROW,2)=e
COR(IROW,3)=SE
COHdKOr.t >=XB
COR(IROW,5)=SX
RETUK
END
SUBROUTINE USAHP(NGROJP.IX,1COL.C1,C2)
COMMON ICODE(IO).VAR(51,10),PAR(10,2),SEV(51)
C THIS SUBROUTINE DRAWS A SAMPLE OF SIZL NGROUP FKOf
C THE WEIBULL DISTRIBUTION FUNCTION. THE PARAMETERS
C A ANO H TO bb USED FOR THE WEIBULL ARE OBTAINED
C FRO" THL ARfcAY PAH(I.J). THE DISTRIBUTION HAS
C LOWER AND UPPEK CLIPS Cl AND C2 RESPECTIVELY
C < SUPPLIED f-HOM THf HAIMLIKIt). THL SAMPLE IS
C STOKED IN THE ICOL COLUMN OF THE ARKAY VAHd.Jl.
C THf METHOD USED IN SAMPLING IS THE DIRECT APPROACH.
A=PARi1COL.1)
H=PAR(ICOL.2)
P=1.0/B
00 1 :-l,NGROUP
2 CALL HAI.OUdXtlY.R)
IX = 1Y
IF(R.LQ.l.O) GO TO 2
R=(C2-C1)«R+C1
flRG=l .0/d.O-H)
VAN(I.ICOL)=IALOG(ARG)/A)«»P
1 CONTINUE
KETUKM
EMI)
84
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE. NORSAM (NGROUP • IX i ICOL.C1-, C2 )
COMMON 1CODE(10),VAR(51.1U)«PARll0.2).SLV(bl)
C THIS SUBROUTINE DOES PRECISELY THE SAME THING AS
C THE WSAMP SUBROUTINE EXCEPT THAT IT DRAWS THE
c GIVEN SAMPLE PROP THE NORMAL DISTRIBUTION INSTEAD
C OF THt WEIBULL. THE DIRECT APPROACH IS ALSO USED
C HERE.
XBAR=PAK(ICOL«1)
SD=PAR(1COL,2}
TPI=2.0*3.1*1.59
UO 1 I=l.NGKuUP
2 CALL RANOUIIX.IY.Rll
IX=IY
IF(Rl.EO.O-.O) GO TO 2
AKG=1.0/(R1*R1)
ARGsALOG(AKG)
TX=SORT(AKG)
CALL RANDU(IX,IY,R2)
IX=IY
R2=R2*TPI
VAR( I,ICOL)=XBAR+SD*TX*SIMR2|
IFUVARII,ICOL).LT.C1).OR.(VAR(I,ICOL).GT.C2)) GO TO 2
IF(ICODE(ICOL).NE.2) 60 TO 1
IF(VARIItlCOL).LT.O.O) GO TO 2
1 CONTINUE
RETURN
END
SUBROUTINE LOGSAM(NGROUPiIX,ICOL.Cl.C2>
COMMON ICODE(10 >•VAR(51.10).PAR <10.2 >,SEV < bl>
C THIS SUBROUTINE SAMPLES FROM THE LOG-NORMAL
C DISTRIBUTION IN PRECISELY THE SAME MANNtR AS
C NOKSAM AND WSAMP DO FROM THE NORMAL AND WEIBULL
C RESPECTIVELY. IT ALSO USES THE DIRECT APPROACH.
CALL NORSAM(NGROUPiIX.ICOL.Cl,C2)
DO 1 1=1.NGROUP
VAR(I.ICOLI=EXPtVAR(I,ICOL)
t CONT:NUE
RETURN
END
SUBROUTINE GAMSAM(NGROUP,IX,
COMMON ICODE(10),VAR(51ilO).PAR(10.2).SEV(51)
C THIS SUBROUTINE SAMPLES FROM THE GAMMA DISTRIBUTION
C IN THL SAME MANNER AS THL ABOVE SUBROUTINES SAMPLE
C FROM THEIR RESPECTIVE DISTRIBUTIONS. THE ONLY
C DIFFERENCE IS THAT THIb SUBROUTINE USES THE
C REJECTION METHOD FOR SAMPLING FROM THE GAMMA.
ALPHA=PAR(ICOL,1>
BETAaPAK(ICOL<2)
XU=BETA«(ALPHA+5.0«SORT(ALPHA))
AM1=ALPHA-1.0
CALL GAMMA!ALPHA.GA,I)
CON=6A*(BETA**ALPHA)
YU=((AM1)**AM1)/(GA*BETA*EXP(AM1>)
00 1 1=1tNGROUP
2 CALL RANDU(IXiIYtXVAL)
IX=IY
CALL RANOU(IX,IY.YVAL>
IX=IY
XVAL=XVAL*XU
YVAL=YVAL*YL
FXVAL=GDF(XVAL
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
FUNCTION GDFCXtALPHA,BETA,CON)
C THIS SUBPROGRAM EVALUATES THE PROBABILITY DENSIT*
C FUNCTION FOR THE GAMMA DISTRIBUTION AT THE POINT
C X. THE GAMMA PARAMETERS ARE ALPHA AND BETA AND
C THE CONSTANT CON IS GIVEN BY CON=GA»
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SU8ROUTIML LCSAMPINSROUPiNPASS,IX,ICOL,IVARtIDVAR)
COMMON ICOOEdO)iVAR(51«10>tPAR(Id 2).SEV(Sl).IOCOOE(10,2
ICURdOtS)
C THIS SUBROUTINE DRAWS A SAMPLE OF SIZE. NGROUP FOR
C A DEPENDENT INPUT VARIABLE FROM THE LOG-NORMAL
C DISTRIBUTION. IVAR IS THE SUBSCRIPT OF THE
C INDEPENDENT VARIABLE WITH WHICH THE GIVtN DEPENDENT
C VARIABLE IS CORRELATED. IDVAR IS THE NUMBER OF
C THE DEPENDENT VARIABLE. ICOL IS THE COLUMN OF THE
C STORAGE ARRAY VAR(ItJ) IN WHICH THE SAMPLE IS
C STORED. THE METHOD USED IS THE ONE DISCUSSED IN
C THE DESCRIPTION OF THIS PROGRAM.
YINT=CORdOVAR.l)
SLOPE=COR(IDVAR•2i
SE=COR(IDVAR.3)
XBARS=COR(IUVARt4)
SDS=CORdOVAR<5>
FN=lviGROU^*NPASS
TPI=2.0*3.1"»159
DO 30 1=1«NGROUP
AVAR=ALOG(VAR(I=EXP( VAR d. ICOL))
30 CONTINUE
RETURN
END
SUBROUTINE NCSAMP(NGROUP•NPASS.IX tICOL tIVAK <
COMMON ICOOEUO),VAR<51.10).PAR(10.2),SEV(bl),IDCODE(ln.2
KORdO.5)
C THIS SUBROUTINE DOES THL SAME THING AS LCSAMP
C ABOVE EXCEPT THAT IT CRAWS THE SAMPLE FUR THE
C DEPENDENT VARIABLE FROM A NORMAL DISTRIBUTION.
YINT=COR(IDVARtl)
£>LOf'E=COR( IOVAR.2)
SC=COR(IDVAR<3)
XBARS=COK(IOVARt4)
SOS=COR(IDVARib)
FN=NGROUP*NPASS
TPI=2.0*3.14159
DO 3D 1=1.NGROUP
XBAR=YINT+(SLOPE*VAR(I.IVAH))
Tl=( (VARdf IVAR)-XBARS)*(VAR/(FN»SDS*SDS )
SD=SE»d .0+1./FN+T1)
31 CALL RAf.OU(IX.IY.Rl)
IX=IY
IF(Rl.EO.O.O) GO TO 31
ARG=1.0/(K1*R1>
ARG=ALOG(ARG)
TX=SBHT(AKG)
CALL RANDUIIX.IY.R2)
IX=IY
R2=R2*TPI
VAR{I.ICOL)=XBAR+(SO*TX*SIN(R2)I
IFIVARd.ICOL).LT.0.0) GO TO 31
30 CONTINUE
KETUKN
END
87
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE RANOUlIX,IT,YFL)
C THIS SUBROUTINE GENERATES A PSEUDO-RANOOM NUMBER
C BETWEtN D AND 1 AND KETJRNS IT IN THE PARAMETER
C YFL. IT ALSO KETUKNS A VALUE OF IY WHICH IS TO BE
C USED AS IX IN THE NEXT PASS THROUGH THIS SUBROUTINE.
IY=IX»B99
IF(IY) 5,6,6
& IY=IY+i2767»l
6 YFL=IY
YFL=YFL/32767.
RETURN
END
SUBROUTINE BflCKT(NGROUP,ICOL1,CY1,CY2,TRAN,XI1,XI2,ANSLl
C THIS SUBROUTINE PERFORMS THE
C TRANSFORMATION RACK To ORIGINAL
C DATA VALUES AFTER LINEAR
C TRANSFORMATION PER PARA 19.8 OF
C STATISTICAL THEORY WITH ENGINEERING
C APPLICATIONS, HALO, WILEY, 1967.
C THE TRANSFORMATION MAKES
C CORRELATION COEFFICIENT ZERO OF
C DATA PAIRS OF PARTIALLY CORRELATED
C DATA WITH MEANS OF ZERO. A SECOND
C TRANSFORMATION IS REQUIRED TO
C ADO A CONSTANT TO THE DATA TO
C HAKE IT POSITIVE SO THAT LOGS
C CAN BE TAKEN.
C
C WRITTEN BY LEE MOTE - 3/10/76
C
C XII IS MEAN OF X
C XI2 IS MEAN OF Y
C ANGL IS ANGLE OF ROTATION IN
C RADIANS
C Yl IS TRANSFORMED ARRAY FOR
C INDEPENDENT VARIABLE
C Y8 IS TRANSFORMED ARRAY FOR
C DEPENDENT VARIABLE
C XI IS RESTORED VALUES OF
C INDEPENDENT VARIABLE
C X2 IS RESTORED VALUES OF
C DEPENDENT VARIABLE
BIT TRANI16)
DIMENSION XKS1), X2(51)«YK51>,Y2«51)
COMMON ICODEUO>,VAR(31,10),PAR(10.2>,SEV(51>,IDCOOE<10,2>,
1COR(10,5),XSR(250),YSR(250)
N=NGROUP
ICOL2CICOL1+1
DO 1 1=1,N
Y1(I)=VAR(1,ICOL1)-CY1
Y2(I)=VAR(I,ICOL2 >-CY2
IF-CYl
IF=ALOG(VAR(I,ICOL2»-CY2
1 CONTINUE
DO 2 I=1«N
X1(I)=XI1+Y1(I)*COS(ANGL)-Y2(I)*SIN(AN6L)
X2(I>=XI2+Y1(I)*SIN*COS(ANGL>
IF(TRAN(2»XKI)sEXP(Xl(I))
IF(TRANO)) X2)
2 CONTINUE
C
DO 3 1=1,N
VAR(I.ICOL1)=X1(I)
3 VAR
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
FUNCTION SF(IVAL)
COMMON ICOOE(IO)•VARI 51•10)iPAR11012).SEV(51)
C THIS SUBPROGRAM EVALUATES THE OUTPUT VARIABLE SF
C AT A PARTICULAR VALUE IVAL ASSOCIATED WITH THE
C INPUT VARIABLE ARRAY VAR(I.J).
SF=(30.12'*3*VAH< IVAL.l I«VAK( IVAL.2) >/(VAR(IVAL«3)*VAR(IVALt3) >
RETURN
EMU
SUBROUTINE SEVCALINGROUPtXMlNtXMAXtSMEANtSSU)
COMMON ICOD£<10),VAR(51,10).PAR(10,2)tSEV<51)
C THIS SUBROUTINE CALCULATES THE NGROUP VALUES OF
C SEVERITY .OR OUTPUT VARIABLE) AND STORES THEM IN
C THE ARKAY SEV(I) FOR LATER USE. IT ALSO KEEPS
C TRACK OF A RUNNING SUM OF SEVERITIES, SUM OF THE
C SQUARES OF SEVERITIES, AND AN OVERALL XHIN ANL
C XMAX VALUE FOP SEVERITY.
DO 1 1=1,NGROUP
SEV(I)=SF(II
SMEAN=SMEAN+SEV(I)
SSO=SSU+(SEV(I)«*2)
IF(SEVII).LT.XMIN) XMIN=SEV(I.
IF(SEV(I>.GT.XMAX> XHAX=SEV(I)
1 CONTINUE
RETURN
END
89
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE CINT(NCIMT.XMIN.XMAX,NINT,XE,W.NCFLAG)
DIMENSION X£(3S)
C THIS SUBROUTINt ESTABLISHES THE CLASS INTERVALS TO
C BE USLO IN THE PROGRAM. THIS SUBROUTINt IS
C BRANCHED TO AFTER THE FIRST PASS OF NGRUUP VAL (tS
C OF THE OUTPUT VARIABLE SEV ARE CALCULATED. IT
c THFN uses THESE VALUES AND STURGES* RULE. FOR
C SETTING UP THE CLASS INTERVALS OH IT READS THE
C NECESSARY DATA AND SETS UP THE CLASS INTERVAL'-
C THAT THE UStR DESIRES TU HAVE.
IF(NCFLAG.Nt.O) 00 TO 2
XN=FLOAT(NCINT>
NINT=l.b+3.J»ALOGiu(XN»
IF (XMAX.GT.50.0) XHAX = 50.0
w=(XMAx-xMiN)/FLOAT(NiNT)
w=w+.oooi*w
GO TO 3
2 READ(1«5)NINT,XHIN.W
5 FORMATII5.2F10.3)
3 XE(1) = xniN
K=NINT+1
00 1 1=2,K
XE(I)=XE(I-1)+W
1 CONTINUE
RETURN
END
SUBROUT I NE AFHEO ( NGROUP . MINT . XE . AF , EMP )
DIMENSION XE(3b),AF(35).Er:P(3)
COMMON 1 COOL (10), VAR (51,10). PAR (10,2), SEV ( 51 )
C THIS SUBROUTINE ACCEPTS N6ROUP VALUES OF SEVERITY
C SEV (OR OUTPUT VARIABLE) AND SEPARATES 1HEM INTO
C THE MINT CLASS INTERVALS WITH ENDPOINTS IN THE
C ARRAY XE. THE ACTUAL FREQUENCIES ARE COUNTED IN
C THb. REAL ARRAY AF. FURTHER. THE NUMBER OF
C OBSERVED VALUES OF SEV BELOW .1. BETWEEN .1 AND 1,
C AND ABOVE 1 ARE COUNTED AND STORED IN THE ARRAY
C EMH(I), 1=1.2,3.
DO 1 J=1.NGROUP
IF(SEVIJ>.GE.XE(1)> 60 TO 2
AF(1)=AF{1)+1.0
GO TO 1
2 DO 3 I=1.NINT
IF( (SEV(J).GL.XF (II >.AND.(StV(0).LT.Xl(I+l»> GO To 4
GO TO 3
GO TC 1
3 CONTINUE
AF(NINT+2<=Af
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE OUTPUT(SMEAN,SDE.V,PROb,SNUM,PNUMtAF,CFtXSAMP,X -OP.
lMNT.XE..lTIL.XMlN.XfAX)
DIMENSION PROB(i),SNUM(3),PNU«(3).AF<35)tCFl35).XL(35).ITILI20 I
THIS SUBROUTINE PRINTS THE OUTPUT OF THE. PROGRAM
INCLUDING MLANt STANUARU DEVIATION, ETC..
rfRlTE(StlO)(ITIL(I),I=lt20)
10 FORMATI1H1.40X.20A2)
WRITEIbtl) SMEAN.SOEV
WRITE(5.2) PROB(l).PROBI2).PROB(3)
WRITE(5,3) XSAMP.SNUMtl).SNUM12),SNUM<3>
uRITEfiMl) XPOPtPNUnil) ,PNUh(2) .PNUM(3)
WRITE(5.51
K=NINT+1
XLE=XHIN
WRITEIb.7) XLEtXE(l) tAF(l)iCFd)
00 6 I = 2tK
b WRITE(5t7) XE(I-1>.XE(I).AF(I),CF(I)
XRE=Xf«AX
WRITE(b.7) XE(NINT+1).XRE.AF1NINT+2),CF(NINT+2)
1 FORMAT
-------
Table C-3 (continued). COMPUTER LISTINGS
OF THE SIMULATION PROGRAMS
SUBROUTINE C1PLOT =CFREQ(I)*FN
CALL PLOTS(IDUM.IDUMiIPLOT)
CALL PLOTU.0.0.0,-2)
XINT=FLOAT(NINT(+1.0
CALL AXIS (0.0.0.0. ITIL.-LT.XINT.O.O.XEdl.Wt
FACT=.9
CALL FACTOR(FACT)
OELTAV=FLOAT(N/10)+1.
CALL AXIS(0..0..'NUMBER OF PLANTS*.16,10.,90..0..DELTAV)
CALL FACTORC1.0)
CALL PLOT(0.0.10.0»FACT,3)
CALL PLOT(XINTilO.O»FACTi2)
XE(NINT+2)=XE<1)
XE(NINT+3)=W
CFREQ(NINT+2)=0.0
CFREQ(NINT+3)=DELTAV/FACT
CALL LINE< XE.CFREO, NINT+1.1.1, U)
CALL SYMBOHXINT.4.0, .21,"SAMPLE SIZE = •.0.0,1'*)
XN=FLOAT(N)
CALL NUMBER(999.0,999.0,.21.XN,0.0,-1)
CALL SYHBOL
NINT=NINT-1
00 3 1=1.20
3 XE(I)=X(I)
CALL PLOTS(IDUM,IDUM,IPLOT)
CALL PLOT(1.0,0.0.-2)
XINT=FLOAT( NINT) +1.0
CALL AXIS (0.0. 0.0, ITIL. -LT.XINT,O.O.XE(1),W)
KK=MNT+2
DO 1 Ksl.KK
M=KOUNT(K)
1 YK(K)=FI.OAT(M)
rK(NINT+3)=0.0
CALL SCALE(YKiB.Ol)
YK(NINT+H)=YK(NINT+5)
CALL AXIS(-1.0.0.U, 'NUMBER OF OBSERVATIONS' t 22, 8.0 . 90 . 0 t
1YKININT+3) .YK YK(I)=YK(I + 1)
X£(NINr+2)=XE(l>-w/2.0
YK(MlNT+2'
YKIMNT*
YVAL=YH(l)/YK(NINT+3)
CALL PLOTC.S.VVAL^)
CALL LINEIXEiYKtNINT+l.l.l.ll)
CALL SYMBOL (XINT.'J.O. .^i. -SAMPLE SIZE '.0.0. 1m
XN=FLOAT(N>
CALL NUMBER (999.0, 999 .Ot.?l-l)
CALL SThHOL(XINT«3.5,.21. '«iN. VALut = ',0.0,13)
CALL NUr BEH ( 999 . 0 . 999 .0,.^l.XhlN,0.0.2)
CALL SYMBOLIXINT.3.0. ,21« tf ux. VALUE = '.0.0,13)
CALL NUNHER ( 999. 0,999.0,. 21. XNAX.U. 0,2)
CALL SYMBOL(XINT,2.5« .21« *f"EAN = ',0.0.7)
CALL NUMBER ( 999. 0 , 999. 0 . .21 > XF AR , 0 . 0 . 2 )
CALL SYMBOLIXTNT,; ?..21,'S1D. UEV. = >,0.0il2l
CALL NUMBER(949.C.499.0..21,SD<0.0,2)
CALL PLOT(O.O.O.O.-.S)
CALL PLOT(30.0,0.0.-3%
MINT=NINT+1
RET' KM
END
93
-------
APPENDIX D
THE GOODNESS-OF-FIT PROGRAM5'7"12
The goodness-of-fit program is designed to take sample data
from some population with an unknown distribution and "fit"
the-data to various continuous distributions by one of the
standard procedures of statistics. The program will print
out various sample parameters, such as the mean, standard
deviation, coefficient of skewness, and measure of kurtosis.
It will also print out the corresponding parameters for the
theoretical continuous distributions which are fitted to the
data. Finally, using Sturge's rule to set up class intervals,
the program will calculate the chi-square value to be used in
a chi-square test for goodness-of-fit as discussed earlier.
In addition, using the right endpoints of the class intervals
as comparison points, the program calculates the residual sum
of the squares of the difference in the theoretical cumula-
tive distribution and the actual cumulative distribution.
All of the calculations indicated above can be used to deter-
mine how well the given theoretical distribution fits the
actual data.
7Mendenhall, W., and R. L. Scheaffer. Mathematical Statistics
with Applications. North Scituate, Duxbury Press, 1973.
8Walpole, R. E., and R. H. Myers. Probability and Statistics
for Engineers and Scientists. New York, The MacMillan Co.,
1972.
9Siegel, S. Nonparametric Statistics. New York, McGraw-Hill
Book Co., 1956.
10Duncan, A. J. Quality Control and Industrial Statistics.
Chicago, Richard D. Irwin, Inc., 1952.
11Cramer, H. Mathematical Methods of Statistics. Princeton,
Princeton University Press, 1946. 575 p.
12Olt, W. R., and D. T. Magee. Random Sampling as an
Inexpensive Means for Measuring Average Annual Air
Pollutant Concentrations in Urban Areas. (Presented at
the 68th Annual Meeting of the Air Pollution Control
Association, Boston. June 15-20, 1975.
94
-------
1. THEORETICAL DISTRIBUTIONS USED AND THE METHODOLOGY
UTILIZED FOR FITTING THEM TO THE DATA
There are four distributions which are used in the goodness-of-
fit program: the normal distribution, the Weibull distribu-
tion, the gamma distribution, and the log-normal distribution.
These represent a wide range of continuous distributions and
are, therefore, deemed sufficient to cover most (if not all)
of the data encountered.
Three methods are routinely used in statistics for fitting
data to a continuous distribution: the method of maximum
likelihood, the method of moments, and the method of least
squares. The theoretical aspects of these methods are dis-
cussed in the statistics texts listed in the references at
the end of this report. The applications of these methods
to the specific distributions under consideration are dis-
cussed below.
First, the data are fitted to the Weibull distribution by
two methods: the method of maximum likelihood and the method
of least squares. For the method of maximum likelihood, the
following system of equations (in the parameters a and b)
need to be solved:
£ - E X.b = 0
a 1=1 X
n n .
£• + s log X. - a I X. log X. = 0
D 1=1 1 i=l ^
(D-l)
(D-2)
where n = sample size and the {X.}n are the sample data
1 1=1 '
values. Eliminating "a" from the system we obtain the
following equation in b alone:
95
-------
? , Xib ^ Xib , n
— i I log X. - 1 = 0 (D-3)
n b n i=l 1
i=l 1
The program uses a numerical scheme called the method of
false position to solve the above equation for b. Then it
obtains "a" by a substitution of b into Equation D-l. Thus,
the necessary parameters, a and b, in the Weibull distribu-
tion function are obtained.
For the method of least squares, the right endpoints of the
class intervals are taken as the x-values and their cumulative
distribution values are taken as the y-values to be trans-
formed by the usual double-log transformation and used in the
method of least squares for obtaining "a" and b. Since the
cumulative distribution value for the right endpoint of the
last class interval is 1.0 and this value cannot be allowed
in the double-log transformation, log (log y^y )• the Y~
value for the last class interval is set at 0.999.
The next fit considered is the normal distribution. The
sample mean, X, and the (unbiased) sample standard deviation
are taken as the values for y and a in the normal distribution
function. This is "almost" a maximum likelihood fit to the
normal. Actually, the maximum likelihood estimators of y and
a are the sample mean and the biased sample standard devia-
tion. However, the unbiased standard deviation should work
as well or better.
The next distribution used is the gamma distribution. The
data are fitted to this-distribution by the method of moments
(since this method is easy to apply to the gamma). The method
of moments yields the system of equations below (in the param-
eters a and B) to be solved:
96
-------
n
aB = - (D-4)
Z(X.-X)2 (biased variance used
ct32 = - ^— - in method of moments) (D-5)
Squaring the first equation and dividing it by the second
one obtains :
IX. 2
n _ n X2 ,_ c.
a = - — — = - — — (D-6)
£(X±-X)2 l(X±-X)2
n
The program uses the above equation to obtain a directly and
subsequently substitute this value into the first equation to
5f
obtain 3 = — directly.
a
Finally, consider fitting the log-normal distribution to the
data. If the data are to be log-normally distributed, the
natural logarithms of the data points should be approximately
normal. Hence, to perform a log-normal fit to the data, the
logarithm of each data point is obtained and used as data for
which to perform a maximum likelihood fit to the normal as
described earlier. That is, the mean of the sample of log
values is used as y and the unbiased sample standard deviation
of the log values is used as a for the normal distribution.
The above remarks conclude the discussion of the theoretical
distributions used and the methods for fitting the data to
each of them individually.
2. FORM AND DESCRIPTION OF OUTPUT
The, first thing printed out is a title describing the data
which are to be analyzed. The sample statistics are then
97
-------
printed out, including X, standard deviation, m3 (third
central moment), gj (coefficient of skewness), etc. The
various distributions are subsequently fitted one by one in
the manner described above and the information described in
the introduction to this section is printed out for each of
the fits. The first fit is the Weibull Maximum Likelihood
Fit and the last one is the Log-Normal Fit. In the section
for each fit, the class intervals and the corresponding
theoretical frequency and actual frequency of the data in
these class intervals are printed out. In calculating the
value of chi-square, the program automatically pools frequency
classes on the upper and lower tails until they obey "the rule
of 5" for the chi-square test. In calculating the number of
degrees of freedom, the program automatically reduces the
number of class intervals to take into account the pooling of
the class intervals described above.
V,
One should keep in mind, however, that the program does
nothing about a class interval with a theoretical frequency
less than five unless it occurs as the first or last class
interval. In that case, the program simply pools it with
the class interval below or above it until "the rule of 5" is
obeyed. Since most distributions (including the theoretical
ones being discussed) will generally not have small frequency
class intervals in their center (except for bimodal ones),
this procedure will for the most part take care of any neces-
sary reduction to obey "the rule of 5." However, if more
reduction is deemed necessary, this can easily be done by
hand, since the class intervals and the frequencies will be
printed out in the output. As an example of the above, con-
sider the following table of class intervals and frequencies:
98
-------
Table D-l. THEORETICAL AND ACTUAL FREQUENCIES
FOR VARIOUS CLASS INTERVALS
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
82.1
53.8
35.6
22.2
13.2
7.7
4.3
2.4
2.8
Actual
frequency
84
57
34
14
14
6
7
5
3
The program will pool the last two class intervals producing
eight class intervals and, therefore, 8-3 = 5 degrees of
freedom. Class interval 7 does not obey "the rule of 5"
either. However, since it is close to five, the error
produced in leaving it alone and adding one more degree of
freedom to the test is small. Furthermore, if it were pooled
with class interval 6 above it, the resulting value of chi-
square would be less but so also would the number of degrees
of freedom and the final conclusion would be nearly the same.
3.
FORM OF INPUT
The input to the program is divided into two parts. The first
part consists of a single card containing three pieces of
information about the data to be analyzed, namely:
N = sample size (in cols 1-5, integer and right justified)
ITIL,(I),1=1,30 = title of the data (in cols 6-65 anywhere)
99
-------
NFLAG = variable telling whether this is the last data set to
be analyzed by the program on this run. If NFLAG = 0
(i.e., the card is blank in col 80), then the program
expects to analyze another data set after completing
this one. If NFLAG = 1 (i.e., the card has a 1 in col
80), this signals the last data set and the program will
terminate after analyzing the current data.
The second part of the input contains the values of the data
punched eight per card under an F10.3 format until the data
are exhausted. The format for reading the data into the
program could easily be changed, if desired.
A listing of the program and output pertaining to the coal-
fired electric utility data is given in Appendix C for
reference purposes.
100
-------
Table D-2. COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
DIMENSION XE(20>,AF<20),XSR(20).YSRUO)«XES<20),AFS(20),RF(20)
DIMENSION ITILI30)
COMKOIJ CATA(SOO) .XMINiXMAX.XBAR.SDLV
C THIS PKOGRAH IS DESIGNED TO TAKE SAMPLE DATA KROM
C SOME POPULATION »ITH AN UNKNOWN DISTRIBUTION AND
C "FIT" THIS DATA TO VARIOUS CONTINUOUS DISTRIBUTIONS
C BY ONE OF THE STANDARD PROCEDURES OF STATISTICS.
c THE PRO&KAM WILL PRINT UUT VARIOUS SAMPLE PARAMETERS
C SUCH AS MEAN. STANDARD DEVIATION,COLFFICIENT OF
C SKEWNESSt AND MEASURE OF KURTOSIS. IT WILL ALSO
C PRINT OUT THE CORRESPONDING PARAMETERS FOR THE
C THEORETICAL CONTINUOUS UISTRIBUTIONS WHICH ARE
C FITTED TO THE DATA. FINALLY. USING STUKGE'S RULE
C TO SET UP CLASS INTERVALS, THE PROGRAM WILL
c CALCULATE THE CHI-SQUARE VALUE TO HE USC.D IN A
r CHI-SOUAKE TEST FOR GOODNESS-OF-FIT. IN ADDITION,
C USING THE RIGHT ENOPOINTS OF THE CLASS INTERVALS
C AS COMPARISON POINTS, THE PROGRAM CALCULATES THE
C RESIDUAL SUK OF THL SQUARES OF THE DIFFLRENCE IN
C THL THEORETICAL CUMULATIVE DISTRIBUTION AND THE
C ACTUAL CUMULATIVE DISTRIBUTION.
5 READH.lt Nt(ITIL(I),I = 1.30I.NFLAG
1 FORMATII5.30A2.10X.15)
READ(1,2)(UATA(I).1 = l.N)
2 FORMAT18F10.3)
WRITE ( 5t 10 MITIL11 ).! = !« 30)
10 FORI"1ATC1H1'30X.30A2)
CALL SSTAT(N)
CALL CINT(N.XMIN,XMAX.NINT,WtXES,AFS)
c WEIBULL MAXIMUM LIKELIHOOD FIT
CALL RELOAD(X£S,AFS,»E,AF.NINT)
CALL SOLVE(NiA.B)
IFlB.LT.0.0)60 TO 3
CALL TSKEW(A.B)
CALL RFWEIB(NINT.XE.A.B.RF)
CALL RELOAOCXES.AFS.XE.AF.NINT)
CALL CHITST
C WEIBULL LEAST SQUARES FIT
3 CALL RELOAU
CALL RELOAD(XES.AFS.XEiAF.NINT)
CALL CHITST(N.NINT.RF.AF.XE)
C GAMMA METHOD OF MOMENTS FIT
CALL RELOAO(XES,AFS,XE,AF,NINT)
CALL GSOLVE(XBARiSDEVtALPHAtBETAiN)
IF(ALPHA.LT.l.O) GO TO 50
CALL RFGAMtNINT.XE.ALPHA,BETA,RF>
CALL RELOADIXES.AFS.XE.AF.NINT)
CALL CHITST(N,NINT,RF,AF,XE)
GO TO SI
50 WRITE(b.52>
52 FORMAT«1HO,'THE GAMMA DISTRIBUTION WILL NOT FIT THIS DATA')
LOG NORMAL FIT
51 WRITEI5.100)
100 FORKATUH1/////53X,'LOG NORMAL FIT')
DO 101 1=1,N
DATA(I) = ALOGIOATA(I))
101 CONTINUE
CALL SSTAT(.N)
CALL CINTCN.XMIN.XMAX.NINT.W.XES.AFS)
CALL KELGACHXES.AFS.XE.AF.MNT)
CALL RFNOKM(NINT,XE.XBAR,SDEV,RF)
CALL RELOAD(XES.AFS.XE.AF.NINT)
CALL CHITST(N,NINT,RF,AF,XE)
IF(NFLAG.EQ.O) GO TO 5
CONTINUE
END
101
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE RFNORM(NINT.XE.XBAR,SDEV.RF)
DIMENSION XE(20),RF120>
C THI' SUBROUTINE IS DESIGNED TO CALCULATE THE
THEORETICAL RELATIVE FREQUENCY RF Of THL NORMAL
C DISTRIBUTION WITH MEAN XBAR AND STANDARD DEVIATION
C SDEV/ OVER THE CLASS INTERVALS WHOSE ENOPOINTS ARE
C GIVEN IN THr ARRAY XE. NINT IS THE NUMBER OF
CLASS INTERVALS REPRESENTED.
DO 1 I=liNINT
1 RF(I)sC.O
IF( (XBAR-3.*SDEV).LT..X£<2)) GO TO 2
XE(1) = C.O
GO TO 3
2 XE(1I = XBAR-3.*SOEV
3 M = NINT-1
S = 0.0
DO 6 I =1.H
DX =
S = S+RFII)
6 CONTINUE
RF(NINT) 1.0-S
RETURN
END
SUBROUTINt I b Tt G (C • 0. DX , XflAR ,SDEV.V )
C THIS SUdRGUTHE CALCULATES THE INTEGRAL OF THE
C NORMAL PROBABILITY DENSITY FUNCTION PDF WITH MEAN
C XBAH ANJ STANDARD OFVIATION SOEV OVER THE INTtRVAL
C C TC 0 IN INCREMENTS OF DX BY THE TRAPEZOIDAL
C RULE. 1000 ITERATIONS ARE USED.
V = 0.0
A2 = SORT(2.*3.1tl59)»SDEV
Yl = POP (C. XBAR,SDE.V.A2)
Y2 = Yl
00 1 I = 1,1000
XI = FLOAT(I)
X2 = C + XI*OX
Yl = Y2
If. = PDF
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE SETUP(N,NINT ,XE.AF,XSR,TSR)
DIMENSION XE(20 ) i AF{ 20 ) , XSR (20) . YSR,( 20 )
C THIS SUBROUTINE ACCEPTS THE DATA AFTER IT IS
C ARRANGED INTO CLASS INTERVALS AND ACTUAL
C FREQUENCIES HAVE BEEN CALCULATED AND SETS UP THL
C RIGHT ENDPQINTS OF THE CLA.SS INTERVALS WITH THEIP
C RESPECTIVE CUMULATIVE FREOUENCIES FOR THE LEAST
C SQUARES FIT TO THE UEIBULL.
S = 0.0
XN = FLOAT(N)
00 1 I=1.NINT
XSK(I) = XE1I+1)
S = S+AFII)
1 YSR(I) = S/XN
TSR(NINT)=.999
00 2 I=ltNINT
XSR(I) = ALOG(XSRCI))
ARG = 1.0/tl.O-YSRd) I
ARG = ALOG(ARG)
YSR(I) = ALOG(ARG)
2 CONTINUE
RETURN
END
FUNCTION F(NtB)
C THIS FUNCTION SUBPROGRAM CALCULATES THE EQUATIC
C VALUE AT A POINT B FOR THE EQUATION USED IN
C SOLVING FOR B IN THE WEIBULL MAXIMUM LIKELIHOOD
C FIT. N IS THE TOTAL NUMBER OF DATA POINTS.
DIMENSION XBI500).ALXBI500)
COMMON XI500)
Sl=0.0
S2=0.0
S5=0.0
oo i 1=1,n
XB(I)=X(I)*«b
ALXBII)=ALOG(XB(I> >
Sl=Sl+Xe(I)»ALXB(I)
S2=S2+XB(I)
1 S3=S3+ALXB(I)
FN=FLOAT(N)
FN=1/FN
F=(S1/S2)-FN»S3-1.0
RETURN
END
103
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE TSKLU(A.B)
C THIS SUBROUTINE CALCULATES AMD PRINTS THE
c THEORETICAL VALUES OF THE nctou VARIANCE., STANDAH:
C DEVIATION. THIRD AND FOURTH CENTRAL MOMLNTS,
C COEFFICIENT OF SKEWNESSt AND MEASURE OF KURTOSIS
C FOK THE WEIBULL DISTRIBUTION FUNCTION WITH
c PARAMETERS A AND B GIVEN. IT ALSO EVALUATES AND
C PRINTS THE I AND J POINTS OF THE FUNCTION FOR 1=1,
C --, 5 AND J=99, --- , 95.
Yl=1.0+(1.0/b)
Y2=1.0+(2.0/B)
Y3=l.Cl+C3.0/B)
CALL GAHnA(Tl,61,I>
CALL GAHHACY2.62»JI
CALL GAMMA«Y3,G3»K)
CALL GAMMAlYi»,G«».L>
Al = A**(-1.0/B>
A2=A**<-2.0/B)
A3=A««(-5.0/B)
A <*=*»» (-1.0/B)
XMU = 41*61
SIGMA2=(G2-G1**2)*A2
SIGMA=SORT
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE RFWEIBCNINT.XE.A.B.RF)
DIMENSION XEC20).RF(20).CF(20)
C THIS SUBROUTINE CALCULATES THE RELATIVE FREOUE'TY
C RF OVER THE CLASS INTERVALS WHOSE ENDPOINTS ARC
C THE ARRAY XE FOR THE WE IBULL DISTRIBUTION FUNCTION
C WHOSE PARAflETEKS AKE A AND 8. MINT IS 'HE NUMBER
C OF CLASS INTERVALS.
MM = NINT-1
00 5 I = l.MM
XVAL = XEU+ll
CF(I) = WF(XVALtAiB)
5 CONTINUE
CFUJINT) =1.0
RFU) = CPU)
DO 6 I = 2iNINT
6 RF(D = CF(I)-CFCI-l)
KETUKN
ENU
SUBROUTINE SR(NINT.XSR.YSR.A,B)
C THIS SUBROUTINE CALCULATES THE PARAMETERS A AND B
C FOR THt WILBULL DISTRIBUTION FUNCTION BT THE LEAST
C SQUARES METHOD. THE ALREADY SETUP ARRAYS XSR AMU
C YSR ARL SUPPLIED TO THE SUBROUTINE. NINT IS THE
C NUMBER OF CLASS INTERVALS.
DIMENSION XSKI20).YSH120)
FN=FLOAT(NINT)
SI = 0.0
S2 = 0.0
IJO 1 1 = 1,NINT
SI = Sl + XSRU)
S2 = S2+YSRiI)
1 CONTINUE
XB = Sl/FN
YB = S2/FN
SI = 0.0
S2 = 0.0
S = 0.0
DO 2 I = 1.'NINT
SI - S1+(XSR(I)-X8>*«2
S2 : .S2+(YSR(I)-YB)«*2
S = S+(XSRdl-X6)*(YSRI II-YB)
2 CONTINUE
VX = Sl/CFN-1.0)
VY = S2/(FN-1.0)
SX = SORT(VX)
SY = SURT(VY)
B - S/S1
AL = Yfl-XB«B
A s EXP(AL)
R s (8»SX)/SY
R2 = R**2
ARG = (B»(1.-K2»/(FN-2,0)
SE = SURTIAK6)
SB = 5E/SQRMS1>
TV = B/SB
WRITEK5.400!
<»00 FORHAT(1HO./////53X,'WEIBULL LEAST SQUARES HTM
WRITEIStlOIAiB
10 FORMAT(1HO,«W£IBULL PARAMETERS ARE G = 0.0. A = -.El"*.!*,
15X.«B = «.Em.<0
WRITE (S. 11) R • R2. SE t SB < T
11 FORKATI1HO.«R = • .E12.4.5* . 'RSO = • «E12.<».^X, »SE - '.E12.<»,
15X.-SB = ».E12.'*.5X.'T\i = ".E12.HJ
RETURN
LNU
105
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE RELOAD (XES, AFS, XE , AF,MINT )
DIMENSION XL'S (20) . AFS I 20 ) ,Xt ( 20 ) . AFC 20 )
C THE ARRAYS XE OF CLASS INTERVAL ENDPOINIS AND AF
c OF ACTUAL FREQUENCIES ARE ALTERED IN CEKTAIN PARTS
C OF THIS PROGRAM TO FIT THE NEEDS AT THE TIME.
c THUS, RATHLR THAN RECALCULATE THEM EVERTTIME THEY
C A*t NELDED. THEY WLRE STORED IN SAVED ARRAYS XES
C AND AFS UPON INITIAL CALCULATION. THIS SUBROUTINE
C SIMPLY RELOADS THE WORKING ARRAYS XE AND AF KITH
C THEIR SAVED VALUES.
DO 1 I = l.NINT
XE(I) = XES(I)
AF(I) = AKS(I)
1 CONTINUE
XEIMNT + 1) - XESININT+1)
RETURN
END
SUBROUTINE GSOLVC(XBAR.SDEVtALPHA.BETA,N)
C THIS SUBROUTINE FITS THE GAMMA DENSITY FUNCTION TO
C THE GIVEN DATA BY THE METHOD OF MOMENTS. XBAR AND
C SOEV ARE THE MEAN AND STANDARD DEVIATION OF THE
C GIVEN DATA WHILE ALPHA AND BETA ARE THE CALCULATEP
C PARAMETERS FOR THE GAMMA DENSITY FUNCTION. N IS
C THE TOTAL NUMBER OF DATA POINTS IN OUR SAMPLE. IN
C ADDITION, THIS SUBROUTINE CALCULATES ANU PRINTS
C THE THEORETICAL VALUES OF THE MEAN, STANDARD
C DEVIATION, COEFFICIENT OF SKEHNESS, ETC. FOR THE
C FITTED GAMMA FUNCTION.
FN=FLOAT(N)
ALPHA= I FN*XBAR*XBAR I / «FN-1. 0 ) *SDEV*SOEV )
BETA=X8AR/ALPHA
X3=2.0*ALPHA*BETA*BETA*BErA
61=?.0/SORTIALPHA)
X0=( • i.O*ALPHA**2) + (6.0*ALPHA) )*BETA*»<»
62=3.0+6./ALPHA
VAR=SPCV*SOCV
WRITE15.1)
1 FORhATUHiy////53X, 'GAMMA METHOD OF MOMENTS FIT"
WRITEI5.2) ALPHA,BETA
2 FORMATClHO.'GAMMA PARAMETERS ARE: ','ALPHA - • ,tl<».7,SX, -BETA = •
I.Em.7)
URITEIS,3)XBAKtVAR,SOEV,X3,X4,bl,G2
3 FORMATtlHO.'XBAR = • ,E1<».7,5X. • VAR = • ,E1<». 7.5X. • SDEV = '.
1E1«».7,5X//'THIRO CENTRAL MOMENT = «, Elf .7 . •>*., 'FOURTH CENT°AL HOHEN
2T = •,E1<».7.5X//'COEFFICIENT OF SKEHNESS = «,Eli».7,5X.
3'MEASURE OF KURTOSIS = '.E14.7)
RETURN
END
106
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBKOUTINE RFGAM(NINT•XEtALPHA.BETAiRFI
DIMENSION XE(20)tRF 20)
C THIS SUBROUTINE SETS UP THE ARRAY OF RELATIVE
C FREQUENCIES RF FOR THE GAMMA DENSITY FUNCTION talTn
C PARAMETERS ALPHA AND BETA. XE IS THE ARRAY OF
C CLASS INTERVAL ENOPOINTS AND NINT IS THt NUMBER OF
C CLASS INTERVALS.
DO 1 1=1.NINT
1 RF(I)=0.0
XEU) = 1.0E-05
M = NINT-1
S = 0.0
DO if I = ItH
OX=(XE
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE SSTAT IN)
COMhON X(bOO), XMIN. XMAXi XBARt SOEV
C THIS SUBROUTINE CALCULATES AND PRINTS THE SAMPLE
C STATISTICS FOR THE GIVEN DATA.
S=0.0
DO 1 1 = 1,N
1 S=S+X(I)
XBAR=S/FLOATCN)
S2=0.0
S3=0.0
S"*=0.0
DO 2 1=1.N
S2=S2+(X(I)-XBAR)*»2
S3=S3+CX«Il-XBAR)«»3
2 S^sSH+IXdl-XBAR)**1*
VAR=S2/FLOAT< N-l)
SOEV=SQRT(VAR>
FN=FLOAT(N)
XW3=/( (FN-1. 0 )» ( FN-2.0 )»(FN-3.0) ) >»S<»
XT2=((3.0*»VAR*»2
XM
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE CKITSTiNiNINTfRF.AF.XE)
DIMENSION AFI20) ,RF(20) ,TF(20) i XEI20)
C THIS SUBROUTINE CALCULATES THE THEORETICAL
C FREQUENCIES FOR THE CLASS INTERVALS AND PRINTS THt
C TABLE OF CLASS INTERVALS talTH THEIR RESPECTIVE
C THEORETICAL AND ACTUAL FREQUENCIES. IT ALSO
C CALCULATES ANO PRINTS THE VALUE OF SSE« THE SUh OF
c THE SQUARES OF THE ERROR. FINALLY. IT CALCULATES
C ANO PRINTS THE VALUE OF CHI-SQUARE AND ITS
C ASSOCIATED DEGREES OF FREEDOM FOR THIS DATA.
FN = FLOAT (N)
DO 7 I=ltNINT
7 TF(I)=RF(I)«FN
WRITC(S.IO)
10 FORMATllHOt' CLASS INTERVALS' .20X i 'THE ORbTICAL FREQUENCY1 <20X ,
1'ACTUAL FREQUENCY')
DO 11 Is
11 WRITE<5,12)
12 FORMATtlHO.'FKOM '.FIO.Z.' TO • ,F10.2, 15X.F10 .3.30X tFlO.3)
SI = 0.0
S2 = 0.0
SSE =0.0
UO 30 I = l.NINT
SI = SI HF(I)
S2 - S2 MAf (Il/FN)
SSE = SSE + (S1-S2)*«2
30 CONTINUE
WRITEK5.31) SSE
31 FORHATCl-iflt 'SSE - '.F16.7!
K=NINT
- L = l
00 =TFfNINT-I)+TF(NINT-I + ll
AF(t:lNT-I)=AF(NlNT-I>*AF(NlNT-I*ll
K=K-1
22 COI.TIN'JE
23 S=C.O
00 st I=L,K
21 S=S+. (TF(II-AFtl) )»*2.0)/TF(I)
NDF=K -L-2
WRITE(5.50) S.NOF
5C FORMftT(lHO/"CHI-SQUARE - >
-------
Table D-2 (continued). COMPUTER LISTINGS OF THE
SIMULATION PROGRAMS - GOODNESS-OF-FIT PROGRAM
SUBROUTINE SOLVL
COMMON xx(5ut»
C THIS SUBROUTINE SOLVES -OR THE PARAMETERS A AND B
C IN THE WEIBULL DISTRIBUTION FUNCJON BY THE METHOD
C OF MAXIMUM LIKELIHOOD. IT USES THE METHOD OF
C FALSE POSITION IN THE SOLUTION PROCLOURb. N IS
C THE TOTAL NUMBER OF DATA POINTS AND XX IS THE
C ARRAY OF DATA POINTS.
XI = 1.0
X2 = 2.0
Yl = F(N,X1)
Y2 = F(N,X2>
71 IF!(Y1.1T.O.O.AND.Y2.GT.O.O).OR.(Y1.GT.O.O.AND.Y2.LT.O.O)) GO TO
170
XI X2
Yl = Y?
X2 r X2 + 1.0
Y2 = F(N,X2)
IF(Xl.GT.lO.O) GO TO 2
60 TO 71
2 WRITEI5.il)
11 FORKATC BETA IS > 10.0 OR < 1.0 OR THERE IS AN EVEN NUMBER OF
1ROOTS BETWEEN TWO INTEGER VALUES')
B=-1.0
RETUKN
70 1F 60 TO 5
X2 = X
Y2 = Y
20 K = K + 1
IF(K.GE.SOOO) GO TO 101
GO TO 3
5 XI = X
Yl = Y
GO TO 20
101 WRITEI5.12)
12 FORMATI' THE NUMBER OF ITERATIONS EXCEEDED 5000*)
100 B=X
S=0.0
00 50 1=1.N
50 S=S+XX(T)»*B
A=FLOAT(N)/S
WRITEIS.tOOl
400 FORMAT)1HO/////50X,'WEIBULL MAXIMUM LIKELIHOOD FIT'I
WRITE(5.500) A.8
500 FORMAT (IHOt'WEIBULL PARAMETtRSS ARE G = 0.0, « = ',EH».i»,
1'B = >|E14.<»)
RETURN
ENU
110
-------
APPENDIX E
TREATMENT OF CORRELATED DATA BY LINEAR TRANSFORMATION
1. THEORETICAL DISCUSSION OF LINEAR TRANSFORMATION
Dependent (correlated) input variables are discussed in
Section IV. B. 2. If the original variables are binormally
distributed, by linear transformation of variables , it is pos-
sible to find two normally distributed and stochastically
independent linear functions of these variables.13 Assume
that XI and X2 are correlated variables. With the intro-
duction of variables Yl and Y2, we can transform the (XI, X2)
variables to the (Yl, Y2) coordinate system which will have
its origin in the point (XI, X2) = (51, £2) and the Y-axis
will form angle a with the X-axis.
Yl = (XI - 51) cos a + (X2 - £2) sin a (E-l)
Y2 = -(XI - 51) sin a + (X2 - £2) cos a (E-2)
The correlation coefficient of (Yl, Y2) depends on the angle
a. The following equation is used to find a value for a for
which the correlation coefficient is zero.
tan 2a. = - for Si 7* S2 and a = 7- for Si = S2 (E-3)
where p = correlation coefficient of (XI, X2) data pairs
Sj = sample standard deviation of XI data
S2 = sample standard deviation of X2 data
The data values (Yl, Y2) may be restored to the original
values (XI, X2) by the following inverse transformations:
13Hald, A. Statistical Theory with Engineering Applications
Nqw York, John Wiley & Sons, Inc., 1952. p. 596-599.
Ill
-------
XI = 51 + Yl cos a - Y2 sin a (E-4)
X2 = -£2 + Yl sin a + Y2 cos a (E-5)
The above equations were implemented using APL language on a
time-sharing terminal. Using the population data for coal
consumption and stack heights, the (XI, X2) data pairs were
correlated with p = 0.559. After transformation the correla-
tion coefficient p = (2.978E - 17) which is essentially zero.
2. SIMULATION IMPLEMENTATION
The variables coal consumed (CC) and stack height (H) were
treated as log-normal distributions. The transformation of
the 24 sample values for CC and H were performed on the APL
terminal as previously discussed. The correlation coefficient
for CC with H was 0.3705. After transformation, the correla-
tion coefficient was 9.9E-17, which is essentially zero.
Table E-l lists the inputs and results of various simulation
runs.
Equations E-4 and E-5 were implemented in a subroutine in the
simulation program to restore the transformed data to actual
values (see Table C-3 for a listing of the program), with
the result that the mean severity was 9.31. The simulation
was repeated with stack height considered as an independent
variable. The mean value of severity was 11.25. The t-test
was made to see if these values were significantly different
from the mean severity of 9.25 arrived at earlier in this
report. 1If
14Alder, H. L., and Roessler, E. B. Introduction to Proba-
bility and Statistics. San Francisco, W. H. Freeman and
Company, 1968. p. 136-140.
112
-------
Table E-l. COMPARISON OF SIMULATION RESULTS
Method
Section IV. B. 2
Variables
treated as
independent
Trans formation
of
variables
Parameter
Coal consumed
Stack heights
c
Percent sulfur
a
Resultant severity
Stack heights
Resultant severity
Coal consumed
e
Stack heights
Resultant severity
Mean
1326
93.6
1.82
9.25
4.472
11.25
0.0
0.0
9.31
Standard
deviation
1329
36
1.15
12.48
0.3751
19.17
0.9731
0.3442
12.0
Comment
A = 0.9487E - 3
B = 0.9856
Dependent variable
A = 0.3039
B = 1.667
Log-normal
distribution
Transformed
Transformed
Coal consumption in kg/yr; stack height in m; severity is dimensionless.
b
Stack height treated as a dependent variable correlated with coal
consumed.
Q
Same value used for each simulation.
d
Stack height treated as an independent variable.
e
Coal consumed and stack heights linearly transformed to make correlation
coefficient zero; both were treated as log-normal distributions.
The t-test assumes that the variate X is normally distributed
with mean a. If all possible samples of n variates are
taken from this population and their means are denoted by X,
then
t =
x
(E-6)
S = S
SY = —
X
(E-7)
(E-8)
113
-------
where t = value from table of the Student t-distribution
with n-1 degrees of freedom
X = mean of X
y = population mean
S=r = standard deviation of X
A
S = sample standard deviation
S = standard deviation corrected for small sample
size
n = sample size
In the sample calculation as shown below, it is assumed
that the population mean of 9.25 is from the simulation
results of Section IV.B.2. From a reference table,
t(23) = 2.807 at the 99% confidence level.
11.25 - 9.25 = o 50
3.998
Since the above t value is less than 2.807, it is concluded
that the means are from the same population. This is also
true for the mean of the simulation using the linear trans-
formation.
The F-distribution allows one to test whether the variances
of two means are from the same population.15 A confidence
level such as 99% (significance level of 1%) is selected.
A null hypothesis of (ai)2 = (02)2 is made. The alternative
is (aj)2 £ (a2)2 and the samples are not from the same
15Koosis, D. J. Statistics. New York, John Wiley and Sons,
Inc., 1972. p. 155-160.
114
-------
populations with the same variance. Usually, the larger
variance is divided by the smaller:
(Si)2
F = where Sj > S2
(S2)2
For the case of treating stack height as an independent
variable,
_ _ (19.17)2 _ 367 _ .,
ICC""*"5
(12.48)2 156
and for the linear transformation case,
O)2 144
(12.48)2 156
For a sample size of 24, the critical region is F >2.72
(found from a table). Thus, it is concluded that the vari-
ances are not significantly different at the 1% confidence
level. It thus appears that linear transformation is a
valid method to use with correlated data.
115
-------
SECTION VII
REFERENCES
1. Slade, D. H. (ed.). Meteorology and Atomic Energy.
Environmental Science Services Administration, Air
Resources Labs. Silver Spring. AEC Publication No.
TID-24190. July 1968. 445 p.
2. Turner, D. B. Workbook of Atmospheric Dispersion
Estimates, 1970 Revision. U.S. Department of Health,
Education, and Welfare. Cincinnati. Public Health
Service Publication No. 999-AP-26. May 1970. 84 p.
3. Parzen, E. Modern Probability Theory and Its
Applications. New York, John Wiley & Sons, 1960.
4. Springer, C. H., et al. Probabilistic Models.
Homewood, Richard D. Irwin, Inc., 1968.
5. Curran, T. C., and N. H. Frank. Assessing the Validity
of the Lognormal Model when Predicting Maximum Air
Pollution Concentrations. (Presented at the 68th
Annual Meeting of the Air Pollution Control Association,
Boston. June 15-20, 1975.)
6. Geary, R. C., and E. S. Pearson. Tests of Normality.
London, University College, 1938.
7. Mendenhall, W., and R. L. Scheaffer. Mathematical
Statistics with Applications. North Scituate,
Duxbury Press, 1973.
8. Walpole, R. E., and R. H. Myers. Probability and
Statistics for Engineers and Scientists. New York,
The MacMillan Co., 1972.
9. Siegel, S. Nonparametric Statistics. New York,
McGraw-Hill Book Co., 1956.
10. Duncan, A. J. Quality Control and Industrial
t Statistics. Chicago, Richard D. Irwin, Inc., 1952.
\ 117
-------
11. Cramer, H. Mathematical Methods of Statistics.
Princeton, Princeton University Press, 1946. 575 p.
12. Olt, W. R., and D. T. Magee. Random Sampling as an
Inexpensive Means for Measuring Average Annual Air
Pollutant Concentrations in Urban Areas. (Presented
at the 68th Annual Meeting of the Air Pollution Control
Association, Boston. June 15-20, 1975.)
13. Raid, A. Statistical Theory with Engineering
Applications. New York, John Wiley & Sons, Inc.,
1952. p. 596-599.
14. Alder, H. L., and Roessler, E. B. Introduction to
Probability and Statistics. San Francisco, W. H. Freeman
and Company, 1968. p. 136-140:
15. Koosis, D. J. Statistics. New York, John Wiley and Sons,
Inc., 1972., p. 155-160.
118
-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing)
1. REPORT NO.
EPA- 600/2-76-032e
2.
3. RECIPIENT'S ACCESSION NO.
4. TITLE AND SUBTITLE
Source Assessment: Severity of Stationary Air
Pollution Sources—A Simulation Approach
5. REPORT DATE
July 1976
6. PERFORMING ORGANIZATION CODE
7. AUTHOR(S)
E. C. Eimutis, B. J. Holmes, and L. B. Mote
8. PERFORMING ORGANIZATION REPORT NO.
MRC-DA-543
9. PERFORMING ORdANIZATION NAME AND ADDRESS
Monsanto Research Corporation
1515 Nicholas Road
Dayton, Ohio 45407
10. PROGRAM ELEMENT NO.
1AB015; ROAP 21AXM-071
11. CONTRACT/GRANT NO.
68-02-1874
12. SPONSORING AGENCY NAME AND ADDRESS
EPA, Office of Research and Development
Industrial Environmental Research Laboratory
Research Triangle Park, NC 27711
13. TYPE OF REPORT AND PER
Task Final; 6/75-5
COVERED
14. SPONSORING AGENCY CODE
EPA-ORD
15. SUPPLEMENTARY NOTES project officer for this report is D.A. Denny, mail drop 62,
919/549-8411, ext 2547.
16. ABSTRACT
rep0rt gjves results of a study simulating the establishment of the
severity of stationary air pollution sources. The potential environmental impact of
an emission source can be determined from the source severity (the ground level
concentration contribution of pollutants relative to some potentially hazardous con-
centration of the same species). The frequency distribution of the severity of well-
documented source types can be examined deter minis tically. A statistical approach
is required to simulate the frequency distribution of the severity of source types
that are complex or involve a large number of emission points in order to ultimately
assess such sources. A Monte Carlo simulation technique is described in this
report, together with efficient algorithms for fitting the inverse Weibull, gamma,
normal, and log-normal cumulative density functions. Significant correlation is
demonstrated between deterministic and simulated severity results using coal-fired
steam/electric utilities as an example.
7.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b.lDENTIFIERS/OPEN ENDED TERMS
c. COSATI Field/Group
Air Pollution
Ranking
Environmental Biology
Simulation
Monte Carlo Method
Estimating
Electric
Utilities
Air Pollution Control
Stationary Sources
Source Assessment
Environmental Impact
13B
12 B
06F
12A
8. DISTRIBUTION STATEMENT
Unlimited
19. SECURITY CLASS (This Report)
Unclassified
21. NO. OF PAGES
133
20. SECURITY CLASS (This page)
Unclassified
22. PRICE
EPA Form 2220-1 (9-73)
119
------- |