Environmental Protection Technology Series
 SOURCE  ASSESSMENT:
    SEVERITY OF STATIONARY
   AIR POLLUTION SOURCES-
   A SIMULATION  APPROACH

 Industrial Environmental Research Laboratory
       Office of Research and Development
      U.S. Environmental Protection Agency
 search Triangie Park, North Carolina  27711

-------
               RESEARCH REPORTING SERIES

Research reports of the Office of Research and Development, U.S. Environmental
Protection Agency, have been grouped into five series. These five broad
categories were established to facilitate further development and application of
environmental technology. Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The five series are:

     1.   Environmental Health Effects Research
     2.   Environmental Protection Technology
     3.   Ecological Research
     4.   Environmental Monitoring
     5.   Socioeconomic  Environmental Studies

This report has been assigned  to the ENVIRONMENTAL PROTECTION
TECHNOLOGY series. This series describes research performed to develop and
demonstrate instrumentation, equipment, and methodology to repair or prevent
environmental degradation  from point and non-point sources  of pollution. This
work provides the new or  improved technology required for the control and
treatment of pollution sources to meet environmental quality standards.
                    EPA REVIEW NOTICE

This report has been reviewed by  the U.S.  Environmental
Protection Agency, and approved for publication.  Approval
does not signify that the contents necessarily reflect the
views and policy of the Agency, nor does mention of trade
names or commercial products constitute endorsement or
recommendation for use.
This document is available to the public through the National Technical Informa-
tion Service. Springfield, Virginia 22161.

-------
                                       EPA-600/2-76-032e

                                       July 1976
        SOURCE ASSESSMENT:

 SEVERITY OF STATIONARY AIR POLLUTION

    SOURCES--A SIMULATION APPROACH
                    by

  E.G. Eimutis, B.J. Holmes, and L. B.  Mote

       Monsanto Research Corporation
             1515 Nicholas Road
             Dayton, Ohio 45407


           Contract No.  68-02-1874
            ROAPNo. 21AXM-071
        Program Element No. 1AB015


     EPA Project Officer:  Dale A. Denny

 Industrial Environmental Research Laboratory
   Office of Energy, Minerals, and Industry
      Research Triangle  Park, NC 27711


               Prepared for

U.S. ENVIRONMENTAL PROTECTION AGENCY
      Office of Research and Development
            Washington, DC  20460

-------
                               PREFACE

The Industrial Environmental Research Laboratory  (IERL) of EPA has the
responsibility for insuring that air pollution control technology is
available for stationary sources.  If control technology is unavailable,
inadequate, uneconomical or socially unacceptable, then development of
the needed control techniques is conducted by IERL.  Approaches con-
sidered include process modifications, feedstock modifications, add-on
control devices, and complete process substitution.  The scale of control
technology programs ranges from bench to full scale demonstration plants.

The Chemical Processes Branch of IERL has the responsibility for devel-
oping control technology for a large number  (>500) of operations in the
chemical and related industries.  As in any technical program the first
step is to identify the unsolved problems.

Each of the industries is to be examined in detail to determine if there
is sufficient potential environmental risk to justify the development of
control technology by IERL.  Monsanto Research Corporation (MRC) has
contracted with EPA to investigate the environmental impact of various
industries which represent sources of emissions in accordance with EPA's
responsibility as outlined above.  Dr. Robert C. Binning serves as
Program Manager in this overall program entitled, "Source Assessment."
As a first step, MRC has developed a priority listing of the industries
in each of four categories:  combustion, organic materials, inorganic
materials, and open sources.  The purpose and intended use of this
listing is that it serve as one of several guides to the selection of
those sources for which MRC will perform detailed source assessments.
Source assessment documents will be produced by MRC and used by EPA to
make decisions regarding the need for developing additional control
technology for each specific source.

In order to analyze the severity of those sources in which the emission
points number in the thousands or hundred thousands,  a statistical
approach is required such as the Monte Carlo simulation technique
described in this report.   An example of this approach for analyzing
coal-fired steam electric utilities is included.
                                   iii

-------
                          CONTENTS
Section                                                Page
   I      Introduction                                   1
  II      Source Severity                                3
          A.    Mathematical Structure                    5
          B.    Derivation of x                           6
          C.    Pollutant Severity Equations              7
  III     Simulation Methodology                         9
          A.    Introduction                              9
          B.    Theory and Methodology                   12
               1.    All Input Variables Independent     13
               2.    Dependent Input Variables           15
  IV      Example  of Use of Simulation Approach With    19
          Coal-Fired Electric Utilities
   V      Discussion of Non-Normality and Chi-Square    25
          Goodness-Of-Fit Test
          A.    Central Limit Theorem and T-Test         26
          B.    Coefficient of Skewness and Kurtosis     34
               as  Analytical Measures of Non-Normal
               Distributions
          C.    Weibull Distribution                     37
          D.    Gamma Distribution                       41
          E.    Log-Normal Distribution                  44
          F.    Sample Skewness and Kurtosis             45
          G.    The Chi-Square Goodness-Of-Fit Test       49
  VI      Appendixes                                    57
          A.    Standard Statistical Formulae For Finite  53
               Populations
          B.    Detailed Derivations of the Criteria     61
               Pollutant Severity Equations
          C.    The Simulation Programs                  66
          D.    The Goodness-Of-Fit Program              94
          E.    Treatment of Correlated Data by Linear  111
               Trans formation
  VII     References
                               v

-------
                       LIST OF TABLES
Table

  1       Criteria Pollutant Severity Equations

  2       Results of Deterministic Calculations

  3       Notation Used for Statistical Parameters
  4       Comparison of Random Sample Values and
          Population Mean for Coal-Fired Electric
          Utilities

  5       Values for Various Points of Interest in      40
          the Weibull Distribution for Various
          Values of Parameter b (a=l)

  6       Values for Various Points of Interest in      41
          the Weibull Distribution for Various
          Values of Parameter b (a=1.0 x 10~5)

  7       0.05 and 0.01 Points of the Distribution      47
          of Yi/ the Coefficient of Skewness
  8       Percentage Points of the Distribution of      48
          Y2/ the Measure of Kurtosis
  9       Values of Glf G2 and (G2-3) for Power         49
          Plant Example

 10       Theoretical and Actual Frequencies for        53
          Nine Class Frequencies
 11       Theoretical and Actual Frequencies for        54
          Nine Class Intervals

A-l       First Square Root Factor in Equation A-3      59
          as a Function of Sample Size

C-l       Variables, Distributions, Parameters and      78
          Clips for Coal-Fired Electric Utilities
          Example
C-2       Summary of Input Data by Groups for Coal-     79
          Fired Electric Utilities Example

C-3       Computer Listings of the Simulation           82
          Programs
D-l       Theoretical and Actual Frequencies for        99
          Various Class Intervals

D-2       Computer Listings of the Simulation Programs ioi
          Programs - Goodness-Of-Fit Program

E-l       Comparison of Simulation Results             113
                               VI

-------
                       LIST OF FIGURES

Figure

   1      Frequency Histograms for the Severity of
          SO2 Emissions from Coal-Fired Electric
          Utilities Comparing the Simulation and
          Deterministic Methods

   2      Comparison of Cumulative Frequency for the    22
          Severity of SO2 Emissions from Coal-Fired
          Electric Utilities Using the Simulation
          and Deterministic Methods

   3      Unimodal Continuous Distribution              35

   4      Examples of Kurtosis in Distributions         36

   5      Probability Density Function for the          37
          Exponential Distribution

   6      Probability Density Function of the Weibull   38
          Distribution for Various Values of
          Parameter b

   7      Probability Density Function for the Gamma    43
          Distribution for Various Values of
          Parameter a

   8      Probability Density Function for the Log-     45
          Normal Distribution
                             VI1

-------
                     LIST OF SYMBOLS
    Symbol
      A

A,B,C,D,E...

   a,b,a,3
      B
     B1
     CC
   c.d.f.
I ,C2,CLIP(I,J)

      E
      e
      F
  FI,...,F

  f 1 , . . . , f


     91
     92
      H
      h
     HO

    IC0DE


  IC0DE(1)

    ITIL

    IVAL
n
n
             Definition
Y-intercept for regression line
[=YB - B(XB)]
Simulated variable values, same as IVAL
where A = variable  1, B = variable 2, etc
Arbitrary parameters
Slope of regression line ( =
Average breathing rate
Coal consumed
Cumulative distribution function
Two dimensional array; maximum/minimum
values for associated stochastic variables
Emission factor
2.72
Hazard potential factor (=TLV-K) for
severity calculations, or statistical
ratio of variances  (Appendix E)
Actual frequencies of the sample data
in n class intervals
Theoretical frequencies that would be
expected for a sample of the same size
from the given distribution
Coefficient of skewness of sample
Measure of kurtosis of sample
Relative hazard
Emission height
Hypothesis that data are distributed
according to some given distribution
Two dimensional array containing infor-
mation on which dependent variable
is correlated and which distribution
it fits
Code which identifies the type of distribu-
tion for independent variable 1
Title for x-axis on plot and title of
report output
Variable value, selected from VAR(I,J)
                             IX

-------
  Symbol

    K

K } , K2 ...
   LT


in 3 , mit ,mk


    N


    n




 NCFLAG
  NDVAR


  NFLAG


 NGR0UP


  MINT


  NIVAR



  NPASS


  NSAMP

    P

PAR(I,J)


 p.d.f.

 PNUM(I)
LIST OF SYMBOLS (Continued)

              Definition

  Safety factor = (~-r-
  Conversion factors

  Lethal dose for 50% of the people exposed

  Input integer used to indicate specific
  option to be used in program

  Third, fourth, k— central moment about
  X, respectively
  Population exposed to a specific source,
  persons
  Number of samples or class intervals
  (for simulation) or number of point
  sources emitting the i— material
  (otherwise)
  Input integer which provides the option
  of using Sturge's rule to set up the
  number of class intervals and using a
  value of XMIN and XMAX to establish W
  and the resulting class intervals

  An integer which gives the number of
  dependent input variables

  Flag to identify whether or not entered
  data are last data set to be analyzed

  An integer which tells the simulation
  program the sample size for each pass

  The number of class intervals to be
  used in program

  An integer which tell the simulation
  program how many input variables are
  independent

  An integer which tells the simulation
  program how many passes to make
  Sample size

  Probability
  Paramter matrix with values for each
  statistical distribution

  Probability density function

  Same as SNUM(l) but for population
  prediction

-------
Symbol
  Q
  R
   s

  SD
 (SD)2
  SE
  SF


SNUM(I)
S,SD
 SX
 sx
 SY
S] ,82
TLV
T1
  t
LIST OF SYMBOLS (Continued)
                    Definition
   Emission rate
   Linear correlation coefficient or
   random number (Appendix C)
   Lung retention factor for the pollutant
   of interest (dimensionless factor,
  U
 VAR2, VAR3
  W
  X
  X
               Source severity or standard deviation
               corrected for small sample size
               Standard deviation of Y for a given X
               Sample variance
   Standard error (= SY/1 - R2)
   Subprogram to calculate severity
   Source severity of the ij— material in
   the region around the j— source
   Resultant values of severity, where 1=1
   indicates the probability that S lies in
   the range <0.1, 1=2 indicates 0.11
   Sample standard deviation
   Standard deviation of X
   Standard deviation of X
   Standard deviation of Y
   Sample standard deviation of XI and X2
   data, respectively
   The "Student t" random variable with
   n degrees of freedom (=Z//x2/n)
   Threshold limit value
   Total time (=t2-t!)
   Time
   Start time
   Finish time
   Wind speed
   Independent variables
   Width of class intervals for a histogram
   Stochastic variable
   Sample mean (= ~
             xi

-------
    n
  Symbol
   X
    n
   XB
  XMAX
  XMIN
  XP0P
  XSAMP
X j , . . . X
  XlfX2
   YB
  Y1,Y2
    Z
    z
    a

    r
   Yl
   Y2
    e
    v
a
£2
IT
P
a
a
a2
ax
LIST OF SYMBOLS (Continued)
                  Definition
   Mean of a sample of size n
   Mean deviation of X
   Maximum value
   Minimum value
   Population size
   Actual sample size
   A random sample from the population
   Arrays of correlated data
   Mean deviation of Y
   Arrays of transformed data
   A standard normal random variable
   Random variable [z = f(xi,...,x )]
   Angle of rotation in radians to make
   correlation zero
   Gamma function
   Coefficient of skewness of population
   Measure of kurtosis of population
   Any positive member of population
   Population mean
   Estimate of the total population mean
   Third and fourth central moment about
   y, respectively
   Mean of Yl  (=XI1)
   Mean of Y2  (=XI2)
   3.14159
   Correlation coefficient
   Standard deviation
   Estimate of population standard deviation
   Population variance
   Estimate of standard error of mean
   Horizontal dispersion coefficient
   Vertical dispersion coefficient
   Average maximum ground level concentration
                        xix

-------
            LIST OF SYMBOLS (Continued)
Symbol                        Definition
 X2            Any Chi-square random variable with n
               degrees of freedom
X              Maximum ground level concentration
 max
X(t)           Concentration time history
  V            Delivered dose = B1-R1•/x(t)dt
  TA           Actual pollutant dose delivered from a
               given point source (=N-B!-R1-T1-x)
  *_           Potentially hazardous dose  (=N-B'-R'-T'-F)
                         Xlll

-------
                          SECTION I
                        INTRODUCTION
               /
               i
A prioritization listing of air pollution sources was
developed earlier to serve as a first step in selecting
industries for detailed study to determine whether a suffi-
cient potential environmental risk exists to justify the
development of control technology.  Preparation of the
listing or relative ranking of emission sources involved the
use of an impact factor which is a worst case, integral
quantity.  In current assessment studies, one of the potential
environmental impacts of a source is determined from the
source severity,  defined as the ground level concentration
contribution of pollutants relative to some potentially
hazardous concentration of the same species.  The frequency
distribution of the severity of well-documented source types
can be examined deterministically.  However, source types
which are complex or involve a large number of emission
points require a statistical approach to simulate the fre-
quency distribution of the severity and ultimately permit
the assessment of the source.

The basic premise of the simulation approach is that detailed
information on all required factors and all emission points
for certain source types will not be obtained because of
time or cost limitations.  When such detailed information
cannot be collected, the inputs can be described in terms of
a distribution of values.  In many cases (e.g., electric

-------
utility boiler capacities), these distributions are readily
available; for some sources, special approximating techniques
must be used to develop them.  Having developed frequency
distributions for the inputs, a distribution of severity can
be calculated by means of a simulation technique.  When
several input variables are treated as frequency distribu-
tions, the computation is extremely tedious; however, with a
high-speed digital computer, computation times are on the
order of a few seconds.

This document presents a methodology for describing the
severity distributions of various source types.  A Monte
Carlo simulation technique is described together with effi-
cient algorithms for fitting the inverse Weibull, gamma,
normal, and log-normal cumulative density functions.  Using
coal-fired steam electric utilities as an example, signifi-
cant correlation is demonstrated between deterministic and
simulated severity results.

-------
                         SECTION II
                       SOURCE SEVERITY

The air pollution severity, S, of a given source should in
some way be proportional to the degree of potential hazard
it imposes upon individuals in its environment.  The relative
hazard, H, from a specific emission can be defined as being
directly proportional to the delivered dose, the probability
of dose delivery, and number of people who would receive it,
and inversely proportional to the toxicity of the material
as follows:
                        S . H .                        (1)
                                LD50

where     S = source severity
          H = relative hazard
          N = number of persons
       LD50 = lethal dose for 50% of the people exposed
          P = probability of dose delivery
          Y = delivered dose = B1 -R' -/x (t)dt
         B1 = average breathing rate
         R1 = lung retention factor
       X (t) = concentration time history

The source severity is herein defined as the ratio of the
dose of a pollutant delivered to a population, relative to
some potentially hazardous dose.  Since LD50 data are not
available for human beings, another measure of potentially
hazardous dosage was used.

-------
The potentially hazardous dose for a given pollutant  from a
specific point source in a given region is defined as
follows:                            ^\ \j -_ f-w &\ 1  ,
                            -t 2 ^  ,-^L
                 V  = NB'R'      'TLV(t) K dt             (2)
where    H*  = potentially hazardous dose, g
          N = population exposed to a specific source, persons
         B1  = average breathing rate, m3/s-person
         R'  = lung retention factor for the pollutant of
              interest  (dimensionless factor, 0
-------
                    = N-B'-R' /    X(t) dt               (6)
where  x(t) = the actual ground level concentration  time
              history of a pollutant of  interest emitted  by
              a specific point source, g/m3
The value of x (t) is very difficult to obtain and was  there-
fore approximated by an average value, x-  The total actual
dose delivered for a specific pollutant from a specific
source is then:

                        *A = N-B'.R'. T1- ^                (7)

Since our measure of source severity was defined as the
ratio of the two dosages, then:
                     s .
                          F   N-B1 -R1 -T ' »F

or                          S = *•                       (9)
A.   MATHEMATICAL STRUCTURE

                                   maera   n   e
                                      the            . .
The source severity, S, of the i — material in the region
around the j — source is expressed as the ratio of x". . to
Fi ; i.e.,

                          Si. = Jil                    no,

For the i — emitted material, the severity vector, S., is
defined by:

-------
                       s± =1  .   I                      (ID
                             Sin
where  n = number of point sources emitting the i— material

The mean  and median  severity  for  the  i— material may be
obtained  from  a  cumulative  frequency  plot  of  S..  In addition,
n-fractiles, the standard deviation,  and confidence limits
about the mean severity  can be  calculated.  Since all of  the
populations under consideration are finite, alternate
forms of  the standard  statistical equations were used  (as
presented in Appendix  A).

B.   DERIVATION  OF x

Since source to  receptor distances were  not compiled, the
maximum ground level concentration for elevated point
sources,  x   ,l  was used:
          lUclX
                            =  2  QCTs
                       xmax
                               ireuh2a
where  x    = maximum  ground  level concentration, g/m3
        max
           IT = 3.14159
           e = 2.72
           u = wind  speed, m/s
 1Slade, D. H.  (ed.).  Meteorology  and Atomic Energy.
 Environmental  Science  Services Administration, Air
 Resources Labs.   Silver  Spring.   AEC Publication  No.
 TID-24190.  July  1968.   445  p.

-------
          h = emission height, m

         a  = vertical dispersion coefficient, m
          z
         a  = horizontal dispersion coefficent, m

          Q = emission rate, g/s


The average concentration, xt is a function of sampling time,

t, and it can be related to the maximum concentration x  x

as follows:2
                                 tj P
                        7 = x    r-                     (13>
                        *   Amax t2

where  tj = 3 min

       t2 = reference time period, min

        p = 0.17


C.   POLLUTANT SEVERITY EQUATIONS


For any material with a given TLV, the severity equation

becomes:
                                   ti\P
                              xmax'
                    c = £ = _                     (14)
                        F
Assuming a national average wind speed of approximately

10 mph  (4.5 m/s) , an averaging period of 24 hours and

substituting the appropriate values, Equation 14 becomes:

                                /  3 \°-17
                   o - X    xmax\1440/
                   S ~ F ~
                            (TLV) (3.33 x ID"3)

                         0.35 y
                               xmax
                        (TLV)3.33 x ID"3

                        105 Xr
                         TLV
2Turner, D. B.  Workbook of Atmospheric Dispersion Estimates,
 1970 Revision.  U.S. Department of Health, Education, and
 Welfare.  Cincinnati.  Public Health Service Publication
 No. 999-AP-26.  May 1970.  84 p.

-------
or
S =
                            (2) (105)Qo
                          - - - 5.
                          ireuh2a  (TLV)
                                                        (15)
The national average  atmospheric  stability is approximately
neutral.  Hence,  a  =s  a  ,  and
                             = 1-0
                                                        (16)
Thus:
                      S =
                           38.43h2(TLV)

                      S -
                           (TLV)h2
                                                        (17)
                                                 , NO  , CO,
                                               X    X
Since the criteria pollutants  (particulates,  SO
and hydrocarbons) have established ambient  air standards,
the appropriate standard  (in g/m3) can be substituted  for
the potential hazard factor, F.  The  severity equations  for
the five criteria pollutants are shown in Table  1.   (Detailed
derivations are shown in Appendix B) .
       Table 1.  CRITERIA POLLUTANT SEVERITY EQUATIONS
Pollutant
Particulate
S0x
NOX
Hydrocarbons
CO
Severity equation
S = 70Qh-2
S = 50Qh~2
S = SlSQh-2-1
S = 162.5Qh~2
S = 0.78Qh~2
(18)
(19)
(20)
(21)
(22)
                              8

-------
                         SECTION III
                   SIMULATION METHODOLOGY

A.   INTRODUCTION

In many statistical analyses of data, it is frequently
desired to consider a random variable which is a function
of other random variables.  An example pertinent to air
pollution studies is given by the severity equations for
ground level concentrations of air pollutants.  For example,
the severity equation for SO2 emissions from the stacks of
coal-fired electric utility plants is given by:

                           S = 50Q                    (19)
                                h2
where  Q = emission rate, g/s
       h = emission height, m

The emission rate can be calculated from:

                  Q= (CO (E)(% sulfur)(Ki)           (23)

where        CC = coal consumed, g/yr
                                    0.019 g SO2 (1% sulfur coal)
              E = emission factor =
                                         g of coal consumed
       % sulfur = percent of sulfur in the coal
             K! = 3.171 x 10~8 (to convert from g/yr to g/s)

-------
                        (K2)(CC)(% sulfur)
or                  S =	(24)
                                h2
where  K2 = 3 x 10~9

Next, consider a general setting where the random variable z
is a function of the random variables xj, ..., x  given by
z = f(xlf ..., x ) for some function f.  Suppose the actual
distributions of the input random variables xlf ..-, x  are
known including their probability density functions (p.d.f.)
and the corresponding cumulative distribution functions
(c.d.f.).  Then it seems reasonable to assume that the distri-
bution of the random variable z can be obtained.  In a sense
this is true in that integral formulae have been developed
which give the probability density function and the cumula-
tive distribution function for z as a function of the same
functions for the x..3  These formulae, however, are very
complicated even for the case of the simple sum, difference,
product, or quotient of two random variables.  Also, even if
the integrals are successfully evaluated, the resulting
probability density function for z will in general not be
exactly one of the standard distributions and as a result
may be difficult to handle.  There are certain special cases
in which the resulting p.d.f. will be known.3  In these in-
stances, the analytical approach to finding z explicitly is
by far the best approach.  In other instances certain sim-
plifying" assumptions about the distribution of z can be made
provided certain things are true about the coefficient of
variability or equivalently the coefficient of skewness of
the input variables.4  However, in cases where there are more
 3Parzen, E.  Modern Probability  Theory  and  Its Applications.
 New York, John Wiley  &  Sons,  1960.
 ^Springer, C. H., et al.  Probabilistic Models.  Homewood,
 Richard D.  Irwin, Inc.,  1968.
                              10

-------
than two input variables or there is considerable skewness
exhibited by the variables or the function f becomes compli-
cated, then the strict analytical approach to finding the
distribution of z explicitly will in general not be applicable.

In these cases, where the general approach of finding the
explicit distribution function for z is not applicable, an
alternate approach is to calculate "many" values of z for
explicit values of the input variables Xj, ..., x  and use
these values to estimate (rather closely if enough values of
z are known) such things as the mean, standard deviation,
etc., for z.  Also, class intervals can be formed and a
frequency histogram and cumulative distribution plot can be
developed for the "many" calculated values of z.  This will
yield a distribution of z without any knowledge of an
analytical formula for its p.d.f. or even without knowing
whether any of the known standard distributions of statistics
exist for the distribution.  This approach is called the
deterministic approach because in this technique it is
possible to determine explicit values for z from explicit
values of the input variables Xj, ..., x .  This approach is
an approximation to the strictly analytical approach described
earlier.  The deterministic approach works well whenever it
is possible to actually calculate the "many" values of z
deterministically from given values of the input variables.
The term "many" means at least 30 values for the purpose of
estimating the mean, standard deviation, and a 95% confidence
interval for the mean.  (The t-test for finding confidence
intervals is discussed in Section V of this report.)   However,
to obtain a better frequency histogram for z, 100 or more values
of z should be available.   (A histogram can be constructed
with less values but it will tend to be less meaningful because
the number of class intervals will have to be smaller.)
                             11

-------
Finally consider the situation when either no explicit values
of the input variable are available from which values of z
can be calculated or the number of such values is too small to
permit calculation of enough values of z to determine useful
information regarding its distribution.  In this situation a
tool commonly used is the probabilistic approach which uses a
computer simulation to obtain values for z.  For example,
instead of knowing many values for the input variables
xlf ..., x , only limited information may be available, such
as an estimate of the mean and possibly the range and symmetry
or skewness properties.  Such situations are not amenable to
either the analytical or deterministic approach.  In this case,
the input variables are fitted to some theoretical distribu-
tion and the small amount of available information about the
variables is used to determine the parameters of the distri-
butions.  The computer is then used to sample a large number
of times from each input variable's distribution function and
to subsequently use these data to calculate a large number of
values of z from which the mean, standard deviation, etc.,
can be estimated and frequency histograms and cumulative
distribution plots for z can be prepared.  Some of the
techniques and procedures used in such a computer simulation
are described below.

B.   THEORY AND METHODOLOGY

The equation for the severity  (S) of ground level concentra-
tions of SO2 emissions from the stacks of coal-fired electric
utilities will be used to illustrate the methodology utilized
in the simulation approach.  The severity equation is:
                       (K2)(CC)(% sulfur)
                   S = _	               (24)
                               h2
                                                            !
where K2/ % sulfur, CC, and h have the meanings defined earli-
er.  Thus, the input random variables are % sulfur, CC, and h.
                            12

-------
1.   All Input Variables Independent

When all of the input random variables are independent, the
methodology is relatively simple.  A large sample  (e.g., of
size n) is drawn from the distribution of each of the input
variables.  These data are then used one by one to calculate
n values of S.  From these n values the mean, standard
deviation, etc., can be calculated and a frequency histogram
and cumulative distribution can be plotted.

Some comments are in order regarding the method by which
samples are drawn from the distribution of the input varia-
bles.  First, it should be noted that the input variables
are restricted to one of four types of continuous distribu-
tions:  the Weibull, normal, gamma, or log-normal distribu-
tion.  The type of each input variable and the corresponding
parameters for its distribution function must thus be speci-
fied.  The method of obtaining the "best" type for each
variable and the corresponding parameters is described in
Section V and the parameters for the SO2 example are given in
Appendix C.  In these goodness-of-fit procedures, it is
necessary to have a random sample of data points for the input
variable in order to be able to fit it to the proper distribu-
tion.  However, certain situations may arise when that much
information about the input variable is not available.  For
example, two extreme points on the distribution and either the
mean or mode may be known,  or some information may be available
to determine whether the distribution is symmetric or skewed.
In such situations where the goodness-of-fit program is inoper-
able, it may still be possible to fit the variable to one of
the four distributions above and to obtain its parameters.

As a demonstration of the above procedure, consider the
following example.  Suppose that for an input variable, x,
it is known with 95% confidence that the values of x will be
                             13

-------
between e and e5  (where e = 2.7..).  Suppose also that the
mode of the distribution is known to be between e and e2 and
that the mean is approximately equal to e3.  These points then
indicate that x is a rather heavily skewed right distribution.
The graph for the p.d.f. of x may resemble the one shown below:
Since it is known that the 0.025 point on the cumulative
graph is approximately equal to e and the 0.975 point is
approximately equal to e5, this information can be used alone
to calculate A and B in a Weibull fit (described later).  Thus,
one finds that A = 1.25 and B = 7.29 x 10"3.  These values of
A and B yield a theoretical mean y = 47.7 which is larger than
the estimated e3 value for the mean.  The theoretical mode is
14.2 which again is larger than the estimated mode.  Thus, the
Weibull fit could be used as an approximation to the "true1'
distribution of x, although it is not a very good fit as our
observations about the mean and mode indicate.
Another way of obtaining a distribution for x is to assume
that it is log-normally distributed since the log-normal
distribution is a right-skewed distribution.  If x is assumed
to be a log-normal distribution, then log x must be normal.
Hence, by taking the logarithm of the 0.025 point and 0.975
point of x, the same points on the cumulative graph of log x
are obtained which were assumed to be normal.  These points '
are 1 and 5, respectively.  Thus, the mean g of log x should,
be taken to be 3 and, since 1 and 5 are the 0.025 and 0.975

                            14

-------
points, respectively, it is found that a = 1.2.  The values
vi = 3 and a = 1.2 can thus be used as parameters to sample
from the normal for values of log x.  By taking antilogarithms
of the sample, a sample for x can be obtained.

In view of the above discussion, it is evident that several
avenues are available for obtaining a distribution to fit the
given data or information about each input variable.  The
simulation program  (for the case of independent input varia-
bles) simply takes the parameters for the given type of dis-
tribution for an input variable and samples from this distri-
bution to obtain a random sample for that input variable.

The method by which the program selects the sample from a
given distribution varies with the distribution.  The direct
approach is used for the Weibull and normal distributions.
The rejection method is used for the gamma distribution.
Finally, for the log-normal distribution, the direct approach
is used to sample from the normal with the mean equal to the
mean of log x and the standard deviation equal to the stan-
dard deviation of log x.  These are sometimes obtained by
taking the log of the geometric mean and geometric standard
deviation of x.   After obtaining a sample for log x, the
antilogarithm of these values gives a sample for x.  These
methods of sampling are further discussed later in this
report.

2.   Dependent Input Variables

Consider what happens when two (or more)  of the input varia-
bles are correlated to some degree.   If this situation occurs,
and a sampling procedure is used such as the one discussed
above, which assumes independent variables, it will tend to
distort the mean as well as the distribution of the output
variable.   For example, in the severity example it was found
                             15

-------
that the variables CC and h had a sample correlation coeffi-
cient of about 0.55.  Thus, whenever CC and h were sampled
independently, large values of CC were allowed with small
values of h and vice versa.  This procedure tended to distort
the "true" distribution of S by allowing unrealistically low
values for S and, more importantly, unrealistically high
values.  These high values caused the simulated mean to be
16.0 whereas the true mean was 8.9 for the deterministic pop-
ulation calculation.  Thus, some way was needed to account
for the correlation that existed between CC and h.

Consider two input variables, X and Y, which are correlated
with linear correlation coefficient R.  This value of R can
either be estimated and supplied directly to the program or
it can be obtained by a simple regression on the sample data
for X and Y.  Once R is obtained, the slope, B, and the Y-
intercept, A, for the regression line is given by:
                          B -

                       A = YB - B(XB)                   (26)

where XB and SX are the mean and standard deviation of X,
and YB and SY are the corresponding parameters for Y.   (This
assumes that X is taken as the independent variable in the
regression line.)  If R had been supplied directly to the
program (without any sample data) , then XB, SX, YB, and SY
would also be required in order to calculate A and B.

After A and B are obtained, the following relationship exists;

                 Y = A + BX = YB + B(X - XB)            (27)

with an error term to account for the fact that the regres-
sion line is not an exact relation between X and Y.  The
                             16

-------
next item needed is the standard deviation  (or error), SE,
of Y due to the regression line estimate of Y.  This  is given
by:
                       SE
= SY \1 - R2                  (28)
Whenever R = 0  (indicating that X and Y are independent),
then SE = SY as would be expected (i.e., the standard devia-
tion of Y due to the regression line estimate is simply  SY).
Also, if R = ±1, then SE = 0 ^s would be expected since
there is no deviation from the regression line in this case.
                             /

SE shown above is the standard deviation for Y when the
regression line estimate of Y is used and the value of X is
XB, which is not likely to occur often.  Hence, a way to
compute the standard deviation  (SD) of Y is needed using the
regression estimate of Y and any value of X.  Intuition
suggests that the larger the deviation of X from XB, the
larger the error in estimates of Y.  A formula for computing
SD is given by:
                              ^ +
                                    n(SX)2
                             (29)
where n is the sample size to be drawn from Y.  Since these
simulations usually have large values of n, then for our
purposes SD is approximately equal to SE.  However, the
correct formula is still used in the program for calculating
SD (the standard deviation of Y for a given X).

The method described below pertains to sampling from the
pair of variables X and Y where X and Y are correlated as
above with X independent and Y dependent.  First, a value of
X is selected at random from the population describing X.
This value is then substituted into the regression equation
to obtain an estimate of Y which is taken as the expected
                             17

-------
value or mean of Y for this given value of X.  Finally, the
standard deviation of Y, SD, is calculated for this value of
X.  Then a sample is taken from the distribution of Y with
the mean given by the estimate of Y from the regression
equation and the standard deviation given by SD.  This pro-
vides a "correlated" random value of Y associated with the
given value of X.  This procedure is continued until the
desired sample size for X and Y is attained.

The following discussion pertains to the type of distribution
from which sampling is allowed for Y, the dependent variable
in the regression equation.  Upon changing from one value of
X to the next, the above procedure will simply provide a new
mean and standard deviation for Y to be used for sampling
purposes.   Since the normal and log-normal distributions are
the easiest distributions from which to sample when the mean
and standard deviation are known, the program allows the user
to select only one of these two distributions for sampling
from Y.  However, the program is designed for expanding this
choice to the Weibull and gamma distributions if more sub-
routines are written and added.  Whenever the log-normal
distribution is used for Y, the program performs the regres-
sion analysis for log Y as a function of log X, so that the
sample value for log Y is first calculated and then converted
to Y internally by taking antilogarithms.  It must be remem-
bered that if one chooses to use the log-normal distribution
for Y and to supply R, XB, YB, SX, and SY instead of supply-
ing the raw sample data, then the values of XB, SX, YB, and
SY must be given in terms of log X and log Y.  That is, SB
must be the mean of log X, etc.  If the raw data values are
supplied for X and Y respectively, the program automatically
converts to logs if Y is assigned a log-normal distribution.
                             18

-------
                         SECTION IV

            EXAMPLE OF USE OF SIMULATION APPROACH
             WITH COAL-FIRED ELECTRIC UTILITIES
In order to obtain an indication of how well the simulation
procedure approximates the "true" population, S02 emissions
from the stacks of coal-fired electric utilities were examined.
Data were available on % sulfur, coal consumed  (CC), and stack
height  (h) for 224 power plants in the U.S.  This was consid-
ered to be the total population which was to be simulated by
using only a small number (24) of plants in order to obtain
information about the distributions of % sulfur, CC, and h.

To obtain a "random" sample, the first 24 plants on the list
were selected.  Percent sulfur, CC, and h for these 24 plants
were then fitted to the four distributions considered in the
simulation program.  The distributions were then selected
which appeared to fit the data better on an overall basis
considering the SE, x2~value, actual class interval compari-
sons, and coefficient of skewness and measure of kurtosis
calculations.   (These techniques of choosing the best fit are
discussed in Sections V and VI of this report.)   For % sulfur,
the Weibull Maximum Likelihood Fit was selected and clipped at
the 5% and 99% points.  For CC, the Weibull Least Squares Fit
was selected and clipped at the 5% and 99% points.   Also, h
was found not to be independent of CC.  Hence, it was decided
to treat h as a dependent variable correlated with the inde-
pendent variable CC by using the raw data on the 24 plants to
obtain R, the correlation coefficient.  The coefficient of
skewness indicated that h was not normal but skewed to the
right.  Furthermore, the coefficient of skewness and measure of
                             19

-------
kurtosis for log h indicated  "near-normality."  Hence, it
was decided to use the log-normal distribution for h.

The severity equation for SO£ emissions from  the stacks of
coal-fired electric utilities was discussed earlier and is
repeated below:
                       (K2)(% sulfur)(CC)
                   S =
                               h2
(24)
Using this function for S, the data, as indicated above, were
entered into the simulation program and 5,000 values were
calculated for S.  Subsequently, the mean, standard devia-
tion, percentage with S > 1, maximum value, and minimum
value were calculated.  A deterministic calculation of these
values was performed for all 224 plants in the population
and the results are compiled in the table below:

       Table 2.  RESULTS OF DETERMINISTIC CALCULATIONS
Parameter
Mean
Standard deviation
Maximum value
Minimum value
Percent having S > 1
Simulated
value
9.25
12.5
154.5
0.08
91
Deterministic
value
8.9
12.4
136.0
0.36
95
Frequency histograms and cumulative frequency plots were
also drawn for both the simulated values and the determinis-
tic values of S and these are shown in Figures 1 and 2.

The large-sample Z test was performed to determine whether
there was a significant difference in the simulated and
deterministic mean values obtained above.  The test, as
                             20

-------
                                                                                 SIMULATED  DETERMINISTIC
to
                             SAMPLE SIZE = 5000
                             MIN. VALUE = 0.08
                             MAX. VALUE = 154.45
                             MEAN = 9.25
                             STD. DEV. =12.48
224
0.36
135.96
8.88
12.37
                0.27
 187*0   23-23
502 SEVERITY(24 PLflNTS)
        Figure  1.   Frequency histograms  for the severity  of SO2  emissions  from coal-fired
               electric  utilities  comparing the simulation and deterministic  methods

-------
to
to
                                                                   SIMULATED   DETERMINISTIC

                                                              SAMPLE SIZE = 5000   224
                                                              MIN. VALUE = 0.08    0.36
                                                              MAX. VALUE =154.45  135.96
                                                              MEAN = 9.25        8.88
                                                              STD. DEV. =12.48    12.37
                  4^09
7.92
11.74
15-57
3-40    23-22   27.05   30-87
    SOZ SEVERITY 124  PLflNTS)
34.70    3'8-53   4^.354*6.1850-00
53-83
      Figure  2.   Comparison of  cumulative  frequency for  the  severity of 862  emissions from
           coal-fired electric utilities using  the  simulation and  deterministic methods

-------
might be expected, showed no significance in the difference
at the 0.01 or 0.05 levels.  Furthermore, the F test for
significant difference in the variances was also negative,
indicating no significant difference.

As can be seen from the frequency histogram of both the
simulated and deterministic plots, the severity appears to
be log-normally distributed.  Furthermore, the cumulative
plots for the simulated and deterministic values of S indi-
cate an extreme-value distribution, such as the log-normal
distribution.  As can be seen by comparing the cumulative
plots for the simulated and deterministic values of S, parts
of the distributions agree very well whereas other parts do
not agree as well.  This is considered acceptable since the
total population in the deterministic calculation has only
224 points and this is very much a discrete population,
whereas the simulation plot assumes a continuous distribution
for S.  Moreover.- the sample correlation coefficient, which
was calculated from the 24 plants in the sample, probably
does not completely represent the true picture of the actual
correlation between CC and h.  For these reasons, the X2
goodness-of-fit indicates an overall poor fit of the simu-
lated cumulative to the deterministic cumulative distribution.

However,  keeping in mind the purpose behind a simulation
(viz., to obtain information about an output variable when
very little is known about the input variables)  and the con-
ditions under which it is generally performed (viz., little
or no information about the true distributions of the input
variables), the simulated values of the mean,  etc., and the
cumulative plots are actually remarkably close to the "true"
values.
                             23

-------
                          SECTION V

         DISCUSSION OF NON-NORMALITY AND CHI-SQUARE
                    GOODNESS-OF-FIT TEST
There are many real-world situations in which massive amounts
of data are collected and analyzed.  In many of these cases
the data come from a population which is finite.  Thus, when
statistical methods are employed to analyze the data, the
underlying distributions are of the discrete type.  However,
it is common practice to approximate these discrete distri-
butions with one or more of a large number of continuous
distributions. The reason, of course, is that continuous
distributions are much easier to handle.

After deciding to use continuous distributions in the
analysis of the data, the experimenter must decide which of
the many types of distributions available will most closely
approximate his data and yield the best results for his
purposes.  In some situations, past experience leads the
experimenter to choose a particular type of distribution.
For example, the exponential distribution and one of its
generalizations, the gamma distribution, are often used to
describe the distribution of arrival and/or departure times
in a queueing theory problem.  In reliability theory, the
exponential distribution and another of its generalizations,
the Weibull distribution, are often used to describe the
distribution of time-to-first-failure or mean time between
failures.  In measuring air pollution, the log-normal dis-
tribution is often used to describe the distribution of the
concentrations of pollutants in the air.  However, in many
                             25

-------
instances the experimenter may have no previous knowledge of
how his data "should" be distributed.  In these instances it
is sometimes desirable simply to use the well-known normal
distribution to describe the data.  With the assumption of
normality, many statistical tests and procedures become
trivial to apply.  However, when normality is assumed and the
"true" distribution  is non-normal, error is introduced into
the problem.  Thus,  when an experimenter assumes his data are
normally distributed in order to be able to apply statistical
tests that require the normal distribution, he is naturally
concerned with the extent to which his assumption distorts
the desired results.  One of the purposes of this section is
to shed some light on questions of the above type.

This discussion is mainly concerned with data collected in
connection with air  pollution analysis and it is frequently
the case that these  data exhibit extreme value characteris-
tics.  This means that the data are not symmetrically located
around the mode but  instead have extreme values to one side
or the other of the  mode which causes the mean to be shifted
to the right or left of the mode.  Thus, the data cannot
generally be described by the normal distribution.  Analytical
techniques that can  be used to detect and measure certain
"degrees of non-normality" of data will be discussed.  Some
of the well-known extreme value distributions, e.g., the
Weibull, gamma, and  log-normal distributions, will also be
reviewed and compared with the normal distribution for
different values of  their parameters.  Finally, the x2
goodness-of-fit test will be discussed to show how it can be
used to test the fit of the data to a given distribution.

A.   CENTRAL LIMIT THEOREM AND T-TEST

Consider a given population which has an unknown distribu-
tion with a finite mean and variance.  Let X^, ..., Xn denote
                             26

-------
a random sample from this population so that each X. is a
random variable with the same distribution as the underlying
population distribution.  Table 3 gives the notation for
certain statistical parameters for both the sample and the
population.
     Table 3.  NOTATION USED FOR STATISTICAL PARAMETERS
Parameter
Mean
Variance
Third central moment
Fourth central moment
k central moment
Sample
notation
X
(SD)2
m
3
m
mk
Population
notation
y
a2
y
3
y
•
•
•
In general, English letters are used to denote a sample
parameter and Greek letters are used to denote the corre-
sponding population parameter.  The first result to be dis-
cussed is the so-called "law of large numbers."  This law
states that each of the sample characteristics in the table
                                                      /
above (as well as some others to be discussed later)
converges in probability to the corresponding population
parameter.  It is instructive to see what this means for the
case of X, the sample mean.  Let e > 0 be any positive
member.   Then:                               \
                                                        (30)
                             27

-------

where X  denotes the mean of a sample of size n.  This indi-
cates that the probability that the mean of a sample of size
n will differ from the true population mean by as much as
e > 0 (for any e > 0) approaches 0 as the sample size n
approaches +°°.  The same is true of the other sample
characteristics.

Practically speaking, the law of large numbers is of little
benefit unless one can determine how good the approximation
is as a function of sample size or unless confidence intervals
can be found for certain samples of size n > 30 where the
population parameters are "reasonably approximated" by their
sample counterparts.  Instead of simply taking the sample
parameter  [e.g., X or (SD)2] as the value of the population
parameter  (y or a2), it is usually desirable instead to
construct confidence intervals around the sample parameter
within which the target population parameter is assumed to
be with a certain degree of confidence.  Since an approxima-
tion to the population mean and a confidence interval about
it seem to be of central importance in most statistical
analyses of data, procedures (and their limitations) for
accomplishing this task are described below.

First, it is necessary to consider some preliminary defini-
tions and resultsi  A random variable is said to have a chi-
square distribution with n degrees of freedom if its p.d.f.
is given by:
         f(X)  =
                                    x
                                   '2"  for X > 0
n
                                     (32)
 0             elsewhere

          28

-------
Suppose a random sample Xj , . . . , Xn is available from a
normal population with mean, y, and variance, a2.  Then the
random variable defined by:
                     n
                     ^-i2   (n-1) (SB)
                                                        (33)
has a chi-square distribution with n-1 degrees of freedom.

Next, suppose that Z is a standard normal random variable
and x2 is anY chi-square random variable with n degrees of
freedom.  Then the random variable:
                          T = —£=                      (34)
is defined to be the "Student t" random variable with n
degrees of freedom.  The T variable is a symmetric distribu-
tion that looks very much like the normal except that its
tails are somewhat wider.

The T-test is used to find confidence intervals about the
mean of a sample.  Let Xlf ..., Xn denote a sample of any
size from a normal distribution with unknown mean, y, and
                       n
                       Z  Xi
variance, a2.  Let X = — - be the sample mean and let  (SD) 2
be the (unbiased) sample variance.  By standard statistical
theorems, X is designated as a normal random variable with
                       a2
mean, y,  and variance, — .  Also, by one of the results  stated
above,     '  *  ' — is found to be a chi-square random variable
           az
with n-1  degrees of freedom.  Thus:
                          X-y
                                 X-y
                                                        (35)
                             29

-------
is a "Student t" random variable with n-1 degrees of freedom.
Thus, by using a table for the T variable, one can find t
such that:                                               2"
                     P(|T| 1 t^) =  1 -  a                (36)
                              2
for any given a.  For example, if a = 0.05, then t0.Q25 = 2.2
for 11 degrees or t0.o25 = 2.18 for 12  degrees of freedom,
etc.  Hence given a and the  number of degrees of freedom, a
value t  can be found such that
       a.
       2
                        -t   < T < t
                          SL  —   —   Sfr
                                    4
with probability 1-ct.  Thus, with a = 0.05, one finds  a 95%
confidence  interval for T.   For the T defined above, it is
             v —
found that  |	— | <_ t0.o25  with 95% confidence
            SD//n
                  <-      <   y-X
                  to.025 1 	~:±to.025
                           SD//n
                 Y   4-       SD^    ^YJ-4-      SD
                 A ~ t0.025  — ±_ V  1 A  + t0.025 —
                             /n                  /n

Thus, an interval is obtained in which  the population  mean
lies with 95% certainty.

The above procedure will now be examined with reference to
the assumption of normality.  It is useful to establish the
reason why  it was necessary  for the sample to be from  a
normal population.  This question can be answered by reversing
the steps used above.  In defining  the  T random variable, the
numerator had to be a standard normal random variable  and the
denominator had to be "%/•*— where Y  had a  chi-square  distribu-
                      * n       A              ^
tion with n degrees of freedom.  Thus, if the  sample were
                                                     _  IX±
taken from a non-normal population, the sample mean, X = ———,
will not (in general) be a normal random  variable  and  the
variable *n~ ' ^	— will not in general have a chi-square
             a2
                             30

-------
distribution.  Thus, for non-normal populations, the ratio
used in defining the T variable will not produce a T variable
and, hence, the T-test is not strictly valid.  In such cases,
the central limit theorem is used.  This theorem states that
if a sample of size n is taken from any population with finite
mean, y, and variance, a2, the sample mean, X  , approaches a
normal random variable in distribution as n increases.  In
fact, for samples of size n ^ 30, the sample mean is so close
to a normal distribution that for all practical purposes it
can be considered normal.  Thus, if a sample of size n ^ 30
is taken from any population and its mean is calculated, one
will obtain approximately a normal random variable with mean,
                                               2
y (the unknown population mean) , and variance — (where a2 is
the unknown population variance) .  Hence, letting

                          Z =  n~u                      (37)
                              a /Sri.
a standard normal random variable is obtained.  Since a is
unknown in the above equation and the sample size is rela-
tively large (>J30) , a can be approximated by the sample
standard deviation, SD, to obtain
                               x -v
                          Z = — - -                     (38)
                              SD//n
for the standard normal random variable.  Hence, the normal
probability table may be used to find a value  Z  such that
                                               2
                     P( Izl < Z ) = 1 - o                (39)
                           —  SL
                              2
and, hence, obtain (as in the T-test)
                 X  - Z
                  n
with 100(1 - a)% confidence.

                              31

-------
If the above two procedures are compared, it is found that
the ratio under consideration for finding the values t  or Z
                                    _                a     a
                                    Y—                2     2
is the same in both cases, namely, —-^—, and the only
                                   SD//n
difference is the specific reference  table used.  If the
underlying population is normal and n is arbitrary, the T
table is used and if not, but n >_ 30, the normal table is
used.  Actually, the T tables in most statistics books stop
at n = 30  (29 degrees of freedom) because for n >_ 30 the
variable defined by the above ratio can be considered to be
normal or T regardless of the underlying population distri-
bution.  Hence, it would be useless to construct tables
beyond approximately n = 30 for the T variable since they
would simply duplicate the values that could be obtained
from a normal table.

It is sometimes desired to find a confidence interval when n
is less than 30 and our data are non-normal.  Unfortunately,
there is no completely statisfactory  analytical answer that
can be given in this case.  However,  statisticians have found
from experience that the normal approximations guaranteed by
the central limit theorem for samples of size n ^ 30 are still
very close to normal for most mound-shaped probability distri-
butions.  In some cases, the approximation is valid for samples
as small as n = 5.  Hence, in the case of n < 30, one can
simply calculate the confidence interval in the same way as
described above, assuming normality,  and subsequently point
out the possible pitfalls that could  occur if the underlying
distribution strays too far from normality (viz., the confi-
dence interval would no longer be valid and more sampling
would be required).  An alternate approach for n < 30, when
there is reason to believe that the data are log-normally
distributed, is to apply the T-test to the logs of the data
and obtain a confidence interval.
                             32

-------
It is interesting to  apply  the  above  procedure to the data
for coal-fired electric utilities.  Data  are  available for
224 power plants on coal consumption,  stack heights,  percent
sulfur in the coal used, and percent  ash  in the coal  used.
The first 24 plants on the  list were  then selected and con-
sidered to be a random sample from  the population of  224
plants.  Using the 24 plants in the sample, the sample mean,
standard deviation, and 95% confidence limits on the  mean
were calculated for each of the above types of data and these
were compared with the population parameters  for all  224
plants.  The results  are provided in  Table 4.
    Table 4.  COMPARISON OF RANDOM SAMPLE VALUES AND POPULATION MEAN
                 FOR COAL-FIRED ELECTRIC UTILITIES
Type of data
Coal consumed, kg/yr
Stack height, m
Percent sulfur
Percent ash
Random sample values
Mean
1,326
93.6
1.82
12.25
Standard
deviation
1,349
36
1.15
4.1
95% confidence
interval
756, -1,896
78.4;108.8
1.33; 2. 31
10.52;14.0
Population
mean
1,089
101.3
2.48
13.03
As noted above, the 95% confidence  interval  contains the
population mean for each of the data types studied except
the percent sulfur.  Using various  statistical  goodness-of-
fit tests  (to be discussed later) on the  data,  it was found
that coal consumption was exponentially distributed almost
perfectly, the percent ash was log-normally  distributed, and
stack height could not be fitted with much success to any of
the distributions that were tried.  However,  even with these
wide-ranging distributions  (which are very much non-normal)
and a sample size of only 24, the T-test  still  provided
satisfactory results in every case  except the percent sulfur
used.  Upon investigating the percent sulfur situation more
                              33

-------
closely, it was found that the range on the data in the sample
of the first 24 plants was from 0.4% to 4% whereas the popu-
lation data ranged from 0.4% to 6.2%.  This indicates that,
in the case of percent sulfur, the random sample is probably
not very good.  However, the 95% confidence interval for
percent sulfur was still relatively close to containing the
actual population mean.
B.   COEFFICIENT OF SKEWNESS AND KURTOSIS AS ANALYTICAL
     MEASURES OF NON-NORMAL DISTRIBUTIONS
In a symmetric distribution, every moment of odd order about
the mean  (if existent) must be zero.  Any such moment which
is not zero may thus be considered as a measure of the asym-
metry or skewness of the distribution.  The simplest of these
measures is y  which is of the third dimension in units of
             3
the variable.  In order to reduce this to zero dimension and
so construct an absolute measure, division by a3 is performed
and the ratio
                               v,
                          Y  = -1                      (40)
                           1   a3
is regarded as a measure of the skewness.  y  is called the
coefficient of skewness.
In statistical applications, unimodal continuous distributions
of the type shown in Figure 3 are often encountered.  The
frequency curve forms a long tail to the right of the mode
and a short tail to the left.  Thus, in calculating y , the
cubes of the positive deviations will generally outweigh the
negative cubes and, hence, y  will be positive as will
     yo                     3
Y  = — 2-.  Thus, the above distribution is said to be skewed
 1   a3
right or have positive skewness.  Similarly, negative skew-
ness occurs when y  < 0.
                             34

-------
                        VALUE OF VARIABLE

         Figure 3.  Unimodal continuous distribution

Reducing the fourth moment y  to zero dimension in the same
way as above, the measure of kurtosis of a distribution is
defined as:
                          Y  = --                       (41)
                           2   a"
The measure of kurtosis is used as a measure of the degree of
flattening of a frequency curve near its center.  The normal
distribution has a constant measure of kurtosis, y  =3.
                                                  2
Thus y  > 3 means that the distribution has a sharper peak,
      2
thinner shoulders, and fatter tails than the normal distribu-
tion.  Likewise, y  < 3 means that the distribution has
                  2
flatter peaks, fatter shoulders, and thinner tails than the
normal distribution.  Figure 4 exhibits these features.

All of the above distributions are not skewed although curve 1
and curve 2 in Figure 4 would fail to be normal because of
deviations in their measures of kurtosis.
It can be shown that y  = 0 and y  = 3 for the normal distri-
,   .                   12
bution.  Hence the values of skewness and kurtosis for a
                             35

-------
 u
 2
 W
 D
 O
 W
                                        L)  = KURTOSIS > 3

                                       (2)  = KURTOSIS < 3

                                       (3)  = NORMAL
                    VALUE OF VARIABLE

      Figure 4.  Examples of kurtosis  in distributions



given distribution can be used to compare it with  the normal

To obtain another reference point for  comparisons,  consider

the exponential distribution function  whose probability

density is given by:
                f(X)  =
-r
   I o
1/B e "' M for X


   elsewhere
(42)
where 3 > 0 is an arbitrary parameter.   By calculating  the

necessary moments, etc., it is found that the exponential

distribution also has constant values for y  and  y   indepen-
                                           1       2
dent of the parameter B.  These are given by:
               y  = coefficient of skewness  =  2
                i

               y  = measure of kurtosis = 9
                2
(43)
Hence, another reference point is available  for  comparison

purposes.  The exponential distribution is a one-tailed dis-

tribution which (as indicated above)  is heavily  skewed right

with a sharper peak,  thinner shoulders  and a fatter  tail
                             36

-------
than that of the corresponding normal  distribution.   The
density function for the exponential distribution  is  graphed
below.
       u
       53
       H
       D
       O
                   VALUE OF VARIABLE
         Figure 5.  Probability density function for
                the exponential distribution
In the next sections,  the formulae will  be  investigated  for
various points of interest on the Weibull and  gamma distribu-
tions as a function of their parameters.  The  mean, median,
mode, value of the p.d.f. at the mode  (i.e., the maximum
value of the p.d.f.)  and the coefficient of skewness  and
measure of kurtosis will be considered.
C.
WEIBULL DISTRIBUTION
The two-parameter family of Weibull  density  functions  is
given by:
                               ,b
           f (X)
               {abX15'1
                 0   <
                            -aX  ^
                           e     for
                                                       (44)
                          otherwise
where a, b > 0 are arbitrary parameters.   The cumulative
distribution can be found in closed form  and is  given by:
              F(X)
              -r
                                                       (45)
                        .-e~aX  for X > I
                         0   elsewhere
Graphs of the p.d.f.  of the Weibull distribution for various
values of b are given below (for a = 1):

-------
    >H
    S  1-2
    H
    m
    H
    EH
    5
       0.9
       0.6
       0.3
           b =
                           b = 3.26
                    0.5         1.0
                      VALUE OF VARIABLE
                                                       2.0
   Figure 6.  Probability density function of the Weibull
 distribution for various values of parameter b  (for a = 1)
From Figure 6,  it  can be  seen  that for b = 3.26 the curve is
almost normal;  for b =  2,  it is  skewed right with positive
kurtosis; and,  for b =  1,  it is  an exponential curve.   For
b = 1, it can be seen that the Weibull distribution reduces
precisely to the exponential distribution.   In order to
observe how the curve changes  for  different values of a and
b, some of its points of  interest  are  analytically calculated
below as a function of  a  and b.  The following formulae are
used:
                    = mean =  a   ']
  mode = a
i/b^i
                  median = a  '  (log  2)
                   1/b
                                    (1+1/b)
                                        1/b
                                             (46)

                                             (47)
                  7
                        (for b  >_ 1;  if  b  <  1,  the mode = 0)
                                                        (48)
                          f  fb-I\b~l~\
       value of p.d.f. =   abf - j
f (mode) = maximum
          (for b > 1; if b  < 1  the max.  value = + »)
                                                        (49)
                              38

-------
          a2 = variance = a 2/b[r(l+2/b) - r2(1+1/b)]   (50)

                a = standard deviation = /a"2"            (51)

 y  = third central moment
  3
    = a~3/b[r(l+3/b) - 3T(1+1/b)r(l+2/b)+2r3(1+1/b)]

    = a~3/br(l+3/b) - 3ya2-y3                           (52)

 y  = fourth central moment

    = a~4/b[r(l+4/b) - 4F(1+1/b)r(1+3/b) +
                     6T2(1+1/b)r(l+2/b) - 3T4(1+1/b)]

    = a~4/br(l+4/b) - 4yy  -6y2a2-y4                    (53)
                         3
                                              y
              Y  = coefficient of skewness =  —3-         (54)
               1                              a3
                                         y
              Y  = measure of kurtosis = —^-             (55)
               2                          H

Letting a = 1,  the above formulae are used to calculate the
points of interest for various values of b as given  in
Table 5.

As can be seen from Table 5 at b = 3.26, the  Weibull distri-
bution function is "almost" normal.  For example, its mean,
mode, and median all occur at approximately the same point.
Furthermore, its coefficient of skewness is near 0 and  its
measure of kurtosis is near 3 at that value of b.  As b
decreases from 3.26 to the values of 2.0, 1.5 and lower,  the
curve becomes more and more non-normal.  Its  mode begins  to
shift farther to the left of its mean  (and median) indicating
a skewed right distribution.  The coefficient of skewness
(as expected) begins to increase from 0.089 to 0.631, 1.08,
etc.  Finally,  the measure of kurtosis begins to increase
also, indicating sharper peaks, thinner shoulders and fatter
                             39

-------
   Table 5.  VALUES FOR VARIOUS POINTS OF INTEREST IN THE
           WEIBULL DISTRIBUTION FOR VARIOUS VALUES
                 OF PARAMETER B (WHEN A = 1)
Point of interest
Mode
Value of f (mode)
Median
Mean (y)
Variance
3rd moment about y
Standard deviation (a)
F(y+a)
F(y-a)
% between y-a and y+a
Coefficient of
skewness (= 0 for
normal)
Measure of kurtosis
or excess (= 3 for
normal)
4th moment about y
Value
b=3.26
0.894
1.26
0.894
0.896
0.091
0.002
0.303
0.836
0.166
67%
0.089
2.73
0.023
b=2.0
0.707
0.858
0.833
0.886
0.215
0.0628
0.463
0.838
0.163
67.5%
0.631
3.244
0.150
b=1.5
0.481
0.745
0.783
0.903
0.375
0.248
0.613
0.845
0.145
70%
1.08
4.384
0.619
b=1.4
0.409
0.736
0.770
0.911
0.436
0.343
0.660
0.848
0.134
71.4%
1.19
4.838
0.918
b=l.l
0.113
0.808
0.716
0.965
0.771
1.174
0.878
0.859
0.081
77.8%
1.73
7.296
4.336
tails than the normal.  At b = 1, the Weibull distribution
reduces to the exponential distribution with y  = 2 and
Y  =9.  It can be seen that y  and y  are converging to
 2                            12
these values for when b = 1.1, y  =1.73 and y  =7.3.  If b
                                1             2
is allowed to attain values < 1, then as indicated earlier the
mode becomes 0 and the maximum value of f becomes +».  Further,
the parameters y  and y  continue to increase giving larger
                1      2
values than those corresponding to the exponential distribu-
tion.  When b > 3.26, the Weibull distribution becomes skewed
left and its kurtosis drops below 3.  From the above discus-
sion it can be seen that the parameter b controls the shape
of the Weibull distribution, i.e., whether it is "almost"
normal or skewed right or left, etc.
                             40

-------
Consider what happens whenever the parameter, a, changes for
given values.  Table 6 is similar to Table 5 except that the
value of a has been changed from 1.0 to 1.0 x 10~5.  This
change in the value of a "stretches" the p.d.f. to cover
data that is wide ranging.  It also causes the maximum value
of the p.d.f. to decrease to the point where the curve is
almost flat.  The values of y  and y  do not change since
                             1      2
they are unitless measures and should not be affected.

D.   GAMMA DISTRIBUTION

The gamma distribution is another of the extreme value dis-
tributions.   The two-parameter family of the gamma p.d.f. is
given by:
                      A—Xa"1e"X/B for X > 0
           f(X) =
                      0 otherwise
                                                       (56)
where a, 8 > 0 are arbitrary parameters.  It can be noted
that when a = 1,  the gamma distribution reduces to the
exponential distribution.  When a = n/2 and 6 = 2 for a
positive integer n, the gamma distribution reduces to the
chi-square distribution with n degrees of freedom.

The closed form for the cumulative distribution function of
the gamma distribution does not in general exist and in order
to use a table to look up its values one should consult a
table of the incomplete gamma function.  Of course, if a = 1,
the closed form exists and, if a = n/2 and 6=2 for a posi-
tive integer n, the appropriate chi-square table can be used.
The graph of the p.d.f. of the gamma distribution for various
values of a with 3=1 (fixed) is shown in Figure 7.

As can be seen from Figure 7, the gamma distribution is a
skewed right distribution with relatively fat tails.  The

                             41

-------
                 Table 6.  VALUES FOR VARIOUS POINTS OF INTEREST IN THE WEIBULL DISTRIBUTION
                            FOR VARIOUS VALUES OF PARAMETER b  (WHEN a = 1.0 X 10~5)
Point of interest
Mode
Value of f (mode)
Median
Mean (y)
Variance
3rd moment about y
Standard deviation (a)
F(y+o)
F(y-a)
% between y-a and y+a
Coefficient of
skewness
a
Measure of kurtosis
4th moment about y
Values
b = 3.26
30.6
0.037
30.6
30.6
107.5
79.8
10.4
0.836
0.165
67.1%
0.089
2.73
3.41 x 101*
b = 2.0
223.6
0.003
263.4
280.2
2.15 x 101*
1.97 x 106
146.6
0.838
0.163
67.5%
0.631
3.248
1.5 x 109
b = 1.5
1,036
0.0003
1,687
1,946
1.74 x 106
2.98 x 109
1,319
0.845
0.145
70%
1.08
4.39
1.33 x 1013
b = 1.4
1,525
0.0002
2,870
3,396
6.06 x 106
1.78 x 1010
2,461
0.848
0.134
71.4%
1.19
4.82
1.77 x 1014
b = 1.1
3,968
0.00002
25,140
33,880
9.50 x 108
5.08 x 1013
30,831
0.859
0.066
79.3%
1.73
7.294
6.59 x 1018
to
     The slight variations between the coefficient of skewness and the measure of kurtosis values shown in
     Tables 5 and 6 are due to calculator round-off/ and actually should be identical.

-------
   o
   I
   D
   a
   w
        r~ a=l
                      VALUE OF VARIABLE

    Figure 7.  Probability density  function for the gamma
 distribution for various values  of parameter a (for 6=1)
parameter a is the shape parameter  and the parameter $ is the
stretch parameter.  To get an  analytical picture of skewness,
kurtosis, etc., for the gamma  distribution,  some formulae are
provided below for calculating the  mean, variance, etc.
                       y = mean  =  a 8
                    a2 = variance  =  a8:
                       mode =  (a-1)6
                                           (57)

                                           (58)

                                           (59)
                                           a-1
                                      /   i \ \Ji -i-
 f (mode)  = maximum value of p.d.f.  =  -^-7—(-5— e v" """'  (60)
                                       i la i p
                                   -(a-1)

a = standard deviation =  /a"2" =  8/a"        (61)
        y  =
         4
           y
         = _i
           «3
= 2a83

 . 4 + 6<


 B3a3/2
                                                         (62)

                                                         (63)

                                                         (64)
                             43

-------
               Y  = -*- = ^-*	  ""P  =  3  +  6/a          (65)
               2   au       a2B4
From the  above formulae for y   and y  ,  it can be  seen  that
                              1      2
the gamma distribution always has a positive coefficient of
skewness  and  always  has a measure of kurtosis > 3.  Thus, it
will tend to  be right skewed and have sharper peaks and
fatter tails  than the corresponding normal  distribution.
Only by changing the shape parameter to very large values of
a can the gamma distribution be made to approach  the normal
distribution's values for y  and y .  When  this is done, the
                           l      2
mean and  the  mode tend to become closer together  as they
should for the normal distribution.  For  a  = 1, the skewness
is 2 and  kurtosis is 9 as it should be  for  the exponential
distribution  of which it is a generalization.  Finally, it
should be noted that for values of a <  1, the skewness and
kurtosis  increase sharply to very large values.

E.   LOG-NORMAL DISTRIBUTION

The two-parameter family of log-normal  p.d.f. is  given by:

                (—^-  e-i/2(10? x-«)2 for X > 0
          f (x)  = < x/2~7e           p
                 0 elsewhere

where a is any real  number and  B > 0.   A  random variable with
this p.d.f. has the  property that its logarithm to base e is
a normal  random variable.  A graph of the p.d.f.  for the
log-normal distribution looks similar to  the one  shown in
Figure 8.

As can be seen, the  log-normal  distribution is another right
skewed distribution.  Further it tends  to have "heavier
tails" (i.e.,  larger kurtosis)  than even  the exponential
                             44

-------
                     VALUE OF VARIABLE

           Figure 8.  Probability density function
               for the log-normal distribution
distribution.  (This distribution is shown graphically by
Curran and Frank.5)

F.   SAMPLE SKEWNESS AND KURTOSIS
Previous discussions were concerned with theoretical distri-
butions and their analytical skewness and kurtosis.  It is
also instructive to consider samples drawn at random from
some population.  Consider the sample analogy of skewness
and kurtosis and observe how they can be used to predict
deviations from normality and toward some other distribution,
e.g., the exponential distribution, etc.  Formulae are pre-
sented below which are used to produce unbiased estimates of
the variance and some other higher central moments about the
mean:
     m3 = third (unbiased) central moment about X
                       n
               n
           — i-% — f— ^T-
          (n-1) (n-2)  =
                       L (X. X)
(66)
5Curran, T. C.,  and N. H. Frank.  Assessing the Validity of
 the Lognormal Model when Predicting Maximum Air Pollution
 Concentrations.  (Presented at the 68th Annual Meeting of
 the Air Pollution Control Association, Boston.  June 15-20,
 1975.)
                             45

-------
     = fourth  (unbiased) central moment about X
           n2-2n+3       n  ,  _-.k
       (n-1) (n-2) (n-3)  ._^  l*i X;
                            (2n-3)      i   „  ,„ _y-. 2
                                           [n
                                         1,1  '"'
                                                        (67)
                      n(n-l)(n-2)(n-3)

            (SD)2 = unbiased estimate of variance
                    1    n     —
                 = ^-   Z  (X±-X)2                      (68)

From the above, the sample coefficient of skewness, gi, and
measure of kurtosis, g2, can be constructed as follows:
                                                        (69)

                                                        (70)
The law of large numbers discussed earlier can now be en-
larged to include the  sample parameters gj and g2; i.e., g!
and g2 also converge in probability to their population
counterparts, y  and y  , as the  sample size, n, becomes large,
                1       2

Tables 7 and 8 provide the 0.05  and 0.01 points of the
distribution of the coefficient  of skewness and the measure
of kurtosis, respectively, which can be used to evaluate the
sampling distributions for parameters gi and  (g2~3) .  If a
given sample has a value of gj or  (g2~3) beyond the 0.05
point, it may be deemed to be non-normal.  In a stricter
test, the 0.01 point would be used.  The tables for both gj
and (g2~3) are not complete and  do not give values for small
numbers of samples in the case of  (g2~3) .  More extensive
tables can probably be obtained  from the original source
shown in the tables.  If the calculated values of gi and
(g2-3) fall below the 0.05 value, this does not mean that
one can categorically accept normality.  However, it is a
                             46

-------
     Table 7.   0.05 AND 0.01 POINTS OF THE DISTRIBUTION
   OF y /  THE  COEFFICIENT OF SKEWNESS (NORMAL UNIVERSE)6'3
Sample size/ n
25
30
35
40
45
50
60
70
80
90
100
125
150
175
200
250
300
400
500
750
1,000
Probability that yi will exceed listed
value in positive direction is
0.05 point
0.711
0.661
0.621
0.587
0.558
0.533
0.492
0.459
0.432
0.409
0.389
0.350
0.321
0.298
0.280
0.251
0.230
0.200
0.179
0.146
0.127
0.01 point
1.061
0.982
0.921
0.869
0.825
0.787
0.723
0.673
0.631
0.596
0.567
0.508
0.464
0.430
0.403
0.360
0.329
0.285
0.255
, 0.208
0.180
 Points  listed are on the positive tail of the distribution;
 with a  minus  sign attached,  they are equally valid for the
 negative  tail.
6Geary,  R.  c.,  and E.  S.  Pearson.   Tests of Normality.
 London,  University College,  1938.
                             47

-------
        Table 8.   PERCENTAGE POINTS OF THE DISTRIBUTION
       of Y / THE MEASURE OF KURTOSIS (NORMAL UNIVERSE)6

Sample size, n
200
250
300
500
1,000
2,000
5,000
Probability that
Y2 falls below
listed value is:
0.01
point
-0.63
-0.58
-0.54
-0.43
-0.32
-0.23
-0.15
0.05
point
-0.49
-0.45
-0.41
-0.33
-0.24
-0.17
-0.11
Probability that
Y2 falls above
listed value is:
0.05
point
0.57
0.52
0.47
0.37
0.26
0.18
0.12
0.01
point
0.98
0.87
0.79
0.60
0.41
0.28
0.17
 good  indication  that  the  distribution  does  not  stray  far
 from  normality (with  respect  to  skewness  and  kurtosis)  and
 would not  produce  serious error  in most statistical appli-
 cations, assuming  that it is  a normal  distribution.

 As an example  of the  above procedure for  using  g: and g2 and
 the tables,  consider  the  224  power plants described earlier
 and their  data on  coal consumption, plant capacity, percent
 sulfur, percent  ash,  and  stack height.  Table 9 gives the
 values of  g!,  g2 and  (g2-3) for  these  data.

 From  the conclusions  shown in Table 9, it can be seen that,
with  the exception of  sulfur, normality of  the  sample data
can be denied  on the basis of skewness and  kurtosis alone.
Percent ash  is denied  because of skewness alone and the
others fail both tests.   The  natural question to ask  at this
point is what distribution (if any) do the  data fit if  it is
definitely non-normal?  Since gj = 2.05 and g2 = 8.03 for
coal consumed and y  = 2  and  y  = 9 for the exponential dis-
                   1          2
tribution,  one might guess that the coal  consumption  data are
                             48

-------
Table 9.  VALUES OF
                        g2 AND (gz-3) FOR POWER PLANT EXAMPLE
Type of data
Coal consumed
Plant capacity
Stack height
Percent sulfur


Percent ash

9i
2.05
1.5
1.46
0.25


0.48

92
8.03
4.8
5.14
2.88


3.04

(92-3)
5.03
1.8
2.14
-0.12


0.04

Conclusions
Non-normal
Non-normal
Non- normal
Looks normal as far as
skewness and kurtosis
are concerned
Non-normal because of
slight right skewness;
kurtosis is satisfactory
almost exponentially distributed.  In the next section,
another goodness-of-fit test called the chi-square goodness-
of-fit will be discussed.  This technique can be used to
test how well sample data will fit any variety of distribu-
tions and it will be used to analyze the distribution of
coal consumed and percent sulfur in the power plants data.

G.   THE CHI-SQUARE GOODNESS-OF-FIT TEST

The chi-square test for goodness-of-fit is a general non-
parametric statistical test of hypothesis used to test a
hypothesis, H0, of the form:  The sample data are distributed
according to some given probability distribution.  The chi-
square test compares a set of actual sample frequencies with
a set of frequencies that would be expected on the basis of
HQ.  If the two sets compare well, the hypothesis is accepted
and the data are assumed to be distributed in the way claimed,
If the two sets of frequencies compare poorly, the hypothesis
is rejected.  Since the sampling distribution that is used
tends to form a chi-square distribution under the given
hypothesis, the test is called a chi-square test.
                             49

-------
In order to formulate the chi-square test, let Fj,  . .., Fn
                                 .*
represent the actual frequencies of the sample data  in n
class intervals.  Under the hypothesis  (H0) that the data
are distributed according to some given distribution
function, let flr ..., fn be the theoretical frequencies
that would be expected for a sample of the same size from the
given distribution.  If H0 is to be true, sample values of
the quantity:
                            n  (F.-f.)2
                     X2 =   I —± -                  (71)
will tend to form a chi-square distribution.  Hence, given a
sample and its actual frequency, the theoretical frequencies
that would result from H0 can be calculated and substituted
into the above formula to obtain a value of x2-  Then, by
looking at a chi-square table, one can determine whether the
sample value of x2 is significant enough  (at whatever level
desired, usually 0.01 or 0.05) to be rejected as a value from
the chi-square distribution.  If it is, the hypothesis H0 is
rejected and if it is not, HQ is accepted.  In the case of
rejection, the probability of error can be stated and one can
be as confident as desired in the rejection.  However, if
rejection cannot be made, at a given level of confidence, then
HO is accepted but this cannot be done with any statement of
error concerned.

Before applying the chi-square test, one must consider the
number of degrees of freedom to be used for the sampling
distribution x2-  Since the value for x2 is extended over n
terms (where n = the number of class intervals) , it may
appear that the sampling distribution has n degrees of
freedom.  However, some of these degrees of freedom have been
utilized in the construction of the test.  The given sample
size was used to determine the theoretical frequency of that
sample size for each class interval.  This reduces the number
                             50

-------
 of degrees of freedom by one to n-1.   The "given"  distribu-
 tion in H0 (the hypothesis to which the test is being applied)
 is generally one of the many distributions available in
 statistics and these are generally parametric families of
 distributions with one, two, or more parameters.   In calcu-
 lating the theoretical frequencies for each class  interval
 from the given distribution, these parameters must be speci-
 fied in some way.   The usual methods of obtaining  these
 parameters for a given sample involve using the data in the
 sample in some way (e.g.,  the method of moments, the method
 of maximum likelihood, or the method of least squares).  If
 the sample data are used to determine the parameters, the
 number of degrees of freedom must be further reduced by one
 for each parameter so estimated.   Thus, for a two-parameter
 distribution, the total number of degrees of freedom finally
 realized for the x2 test is n-3 (where n = the number of
 class intervals used).  If the distribution in the hypothesis
 H0 has its parameters specified independently of the sample,
 or if the theoretical frequencies are given without needing
 to be calculated from a given distribution, it is  not neces-
 sary to reduce the number of degrees of freedom by any more
 than one (i.e., to n-1).

 Another consideration that arises with the chi-square test
 which needs some discussion is the problem concerned with the
 number of theoretical frequencies for each class interval.
vlt is a conservative rule that the theoretical frequencies
 in any class interval be five or more.  If this is not the
 case, adjacent intervals should be pooled so as to accomplish
 this task.   When two class intervals are pooled, however, the
 number (n)  of class intervals is reduced by one, and, hence,
 so is the number of degrees of freedom of x2-  Thus, it is
 of no benefit to try to use a large number of class intervals
 to start the test for in the end these will have to be pooled
 to obey "the rule of 5."  Another point to observe is that
                              51

-------
whenever the number of class intervals has to be reduced to
three or less in order to obey the  "rule of 5," the number of
degrees of freedom for x2 drops to  zero and, hence, the x2
test is not applicable.  Thus, for  very small sample sizes
 (approximately 15 or less), the x2  test will not be very work-
able and other measures must be used to get fits to the data.

Actually, the "rule of 5" is conservative and need not be
strictly obeyed.  For example, if a theoretical frequency for
some class interval is around four  and the actual frequency
is close to that value, it would not be necessary to pool the
class intervals and thus reduce the number of degrees of
freedom.  Furthermore, it has been  observed that if an error
of 1% can be tolerated in the probabilities read from the
chi-square table at the 0.05 level  or a 0.2% error at the
0.01 level, then the smallest theoretical frequency can be
as low as two if there are at least six degrees of freedom,
as low as one if there are 10 degrees of freedom, and as low
as 0.5 if there are 25 degrees of freedom.  However, all of
these low frequencies should occur  only once to be allowed;
otherwise, pooling must be used according to the "rule of 5"
described above.

The data from the power plants will now be used to demonstrate
the x2 test.  Consider coal consumed, which on the basis of
skewness and kurtosis was guessed to be an exponential dis-
tribution.  The method of least squares is used to arrive at
values of the parameters a and b in the Weibull distribution
from the given data.  These values  were given by a = 9.7 x 10~4
and b = 0.9974.  Since b *\> 1, the exponential distribution is
represented here.  A chi-square test will now be performed to
see if the data do indeed fit the exponential distribution.
Table 10 gives the actual and theoretical frequencies for nine
class intervals (determined by Sturge's rule) beginning with
the smallest value and proceeding through the largest value.
                             52

-------
      Table 10.  THEORETICAL AND ACTUAL FREQUENCIES FOR
           NINE CLASS FREQUENCIES (COAL CONSUMED)
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
115.2
52.0
27.1
14.2
7.4
3.9
2.0
1.1
1.2
Actual
frequency
116
55
22
14
8
4
3
0
2
In order to obey "the rule of 5," the last four class inter-
vals are pooled into one with a theoretical frequency of 8.2
and an actual frequency of 9.  Thus, six class intervals and
three degrees of freedom result.  The value of x2 for this
table is 1.3.  Referring to a x2 table, this value of x2 is
not significant at the 0.01 level or at the 0.05 level.  In
fact, it becomes significant at the 0.8 level.  This indicates
a very good fit and, hence, one can conclude that the data
are indeed exponentially distributed.

Recalling that the gamma distribution should also fit the
exponential distribution for a = 1, a method of moments fit
was completed for the gamma distribution to the sample data
to obtain the parameters, a = 1.06 and B = 1024.8.  Upon
doing ax2 test, it was found that the number of degrees of
freedom was three and x2 = 1.9.  This value of x2 was not
significant until the 0.6 level of significance.  Hence, a
very good fit to the data was also obtained using the gamma
distribution.
                             53

-------
In view of the above discussion, one would certainly believe
that the x2 test should reject normality of these data.  Using
the maximum likelihood fit to the data  (i.e., simply using
y = sample mean and a = sample standard deviation) for the
normal, one obtains three degrees of freedom and ax2 value
of 31.6 which can be rejected at any level for three degrees
of freedom.

Since the log-normal distribution is a right skewed distribu-
tion, it is reasonable to believe that it may fit these data.
Indeed, the x2 tests yields six degrees of freedom with
X2 = 3.4.  This value is not significant with this number of
degrees of freedom until the 0.75 or 0.8 level.  Hence, the
log-normal distribution also yields a very good fit to these
data.
Next, consider the % sulfur in the coal consumed.  Since the
skewness and kurtosis tests earlier indicated a possible nor-
mal distribution for these data, the normal distribution was
fitted to the data by maximum likelihood and ax2 test was
performed.  The table for nine class intervals is shown below:

     Table 11.  THEORETICAL AND ACTUAL FREQUENCIES FOR
           NINE CLASS INTERVALS  (% SULFUR IN COAL)
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
22.3
31.4
46.5
50.0
39.0
22.0
9.1
2.7
1.0
Actual
frequency
32
28
37
51
42
26
4
1
3
                             54

-------
A clarification is in order to define the manner in which the
last few class intervals should be pooled.  Since pooling the
last two intervals yields a theoretical frequency of 3.7 with
an actual frequency of 4, this will be used to obtain five
degrees of freedom and a x2 value of 10.3.  This value of x2
is not significant at the 0.05 level (it becomes significant
at about 0.07 or 0.08).  Hence, normality for % sulfur is
accepted although with some reservations.  If the last three
intervals had been pooled, four degrees of freedom would have
been obtained and x2 would be 9.2 which is not significant at
0.05 (it becomes significant also at about 0.07).

Since the fit to the normal distribution for % sulfur is
rather poor by x2* various fits to other distributions may be
evaluated, viz., the Weibull (by both maximum likelihood and
least squares) and the gamma and the log-normal distributions.
All of these distributions fail, although the Weibull maximum
likelihood comes closest with five degrees of freedom and a
X2 of 18.3.  However, this value is significant at 0.01 and
consequently the fit is rejected with "99% confidence."

In order to perform chi-square, skewness, and kurtosis tests
on sample data, the amount of computation becomes very large.
Hence,  a computer program was written to perform the compu-
tations involved in obtaining these measures for various
types of fits to the four distributions which have been
considered in this report:  the Weibull, gamma, log-normal,
and normal distributions.  This program and its method of
operations are described in Appendix D.
                             55

-------
                         SECTION VI

                         APPENDIXES


A.   Standard Statistical Formulas for Finite Populations
B.   Detailed Derivations of the Criteria Pollutant Severity
     Equations
C.   The Simulation Programs


D.   The Goodness-Of-Fit Program


E.   Treatment of Correlated Data by Linear Transformation
                             57

-------
                         APPENDIX A
    STANDARD STATISTICAL FORMULAS FOR FINITE POPULATIONS

In certain cases, when  sampling  from finite populations  (e.g.,
brick kilns, cattle  feedlots, etc.) and the sample  size, n,
is small compared to the total population size, N,  correc-
tions must be made to the standard deviation and confidence
limit calculations.  It should be recognized that there is a
difference between the  sample standard deviation and the
estimated population standard deviation.  When dealing with
a population of 400  plants comprising a given source type,
N=400.  If 10 plants  are surveyed for a stack height, age,
capacity, etc., n =  10.  Using the data from those  ten plants,
one can compute the  means and standard deviations of the
various parameters.  But the computed values are the biased
mean and standard deviation of the sample of 10 points.
What is really desired  is an unbiased estimate of the mean
and standard deviation  of the population of 400 plants.  The
symbol "~" will be used over another symbol to stand for an
estimate.  Assume that  a sample  mean, X, and a sample standard
deviation, SD, have  been computed.  An estimate, y, of the
total population mean,  y, is:

                              y  = X                     (A-l)

                        where X  = -^                    (A-2)

An estimate of the population standard deviation, a, is
simply:
                                                        (A-3)

where SD = sample standard deviation = — JnZx2-(Ex)2    (A-4)
                             58

-------
 Earlier  it  was  indicated that the sample standard deviation
 gave  a biased estimate  of the corresponding population param-
 eter.  The  first  square root factor in Equation A-3  corrects
 for this bias.  As  the  sample size, n, becomes  larger, the
 factor approaches unity;  hence,  the bias is less for larger
 sample sizes than for small  ones.   The following table shows
 this  tendency:
    Table A-l.
                FIRST SQUARE ROOT FACTOR IN EQUATION A-3
                AS A FUNCTION OF SAMPLE SIZE
Sample size
(n)
2
5
10
50
100
1,000
1^
1.4142
1.1180
1.0541
1.0102
1.0050
1.0001
The second factor,
                     «
                        /  in Equation A-3  adjusts  for  finite
population size, and, if n = N, the correction  factor  is
unity or,
                           a = SD
                                                        (A-5)
which is logical since the sample size is equal to  the  popu-
lation size.  There is one more important parameter used  in
calculating confidence limits and that is the estimated
standard error of the mean, a_:
                                                        (A-6)
Confidence limits on ia are thus
                                                        (A-7)
                             59

-------
where K is the standard "Student t" variable with  (n-1)

degrees of freedom, and unless otherwise specified should

be for the 95% (a = 0.05) confidence level.
The preceding formulae are summarized for reader convenience

below:
           Quantity
    Formula
Sample mean, X:
Sample standard deviation, SD;
Estimated population mean, p
Estimated population
  standard deviation, a:
Estimated standard
  error of mean, 5_:
 IX
  n
Vn£x2-(Ex)2
      n


 X
      n
      N-n
      N-1
(A-2)


(A-4)


(A-l)
(A-6)
                             60

-------
                         APPENDIX  B




            DETAILED DERIVATIONS OF  THE  CRITERIA

                POLLUTANT SEVERITY EQUATIONS
1.   CO SEVERITY EQUATION






Since the primary standard for carbon monoxide,  CO,  is for a


1-hour averaging time, t = 60 minutes and  tg  =  3 minutes.






Given                   X    =    82                    (B-l)
                        Amax   ireuhz                    v    '




Correcting for averaging time:





                 -           / 3\0< 17

                 xmax ~ x                               (B~2)
                      =  2 Q  / 3\°-17

                              \6Qj
                        2 Q(0.6)






                                1-2 Q
                         (3.14) (2.72) (4.5)hz




                 X    =  (3.12 x 10-2)Q                  (    }

                 Amax         h2





Given                     S = -S2.                      (B-3)
For the criteria pollutants, F is  set equal  to  the  primary


standard which is 0.04 g/m3 for CO.






                  - Xmax -  (3.12 x 10-2)Qh~2
                                    -
                        S   -                           (B-4,
                             61

-------
 2.    HYDROCARBON SEVERITY EQUATION





 The  primary standard for hydrocarbons is for a 3 -hour


 averaging time.   Thus,  t = 180 minutes and t0 = 3 minutes.
                    xmax -       -      -
                          / 3 \Q-17

                      xmax \180/




                    =0.5 xmax




                    = (0.5)(0.052)Q = 0.0260 Q






For hydrocarbons,    F   = 1.6 x 10~u g/m3
Then                 S - Xmax - 0.026 Qh"2              .
inen                 S ~  F   " 1.6 x 10-1*              (B 6)
and                      SHC =      -                   (B-7)





3.   PARTICULATE  SEVERITY EQUATION
The primary  standard  for  particulate is for a 2 4 -hour


averaging time .
               xmax
                   _  (0.052)Q(0.35)  _ (0.0182)Q
                   -- _2              j-2           (B-8)
For particulates,   Fp =  2.6  x  10"^  g/m3




                       X
                     _ Xmax _  0.0182  Qh~2               .

                   5 ~ "               -4*                (B~9)
                        F      2.6  x  ID
and                      SP =
                             62

-------
4.   SOV SEVERITY EQUATION
The primary standard for SO  is  for a  24-hour  averaging time.
                           X
The primary standard is 3.65 x lO"1* g/m3 .





                     Fcn  = 3.65 x ICT1*
                      bux



and                S - Xmax -  (0.0182)Qh-2
and                S - -j -- 3.65 x KT1*               (B  12)




and                      SSQx = ^Q                    (B_13)





5.   SEVERITY EQUATION FOR NOV
                             2\




Since NO  has a primary standard with a  1-year  averaging
        X

time, the x.,,-,, correction equation cannot be used.   Instead
the following equation is used:
                  -   2.03 Q
                               P
                              "-1YJI

                                2U,
(B-14)
A difficulty arises, however, because a distance,  X,  from


emission point to receptor, is included.  To overcome this,


the following rationale is proposed:





The equation            x    = 	Brr                   (B-l)
     ^                  Amax   ireuh2




is valid for neutral conditions or when a   = a  .   This
                                         z    y

maximum occurs when



                          h = /2o~                      (B-15)
                                  Z



and, since                a  = aX                      (B-16)
                           Z
 Personal communication.  Bruce Turner, Environmental  Pro-

 tection Agency.

                             63
\

-------
then the distance  X
is,
       where the maximum concentration occurs





             = (JL.\*
                         X
                          max
                                            (B-17)
For neutral  conditions,  a = 0.113 and b = 0.911.
The following  sample  calculations illustrate the concentra-


tion estimates.





Assume Q =  10  g/s  and h = 50 m,  then
         50

^max   V0.16
                               . 098
                                    = 548.7 m
                                            (B-18)
and
  a  =  (0.113) (548.7)°-911 = 35.4 m
   z
Assume u =  4.5 m/s,  then
          -    2.03 Q

             =
           x  =
               a  uX
               z  max
                   (2.03) (10)
               (35.4) (4.5) (548.7)
                                   exp  -(
or
= (2.32 x 10-1*) (0.369)




= 8.57 x 10~5 g/m3
                                       (B-19)
Simplifying Equation  B-14,  since a   = 0.113 x0-911 m and
                                   Z


u = 4.5 m,
                   X  =
                     X
                         4  Q
                       {1 . 911

                       max
                  exp
                                h N1-098
          max    \0.16,



              = 7.5 h1-098
                                       (B-20)
                                                        (B-21)
and
                   4 Q  _
                     4 Q
                 X1-911    (7.5  hl.098)1.911
                              64

-------
                 -=O^Q  exp|.,{^
                       h2.1
                 az = 0.113 X0-911
                    = 0.71 h
Therefore,
       -=CL08S_Q
             h2.1

      20.71
= 0.085 Q
    h2.i
                             (0.371)
                       = 3.15 x 10~2 Q
                              h2-1
                                    (B-22)
Substituting Q and h:
                     X = 8.5 x 10~5 g/m3
Therefore:
                      -'
- = 3.15 x 10"^ Q
X        h2-!
                                      exp
                              zmax
                              2
                        .
                        2
The NO  standard is 1.0 x 10~4 g/m3.  Therefore,
                  (B-23)
  F = 1 x
                                   g/m
and the NO  severity equation is:
  =  (3.15 x 10"2)Qh"2-1
i__                 |
x          1 x 10-^
    315 Q
     h2.1
                                                        (B-24)
                             65

-------
                         APPENDIX C
                   THE SIMULATION PROGRAMS

1.   THE INPUT TO THE PROGRAM

In this section, details regarding the input to the simula-
tion program are discussed.  The input is divided into nine
groups, some of which are always required and some of which
are optional.  Each of these groups is discussed in the order
in which it should appear in practice.

a.   Input Groups

Group 1:  This is a required group and should appear first.
It is a single card containing three pieces of data:  a title
and two flags to indicate later options.  The format is 20A2,
215.

(ITIL(I),1=1,20) - a title which will be printed as such on
     output and which will appear below the x-axis on plots.
     This is read in columns  1-40.

LT - an integer in cols 41-45.  Suppose it has been decided
     that there is a correlation between a pair of input
     variables such that a simple regression must be performed
     to obtain the regression line and SE for sampling pur-
     poses as described in a previous section.  Then as men-
     tioned earlier,- the user has the option of either
     supplying the program with the raw data and having the
     regression analysis performed on these data or supplying
     R,  XB, SX, YB, and SY directly and using these directly
     to obtain the regression line information.  By the use
     of the flag LT, the user signals the program which option
 Columns  are  subsequently abbreviated as:   cols.
                             66

-------
     should be exercised.  If LT = 0  (i.e., cols 41-45 are
     left blank), the program will expect raw data upon which
     to perform the regression analysis.  If LT = 1  (actually
     any integer other than 0), the program will expect to
     read in R etc. to perform the regression analysis.  In
     either case, data must be read into the program.  However,
     these data will not be placed as the second group but
     rather will appear in Group 8 to be discussed below.

     The option is available to run the program with no
     dependent variables.  In this case LT = 0 and the
     number of independent variables NDVAR =0.  NDVAR is
     in Group 3.

NCFLAG - an integer in cols 46-50.  This flag provides the
     option of using Sturge's rule to set up the number of
     class intervals and using a value of XMIN and XMAX ob-
     tained from the first 50 or so values of the output
     variables to establish W and the resulting class inter-
     vals.  If NCFLAG = 0 (cols 46-50 are blank), Sturge's
     rule and XMIN and XMAX as described earlier will be used
     as indicated above.  If NCFLAG = 1 (anything ^ 0) the
     user can read into the program the number, NINT, of
     class intervals he wishes to use and the value XMIN of
     the left endpoint of the first class interval and the
     width, W, of each succeeding class interval.  If the
     user decides to read in NINT etc., this information will
     be on a card in Group 9 of the input.
         t
Group 2;  This is a required group and consists of a single
card with three pieces of data:  XSAMP, XP0P, and NSAMP.  The
format is 2F10.3, 15.  This program was originally designed
to simulate the severity, S, of air pollution concentrations.
In doing so, it is sometimes desired to know how many plants
from the sample and from the population have predicted
                             67

-------
severity below  0.1, between  0.1  and  1,  and above  1.  Thus, the
program automatically monitors the number of  simulated values
of S that  fall  in these ranges.   It  then divides  the number
in each range by the simulated sample size  (usually 5,000 or
more) to obtain the relative  frequency  of each range.  It
then multiplies these relative frequencies by the actual
sample size, XSAMP, and population size, XP0P, to give the
required predictions for each range.  Thus, the program re-
quires XSAMP and XP0P in real format and this is  the place
that it is provided.  Also, the  sample  size,  NSAMP, is entered
in integer form to be used  (if required) for  reading in the
raw data for the regression analysis later.

Of course, if the distributions  from samples  have not been
estimated, or if a person is  simply  not interested in the
values between  0.1 and 1, etc.,  then XSAMP and XP0P can be
left blank  (and also NSAMP if a  regression analysis on raw
data is not being performed).  However, if these  values are
left blank, the blank card must  still be supplied as Group 2.

Group 3;  This  is a required  group and  consists of a single
card containing four pieces of information, all integers.
The format is 415.  Due to restricted core size,  samples can-
not be drawn as large as required  (usually 5,000  or more) for
each input variable in order  that all of these can be used at
one time to calculate values  of  the  output variable.  Instead,
small samples must be drawn out  for  each input variable and
that many values of the output variable must  be calculated.
This is followed by the statistical  manipulations desired
[for example keeping a running US and Z(S)2 for later use in
finding the mean and standard deviation of output variables]
and a repetition of the process  as many times as  needed to
generate the desired sample size.  The  parameters on this
card dictate the manner in which the program  does this.
                             68

-------
NGR0UP - an integer in cols 1-5, which tells the program the
     sample size for each pass.  This number must be <_ 50.

NPASS - an integer in cols 6-10 which tells the program how
     many passes to make, drawing out samples of size NGR0UP
     at each pass.  Clearly, the final sample size will be
     NGR0UP*NPASS.

NIVAR - an integer in cols 11-15.  This parameter tells the
     program how many of the input variables are independent.
     For example, in a given equation there may be five input
     variables.  The user may decide that one of the variables
     (e.g., Y) is dependent on  (or correlated with) another
     variable  (e.g., X) and that the other three variables
     are independent.  Then NIVAR in this case would be taken
     to be four:  the three that are given independent and
     the independent variable X in the correlation or regres-
     sion equation.  Thus, there would be one dependent
     variable, Y, the variable that will be dependent in the
     regression equation.

NDVAR - an integer in cols 16-20 which gives the number of
     dependent input variables.  In the example above, NDVAR
     would be one.  However, in general, two or more pairs of
     correlated variables may be present in the equation so
     that NDVAR could be two or more.  The program does not
     allow a variable to be independent on one pair for
     correlation purposes and dependent in another.  However,
     the same independent variable can be used for two or
     more dependent variables and by proper choice this is
     all that one should ever need.

     A word of caution is in order at this point.  If NDVAR
     is >_ 2, a regression analysis must be performed for each
     correlated pair in order to obtain the regression equa-
     tion to be used for sampling purposes.  It was pointed
                             69

-------
     out earlier that the  flag LT  in the  first data group
     gave the user  the option of either reading in raw data
     or supplying R etc.,  directly.  The  user is cautioned
     here that  if two or more regression  fits are necessary,
     they must  both be performed in the same manner, either
     both with  raw  data or both with user supplied values of
     R etc.  The program uses LT to get into one of two loops:
     either read raw data  for Xj and Yj,  perform a regression
     analysis and then repeat for  X2 and  ¥2 etc. until the
     number of  dependent variables is exhausted; or, read R
     etc. for Xj and Ylf perform a regression analysis and
     then read  R etc. for  X£ and ¥2 etc.  until the number of
     dependent  variables is exhausted.  Finally, it should be
     noted that NIVAR + NDVAR (= the total number of variables)
     must be <_  10.

Group 4:   This is  a required group and consists of a single
card containing codes which tell what type of distribution is
to be used for  sampling from the independent variables.  The
values on the card  are integers punched in 1615 format  (actu-
ally all 16 will not be used).  The values on the card are
read into an integer array IC0DE(I) from  I = 1 to NIVAR.  Thus,
IC0DE(1)  is the code which tells what type of distribution
independent variable 1 has, etc.   The distributions and their
corresponding codes are listed below:

                code 1 = Weibull distribution
                code 2 = normal distribution
                code 3 = gamma distribution
                code 4 = log-normal distribution

For example, suppose three independent variables exist and
they are ordered as VARj, VAR2, VAR3.  Suppose VARj has a
Weibull distribution while VAR2 has a normal and VAR3 a log-
normal distribution.  Then the data card  would look like the
line below:
                             70

-------
          cols      1-5       6-10      11-15
In setting up the function card  (to be discussed later), one
must exercise caution to be certain that the variables so
ordered above will appear in their proper places in the
function.

Group 5:   This group of data may consist of more than one
card depending on the number of independent variables con-
tained in the function.  This is the place where the program
is given the parameters that go along with the distribution
selected for each input (independent) variable.  For each
distribution, the parameters are listed which the program
needs for it as shown below:

     Distribution   	Program parameters	
          Weibull ->- A and B
           normal •> y and a
            gamma -»• a and 3
       log-normal -*• y and a for log X or equivalently
                    the log of geometric y and geometric
                    a for X.

These parameters are read into a matrix PAR(I,J) which is a
dimensional 10 x 2.  Thus, each row of the matrix has two
components and the rows correspond to the given variables.
The parameters for independent variable 1 should be read into
the first row; independent variable 2 into the second row,
and so on until the independent variables are exhausted.
Thus, the program reads values into PAR(I,J) row-wise, i.e.,
it reads parameters for independent variable 1 into row 1 as
the first two values on the data card, etc.  The parameters
are expected in the same order as listed in the table above,
i.e., for the Weibull distribution, A first then B etc.  The
format is 8E10.3.
                             71

-------
As an example, consider the situation described above where
three independent variables (VARj, VAR£, and VAR3) were
present and these were Weibull, normal, and log-normal dis-
tributions, respectively.  The data card for the parameters
should contain A then B for the Weibull as the first two
entries; then y then a for the normal as the next two entries;
and, finally, y then a for log VAR3 (or the log of geometric
y then geometric a for VAR3) as the last two entries.  Thus,
there would be six fields  (of the eight total) taken up on
the card and it might look as that shown below:

cols   1-10    11-20     21-30     31-40     41-50     51-60

          A        B         y         a         y         a

where the last y and a are for log VAR3.

Group 6:   This group of data may consist of more than one
card depending on the number of independent variables.  Some-
times in fitting continuous distributions to raw data or real-
world situations, the continuous distributions tend to take
on extremely high or extremely low values which are unrealis-
tic for the given data.  This is a particular problem with
extreme-value distributions like the Weibull, gamma, or log-
normal distributions.  Hence,  it is desired to "clip" the
continuous distributions at appropriate points to keep these
unusually large or small values from occurring.  This is the
data group in which the program is instructed at which points
to clip the (independent)  distributions.  The lower and upper
clip for each variable is read into a two-dimensional array
CLIP(I,J)  which is dimensioned 10 x 2.  The procedure of
reading one row containing a low clip and high clip, respec-
tively for each independent variable in the order in which
they are coded in IC0DE, is the same as that used for the
parameters.
                             72

-------
The following discussion pertains to the means by which the
distributions are clipped.  The direct approach to sampling
from the Weibull distribution uses the cumulative distribu-
tion function to obtain a sample value for X from a given
random number, R.  Hence, to clip the distribution so that
it does Jiot allow values of X below some given value or above
some other given value, all that is needed is to find at
which points, Cj and C2, between 0 and 1, these low and high
values occur on the distribution.  The program is then
supplied with these points, C^ and C2 , and it will automat-
ically clip the random number generator so that the random
numbers generated will be between Cj and C2; hence, the
corresponding X-value will be in the correct range also.
For example, suppose it is desired to exclude values of X
below 10 or above 6,000 and that these points occur at
Cj = 0.02 and C2 = 0.96, respectively, on the cumulative
distribution for the Weibul'l.  The program is then supplied
with these parameters  (viz., 0.02 and 0.96) as the values
for the array CLIP(I,J) and the random number generator is
clipped according to the equation below:

          R = (0.96 - 0.02)R + 0.02 = 0.94R + 0.02

This equation will automatically be set up internally.

Next, the problem of clipping the normal and log-normal
distributions is discussed.  The direct approach to sampling
from these distributions uses the probability density func-
tion and not the cumulative distribution function.  Hence,
the clips in these cases cannot be applied in the same manner.
Thus, a very direct approach was used for clipping the values
of the variable X in this case.  The program was supplied
with the actual lowest value that X was allowed to assume and
likewise the highest value that X was allowed to assume.  The
program then samples from normal with no restrictions.  If
the sample value obtained is between the low and high value
                             73

-------
 supplied,  it  is  retained.  Otherwise,  the program  samples
 again  and  continues  to do  so  until  it  obtains a value in the
 proper range.  Then  the program continues the above procedure
 until  the  desired  sample size is attained.  A word of caution
 is  in  order.   In supplying clips for the normal, low and high
 values for X  are supplied  directly.  However, in supplying
 clips  for  the log-normal,  low and high values for  log X,
 not X, should be supplied.

 If  any or  all of the variables are  to  be undipped, this
 card(s) must  still be supplied.  To leave the Weibull distri-
 bution undipped, values of 0.0 and 1.0, respectively, are
 supplied for  the low and high clip.  To leave the  normal or
 log-normal distribution undipped,  extremely low or extremely
 high values  (for example,  1.0 x 10"18  and 1.0 x 1018) are
 supplied for  the low and high clip.

 Group  7;   This  data group may consist of more than one card
 depending  on  the number, NDVAR, of  dependent variables.  If
 NDVAR  is >_ 1,  the program  in  this data group is given two
 pieces of  information about each dependent variable:
 (1) which  independent variable it is correlated with  (i.e.
 the number of  the independent variable, e.g., 1 for VARj
 etc.); and (2) the type of distribution to be used in sam-
 pling  for  values of  the dependent variable  (either normal = 2
 or  log-normal  =  4, at present structure).  These codes for
 the dependent  variables are read into  a two-dimensional
 integer array  IDC0DE(I,J)  which is  a 10 x 2 matrix.  Hence,
 each row of IDC0DE corresponds to a dependent variable in
 the same order as the dependent variables are entered, i.e.,
 row 1  for  variable 1, etc.

For example,  suppose there are five input variables, four of
which are  independent and  one of which is dependent.  Suppose
 further that for the numbering used on the independent vari-
 ables, the dependent variable is correlated with the third

                              74

-------
independent variable and the log-normal distribution is
desired for sampling from the dependent variable.  The data
card for this group would then appear as follows:

               cols      1-5       6-10
Group 8;  This data group is optional and must appear only
if NDVAR is >_ 1.  It may consist of more than one card.  This
data group will contain the information necessary to perform
the regression analysis between the independent variables and
dependent variables.  Depending on the value of the flag LT
in Group 1, this data group will contain either the raw data
for the pair(s) of correlated variables (independent variable
then dependent variable) or the values of R, XB, SX, YB, SY
(in that order) for the independent variable X and the
dependent variable Y.  The format on all cards is 8E10.3.

Note that if NDVAR is >_ 2, the raw data must appear as
follows:

 jIndependent VARj)
 (Dependent   VARi)

 (Independent VAR2|
 /Dependent   VAR2j (Even if IndePendent VAR2 = Independent

  etc.

Further, if NDVAR is ^ 2 and it is desired to enter the
values of R, etc. it must appear as below:
                 , XBlf SXlf YBj, SY

                 , XB2, SX2, YB2/ SY2|  etc.
                             75

-------
Group  9:   This data group  is optional and will appear only
if NCFLAG 7*  0 on the first  data card.  In this group, specific
values of NINT  (number of class intervals), XMIN  (beginning
left endpoint) and W  (width of class  intervals) are read into
the program.  As mentioned  previously, if NCFLAG  = 0, the
program will establish class intervals of its own and this
data group is omitted.  The format used  for this  group is
I5,2F10.3.

b.   The Function Card

Whenever changes are made from one run of this program to
another run, they are generally made  from one function
(describing  an output random variable as a function of input
random variables) to another function (with different input
and output random variables).  Hence, it is necessary to
change one card in the program itself, which will be referred
to as the function card.  It appears  as a card in the function
subprogram SF with calling  argument IVAL and common arguments
VAR(I,J) (as well as some other dummy arguments).

VAR(I,J) is a 51 x 10 matrix in which the samples drawn from
the distributions of the input variables are stored for each
pass from 1 to NPASS.  These samples  are stored columnwise,
i.e., column 1 contains the sample from VARj , etc., until all
of the independent variables are sampled.  In the first column
after the NIVAR, the dependent variables are stored until they
are exhausted.  After storing the values in VAR(I,J) .in this
way, the NGR0UP values of the output  variable for this pass
are calculated.  In this calculation, the function card,
defined above, is used.

On a single pass of the program, the  values of the output
variable from I = 1 to NGR0UP are calculated.  For a given value
of I, this information is transferred to the subprogram SF as

                             76

-------
 IVAL.   For  this  value  of  IVAL,  a  value  SF  of  the  output
 variable  is calculated according  to  the rule  specified  on the
 function  card.   Thus,  caution should be exercised in  construc-
 ting the  function  card so that  it reflects the  true function
 of the  input variables in the proper order in which they
 appear  in VAR(I,J) .

 As an example, consider the function

                            3.1*A*B2
                        Z =
                             C*D+E
 (where * denotes multiplication).  Suppose variables A, C, D,
and E are taken to be independent and B is dependent and
correlated with variable E.  If  the independent variables are
numbered as below:

                            A =  1
                            C =  2
                            D =  3
                            E =  4

then automatically, variable B will be numbered 5 in VAR(I,J),
i.e., column 5 will contain the  values of E.  Thus the SF
card would be:

     SF = (3.1*VAR(IVAL,1)*(VAR(IVAL,5)**2)/
          (VAR(IVAL,2)*VAR(IVAL,3)+VAR(IVAL,4))

c.   Example of Overall Input Data

Suppose it is desired to simulate the severity, S, of SO2
emissions from coal-fired electric utilities.  The severity
equation is given by:

                  s = (30.1)(% sulfur)(CC)
                                 h2
                             77

-------
where   %  sulfur  =  percent  of  sulfur  in  coal  used
              CC  =  coal  consumed  in 106  kg/yr
               h  =  stack height in meters

Assume  that  % sulfur  and CC are  to be independent and h  is to
be correlated with coal consumed.  Suppose the type of dis-
tribution for each input variable has been determined along
with the  corresponding  parameters and clips  according to the
information  in the table below.
 Table C-l.  VARIABLES, DISTRIBUTIONS, PARAMETERS AND CLIPS
          FOR COAL-FIRED ELECTRIC UTILITIES EXAMPLE
Variable
% sulfur-1
CC-2
h-3
Type of variable
distribution
2 (normal)
1 (Weibull)
4 (log-normal)
Parameters
y=2.5
A=9.7xlO~I+
a
0 = 1.1
B=1.0

Clips
0.025 and
0.05 and
_a
1.0
0.99

 Not needed since this variable is dependent and, hence,
 correlated.
Assume that the sample size from which the parameters were
obtained is 224 out of a population of 600.   (This is not a
situation where a simulation is required but  it can serve as
an example.)  Suppose that the program is instructed to set
up its own class intervals and to perform a regression
analysis on the raw data for CC and h.  Finally, assume that
a simulated sample size of 5,000 is desired.  Then the input
data would appear as shown in Table C-2.  In  Table C-2, the
values that should appear on the card(s) are  underlined and
the columns in which they should appear are shown in paren-
theses beside them.
                             78

-------
                  Table C-2.
10
SUMMARY OF INPUT DATA BY GROUPS FOR COAL-FIRED
    ELECTRIC UTILITIES EXAMPLE
Group
number
I
II
III
IV
V
VI
VII
VIII
IX
Input data3
Titles (1-40) (blank)
(41-45)
224.0 (1-10) 600.0 (11-20) 224
50 (1-5) 100 (6-10)
2_ (1-5) 1 (6-10)
2.5 (1-10) 1.1 (11-20
2 (11-15)
) 9.7x10"
0.025 (1-10) 1.0 (11-20) 0.05
2 (1-5) £ (6-10)
Raw data for CC
Raw data for h
Omitted

(blank) (46-50)
(21-25)
1 (16-20)
•* (21-30) 1.0 (31-40)
(21-30) 0.99 (31-40)

             Underlined  information  represents  input  data  which  is  to  be  placed
             in  columns  designated by  parentheses.

             JEight  per card  in  format  8E10.3.

-------
2.   DESCRIPTION OF OUTPUT

There are two forms of output from the simulation program:
 (1) printed output, which is divided into three parts; and
 (2) the frequency histogram and cumulative frequency function
of the simulated sample of output values.

a.   The Printed Output

The first item that is printed out is the title read in on
the first card of the input.  The next item printed is the
mean and standard deviation of the output variable.  The
probabilities that the output variable lies in the range
< 0.1, between 0.1 and 1, and > 1, respectively, are subse-
quently printed.  These are followed by the number of plants
 (outcomes) from the sample which are predicted to fall in
each of these three ranges, respectively, and then by the
same numbers for the population.  These are printed as
SNUM(l), SNUM(2), SNUM(3), PNUM(l), PNUM(2), and PNUM(3) for
sample number in range 1, etc.  Finally, a table is printed
out which gives the class intervals, the actual frequency of
the sample in each class interval, and the cumulative fre-
quency function for the right endpoint of the class interval.

A few special comments are provided regarding the first and
last class intervals.  The first class interval printed has as
its left endpoint the minimun value of the output variable in
the whole sample of size NGR0UP*NPASS.  Its right endpoint is
the minimum value found in the first NGR0UP values of the
output variables, for it is after the first pass that the
program automatically sets up class intervals.  The last
class interval printed has its left endpoint equal to the
maximum value obtained on the first pass from the first
NGR0UP values and its right endpoint is the overall maximum
value for the whole sample.  If the user supplies his own
                             80

-------
class intervals and the lowest value supplied is lower than
the actual minimum value calculated by the program, the first
class interval will then look "backwards."  For example,
suppose the user supplied 0 as the beginning class interval
value and the lowest observed value was 0.03 during the
simulation.  The first two lines for the first two class
intervals would then appear something like the following:

 Class interval     Actual frequency    Cumulative frequency
From 0.03 to 0.0           0.0                   0.0
From 0.0  to 0+W            ?                     ?

Likewise, the last class interval could appear "strange" in
a similar situation.  If one wishes to know XMIN and XMAX for
the whole sample, these can be found as indicated above.

b.   The Plots

The output plots from the simulation program are self-
explanatory and will not be further discussed.
                             81

-------
Table  C-3.    COMPUTER  LISTINGS  OF  THE SIMULATION  PROGRAMS
              BIT TRANI16)
              DIMENSION AF<35).CF(35)«1TIL(20) ,ErtP< 3),PROB< 3> •
              1SNU* ( 3) . PfJUM( 3) , XE (35) . CLIP (10 12) .KAF ( 35)
              COMMON ICOCE(lU). VAR(51-10>.PAR(10.2),SFV(51>,IOCODE(10,2)i
              ICORdO.b) ,XSK(?50) ,YSR(ZbO)
         C         THIS PROGRAf IS  DESIGNED  TO PERFORH A COMPUTER
         C         SIMULATION TO  OBTAIN  VALUES OF ONE RANDOM  VARIABLE
         C         DEFINED AS A FUNCTION OF  OTHER RANDOM VARIABLES.
         c         ALL THE RANDOM VARIABLES  ARE ASSUMED TO BE
         C         CONTINUOUS AND CAN  Bt ANY ONE OF FOUR DIFFERENT
         C         DISTRIBUTIONS] THE  WLIBULL. THE NORMAL. THE  GAMMA.
         C         OR THE LOG-NORMAL.  FURTHERMORE, THE PROGRAM  WILL
         C         ACCOUNT FOR  CORRELATION BETWEtN CERTAIN PAIRS OF
         c         THE INPUT VARIABLES.
              READ(1.100)TRAN,IC6U1.CT1,CY2.XII,XI2.ANGL
              WRITE(5.500) IRAN,ICOL1.CY1.CY2,XII.XI2.ANGL
              RE AD (1,12) (ITIH I) ,1 = 1,20) ,LT,NCFLAG
              REAOI1.13) XSAMP.XPOP.NSAMP
              READ(i.io) NGROUP.NPASS.NIVAR.NDVAR
              SMEAN=0.0
              sso=o.o
              XMIN=1.0£18
              XMAX=0.0
              DO 25 1=1,35
            25 AF(I)=0.0
              DO 26 1=1,3
            26 EMP(I)=0.0
              CALL RANDU(9.IY,YFL)
              IX = IY
              READd.10) (ICODE(I),I=1>NIVAR)
              READ (1,11)((PAR(I.J),J=lt2),I=l,NIVAR)
            11 FORMATI8E10.3)
              READ(1,11)((CLIP(I.J),J=1,2),I=1,NIVAR)
              IF(NDVAR.EQ.O) GO  TO  6
              READ 11<10)((IDCODE(I,J)•J=l12)•1=1•NDVAR)
              IF(LT.EQ.O) GO TO  60
              DO 70 I=1.NDVAR
              IROW=I
              CALL SR(NSAMP,IROU,LT)
            70 CONTINUE
              GO TO d
            60 00 <40 I = 1,MDVAR
              IROW=I
              REAO(l.ll)(XSR(K),K=1,NSAMP)
              READ! 1,11)(YSR(K),K=1,NSAMP)
              KO=IDCODE(I,2)
              IFIKO.LQ.m GO TO  52
              GO TO 50
            52 DO 51 J=l,fjSAMP
              XSR(J)=ALOG(XSR(J>)
              YSR(J)=ALOG(YSR(J)>
            51 CONTINUL
            bO CALL SKUISAMP.IhOW.LT)
            10 CONTINUL
            tt IPASS=1
            22 DO i 1 = 1,hilVAK
              ICOL=I
              KK = ICCUE(I)
              GO TO ("»,5,6.7) , KK
            "» C1=CLIP(1.1)
                                           82

-------
    Table C-3  (continued).    COMPUTER  LISTINGS
               OF  THE SIMULATION  PROGRAMS
      C2=CLIP(I.2)
      CALL  USA.1P(NGROUP, IX. ICOL.C1.C2)
      60  TO 3
    5 Cl=CLIPll.l)
      C2=CLIP<1.2>
      CALL  NOKSAh(NGROUP11XtICOL.Cl.C2)
      GO  TO 3
    (> CALL  GAMSAMCNGROUP.IX.ICOL)
      SO  TO 3
    7 C1=CLIP
      CALL  FDPLOTINCINT,SMEAN,SDEV,MINT,W.XE.XfnN.XMAX.KAF.ITIL,XT)
   10 FORMAT116I5)
   12 FORMAT(20A2>2I5)
   13 FORMAT(2F10.3.I5)
100   FORMAT(lbLl.I2.1X,5E10.3)
500   FORMATdHO, 16L1.12. IX, 5L15. 7)
      END
                                  83

-------
     Table  C-3   (continued) .   COMPUTER  LISTINGS
                OF  THE SIMULATION  PROGRAMS
      SUBROUTINE:
      COMMON ICOOE < 10 ) t VAR ( bl . 10 ) , P AR < 10 . 2 ) . SEV ( bl ) , IOCOOE 110.3',
     1COR<10«5>«XSR(250).YSK(2501
C     •    THIS SUBROUTINE  DOtS tlTHER A SIMPLE REGRESSION ON
C         THE INPUT  RAW DATA FOK Two CORRELATtD  VARIABLES OR
C         READS IN THE  VALUES OF R, XBi SX.  YB,  AMD  SY AND
C         CALCULATES THE REGRESSION LINE FROM THESE  VALUES.
C         IN EITHLR  EVENT,  THE VALUES OF THE INTERCEPT A,
C         SLOPE B, STANDARD  ERROR IN REGRESSION LINE  SE. AND XB
C         AND SX AKE STORED IN THE IKOW ROW  OF AN  ARRAY
C         COR(I.J) TO BE USED LATER IN THE SUBROUTINES WHICH
C         DRAW SAMPLES  FROM THE DEPENDENT VARIABLES.
      IF(LT.NE.O) GO TO 5
      FN = FLOAT(HH)
      SI = 0.0
      S2 = 0.0
      oo i 1=1, MM
      SI = Sl+XSKd)
      S2 = S2+YSJRII)
    1 CONTINUE
      XB = Sl/FN
      YB = S2/FN
      SI = 0.0
      S2 = 0.0
      S = 0.0
      DO 2 1=1. MM
      SI = S1+(XSR'I)-XB)»*2
      S2 = S2+(YSRCII-YB)»»2
      S = S+(XSR(I)-XU>*(YSR=A
      COR(IROW,2)=e
      COR(IROW,3)=SE
      COHdKOr.t >=XB
      COR(IROW,5)=SX
      RETUK
      END
      SUBROUTINE USAHP(NGROJP.IX,1COL.C1,C2)
      COMMON ICODE(IO).VAR(51,10),PAR(10,2),SEV(51)
C         THIS SUBROUTINE DRAWS A SAMPLE OF SIZL NGROUP FKOf
C         THE WEIBULL DISTRIBUTION FUNCTION.   THE PARAMETERS
C         A ANO H TO bb  USED FOR THE WEIBULL  ARE OBTAINED
C         FRO" THL ARfcAY PAH(I.J).  THE DISTRIBUTION HAS
C         LOWER AND UPPEK CLIPS Cl AND C2 RESPECTIVELY
C         < SUPPLIED f-HOM THf HAIMLIKIt).  THL  SAMPLE  IS
C         STOKED IN THE  ICOL COLUMN OF THE ARKAY VAHd.Jl.
C         THf  METHOD USED IN SAMPLING IS THE  DIRECT  APPROACH.
      A=PARi1COL.1)
      H=PAR(ICOL.2)
      P=1.0/B
      00 1 :-l,NGROUP
    2 CALL HAI.OUdXtlY.R)
      IX = 1Y
      IF(R.LQ.l.O)  GO TO 2
      R=(C2-C1)«R+C1
      flRG=l .0/d.O-H)
      VAN(I.ICOL)=IALOG(ARG)/A)«»P
    1 CONTINUE
      KETUKM
      EMI)
                                   84

-------
     Table  C-3   (continued).    COMPUTER  LISTINGS
                OF  THE SIMULATION  PROGRAMS
      SUBROUTINE. NORSAM (NGROUP • IX i ICOL.C1-, C2 )
      COMMON 1CODE(10),VAR(51.1U)«PARll0.2).SLV(bl)
C         THIS SUBROUTINE DOES PRECISELY THE  SAME THING AS
C         THE WSAMP  SUBROUTINE EXCEPT THAT  IT  DRAWS THE
c         GIVEN SAMPLE PROP THE NORMAL DISTRIBUTION INSTEAD
C         OF THt WEIBULL.  THE DIRECT APPROACH IS ALSO USED
C         HERE.
      XBAR=PAK(ICOL«1)
      SD=PAR(1COL,2}
      TPI=2.0*3.1*1.59
      UO 1 I=l.NGKuUP
    2 CALL RANOUIIX.IY.Rll
      IX=IY
      IF(Rl.EO.O-.O)  GO TO 2
      AKG=1.0/(R1*R1)
      ARGsALOG(AKG)
      TX=SORT(AKG)
      CALL RANDU(IX,IY,R2)
      IX=IY
      R2=R2*TPI
      VAR( I,ICOL)=XBAR+SD*TX*SIMR2|
      IFUVARII,ICOL).LT.C1).OR.(VAR(I,ICOL).GT.C2)) GO TO 2
      IF(ICODE(ICOL).NE.2) 60 TO 1
      IF(VARIItlCOL).LT.O.O) GO TO 2
    1 CONTINUE
      RETURN
      END
      SUBROUTINE  LOGSAM(NGROUPiIX,ICOL.Cl.C2>
      COMMON ICODE(10 >•VAR(51.10).PAR <10.2 >,SEV < bl>
C         THIS  SUBROUTINE SAMPLES FROM THE LOG-NORMAL
C         DISTRIBUTION IN PRECISELY THE SAME MANNtR AS
C         NOKSAM  AND WSAMP DO FROM THE NORMAL  AND WEIBULL
C         RESPECTIVELY.  IT ALSO USES THE  DIRECT APPROACH.
      CALL NORSAM(NGROUPiIX.ICOL.Cl,C2)
      DO 1 1=1.NGROUP
      VAR(I.ICOLI=EXPtVAR(I,ICOL)
    t CONT:NUE
      RETURN
      END
      SUBROUTINE GAMSAM(NGROUP,IX,
      COMMON  ICODE(10),VAR(51ilO).PAR(10.2).SEV(51)
C         THIS SUBROUTINE SAMPLES FROM  THE  GAMMA DISTRIBUTION
C         IN  THL SAME MANNER AS  THL  ABOVE SUBROUTINES SAMPLE
C         FROM THEIR RESPECTIVE  DISTRIBUTIONS.  THE ONLY
C         DIFFERENCE IS THAT THIb SUBROUTINE USES THE
C         REJECTION METHOD FOR SAMPLING FROM THE GAMMA.
      ALPHA=PAR(ICOL,1>
      BETAaPAK(ICOL<2)
      XU=BETA«(ALPHA+5.0«SORT(ALPHA))
      AM1=ALPHA-1.0
      CALL GAMMA!ALPHA.GA,I)
      CON=6A*(BETA**ALPHA)
      YU=((AM1)**AM1)/(GA*BETA*EXP(AM1>)
      00 1 1=1tNGROUP
    2 CALL RANDU(IXiIYtXVAL)
      IX=IY
      CALL RANOU(IX,IY.YVAL>
      IX=IY
      XVAL=XVAL*XU
      YVAL=YVAL*YL
      FXVAL=GDF(XVAL
-------
     Table C-3  (continued).    COMPUTER LISTINGS
                OF  THE  SIMULATION PROGRAMS
      FUNCTION GDFCXtALPHA,BETA,CON)
C         THIS SUBPROGRAM EVALUATES THE PROBABILITY DENSIT*
C         FUNCTION FOR  THE GAMMA DISTRIBUTION  AT  THE POINT
C         X.  THE GAMMA PARAMETERS ARE ALPHA AND  BETA AND
C         THE CONSTANT  CON IS GIVEN BY CON=GA»
-------
     Table  C-3   (continued).    COMPUTER  LISTINGS
                OF  THE  SIMULATION PROGRAMS
      SU8ROUTIML LCSAMPINSROUPiNPASS,IX,ICOL,IVARtIDVAR)
      COMMON ICOOEdO)iVAR(51«10>tPAR(Id 2).SEV(Sl).IOCOOE(10,2
     ICURdOtS)
C         THIS SUBROUTINE  DRAWS A SAMPLE OF SIZE. NGROUP FOR
C         A DEPENDENT INPUT  VARIABLE FROM THE  LOG-NORMAL
C         DISTRIBUTION.  IVAR  IS  THE SUBSCRIPT OF THE
C         INDEPENDENT VARIABLE WITH WHICH THE  GIVtN  DEPENDENT
C         VARIABLE IS CORRELATED.  IDVAR IS THE NUMBER OF
C         THE DEPENDENT VARIABLE.  ICOL IS THE COLUMN OF THE
C         STORAGE ARRAY VAR(ItJ)  IN WHICH THE  SAMPLE  IS
C         STORED.  THE METHOD USED IS THE ONE  DISCUSSED IN
C         THE DESCRIPTION  OF THIS PROGRAM.
      YINT=CORdOVAR.l)
      SLOPE=COR(IDVAR•2i
      SE=COR(IDVAR.3)
      XBARS=COR(IUVARt4)
      SDS=CORdOVAR<5>
      FN=lviGROU^*NPASS
      TPI=2.0*3.1"»159
      DO 30 1=1«NGROUP
      AVAR=ALOG(VAR(I=EXP( VAR d. ICOL))
   30 CONTINUE
      RETURN
      END
      SUBROUTINE NCSAMP(NGROUP•NPASS.IX tICOL tIVAK <
      COMMON  ICOOEUO),VAR<51.10).PAR(10.2),SEV(bl),IDCODE(ln.2
     KORdO.5)
C         THIS SUBROUTINE DOES THL SAME THING  AS  LCSAMP
C         ABOVE EXCEPT THAT IT CRAWS THE  SAMPLE FUR THE
C         DEPENDENT VARIABLE FROM A NORMAL  DISTRIBUTION.
      YINT=COR(IDVARtl)
      £>LOf'E=COR( IOVAR.2)
      SC=COR(IDVAR<3)
      XBARS=COK(IOVARt4)
      SOS=COR(IDVARib)
      FN=NGROUP*NPASS
      TPI=2.0*3.14159
      DO 3D 1=1.NGROUP
      XBAR=YINT+(SLOPE*VAR(I.IVAH))
      Tl=( (VARdf IVAR)-XBARS)*(VAR/(FN»SDS*SDS )
      SD=SE»d .0+1./FN+T1)
   31 CALL RAf.OU(IX.IY.Rl)
      IX=IY
      IF(Rl.EO.O.O) GO TO 31
      ARG=1.0/(K1*R1>
      ARG=ALOG(ARG)
      TX=SBHT(AKG)
      CALL RANDUIIX.IY.R2)
      IX=IY
      R2=R2*TPI
      VAR{I.ICOL)=XBAR+(SO*TX*SIN(R2)I
      IFIVARd.ICOL).LT.0.0) GO TO 31
   30 CONTINUE
      KETUKN
      END
                                  87

-------
     Table  C-3   (continued).   COMPUTER  LISTINGS
                OF  THE  SIMULATION  PROGRAMS
      SUBROUTINE  RANOUlIX,IT,YFL)
C         THIS  SUBROUTINE GENERATES A  PSEUDO-RANOOM NUMBER
C         BETWEtN D AND 1 AND KETJRNS  IT  IN THE PARAMETER
C         YFL.   IT ALSO KETUKNS A  VALUE OF IY WHICH IS TO BE
C         USED  AS IX IN THE NEXT PASS  THROUGH THIS SUBROUTINE.
      IY=IX»B99
      IF(IY)  5,6,6
    & IY=IY+i2767»l
    6 YFL=IY
      YFL=YFL/32767.
      RETURN
      END
      SUBROUTINE       BflCKT(NGROUP,ICOL1,CY1,CY2,TRAN,XI1,XI2,ANSLl
C         THIS  SUBROUTINE PERFORMS THE
C         TRANSFORMATION RACK To ORIGINAL
C         DATA  VALUES AFTER LINEAR
C         TRANSFORMATION PER PARA 19.8 OF
C         STATISTICAL THEORY WITH ENGINEERING
C         APPLICATIONS, HALO, WILEY,  1967.
C         THE TRANSFORMATION MAKES
C         CORRELATION COEFFICIENT ZERO OF
C         DATA  PAIRS OF PARTIALLY CORRELATED
C         DATA  WITH MEANS OF ZERO.  A SECOND
C         TRANSFORMATION IS REQUIRED  TO
C         ADO A CONSTANT TO THE DATA  TO
C         HAKE  IT POSITIVE SO THAT LOGS
C         CAN BE TAKEN.
C
C         WRITTEN BY LEE MOTE - 3/10/76
C
C         XII IS MEAN OF X
C         XI2 IS MEAN OF Y
C         ANGL  IS ANGLE OF ROTATION IN
C         RADIANS
C         Yl  IS TRANSFORMED ARRAY FOR
C         INDEPENDENT VARIABLE
C         Y8  IS TRANSFORMED ARRAY FOR
C         DEPENDENT VARIABLE
C         XI  IS RESTORED VALUES OF
C         INDEPENDENT VARIABLE
C         X2  IS RESTORED VALUES OF
C         DEPENDENT VARIABLE
      BIT TRANI16)
      DIMENSION XKS1), X2(51)«YK51>,Y2«51)
      COMMON  ICODEUO>,VAR(31,10),PAR(10.2>,SEV(51>,IDCOOE<10,2>,
     1COR(10,5),XSR(250),YSR(250)
      N=NGROUP
      ICOL2CICOL1+1
      DO 1 1=1,N
      Y1(I)=VAR(1,ICOL1)-CY1
      Y2(I)=VAR(I,ICOL2 >-CY2
      IF-CYl
      IF=ALOG(VAR(I,ICOL2»-CY2
1     CONTINUE
      DO 2 I=1«N
      X1(I)=XI1+Y1(I)*COS(ANGL)-Y2(I)*SIN(AN6L)
      X2(I>=XI2+Y1(I)*SIN*COS(ANGL>
      IF(TRAN(2»XKI)sEXP(Xl(I))
      IF(TRANO)) X2)
2     CONTINUE
C
      DO 3 1=1,N
      VAR(I.ICOL1)=X1(I)
3     VAR
-------
      Table  C-3   (continued).    COMPUTER  LISTINGS
                 OF  THE SIMULATION PROGRAMS
      FUNCTION SF(IVAL)
      COMMON  ICOOE(IO)•VARI 51•10)iPAR11012).SEV(51)
C         THIS SUBPROGRAM EVALUATES THE  OUTPUT VARIABLE  SF
C         AT  A PARTICULAR VALUE IVAL ASSOCIATED WITH THE
C         INPUT VARIABLE ARRAY VAR(I.J).
      SF=(30.12'*3*VAH< IVAL.l I«VAK( IVAL.2) >/(VAR(IVAL«3)*VAR(IVALt3) >
      RETURN
      EMU
      SUBROUTINE SEVCALINGROUPtXMlNtXMAXtSMEANtSSU)
      COMMON ICOD£<10),VAR(51,10).PAR(10,2)tSEV<51)
C        THIS SUBROUTINE CALCULATES THE NGROUP VALUES OF
C        SEVERITY .OR OUTPUT VARIABLE) AND STORES THEM IN
C        THE ARKAY SEV(I) FOR LATER USE.  IT ALSO KEEPS
C        TRACK OF A RUNNING SUM OF SEVERITIES, SUM  OF THE
C        SQUARES OF SEVERITIES, AND AN OVERALL XHIN ANL
C        XMAX VALUE FOP SEVERITY.
      DO  1 1=1,NGROUP
      SEV(I)=SF(II
      SMEAN=SMEAN+SEV(I)
      SSO=SSU+(SEV(I)«*2)
      IF(SEVII).LT.XMIN) XMIN=SEV(I.
      IF(SEV(I>.GT.XMAX> XHAX=SEV(I)
    1  CONTINUE
      RETURN
      END
                                   89

-------
     Table C-3  (continued).    COMPUTER LISTINGS
                OF  THE SIMULATION  PROGRAMS
      SUBROUTINE CINT(NCIMT.XMIN.XMAX,NINT,XE,W.NCFLAG)
      DIMENSION X£(3S)
C         THIS SUBROUTINt ESTABLISHES THE  CLASS INTERVALS TO
C         BE USLO IN  THE PROGRAM.  THIS SUBROUTINt IS
C         BRANCHED TO  AFTER THE FIRST PASS  OF NGRUUP VAL (tS
C         OF THE OUTPUT VARIABLE SEV ARE CALCULATED.  IT
c         THFN uses THESE VALUES AND STURGES* RULE. FOR
C         SETTING UP  THE CLASS INTERVALS OH IT READS THE
C         NECESSARY DATA AND SETS UP THE CLASS INTERVAL'-
C         THAT THE UStR DESIRES TU HAVE.
      IF(NCFLAG.Nt.O)  00 TO 2
      XN=FLOAT(NCINT>
      NINT=l.b+3.J»ALOGiu(XN»
      IF (XMAX.GT.50.0) XHAX = 50.0
      w=(XMAx-xMiN)/FLOAT(NiNT)
      w=w+.oooi*w
      GO TO 3
    2 READ(1«5)NINT,XHIN.W
    5 FORMATII5.2F10.3)
    3 XE(1) = xniN
      K=NINT+1
      00 1 1=2,K
      XE(I)=XE(I-1)+W
    1 CONTINUE
      RETURN
      END
      SUBROUT I NE  AFHEO ( NGROUP . MINT . XE . AF , EMP )
      DIMENSION XE(3b),AF(35).Er:P(3)
      COMMON  1 COOL (10), VAR (51,10). PAR (10,2), SEV ( 51 )
C         THIS SUBROUTINE ACCEPTS N6ROUP VALUES OF SEVERITY
C         SEV (OR OUTPUT VARIABLE)  AND  SEPARATES 1HEM INTO
C         THE MINT CLASS INTERVALS  WITH ENDPOINTS IN THE
C         ARRAY XE.  THE ACTUAL FREQUENCIES ARE COUNTED IN
C         THb. REAL ARRAY AF.  FURTHER.  THE NUMBER OF
C         OBSERVED VALUES OF SEV BELOW  .1. BETWEEN .1 AND 1,
C         AND ABOVE 1 ARE COUNTED AND  STORED IN THE ARRAY
C         EMH(I), 1=1.2,3.
      DO 1 J=1.NGROUP
      IF(SEVIJ>.GE.XE(1)> 60 TO 2
      AF(1)=AF{1)+1.0
      GO TO 1
    2 DO 3 I=1.NINT
      IF( (SEV(J).GL.XF (II >.AND.(StV(0).LT.Xl(I+l»> GO To 4
      GO TO 3
      GO TC  1
    3 CONTINUE
      AF(NINT+2<=Af
-------
  Table  C-3  (continued).    COMPUTER  LISTINGS
             OF THE  SIMULATION PROGRAMS
   SUBROUTINE OUTPUT(SMEAN,SDE.V,PROb,SNUM,PNUMtAF,CFtXSAMP,X -OP.
  lMNT.XE..lTIL.XMlN.XfAX)
   DIMENSION PROB(i),SNUM(3),PNU«(3).AF<35)tCFl35).XL(35).ITILI20 I
       THIS SUBROUTINE PRINTS THE  OUTPUT OF THE. PROGRAM
       INCLUDING MLANt STANUARU  DEVIATION, ETC..
   rfRlTE(StlO)(ITIL(I),I=lt20)
10 FORMATI1H1.40X.20A2)
   WRITEIbtl) SMEAN.SOEV
   WRITE(5.2) PROB(l).PROBI2).PROB(3)
   WRITE(5,3) XSAMP.SNUMtl).SNUM12),SNUM<3>
   uRITEfiMl) XPOPtPNUnil) ,PNUh(2) .PNUM(3)
   WRITE(5.51
   K=NINT+1
   XLE=XHIN
   WRITEIb.7) XLEtXE(l) tAF(l)iCFd)
   00 6 I = 2tK
 b WRITE(5t7) XE(I-1>.XE(I).AF(I),CF(I)
   XRE=Xf«AX
   WRITE(b.7) XE(NINT+1).XRE.AF1NINT+2),CF(NINT+2)
 1 FORMAT
-------
     Table C-3   (continued).    COMPUTER  LISTINGS
                OF  THE  SIMULATION PROGRAMS
      SUBROUTINE  C1PLOT =CFREQ(I)*FN
      CALL PLOTS(IDUM.IDUMiIPLOT)
      CALL PLOTU.0.0.0,-2)
      XINT=FLOAT(NINT(+1.0
      CALL AXIS (0.0.0.0. ITIL.-LT.XINT.O.O.XEdl.Wt
      FACT=.9
      CALL FACTOR(FACT)
      OELTAV=FLOAT(N/10)+1.
      CALL AXIS(0..0..'NUMBER OF PLANTS*.16,10.,90..0..DELTAV)
      CALL FACTORC1.0)
      CALL PLOT(0.0.10.0»FACT,3)
      CALL PLOT(XINTilO.O»FACTi2)
      XE(NINT+2)=XE<1)
      XE(NINT+3)=W
      CFREQ(NINT+2)=0.0
      CFREQ(NINT+3)=DELTAV/FACT
      CALL LINE< XE.CFREO, NINT+1.1.1, U)
      CALL SYMBOHXINT.4.0, .21,"SAMPLE  SIZE = •.0.0,1'*)
      XN=FLOAT(N)
      CALL NUMBER(999.0,999.0,.21.XN,0.0,-1)
      CALL SYHBOL
      NINT=NINT-1
      00 3 1=1.20
    3 XE(I)=X(I)
      CALL PLOTS(IDUM,IDUM,IPLOT)
      CALL PLOT(1.0,0.0.-2)
      XINT=FLOAT( NINT) +1.0
      CALL AXIS (0.0. 0.0, ITIL. -LT.XINT,O.O.XE(1),W)
      KK=MNT+2
      DO 1 Ksl.KK
      M=KOUNT(K)
    1 YK(K)=FI.OAT(M)
      rK(NINT+3)=0.0
      CALL SCALE(YKiB.Ol)
      YK(NINT+H)=YK(NINT+5)
      CALL AXIS(-1.0.0.U, 'NUMBER OF OBSERVATIONS' t 22, 8.0 . 90 . 0 t
     1YKININT+3) .YK YK(I)=YK(I + 1)
      X£(NINr+2)=XE(l>-w/2.0
      YK(MlNT+2'
      YKIMNT*
      YVAL=YH(l)/YK(NINT+3)
      CALL PLOTC.S.VVAL^)
      CALL LINEIXEiYKtNINT+l.l.l.ll)
      CALL SYMBOL (XINT.'J.O. .^i. -SAMPLE SIZE    '.0.0. 1m
      XN=FLOAT(N>
      CALL NUMBER (999.0, 999 .Ot.?l-l)
      CALL SThHOL(XINT«3.5,.21. '«iN.  VALut =  ',0.0,13)
      CALL NUr BEH ( 999 . 0 . 999 .0,.^l.XhlN,0.0.2)
      CALL SYMBOLIXINT.3.0. ,21« tf ux.  VALUE =  '.0.0,13)
      CALL NUNHER ( 999. 0,999.0,. 21. XNAX.U. 0,2)
      CALL SYMBOL(XINT,2.5« .21« *f"EAN  = ',0.0.7)
      CALL NUMBER ( 999. 0 , 999. 0 . .21 > XF AR , 0 . 0 . 2 )
     CALL SYMBOLIXTNT,; ?..21,'S1D.  UEV. =  >,0.0il2l
     CALL NUMBER(949.C.499.0..21,SD<0.0,2)
     CALL PLOT(O.O.O.O.-.S)
     CALL PLOT(30.0,0.0.-3%
     MINT=NINT+1
     RET' KM
     END
                                   93

-------
                         APPENDIX D
              THE GOODNESS-OF-FIT PROGRAM5'7"12

The goodness-of-fit program is designed to take sample data
from some population with an unknown distribution and "fit"
the-data to various continuous distributions by one of the
standard procedures of statistics.  The program will print
out various sample parameters, such as the mean, standard
deviation, coefficient of skewness, and measure of kurtosis.
It will also print out the corresponding parameters for the
theoretical continuous distributions which are fitted to the
data.  Finally, using Sturge's rule to set up class intervals,
the program will calculate the chi-square value to be used in
a chi-square test for goodness-of-fit as discussed earlier.
In addition, using the right endpoints of the class intervals
as comparison points, the program calculates the residual sum
of the squares of the difference in the theoretical cumula-
tive distribution and the actual cumulative distribution.
All of the calculations indicated above can be used to deter-
mine how well the given theoretical distribution fits the
actual data.
7Mendenhall, W., and R. L. Scheaffer.  Mathematical Statistics
 with Applications.  North Scituate, Duxbury Press, 1973.
8Walpole, R. E., and R. H. Myers.  Probability and Statistics
 for Engineers and Scientists.  New York, The MacMillan Co.,
 1972.
9Siegel, S.  Nonparametric Statistics.  New York, McGraw-Hill
 Book Co.,  1956.
10Duncan, A. J.  Quality Control and Industrial Statistics.
  Chicago,  Richard D. Irwin, Inc., 1952.
11Cramer, H.  Mathematical Methods of Statistics.  Princeton,
  Princeton University Press, 1946.  575 p.
12Olt, W. R., and D. T. Magee.  Random Sampling as an
  Inexpensive Means for Measuring Average Annual Air
  Pollutant Concentrations in Urban Areas.  (Presented at
  the 68th Annual Meeting of the Air Pollution Control
  Association, Boston.  June 15-20, 1975.
                             94

-------
1.   THEORETICAL DISTRIBUTIONS USED AND THE METHODOLOGY
     UTILIZED FOR FITTING THEM TO THE DATA
There are four distributions which are used in the goodness-of-
fit program:  the normal distribution, the Weibull distribu-
tion, the gamma distribution, and the log-normal distribution.
These represent a wide range of continuous distributions and
are, therefore, deemed sufficient to cover most  (if not all)
of the data encountered.

Three methods are routinely used in statistics for fitting
data to a continuous distribution:  the method of maximum
likelihood, the method of moments, and the method of least
squares.  The theoretical aspects of these methods are dis-
cussed in the statistics texts listed in the references at
the end of this report.  The applications of these methods
to the specific distributions under consideration are dis-
cussed below.

First, the data are fitted to the Weibull distribution by
two methods:  the method of maximum likelihood and the method
of least squares.  For the method of maximum likelihood, the
following system of equations (in the parameters a and b)
need to be solved:
          £ - E   X.b = 0
          a   1=1  X
              n             n     .
          £• + s  log X. - a I   X.  log X. = 0
          D   1=1     1     i=l          ^
(D-l)
(D-2)
where n = sample size and the {X.}n   are the sample data
                                1 1=1 '
values.  Eliminating "a" from the system we obtain the
following equation in b alone:
                             95

-------
            ?  ,  Xib  ^  Xib    ,  n
            —	i  I    log  X.   -  1 =  0      (D-3)
                n      b        n  i=l       1
                i=l   1
The program uses  a  numerical  scheme called  the method of
false position  to solve the above equation  for b.  Then it
obtains  "a" by  a  substitution of b  into Equation  D-l.  Thus,
the necessary parameters, a and b,  in the Weibull distribu-
tion function are obtained.

For the  method  of least squares, the right  endpoints of the
class intervals are taken as  the x-values and their cumulative
distribution  values are taken as the y-values to  be trans-
formed by the usual double-log  transformation and used in the
method of least squares for obtaining "a" and b.  Since the
cumulative  distribution value for the right endpoint of the
last class  interval is  1.0 and  this value cannot  be allowed
in the double-log transformation, log (log  y^y   )• the Y~
value for the last  class interval is set at 0.999.

The next fit  considered is the  normal distribution.  The
sample mean,  X, and the (unbiased)  sample standard deviation
are taken as  the  values for y and a in  the  normal distribution
function.  This is  "almost" a maximum likelihood  fit to the
normal.  Actually,  the  maximum  likelihood estimators of y and
a are the sample  mean and the biased sample standard devia-
tion.  However, the unbiased  standard deviation should work
as well  or better.

The next distribution used is the gamma distribution.  The
data are fitted to  this-distribution by the method of moments
(since this method  is easy to apply to  the  gamma).  The method
of moments yields the system  of equations below  (in the param-
eters a  and B)  to be  solved:
                              96

-------
                             n
                        aB  =    -                      (D-4)
                  Z(X.-X)2     (biased variance used
           ct32 =  - ^— -     in method of moments)     (D-5)

Squaring the first equation and dividing it by the second
one obtains :
                      IX.  2
                       n _         n X2               ,_ c.
               a =  - — —  = - — —             (D-6)
                     £(X±-X)2       l(X±-X)2
                        n
The program uses the above equation to obtain a directly and
subsequently substitute this value into the first equation to
            5f
obtain  3 = —  directly.
            a
Finally, consider fitting the log-normal distribution to the
data.  If the data are to be log-normally distributed, the
natural logarithms of the data points should be approximately
normal.  Hence, to perform a log-normal fit to the data, the
logarithm of each data point is obtained and used as data for
which to perform a maximum likelihood fit to the normal as
described earlier.  That is, the mean of the sample of log
values is used as y and the unbiased sample standard deviation
of the log values is used as a for the normal distribution.

The above remarks conclude the discussion of the theoretical
distributions used and the methods for fitting the data to
each of them individually.

2.   FORM AND DESCRIPTION OF OUTPUT
The, first thing printed out is a title describing the data
which are to be analyzed.  The sample statistics are then
                             97

-------
printed out,  including X,  standard deviation, m3  (third
central moment), gj  (coefficient of skewness), etc.  The
various distributions are  subsequently  fitted one by one in
the manner described above and the information described in
the introduction to this section is printed out for each of
the fits.  The  first fit is the Weibull Maximum Likelihood
Fit and the last one is the Log-Normal Fit.  In the section
for each fit, the class intervals and the corresponding
theoretical frequency and  actual frequency of the data in
these class intervals are  printed out.  In calculating the
value of chi-square, the program automatically pools frequency
classes on the  upper and lower tails until they obey "the rule
of 5" for the chi-square test.  In calculating the number of
degrees of freedom, the program automatically reduces the
number of class intervals  to take into account the pooling of
the class intervals described above.
                             V,
One should keep in mind, however, that the program does
nothing about a class interval with a theoretical frequency
less than five  unless it occurs as the first or last class
interval.  In that case, the program simply pools it with
the class interval below or above it until "the rule of 5" is
obeyed.  Since  most distributions (including the theoretical
ones being discussed) will  generally not have small frequency
class intervals in their center (except for bimodal ones),
this procedure  will for the most part take care of any neces-
sary reduction  to obey "the rule of 5."  However, if more
reduction is deemed necessary, this can easily be done by
hand,  since the class intervals and the frequencies will be
printed out in  the output.  As an example of the above, con-
sider the following table of class intervals and frequencies:
                             98

-------
       Table D-l.  THEORETICAL AND ACTUAL FREQUENCIES
                 FOR VARIOUS CLASS INTERVALS
Class
interval
1
2
3
4
5
6
7
8
9
Theoretical
frequency
82.1
53.8
35.6
22.2
13.2
7.7
4.3
2.4
2.8
Actual
frequency
84
57
34
14
14
6
7
5
3
The program will pool the last two class intervals producing
eight class intervals and, therefore, 8-3 = 5 degrees of
freedom.  Class interval 7 does not obey "the rule of 5"
either.  However, since it is close to five, the error
produced in leaving it alone and adding one more degree of
freedom to the test is small.  Furthermore, if it were pooled
with class interval 6 above it, the resulting value of chi-
square would be less but so also would the number of degrees
of freedom and the final conclusion would be nearly the same.
3.
FORM OF INPUT
The input to the program is divided into two parts.  The first
part consists of a single card containing three pieces of
information about the data to be analyzed, namely:

N = sample size (in cols 1-5, integer and right justified)

ITIL,(I),1=1,30 = title of the data (in cols 6-65 anywhere)
                             99

-------
NFLAG = variable telling whether this is the last data set to
     be analyzed by the program on this run.  If NFLAG = 0
     (i.e., the card is blank in col 80), then the program
     expects to analyze another data set after completing
     this one.  If NFLAG = 1 (i.e., the card has a 1 in col
     80), this signals the last data set and the program will
     terminate after analyzing the current data.

The second part of the input contains the values of the data
punched eight per card under an F10.3 format until the data
are exhausted.  The format for reading the data into the
program could easily be changed, if desired.

A listing of the program and output pertaining to the coal-
fired electric utility data is given in Appendix C for
reference purposes.
                            100

-------
       Table  D-2.    COMPUTER LISTINGS  OF  THE

 SIMULATION  PROGRAMS  - GOODNESS-OF-FIT PROGRAM


      DIMENSION XE(20>,AF<20),XSR(20).YSRUO)«XES<20),AFS(20),RF(20)
      DIMENSION ITILI30)
      COMKOIJ CATA(SOO) .XMINiXMAX.XBAR.SDLV
C         THIS PKOGRAH  IS DESIGNED  TO TAKE SAMPLE DATA KROM
C         SOME POPULATION »ITH  AN UNKNOWN DISTRIBUTION AND
C         "FIT" THIS DATA TO  VARIOUS CONTINUOUS DISTRIBUTIONS
C         BY ONE OF THE STANDARD PROCEDURES OF STATISTICS.
c         THE PRO&KAM WILL PRINT UUT VARIOUS SAMPLE PARAMETERS
C         SUCH AS MEAN. STANDARD DEVIATION,COLFFICIENT OF
C         SKEWNESSt AND MEASURE OF  KURTOSIS.  IT WILL ALSO
C         PRINT OUT THE CORRESPONDING PARAMETERS FOR THE
C         THEORETICAL CONTINUOUS UISTRIBUTIONS WHICH ARE
C         FITTED TO THE DATA.   FINALLY. USING STUKGE'S RULE
C         TO SET UP CLASS INTERVALS, THE PROGRAM WILL
c         CALCULATE THE CHI-SQUARE  VALUE TO HE USC.D IN A
r         CHI-SOUAKE TEST FOR GOODNESS-OF-FIT.  IN ADDITION,
C         USING THE RIGHT ENOPOINTS OF THE CLASS INTERVALS
C         AS COMPARISON POINTS, THE PROGRAM CALCULATES THE
C         RESIDUAL SUK  OF THL SQUARES OF THE DIFFLRENCE IN
C         THL THEORETICAL CUMULATIVE DISTRIBUTION AND THE
C         ACTUAL CUMULATIVE DISTRIBUTION.
    5 READH.lt Nt(ITIL(I),I  =  1.30I.NFLAG
    1 FORMATII5.30A2.10X.15)
      READ(1,2)(UATA(I).1 = l.N)
    2 FORMAT18F10.3)
      WRITE ( 5t 10 MITIL11 ).! = !« 30)
   10 FORI"1ATC1H1'30X.30A2)
      CALL SSTAT(N)
      CALL CINT(N.XMIN,XMAX.NINT,WtXES,AFS)
c  WEIBULL MAXIMUM LIKELIHOOD FIT
      CALL RELOAD(X£S,AFS,»E,AF.NINT)
      CALL SOLVE(NiA.B)
      IFlB.LT.0.0)60 TO 3
      CALL TSKEW(A.B)
      CALL RFWEIB(NINT.XE.A.B.RF)
      CALL RELOAOCXES.AFS.XE.AF.NINT)
      CALL CHITST
C   WEIBULL LEAST SQUARES FIT
    3 CALL RELOAU
      CALL RELOAD(XES.AFS.XEiAF.NINT)
      CALL CHITST(N.NINT.RF.AF.XE)
C         GAMMA METHOD  OF MOMENTS FIT
      CALL RELOAO(XES,AFS,XE,AF,NINT)
      CALL GSOLVE(XBARiSDEVtALPHAtBETAiN)
      IF(ALPHA.LT.l.O)  GO TO 50
      CALL RFGAMtNINT.XE.ALPHA,BETA,RF>
      CALL RELOADIXES.AFS.XE.AF.NINT)
      CALL CHITST(N,NINT,RF,AF,XE)
      GO TO SI
   50 WRITE(b.52>
   52 FORMAT«1HO,'THE GAMMA DISTRIBUTION WILL NOT FIT THIS  DATA')
    LOG NORMAL  FIT
   51 WRITEI5.100)
  100 FORKATUH1/////53X,'LOG NORMAL FIT')
      DO 101  1=1,N
      DATA(I) = ALOGIOATA(I))
  101 CONTINUE
      CALL  SSTAT(.N)
      CALL  CINTCN.XMIN.XMAX.NINT.W.XES.AFS)
      CALL  KELGACHXES.AFS.XE.AF.MNT)
      CALL  RFNOKM(NINT,XE.XBAR,SDEV,RF)
      CALL  RELOAD(XES.AFS.XE.AF.NINT)
      CALL  CHITST(N,NINT,RF,AF,XE)
      IF(NFLAG.EQ.O) GO TO 5
      CONTINUE
      END
                                101

-------
Table D-2  (continued).    COMPUTER  LISTINGS  OF  THE
   SIMULATION PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE RFNORM(NINT.XE.XBAR,SDEV.RF)
        DIMENSION XE(20),RF120>
  C        THI' SUBROUTINE  IS DESIGNED TO CALCULATE THE
           THEORETICAL RELATIVE FREQUENCY RF  Of THL NORMAL
  C        DISTRIBUTION WITH MEAN XBAR AND STANDARD DEVIATION
  C        SDEV/ OVER THE CLASS INTERVALS WHOSE ENOPOINTS ARE
  C        GIVEN IN THr ARRAY XE.  NINT IS THE NUMBER OF
           CLASS INTERVALS  REPRESENTED.
        DO 1 I=liNINT
      1  RF(I)sC.O
        IF( (XBAR-3.*SDEV).LT..X£<2)) GO TO 2
        XE(1) = C.O
        GO TO 3
      2  XE(1I = XBAR-3.*SOEV
      3  M = NINT-1
        S = 0.0
        DO 6 I =1.H
        DX = 
        S = S+RFII)
      6  CONTINUE
        RF(NINT)   1.0-S
        RETURN
        END
       SUBROUTINt I b Tt G (C • 0. DX , XflAR ,SDEV.V )
  C         THIS SUdRGUTHE CALCULATES THE  INTEGRAL OF THE
  C         NORMAL PROBABILITY DENSITY FUNCTION PDF WITH MEAN
  C         XBAH ANJ STANDARD OFVIATION SOEV OVER THE INTtRVAL
  C         C TC 0 IN  INCREMENTS OF DX BY THE TRAPEZOIDAL
  C         RULE.  1000  ITERATIONS ARE USED.
       V = 0.0
       A2 = SORT(2.*3.1tl59)»SDEV
       Yl = POP (C. XBAR,SDE.V.A2)
       Y2 = Yl
       00 1 I = 1,1000
       XI = FLOAT(I)
       X2 = C + XI*OX
       Yl = Y2
       If. = PDF
-------
Table  D-2  (continued).    COMPUTER  LISTINGS  OF THE
   SIMULATION  PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE SETUP(N,NINT ,XE.AF,XSR,TSR)
        DIMENSION XE(20 ) i AF{ 20 ) , XSR (20) . YSR,( 20 )
  C        THIS SUBROUTINE ACCEPTS THE DATA AFTER IT IS
  C        ARRANGED INTO CLASS INTERVALS AND ACTUAL
  C        FREQUENCIES HAVE BEEN CALCULATED AND  SETS UP THL
  C        RIGHT ENDPQINTS OF THE CLA.SS INTERVALS WITH THEIP
  C        RESPECTIVE CUMULATIVE FREOUENCIES FOR THE LEAST
  C        SQUARES FIT TO THE UEIBULL.
        S = 0.0
        XN = FLOAT(N)
        00 1 I=1.NINT
        XSK(I) = XE1I+1)
        S = S+AFII)
      1  YSR(I) = S/XN
        TSR(NINT)=.999
        00 2 I=ltNINT
        XSR(I) = ALOG(XSRCI))
        ARG = 1.0/tl.O-YSRd) I
        ARG = ALOG(ARG)
        YSR(I) = ALOG(ARG)
      2  CONTINUE
        RETURN
        END
       FUNCTION F(NtB)
  C         THIS FUNCTION SUBPROGRAM CALCULATES THE EQUATIC
  C         VALUE AT A POINT B FOR THE EQUATION USED IN
  C         SOLVING FOR B IN THE  WEIBULL MAXIMUM LIKELIHOOD
  C         FIT.  N IS THE TOTAL  NUMBER OF DATA POINTS.
       DIMENSION XBI500).ALXBI500)
       COMMON XI500)
       Sl=0.0
       S2=0.0
       S5=0.0
       oo i 1=1,n
       XB(I)=X(I)*«b
       ALXBII)=ALOG(XB(I> >
       Sl=Sl+Xe(I)»ALXB(I)
       S2=S2+XB(I)
     1 S3=S3+ALXB(I)
       FN=FLOAT(N)
       FN=1/FN
       F=(S1/S2)-FN»S3-1.0
       RETURN
       END
                                  103

-------
Table D-2  (continued).    COMPUTER  LISTINGS  OF  THE
   SIMULATION PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE TSKLU(A.B)
  C         THIS SUBROUTINE CALCULATES AMD PRINTS  THE
  c         THEORETICAL VALUES OF  THE nctou VARIANCE., STANDAH:
  C         DEVIATION. THIRD AND FOURTH CENTRAL MOMLNTS,
  C         COEFFICIENT OF SKEWNESSt AND MEASURE OF KURTOSIS
  C         FOK THE WEIBULL DISTRIBUTION FUNCTION  WITH
  c         PARAMETERS A AND B GIVEN.  IT ALSO EVALUATES AND
  C         PRINTS THE I AND J POINTS OF THE FUNCTION FOR 1=1,
  C         --, 5 AND J=99, --- , 95.
        Yl=1.0+(1.0/b)
        Y2=1.0+(2.0/B)
        Y3=l.Cl+C3.0/B)
        CALL GAHnA(Tl,61,I>
        CALL GAHHACY2.62»JI
        CALL GAMMA«Y3,G3»K)
        CALL GAMMAlYi»,G«».L>
        Al = A**(-1.0/B>
        A2=A**<-2.0/B)
        A3=A««(-5.0/B)
        A <*=*»» (-1.0/B)
        XMU = 41*61
        SIGMA2=(G2-G1**2)*A2
        SIGMA=SORT
-------
Table D-2  (continued).    COMPUTER  LISTINGS OF  THE
   SIMULATION  PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE RFWEIBCNINT.XE.A.B.RF)
        DIMENSION XEC20).RF(20).CF(20)
  C         THIS SUBROUTINE CALCULATES THE RELATIVE FREOUE'TY
  C         RF OVER THE CLASS  INTERVALS WHOSE ENDPOINTS ARC
  C         THE ARRAY XE FOR THE  WE IBULL DISTRIBUTION FUNCTION
  C         WHOSE PARAflETEKS AKE  A AND 8.  MINT IS  'HE NUMBER
  C         OF CLASS INTERVALS.
        MM  = NINT-1
        00  5 I = l.MM
        XVAL = XEU+ll
        CF(I) = WF(XVALtAiB)
      5 CONTINUE
        CFUJINT) =1.0
        RFU) = CPU)
        DO  6 I = 2iNINT
      6 RF(D = CF(I)-CFCI-l)
        KETUKN
        ENU
        SUBROUTINE SR(NINT.XSR.YSR.A,B)
  C         THIS SUBROUTINE CALCULATES THE PARAMETERS  A  AND B
  C         FOR THt WILBULL DISTRIBUTION FUNCTION BT THE LEAST
  C         SQUARES METHOD.  THE  ALREADY SETUP ARRAYS  XSR AMU
  C         YSR ARL SUPPLIED TO THE SUBROUTINE.  NINT  IS THE
  C         NUMBER OF CLASS INTERVALS.
        DIMENSION XSKI20).YSH120)
        FN=FLOAT(NINT)
        SI  = 0.0
        S2  = 0.0
        IJO  1 1 = 1,NINT
        SI  = Sl + XSRU)
        S2  = S2+YSRiI)
      1 CONTINUE
        XB  = Sl/FN
        YB  = S2/FN
        SI  = 0.0
        S2  = 0.0
        S = 0.0
        DO  2 I = 1.'NINT
        SI  - S1+(XSR(I)-X8>*«2
        S2  : .S2+(YSR(I)-YB)«*2
        S = S+(XSRdl-X6)*(YSRI II-YB)
      2 CONTINUE
        VX  = Sl/CFN-1.0)
        VY  = S2/(FN-1.0)
        SX  = SORT(VX)
        SY  = SURT(VY)
        B - S/S1
        AL  = Yfl-XB«B
        A s EXP(AL)
        R s (8»SX)/SY
        R2  = R**2
        ARG = (B»(1.-K2»/(FN-2,0)
        SE  = SURTIAK6)
        SB  = 5E/SQRMS1>
        TV  = B/SB
        WRITEK5.400!
    <»00 FORHAT(1HO./////53X,'WEIBULL LEAST SQUARES HTM
        WRITEIStlOIAiB
     10 FORMAT(1HO,«W£IBULL PARAMETERS ARE G = 0.0.   A  = -.El"*.!*,
       15X.«B = «.Em.<0
        WRITE (S. 11) R • R2. SE t SB < T
-------
Table D-2  (continued).    COMPUTER LISTINGS  OF THE
   SIMULATION  PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE  RELOAD (XES, AFS, XE , AF,MINT )
        DIMENSION XL'S (20) . AFS I 20 ) ,Xt ( 20 ) . AFC 20 )
  C         THE ARRAYS  XE OF CLASS INTERVAL  ENDPOINIS AND AF
  c         OF ACTUAL FREQUENCIES ARE  ALTERED  IN CEKTAIN PARTS
  C         OF THIS PROGRAM TO FIT THE NEEDS AT THE TIME.
  c         THUS, RATHLR THAN RECALCULATE  THEM EVERTTIME THEY
  C         A*t NELDED. THEY WLRE STORED  IN  SAVED ARRAYS XES
  C         AND AFS UPON INITIAL CALCULATION.  THIS SUBROUTINE
  C         SIMPLY  RELOADS THE WORKING ARRAYS XE AND AF KITH
  C         THEIR SAVED VALUES.
        DO 1 I = l.NINT
        XE(I)  = XES(I)
        AF(I)  = AKS(I)
      1 CONTINUE
        XEIMNT + 1)  - XESININT+1)
        RETURN
        END
        SUBROUTINE GSOLVC(XBAR.SDEVtALPHA.BETA,N)
  C         THIS SUBROUTINE FITS THE  GAMMA DENSITY FUNCTION  TO
  C         THE GIVEN DATA BY THE METHOD OF MOMENTS.  XBAR AND
  C         SOEV ARE THE MEAN AND STANDARD DEVIATION OF THE
  C         GIVEN DATA WHILE ALPHA AND  BETA ARE THE CALCULATEP
  C         PARAMETERS FOR THE GAMMA  DENSITY FUNCTION.  N IS
  C         THE TOTAL NUMBER OF DATA  POINTS IN OUR SAMPLE.  IN
  C         ADDITION, THIS SUBROUTINE CALCULATES ANU PRINTS
  C         THE THEORETICAL VALUES OF THE MEAN, STANDARD
  C         DEVIATION, COEFFICIENT OF SKEHNESS, ETC. FOR THE
  C         FITTED GAMMA FUNCTION.
        FN=FLOAT(N)
        ALPHA= I FN*XBAR*XBAR I / «FN-1. 0 ) *SDEV*SOEV )
        BETA=X8AR/ALPHA
        X3=2.0*ALPHA*BETA*BETA*BErA
        61=?.0/SORTIALPHA)
        X0=( • i.O*ALPHA**2) + (6.0*ALPHA) )*BETA*»<»
        62=3.0+6./ALPHA
        VAR=SPCV*SOCV
        WRITE15.1)
      1  FORhATUHiy////53X, 'GAMMA METHOD OF MOMENTS FIT"
        WRITEI5.2) ALPHA,BETA
      2  FORMATClHO.'GAMMA PARAMETERS  ARE:  ','ALPHA - • ,tl<».7,SX, -BETA = •
       I.Em.7)
        URITEIS,3)XBAKtVAR,SOEV,X3,X4,bl,G2
      3  FORMATtlHO.'XBAR = • ,E1<».7,5X. • VAR = • ,E1<». 7.5X. • SDEV =  '.
       1E1«».7,5X//'THIRO CENTRAL MOMENT = «, Elf .7 . •>*., 'FOURTH CENT°AL HOHEN
      2T  =  •,E1<».7.5X//'COEFFICIENT  OF SKEHNESS = «,Eli».7,5X.
       3'MEASURE OF KURTOSIS = '.E14.7)
        RETURN
        END
                                  106

-------
Table D-2  (continued).    COMPUTER  LISTINGS  OF  THE
   SIMULATION PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBKOUTINE RFGAM(NINT•XEtALPHA.BETAiRFI
        DIMENSION XE(20)tRF 20)
  C         THIS SUBROUTINE SETS UP  THE ARRAY OF RELATIVE
  C         FREQUENCIES RF FOR  THE GAMMA DENSITY FUNCTION talTn
  C         PARAMETERS ALPHA AND BETA.  XE IS THE ARRAY OF
  C         CLASS INTERVAL ENOPOINTS AND NINT IS THt  NUMBER OF
  C         CLASS INTERVALS.
        DO 1 1=1.NINT
      1 RF(I)=0.0
        XEU) = 1.0E-05
        M = NINT-1
        S = 0.0
        DO if I = ItH
        OX=(XE
-------
Table  D-2   (continued).   COMPUTER  LISTINGS  OF  THE
   SIMULATION  PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE  SSTAT  IN)
        COMhON X(bOO),  XMIN. XMAXi XBARt  SOEV
  C         THIS SUBROUTINE CALCULATES AND PRINTS THE SAMPLE
  C         STATISTICS  FOR THE GIVEN DATA.
        S=0.0
        DO 1 1 = 1,N
      1 S=S+X(I)
        XBAR=S/FLOATCN)
        S2=0.0
        S3=0.0
        S"*=0.0
        DO 2 1=1.N
        S2=S2+(X(I)-XBAR)*»2
        S3=S3+CX«Il-XBAR)«»3
      2 S^sSH+IXdl-XBAR)**1*
        VAR=S2/FLOAT< N-l)
        SOEV=SQRT(VAR>
        FN=FLOAT(N)
        XW3=/( (FN-1. 0 )» ( FN-2.0 )»(FN-3.0) ) >»S<»
        XT2=((3.0*»VAR*»2
        XM
-------
Table D-2  (continued).    COMPUTER  LISTINGS  OF  THE
   SIMULATION PROGRAMS  -  GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE CKITSTiNiNINTfRF.AF.XE)
        DIMENSION AFI20) ,RF(20) ,TF(20) i XEI20)
  C         THIS SUBROUTINE CALCULATES THE THEORETICAL
  C         FREQUENCIES FOR THE  CLASS INTERVALS AND PRINTS THt
  C         TABLE OF CLASS INTERVALS talTH THEIR RESPECTIVE
  C         THEORETICAL AND ACTUAL FREQUENCIES.  IT ALSO
  C         CALCULATES ANO PRINTS THE VALUE OF SSE« THE SUh OF
  c         THE SQUARES OF THE ERROR.  FINALLY. IT CALCULATES
  C         ANO PRINTS THE VALUE OF CHI-SQUARE AND ITS
  C         ASSOCIATED DEGREES OF FREEDOM FOR THIS DATA.
        FN = FLOAT (N)
        DO 7 I=ltNINT
      7 TF(I)=RF(I)«FN
        WRITC(S.IO)
     10 FORMATllHOt'   CLASS INTERVALS' .20X i 'THE ORbTICAL FREQUENCY1 <20X ,
       1'ACTUAL FREQUENCY')
        DO 11 Is
     11 WRITE<5,12)
     12 FORMATtlHO.'FKOM '.FIO.Z.' TO • ,F10.2, 15X.F10 .3.30X tFlO.3)
        SI = 0.0
        S2 = 0.0
        SSE =0.0
        UO 30 I = l.NINT
        SI = SI   HF(I)
        S2 - S2 MAf (Il/FN)
        SSE = SSE + (S1-S2)*«2
     30 CONTINUE
        WRITEK5.31) SSE
     31 FORHATCl-iflt 'SSE -  '.F16.7!
        K=NINT
       - L = l
        00 =TFfNINT-I)+TF(NINT-I + ll
        AF(t:lNT-I)=AF(NlNT-I>*AF(NlNT-I*ll
        K=K-1
     22 COI.TIN'JE
     23 S=C.O
        00  st I=L,K
     21 S=S+. (TF(II-AFtl) )»*2.0)/TF(I)
        NDF=K -L-2
        WRITE(5.50) S.NOF
     5C FORMftT(lHO/"CHI-SQUARE - >
-------
Table D-2  (continued).   COMPUTER LISTINGS  OF  THE
   SIMULATION PROGRAMS  - GOODNESS-OF-FIT  PROGRAM
        SUBROUTINE SOLVL
        COMMON xx(5ut»
  C         THIS SUBROUTINE  SOLVES -OR THE PARAMETERS A AND B
  C         IN THE WEIBULL DISTRIBUTION FUNCJON BY THE METHOD
  C         OF MAXIMUM LIKELIHOOD.  IT USES THE METHOD OF
  C         FALSE POSITION IN THE SOLUTION PROCLOURb.  N IS
  C         THE TOTAL NUMBER OF DATA POINTS AND XX IS THE
  C         ARRAY OF DATA  POINTS.
        XI  = 1.0
        X2  = 2.0
        Yl  = F(N,X1)
        Y2  = F(N,X2>
     71 IF!(Y1.1T.O.O.AND.Y2.GT.O.O).OR.(Y1.GT.O.O.AND.Y2.LT.O.O))  GO TO
       170
        XI    X2
        Yl  = Y?
        X2  r X2 + 1.0
        Y2  = F(N,X2)
        IF(Xl.GT.lO.O) GO  TO 2
        60  TO 71
      2 WRITEI5.il)
     11 FORKATC BETA IS > 10.0 OR < 1.0 OR THERE IS AN EVEN NUMBER OF
       1ROOTS BETWEEN TWO  INTEGER VALUES')
        B=-1.0
        RETUKN
     70 1F 60 TO 5
        X2 = X
        Y2 = Y
     20 K =  K + 1
        IF(K.GE.SOOO) GO TO  101
        GO TO 3
      5 XI = X
        Yl = Y
        GO TO 20
    101 WRITEI5.12)
     12 FORMATI' THE NUMBER  OF ITERATIONS EXCEEDED 5000*)
    100 B=X
        S=0.0
        00 50 1=1.N
     50 S=S+XX(T)»*B
        A=FLOAT(N)/S
        WRITEIS.tOOl
    400 FORMAT)1HO/////50X,'WEIBULL MAXIMUM LIKELIHOOD FIT'I
        WRITE(5.500) A.8
    500 FORMAT (IHOt'WEIBULL  PARAMETtRSS ARE G = 0.0,    « = ',EH».i»,
       1'B = >|E14.<»)
        RETURN
        ENU
                                  110

-------
                         APPENDIX E
    TREATMENT OF CORRELATED DATA BY LINEAR TRANSFORMATION

1.   THEORETICAL DISCUSSION OF LINEAR TRANSFORMATION

Dependent  (correlated) input variables are discussed  in
Section IV. B. 2.  If the original variables are binormally
distributed, by linear transformation of variables ,  it is pos-
sible to find two normally distributed and stochastically
independent linear functions of these variables.13  Assume
that XI and X2 are correlated variables.  With the  intro-
duction of variables Yl and Y2, we can transform the  (XI, X2)
variables to the  (Yl, Y2) coordinate system which will have
its origin in the point  (XI, X2) =  (51, £2) and  the Y-axis
will form angle a with the X-axis.

           Yl =  (XI - 51) cos a +  (X2 - £2) sin  a       (E-l)

           Y2 = -(XI - 51) sin a +  (X2 -  £2)  cos a      (E-2)
The correlation coefficient of  (Yl, Y2) depends on  the  angle
a.  The following equation is used to  find  a value  for  a  for
which the correlation coefficient is zero.
   tan 2a. = - for Si 7* S2 and a = 7- for Si =  S2   (E-3)
where   p = correlation coefficient of  (XI, X2) data  pairs
       Sj = sample standard deviation of XI data
       S2 = sample standard deviation of X2 data

The data values  (Yl, Y2) may be restored to the original
values (XI, X2) by the following inverse transformations:
13Hald, A.  Statistical Theory with Engineering Applications
  Nqw York, John Wiley & Sons, Inc., 1952.  p. 596-599.
                            Ill

-------
               XI =  51 + Yl cos a - Y2  sin  a            (E-4)

               X2 =  -£2 + Yl  sin a + Y2 cos a           (E-5)

The above equations  were implemented using  APL  language on a
time-sharing terminal.  Using the population data  for coal
consumption and stack heights, the  (XI, X2) data pairs were
correlated with p =  0.559.  After transformation the correla-
tion coefficient p = (2.978E  - 17) which is essentially zero.

2.  SIMULATION IMPLEMENTATION

The variables coal consumed  (CC) and stack  height  (H) were
treated as log-normal distributions.  The transformation of
the 24 sample values for CC and H were performed on the APL
terminal as previously discussed.  The correlation coefficient
for CC with H was 0.3705.  After transformation, the correla-
tion coefficient was 9.9E-17, which is essentially zero.
Table E-l lists the  inputs and results of various  simulation
runs.

Equations E-4 and E-5 were implemented in a subroutine in the
simulation program to restore the transformed data to actual
values (see Table C-3 for a listing of the  program), with
the result that the  mean severity was 9.31.  The simulation
was repeated with stack height considered as an independent
variable.  The mean  value of  severity was 11.25.   The t-test
was made to see if these values were significantly different
from the mean severity of 9.25 arrived at earlier  in this
report. 1If
14Alder, H.  L., and Roessler, E. B.  Introduction to Proba-
  bility and Statistics.  San Francisco, W. H. Freeman and
  Company,  1968.  p. 136-140.

                               112

-------
              Table E-l.  COMPARISON OF SIMULATION RESULTS
Method
Section IV. B. 2





Variables
treated as
independent
Trans formation
of
variables

Parameter
Coal consumed

Stack heights
c
Percent sulfur

a
Resultant severity
Stack heights
Resultant severity
Coal consumed
e
Stack heights
Resultant severity
Mean
1326

93.6
1.82

9.25
4.472
11.25
0.0

0.0
9.31
Standard
deviation
1329

36
1.15

12.48
0.3751
19.17
0.9731

0.3442
12.0
Comment
A = 0.9487E - 3
B = 0.9856
Dependent variable
A = 0.3039
B = 1.667

Log-normal
distribution
Transformed

Transformed

 Coal consumption in kg/yr;  stack height in m; severity is dimensionless.
b
 Stack height treated as a dependent variable correlated with coal
 consumed.
Q
 Same value used for each simulation.
d
 Stack height treated as an  independent variable.
e
 Coal consumed and stack heights linearly transformed to make correlation

 coefficient zero; both were treated as log-normal distributions.




The  t-test assumes  that the variate X  is normally distributed

with mean a.   If all possible  samples  of n variates are

taken from this population and their means are denoted by X,

then
                              t  =
                                    x
(E-6)
                            S =  S
                              SY  = —
                               X
(E-7)




(E-8)
                                   113

-------
where     t = value from table of the Student t-distribution
              with n-1 degrees of freedom
          X = mean of X
          y = population mean
         S=r = standard deviation of X
          A
          S = sample standard deviation
          S = standard deviation corrected for small sample
              size
          n = sample size

In the sample calculation as shown below, it is assumed
that the population mean of 9.25 is from the simulation
results of Section IV.B.2.  From a reference table,
t(23) = 2.807 at the 99% confidence level.
                       11.25 - 9.25 =  o 50
                           3.998

Since the above t value is less than 2.807,  it is concluded
that the means are from the same population.  This is also
true for the mean of the simulation using the linear trans-
formation.

The F-distribution allows one to test  whether the variances
of two means are from the same population.15  A confidence
level such as 99%  (significance level  of 1%) is selected.
A null hypothesis of (ai)2 = (02)2 is  made.  The alternative
is (aj)2 £ (a2)2 and the samples are not from the same
15Koosis, D. J.  Statistics.  New York, John Wiley and Sons,
  Inc., 1972.  p. 155-160.

                              114

-------
populations with the same variance.  Usually, the larger
variance is divided by the smaller:
                       (Si)2
                   F = 	 where Sj > S2
                       (S2)2
For the case of treating stack height as an independent
variable,

                  _ _  (19.17)2 _ 367 _   .,
                                 ICC""*"5
                       (12.48)2   156

and for the linear transformation case,

                         O)2    144
                      (12.48)2   156

For a sample size of  24, the critical region is F  >2.72
(found from a table).  Thus, it is concluded that  the vari-
ances are not significantly different at the 1% confidence
level.  It thus appears that linear transformation is a
valid method to use with correlated data.
                             115

-------
                         SECTION VII

                         REFERENCES
1.   Slade, D. H. (ed.).  Meteorology and Atomic Energy.
     Environmental Science Services Administration, Air
     Resources Labs.  Silver Spring.  AEC Publication No.
     TID-24190.  July 1968.  445 p.

2.   Turner, D. B.  Workbook of Atmospheric Dispersion
     Estimates, 1970 Revision.  U.S. Department of Health,
     Education, and Welfare.  Cincinnati.  Public Health
     Service Publication No. 999-AP-26.  May 1970.  84 p.

3.   Parzen, E.  Modern Probability Theory and Its
     Applications.  New York, John Wiley & Sons, 1960.

4.   Springer, C. H., et al.  Probabilistic Models.
     Homewood, Richard D. Irwin, Inc., 1968.

5.   Curran, T. C.,  and N. H. Frank.  Assessing the Validity
     of the Lognormal Model when Predicting Maximum Air
     Pollution Concentrations.  (Presented at the 68th
     Annual Meeting of the Air Pollution Control Association,
     Boston.  June 15-20, 1975.)

6.   Geary, R. C., and E. S. Pearson.  Tests of Normality.
     London, University College, 1938.

7.   Mendenhall, W., and R. L. Scheaffer.  Mathematical
     Statistics with Applications.  North Scituate,
     Duxbury Press,  1973.

8.   Walpole, R. E., and R. H. Myers.  Probability and
     Statistics for Engineers and Scientists.  New York,
     The MacMillan Co., 1972.

9.   Siegel, S.  Nonparametric Statistics.  New York,
     McGraw-Hill Book Co., 1956.

10.  Duncan, A. J.  Quality Control and Industrial
   t  Statistics.  Chicago, Richard D. Irwin, Inc., 1952.

   \                           117

-------
11.  Cramer, H.  Mathematical Methods of Statistics.
     Princeton, Princeton University Press, 1946.  575 p.

12.  Olt, W. R.,  and D. T. Magee.  Random Sampling as an
     Inexpensive Means for Measuring Average Annual Air
     Pollutant Concentrations in Urban Areas.  (Presented
     at the 68th Annual Meeting of the Air Pollution Control
     Association, Boston.  June 15-20, 1975.)

13.  Raid, A.  Statistical Theory with Engineering
     Applications.  New York, John Wiley & Sons,  Inc.,
     1952.  p. 596-599.

14.  Alder, H. L., and Roessler, E. B.  Introduction to
     Probability and Statistics.  San Francisco,  W. H. Freeman
     and Company, 1968.  p. 136-140:

15.  Koosis, D. J.  Statistics.  New York, John Wiley and Sons,
     Inc., 1972.,  p. 155-160.
                                118

-------
                                 TECHNICAL REPORT DATA
                          (Please read Instructions on the reverse before completing)
 1. REPORT NO.
 EPA- 600/2-76-032e
                            2.
                            3. RECIPIENT'S ACCESSION NO.
 4. TITLE AND SUBTITLE
 Source Assessment:  Severity of Stationary Air
  Pollution Sources—A Simulation Approach
                            5. REPORT DATE
                             July 1976
                            6. PERFORMING ORGANIZATION CODE
 7. AUTHOR(S)
 E. C. Eimutis, B. J. Holmes,  and L. B. Mote
                            8. PERFORMING ORGANIZATION REPORT NO.

                            MRC-DA-543
 9. PERFORMING ORdANIZATION NAME AND ADDRESS
 Monsanto Research Corporation
 1515 Nicholas Road
 Dayton, Ohio  45407
                             10. PROGRAM ELEMENT NO.
                             1AB015; ROAP 21AXM-071
                             11. CONTRACT/GRANT NO.

                             68-02-1874
 12. SPONSORING AGENCY NAME AND ADDRESS
 EPA,  Office of Research and Development
 Industrial Environmental Research Laboratory
 Research Triangle Park, NC  27711
                             13. TYPE OF REPORT AND PER
                             Task Final; 6/75-5
                                                                             COVERED
                             14. SPONSORING AGENCY CODE

                              EPA-ORD
 15. SUPPLEMENTARY NOTES project officer for this report is D.A. Denny,  mail drop 62,
 919/549-8411,  ext 2547.
 16. ABSTRACT
               rep0rt gjves results of a study simulating the establishment of the
 severity of stationary air pollution sources.  The potential environmental impact of
 an emission source  can be determined from the source severity (the ground level
 concentration contribution of pollutants relative to some potentially hazardous con-
 centration of the same species).  The frequency distribution of the severity of well-
 documented source types  can be examined deter minis tically.  A statistical approach
 is required to simulate the frequency distribution of the severity of source types
 that are complex or involve a large number of emission points in order to ultimately
 assess such sources.  A Monte Carlo simulation technique is  described in this
 report, together with efficient algorithms for fitting the inverse Weibull, gamma,
 normal, and  log-normal cumulative density functions.  Significant correlation is
 demonstrated between deterministic and simulated severity results using coal-fired
 steam/electric utilities as an example.
 7.
                             KEY WORDS AND DOCUMENT ANALYSIS
                 DESCRIPTORS
                                          b.lDENTIFIERS/OPEN ENDED TERMS
                                        c.  COSATI Field/Group
 Air Pollution
 Ranking
 Environmental Biology
 Simulation
 Monte Carlo Method
 Estimating
Electric
 Utilities
Air Pollution Control
Stationary Sources
Source Assessment
Environmental Impact
13B
12 B
06F

12A
 8. DISTRIBUTION STATEMENT
 Unlimited
                                          19. SECURITY CLASS (This Report)
                                           Unclassified
                                         21. NO. OF PAGES
                                            133
                20. SECURITY CLASS (This page)
                Unclassified
                         22. PRICE
EPA Form 2220-1 (9-73)
                                         119

-------