&EPA
              United States
              Environmental Protection
              Agency
              Air Pollution Training Institute
              MD20
              Environmental Research Center
              Research Triangle Park NC 27711
EPA 460/2-81-19
June 1961
              Air
APTI
Course 426
Statistical Evaluation
Methods for
Air Pollution Data
              Student Workbook

-------
                                 HANDOUT //I

An Introduction

1.  Illustrative Statistical Problems

    A.  Determination of the Precision and Accuracy of an Air Pollution
        Monitoring System.
                                                                     • i
    B.  Assessment of a Long-Term Trend in Air Pollution Levels. "'•£?
                                                            '' -,r
    C.  Measuring the Effectiveness of an Air Pollution Abatement Strategy.
                                                            •S' £ £
    D.  Quantification of the Effect of Air Pollution on Morbidity.

        The following data represent air pollution concentration readings
                                                           in;
        taken in two cities, A and B.  Is there evidence that city B has a
                                                 a;         haf.  ,
        higher true mean level of pollution than city A?        .-''•

                            JL  f '         B      -           "       jr-

                            760             770
                            790             840           Y\ » 785,
                            820             780
                            805             810           Y  - 806.
                            750             830            B

2.  Objectives of the Statistical Method

    A.  Essentially, statistics is concerned with the extraction of infor-
                                                             r -
        mation from observed data.  Features common to each of the statis-
                                                     ' :  ?J T t"  ' - -
        tical problems described are:

          1)  Design of an experiment or sampling procedure.

          2)  Collection and analysis of data.

          3)  The formulation of an inference about a population from

              information contained in a sample.

    B.  Distinction between a sample and the population  (totality of

        measurements from which the sample was drawn).

    C.  The objective of the statistical procedure is:

        1)  To provide an experimental design or sampling procedure which

-------
             will yield the greatest amount of information for a given cost.

         2)  To utilize an analysis which will extract the maximum informa-

             tion from the given data.

         3)  To utilize an inferential procedure where the degree of uncer-

             tainty, which always arises  in any inferential  process, is

             known.  Specifically,  we wish to give a measure  of the goodness

             of the inferential procedure.

 3.  Distributions of Random Variables

      Data:  A set of n = 25 air pollution concentration measurements

                 1060    1000    1200    1050    1240
                 1290     970    1290    1160    1070
                 1020     890^-1210    1390    1130              ,-
                 1120     830    1030    1180     910
                  900    1170    1150     990     950

     A.  A Graphical Method of Presenting Data - The Frequency Histogram.

         1)  Frequency Tabulation
frequency
Interval
825-925
925-1025
1025-1125
1125-1225
1225-1325
1325-1425
Relative
Tally Number Frequency
1111 4
11111 5
11111 5
1111111 7
111 3
1 1
.16
.20
.20
.28
.12
.04
2) The frequency histogram
10 •
4


557



3 1 (
                 825  925.  1025  1125  1225  1325  1425

     B.  Numerical Descriptive Measures.

         1)  For the Sample

             a.  The sample mean,  Y.

-------
             11    /
             I Yt/n i
            i=l 1/
is a measure of the center of the distribution
        of data.  Other measures of central tendency are the median,


        mode, and geometric mean Y~  = (Y Y....Y )   .

                              2
    b.  The sample variance, S .
        S2 = I  (Y.-Y)2/(n-l).
            i=l   *   '
    c.  The sample standard deviation S is defined tc be the positive


        square root of S .


2)  For the population


    a)  The population mean is usually denoted by the symbol y.


    b)  The population" standard deviation is usually denoted by


        the symbol o, the variance by the symbol a .


3)  Calculation of Y and S


    Example:  Consider a sample consisting of the five measurements
2, 3,
Y.
2
3
5
6
1
15
5
I
3, 6, 1.
(YJ - YD
-1
0
0
3
-2
0

Y. = 15,

(Y. - Y)2
1
0
0
9
4
14



Y.2
4
9
9
36
1
59


       .Y  = 15/5 = 3,


            5
       S2 = I  (Y.-Y)2/ (n-i)  = 14/4 = 3.5,

           i=l       /
       S  =/3V5   = 1.87.

-------
         Note the following short-cut formula:

          n        on    i(n    \ i /
          I  CYrY)2 = I  Y.2-   I  Y.  Vn
                     = 14.

4.  The significance of the standard deviation.

    A.  Tchebysheff's Theorem

        At least (1	_)  of the observations will lie within K standard
                      K

        deviations of the mean.
                                               -                   ^
          For example, if K = 2, at least 1 	   = 3/4 of the observations
                                             (2Y

        will lie within two standard deviations of the mean.

    B.  An Empirical Rule applicable to normal (tell-shaped) distributions:

        Approximately 68% of the observations will lie within one standard

        deviation of the mean, about 95% within tv:o, and about S3.7% within

        three.

    C.  Example:  For the 25 air pollution concentration measurements,

        Y s  1088 and  S =  142.


K    (\  +  KS1   ^°* °^ observations   Observed   Predicted     Normal
                   in interval       fraction   Tchebysheff    Rule

1   946-1230           17               .68     at least none   .68

2   804-1372           24               .96     at least 3/4    .95

3   662-1514           25              1.00     at least 8/9    .997

-------
                                 HANDOUT #2




II.   Statistical Methods of Inference.


     1.   Types


         A.   Estimation of Population Parameters.


         B.   Tests of Hypotheses about  Population  Parameters.


         C.   Decision-Making Procedures.


     2.   Estimation


         A.   Objective is to estimate a parameter  or  characteristic  of  the


             population.


         B.   An estimator is a rule (formula)  for  combining  the  observations

                              -  J~ •                -                  .f"
             to obtain an estimate.   For example,


                                     n
                                     I  Y./n
                                    1=1   V
             is an estimator for u.


         C.   If an estimation process  was  repeated  many  times  based  on  differ-


             ent samples  from the same population,  a  distribution  of estimates


             would be obtained.


         D.   A good estimator will have a  distribution centered  about the  para-


             meter estimated and will  possess  a  small standard deviation.


             Example:  Let  Y be  the  mean of a  random  sample  of n observations


             drawn from a population with  mean u and  standard deviation c.   It can be


             shown that 7" will have  a  distribution  with  mean equal to y and


             standard deviation  equal  to a/An. For the  air  pollution data,


             Y was equal  to 103S.


             The value of c is unknown, but  S  =  142 is an  estimate of a.   Hence,


             the standard deviation  of Y is  approximately:

-------
           S     142     09  „
    0   •  - » — — —  =  28.4.
     Y            25

Utilizing Tchebysheff 's Theorem and the Empirical  Rule,  we would  con-

clude that the chances are good (at least .75 and  perhaps as  high as

.95) that our estimate is within two o_  = 2(28.4) = 56.8 of  the  true
                                      Y
mean concentration.

    (Note:  when n is large, say n >_ 30, Y has a distribution  which is

nearly normal)

   Example:  A sample of n items is drawn from a large lot of manu-

factured items and inspected for defectives.   Let  the number  in the

sample equal n, the number_pf defectives equal Y,  and the true (and
                      •-'  ***"                -                  ,f"'
unknown) fraction defective equal p.  Then a good  estimator of p  is


                      '•?
   It can be shown that in repeated samples of size n, p will have a

distribution with mean equal to p and standard deviation equal to
   Suppose that a sample of n = 100 items is inspected and Y = 10

defectives are observed.  Then the estimate of p is

         ft _ Y  _ 10  _   .
         P - n  - 100 - '10-

   The standard deviation of p is o/s  ~/*J^^ •   Although p is unknown,

an approximate value of ov may be obtained by using p as an approxima-

tion for p.  Thus, the standard deviation of p is approximately:
                 M   -     A-10) (-90)  -  03
                  n   •   N/100      " -03'
   Utilizing Tchebysheff's Theorem and the Empirical Rule,  we would

be fairly confident (probability at least .75 and actually near .95)

that our estir.:£.te  p = .10  is within 2 a* = .06 of the true fraction

defective p.

-------
    E.   How does cost enter into the  estimation  problem?  One must pay for
        precision 	 the larger the sample,  the  smaller  (in general)
        the standard deviation of the estimator  and  the greater the preci-
        sion.   If the sample size h is not  large enough to  gain the desired
        precision,  the experiment will produce ineffective  results.  On
        the other hand, if the sample size  is  larger than needed, money
        is wasted.
        Example:  Suppose we wish to  estimate  the  true fraction defective
        p to within .02 of the true value with a probability of .95.
           Then,    2a*  = .02,

             or«    0    "• •••'-
                     n
                or   n   = pq/ (.01)  .
        If p is unknown,  a conservative guess is p = .5.
        Using this value,
                     _ (.5) (.5)   _
                   n - -Toooi --  2500'
        Actually, a more  realistic guess for the value  of p might  be  p  =  .10.
        Then,
             n . C.I) (.9)
             n     .0001       yuu'
3.  Tests of Hypotheses
    A.  Example:  Test to determine  the effectiveness  of  a new cold vaccine.
        Ten people were injected  with the  vaccine and  8 out of 10  survived
        the winter without acquiring a cold.  Was the  vaccine effective?
        Assume that the probability  of surviving the winter without a cold
        and without vaccine is  p  = .5.
        (fictitious example)

-------
B.  Tests of Hypotheses and the Scientific Method.



C.  Elements of a Test



    1)  Null hypothesis (hypothesis to be tested), H .



    2)  Test statistic or "decision maker"



    3)  Rejection region and non-rejection region for the test



        statistic



    4)  Alternative hypothesis, H
                                 a


        (aids in the selection of the rejection region)



D.  For the cold vaccine example, we would use



    1)  H :  the vaccine is ineffective,
         o


             ie., p = ..5./- -



    2)  Test statistic:  Y = number oi survivors



    3)  Rejection Region:



        Reject H  if Y is greater than or equal to some preassigned



        number, say 9.



    4)  Alternative Hypothesis



        H :  The vaccine is effective, ie., p > .5.



E.  The goodness of this inferential procedure is measured by the



    probability that an error will be made by the "decision maker."



    The errors are:



       1)  Type I Error:  Reject the null hypothesis when, in fact,



           the null hypothesis is true.



       2)  Type II Error;  Do not reject the null hypothesis when it



           is false and some alternative, H , is true.
                                           A


F.  A test is chosen for a specific situation in such a way as to



    minimize the total risk (cost) associated with the two types of



    errors.

-------
4.  Decision-Making Procedures.


    A.  Note that estimation and tests of hypotheses may be viewed as


        decision-making procedures.


    B.  Example:  Lot acceptance sampling for defectives (large lots.)


           Sampling plan:  Select n items at random and reject the lot


        if the number of defectives Y in the sample is greater than some


        specified number.


           If the true fraction defective is equal to p, then it can be


        shown that the probability of observing a particular value of Y


        is given by the binomial distribution





                                  Y (1 ' PJ" ~ Y> Y = °' J' '->*''
                Y! (- Y)!        '


C.  Features of a Decision-Making Procedure.


    ])  Two alternatives (could be more than two), say A and B.


    2)  Decision Maker based upon information contained in the sample.


        If the decision maker takes certain values, decision A is made;


        otherwise, one makes decision B.


D.  The goodness of a decision-making process is measured by the pro-


    bability of making an incorrect decision.


       Errors :


       1)  Making decision B when A is the correct decision.


       2)  Making decision A when B is the correct decision.


E.  The goodness of a lot acceptance sampling plan is measured by the
                        •

    operating characteristic curve for the plan.  This is a plot of the


    probability of acceptance of the lot versus the true fraction de-


    fective p.

-------
                                    HANDOUT »3





III.  Tests of Hypotheses and Confidence Intervals





      I.   Inferences Concerning the Mean of a Population.



          1.  Sample size, n, is large (say, n > 30).



              Null Hypothesis:  U = U



              Alternative Hypothesis:  u t u   (two-tailed test)



              Test Statistic:       _

                                    Y - y

                                Z = 	
                     2                                              2   n      — ? /
                 If 0  is unknown (as is usually the case),   use  S = £   (Y.-Y) / (n-1)

                                    g^'                -               i=l"   a    '

                 as an estimate of a .



              Rejection Region:  for a = 0.05, say,



                 Re jet H  if Jz| is greater than 1.96
              95% Confidence Interval:  Y± 1.96 S//"n.
          2.   Small sample (say, n < 30), and the sampled population  is approximately



              normally distributed.



              Null Hypothesis:  u = u



              Alternative Hypothesis:  y i y  (two-tailed test)



              Test Statistic:


                              t =
              Rejection region:  for a = 0.05, say, reject H  if  |t| is greater



                                 than t Q7C    ..
                                       • y / d f TI~ x
 __.     .   S/ /"n.
. y/b,  n-i    •
              95% Confidence Interval:  Y ± t
              •~-~~~~~-~'^-~ii^~-i~~~~i~-~-~-~-     '   .


     II.   Inferences Concerning the Difference Between the Means of Two Populations.



          1.   Large Samples



              A.  Assumptions:



                  a)  Population I has mean equal to u, and variance equal to a   .

-------
    b)  Population II has mean equal  to y   and variance equal  to a..
                                        *                        *


B.  Some results:



       Let Y. be the mean of a random sample  of  n.  observations from




    population I and Y« be the mean of a random  sample of n. observa-



    tions from population II.   Consider the difference,  (Y~. -  7.).



       It can be shown that the mean  of (Y. - Y_)  is  (y. - y-) and its
variance is
                 d-.d
                 nl    n2
.   Furthermore,  for  large  samples,  (Y.  - Y_)
    will be approximately normally distributed.



C.  Testing Procedure



    Null Hypothesis:   y*.  - y  = D



            (Note:  We are usually testing the hypothesis  that



             U, ~ u- = 0, i.e., y.  = y_)



    Alternative Hypothesis:  y  - y  ^ D   (two-tailed  test)



    Test Statistic:



                          Cf, - T,J - D
    If the null hypothesis is that y  = y_,  then D  =  0 and
        22                    22                    2
    If a  and a_ are unknown,  use S,   and S2   as estimates  of a   and



    a. , respectively.



    Rejection Region:  for a = 0.05,  say, reject H  if |z|  is greater



    than 1.96.
    95% Confidence Interval:   (Y.  - Y.) ± 1.96 /S 2/   * S,2/  .
    	I     2       ^J  l/nl    2/n2

-------
      2.   Small samples
          A.   Further Assumptions:
                 Both populations approximately normally distributed  and  a.  =
              
-------
     2.   Results:
                                     y
            The estimator of p  is p = —  .
            The average  value of p is p.
            The variance of p is equal to 23., where q =  (1-p).
     3.   Testing Procedure (large n):
         Null Hypothesis:   p =  p
         Alternative Hypothesis:  p i p   (two-tailed test)
         Test Statistic:
                          ,.   iis.
                               / p q
                            J^
         Rejection Region:   for gr- 0.05, say, reject H  if |z| > 1.96.
         95% Confidence  Interval:  p ±  1.96/ p q
                                ±  1.96/fJ
                                    /*J  n
IV.   Inferences Concerning  the Difference Between Two Binomial Probabilities,
          P: and p2.
     1.   Assumption:   Random  samples of sizes n  and n  from each of two
                      binomial populations with parameters p. and p .
     2.   Results:  The estimated difference between p  and p  is (p  - p ) =

                   li   II  •
                   nl   n2
                                            t
                                       nl      n2
    The average value  of (p   - p  ) is  (p  - p.,)
                          X    Z      l    «•

    The variance of (p.  -  p.) is

3.  Testing Procedure  (n.  and n.  large):
    Null Hypothesis:   p. = p.  (=  p, say)
    Alternative:  p t p  (two-tailed  test)

-------
                              CP, - P2)                 Y  + Y
         Test Statistic:  Z =               , where p = —	
         Rejection Region: for a = 0.05, say, reject H  if |z| > 1.96.
         95% Confidence Interval:  (p  - p ) ± 1.96
                                                    ,'    'll      "2
 V.  Tests Concerning the Variance of a Population.
     1.  Assumption:  Random sample from a normally distributed population.
     2.  Testing Procedure
                            2     2
         Null Hypothesis:  0  = a
                                   2     2
         Alternative Hypothesis:  0  > o   (one-tailed test)
         Test Statistic:                     ..
                               2    (n - 1) S2
                              X         02
                                         o
         Rejection Region:  for a = 0.05, say, reject H  if X  is greater than

                            X^.n.,-

VI.  Tests for Comparing Two Population Variances.
     1.  Assumptions:
         A.  Population I has a normal distribution with mean y  and variance

                                                                                 2
         B.  Population II has a normal distribution with mean y_ and variance O-
         C, 'Random samples of sizes ir. and n_ are selected from the two
             populations.
     2.  Testing Procedure ?
         Null Hypothesis:  a.  = 0-
         Alternative Hypothesis:  a.  > 0.    (one-tailed test)
         Test Statistic:  F = S,fc/S

-------
Rejection Region:   for o = 0.05,  say,  reject H   if F is greater than




                   F.9S, B  -  1,  n  -  I'

-------
TABLE A-1  Standard normal cumulative probabilities'
f
4.8
•3.7
•3.6
•3.5
•3.4
•3.3
•3.2
•3.1
•3.0
•2.9
•2.8
•2.7
•2.6
•2.5
•2.4
•2.3
•2.2
•2.1
•2.0
•1.9
•1.8
•1.7
•1.6
•1.5
•1.4
•1.3
•1.2
•1.1
•1.0
•0.9
-0.8
•0.7
•0.6
•0.5
•0.4
•0.3
•0.2
•0.1
•0.0
0.00
0.0001
0.0001
0.0002
0.0002
0.0003
0.0005
0.0007
0.0010
0.0014
0.0019
0.0026
0.0035
0.0047
0.0062
0.0082
0.0107
0.01M
0.0179
0.0228
0.0287
0.0359
0.0446
0.0548
0.0668
0.0808
0.0968
0.1151
0.1357
0.1587
0.1841
0.2119
0.2420
0.2743
0.3085
0.3446
0.3821
0.4207
0.4602
0.5000
0.01
0.0001
0.0001
0.0002
0.0002
0.0003
0.0005
0.0007
0.0009
0.0013
0.0018
0.0025
0.0034
0.0045
0.0060
0.0080
0.0104
0.0136
0.0174
0.0222
0.0281
0.0351
0.0436
0.0537
0.0655
0.0793
O.OS51
0.1131
0.1335
0.1562
0.1814
0.2090
0.2389
0.2709
0.3050
0.3409
0.3783
0.4168
0.4562
0.4960
0.0?
0.0001
0.0001
0.0001
0.0002
0.0003
0.0005
0.0006
0.0009
0.0013
0.0018
0.0024
0.0033
0.0044
0.0059
0.0073
0.0102
0.0132
0.0170
0.0217
0.0274
0.0344
' 0.0427
0.0526
0.0643
0.0778
0.0934
0.1112
0.1314
0.1539
0.1788
0.2061
0.2358
0.2676
0.3015
0.3372
0.3745
0.4129
0.4522
0.4920
0.03
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
0.0006
0.0009
0.0012
0.0017
0.0023
0.0032
0.0043
0.0057
0.00764*
0.0099
0.0129
0.0166
0.0212
0.0268
0.0336
0.0418
0.0516
.0.0630
0.0764
0.0918
0.1093
0.1292
0.1515
0.1762
0.2033
0.2327
0.2643
0.2981
0.3336
0.3707
0.4090
0.4483
0.4880
0.04
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
0.0006
0.0003
0.0012
0.0016
.0.0023
0.0031
0.0041
0.0055
0.0073
0.0096
0.0125
0.0162
0.0207
0.0262
0.0329
0.0409
0.0505
0.0618
0.0749
0.0501
0.1075
0.1271
0.1492
0.1736
0.2005
0.2297
0.2611
0.2946
0.3300
0.3669
0.4052
0.4443
0.4840
0.05
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
0.0006
0.0008
0.001 1
0.0016
0.0022
0.0030
0.0040
0.0054
0.0071
0.0094
0.0122'
0.0158
0.0202
0.0256
0.0322
0.0401
0.0495
0.0606
0.0735
0.0885
0.1057
0.1251
0.1469
0.1711
0.1977
0.2266
0.2578
0.2912
0.3264
0.3632
0.4013
0.4404
0.4801
0.06
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
O.C006
0.0008
0.0011
0.0015
0.0021
0.0029
0.0039
0.0052
0"0069
0.0091
0.0119
0.0154
0.0197
0.0250
0.0314
0.0392
0.0435
0.0594
0.0721
0.0869
0.1038
0.1230
0.1446
0.1685
0.1949
0.2236
0.2546
0.2877
0.3228
0.3594
0.3974
0.43S4
0.4761
0.07
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
0.0005
0.0008
0.0011
0.0015
0.0021
0.0028
0.0038
0.0051
0.0068
0.0089
0.0116
0.0150
0.0192
0.0244
0.0307
0.0384
0.0475
0.0582
0.0708
O.OS53
0.1020
0.1210
0.1423
0.1660
0.1922
0.2206
0.2514
0.2843
0.3192
0.3557
0.3936
0.4325
0.4721
O.OS
0.0001
0.0001
0.0001
0.0002
0.0003
0.0004
0.0005
0.0007
0.0010
0.0014
0.0020
0.0027
0.0037
0.0049
0.0066 "'
0.0087
00113
0.0146
0.0188
0.0239
0.0301
0.0375
0.0465
0.0571
0.0694
0.0833
0.1003
0.1190
0.1401
0.1635
0.1894
0.2177
0.2483
0.2810
0.3156
0.3520
0.3897
0.4286
0.4681
0.09
00001
0.0001
0.0001
0.0002
0.0002
0.0003
0.0005
0.0007
0.0010
0.0014
0.0019
0.0026
0.0036
0.0048
0.0064
0.0084
0.0110
0.0143
0.0183
0.0233
0.0294
0.0367
O.M53
C.0559
0.0681
0.0823
0.0965
0.1170
0.1379
0.1611
0.1867
0.2148
0.2451
0.2776
0.3121
0.3483
0.3859 I
0.4247
0.4641
492

-------
TABLE A-1   Standard normal cumulative probabilities (continued)
/
0.0
0.1
0.2
OJ
0.4
03
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
25
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
0.00
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
03192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
-0.9893
0.9918
0.9938
03953
0.9965
0.9974
0.9981
0.9986
0.9990
0.9993
03995
0.9997
0.9998
0.9998
0.9999
0.9999
1.0000
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
03345
0.9463
03564
0.9649
03719
0.9778
0.9826
03864
03696
0.9920
0.9940
0.9955
03966
0.9975
03982
0.9987
0.9991
0.9993
0.9995
03997
0.9998
03998
0.9999
0.9999

0.02
0.5080
0.5478
0.5871
0.6255
0.6623
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
03222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.3893
03922
0.9941
0395o
0.9967
03976
03982
0.9987
0.9991
0.9994
0.9995
0.9997
0.9998 •
0.9999
0.9999
0.9999

0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
03082
03236
0.9370
0.9484
03582
0.9664
03732
03738
03834
0.9871
0.9901
0.9924
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988
0.9991
0.9994
0.9996
03997
03998
0.9999
03999
03999

0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7703
0.7995
0.8264
03508
0.8729
0.8925
0.9099
0.9251
0338*2
0.9495
0.9591
0.9671
03738
0.9793
0.9838
0.9875
03904
0.9927
03945
0.9959
0.9969
0.9977
0.9984
0.9988
0.9992
0.9994
0.9996
03997
03998
0.9999
0.9999
03999

0.05
0.5199
0.5596
0.5987
0.6363
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8943
0.9115
03265
03394
03505
0.9599
03678
0.9744
03798
0.9842
0.9878
0.9906
03929
0.9946
0.9950
0.9970
03978
0.9984
0.9989
0.9992
03994
0.9996
0.9997
0.9998
03999
03999
0.9999

0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
03279
0.9406
0.95,15
0.9608
0.9686
0.9750
0.9803
0.9846
0.9831
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
03989
0.9992
0.9994
0.9996-
0.9997
0.9998
0.9999
0.9999
0.9999

0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.3078
0.8340
0.8577
0.8790
0.8980
0.9147
03292
'03418
0.9525
0.9616
0.9693
03756
0.9808
0.9850
0.9884
0.9911
03932
0.9949
0.9962
03972
03979
.03985
03989
0.9992
03995
03996
03997
03998
03999
03999
03999

0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
03306
0.9429
03535
0.9625
0.9699
0.9761
0.9812
0.9854
03887
0.9913
0.9934
0.9951
0.9963
0.9973
03930
0.9986
0.9990
0.9993
0.9995
0.9996
0.9997
0.9998
0.9999
0.9999
03999

0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.944'f
0.9545
0.9633
0.9706
0.9767
0.9817
03857
' 0.9890
0.9916
03936
0.9952
0.9964
03974
0.9981
0.9986
0.9990
0.9993
0.9995
0.9997
0.9998
0.9998
0.9999
0.9999
0.9999


-------
 TABLE A-1   Standard normal cumulative probabilities (continued]
t
•4.265
•3.891
•3.719
•3.291
•3.090
-2.576
•2.326
•2.054
•1360
•1.881
-1.751
• -1.645
•1.555
-1.476
•1.405
•1.341
•1.282
•1.036
4.842
•0.674
•0524
•0.335
•0.253
0.126
0
nz
0.00001
0.00005
0.0001
0.0005
0.001
0.005
0.01
0.02
0.025
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
*
0
0.126
0.253
0.385
0.524
0.674
0.842
1.036
1.282
1 .341
1.405
1.476
1555
1.645
1.751
1.881
1360
2.054
2.326
2.576
3.090
3.291
3.719
3.891
4.265
HZle entry is the area under the standard normal curve to the left of the indicated t
• value, thus giving P (Z<2\.
494

-------
TABLE A-2   Percentites of the t distribution
                                                                                              T.
                                                  (•) Student'it distribution
\. *
df\
1
2
3
4
S
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
35
40
45
50
60
70
80
90
100
120
140
160
180
200
00
55
0.158
0.142
0.137
0.134
0.132
0.131
0.1-30
0.130
0.129
0.129
0.129
0.128
0.128
0.128
0.128
0.128
0.128
0.127
0.127
• 0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.127
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
0.126
65
0.510
0.445
0.424
0.414
0.408
0.404
0.402
0.399
0.398
0.397
0.396
0.395
0.394
0.393
0.393
0.392
0.392
0.392
0.391
0.391
0.391
0.390
0.390
0.390
0.390
0.390
0.389
0.389
0.389
0.389
0.388
0.388
0.388
0.388
0.387
0.387
0.387
0.387
0.386
0.386
0.386
0.386
0.386
0.386
0.385
75
1.000
0.816
0.765
0.741
0.727
0.718
0.711
0.706
0.703
0.700
0.697
0.695
0.694
0.692
0.691
0.690
0.689
0.688
0.6S8
0.687
0.636
0.686
0.685
0.635
0.684
0.684
0.684
0.683
0.683
0.683
0.682
0.681
0.680
0.679
0.679
0.678
0.678
0.677
0.677
0.677
0.676
0.676
0.676
0.676
0.674
B5
1363
1.386
1.250
1.190
1.156
1.134
1.119
1.108
1.100
1.093
1.088
1.083
1.079
1.076
1.074
1.071
1.069
1.067
1.066
1.064
1.063
1.061
1.060
1.059
1.058
1.058
1.057
1.056
1.055
1.055
1.052
1.050
1.049
1.047
1.045
1.044
1.043
1.042
1.042
1.041 '
1.040
1.040
1.039
1.039
1.036
90
3.078
1386
1.638
1333
1.476
1.440
1.415
1.397
1.383
1J72
1.363
1.35£
'1.330
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1319
1.318
1.316
1.315
1.314
1.313
1.311
1J10
1.306
1.303
1.301
1.299
1.296
1.294
1.292
1.291
1.290
1.289
1.288
1.287
1.286
1.286
1.282
95
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1333
1312
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
.714
.711
.708
.706
.703
.701
.699
.697
1.690
1.684
1.679
1.676
1.671
1.667
1.664
1.662
1.660
1.658
1.656
1.654
1.653
1.653
1.645
97.5
12.706
4.303
3.182
2.776
2571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.14S
2.131
2.120
2.110
2.101
2.093
2.036
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
2.030
2.021
2.014
2.009
2.000
1394
1.990
1387
1.984
1380
1377
1375
1.973
1372
1360
99
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.S18
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
2.438
2.423
2.412
2.403
2.390
2.381
2.374
2.368
2.364
2.358
2.353
2.350
2.547
2.345
2.326
99.5
63.657
9.925
5341
.4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2347
2321
2398
2.87E
2.861
2345
2331
2319
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.724
2.704
2390
2.678
2360
2.648
2.639
2.632
2.626
2317
2.611
2.607
2.603
2.601
2.576
99.95
636.619
31.599
12.924
8.610
6369
5359
5.408
5.041
4.781
4.587
4.437
4.318
4.22l'
4.14C
4.073
4.015
3.965
3.922
3.883
3350
3319
3.792
3.768
3.745
3.725
3.707
3.690
3.674
3.659
3.646
3.S91
3.551
3.520
3.496
3.460
3.435
3.416
3.402
3.390
3.373
3.361
' 3.352
3.345
3.340
3.291
495

-------
TABLE A-3   Percentites of the chl-squaro distribution
                                                           (b) x* distribution
"\H
di^s.
1
7
1
4
6
t
1
6
9
10
11
17
1)
14
IS
16
II
18
19
20
Jl
11
13
24
n
76
11
7B
79
10
n
40
4%
50
60
10
60
90
IOO
170
140
160
160
JOO
e.t
00001
ooio
0077
070?
0411
0679
0989
1 344
i 715
7 156
7601
ION
3*6*
401b
4 COI
t 147
J697
6K*
6844
14)4
80)4
864)
97GO
9886
10420
n 160
II 808
17461
1) 171
1) '87
II 192
70101
74 )ll
71991
15514
41275
41 117
59 196
67 128
8)847
100 655
111619
1)4884
112741
1
00002
0070
0 IIS
0791
0.454
0817
1 7)9
••1 648
701)8
7558
1051
1S?I
4 107
4660
6779
S81I
6408
7014
763J
87CO
BB9I
9542
10 196
10866
11574
12 198
17879
115GS
14 256
14 043
18 409
7) 164
74901
79 707
11484
44447
6)440
61 744
10(164
8C9?)
1040)4
171 )46
1.18 670
1464)7
1«
0.001
0001
0.718
0484
08)1
17)1
1690
2 IBO
7700
1741
1816
4 4O4
5009
5679
6762
6908
1464
87)1
8907
9491
I07B1
10 982
11.689
17401
1) 170
11844
14571
15 )08
I6O47
16 791
704C9
744))
78 JC6
17)47
40482
48 758
57 15)
65647
74 777
91 57)
109 1)1
176870
144 741
162 778
•
OOO4
0.10)
0357
0711
1.145
16JS
2 167
2.711
1175
1940
4574
5776
S897
6571
1761
1962
8672
9390
10 III
10851
II 491
123)8
1)091
1)B<8
14 Gil
15179
16 141
1C 978
17 7011
18 49)
22 405
76509
30617
34 764
4) 188
SI 7)9
60)91
C'J 176
77979
95 704
11)649
1.11 706
149 oca
168 7 '9
10
0018
0711
0584
1 004
1610
7204
233)
1490
4 1GB
48GS
S578
6)04
7042
'I 790
8547
9312
10085
10 BUS
II 651
17441
11740
I4O4I
14 BOS
15649
16471
11297
18 114
189)9
I97G8
70 5
40
0775
1.072
I.8G9
2.753
3.GSS
4570
S.49)
6.421
7.357
8794
9737
10 187
II 179
17.078
130)0
1)98)
149)7
15.891
I6U50
1 7 i:O9
18.768
19.779
20090
71/752
22616
21579
24444
255(19
76 4 74
27442
127B7
17 1)4
41995
4GBG4
56 670
66)96
70 IBS
85 1)9)
958(18
1 15 -IGS
135 149
154 850
1 74 080
194 119
SO
0455
1.386
2)66
1357
4341
S.348
6.346
7344
8343
9)42
10341
1I.34C
17 340
1)339
14 3J9
153)8
163)8
171)8
18338
19337
70337
21 3)7
22.337
233)7
14337
753)6
763)6
77336
783J6
7933G
34 1)6
39334
44335
49 335
K» 115
69334
793)4
B'J 3J4
993)4
1191)4
1.19334
1*9 ))'•
1/9334
199 3)1
•0
0/08
1 8)3
2.946
4 O45
5 132
8.711
728)
8351
9414
10473
11.5)0
17.584
13636
I46BS
IS. 7))
16.780
17874
188GB
19910
20951
21.991
73 031
24.009
25 106
26 143
27 179
28 714
79749
30783
31 316
3647S
41677
46.761
SI 892
C7 1)5
77358
8? WO
97 7GI
107 946
U3 789
143604
IG:I BUB
IB4 173
7O4 4)4
7O
1.074
7408
l.GGS
4 878
60G4
77)1
8.38)
9574
10 656
11.781
17899
14011
IS 119
16222
17327
18.418
19.511
20601
71 CB9
72.775
73.858
249)9
' 26.018
27098
28 172
29246
30)19
31 391
12461
11530
38859
44 165
49452
54 723
64727
75689
86 170
96574
ini 9(16
177616
148769
16HB76
IB9 446
209 BBS
SO »O
1647 2708
3.719 4.605
4.647. 6.751
5 989 7.779
7 289 S 736
8 558 I0.64S
9801 17017
It 010 13 167
17.747 I4.GB4
13 447' IS 9B7
14 6U 17.775
15812^ 18.049
16 985 . 19817
18 151 21 OG4
19.311 22307
7046S 73.S47
71615 24.709
77.760 75 989
2) 900 27 3O4
25 038 28412
"•.171 29 CIS
27.)OI 30813
78 479 37 007
79.553 33.<96
30 675 34382
3 1 795 35 561
32912 36741
34077 17916
35.139 39 OB 7
36 750 40 206
41 778 46049
477G9 SI 80S
52(29 S 7 SOS
58 164 63 167
68972 74397
79715 85527
9O 4O5 96 578
101 (l!>4 107 SG5
IIIGbTv 1184'JB
117 BOG '-140 231
153 854 161827
174878 183 Jl I
195. 741 704. 7O4
7IGG09 2700V 1
•8
1841
S.991
7.81S
9488
11.070
17.S92
14067
15507
16919
18107
I967S
21.026
22362
73685
24996
26290
77587
788G9
30 144
31 410
37.671
33924
15.172
16 4 IS
17652
38 685
40 II)
41.1)7
42557
41.77)
49802
56 758
61 656
67505
79087
905JI
101 879
II) 144
174 )42
I465G7
168CI3
190516
217 )O4
7)) 994
»».S
6024
7378
9)48
11 141
128)1
14.449
16011
17535
1907)
7048)
71 970
2) 3)7
24 716
26 119
27488
2884S
30 191
11 576
12852
14 170
35479
36.781
18076
19)64
40U46
41921
43 I9S
44 4GI
45 777
46979
S3 70)
59)47
65410
71 470
8)798
95 07)
IOG 679
118 1)6
179 5GI
152 711
1 74 648
IOG 9 IS
219 O44
741 OOB
•t
8815
9.210
II 345
13277
15.006
16812
1B.47S
20O9O
21 666
23.209
74.725
76717
77608
79 141
30578
17000
33409
14 SOS
16 191
37560
38937
407U9
41.6)8
47980
44 314
45 642
4696)
48778
49 588
SO 897
57 347
6)691
69947
76 154
88 379
100474
117)79
174 116
1)5807
I5B 900
181 840
3O-I 4)0
727056
249 445
t».S
7.87*
10.597
178)8
I46GO
16.750
IB 548
70778
7I95S
7)589
75 IBS
78.757
78 30O
29819
11 119
12801
14261
15 718
37 156
38582
39997
41 401
47 796
44.181
45559
46928
48790
49 645
5099)
57 1)6
5)677
60775
66 766
71 166
79490'
91957
104 215
116)71
178799
140 169
1C 3 648
186847
2(10 874
2)7670
755 2U4
M.*«
12.118
1S.202
17.7)0
19997
22.10*
24.10)
26018
77868
296G6
11.470
11.117
14871
36478
38 109
39 719
41.308
42879
44 434
45971
47498
49 Oil
SO 511
52000
SI 4 79
S4947
56407
S7858
59100
60715
67 162
69 199
76095
82876
89 561
102 695
115 578
178761
140 787
15) 167
1 7 7 601
701 681
225 481
249048
272421

-------
TABLE A-4   Percfntiles of tha F distribution
Upott K% point of th»F diivibution
                                                     (c) F distribution














s
o
«
!
I
X
0
(
o
*
}
g
"*
c
tt
Ik
o
m
*t
*
O
O











/
/








1
I
3
4
5
6
1
C
9
10
II
17
1 1
14
14
16
1)
18
19
30
II
>]
JJ
74
74
16
»
78
79
JO
U
It
16
38
40
47
44
46
48
SO
60
10
80
90
100
i 174
ISO
TOO
300
400
1000


4M ISO 170 8 SO 887 1.98 t.10
76) 100 I. IS 17) 178 111 354
707 778 736 739 741 747 741
111 700 in 10» IDt 708 70S
169 184 188 189 189 189 189
1 67 1 IS 1 18 1 19 1)9 1 18 1)8
IS) 1 >0 1 17 1 77 l>l 1 71 1 10
144 166 161 166 168 1 65 164
111 167 161 161 167 161 1 DO
149 160 1 60 1 49 1 49 1 48 1 4 )
141 148 149 IS) 146 144 144
1 46 1 46 1 46 1 44 1 44 1 41 1 47
1 44 1 4S 1 44 1 41 1 47 1 SI 1 SO
144 IS1 141 141 141 ISO 149
141 147 147 141 149 148 141
147 151 I ft! (SO 148 147 146
147 1 SI ISO 149 14) 146 I.4S
141 ISO 149 148 I4S I4S 144
141 |49 149 141 146 144 141
140 149 148 14) |4S 1.44 14]
140 148 148 146 144 141 147
140 148 1.41 I4S 144 147 141
119 14) 1.41 14} 141 147 141
119 14) 146 144 14] I4| |40
1 19 1 41 146 1 44 1 47 141 1 4O
118 4 144 144 |47 141 139
118 4 I4S 14] 147 140 119
1 18 4 144 14] |4I 1 40 1 39
1)8 4 145 14] 141 140 1)8
1)8 4 144 147 141 1)9 1)8
1 )) 4 144 147 140 1)9 13)
13) 4 141 147 1.40 1)8 11)
1 )) 1 44 14) 141 1 19 1 18 1)8
1)6 144 1.4) 141 1)9 1 )> 1)6
1)6 144 147 140 1)9 111 1)6
1)6 14) 147 140 1)8 11) IIS
1 36 14) 147 1 40 1 38 1 16 1 )4
1 )8 1 4) 1 47 1 40 1 18 36 1 34
1 36 1 41 141 1 39 1 38 30 1 34
1 IS 14) 141 119 1 1) 16 1)4
1 IS 1 47 1 41 1 18 1 1) IS 1 1)
IIS 141 140 1)8 1)6 14 1 31
114 141 140 1)8 1)6 )4 13}
1 14 141 1 19 1 )) 1 35 33 1 17
1)4 141 1)9 ID IIS 111 1)7
1)4 140 1)9 116 114 111 111
111 140 138 116 114 117 111
1 1) 1 40 1 38 1 16 1 34 1 )7 1 30
1 )) 1 39 1 38 1 35 1 )) 1 )7 1 30
111 119 ID 1)4 I)) 1)1 DO
112 119 11) IIS 111 111 179
OEGIfttOf f HltDOM FOK NUMf NATCH

I.I* 9.70 812 9.1) 94| 944 94) 949 967 9 S) 944 95)
134 11} }3B 139 119 140 141 141 141 147 147 147
744 744 744 244 145 744 744 746 746 746 7.46 748
70S 7oe 7oa JOH JOB ?os JOB ?oa 208 JOB 708 JOB
189 189 189 189 189 189 109 189 188 1 CB 188 188
1 n til 1)) l» til 1.)) 1)6 l.)6 l.)6 1)8 1)8 1)6
110 163 169 169 168 168 168 168 168 16) 16) 16)
164 161 16] 161 IC7 167 10) 167 1.67 161 161 161
1 60 1 49 1 49 1 48 1 'jB 1 48 16) 14) 16) 14) 1 SO 1 46
146 1.46 144 144 104 144 144 IS] 14) 1.41 141 1 b)
1 41 141 14? 1.47 141 141 141 1 4O 1 6O ISO ISO 149
I.SI I SI 1 40 149 149 149 148 148 1.48 14? 14) |4)
148 14) 146 146 144 144 144 144 144 144 141 1.4)
146 146 144 144 1.44 14] 1.4) 14] 14} 1.47 1.47 141
I.4S 1.44 1.44 141 141 147 1.47 141 1.41 I4lul40 MO
44 141 14) 141 141 141 141 140 1.40 1.19 t'?9 119
41 147 147 141 1.40 140 140 IJ9 119 118 1)8 138
.47 1.41 1.41 140 1.40 1)9 1)9 1)8 1)8 1.)) 13) ID
147 141 140 1.19 119 1)8 138 I.D I.D ID 1.16 116
141 140 1.19 1.19 1)8 ID ID ID 116 1.16 1.16 IIS
1.40 1)9 119 118 ID ID DB 116 138 1)5 1.14 134
40 1.19 118 I.D ID 1)6 1)6 114 IIS 114 114 1)4
119 138 1)8 ID 1)6 1)6 IIS 114 114 114 1.14 111
19 1 18 I.D 1 36 1 16 1 M 1 34 1 14 1 14 1.11 1 )) 1.11
18 I.D ID 118 1 IS IIS 1)4 1)4 1.11 111 1.11 137
18 117 118 US 114 1.14 1)4 1)1 111 1.11 117 117
1.16 I.D 116 1)4 114 114 111 111 1.17 1.17 117 111
D 118 135 134 114 111 111 '111 117 117 111 1)1
3) 1)6 134 134 1)4 1.1) 1.)) 117 117 1)1 111 111
16 1.14 114 1.14 111 117 137 111 1.11 111 1.10 110
16 1)5 1)4 1)1 111 1)7 111 111 1.10 110 1)0 1.79
.14 1)4 1)1 111 117 111 111 1.10 110 110 1.79 179
14 114 111 1)7 1)7 1)1 1.30 110 1.79 179 179 178
35 1)4 111 117 111 111 1)0 130 179 179 178 1.78
34 1)) 1)7 117 111 I1O 110 179 179 178 178 178
14 1.1) 117 111 111 1.10 179 179 178 178 178 17)
14 11] 117 111 DO DO 179 178 1 7H 178 17) 17)
.)) 1)7 Dl 1)1 DO 179 179 178 178 17) 17) 17)
1) 1)7 1)1 DO DO 179 178 178 17) 17) 17) 176
)7 Dl DO 179 I.29 1 28 17) 7) 1.76 176 176 1 7S
37 1)1 DO 179 178 17) 17) 76 126 174 125 174
11 DO .79 128 17) 1?) 1.76 70 176 124 174 124
11 10 79 1 78 1.7) 1 76 1 26 74 1 24 1 74 1.74 1 71
110 79 78 12) 17) 1 7€ 1.24 24 174 174 171 171
10 29 78 1 2» 1 26 1 74 1 7S 74 1 74 121 17) 1 77
30 7t 7) 17) 176 125 174 .74 17) 17) .177 172
79 78 7) 1 76 1 74 1 7« 1 74 71 1 71 1 77 l«7 1 71
79 7) 76 1.76 174 124 17) .7) 177 177 121 121
78 r> 26 174 174 1.24 17) 77 127 171 171 170
7B 1.7/ 176 171 174 17) 17) 177 171 171 170 170

10 M 10 40
148 98) 96) t.II
) 4) 1.44 1.44 1.4$ .
746 746 747 74)
708 708 708 708
1 88 1 88 1 88 1 88
1)8 1)1 1)4 1 74
16) 16) 1 66 1 66
161 1 60 1 60 1 49
1 46 1 44 1 IS 1 44
147 147 141 1 SI
1 49 1 49 1 48 147
1.4) 1 48 1 45 1 45
1.4) 147 141 141
141 140 140 D9
1 4O 1 19 1 38 1 17
1.19 1 38 1 D 1 36
1)8 1 )> 1 16 1 IS
1 17 1 16 1 IS 1 14
D6 1 IS 1.14 1 11
IIS 114 111 1.17
114 1.11 111 111
1.14 111 1.17 111
11) 1)2 1)1 DO
1.11 Dl 111 179
111 111 DO 179
1 12 1 30 1 30 1 78
1)1 DO 1 79 1 78
Dl DO 179 17)
DO 1 79 1 78 17)
.10 178 178 176
29 178 17) 176
.79 17) 1 76 1 74
78 17) 1 76 1 74
78 1 76 1 24 1 74
11 1 76 1 75 1 74
77 1 7S 1 74 1 71
76 1 21 74 1 71
76 1 2S 24 1 27
76 1 74 71 1 77
21 17) 77 71
74 17) 71 70
71 1 77 71 19
71 1 71 70 19
71 1 71 70 IH
.77 1 70 19 1)
71 1 70 19 1)
71 1 19 IB 16
70 1 19 1) IS
70 1 18 1) IS
1 19 1 IB 1 16 1 14

»0 100 1*0 100
l»4 «80 »8I 982
146 14) 147 14)
74) 247 74) 74)
708 708 708 70S
188 1 B» 18) 18)
1 )S D4 114 1)4
1 66 1 64 1 64 1 65
1 59 1 48 1 48 1 48
1 44 1 41 1 41 14)
1 40 1 49 1 49 1 49
14) 1 40 46 1 46
1 44 1 4) 4) 1 4)
1 40 1)9 39 1 )8
1)8 1)7 D ID
117 1 16 IS 1 IS
16 1 14 14 1 14
14 1 )] 11 1 17
.11 112 11 111
» Dl 10 1 10
17 1 10 10 1 29
11 129 79 1 70
10 I 79 1 78 1 78
79 1 78 1 71 1 11
79 1 77 17) 1 76
78 17) 1 H 1 7«
78 1 76 1 74 1 25
7) 174 174 174
7) 1 74 1 74 1 74
76 1 74 1 74 1 74
24 1 74 1 71 1 71
74 17) 1 77 1 77
74 1 77 1 77 1 71
74 1 77 1 71 1 71
7) 1 71 1 70 1 70
71 1 71 1 70 1 70
77 1 70 II* 1 19
77 1 70 1 19 1 19
71 1 19 1 19 1 IB
71 II* 1 18 1 18
70 1 IB II) 1 16
19 1 16 1 16 1 IS
18 1 16 1 IS 1 14
18 1 14 1 14 1 11
1) 1.14 1 I] II)
16 1 14 117 1 17
16 1 11 1 17 1 It
14 1 17 III 1 IO
14 1 II 1 10 104
14 1 10 IO9 1 08
11 1 10 108 10)

-------
; TABLE A-4   Percentltes of the F distribution (continued)
 tfpptr Itn point ottt*F dlttrlbution

















c
o

z
I
o
z
IM
o
a
o
Ifc
§
tt*
c
Ik
It
o
•A
M
O
o










/






11



1
7
3
4
5
6
1
8
9
10
II
13
13
14
15
16
17
18
19
70

71
77
7)
24
25
26
21
28
29
)0
)2
)4
16
38
40
47
44
46
48
50
60
10
BO
90
IOU
75
SO
too
100
iuo
XX)
OtGRCES OF FREEDOM FOR NUMERATOR


399 495 538 556 57.2 58.2 58.9 59.4 599 601 C0.5 607 60.P. 611 61.1-61.3 615 616 61.7 61.7 61.1 62.3 626
853 900 9.16 9.74 979 933 9.35 931 938 939 9.*0 941 941 9.47 942 9.43 943 944 fl 44 944 945 9.46 947
554 546 539 534 531 578 571 575 574 573 522 522 5.21 570 6.20 5.20 5.19 5.19 5.19 5.18 517 5.17 5.16
454 437 419 4.11 405 401 3.98 395 3.94 397 391 390 389 3 C8 387 3 86 386 385 385 384 3.83 3.87 3 BO
406 318 367 357 345 3.40 317 334 317 310 3.78 377 370 325 3.24 373 3.77 3.77 371 3.71 3.19 3.17 3.16
378 346 329 318 311 305 301 298 296 2.94 7.97 790 2.B-.I 288 2S7 786 285 285 7 84 284 281 280 2.18
359 J76 301 2.96 288 283 218 2.15 2.12 270 268 267 265 2.t4 763 762 2.61 261 260 259 257 756 754
346 311 292 281 213 261 262 259 2.56 2.54 252 250 1.49 748 7.40 745 745 744 743 242 740 7 .18 116
116 101 781 269 261 755 151 2.41 244 242 240 218 216 235 2.14 2)1 212 231 210 210 7.77 775 721
323 286 266 254 2.45 219 2.34 2.30 227 725 2.23 2.21 219 2.13 217 7.16 215 214 213 2.12 2.10 708 2 OS
316 281 161 248 219 213 1.78 2.74 771 2.19 2.17 2.15 2.13 2.12 2.10 2.09 208 208 1 07 700 103 1 Ol 1.99
314 176 156 143 735 228 273 2.20 2.16 214 2.12 2.10 2.08 207 2.05 204 2.03 207 201 201 198 196 19)
310 2.73 252 219 211 2.24 219 2.15 7.12 2.10 207 7.05 2.04 2.02 2OI 200 199 1.98 1.97 196 1.93 1.91 189
307 210 749 236 727 771 216 2.12 209 200 204 207 700 199 197 196 195 194 1.93 197 189 181 IBS
30S 267 746 2.33 2.24 2.18 2.13 209 2.06 103 101 1.99 '197 195 194 193 1.97 191 ISO. 189 186 184 161
303 764 244 1.31 172 2 IS 2.10 2.06 203 2 OO 198 196 194 191 191 190 189 188UJ87 186 183 181 178
301 262 242 279 2 2O 2.13 708 2 04 200 198 195 19) 1.92 19') 1 B9 181 186 185 Vp4 184 180 178 115
299 261 240 277 2.18 211 206 202 198 1.96 193 1.91 1 89 188 166 185 184 183 182 181 IB 1.16 173
297 259 138 225 216 2 O9 204 200 196 194 191 189 181 1 Bu 1 84 183 1.82 181 180 1.19 .76 1.14 111

796 257 2.36 173 1.14 108 102 198 1.95 192 190 181 186 184 183 181 1.80 .19 .78 78 .74 1.77 169
795 156 135 172 213 2 O6 2.01 191 193 1.90 188 1 BO 1 64 1 8J 1*1 180 .79 18 71 .76 .73 110 101
794 255 2.34 221 211 2 OS 193 195 192 189 181 184 183 .81 180 .18 11 .76 .75 .14 .71 169 166
793 754 233 .19 2.10 204 198 194 1.91 188 185 1.81 1.81 BO 18 17 .16 .15 .74 .1) 10 161 164
292 253 731 .18 1 09 202 1.91 I'll 1 89 1.87 184 1.87 180 .19 17 .16 75 14 .73 77 68 166 163
791 251 231 17 208 2 Ol 198 197 IBB 186 18) 181 79 .17 76 .15 .73 .12 11 .71 167 165 161
790 751 230 .17 707 700 195 191 181 IBS 187 ISO 18 .7H 15 .14 77 .11 10 ./O 166 164 1 6O
789 250 239 .16 206 200 194 1.90' 181 184 181 79 .11 .15 14 13 .71 .70 69 169 105 163 159
289 750 228 .15 200 1.99 1.93 189 186 183 1.80 .18 .76 .15 173 72 71* 169 168 168 164 162 158
188 149 128 14 105 198 193 188 185 182 119 .11 .75 7« 1.77 11 10 169 168 161 163 101 151
787 748 226 13 204 1.97 191 187 183 181 1.78 76 .74 .77 1.71 169 168 167 166 165 167 159 156
766 247 225 12 2 02 1 96 1 90 1.B6 187 19 111 75 .73 .71 109 108 161 106 105 164 100 1 SB 154
785 746 274 II 201 194 189 IBS 181 .78 116 .13 71 .111 108 161 106 105 164 163 159 ISO 153
284 245 223 10 201 194 188 184 180 .11 1.75 .17 .70 1.69 167 166 165 163 167 161 < 58 155 157
7B4 244 273 709 200 193 1 61 1 83 119 16 114 .71 .10 1 Oil 166 165 164 167 161 161 157 154 151
783 243 222 208 199 192 186 1.87 .78 75 173 .71 169 161 165 164 163 167 161 160 156 IS) ISO
282 243 221 208 198 191 186 1.81 .78 75 1.17 .70 168 106 105 163 167 101 160 159 155 157 149
787 742 271 7 O7 198 191 IBS 181 .17 .74 111 169 167 1 6j 104 163 161 1 OO 159 158 54 157 148
281 747 220 207 191 ISO IBS 1 BO 71 73 111 109 107 165 16) 162 161 119 158 151 54 151 141
741 741 120 106 197 190 184 180 70 73 110 168 106 164 163 161 160 159 1 SB 157 53 ISO 146
779 139 18 104 1.95 187 187 .71 14 III 168 1 Oti 164 162 160 159 1.58 156 155 154 SO 148 144
718 238 .16 2.03 193 186 180 .76 .72 1.69 166 16-1 102 1.60 159 151 156 155 154 153 .49 146 147
217 731 15 202 192 185 79 .75 .11 1 68 165 163 161 59 151 156 155 IS) 152 151 41 144 1 4O
276 236 IS 201 191 1 B4 18 .14 .70 161 164 162 1 OO Stt ISO 155 154 157 151 150 46 14) 1)9
216 736 14 200 191 18) 18 .71 169 166 164 161 159 f>7 1 50 154 IS] 157 .ISO 149 45 147 138
775 7)5 2.1) 199 1 B'J 182 71 .77 168 165 167 100 158 56 151 15) 151 150 1*49 148 44 141 110
7)4 2)4 212 190 189 181 16 11 161 1.64 161 159 151 SS 1.5) 157 ISO 149 148 141 41 1 4O 1)5
M) 2)) 211 191 188 1 BO 15 70 1 60 103 161) 1 b« I.5S 154 157 151 149 1 4g 141 146 41 1)8 1)4
777 2)7 710 190 181 179 14 169 1 OS 167 151 1.57 155 15) 151 149 1 48 147 140 145 4O 1)7 1)2
717 7)1 209 190 1 B6 1/9 1)' IOU 104 1 Ul 1 SB 1 M, 154 I.M ISO 149 147 1 Ob 145 144 J'J > .10 1)1
771 231 709 195 IBS 118 1.77 108 1 04 161 1 SB 155 IS) 151 1 4
-------
TABLE A-4    Percentlles of the Fdistribution (continued}
Upper 5% point of th» F diivibution
                                                            OeORECSOf FREEDOM FOR NUMERATOR
                                                                    12   1}
                                                                               14    18    1«    17   II    It   JO   18    JO   40
                                                                                                                                     60   1OO  IM   10O
         161   700
        185   190
        101  »SS
        Ml  694
        661  679
        599  6 14
        S 59  * 74
        5)7  446
        5 17  4 26
        496  4 10
        484  198
        4 7S  189
        467  181
        4 60  1 74
        4 54  1 68
        4 41  J 63
        445  159
        441  355
        4 IB  311
        4.35  1.49'
        4)2  14)
        4 30  1 44
        4 28  1 4J
        4 76  1 40
        4 24  1 39
        421  331
        421  3 35
        4 20  3 34
        418  13)
        417  332

        4 IS  3.79
        4.1)  178
        4 II  126
        4 10  3 24
        408  12)
        4 07  1 77
        406  171
        4 OS  1 70
        404  1 19
        401  3 18
        4 00  1 IS
        393  111
        3 96  111
        3 95  1 10
        394  109

        397  101
        190  106
        J 89  ) 04
        387  301
        186  101
 716
 197
• .78
6S9
641
4 76
4 )S
407
 386
3 71
359
 349
141
134
129
174
1.70
1 16
1.11
1 10
307
3 OS
10)
101
299
788
796
79S
793
792

790
788
787
7 8S
784

783
787
781
780
J 79
776
7 74
2.72
771
770

768
 766
 76S
 76)
 767
 77S  730
 197  19)
917  901
6 39  6.26
S 19  SOS
4 S)  4 39
412  397
384  369
363  148
1 48  1 11
3 16  1 30
3 76  111
3.18  30)
111  7 96
1 OS  7 90
101  7 BS
796  781
791  777
7 90  7 74
787  771
7 84  3 68
782  7CS
7 30  7 64
7 78  2 67
7 75  760
7.74  7 69
7.7)  7.S7
2.71  7S6.
2 70  7 SS
269  JS1

767  7.SI
7 CS  7 49
76)  7.48
7C2  746
761  74S
7 59  7.44
758  74)
767  742
7.57  7.41
7 56  740

75)  237
250  735
7 49  7 33
747  737
745  7)1
744
74)
747
740
7.)9
 2)4
 19)
894
6 16
495
4 26
)B7
)S8
) )7
1 22
109
100
292
785
7 79
7 74
7.70
76C
761
760
757
755
25)
251
749

747
746
745
7.4)
7.47

7.40
738
716
735
734

2)7
711
7)0
779
779

275
27)
771
770
7 19

2 17
7 16
7 14
7 I)
7.17
 737
 194
889
609
488
471
3 79
350
379
3 14
101
791
781
2 76
2 71
766
761
758
754
7.51
7.49
7.46
744
747
740
239
2.37
216
215
2))

2)1
229
7 28
226
725
7.74
77)
777
7 17
7 14
7 I)
7 II
7.10

708
707
706
7(14
70]
 739
 194
 885
 604
 487
 4 15
 3.7)
 )44
 )7)
 107
 795
 785
 7 77
 7.70
 764
 7.59
 755
 751
 748
 745
 2.42
 740
 2)7
 236
 234
 232
 2)1
 279
 278
 277

 2.74
 27)
 771
 2 19
•2 18
 2 17
 7 16
 7 IS
 7 14
 21)
 2 10
 707
 206
 704
 70)

 701
 7.00
 I 98
 I 97
 I 96
 741
 19.4
881
600
4 77
4.10
368
339
3.18
302
780
780
7 71
765
759
754
249
746
747
739
7)7
7.34
7)7
7)0
778
777
775
774
777
7.71

7 19
7.17
2 IS
2.14
7.12

711
2.10
209
208
207
704
202
200
I 99
I 97

196
1 94
1 9)
I 91
I 90
 742
 194
 879
 596
 4.74
 406
 364
 335
 3 14
 2.98
 785
 7 7S
 767
 760
 754
 249
 745
 7.41
 2)8
 715
7.32
2.10
2.27
2.75
7.24

 7.77
 720
 2 19
 2 18
2.16

2 14
7.17
 7 II
 209
7.08

706
205
204
2.03
203
 I 99
 1.97
 1 95
 I 94
 I 93

 191
 189
 I 88
 I 86
 I BS
 243
 19.4
 8.76
 594
 4 70
 401
 360
 331
 ) 10
 294
 782
 7.77
 76)
 757
 751
 146
 741
 7)7
 7)4
 2)1
 7.78
 776
 774
 772
 220
 2 IB
 2.17
 2 IS
 2.14
 7 II

 2 10
 2.08
 207
 205
 204

 701
 201
 200
 I 99
 I 99

 195
• 193
 I 91
 1 90
 1.89

 187
 185
 1 84
 I 82
 1 81
 244
 194
8.74
591
468
400
357
178
307
791
7 79
769
760
7.53
748
7.47
738
7.34
731
773
275
7.73
7.70
2 18
2.16
7.15
2.1)
7 17
7.10
709

207
705
70)
707
VOO

1 99
198
I 97
196
1.95
197
189
I 88
1 86
185

I 8)
I 87
I 80
1.70
I./7
 745
194
8.7)
589
4(16
3.98
355
3 76
305
789
7.76
768
758
751
745
-40
735
731
2 78
775
777
770
7.18
7 15
7 14

7 17
7 10
709
008
700
704
207
700
1 99
1 97

I 96
195
I 94
193
192

  B9
  f)6
  84
  83
  87

  80
 .79
  77
  75
  74
 745
194
8.71
687
464
190
35J
1.74
3.0)
786
7 74
2.64
255
748
242
2.37
2.3)
7.79
7.76
722
770
7.17
7.16
7 13
7.11
209
ion
70G
705
704

201
199
1.98
196
195
194
192
191
1 90
189
I 86
  84
  82
  80
  79
  77
  76
  74
  72
  71
 246
194
8.70
586
462
394
351
3.22
301
785
7 72
2C2
75)
7.4G
740
735
731
727
273
770
2 18
2 IS
2 II
2.11
709
207
206
204
20)
701

199
1.97
I 95
I 94
I 92

1.01
I 90
I 89
I 88
I 87
I 84
1 81
1.79
1.78
1.77

1 75
1.7)
1.72
1.70
I C9
 246
19.4
869
584
460
)97
)49
1.70
299
281
2 70
260
251
744
718
71)
779
7.75
221
2 18
2 16
21)
211
209
207

205
204
202
201
I 99

197
195
19)
.1 92
1.90

189
188
I 87
I 86
1.85
  82
  79
 .77
  76
  75

 .73
  71
169
IC8
100
 247
 19.4
8.68
SB)
4.59
191
348
319
207
281
769
258
250
24)
2)7
2)2
277
77)
770
7.17
2 14
7.11
209
207
205
70)
707
700
199
1 98

1.95
1.9)
197
I 90
I 8$

187
1.86
185
I 84
I 8)
I 80
1 77
1.75
I 74
1 73

1.71
1.69
1 67
I 66
1.64
 747  748
 19.4  19.4
8.67  867
687  581
4 58  4.57
190  188
3.47  346
3.17  316
7 90  2 95
280  279
767  766
2.57  256
248  247
241  740
7 35  7.34
7.30  2 29
2 76  7 74
7.72  7 20
7.18  7.17
7.15  7.14
7.17 '.2.11
7 10  7.08
7 Olf-J 06
2.05  704
7.04  207
7 02  2.00
200  199
I 09  I 97
I 97  1.96
196  1.90
  94  192
  92  1.90
  90  I 88
  88  I 87
  87  I 85
186
I 84'
                                                            I 8)   82
1 82
1.81
  78
  75
  7)
  72
  ;i

169
167
I 66
164
162
 248
19.4
866
680
4.56
187
344
3 IS
794
7.77
7CS
754
74(i
739
7.31
728
773
7 19
716
2.12
210
207
70S
70)
701

I 99
I 97
106
I 94
19)
1.91
1.89
187
I 85
184

IB)
I 81
I 80
  79
  78
 .75
  72
  70
IC9
168
1 G6
1 G4
I G7
I 61
I 59
 749
19.5
8.6)
5.77
457
38)
340
111
289
2 71
260
250
241
2)4
2.78
7.73
3 18
7.14
2 II
207
205
202
200
1 97
1.96
I 94
192
I 91
I 89
I 88
185
183
181
I 80
I 78

I 77
I 76
I 75
1.74
1.73
169
IGG
164
I tl
162
1 59
isa
I 56
I !>4
153
 750
195
B.67
5.75
4.50
381
338
308
2HG
770
757
747
738
731
225
2.19
7 IS
7.11
707
704
701
198
196
I 94
197
190
1 88
I 87
1.85
184

182
1 80
1.78
1.76
I 74

I 73
1.77
1.71
1.70
1 69
1 65
I G7
I GO
I 59
I 57

I 55
I 54
I 52
I 5O
1 48
 751
19.5
859
5.77
4.46
3.77
3.34
3.04
783
766
75)
74)
734
7.77
7.20
7 IS
7 10
706
70)
199
196
194
191
189
1 87

 85
 84
 82
 81
 79
 .77
 75
 .7)
 .71
169

I 68
167
165
164
1.G)
 59
 57
 54
 5)
 57
 49
 48
 4G
 4)
 47
 751  75)
19.5  19.5
8.58  8 55
5.70  56G
4.44  4.41
3.75
337
302
780
764
7.51
740
7 II
724
2.18
1.71
127
297
776
259
246
215
776
7 19
7.17
212  707
708  202
204  198
200  194
197  191
I 94  I 88
191  185
1.88  187
1 86
1 84   .78
1 82
I 81
I 79
1.77
I 76

1.74
1.71
169
I 68
I 66

1 65
I 63
I 62
I 61
I GO

I 56
I 53
I 51
I 49
I 48

I 45
I 44
1 41
I 19
I 18
 .76
 .74
 .73
 .71
  70

167
165
I 62
161
I 59

 57
 58
 55
 54
 52

 48
 45
 41
 41
 19

 1C
 14
 37
 10
 78
 76)
195
854
565
419
170
176
796
7.74
257
744
71)
774
7 17
7 10
70S
700
I 96
1.97
189
186
18)
I 80
  7B
 .76
  74
  72
 .70
  69
167
164
162
160
158
I 56

 .55
  5)
  57
  51
  50
  45
  47
  19
  18
  16
  ))
  II
  78
  76
  2)
 754
195
B.54
SG5
4.)9
369
) 75
795
77)
756
74)
737
22)
2 16
2 10
2.04
1 99
I 95
I 91
I 88
184
 87
 79
 77
 .75
 7)
 71
169
167
166
16)
161
 .59
 57
 55

 5)
 57
 51
 49
 48
 44
 40
 38
 )G
 )4
 II
 79
 76
 2)
 71
   1000
        IBS  100  261  218 '7.77  7.11  707  195  189   184  180  176  17)  170  168  165  1.6)
                                                                                                    1 61 1 60
                                                                                                              158  157  147   141  1)6   176  177   1.19

-------
'. TABLE A-4   Percentlles of the F distribution (continued)
 Upptr 2.5% point of tt>* f distribution
                                                              DEGREES OF FREEDOM FOR NUMERATOR
                                                                     11    13   14    IB   1«    1?    «•    It    20   «    30   4O    tO   IOO   ISO   JOO
      10
     100
     ISO
     700
     300
     •,00
 548
385
174
III
100
8Bt
eoi
757
Ml
694

672
654
641
630
620
6 II
604
598
49}
4BI

481
4 /9
4 14
4 17
469
466
46)
4CI
449
447
443
440
447
445
447

440
439
437
5 34
479
474
677
470
4 IB
4 14
4 13
4 10
407
604
 BOO
390
160
106
84)

7.76
644
606
4 71
446

576
i 10
497
486.
4 77
469
467
446
441
446

447
4 38
414
  17
  79
  77
  74
  77
  70
  18
  15
  17
409
407
405

403
407
40O
399
397

391
369
366
384
383
380
3 78
3 76
1 7)
3 77
 864
397
144
998
7 76
660
589
547
508
483

463
447
4 35
4 74
4 15
408
401
395
390
386

387
3 78
3 /5
3.77
169
367
365
363
361
359
356
343
340
348
146

145
14)
147
140
139

3 34
331
378
326
1 25
127
170
1 18
1 16
1.14
 900  927
39.7  39)
 151  149
960  936
 739  715
67)  599
6 47  5 29
5 05  4 87
472  448
447  424

478  404
412  3 89
4.00  1 77
189  166
180  158
1 7j  1 50
166  144
161  138
156  1.13
341  1 79

1 48  1 75
1 44  1 77
141  118
138  115
135  111
111  110
111  308
179  106
177  104
1 25  1.01
1.77  100
1.19  797
117  794
115  797
111  790

1.11  789
109  787
3 03  266
107  784
304  78)

1 01  7 79
797  775
794  773
79)  7/1
797  7.70
289  767
787  7C4
7 84  7 lil
781  761
781  759
917  948
39 1  39 4
14.7  146
9 70  9 07
6 98  6 84
587  6.70
512  4 99
4 65  45)
4 37  4.20
407  194

188  176
171  361
3 60  3 48
3 50  1.38
1.41  1.29
1.14  1.77
3 28  1 16
1 77  1 HO
1.17  105
111  101

1.09  797
1 05  29)
102  2.90
799  187
297  785
7 94  2 87
7 92  7 80
790  2)8
7 88  7 76
28;  775
784  2.71
781  769
7 78  7 66
7 76  7 64
774  762

27)  761
271  759
2.70  248
7 69  7 56
707  254

763  751
759  747
751  745
755  7«J
254  742

251  239
249  737
747  235
245  233
24)  2)1
 957
394
14 5
898
6 76

560
4.90
443
4.10
185

366
151
119
1 29
120

1 12
106
101
206
291

787
784
781
7 73
275

7.71
7.71
769
767
765

262
259
257
255
753

251
250
740
747
246

741
7 3B
735
234
737

730
728
2 26
22)
2 22
96)
394
14.6
890
668

552
4 87
4 36
40)
3 78

359
344
331
371
3 17

305
798
793
788
704

2.80
7 7G
7.73
7 70
268

265
263
261
759
757

254
252
749
747
745

74)
242
241
239
718

7)3
7)0
7 21)
2 2(1
224

2.22
270
7 18
7 16
7 14
 969
l'J.4
14.4
884
662

6.46
4.76
4 30
196
3 72

151
117
125
3 IS
1.06

299
797
787
787
7.77

7.71
7.70
767
764
761

259
257
755
251
251

248
245
243
241
2.39

237
2.16
214
7))
7)2

777
774
771
7 I'J
2 IB

7.15
2 I)
7.11
709
707
 873
39.4
14 4
879
657

541
4.71
4 24
391
366

1.47
137
170
1.09
3.01

793
787
281
2 76
2.72

268
265
202
759
756
7.64
751
749
748
2.46
2.41
240
717
735
233

73V
730
7.29
7.77
77C

7.77
7 IB
7.16
7 14
7 17

7 10
708
706
204
202
 977
39.4
14.3
H.75
1.52
5.17
467
4.20
187
362

.1.43
1 78
3 IS
105
296
789
282
2.77
2 72
268
264
760
757
754
251
7.49
247
245
241
741
718
735
733
731
7 79

777
7 7C
274
7.73
777

7 17
7 14
7 II
709
708
705
703
701
I 99
1.97
 "80
39.4
143
8 71
649

511
461
4 16
181
358

3)9
174
1.17
301
797
785
2 79
2 71
2.68
264

2 CO
256
253
250
2.48
745
7.1)
711
239
717
7 34
7)1
2.29
2 27
224

2.23
2.22
2.70
7.19
2.18

7 I)
7.10
7U7
705
704

701
I 09
I '.II
I'JS
I !>)
 981
39.4
14)
068
6.46
530
4.60
4.11
380
3 55

.1)6
1.71
i.o8
298
789
787
:'.75
7.70
7C5
760

756
753
: so
747
744
247
739
737
736
234
211
778
775
271
771

270
2 IB
2 17
2 Ib
2 14

209
20U
20)
202
7 IK)
1 O/
I 45
19)
I 91
I 89
 985  087
394 39.4
14.3 142
866 863
6.43 640
5 27 5.74
4.47 454
4 10 4 OS
3 77 1.74
157 350

1 11 1 30
1.18 3.IS
3 05 1 01
7 95 7.97
28G 284
279 2.78
2.72 2 70
7 67 7.64
767 759
767 755

751 7.51
7.50 7.47
747 744
244 2.41
2.41 2.38
2.39 2.36
236 234
2.34 737
7.37 730
7.11 2.28
2 78 7.75
7.75 777
7.77 770
7 70 717
718 7.15

7.16 714
716 7.17
7.1) 711
7 17 709
711 708

206 701
7O) JOO
7OO I 97
  98 I 94
  97 I 94
  94 1 91
  92 189
  90 I 87
  UU 1 U4
  86 I at
 989
394
14.7
861
GIB

6.77
4.47
404
1 77
147

128
1 II
1.00
290
281

2 74
7.67
762
247
252

248
245
242
7.19
236
234
2)1
7 79
22J.
226

222
220
2 17
2 IS
2.11

2.11
7 10
708
7.07
70G

701
1.97
I 95
I 9J
I 91

1 89
I 87
I 84
18?
I 80
 990  997
39.4  39.4
14.7  14.7
8.59  8.68
6 36  6.14

520  S.I8
450  4.48
4 0)  4 02
170  168
145  1 44

1.76  1.74
lit  109
2 98  2.96
2 88  2 86
279  277
2.72  2.70
705  261
260.  258
255  25)
250'  2.48

246  744
2.4)  741
2.39  2)7
236  2)4
234  2.12
211  229
2.29  2.27
277  7.74
2 75  7.21
223  771
7.70  7.18
217  2.15
2.IS  2.13
21)  211
2 II  209

2 09  2 07
207  205
206  204
205  207
70)  701

1.98  1 96
I 94  19)
19?  I 9O
191  I 88
I 89  18'
I B6S  I 84
I 84 '  1.8?
I 82  1 HO
I 80  I 7 <
178  I 76
 993
19.4
14.1
856
611
51?
447
400
167
142

1 21
307
295
284
2.76
?ca
762
756
751
746
74?
2 39
23G
731
730
778
775
771
771
720
716.
213
2 II
209
207

705
703
20?
701
199

194
191
  88
  U6
  B!i

  82
  .HI)
  78
  15
  74
 998 1001
39.5 395
14.1 14.1
8.50 846
6 77 6 71
S.I I 607
4.40 4 36
3 94 3 89
160 356
335 3.31

3.16 3 I?
301 7 96
288 7 B4
2.78 2.73
7 69 7 64
761 757
2 55 2 50
7 49 7 44
7 44 7 39
740 715

2.36 231
232 227
2.79 274
776 771
773 2.18
221 216
218 711
7.16 711
714 709
212 207
709 7O4
706 701
7O4 I 99
701 I 96
I 99 I 94

1.98 I 9?
I 90 I 91
I 94 I 89
19) I 68
19? I 87

187 IB?
18) I 78
181 I >5
  /9 I 7)
  77 I 71
 .74 I 68
 .7? 1.6?
 .70 I 64
  67 I 02
  65 I GO
1006 1008
39 5  39.6
14.0  14 0
841  8.38
6.18  614
SOI  4.98
431  478
384  181
151  1.47
3 76  37?

306  101
791  787
7.78  7.74
267  264
2 59  2 55
251  247
2.44  241
238  235
23)  730
229  225

2 25  221
221  2.17
2 18  2.14
2.15  211
2.12  208
209  205
207  2 03
2.05  201
203  1 99
701  197
I 98  I 93
                                                                                                                                  CB   63
  66
  64

  61
  59
  SG
  54
  4?
1011
39.5
140
83?
6.08

49?
471
1 74
3 40
1.14

79G
780
767
746
747

740
7.11
2.27
2.7?
717

211
209
206
2.02
200

197
194
I 92
1 90
I 88

  84
  8?
  79
  76
  74

  7?
 .70
169
167
I 6G

160
  56
  4)
  4O
  48

  44
  4?
  39
  36
  34
                                                                                                                                                 1015 1018
                                                                                                                                                 19.6 19.S
                                                                                                                                                 119  119
                                                                                                                                                 8.30 8 79
                                                                                                                                                 606 605
                                                                                                                                                 4 89 4 88
                                                                                                                                                 4 19 4 18
                                                                                                                                                      1.70
                                                                                                                                                      1 1?
                                                                                                                                                      3 12
3 72
338
1 II
                                                                                                                                                 291
                                                                                                                                                 2 78
                                                                                                                                                 265
                                                                                                                                                 2.M
                                                                                                                                                 245
                                                                                                                                                 217
                                                                                                                                                 230
                                                                                                                                                 2.24
                                                                                                                                                 2 19
                                                                                                                                                 2.14
      292
      2 76
      263
      251
      244
      216
      229
      221
      2 18
      21)
                                                                                                                                                 2 10  2.09
                                                                                                                                                 20G  204
                                                                                                                                                 70)  701
                                                                                                                                                 700
                                                                                                                                                 1 97
      198
      195
194
1 91
I 89
I 87
I B5
I 82
I 78
1.76
I 7)
I 71

169
I 67
I 65
IC4
I 62

  56
  5?
  49
  46
  44

  40
  38
  .15
  31
  78
I 9?
I 90
1 88
I 86
I 84
1 80
I 77
1 74
I 71
169

I 67
I 65
16)
16?
I 60

 .54
  50
  47
  44
  4?

  )tt
  35
  .3?
  78
  75
         604  170  1.11  780  2.58  242  230  2 ?O 213  200  701  I 90   10?  I UU  185  IB?  1.79   1.77  174  17?  164  158  150  144  13?  176  173

-------
TABLE A-4    Percontlles of the F distribution (continued)
Upper I % point of th* F dittribution
                                                            DEGREES OF FREEDOM FOR NUMERATOR
                                                         10    11
                                                                   17    13
                                                                              14
                                                                                   IS   IB
                                                                                              U
                                                                                                   18   19    70
                                                                                                                        30   40    SO   100   ISO   TOO
        4052 6000
        08 4 B9 0
        34.1 308
        21} 180
        163 113
     II
     II
     13
     14
     IS
     16
     I?
     18
     19
     20
     21
     7?
     73
     74
     75

     76
     II
     28
     29
     30
     32
     )4
     16
     36
     40

     • 7
     44
     46
     48
     so
     60
     JO
     80
     90
    100

   !l25
    ISO
    700
   300
   SOO
   1000
 13 I
 177
 II 3
 106
 100
96S
933
901
aca
8 S3
840
879
8 18
a 10
807
I9S
188
187
1 ??

i n
768
164
160
756
ISO
144
F40
735
131
178
I7S
177
1 19
7 I)
?oa
696
693
690
 109
 9SS
 86S
 802
 146

 7 71
693
6 10
641
636
673
6 II
601
S93
S8S
S18
517
S66
S6I
SSI
SS3
549
S4S
447
S39
534
579
S7S
571
5 18
5 16
5 12
510
508
506
498
497
488
485
487
684  4 78
681  415
6 16  411
6 77  4 68
6 69  4 64
6403
997
295
16 1
121

9 18
84S
IS9
699
655
677
S9S
5 M
S46
547
529
5 19
509
SOI
494
481
487
4 16
4 n
468
464.
460
45>
4 54
451
446
442
4 38
434
431

429
476
4 74
422
4 70
4.13
401
404
401
398
394
391
388
3 8S
382
6625
092
78 I
160
II 4

9 15
185
101
647
599

561
541
52l'
504
489

4 17
461
4 SB
4.50
443
4 31
431
  26
  27
  IB

  14
  II
401
404
407
391
393
389
386
383

380
318
316
314
3 17

365
360
3S6
353
351

341
345
341
338
336
5164
993
287
155
no

a is
MS
663
606
564

637
506
486
469
4S6
444
434
4.75
4.11
4.10
404
399
3.94
390
3 65
387
3 IB
3.15
3 It
3 10
365
361
341
354
3.51

349
347
344
343
341
334
379
376
373
371

3.11
3 14
3 II
308
305
5859
993
719
157
10?

84?
7.19
63?
680
539
SO?
487
467
4 46
437
4 70
4.10
401
394
381
381
3.78
3.II
361
363
159
356
353
350
341
343
339
335
337
379

37?
324
322
320
3 19
3 17
30?
304
301
299

70S
79V
789
786
284
6928 69B.I
99 4  99 4
71.1  716
ISO  148
(OS  103

876' 8 10
6 99  6 84
618  603
561  541
5 70  5 06

489  414
464  4.50
4 44  4 30
4 28  4 14
4 14  400

4 03  3 89
3 93  3 79
384  3.11
37?  363
3 70  3.56
364  351
3 59  3.45
354  341
3 SO  3.36
346  3.32

347  379
3 39  3 76
336  373
3.33  3 70
330  3.1?
326  313
322  3.09
3.18  305
315  302
3.12  799

3 10  7.9?
3 08  7 95
306  793
304  391
3 02  7 89

2 95  2 87
791  2./B
781  774
784  777
782  769

7 79  7 66
7 16  7 63
7 73  7 CO
270  25?
768  2 55
6022 6056
09.4  994
213  717
.14.7  14.5
107  10.1

? 98  7 81
617  667
591  581
5 35  6 76
4 94  4 85
463  4.54
4 39  4 30
4 19  4 10
4 03  3 94
3 09  3 BO
3 IB  3 69
3 68  3 59
360  351
3 S3  3 43
346  331
340  331
335  376
3 30  3.71
376  3.1?
377  313
3 IB  3 09
3 IS  306
317  303
3 09  3 00
3 07  7.9B
307  7.93
2 98  2 89
2 95  2 86
797  783
7 89  7 80

7 86  7 78
784  7.75
787  273
7 80  771
778  270
2 <2  263
26?  259
2 64  7 55
761  2 S7
7 59  2 50

7!>S  747
253  744
2 SO  241
241  238
744  2 36
6083
994
21 I
145
996

1 19
654
5 13
5 18
4 17

446
422
407
386
3 73

367
352
343
336
329
324
3 18
3 14
309
306

302
799
790
293
291
788
787
7 19
715
713

7 10
268
766
764
763

756
761
748
745
743

339
237
734
231
278
6lt*
994
77.1
14.4
OBJ

7.77
64?
667
6 II
4 71

4 40
4 16
39C
380
36)

355
3411
33?
330
373

3.11
3.13
307
303
2.99

79C
29^
790
781
284

780
7 16
7 17
769
761

764
767
760
758
756

750
745
74?
739
731

733
731
72?
224
222
6126 6143
99 4  99.4
270  269
143  142
982  911

1 66  I 60
641  636
561  556
605  501
4 66  4 6O
4.14  439
4 10  4 O'j
391  386
3 IS  3 70
361  350
3 SO  3 45
340  335
3 32  3 21
3 24  3 19
318  3.13
3.12  301
3.07  3 07
307  291
2 98  7.91
7.94  2 B9
7 90  2 86
281  28?
7 84  7 ?•»
781  271
2 19  2 74
274  270
2.70  266
267  26?
2 64  2 59
261  256

7.59  7.54
756  757
7 54  2 SO
2 S3  2 49
2 SI  240
2 44  2.39
240  73S
236  231
2 33  2 n
7.31  771

2 78  2 23
2 25  2 20
2.27  2.1?
2 IP  7 14
711  212
615?
 994
 769
 142
 9 17
 7.56
 631
 557
 496
 4 56

 475
 401
 3H7
 366
 357
 341
 331
 3.73
 3.15
 309
 303
 798
 7.93
 289
 286
 7B1
 7 18
 2.75
 7 13
 770
 7.65
 761
 758
 765
 252

 2.50
 74?
 245
 244
 247
 236
 231
 2.27
 224
 227

 7 19
 2.16
 2 13
 2 10
 201
6110
99.4
268
14.7
968

752
678
5 «8
497
467
4.71
391
3 IB
367
349

33?
377
3 19
3.12
305
7.99
794
789
285
281
2.7B
2.15
2 12
269
266
262
25B
7.54
751
748

246
244
242
240
738
731
771
773
771
7 19

7 15
7 17
703
206
2.04
6181 6192  6201
 99 4  99.4  99 4
 268  268  26.7
 14.1  14.1  140
 964  961  9 SB

 748  146  742
 624  671  618
 5.44  541  538
 4 89  4 86  4 B3
 4.49  446  443

 4 IB  4.15  4.13
 3.94  3.91  3BB
 3.15  3.17  369
 3 59  3 56  3 53
 3.45  347  3.40

 3 34  3 31  3 78
 3.74  371  3.19
 316  313  3.10
 308  3.05  3.03
 3.07  7.99  796

 796  293  '290
 291  288  785
 786  283^260
 7.87  7 19  7 76
 1.78  7.75  7.72
                                                                                                     6709 6240 6261  6381 6303
                                                                                                     99.4  99.5  995  995 995
                                                                                                     36?  76.8  765  76.4 764
                                                                                                     140  139  138  13.7  137
                                                                                                     955  9.45  938  979 974
                                                                                                                          6334  6345 6350
                                                                                                                           99.5  99.5  99 5
                                                                                                                           76 2  76 3  26.2
                                                                                                                           136  I3S  136
                                                                                                                           913  909  90S
2.75  277
7.71  7.68
2 68  2.66
2 66  2 63
2 63  2 60
2 SB  255
7.54  7 SI
751  748
748  745
     Y42
245

7.43
740
238
231
736
278
223
220
2 I?
2 IS
2 II
209
206
203
200
740
7.3?
735
733
732

375
7.70
7 I?
7.14
7 17

708
7.06
703
199
1.97
269
760
763
760
757
2.63
749
245
242
239

23?
735
733
231
729
222
2 IB
7 14
7 II
709

705
7.03
700
197
1 94
740
6 16
636
4 Bl
4 41
4.10
3B6
366
351
331
326
3 16
308
300
2 94
288
283
2.IB
2 74
2.70
266
263
260
25?
255
750
746
243
240
7.3?

734
2.32
730
7.7B
72?
2.20
2 IS
2 12
209
20?

203
200
I 91
1 94
1 92
                7 30  7 23
                606  599
                5 26  5 20
                4 II  465
                431  425

                4.01  394
                3.16  3 70
                351  351
                341  335
                3 28  321

                316  310
                301  3OO
                2 98  2 92
                2.91  284
                284  218

                2.19  2.12
                713  767
                2 69  2 62
                2 64  2 58
                7 60  2 54

                757  250
                254  247
                751  744
                748  741
                745  739
                741  734
                23?  730
                7.33  7 26
                7.30  7 23
                277  7 70
           7.14  709
           691  586
           5.17  507
           45?  457
           4 I?  4.17
           386  3BI
           367  35?
           343  338
           32?  322
           313  308
           302  2.97
           292  28?
           284  218
           7 16  2.71
           7 69  7 64
           7 64  7 59
           7 58  7 63
           7 64  7.48
           2 49  7 44
           2 45  7 40
           2 47  7.36
           238  233
           2 35  2 30
           733  77?
           730  775
           775  770
           771  716
           218  2.12
           2.14  209
           2.11  2.06
22S
222
2.70
7 18
7 I?
7 10
70S
701
I 99
1 91
193
I 90
I 81
184
I 81
7.IB
7.15
7 13
7 17
2 IO

203
198
194
1 92
I 89

I 85
183
I 79
I 76
I 14
7O9
201
204
202
201
194
I 89
I H5
I 82
I 80

1.16
1 13
I 69
I 66
I 63
203
201
199
I 9?
I 96

IBB
I 83
I 19
I 76
I 74

169
166
I 63
I 59
I 5?
699
575
496
4 41
401
311
341
321
3 II
298
286
2 76
768
760
7.54.
748
247
737
733
7.79
275
773
7 19
7 16
7 13
208
704
200
I 97
I 94

191
I 89
186
1 84
182
I 15
I 10
165
I 62
I 6O

I 55
I 52
I 48
I 44
I 41
696
6 77
493
4 38
3.98
36?
343
374
308
794

783
7.73
764
75?
750
7.44
738
734
779
775
271
7 18
7 IS
3 12
209
704
200
I 06
I 93
I 90

i a?
184
182
I 80'
I IB
I 70
I 65
161
I 57
1 55
I 50
1.48
I 42
1 38
I 34
693
6 70
491
436
396
366
341
372
306
292
281
271
262
255
248
742
736
737
77?
7.73
119
3 16
2.13
2 10
701
702
198
194
I 90
18?

IBS
I 87
I BO
1 78
1 76
IBB
16?
I 58
I 55
I 52
I 4?
1.43
I 39
I 35
1 31
        666  463  380  334  304  2B2 266  253  243  234  27?  770 715  710  706  707  198  I 95VM 97  190  1.79  177  161  154   138  137  178

-------
     !;  TABLE  A-4    Percentiles of thy F distribution (continued)
O   Vppff 0.5% point of th* F (Attribution
                                                                    OtOREfSOF f RCEOOM FOB NUMERATOR
                                                                            u    13    14    u    i«    ir    11    it    jo   n    30   40    »o   too   no  >oo
            10
            II
            I?
            I)
            14
            IS
            16
            17
            18
            19
            70
            II
            11
            73
            24
            75
            ?0
            11
            78
            79
            10

            37
            3«
            36
            38
            40

            47
            44
            46
            48
            SO
            60
            10
            80
            90
          •100

          •I7S
           70O
           300
           soo
          IOOO
 199
SS6
31 1
778
186
167
14 I
136
178
17.7
II 8
II 4
II.I
108
106
10.4
107
101
994

983
973
963
9SS
948
941
934
978
973
9 18

909
901
894
883

8 18
874
8 70
866
H63
849
840
833
878
874

8.U
8 17
806
800
795
 199
498
761
183
14%
17.4
II 0
101
943
891
8 SI
8 19
197
7 70
7 SI
7 IS
771
709
699

689
681
6 7J
666
660
6S4
649
644
640
6 IS

678
677
6 16
6 II
6.07

60J
S99
S9G
S93
690
6.?9
S 17
S67
567
559

551
S49
S44
539
515
 199
47.5
243
165
129
109
960
8 >2
808
760
721
691
668
64B
6.10
6 16
601
597
587
571
665
558
5.57
646
541
516
617
5 78
574

517
5 It
506
507
498

4.94
491
488
485
481
471
466
461
467
454

4.49
445
441
436
431
 199
467
217
156
170
10 I
BOI
796
7 34
688
657
673
600
580
564
550
517
577
5.17

509
507
495
489
484

4.79
4.74
4 70
466
462

456
4 50
446
441
417

  14
  II
  78
  75
  .71
4.14
408
401
199
196

191
388
184
380
1 76
 199
454
775
149

115
957
830
147
687
647
607
5.79
5S6
617
571
601
495
485
4.76

4.68
461
4.54
449
441

  ia
  34
  10
 .26
  73

 .11
 .11
406
407
199

394
197
190
187
IBS
1 76
170
165
167
159
1.64
151
147
141
140
 199
448
720
14 S

II.I
918
795
7 II
654

6 10
5 76
5.48
576
507

4.91
4 78
466
456
4 47

4.19
437
476
4 70
4.15

4.10
406
407
198
1.95

189
184
1 79
175
1.71

168
165
167
160
158

349
141
139
335
1.13

178
175
171
1.17
1 14
 199  199
44.4  44.1
718  21.4
14.7  14.0

108  106
889  868
7 69  7.50
6 88  6 69
630  612

5 66  508
557  535
575  500
5 03  4 86
4BS  467

4 69  4 57
4.56  4.19
444  478
414  4.18
4.26  4.09

4.18  4.01
4.11  194
4 05  1 88
1.99  181
194  1.78

189  1.71
185  3 09
18t  165
177  3.61
3 74  3 58

168  1.S7
161  147
368  142
154  139
1.51  1.35

34B  112
1.45  1.79
1 42  3 76
1 40  1 24
338  122

179  313
121  108
119  103
3 15  100
111  297

1 08  2 93
3 04  2 89
101  286
297  282
2 94  2 79
 199
43.9
71.1
118
104
851
7 14
654
597
554
5.20
494
4 72
4.54
418
4.25
4.14
404
3.96

188
181
1 75
169
164

160
1.56
352
148
145

119
1.34
110
126
177
3 19
1.16
1.14
3)1
1.09
1.01
795
791
787
285

780
2.77
2.73
769
266
 199
43 7
21 0
136
101
838
7.21
642
685
5.42
609
4.82
4 CO
442
427
4 14
4.01
191
185

177
1.70
164
1.59
354
149
345
141
138
3.34

3.29
1.24
3 19
1 IS
1 12

109
306
101
101
299
290
285
280
2.77
2 74

210
267
261
259
256
 199  199
41.5  43.4
208  70.7
11.5  11.4

10.1  100
8.77  8.18
7.10  701
6.11  671
5.75  666
f. 17  6.74
499  491
4.77  464
4 61  4 41
4.31  4 75

4.18  4.10
405  197
3 94  3 86
184  1.76
3 76  1 68

1 68 '  1 60
161  354
1.55  1.47
1 50  1 47
145  117

1.40  1 33
3.36  1 78
137  1.75
1.79  171
175  118

1.70  1 17
1.15  107
1.10  101
1.06  299
1 01  2 95

100  297
7.97  2 89
294  287
7 92  2 85
2 90  2 82

2 82  2 74
7.76  268
277  7 64
768  761
7C6  258

261  254
258  251
254  247
2!>l  741
2 48  2 40
 199  199
433 412
206 205
i3.i :i.2
9.95 988
BIO B 03
6*« 687
615 6.09
5.59 5 53
S 16 5 10
484 477
4 57 4 61
4 36 4 30
418 412
4 03 397
190 184
3.79 3.71
1.70 164
1CI 155

154 148
147 141
141 US'
1,15 1 10
1 30 1.75

3.76 1 70
127 116
1.18 1 17
1.16 309
111 306

1.06 100
1.01 295
2.96 790
797 787
2 89 2 81

2.G6 2 80
7 HI 2 77
2 BO 2 75
7/8 777
2 16 2.70

2 08 2 67
767 7S6
2 58 2 52
2 M 7 49
2L2 246
2.47 242
2 •'« 2 18
2402 .16
217 231
2 34 2 2b
 199  199
43.1 43.0
204 70.4
13.1 13.1
981 9.76
7.97 791
681 6)8
6.03 5 98
6.47 5.42
SOS 500
4.72 4.67
446 441
4.25 470
407 4.07
192 187
179 1.75
168 164
159 154
350 3.46

341 338
336 331
3.30 3.75
3.75 3.70
3.70 3.16
3.15 3.11
311 307
307 1.03
104 7.99
301 290

795 290
790 285
285 281
282 277
278 2.74

2.75 771
777 768
7.70 765
267 263
265 261
257 253
251 747
247 241
2 44 2 39
241 237

737 2.32
211 2 29
2 30 2 75
7 76 2 21
773 719
 199
429
20.3
11.0
9.71
7.87
6.77
694
538
496
4C3
4.37
4.16
398
383
371
360
150
3.47
114
177
1.71
1 16
111
1.07
101
799
2.95
292

786,
281
2.77
2.71
2.70

267
264
261
259
2.57

2.49
243
239
735
733

778
775
2.21
2 17
2 14
 199  199
42.9  428
20.1  20.2
IIO  12.9
9.66  9.62
781  7.79
668  664
SOO  586
534  5.31
4.92  4.89
4.S9  4 56
4.33  4 30
4.12  409
395  391
ISO  1.76
167  164
1 56  1 61
146  141
1.38  1.15
  M*
llVi  327
3 24  3.21
1.18  1.15
1.12  1.09
108  4.04

1.01  1.00
2.99  796
295  2.92
2 92  2 88
2.89  2 85

7.81  2 80
7.78  775
7.71  770
7.70  266
266  261

2.61  2.60
260  2.57
258  254
2 55  2.52
2 53  2.50

2.45  242
2 39  2 36
215  232
2 32  2 78
229  226
2.24  2.71
771  718
718  714
2 14,  7 10
211'  207
 199
42.8
707
179
9.69
7.75
661
581
5.77
486
451
4.77
40G
188
1.71
3.61
150
3.40
117
174
3.18
1.17
106
301

297
2.91
789
2.B6
282

2 77
2.72
767
761
760

257
254
251
7.49
7.47

719
733
779
775
721
7 18
7.15
7 II
707
704
 199
478
20.0
128
9.45
762
648
5.71
5 !S
4.74
441
4.15
1.94
1 77
162
149
138
3.29
3.20

3.13
3.06
100
795
290

285
281
277
2.74
271

765
260
2.56
2.52
2«B

245
2.42
240
217
215
227
2.21
2 17
2.11
211
206
201
I 99
I 95
I 92
 199
47.5
199
12.7
916
751
640
567
507
465
4 33
407
366
369
154
141
330
371
3 12

105
2.98
292
787
782
27»
271
769
766
761

757
757
7.48
744
740

7.17
734
737
779
777

7.19
7 13
708
705
707
I 98
194
I 91
I 87
184
 199
47.1
198
175
974
747
679
567
497
4.55
471
3.97
3 76
158
1.44
111
370
311
107

295
788
282
2 77
2 72
267
261
259
756
252

247
242
217
233
230

776
774
771
7 19
7 16

708
707
1 97
 94
 91
 86
 83
  79
  75
  77
 199
47.7
197
17.6
9.17
7.36
677
545
490
449
4 17
191
1 70
157
117
175
1.14
104
796

788
282
2 76
2 70
266

261
757
251
749
746

740
7.15
710
221
223

720
217
714
2 12
2.10

201
  95
  90
  87
  84
  79
  76
  71
167
I.C4
 199
47.0
195
17.1
901
7.77
609
617
4.77
4.36
404
3.78
157
3 39
1.75
1 17
101
791
781

7 75
769
767
757
757
7.47
743
739
736
7.37

776
721
2 17
2 12
209

206
203
200
I 97
195

I 86
I BO
  76
  71
  6B
  61
  59
  54
I 50
I 46
                                                                                                                                                         199   199
                                                                                                                                                        420  41.9
                                                                                                                                                              19.4
194
177
898
7.17
604
578
4 71
4.11
199
1 74
151
115
120
107
296
287
2 78

2.71
264
258
252
7.47
241
218
715
7 II
728

222
7 16
7 17
70S
204

200
197
I 95
1.92
I 90

I 81
1.74
I 69
165
I 62
I 56
1 53
I 48
I 43
I 39
122
895
7.15
602
526
4 71
4.29
397
3 71
350
311
1 18
1.05
294
285
2 76

268
762
756
750
245
240
236
232
229
225

7 19
2 14
209
205
701

198
1.95
I 97
I 90
I B7

  78
  71
  66
  67
  69
  63
  49
  44
  39
  35
               791  511  4.10  174  317  311  797  777  264   254  245  218  2 11  270  721  716  717  709  205  2 O7  190   I 81  169  161  141  136  111

-------
TABLE A^*    Percsntltes of the F distribution (continued)
  Upr>tr 01% point of ttir f distribution
                                                              OCOREO or
                                                                             «f OOM con NUMERATOR
                                                                                                           I*    70   ]S   30    4O   80
                                                                                                                                            1OO   ISO  10O
   /'so
      125
      ISO
      100
      300
      SI*
 999
 is;
 M I
41 1
155
791
254
179
210
 19.?
 186
 178
 17 I
 1C 6
 16 I
 IS ?
 154
 IS.I
 i4 a
 146
 144
 14}
 140
 119
 II }
 116
 US
 114
 111
 II I
 110
 128
 17 ?
 176
 US
 114
 174
 171
 172
 120
 II B
 II I
 116
 II 5
 11.4
 II 1
 II 2
 110
                999
                MB
               61 2
               37.1
               270
               21.7
               IBS
               164
               149
               11 B
               110
               121
               11 B
      999
      141
      S62
      112
      217
      188
      isa
      119
      126
      116
      10 B
      102
      9 71
IM 914
110 901
10 7 811
104 849
102 8.28
99S BIO
9.77 794
9.61 780
947 767
B14 7 54
972 744
9.12 716
9.02 7.77
891 719
BBS 712
8 77 7.0S
B 64 6.94
852 681
842 674
811 666
B2S 659

818 6 SI
B 12 6.48
80S 642
800 618
796 614
7.17 617
764 606
7S4 497
747 S9I
741 586

7)0 577
7.74 6.11
7 IS S61
707 SSC
700 SSI
999
111
SI.4
II.I
219
112
14 4
126
II 1
101
961
907
862
87S
704
768
746
727
7.10
69S
681
6.70
6.S9
649
641
611
62S
619
6 12
601
S92
S84
S >G
S)0

664
6S9
SS4
6 SO
S46
Sll
S20
S 12
sue
602

491
483
481
4 75
469
 999  999
 IIS  111
SI 7 SOS
29 a ?a a
JO B 70 0
162 ISS
IIS 129
117 III
10 S 991
9 SB 9 OS
a 89 a 38
8 IS 7 86
792 744
7.S7 7W
727 680
7 02 6 06
6BI 63S
662 618
646 602
612 688
6 19 S 76
6 08 465
598 SSS
589 &46
5.80 618
471 611
5 66 S 74
&S9 SIB
SSI SI2
641 507
S 14 4 91
6 26 4 86
& 19 4 79
S 11 4 71
607 4.68
507 461
4 »8 4 59
4 94 4 SS
4 90 4 SI
476 417
4 66 4 78
4 SB 4 70
4 51 4 IS
448 4.11

440 401
4 IS 1 98
4 79 1 92
472 18C
4 IB 181
 999  999
 112  111
49.7  490
.282  276
 19.S  190
 ISO  14.6
 174  120
 10.7  104
9 57  9 20
a 66  B IS
BOO  7 71
 749  771
 7 08  6 80
674  647
646  6.19
6 27  S 96
602  S 76
S BS  S 59
S 69  S 44

SS8  611
544  6.19
Sll  S 09
S 21  4 99
515  491
507  481
6 00  4 76
4 91  4 69
487  464
482  458

472  448
461  440
 4 S6  4.11
4 49  4 26
444  421

418  4.16
4 14  4.11
4-10  4 07
4 26  401
422  400
409  186
1 99  177
1 92  1 70
 187  165
 181  161

 1 75  1 !|4
 1.71  149
 164  141
 1 49  1 18
 1 54  3 33
 999  999
 110  129
485 48.1
272 269
18.7 18.4
14 1 14 1
11.8 US
101 989
B 96 B 75
B.17 792
7.48 7 79
6 98 6 80
6 58 6 40
626 608
598 681
5 75 6 68
5 56 519
S19 672
6.24 608

Sll 4.95
4 99 4 81
4 89 4 71
4 BO 4.64
4.11 466
464 4.48
457 441
4.40 415
4 45 4 2'J
4 19 4 74

4.10 4.14
4.22 4CXi
414 199
408 391
402 187

1 97 1 B3
191 178
189 1.14
1 84 1 70
1B2 167
3 09 3 54
160 144
1 41 1 19
1 48 1 14
1 44 1 30

117 373
117 118
1 76 1 12
171 107
116 102
 999
 129
47 7
266
187
11.9
II 4
972
BS9
770
7 14
665
670
6.94
667
544
52!>
5 OH
4.»l

4BI
4.7O
460
4.51
4.42
4.15
428
427
4 1C
4.11

402
194
187
180
3.75

3.70
3.66
367
158
155
142
111
127
3.22
3 18

311
306
3 IX)
295
2.91
 999
 128
474
26.4
180
13.7
112
9S7
8.4!,
761
700
652
6 11
SBI

555
512
5.11
497
462
4.70
468
448
419
4 II
4.24
4.17
4.11
405
400

1.91
1.B1
1.76
1 70
164

159
155
151
148
344
332
3 21
3.16
1.11
307

300
79G
290
2BS
781
999
128
47 2
262
178
136
II I
9.44
8.12
7.51
684
641
607
571
5.41
522
501
4.B7
4.77

4.60
449
419
430
422
4.14
408
401
396
3.91

381
3 74
167
160
155

150
146
342
318
3.15
121
1.14
107
107
299

292
287
282
2 76
2 72
999
 ITS
469
261
17.7
114
109
911
827
7.41
679
6.11
691
562
615
S.ll
494
4 78
464

451
440
4 10
4 21
4.13
406
399
393
IBS
182
1 »1
165
1.49
15V
147

1.42
318
314
1.11
127
1 IS
10G
100
295
291

234
2 BO
2 74
2 69
264
 999  999
 127  127
468 46.8
25 9 25 8
176 17.4
111 11.2
ioa toe
9.24 9.15
8.13 80S
732 7.74
671 6.63
673 6.16
685 578
5 54 5.46

527 570
505 4.99
487 480
4.70 464
4 46 449

4.44 4.37
4.11 4.76
4.23 4.16
4.14 4.07
406 3.99
3 99 3 97
192 186
186 3.BO
3 80 1 74
1.75 169
1 66 1 60
1 !iG 1 57
1SI 145
145 1 19
140 1.14

1 35 3 29
3-31 3.26
121 121
324 117
1 2O 1.14
1.08 102
2.99 2 9J
291 287
2 88 2 82
284 2.78

277 271
271 267
2.67 2.lil
2 62 2 56
2 SB 2 52
 099  999
 127  127
46.S 46.3
25.7 7S.6
17.4 17.1
11.1 11.1
107 106
90S 9.01
7.98 7.91
7.17 7.11
657 6.61
609 601
6.71 566
5.40 5 35
6.14 509
492 4.87
4 74 4 68
4 58 4.52
4.44 418

4 31 4 26
4.20 4.15
4 10 4 OS
402 19C
3 94 3.88

386 \8I
380 375
174 169
168 1.61
3 61 3 SB
1.64 1.49
1.46 341
3.40 334
3 14 3 28
128 3.21

121 118
H9 114
315 1.10
1 12 107
109 304
296 291
288 283
281 2 76
2 76 2 71
2 71 2 68

266 261
261 246
2 56 241
2 40 246
246 241
999
127
46.2
25.5
17.2
110
105
B.95
1.B6
7.06
645
598
560
579
504
482
461
4 47
411

421
4.10
400
1.92
384

3 77
3.70
164
Ib9
151
1.44
117
110
124
1.19

1.14
1 10
106
107
799
281
2  78
2  72
267
261

256
252
24U
241
217
 999  999
 126  (26
46.1 45.7
25.4 25.1
17.1 16.9
17.9 12.7
105 101
890 8.69
780 760
701 681
6.40 672
591 5.75
5 56 5.18
5.25 607
499 482
4 78 4 60
4.59 442
441 4.26
4.29 412
4.17 400
40C 189
196 1.79
1.87 1.71
1.79 1.61
1 77 1.56
3 66 3.49
360 341
3 54 318
3.49 1 11
1 40 1 24
1.11 1.16
126 1 10
3 20 3 04
314 298
1 10 294
306 289
3 02 2 86
2 98 2 82
2 95 2 79
281 267
774 258
268 252
261 247
2 59 2 41

2 52 216
2 48 2 32
242 226
217 221
231 217
 099
 126
454
24.9
167
12.5
10 I
855
7.47
668
609
561
525
4.95
4.70
448'
410
4.14
400

1B8
1 78
168
1.S9
152
144
118
132
327
3.22
1.13
305
298
292
287

283
278
2 74
2 71
268
255
247
241
2.16
2.12

224
221
2.15
210
205
 999
 125
 45.1
 246
 164
 121
 992
 817
 730
 652
 691
 647
 5 10
 480
 4.54
' 4.11
 4.15
 199
 186
 1 74
 161
 361
 345
 317
 110
 121
 3 18
 3.12
 107

 298
 291
 284
 2 78
 2 71

 268
 264
 26O
 256
 251
 241
 212
 226
 221
 217

 2 10
 200
 200
 I  94
 I  90
699
 126
449
24.4
161
122
980
8.26
7.19
647
581
517
500
4 70
4.45
424
400
190
177

1.64
164
144
136
3.78

121
1 14
309
303
2.98
289
282
2 75
269
264

259
255
251
247
244
212
221
2 16
2 II
208

201
196
I 'JO
I 85
I 80
 999
 124
446
24.1
 16.0
 120
957
804
698
621
561
6 17
481
451
426
4.05
187
1.71
158

146
3.15
125
1 17
1.09
 102
296
290
284
2 79
 2.'0
 261
 256
2 SO
244

240
215
231
228
225
2 12
201
I 9U
I 91
I 87

I 79
I /4
I OB
I 67
I 57
 999
 124
44 1
240
IS.9
II 9
949
796
691
6 14
556
6 10
4 74
4 44
4 19
398
180
165
151
119
178
1 19
1.10
101
295
289
281
2 78
271
264
256
249
241
218
211
228
224
221
2 IB
205
I 95
189
I 81
I 79

I 71
166
I 60
151
I 48
999
 124
441
24.0
159
II B
945
791
687
6 10
6.S7
507
4 71
4 41

4.16
195
177
161
148

116
175
1 16
107
299
292
286
280
2 74
269
260
252
246
240
214

229
225
221
2 17
2 14
201
I 92
I 86
I 79
1 76

167
I 67
I 55
I 48
I 41
          109  696  S.46  465  4.14  178  151   310  313  299  287  277  269  7.61  254   2 48  243  2 J8  214  210  214  202   187  17)  161  144  I IB

-------
                              HANDOUT #4





IV.   Fundamentals of Experimentation.



     1.   Information Available in a Sample:



         Generally speaking,  the information available  in  a  sample  is  dependent



         upon :



         A.   The sample size  - information increases  as n  increases.



         B.   The amount of variability in  the measurements -  experimental



             error.   Information increases as the  experimental  error decreases.



         Suppose that Y is used as an estimator of the  population mean.   It  is



         known  that, regardless of the population  distribution,  the standard


                           •   /"—               -                  S'
         deviation of the  mean  Y of a random sample  of measurements drawn  from



         that population is equal to o /       .  We would  consider o-v=3
         to be related to the amount  of information that  would  be  available  in



         Y.   Note that o~_  decreases  as o decreases and  as  n  increase?.

                        Y

     2.   Statistically Designed  Experiments:



         Purpose:   To design  the experiment  (not  the equipment,  but  the  manner



         in which the data will  be  selected)  so as  to minimize  the experimental



         error.



         Example:   Consider the  following data representing comparative  readings
of two samplers


Days
1
2
3
4
5
Means : Y .
A
taken on
Sampler
A
five consecutive
Sampler
B
days.





• Difference
10.2
9.4
11.8
9.1
8.3
= 9.76 ,
10.6
9.8
12.3
9.7
8.8
YB - 10.24 ,
.4
.4
.5
.6
.5
d = YB





-YA
                                                                °'4S

-------
    We note that the variability from day to day is  large  compared  with



    the difference between the means.  This variability can be  reduced by



    making a comparison between A and B on each day  (i.e.,  by looking at



    the paired differences).   Variability can be reduced by blocking so



    as to eliminate unnecessary experimental error.



         a)  Comparisons are  made within blocks.



         b)  The blocks in the above example are days.



3.  Linear Models for Experimental Data.



    a)  Fitting a straight line to a set of points.



        Model:  Y. = B  + B,x. + e..
                 i    o    11    i



    Note:              -  »r-                .                  s


        Y. = observed value of Y at x., i = 1,2,..., n.



        e. = random error with average equal to zero and variance equal



             to a2.



    Average Value of Y at x.  is  E(Y.) = B  + B.x..



    b)  Fitting a cubic curve.



        Model:




      •  Yi = 6o + 3lXi * 62Xi2  + B3Xi  * V


    c)  A comparative test, two treatments, no blocking.



        Yi  = y + Ti + ei. ,  i = 1, 2 ar.d j = 1,2,...,  ^  .




        Y.. = j-th observation on i-th treatment.

         1J


        y = overall mean.



        T. = additive effect due to Treatment 1.




        T_ = additive effect due to Treatment 2.




        e.. = random error.

-------
    d)  Comparative test, two treatments, with blocking.


        Model:


        Yij = M + T. + P.J + e..., i = 1, 2 and j = 1,2,..., b.



        Y.. = observation associated with i-th treatment in j-th block.



        u = overall mean.


        T. = additive effect due to Treatment i.



        B. = additive effect due to Block j.



        e.. = random error.
         30


4.  A statistical problem associated with experimental models is to


    estimate or test hypotheses concerning the parameters that''appear in


    the model.

-------
                                     HANDOUT #5

                                   EXERCISE SET 1

1.  After the implementation of an air pollution abatement strategy designed to
    lower the average concentration of a certain air pollutant to below 5 ppm,
    a random sample of six concentration measurements gave the values 4.9,  5.1,
    4.9, 5.0, 5.0, and 4.7 ppm.  Do these data provide sufficient evidence  that
    the true mean concentration is now less than 5 ppm?  (Use a = .05)

2.  The EPA suspects that the average concentration of a certain air pollutant  is
    significantly higher in community A than it is in community B.   To check this,
    five randomly selected air samples were taken at strategically located  sites
    in each city.  The data obtained are presented in tabular form below:
COMMUNITY A
COMMUNITY B
4.8, 5.2, 5.0, 4.9, 5.1
5.0, 4.7, 4.9, 4.8, 4.9
    (a)  Do the data provide sufficient evidence to indicate a significant difference
         in mean concentration between communities A and B?  (Use a = .05).

    (b)  Find a 95% confidence interval^for the true difference in mean concentration
         between communities A and B.  **

3.  The following data pertain to measurements of air quality in two cities:

                                            CITY A   '   CITY B

                  Number of observations:      40          60

                             Sample mean:      15          13

               Sample standard deviation:       2           3

    Is there evidence (1% level) that city A has a higher mean air quality level
    than city B?

4.  The Environmental Protection Agency conducted an experiment to assess the
    characteristics of sampling procedures designed to measure the concentration
    of a certain air pollutant in a particular city.  At a particular location,
    two identical sampling units (denoted  Sampler 1 and Sampler 2) were set up
    side-by-side and then readings were taken on each of ten days.  The data  are
    as follows:

-------
                                  - 2 -
               DAY
            SAMPLER 1
123456789   10
6   3   1  10  17   4   8  12   10    20
            SAMPLER 2
5   1   2  12  17   7   6  12    9    19
One particular question of interest to EPA is  whether there  is  any  evidence
of a difference between the readings of the adjacent sampling units.  Test
the appropriate hypothesis at the 5% level of  significance.

-------
                          HANDOUT #6
                   SOLUTIONS TO EXERCISE SET 1
                    S  =
                         (6-1)
                                   (146.12)  -
            = .018,  S = .134
Ho:  u = 5
H :   y < 5
 a
test statistic:  t_ =

                  _   g
                          S/

                          //IT
    rejection region:   reject H  if t  < -t
                               o     j
     4.93 - 5
     .134
                                              = -2.015.
2.  a)
                   -1.28 => DO NOT REJECT H .
                                   =4.86.
                 ,
       .•.
       1=1
         ,.
         Ai
                    .
                    1=1
                            Ai
.125.10- ^L- 0.10.
            -YB)2  =118.15-
         2 _ 0.10 * 0.05

           "
                  5 - 2)
                                 019  S  =  137
                                '01J'   p   ^^~
    Ha:  WA>VB


                          (Y  - Y "» - 0
    test statistic:  t. = v A    B'
                          c   /  _	 + _—


                           -     "A   "B
        rejection region:  reject when to > t ^c  o
                                               ~ 1-860
    t  =
         (5.00 - 4.86) - 0
                                     >  DO NOT REJECT H

-------
b)  95% confidence  interval for  (y. - yn) is of the form:
                                 A    O
    CYA ' V  * t.975.8   S   '-
or  (5.00 - 4.86)  ±  (2.306)  (•137)/sjj + i



or  0.14 ± 0.20, or  (-0.06,  + 0.34).
3'   V   ^A^B'V
                      > yB> test statistic:
                                                 Z .
                                                  (YA -  V  -
rejection region:   reject when Z > 2.33.
Z =
    (15 "
     O. + .1
   /X/40   60
4.
     DAY
                  =  4  =>  REJECT H
                      23456789    10
di-Yl
i'Y2i

+1 •»
2 -1
2 0
-3
> 2 0
•f 1 +1
* - Yl - Y2 ' 15
    t-\
        25 -
             or
              10
                      = 2.77  .
Ho:  Pd
                   «  0
test statistic:   tg =
                      d  - 0
                     ^
rejection region:   reject when  |tg| > t g75 g = 2.262.
t _
     °'10 " °
                      => DO NOT REJECT H .   This is an example  of a PAIRED



                                            DIFFERENCE EXPERIMENT.

-------
                                 HANDOUT #7



              SOME ASPECTS OF THE ANALYSIS OF AIR QUALITY DATA



I.  Some Common Statistical Problems Associated With Air Quality Data



    1.  COLLABORATIVE TESTING



        a)  Design of an Interlaboratory Study (number of labs,  number of materials,



            number of replicates).



        b)  Measures of precision and accuracy to pinpoint sources of error.



    2.  ANALYSIS OF POLLUTANT CONCENTRATION DATA



        a)  Distribution of conceritration data (e.g., lognormal).



        b)  Summary statistics (e.g., geometric or arithmetic mean,  percentiles,



            maximum values; time interval;  1-hour,  8-hour,  24-hour, weekly,



            monthly, annual).


                                  f '                                  *"
        c)  Comparing air quality to standards (What is a violation?  Sample  size



            dependency:  the more the sampling frequency, the higher the chance



            of recording a violation).



        d)  Sampling plans (how often,  how long, and where to sample).



        e)  Trends analysis (dependency on time period chosen; seasonal patterns;



            combining data from different sampling sites).



II.  Some Statistical Techniques Useful  in Analyzing Air Quality  Data



    1.  TRANSFORMATIONS



        a)  Purposes:  to achieve approximate normality; to stabilize variance;



                       to make a non-linear model linear.



        b)  Examples:  log transformation; square root transformation.



    2.  PARAMETRIC STATISTICAL PROCEDURES:  dependent on certain assumptions



        which may not be valid in certain situations.



    3.  NONPARAMETRIC STATISTICAL PROCEDURES:  When the assumptions made for



        certain parametric procedures do not hold (and transformations or other



        manipulations do not help),  then the application of various nonparametric



        approaches is warranted.   See Hollander & Wolfe, "Nonparametric Statistical

-------
                                                                        2

Methods", Wiley and Sons, 1973, New York.  Also, see Conover, "Practical

Nonparametric Statistics", Wiley and Sons, 1971, New York.

         1.  SIGN TEST (a non-parametric alternative to the paired
                       t-test)

Uses:  Applicable to the case of two related samples (e.g., comparison of

       two adjacent instruments in order to check for biases, or for

       checking for bias relative to a known standard.)

Limitations:  Considers only signs (and ignores magnitudes) of differences,

              which suggests a possible loss in sensitivity.

H :  median difference is zero; i.e., p(X > Y) = p(Y > X) = i

Procedure:

a)  Determine the sign of^rhe algebraic difference between each-'bf the

    pairs of data points; assume that N signs remain after throwing out

    all ties.

b)  Count the number of "+" signs and "-" signs, and let x denote the

    smaller of these two numbers.

c)  Then, from the assumptions of the binomial distribution,
    Pr
observing a value no greater than x given H  is true
     I   C1?  Uf , where C? = N.1 / j!(N-j)!  .
    j=0   J  W           3      /
    (For a two-tailed test, multiply this number by 2).

    If N is fairly large (say, > 25), then one can use a normal approximation

    to the above probability by calculating
           Z =
                     N
                 x -
                 j_
                 2
    and then using standard normal curve area tables.

    EXAMPLE:  Two air pollution monitoring instruments, A and B, operating

              side-by-side for 25 hours, yield the following data:

-------
lour
1
2
3
4
5
6
7
8
9
10
11
12
13
A
9
10
5
8
3
5
6
10
13
11
2
7
1
B
7
6
5
3
2
7
5
8
8
10
1
7
1
Difference
+2
t4
0
+S
+ 1
-2
+ 1
+2
+5
+ 1
+ 1
0
0 . r •
Hour
14
15
16
17
18
19
20
21
22
23
24
25

A
8
5
9
8
14
4
12
2
6
4
14
9

B
12
4
9
6
11
3
12
1
1
1
13
8

Difference
-4
+ 1
0
+2
+3
+ 1
0
+ 1
+5
+ 3
+ 1
+ 1

For these data, we have observed 18 plus signs,  2 minus  signs,  and  5  ties,  so



that N = 20 and x = 2.
So, Pr  )no more than 2 of one sign for N = 20 under H (  = 2   £   C".   Ud

        *•                                             OJ      i = 0  J   \>  1
                                                                        20
             2°
C20 fl)   + c20 fil20 +  C20  fi
C         +C            C
                 2
                        l
                        )
           i20
                   2(1 + 20 + 190)


                          22°
      422
   1,048,576
               = .0004 => reject H  at any reasonable significance  level.
If the normal approximation is used,  we have
     Z =
                -8


                /T
= -3.58, which leads us to the same conclusion as



  above concerning H .
                  2.   WILCOXON SIGNED-RANK TEST (a non-parametric  alternative

                                                 to the paired t-test)



    Uses;  Applicable to situations for which the sign test is used;  considera-



           tion is given to the magnitudes of the differences between paired



           observation? ^s well as to the directions of the differences.

-------
Limitations;  In general, more powerful than the sign test;  not as powerful
              as the paired t-test when parametric assumptions hold.
H :  median difference is zero.
 o
Procedure:
a)  Determine the sign and magnitude of the algebraic difference between
    each of the pairs of data points; assume that N scores remain after
    throwing out all zeros.
b)  Assign ranks to the absolute values of these N differences; use the
    midrank method in case of ties.  Make sure that ranks increase as
    absolute values increase.
c)  Then, assign to each, rank the sign which it represents.
                       '- tf                  -                  <<
d)  Determine the sum of the ranks associated with a minus sign (T )  and
    the sum of the ranks associated with a plus sign (T_).
e)  To test H , one can use published tables based on min (T , T ), or one
    can use a large sample normal approximation with

                            T    N(N + 1)
                             1 "    4
                              N(N+1) (2.N+1)
                                   24
    EXAMPLE:  Refer to the data set used to illustrate the sign test.  For
              these data, the absolute differences and associated signed
              ranks are given in the following table.

-------
ABSOLDTE DIFFERENCE  .(SIGNED) RANK  ABSOULUTE DIFFERENCE  (SIGNED) RANK
2
4
0
5
1
2
1
2
5
1
1
0
0
(+) 11.5
(O 16.5
IGNORED
CO 19
(+) 5
(-) 11.5 *
M 5
(+) 11.5
(-0 19
(+) 5
(+) 5
IGNORED •
IGNORED
4
1
0
2
3
1
0
1
5
3
1
1

(-) 16.5 *
(+) 5
IGNORED
(0 11.5
(+) 14.5
(+) 5
IGNORED
(-0 5
(+) 19
(+) 14.5
(+) 5
(+) 5

        Thus, T. = 11.5 + 16.5 = 28; and, from Table 9 at the back, it is clear

        that this observed value of T.. is highly significant.  Similarly, the

        large sample test statistic is
                        ~     .1        "vo   i rjr
            Z  =   ,                 =   2$ 79   = *2-87 , which is highly
                                                          significant.
720(20+1) (40+r
       24
              3.  MANN-WHITNEY "U" TEST  (a non-parametric alternative to the
                                         two-sample t-test)

    Uses:  To test whether two independent samples have been selected from the

           same population.

    Limitations:  Not as powerful as the t^o-sample t-test when the parametric

                  assumptions hold.

    H  :  The two populations have the same distribution  (versus the alternative

         that one population is "stochastically  larger" than the other)

    Procedure:

    a)  Combine the n. observations from Pop'n 1 and the n. observations from

        Pop'n 2, arrange them in ascending order of size, and  then assign ranks

        from  1 to  (n. + n_); in case of  ties, use  the midrank  method.

-------
b) For the observations from Pop'n 1, calculate


               n  (n  + 1)

   Ul = nln2 * 	2	Rl '


   where R. is the sum of the ranks assigned to the observations from


   Pop'n 1; it can be shown that U_ = n n  - U  .


c)  To test H , one can use published tables based on min (U.,  U )  for small


    n. and n2 , or one can use a large sample normal approximation  by


    calculating

                                V2
                           u  --1
                         ]n2 (nx + n2 + 1)


                               12


    EXAMPLE:  Test H  that the set of pollutant concentration measurements


              obtained at site 1 could have come from the same population


              as the set obtained at site 2 (use a two-tailed test with a =


              .05).  (SEE DATA SET ON NEXT PAGE)


    For these data, n. = n_ = 25, R. = 540, R. = 735.
    Since U  = (25) (25) +          - 540 = 410,
          410 .1251251
                    2           410 - 312.5  _

                                             ~  '
          (25) (25) (25 + 25 + 1)

                 12
                                   51 54
                                   **'**
    Since Z rt c = 1.96 , we cannot reject H ,
                  —* ~                       O

-------
SITE 1
DBS. CONC.
12
14
17
22
23
23
26
27
. 27
28
32

33
33
RANK
1
2
4
5.5
7.5
7.5
9.5
11.5
11.5
13
14

15.5
15.5
DBS. CONC.
36
46
58
61
70
83
94
101
112
143
152

175

RANK
17
20.5
24.5
26.5
32
36
37
39.5
42
48
49
f "
A*
50

















SITE 2
DBS. CONC.
16
22
26
38
41
46
51
54
58
61
65

66
67
RANK
3
5.5
9.5
18
19
20.5
22
23
24.5
26.5
28

29
30
DBS. CONC.
68
72
76
81
95
101
108
115
119
121
136
J" '
140

RANK
31
33
34
35
38
39.5
41
43
44
45
46

47

TIES:  Actually, the above test statistic  we calculated  for  these data  is




       not strictly correct because of the presence of  tied  observations,




       although the error is generally negligible.   When ties  occur between




       two or more observations involving  both populations,  a  correction




       is necessary; ties just among observations from  the same  population




       require no adjustment.   The correction tends to  increase  the Z value,




       so that ignoring the correction provides a conservative test.  In




       general, one should worry about correcting only  if the  percentage of




       tied observations is quite large.




              4.  ANALYSIS OF TRENDS




A trend will be generally considered as a broad  long-term movement in  the




time sequence of air quality measurements.   Before doing any  statistical




tests concerning the presence or absence of  a trend, various  plots of  the




data should be made.




     The classification of a trend as being  up, down, or neither up nor




down can depend significantly on the following considerations:

-------
a)  CHOICE OF AGGREGATE MEASURE:   hourly,  weekly,  monthly, quarterly,  or




    yearly maximum or average; the larger the interval over which the




    aggregate is taken the more precision and stability contained in the




    estimate, but correspondingly the fewer time sequenced estimates available.



b)  TIME FRAME OF INTEREST:  the particular interval of time over which



    the data are being considered, if altered, can influence the outcome of



    a trend analysis.




c)  SEASONAL EFFECTS:  seasonal patterns,  if ignored,  can result in mis-



    leading interpretations of trends.



d)  MISSING DATA:  missing data can introduce bias into the determination



    of a trend, especial ly^lf there is any systematic  loss.   ^



e)  POOLING DATA FROM DIFFERENT SOURCES:   Strictly speaking, the  data should



    result from the same analytical (chemical and  instrumental)  methodology



    at the same site location for the entire time  period under consideration.



    If one is willing (as is often done)  to relax  this rule in order to




    create the only possible complete record of data for the analysis,  then



    one must be willing to accept the possibility  of creating an apparent



    trend when none exists or indicating  no trend  when one actually exists.



    When combining data from different sites for purposes of assessing



    national or regional trends,  the recommended approach is to  determine



    the trend at each individual site and then to  summarize the  results



    for the group of sites; this summary  would be  in the form of a number



    of upward trends, downward trends, and "no changes", which could then



    be analyzed statistically.  The alternative approach of considering



    a composite index based on pooling data can in some instances be quite



    misleading because of a few aberrant  sites or  because the rate of



    change of the composite may not represent the  typical rate of change

-------
of the individual sites (e.g., a'zero rate of change for the composite



may be the average effect of an equal number of increasing and decreas-



ing patterns).



NOTE:  It is, in general, dangerous to extrapolate a trend line for



       the purpose of predicting future levels or continued direction



       of change.



STATISTICAL PROCEDURES:  NONPARAMETRIC TREND ANALYSIS



(parametric procedures will be discussed under the heading of regression



 analysis)



SPEARMAN RANK CORRELATION COEFFICIENT  (rs"



Given observations X./T_,..., X..  (taken over time, usually") and their



corresponding ranks R(X ), R(X_),..., R(X ), then



                           N  2




             r   = 1 -  —~	   , -1 <.r  < + 1,

              5          N(N  -1)           S



where d. = R(X.) - i; that is, d.  is the difference between the rank



of X. and its sequential index i in the series of N observations (in



cur case, i is usually an index of time).  It should be noted that the



computational formula given for r  is mathematically  equivalent to



the usual Pearson product moment correlation  coefficient of the ranks



of the observations with the  order in which the observations were taken.



     To test for the significance  of an observed r  , one can use published



tables or,  for  larige N,  the approximation
                    Z = r  VN-1.
                         S
EXAMPLE:  The following table provides annual geometric means, their


ranks and the time index for Tucson TSP Data, 1964-1971.   Investigate


the possibility of a trend.

-------
   X.:   128   118   80   89   70   78   96   88




R(Xi):    8     7     351264




   i:     1     2     345678



   If ties had occurred, the midrank method would have been used;  an ad-



   justment is necessary if the proportion of ties is large (see Siegel,



   pp. 206-210).






   For the above data   I  d2 = (8-1)2 + (7-2)2 + (3-5)2 + (5-4)2  + (1-5)2

                       i=l  *



   +  (2-6)2 * (6-7)2 *  (4-8)2 = 124, so that  r  = 1 -     ^I  = -0.476 .
                                	            S
   From Table 11 at the back, it is seen that the observed value of r  is



   not significant; also, note that




                    Z = -.476  V(8-l) =  -1.26,  which is not significant.






   The value of r  for the data for the years 1968-1971 is +0.30;  however,



   this positive value is not statistically significant because of the



   small sample size involved.



   TEST FOR TREND BASED ON THE PROPORTION OF OBSERVATIONS ABOVE A GIVEN-



     STANDARD VALUE



   This procedure tests for a change in trend by comparing the percentage



   of observations above a specified threshhold value for two different



   time periods.  The assumption of independent observations suggests,



   for example, that for hourly data one should consider at most one



   observation per day (e.g.,  the maximum hourly value for a 24-hour



   period or the value for a particular hour).  For fixed samples of sizes



   n. and n_ from two time periods, the table of cell frequencies can be



   constructed as follows:

-------
                                                                         11
            NUMBER OF OBSERVATIONS
            NOT EXCEEDING STANDARD
  TIME
PERIOD I
  TIME
PERIOD II
a
              NUMBER OF OBSERVATIONS
              EXCEEDING STANDARD
        The test statistic to be used is
                   X,
                           (nj_ + n2) (ad - be)
                                       n.
= (a + b)
                                       n  = (c + d)
        which has a Chi-square distribution with 1 df under H  of no difference

        between the time periods with regard to the true probability of exceeding

        the standard.  This test should be used only if there are at least 5

        observations in each of the four cells.

        EXAMPLE:  Tne following table represents the number of days whose

        maximum 1-hour oxidant concentration exceeded the 1-hour standard at a

        particular location during 1964-1967 and 1968-1971.
1964-1967
1968-1971
ii* Y. = -
<_ STANDARD
662
714
> STANDARD
154
111
(816 + 825) K662)(lll)
n. = 816
n2 = 825
- (154) (7147]
                                                                   = 8.9.
                        (816)(825)(662 + 714) (154 + 111)

        Since Pr 1 Xj  > 3.84 given H j = .05, it can therefore be concluded

        that short-term oxidant levels have significantly decreased  (5% level)

        in recent years at this particular location.

-------
EXERCISE:  The following data represent 24-hour maximum concentration




values recorded over a period of one month at two different sites A



and B in a particular city.



a)  Is there evidence that one site tends to record higher maximum con-



    centration levels than the other?  Use the sign test, Mann-Whitney




    U test, and Wilcoxon signed-rank test, and comment on your findings.



b)  Is there evidence of a daily trend in the differences between the




    readings taken at the two sites?  Use the Spearman rank correlation



    approach and comment on your findings.
DAY
1
2
3
4
5
6
7
8
9
10
A
349
286
174
139
135
116
105
100
96
96
B
171
153
150
100
100
100
98
96
95
86
DAY
11
^12
13
14
15
16
17
18
19
20
A
93
90
85
81
77
74
70
60
55
52
B
79
77 "
* 77
75
72
72
68
62
59
54
DAY
21
22
23
24
25
26
27
28
29
30
A
50
40
29
25
22
20
IS
10
5
5
B
48
4i
29
28
28
24
23
22
21
21

-------
                                                            '•*- •'•--vr.J'rVrj-'1-^iT^
                    Table 8: Distribution function of U   419
TableS
Distribution function of U
P(U
nl£



t/o


< ^o);
«2; 3 <
«,
0
1
2
3
4
1
.25
.50



l/o is the
«2 < 10
2
.10
.20
.40
.60

argu

3
.05
.10
.20
.35
.50
nt




U,




0
1
2
3
4
5
6
7
8
r •
*»*
.2000
.4000
.6000






11, = 4
2
.0667
.1333
.2667
.4000
.6000




3
.0286
.0571
.1143
.2000
.3143
.4286
.5714


4
.0143
.0286
.0571
.1000
.1714
.2429
.3429
.4429
.5571
0
1
2
3
4
5
(/, 6
7
8
9
10
II
12
.1667 .0476
.3333 .0952
.5000 .1905
.2857
.4286
.5714







.0179
.0357
.0714
.1250
.1964
.2857
.3929
.5000





.0079
..0159
.0317
.0556
.0952
.1429
.2063
.2778
.3651
^4524
.5476


.0040
.0079
.0159
.0278
.0476
.0754
.Mil
.1548
.2103
.2738
.3452
.4206
.5000

-------
                                                                                             I
             420   Table 8: Distribution function of U
I
n,
0
1
2
3
4
5
6
7
8
(/. 9
10
II
12
13
14
15
16
17
18
1 2
.1429 .0357
.2857 .0714
.4286 .1429
.5714 .2143
.3214
.4286
.5714













n,
0
1
2
3
4
5
6
7
8
9
10
II
t/c 12
13
14
15
16
17
18
19
20
21
22
23
24
1 2
.1250 .0278
.2500 .0556
.3750 .1111
.5000 .1667
.2500
.3333
.4444
.5556

















3
.0119
.0238
.0476
.0833
.1310
.1905
.2738
.3571
.4524
.5476









»i =
3
.0083
.0167
.0333
.058?
.0917
.1323
.1917
.2583
.3333
.4167
.5000














4
.0048
.0095
.0190
.0333
.0571
.0857
.1286
.1762
.2381
.3048
.3810
.4571
.5429






7
4
.0030
.0061
.0121
.0212
.0364
.0545
.0818
.1152
.1576
.2061
.2636
.3242
.3939
.4636
.5364










5
.0022
.0043
.0087
.0152
.0260
.0411
.0628
.0887
.1234
.1645
.2143
.2684
.3312
.3961
.4654
.5346




5
.0013
.0025
.0051
.0088
.0152
.0240
.0366
.0530
.0745
.1010
.1338
.1717
.2159
.2652
.3194
.3775
.4381
.5000







6
.0011
.0022
.0043
.0076
.0130
.0206
.0325
.0465
.0660
.0898
.1201
.1548
.1970
.2424
.2944
.3496
.4091
.4686
.5314

6
.0006
.0012
.0023
.0041
.0070
.0111
.0175
.0256
.0367
.0507
.0688
.0903
.1171
.1474
.1830
.2226
.2669
.3141
.3654
.4! 78
.4726
.5274
























7
.0003
.0006
.0012
.0020
.0035
.0055
.0087
.0131
.0189
.0265
.0364
.0487
.0641
.0825
.1043
.1297
.1588
.1914
.2279
.2675
.3100
.3552
.4024
.4508
.5000

-------
Table 8: Distribution function of U   421
«,
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I/. l6
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1 2 3
.1111 .0222 .0061
.2222 .0444 .0121
.3333 .0889 .0242
.4444 .1333 .0424
.5556 .2000 .0667
.2667 .0970
.3556 .1394
.4444 .1879
.3556 .2485
.3152
.3879
.4606
.3394




















4
.0020
.0040
.0081
.0141
.0242
.0364
.0545
.0768
.1071
.1414
.1838
.2303
.2848
.3414
.4040
.4667
.5333
















3
.0008
.0016
.0031
.0054
.0093
.0148
.0225
.0326
.0466
.0637
.0855
.1111
.1422
.1772
.2176
.2618
.3]08
.3621
.4165
.4716
.3284












6
.0003
.0007
.0013
.0023
.0040
.0063
.0100
.0147
.0213
.0296
.0406
.0539
.0709
.0906
.1142
.1412
.1725
.2068
.2454
.2864
.3310
.3773
.4259
.4749
.5251








7
.0002
.0003
.0006
.0011
.0019
.0030
.0047
.0070
.0103
.0145
.0200
.0270
.0361
.0469
.0603
.0760
.0946
.1159
.1405
.1678
.1984
.2317
.2679
.3063
.3472
.3894
.4333
.4775
.5225




8
.0001
.0002
.0003
.0005
.0009
.0015
.0023
.0035
.0052
.0074
.0103
.0141
.0190
.02-19
.0325
.0415
.0524
.0652
.0803
.0974
.1172
.1393
.1641
.1911
.2209
.2527
.2869
.3227
.3605
.3992
.4392
.4796
.5204
         E5C

-------
422  Table 8: Distribution function of U
fit
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

-------
Table 8: Distribution function of U  423
      10
"I
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
IS
16
17
18
19
20
21
22
23
24
C/o25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
1 2 3
.0909 .0152 .0035
.1818 .0303 .0070
.2727 .0606 .0140
.3636 .0909 .0245
.4545 .1364 .0385
.5455 .1818 .0559
.2424 .0804
.3030 .1084
.3788 .1434
.4545 .1853
.5455 .2343
.2867
.3462
.4056
.4685
.5315


f '
'- **































Computed by M. Pagano at the
4
.0010
.0020
.0040
.0070
.0120
.0180
.0270
.0380
.0529
.0709
.0939
.1199
.1518
.1868
.2268
.2697
.3177
.3666
.4196
.4725
.5275






























5
.0003
.0007
.0013
.0023
.0040
.0063
.0097
.0140
.0200
.0276
.0376
.0496
.0646
.0823
.1032
.1272
.1548
.1855
.2198
.2567
.2970
.3393
.3839
.4296
.4765
.5235

























Department
6
.0001
.0002
.0005
.0009
.0015
.0024
.0037
.0055
.0080
.0112
.0156
.0210
.0280
.0363
.0467
.0589
.0736
.0903
.1099
.1317
.1566
.1838
.2139
.2461
.2811
.3177
.3564
.3962
.4374
.4789
.5211




















7
.0001
.0001
.0002
.0004
.0006
.0010
.0015
.0023
.0034
.0048
.0068
.0093
.0125
.0165
.0215
.0277
.0351
.0439
.0544
.0665
.0806
.0966
.1148
.1349
.1574
.1819
.2087
.2374
.2681
.3004
.3345
.3698
.4063
.4434
.4811
.3189















8
.0000
.0000
.0001
.0002
.0003
.0004
.0007
.0010
.0015
.0022
.0031
.0043
.0058
.0078
.0103
.0133
.0171
.02 1 7
.0273
.0338
.0416
.0506
.0610
.0729
.0864
.1015
.1185
.1371
.1577
.1800
.2041
.2299
.2574
.2863
.3167
.3482
.3809
.4143
.4484
.4827
.5173










9
.0000
.0000
.0000
.0001
.0001
.0002
.0003
.0005
.0007
.0011
.0015
.0021
.0028
.0038
.0051
.0066
.00«6
.0110
.0140
.0175
.0217
.0267
.0326
.0394
.0474
.0564
.0667
.0782
.0912
.1055
.1214
.1388
.1577
.1781
.2001
.2235
.2483
.2745
.3019
.3304
.3598
.3901
.4211
.4524
.4841
.5159





10
.0000
.0000
.0000
.0000
.0001
.0001
.0002
.0002
.0004
.0005
.0008
.0010
.0014
.0019
.0026
.0034
.0045
.0057
.0073
.0093 *"
.0116
.0144
.0177
.0216
.0262
.0315
.0376
.0446
.0526
.0615
.0716
.0827
.0952
.1088
.1237
.1399
.1575
.1763
.1965
.2179
.2406
.2644
.2894
.3153
.3421
.3697
.3980
.4267
.4559
.4853
.5147



M


"

J



i
J
"i
J
4
1

.
4
1

{'



1
1
'*•!
El
| •
li
r 4
['
•

V


'•(
1
f j






.
•


of Statistics, University of Florida.

EE23I
SSSSE
anrassxasssra


-------
424   Table 9: Critical values of T In the Wllcoxon test

  Table 9
  Critical values of T in the Wilcoxon
  matched-pairs signed ranks test
                               n = 5(1)50
One-sided
a . .05
a - .025
a •= .01
a - .005
One-sided
a = .05
a = .025
a -.01
a ..005
One-sided
a - .05
a - .025
a - .01
o - .005
One-sided
a - .05
a = .025
a = .01
a - .005
One-sided
a = .05
e - .025
a -.01
a - .005
One-sided
a = .05
a - .025
a - .01
o- .005
One-sided
e - .05
a - .025
a = .0!
a = .005
One-sided
e -.05
a - .025
a - .01
a - .005
Two-sided
a = .10
a- .05
o = .02
a = .01
Two-sided
a- .10
o » .05
a » .02
e = .01
Two-sided
a = .10
a- .05
a- .02
o= .01
Two-sided
o = .10
a- .05
a- .02
a m .01
Two-sided
a- .10
a =.05
o= .02
„= .01
Two-sided
a- .10
a>= .05
e- .02
a- .01
Two-sided
a- .10
a° .05
o= .02
0= .01
Two-sided
e ** .10
e ~ .05
a - .02
a •-- .01
n= 5
I
H= 11
14
11
7
5
n= 17
41
35
28-
23
» = 23
83
73
6:
J5
n = 29
141
127
111
100^
n= 35
214
195
174
160
« = 40
287
264
238
221
n - 46
389
361
329
307
n- 6
2
1
n- 12
17
14
10
7
n = 18
41
r*9
/^3
28
n- 24
92
81
69
68
n= 30
152
137
120
109
n = 36
228
208
186
171
(i = 41
303
279
252
234
n =- 47
408
379
345
323
«= 7
4
2
0
«= 13
21
17
13
10
n= 19
54
46
38
32
n = 25
101
90
77
68
n= 31
163
148
130
118
n=37
242
222
198
183
n= 42
319
295
267
243
/i = 48
427
397
362
339
n- 8
6
4
2
0
n«= 14
26
21
16
13
n= 20
60
52
43
.'. 37
n = 26
no
98
85
76
n =32
175
159
141
128
n=38
256
235
211
195
n = 43
336
311
281
262
•1 = 49
446
415
380
356
«= 9
8
6
3
2
n= 15
30
25
20
16
n = 21
68
59
49
43
n=27
120
107
93
84
n =33
188
171
151
138
f< = 39
271
250
224
208
it = 44
353
327
297
277
/>- 50
466
434
398
373
n~ 10
II
8
5
3
« = 16
36
30
24
19
n=- 22
75
66
56
49
/i = 28
130
117
102
92
n-34
201
183
162
1<9

n = 45
371
344
313
292

      From "Some  Rapid Approximate  Statistical Procedures" (1964), 28.
  F. Wilcoxon and R. A.  Wilcox.  Reproduced  with the kind permission of
  R. A. Wilcox and the Ledcrle Laboratories.

-------

Table 11: Spearman's rank correlation coefficient 427
Table 11
Critical values of Spearman 's rank correlation coefficient
ft
5
6
7
8
9
10
II
12
13
14
13
16
17
18
19
20
21
22
23
24
25
26
2?
28
29
30
a. «= 0.05
0.900
0.829
0.714
0.643
0.600
0.564
0.523
0.497
0.475
0.457
0.441
0.425
0.412
0.399
0.388*- -
0.37?
0.368
0.359
0.351
0.343
0.336
0.329
0.323
0.317
0.3 li
0.305
« = 0.025
___
0.886
0.786
0.738
0.683
C.648
0.623
0.591
0.566
0.545
0.525
0.507
0.490
0.476
0.462
0.450
0.438
0.428
0.418
0.409
0.400
0.392
0.335
0.377
0.370
0.364
at = 0.01
_
0.943
0.893
0.833
0.783
0.745
0.736
0.703
0.673
0.646
0.623
0.601
0.582
0.564
0.549
0.534"
0.521
0.508
0.496
0.485
0.475
0.465
0.456
0.448
0.440
0.432
a. = 0.005
_
-.
^_
0.8SI
0.833
0.794
0.818
0.780
0.745
0.716
0.689
0.666
0.645
0625
0.608
0.591 *
0.576
0.562
0.549
0.537
0.526
0.515
0.505
0.496
0.487
0.478
                   From " Distribution of Sums of Squares of Rank Differences for Small
              Samples," E.  G. Olds, Ar.nals of Mathematical Statistics.  Volume  9 (1938).
              Reproduced with the kind permission of the Editor, Annals of Mathematical
              Statistics.
\<*

-------
HANDOUT #8



SOLUTION
TO EXERCISE
IN HANDOUT #7

(SIGNED) RANK Uh
DAY
1
2
3
4
5
6
7
8
9
10

11
'12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
A OVERALL RANK B OVERALL RANK
349
286
174
139
135
116
105
100
96
96

93
90
85
81
77
74
70
60
55
52
50
40
29
25
22
20
18
10
5
5
60
59
58
54
53
52
51
48.5
44
44

41
40
38
37
34
31
28
25
23
21
20
17
15.5
12
8.5
5
4
3
1.5
1.5
171
153
150
100
100
100
98
96
95
86

79
77
77
75
72
72
68
62
59
54
48
41
29
28
28
24
23
22
21
21
57
56
55
48.5
48.5
48.5
46
44
42
39
f *
" *36
34
34
32
29.5
29.5
27
26
24
22
19
18
15.5
13.5
13.5
11
10
8.5
6.5
6.5
DIFFERENCE
+ 178
+133
+24
+ 39
+ 35
+ 16
+7
+4
+1
+ 10

+ 14 "
+13
+8
+6
+5
+2
+2
-2
-4
-2
+2
-1
0
-3
-6
-4
-5
-12
-16
-16
RANK ABSOLUTE DIFFERENCE
30
29
26
28
27
25
20
17
13
22

24
23
21
19
18
15
15
9.5
6.5
9.5
15
11
12
8
4
6.5
5
3
1.5
1.5
(+) 29
CO 28
(+) 25
(+) 27
CO 26
CO 23
CO 16
CO 10
CO 1.5
CO 18

CO'21
CO 20
CO 17
CO 14.5
C+) 12.5
CO 5
CO 5
CO 5
CO 10
CO 5
CO 5
Ml. 5
IGNORED
CO 8
CO 14.5
CO 10
CO 12.5
CO 19
CO 23
CO 23

-------
a)  SIGN TEST:  For these data, we have observed  18 plus  signs,  11  minus signs,

    and 1 tie, so that N = 29 and x =  11.

    Then,

                    11   29
                       " ~~2        -3.5
             Z  = — = -  = - =  -1.30, which  indicates  no evidence
                    I  /29~         2'693     -                  --
                    2                              of  a significant difference.

    MANN-WHITNEY "U" TEST:  For these  data,  n.  =  nD =  30,  RA  =  929.5,  and R_ =
                                             ABA                B
                            900.5.
    Then, UA =  (30) (30) +
             = 900 + 465 - 929.5 = 435.5.

                       (30) (30)  f
    c   .           ~    2             435.5  - 450
    oO, L =
         J
            -14.5
30 (30)(50 +30+1)      /4,575
         12
          - >-.,--£-,, < 1> which indicates no evidence  of  a  difference.
            O/.04                      	

    WILCOXON SIGN'ED-RANK TEST:  For these data, K = 29,  T   =  131.5,  and

    T2 » 303.5.  Then,

           ixi c _ r	
     ,               4         131.5 - 217.5   -86.0    .  ..     ,.  ,  ,.
     Z  =               '- -— = —•        	= •.; . = -1.86  ,  which (for a two-tailc
            29 (30) (58 + 1)     /2138.75       ao>/i>           test) is  significant s
                  24                                            the  10% level but not
                                                                at the 5% level.

b)  SPEARMAN RANK CORRELATION:


    Here, f  d2 = (30 - I)2 -r  (29 - 2)2  + (26 - 3)2 +  ...  + (1.5 -  30)2  = 3,694.5
                 6(8.694.
                 30(900-1)

-------
From Table 11 tit is clear that this observed value of r  is highly

significant.  This can also be demonstrated by calculating
                          Z - -.934 V(30-l)  = -5.03, which is highly

                                                    significant.

     Note that the analyses in parts (a) and  (b) are addressing different

issues.  While there is no evidence that one site tends to record higher

maximum concentration levels than the other, there is strong evidence of

a daily trend with respect to the maximum concentrations observed at the

two sites.  A graph of the data would look something like this:
    24-HOUR
    MAXIMUM
                          DAY IS
                          (APPROX.)
DAY

-------
                              HANDOUT #9
                         ANALYSIS OF VARIANCE

 The  "analysis  of variance" procedure attempts to analyze the variation in a

 response  (Y) and to  assign portions of this variation to each of a set of

 independent variables.  The object is to  locate important independent variables

 in a study and to  determine how they interact and affect the response.

DATA
v Y Y
V V'"' n

v
r
TOTAL^SUM
OF SOtlARES
n 2
£(Y.-Y)Z
i=l
                                                  FIRST SOURCE!
                                                  OF VARIATION
                                                  SECOND SOURCE

                                                  OF VARIATION
Each source of
if-
variation is

measured by a

"sum of squares".
                                                  THIRD SOURCE

                                                  OF VARIATION
                                                  EXPERIMENTAL!
                                                     ERROR
 When  the  independent variables  are unrelated to the response, it will be noted

 that  each of the  pieces  of  the  total  sum of squares, divided by an appropriate
                                                             2
 constant, provides  an  independent and unbiased estimator of a , the variance of

 the experimental  error.   When a variable is highly related to the response, its

 portion (called the "sum of squares"  for the variable) will be inflated.  This
                                                        •>
 condition can be  detected by comparing  the estimate of c1" for a particular  in-

 dependent variable  with  that based on the sum of  squares due to experimental

 error using an F  test.   If  the  estimate for the independent variable is signi-

 ficantly  larger,  the F test will reject the null  hypothesis of "no effect for

 the independent variable" and provide evidence to indicate a relation to the

.response.

-------
   EXAMPLE.  Assume that independent random "samples have been selected from k



   normal populations with means u., p-,..., p., respectively, and common variance



   a .  Let n., i=l, 2,..., k, be the number of observations selected from the i-th
   population, i=l, 2, ..., k, and let N = £ n. .  This is a COMPLETELY RANDOMIZED

                                          i=l 1


   EXPERIMENTAL DESIGN (i.e., one-way ANOVA) .
ADDITIVE MODEL:  Let Y. . = u. + e. . = u + T. •»•£.., where Y. . is the observed
-        13   Hi    ij        i    ij'        1J



   response of the j-th experimental unit in the i-th sample, u is the over-all



   mean, T. is the effect of the i-th population (or treatment), and E. .  is the



   experimental error component, where e..^N(0,o ) and the  {E..} are mutually



   independent.  The null hypothecs we shall want to test is that the^means of the



   k populations are all equal  (i.e., u  = u_ =...= u,) , or, equivalently, that
   T. = 0 for every i.
                       n.
                                          .    n.
                                          k     i
        Let Y-  = —   I  Y.. and Y = J-   Y   T  Y. .  .

             x'   ni  J=l  *3         N  i=l j = l  1J
                                                n.
                                                T*          *5

   Then, the total sum of squares  (TSS) = J     2.  (Y. . - Y)  > which can be parti-
   tioned into the following components:


                                k
k   ni
                   (Y.. - Y)2 =
                    ij
                      "i
                               (Yij " V * Yi- " Y)2
          n.
'. . - Y. )2
11     i*
                                            CYi  * Yi-5 * ni
                                                       -Y)2
      k   ni
              SSE, sum of

              squares due

              ito error
                                              ni
                                                                    ) = 0.
                          SST, siun of squares

                          due to treatments


-------
Thus, TSS = SSE + SST.


     Note that the treatment sum of squares  (SST) will be zero only when all


of the observed treatment means (i.e., the {Y..}) are identical  (in which case


T.  = Y2> = ... = T.   = T).  As the treatment means get further apart in value,


the deviations {(Y.. - Y)} will increase in  absolute value and SST will increase

in magnitude.  Consequently, the larger the  value of SST, the greater the weight


of evidence favoring a rejection of the null hypothesis of no treatment differ-


ences.  THIS SAME LINE OF REASONING WILL APPLY TO THE SIGNIFICANCE TESTS EMPLOYED


IN THE ANALYSIS OF VARIANCE FOR ALL DESIGNED EXPERIMENTS.
     For the case we are considering, it can be shown that E

                                     k
and that E
SSE
2
fM 1
I.N-
             1
           (k-1)
             and
                                                             (N-k)
                                                                    SSE
          SST
                 SST
= o
*• *£*•     I  n.T?.  Under H :
  (k-1)  >j  i i          o
                                                                  = a
=Tk=
                                Also, SSE and SST are independent.  Thus, under
     SST  / SSE
Ho'  (CT)/(rnO ** F(k-l),  (N-k)'  Consecluently> we would reject the null
hypothesis if this F-statistic is greater than
                                                               W
                                                              ~ "• J
                                                                       are reecti-r!
the null hypothesis for values falling into the right-hand tail of the F-distri-


bution, inasmuch as this rejects H  when SST is too large.  MODERATE DEPARTURES


FROM THE ASSUMPTIONS WILL NOT SERIOUSLY AFFECT THE PROPERTIES OF THE PROPOSED


TEST PROCEDURE; THIS IS PARTICULARLY TRUE OF THE NOR,%IALITY ASSUMPTION.
        k   "i
                        Computational Formulas

                                n.
TSS =
                            =i  =
      k  i
SST = I  -
n.
i
I Yii
j = l _
i

~ JT

'
|£
I
, i=1
n.

I Y. .
j=l 1J,
SSE = TSS - SST.

-------
                              ANALYSIS OF VARIANCE TABLE
        SOURCE OF VARIATION   DEGREES OF FREEDOM   SUM OF SQUARES
BETWEEN POPULATIONS

WITHIN POPULATIONS
                                    (k-1)

                                    (N-k)
                                                         SST

                                                         SSE

                                                         TSS
             TOTAL                  (N-l)

                        TWO-WAY ANOVA  (RANDOMIZED BLOCK DESIGN)
MEAN SQUARES  F RATIOS

 SST/(k-1)    SST  /
             ricTn /

                    HO
         SSE/(N-k)
 /SSE
/(N-k)
        Model:  Y..  = y + T.
                yj    -    -
                                    £., , i = l,2,...,k, j = l,2,...,b.
This additive model
                        OVER-ALLWREATMENTAfcLOCK \ EXPERIMENTAL
                                   EFFECT/fcFFECT
assumes no interaction\  \ MEAN
                                                         N(0,G )
                                                       INDEPENDENCE
between treatments and
blocks.
                 k   b
        Y = UT   I   I
                i=l j=]
               k   b
                      CY<; - Y)'
                             *   !
                                      k   b
                                      I   I
                                                   ?   = -

k
il-l
b
^1 YiJ
                                                           kb
                £   -     -2   1  k
        SST = b J  (Y.. - Y)  = i-
                                       b
                                       I  Y..
                                                  kb
                                                k   b
                                                I   I
        SSB =
                    (Y   - Y)  =
                                  k
                                  I  Y^
                                                     _^
                                                     kb
k   b
I   I
                                 £   I                     -2
        SSE = TSS - SST - SSB =  T   y  (Y..  - Y.  - Y .  + Y) .
                                •    •      11     1*    *1

-------
                                    ANOVA TABLE
SOURCE OF VARIATION
    TREATMENTS
    BLOCKS
    ERROR
               D.F.
              (k-1)
              (b-1)
            (k-1)(b-1)
SUM OF SQUARES     MEAN SQUARES

                     SST,
SST
SSB
SSE
                   _
               MSB "
                                                      SSB
                                                      SSE
                                     F RATIOS
                                    F(k-l),(k-l)(b-l)=^
                                                       MM
,-1). ^
   '  MSI
    TOTAL              (kb-1)         TSS

    EXAMPLE:  The following data represent the results of an interlabroatory study

    involving 4 laboratories, each of which analyzed the same three air samples.
v LAB
\
SAMPLEN

1
(j) 2
3

1


5
4
8
17

2


8
6
10
24
(i)
3

if
8
8
10
26

4


4
5
7
16

4 25
23
35
83
 4    3
 [    I
1=1  j=
                             4    3
                                                              623 - 574.08 = 48.92
          1  J  / 3     V            112       2       2       21
        = J  I  (  I  Yr   - 574.08 = -j  [(17)  +  (24T +  (26)  +  (16r) - 574.08
            i=l \j=l  1J/
          599.00 - 574.08 = 24.92

                           2
                  i=l
                        ,
                        J
                    - 574.08 =.-i-  [(25)2 + (23)2 + (35)2]  - 574.
                                                                         08
          594.75 - 574.08 = 20.67
SSE = 48.92 - 24.92

SOURCE OF VARIATION
LABORATORIES
SAMPLES
ERROR
TOTAL
- 20.67

D.F.
3
2
_6
11
= 3.33
ANOVA TABLE
SUM OF SQUARES
24.92
20.67
^3.33
48.92


MEAN SQUARES
8.31
10.34
.56

                                                                    F RATIOS
                                                                8.31/.56 »  14.84

                                                               10.34/.S6 =  18.46

-------
                                                                                   6



Since F _-_ „ , = 12.9 and F _g5 o 6 = 1^'^> it f°H-ows that there is strong evidence



of a difference among the laboratory means and among the sample means.






NOTE:  In evaluating laboratory performance in an interlaboratory study,  there



are two measures of interest:  PRECISION  (i.e., variability) and ACCURACY.



The above analysis of variance addresses  only questions concerning precision.



     To illustrate briefly this point, suppose that each of k  laboratories  has



been given a sample with true  (known) concentration y  to analyze.  Then,



 V                                                k
 r  —       2                                     r  —    — 2
 I (Y.  - y)  has information on accuracy, while  £ (Y.  - Y)   is a  measure

-' '  1*                                          i=l   x*
of precision only.   In fact, sj,flce
    (Y.  - y)  =  I  (Y.  - Y)  + k  (Y  - y)   ,  it follows  that   I  (Y    - y)   =

     x
             -2
       (Y.  - Y)  = 0, but that the  converse  is not necessarily  true.

   1*1  ll

-------
                              HANDOUT #10
                    NONPAR/METRIC ANOVA PROCEDURES

                 KRUSKAL-WALLIS ONE-WAY ANOVA BY RANKS
The Kruskal-Wallis technique tests the null hypothesis that independent
samples come from the same population or from identical populations with
respect to mean values.
PROCEDURE :
a)  Suppose that there are k independent samples with n. observations  in the
                              k
    i-th sample; rank all N = J  n. observations combined, using the midrank
                             i=l  1
    method in case of ties.
b)  Calculate R. , the sum of the ranks for the observations in the i-th group.
c)  As a test statistic, use

                           ..     k  R2
                     H -                - - *w  '
                 2
    which has a xfk ±\ distribution under H  .
EXAMPLE:  The lead content  (in gm/L) is measured in samples cf three  brand?
of gasoline.  Is there evidence that the three brands differ with respect to
average lead content?  The data are as follows:
                BRAND A       .BRAND B        BRAND C
               7.0  (2)       S.3  (9.5)       8.6  (12)
               7.1  (3)       7.9  (8)         7.4  (5)
               7.2  (4)       6.5  (1)         7.5  (6)
               7.6  (7)    ;   8.8  (13)        8.3  (9.5)
               8.4  (11)                      9.0  (14)
                                             9.2  (15)
                         (Ranks are in parentheses)

-------
For these data, n. = 5, R  = 27; n2 = 4, R  = 31.5; n_ = 6, R_ = 61.5
Then, H
          15(15

-------
Then,
       K =
              12
           3(4) (4+1)
(5)2 + (10)2  + (II)2 + (4)2
 = 7.4,

    2
                        which is significant at the 10% level since x      = 6.251.
                                                                     .90,3
EXERCISE:  A laboratory animal experiment was conducted by EPA to investigate the

sensitivity of skin to three chemicals related to reaction products of  pollutants

in the atmosphere.  In this experiment, one-inch squares of skin on rats were treated

with the chemicals and then scored from 0 to 10, depending on the degree of irritation.

Three adjacent one-inch squares were marked on the back of each of the  eight rats,

and then each of the three chemicals was applied to each rat.  The data are as follows:
          CHEMICAL
                         1
                                   8
TOTALS



I
II
III
TOTALS


6
5
3
14

9
9
4
22

6
9
3
18

5
8
6
19

7
8
8
23

5
7
5
17

6
7
5
18

6
7
6
19
50
60
40
150
1
   a)   Perform  one-way  parametric  and non-parametric  ANOYA on these data,  ignoring

        the  column  (i.e.,  rat")  designation.   Comment en your findings.

   b)   Perform  two-way  parametric  and non-parametric  ANOVA on these data,  and

        comment  on  your  findings.

-------
                              HANDOUT #11
a)  Model:
                  SOLUTIONS TO EXERCISE IN HANDOUT #10
             I .
                  CHEMICALS CERROR
                     y + TT * e,r,  i=l,2,^ and j = l,2,...,8
           3   8
    TSS=
Y2.
                        24
                                3   8
                                           = 1,006 -
                                                  (150)'
                                          = 1,006 - 937.5 = 68.5,
       SST
        = j  K50)2 + (60)2 + (40)H - 937.5 = 962.5 - 937.5 = 25.0.
    SSE = TSS - SST = 68.5 - 25.0 = 43.5.

    SOURCE OF VARIATION   D.p!   SUM OF SQUARES '  MEAN SQUARES   F^RATIOS
         CHEMICALS

           ERROR
           TOTAL
                               2

                              11
                              23
                      25.0
                      45.5
                      68.5
                                    12.5
                                     2.07
                                               **Significant at .01 level.

    The Kruskal-'iVallis procedure is based on the following table of ranks:
•"CHEMICALS  II

           III

       Then,


            H =
               11.5  23  11.5
                 6   16.5    6   11.5  11.5 !

6    25   23    20    20   16.5  16.5  16.5

1.5   3   1.5  11.5   20     6     6   11.5
                                             Rl =
                                             R. ^
                                                               R.
97.5
•* • •*  •
i*ti . S

61.0
                12
                24(24+1)


                33,249.50
                2(25)  (8)
        f(97.5)2    (141.Sp2    (61.0):
        1_  8          8           T
                                  12 - 75 - 8 12
                                 '^   75   ^jJ£'
                                                           - 3(24*1)
       ich is significant at the  2.5%  level since x S75 2 = 7<38'
        which
b)  Model:  Y. .  = y + iT *  gr* e.V, i = 1,2,3 and j = 1,2,...,8.
    Now, TSS and SST are as given in part (a).

-------
  SSB
 y  |U4)2 + (22)2 + ... + (19)2 I  - 937.5 = 956.0 - 937.5 = 18.5,
  So, SSE = TSS - SST - SSB = 68.5 - 25.0 - 18.5 = 25.0.



  SOURCE OF VARIATION  P.P.  SUM OF SQUARES  MEAN SQUARES  F RATIOS



CHEMICALS (treatments)   2        25.0           12.5        7.0**



RATS (blocks)            7        18.5            2.64       1.48



  ERROR                 1£        25.0           25/14



  TOTAL                 23        68.5

                                           **Significant at .01 level.



  The Friedman procedure is based on the following table of ranks:
\RAT
CHEMICALS^
I
II
III
1
3
2
1
2
2.5
2.5
1
3
2
3
1
4
ar
3
2
5
" 1
2.5
2.5
6
1.5
3
1.5
7
2
3
1

1

1
8
.5
3
.5
TOTALS
RX = 14.5
R2 = 22
R3.11.5

*


  Then,
K

    8(53
                  + D  |_U4.5)2 + (22)2 «• (11.5)2J - 3(8) (3 T
         = -5- (826.5) - 96 = 103.3 - 96 = 7.3, which is significant at
           O            .                 *- n- —
    the 5% level since x
                                     = 5.99.
      Note that the variation due to differences among rats was not



      significant.

-------
                                 HANDOUT #12



                          SIMPLE LINEAR REGRESSION





Model:  Y. = Bn + B.x. + e., i = 1, 2, ..., n, where Y. is the observed response
— — — •    i    u    i i    i                           i


associated with x., Bn and 6  are parameters to be estimated, and e. denotes the



experimental error component.



USUAL ASSUMPTIONS  (needed to make statistical inferences, but not for fitting)




  i)  e. «* N(0, o ) => we are assuming that the random variable Y. is normal and



          JL          that the variance of  Y.  is a  (independent of i) .



      Y4~ N(BQ + 81xi, o2)


                               ••' f*                -                  sr-

 ii)  {e.} are independent => {Y.} are independent



iii)  x. is measured without error (i.e., x. is not a random variable).
       i                                   i    ~


    **      /\                                         ^    ^    A

Let BO and B  denote estimates of BQ and B, , and let Y. = BQ + B,x. denote the



predicted value at x. .  Then, the METHOD OF LEAST SQUARES chooses JL ar.d I. to





                                                  2       -  ?    5
MINIMIZE the sun of squares due to error = SSE =  J (Y. - Y.)~ =  \  (Y. - g  - B.x.)

                                                      x    x          1    U    1 a
This can be viewed as a problem in calculus; the values of Bn and B, which minimize
SSE are found as the solutions of the two equations - -• = 0 and  ^   - 0.  These


                                                    *&0          36l
solutions are:
              B  -
              B  -
                     n                        n         i r n   \ r n   \

                     y (x  - xw  - Y)       y x Y  -  -  y x     y Y

                    1=1  i       i           i=l i i    nU-l *J U=l i^
               1        n       _2           n2lfn
                        y (x. - x)            y x.  —  I x.
                            i                    T    n I     i
and BO = Y -

-------
Consider the following simple example:     Y


Suppose Y represents a measure of
reproducibility and x represents
concentration for a chemical procedure
designed to analyze air samples.
Is there evidence of a 1inear relation-
ship between x and Y?
Y_
65
78
52
82
92
89
73
98
56
75
x
39
43
21
64
57
47
28
75
34
52
A plot of the data reveals
            Y  100
                90-
                80
                70.
                60
                50
                40
                30-
                       10    20    30    40
                                COMPUTATIONS
50
60
Y.
i
65
78
52
82
92
89
73
98
56
75
x.
i
39
43
21
64
57
47
28
75
34
52
x.2
i
1521
1849
441
. 4096
3249
2209
784
5625
1156
2704
x.Y.
i i
2535
3354
1092
5248
5244
4183
2044
7350
1904
3900
Y 2
i
4225
6084
2704
6724
8A64
7921
5329
9604
3136
5625
70
80
           SUMS; 760   460   25,634   36,854   59,816

-------
Then,
           (36,854) - -^(460) (760)
            (23,634) -
        = jo(760) -  (0-77) I^(460) =
Fitted line is:  Y = 40.53 + 0.77x.
       10
SSE =  J  (Y. - 40.58 - 0.77x.)'
           *                1
      10
                         10
               _
       J  (Y  - Y)  - B,  I  (x.  -  XJ(Y  -  Y)
      i=l  l            i=l   *   *v    l
                                                10   ^ f  10
      59,816 - -~j(760)2 - 0.77  fib,854 - -^-(460) (760) I
    = 2056 - 1,458.38  =597.62
TU    C2     SSE
Then, S  =
            (n - 2)     8
                               74.703  »
                                             2                               2
                                             5  is an unbiased  estimator  of o
if all assumptions hold.

Note;  One can get a so-called "pure" estimate  of a  by taking more  than  one

       observation at each x value; this is   called  replication.

The following results are useful:
Van
           2
                   :, - x)   ,
Var(&0) = o'
             !*_
             n    n

-------
 Variance  of  predicted  value  of E(Y)  at  (x =  x-.)  =
                                                             (xo "
                                                      n    n
                                                           I  (x.  -  x)'
                                                          i=l
 If all  assumptions  hold,  then 100(1  -  a)%  confidence  intervals  for 8 ,  6n,  and

 E(Y) at  (x  = x.)  are,  respectively:
 B  4 t  ct          S./ /   T  rx   .
 Pl   ll~,  (n-2)    /  /   L  (*<
Bn ± t  a       ,   S
 0    1-,  [n-2j
                                    x
  n     n

       i = l
            ± V. (n-2)
For our exar.ple, a 95% confidence  interval  for  2   is


0.77 ± 2.306 /M.703 / /2T74  , or  .77 ±  .40.

And, a 95% confidence interval for E(Y)  at  x  = 50 is:
5£ * 0.77(50)
— I
)j ± 2.

306/74.70330-+
                                                      , or 79. 08  ±6.50
This is INTERPOLATION, not EXTRAPOLATION.

An often-used indicator of the strength of the  linear relationship between  two

variables  X and Y, which will be independent of their respective scales  of  measure-

ment, is the CORRELATION COEFFICIENT

-------
r =
n
i»l
— ^ 1 ( n ^ f n ^
( - X 1 f Y - Y1 Y Y Y - — Y Y. y Y
™J\***J L *^' * • ^^1 / •*»• 1 f A • 1

/n
,!,«.-
2 n . /i
X) [ (Y. - Y)^ /
Aw/
r f?vi2n
?X2 WliJ
•^i i ^

n 2 (-_i ij
i-lYi n
The range of values of r is -1 <_ r <_ + 1.
Since the numerator used in calculating r is identical to the numerator in the
formula for B,, r will assume exactly the same sign as B, and will equal 0 only
if B, = 0.  In fact, r = +1 when Bj > 0 and SSE = 0, and r = -1 when B, < 0 and
SSE = 0.
It can be shown that
                  n
                    (Y. - Yr - SSE

                       n       — 2
                      i=l  x
,  0 <  r  < .1.
       2 .
Thus, r  is equal tc the ratio of the reduction in the sum cf squares cf deviations
obtained using x to the total sum of squares of deviations about the sample mean,
Y, which would be the predictor of Y if x were completely ignored.
     Most of these concepts can be extended to the so-called MULTIPLE REGRESSION
situation where models like

                       Y= $Q* Bxx1+ ... + Bkxk+ c
can be fitted by the method of least squares.  In fact, almost all analysis of
variance procedures can be implemented using multiple regression techniques.

-------
                             SELF-INSTRUCTIONAL
                                  PACKACi-
Instructor's Name:  LARRY KITPER
Topic:
         BASIC  MATRIX MANIPULATIONS
Estimated Working   Unit:  30 minutes
Time of. Student:    Post Test:  20 minutes
PRE-TEST:
Consider the matrix X
               2    -3*^ I
              -342
               120
1.  Is X square?      	Yes

2.  Is X symmetric?   	Yes

3.  Is X diagonal?    	Yes
                           No

                           No

                           No
4.  Find the product X'X.

If you miss any of the pre-test questions, please continue through this unit,
ANSWERS TO PRE-TEST:
1.  YES
4.  X'X
 2.   YES
 [14   -16
-16    29
 -4     5
 3.  NO
-4
 5
 5

-------
                         BASIC MATRIX MANIPULATIONS


Introduction


      You are familiar with the methodology pertaining to the least squares


fitting of the simple linear regression model Y «= £. 4- S X + e.


      You are also aware of the fact that a more common and more general


problem is that of relating a response not to just one but to several indepen-


dent variables.  For example, an environmental engineer rcight be concerned


with predicting suspended particulate concentration as a function of humidity,
                                                         i

temperature, number of point sources of pollution, etc.


      The purpose of this package is to introduce you to some elementary but
                              "•  *'                 -                  ,f'-

important matrix manipulations so that the general least squares procedures


for multiple linear regression can be discussed in a follow-up package using


convenient matrix notation and methodology.






Objectives


      In this package, the overall objective, is:



	 to be able to determine the transpose of a given matrix and to be able


    to perform matrix multiplication.


Sub-objectives:


	 to be able to operate with the notions of


      i) a matrix and its dimensions,


     ii) the elements of a matrix,


    iii) the transpose of a matrix,


     iv) a symmetric matrix,


      v) a diagonal matrix,


     vi) a product of two matrices.

-------
It is also hoped that you will be willing to answer questions concerning the

way you felt about the material.
Activities
      A matrix may be simply defined as a rectangular array of numbers.  For
example,
             231]      _ [2   ll
             I   1   2]'   B - L3   2]'
are matrices.

      The dimensions of a matrix
                                     the number of rows and the number-''of
columns that it has.  The matrix A above has jwo rows and three columns:

                             Column   Column   Column
                              One      Two      Three
                               4-        4-         4-
                      Row
                      One

                      Row
                      Two
                                                  1


                                                  2
It is customary to say that A is a (2x3) matrix.

                                   /     \
                                Number  Number
                                  of      of
                                 Rows   Columns


See if you understand the above concepts by giving the dimensions of the

matrices B and C  above  and by giving an example of a (1 x A) matrix.

If you said that B is a (2 x 2) matrix and that C has three rows and one

column, then you are right!  An example of a (1 x 4) matrix is any matrix

with one row and four columns, such as [-2, 3, 3, 0].  Incidentally, the

matrix B is a "square" matrix because it has the same number of rows and

columns.  Are A and C square?  No!

-------
      The numbers forming the "rectangular array" are called the elements of

the matrix.  If a., denotes the element in the i-th row and j-th column of the

matrix A on p. 3, then a,,=2, a,,=3, a1_=l, a, =1, a,0=l, a0 =2.  It is often
*12
13"
                                             21
                                                           23
informative to write

indicating that the matrix A (represented by a capital letter) with two rows

and three columns has typical element a. . (represented by the corresponding
small letter).   Suppose that B'
                                        i '.«!» where b^-3, b31"2,
b-_=6, b22=0, b-2=l.  Can you construct B using this information?
                                                                      rf
You are doing fine if you discovered that
                              B =
                                    -2
                                     3
                                     2
   6
   0
   1
With the above notions firmly in mind, we are now in a position to introduce

-------
the concept of the transpose of a matrix.  The transpose A1  of a matrix A is
defined to be that matrix whose i,j-th element a!,  is equal  to the j,i-th
element of A.   For example,  if  A =   ,   1   _I,  then A1
a!i= an = 2>
               *12 = 321
- a12 = 3, a^2 = a22
                                                         3   1
.1,
                                                               since
                                                                        1.
                                                             31    13
al_ = a.- = 2.  Another way of looking at it is that the first column of A
becomes the first row of A', the second column of A becomes the second row
of A1, etc.  Thus, if A is (r * c),  then A1 is (c * r).
      For practice, find the transposes of

" 1 2 "

3 1
0 1



B =


"l 0 2 "

_0 4-5
2-53



C SB

0*~
_


2
-2
You should have found that

-------
        A'
        A
1
2
3 C
1 ]
!]• *'•
1
0
2
0
A
-5
2"
-5
3
                                                  C'  =  [ 0,1,2,-2
Note that in the above example the matrix B is such that B=B'  (equality here


means that corresponding elements are equal).   A matrix satisfying this con-


dition is said to be a symmetric matrix.


      Does a symmetric matrix A have to be square?



                                                         i



The answer is YES, for otherwise A and A' would have different dimensions and


so could not possibly be equal.




Can you give a necessary and sufficient condition for the square matrix


A «= ((a..)) to be symmetric?
The condition is that a.. = a., for every i^ j .   If you have trouble seeing


this, try to construct some examples of symmetric matrices.  An important


special case where the above condition for symmetry is satisfied is when
a.  •= a.. = 0 for every


to be a diagonal matrix.
                              A square matrix having this property is said
Can you give an example of a (3 x 3) diagonal matrix?

-------
You are correct if your example is of the general form
                                      This is the "diagonal".





The final new concept to be discussed is that of matrix multiplication.  Before
                                                             x


we give the mechanism by which we form the product AB of t^wo matrices A and B,



let us hasten to point out that such a product exists if and only if the number



of columns of A is equal to the. number of rows of B.  Thus, the product,AB is



not necessarily equal to the product BA; for example, if A is  (2 * 3) and B is



(3xi), then BA does not exist.





In general, if A=((a. .)) * ' *  and B=((b . .)) ,k. ' .f  , then the i,j-th element
                    1J  1-1, J-l          1J  1=1,3-1


c . . of the product C = AB is defined to be
 ij

                        k
Note that AB will have the same number of rows as A and the same number of



columns as B.  For example, suppose
                              .1  0]

                              3  ij
      and B
                        1  -2

                        1   0

                        3   2
Then,
                  '11
                  '12
                  •21
                                  » 2(1) -I- 1(1) + 0(3) = 3,
 3


 I  ab
£-1

 3
j?2
 ^
                                    2(-2) + 1(0) + 0(2) = -A,
                                    0(1) + 3(1) + 1(3) = 6,
                  "22
                                    0(-2) + 3(0) + 1(2) = 2,

-------
Finally,
                      Cll   12





                      C21  C22
3





6
                   -A





                    2
Can you find AB when
   [
                         1
Can you find BA?
You should find that AB
•m-
The product BA does not exist!






As a final practice exercise,  if X




     a) Find X',




     b) Find X'X.
           [

-------
The answers are:
      a)  X'
           1
           0
           1
      b)  X
'X =    -2

     L  7
-2    7
 1   -3
-3   10
Now, please take the post-test on the next page.   You will have 20 minutes


to complete the post-test.

-------
                                                                             10
                                 POST-TEST
                        "BASIC MATRIX MANIPULATIONS"
1.  Indicate whether each of the following statements is TRUE or FALSE.
                                                               ««TT>itr"     "TAT cr"
                                                                TRUE
    a)  A symmetric matrix is always square.

    b)  A diagonal matrix is always symmetric.

    c)  If A and B are square, then AB=BA.               '

    d)  If A is (2 x A) and B is (3 x A), then AB is (2 x

    e)  X'X is always symmetric, liven if X is not.
                                                                   FALSE
2.  For each of the following matrices, check the appropriate blank if  the matrix

    has the indicated property.

                                               SQUARE     SYMMETRIC     DIAGONAL
    a)
    b)
[3   0   Ol
[GO   2 J
[5  -1
 -1   3
L2   1
     2
     1
     0
    c)  [ 5 ]
3.  If X
3    7
1   -1
0    3
    a)  Find X1

    b)  Find X'X
              1
              0

-------
                                                                           11
>.   If A =  LI   3
                   , „
                andB
                                1    0
find AB and BA.

-------
                                                                             12
                         STUDENT ATTITUDE QUESTIONS

                                    for

                        "BASIC MATRIX MANIPULATIONS"
This package is:

      	 Better than an in-class lecture.

      	 Somewhat confusing.
                                                         i
      	 The repeated practice cycles are not worthwhile.

      	 Easy to follow.


I would like:

      	 Most of the course in this format.

      	 No more of this nonsense.

      	 Some topics in this format.




Please feel free to mention any strengths and  weaknesses of this unit which

occur to you:

-------
                              HANDOUT #13





A General Method of Fitting Multiple Linear Regression Models



1.  The Method of Least Squares



    A.  Consider the simple case where one wishes  to  fit  a  straight  line



        to a set of points.
    Model:  Y^ = BQ + 3^ + EI



    Problem:  Estimate BQ and 3.




    Definition:  A residual is the difference between an observation  Y.
    and
                                                           *•   /\

its predicted value.   For this  example,  a residual  is Y. -  (fi + ?./.)•
                                             .
    Method of Least Sauares:  Choose 6« and B,  so that the sum of sauares
                                      o      i


    of the residuals of the observations will be a nir.imun.  The solution will be:



                             n
                             I  (x.  - x) (Y  - Y)
                      1         n       _ 7

                               j^i - *>

                     *    	   f.. _
                     B0 = Y - Brx



2,  A general model:  Consider a response Y which is a function of k factors


    (variables) x., x_,..., x,.
    Model:
            Y  =
    Assumptions:  For giver, levels of x,, x ,..., x, ,

-------
    a)  The average value  of the random error, £, will equal zero.

    b)  The variance of e   is  the  same, regardless of the levels of the

                                   2
        variables,  and  is  equal to a .

    c)  The errors    are  independent for the n measurements of response.

    Problem:   Estimate  BQ,  3^ 32>..., 3^ , to form a prediction equation

    for the response Y.

    Method of Least Squares:   Choose Bn, 0,, B.,,..., 6,  so as to minimize
                                     
-------
                       Let A =
2
4
6
2
0
1
0
1
5
4
1
1
1
1
1
6
5
1
                    Then, A' =
D.  The Addition of Matrices:  Matrices of the same dimension may be



    added by adding corresponding elements (same row-column position)



    Example:






               Let A =
2^-4
!>*
1 0

and B =

3 0

2 1
           Then, A+B =
E.  Multiplication of Matrices
2 4
1 0
+
3 0
2 1
=
5 4
3 1
    Let A
         2x2
    and B
         2x3
 2   1




_0   4_





"2   1   3"




 0   1   2
2 1
_0 4_

213
_0 1 2_
s
438
_0 4 8__
    The product is:






               AB =
    Note:  1)  AB is not generally equal to BA



           2)  AB exists only if the number of columns in A equals the



               number of rows in B.

-------
                3)  If A is of dimension 'mxn and B is nxp ,  then the product



                    AB will have dimensions mxp.



        Example:  Let A = [1, 2, 0]

and B =

5
1
2




Then, AB = [1, 2, 0]
5
1
    F.  The identity matrix.  The square matrix
                    I
                     nxn
                              1  0  0. .  .  .0

                              0  1  0. .  .  .0
                                           is said to be



                             0  0. . . ._!_



    the identity matrix.  Given a square matrix A   , then AI - IA = A.
4.
G.  The inverse of a matrix:  A square matrix is said to have an inverse,



    denoted by the symbol A" , if



                         AA"1 = A~*A = I.



   .Note:  Not every square matrix has an inverse



The Least Squares Estimates



Arrange the n responses in an nxl vector ,Y and the corresponding values



of x., x ,..., x,  in a second matrix X as follows:
      (nxl)
Yl
Y2
Y3
;
Y


X =
(nxk+1)

1
1
1
•
1
XXX
11 21 	 kl
X12 X22 	 \2
XXX
X13 *23 	 k3
• • *
X It IT
In 2n kn_

-------
Let
              be a (k + 1) x 1 matrix containing the desired estimates of

              the parameters

                        BQ, Bj, B2,..., BR.
      Then, it can be shown that the least squares estimates are given by the

      matrix equation
5.  Advantage of the MATRIX approach in fitting multiple linear regression

    models:

    A.  It is general in approach.

    B.  It is especially suitable for high speed electronic computation.

        The major problem is the inversion of the matrix X'X.

6.  The expression for SSE in matrix notation is


          CCC — V'V   C'Y'V
          SSF- - i i - £ * i»

    so that S2 = SSE/(n-k-l).

7.  If the (i,j) -th element of (X'X)"1 is denoted as c.., then Var (B.) =
         *%         /\   ^          ^
    c.. o' and Cov(8., B.) = c.. o  for i i j.

                                        A
8.  A 93% confidence interval for B. is B- ± t
                                                            S/c..
                                                               11

-------
                                HANDOUT #14
                         MULTIPLE REGRESSION EXAMPLE

Suppose that the amount of air pollution (measured as the number of particles

per cubic millimeter) is recorded for two different types of pollutants, say A

and B, at five equally spaced points in time.  The data is presented in tabular

form below.  For the sake of simplicity, it is assumed that the ten responses

are all mutually independent.

                                       (Coded Time)

A
B
-2
2
1
-1
3
3
0
7
6
1
4
8
2
5
10
          TYPE OF
         POLLUTANT
a.  Using the method of least squares, fit the linear model
Y -
61X1
B2X2
                                     B3X1X2
    to the n = 10 data points.  Let .x.. = 1 if the data point refers to pollutant

    type B and let x  = 0 if the point refers to A.  Let x  =  (coded) time.  You

    may use the fact that
                       (X'X)
                           -1

1
5

~~i
-i
0
0
-1
2
0
0
0
0
1/2
-1/2
0
0
-1/2
1
    Y' = (2, 3, 3, 4, 5, 1, 3, 6, 8, 10) ;  Y'X --
           1   1111   1
           0   0000   1
          -2-1012-2
           0   C  0  0  0  -2
  1111
  1111
 Tl  0  1  2
 -1012
                           (45, 28, 30, 23)

                                      i
                                      '0
                                     x*.
                                     e
                           -> B  =
                                              6,
                                     &
                                     6,
          3.40
          2.20
          0.70

-------
So, the fitted model is:



Y = 3.40 + 2.20Xl + .70x2 + 1.



Now, SSE = Y'Y - B'X'Y



         = 273 - 272.40 = 0.60.






So, S2 =  °:6°  = .10 => S i  .316.
b.  Test H : B, = 0 versus H : B, t 0 at the 5% level of significance.   Inter-
          U   O             a   ^


    pret Q  as something more than just an "interaction effect."



    Hn: 6, = 0 ; H : B. f 0 (two-tailed test)
     u   «•>        a   o

                           -••   f '                                  s'


    rejection region:  reject H  if |tl =  |B,/S/c   I > t __c  , = 2.447.
     11  	           0            o    33     .y/b,D



    In our case, t = 1.60/.316/1/5  = 11.52. which means that there is strong



    evidence in favor of H .
                          a




    When Xj = 0 (pollutant A), F.(Y) = BQ + B^;



    when x. = 1 (pollutant B), E(Y) = (Bn + B,) +  (B, + B,)x_.
          i                             u    A      ^    w  ^


    Thus, B, represents the difference in the slopes of the straight  lines



      giving the linear relationship between response and  (coded) time for



      pollutants A and B.



c.  Find a 98% confidence interval for (B5 + B,) and also interpret the  meaning



    of this parameter.



    The best point estimate of (B0 + B,) is (B, *  B-), where Var  (B_  * B..)
                                 £    J       £     *>      .          *     ij
        Var
B0 * Var B7 +' 2Cov (B,, S.) = r
                                                              0/10.
The
        98% confidence interval for  (B2 + Bj) is:

      IB2 * B3) ± t 99 6 S/vTb"  , or  (.70 +  1.60) ±  3.143 (.316)//L^  , or
      2.30 ± .

-------
    From part (b), it should be clear that (6  + (3 )  is just the slope of the


    straight line expressing the linear relationship  between the response and


    (coded) time for pollutant B.
                            9
d.  Calculate the value of R".  What does this value mean?
     10        ,
     I (Y.  - Y)  = 273 - ^—  = 273 - 202.5 = 70.5




    So, R2 = 1 - -^	—	 = 1 - ^p|  = 1 - .0085 = .9915

                   I (Y. - Y)2
                  i=l  *


    This value is the ratio o^-the reduction in the sum of squares,of


    deviations obtained by using the linear model Y = 6Q + B,x.  + B2X2 * ^3xi


    to the total sum of squares of deviations about the sample mean, Y, which


    would be the predictor of Y if x  and x  were completely ignored.

-------
                                HANDOUT #14
                         MULTIPLE REGRESSION EXAMPLE

Suppose that the amount of air pollution (measured as the number of particles

per cubic millimeter) is recorded for two different types of pollutants, say A

and B, at five equally spaced points in time.  The data is presented in tabular

form below.  For the sake of simplicity, it is assumed that the ten responses

are all mutually independent.

                                        (Coded Time)

A
B
-2
2
1
-1
3
3
0
7
6
1
4
8
2
5
10
          TYPE OF
         POLLUTANT
a.  Using the method of least squares, fit the linear model
Y = B
B2x2 +
                                            + e
    to the n = 10 data points.  Let x.. - I if the data point refers to pollutant

    type B and let x  = 0 if the point refers to A.  Let x  =  (coded) time.  You

    may use the fact that
                       (X'X)
                           -1

1
5

1
-1
0
0
-1
2
0
0
0
0
1/2
-1/2
0
0
-1/2
1
    Y' =  (2, 5, 3, 4, 5, 1, 3, 6, 8, 10)  ;  Y'X -
1
0
-2
0
1
0
-1
c
1
0
0
0
1
0
1
0
1
0
2
0
1
.4.
1
-2
-2
1
1
•rl
-1
1
1
0
0
1
1
1
1
1
1
2
2_|
                    (45, 28, 30, 23).
                              "X
                              Bn
                                                            B
                                                            ^
                                                            B,
                               1
                                                   = (X'X^X'Y
          3.401
          2.20
          0.70
          j_.60J

-------
So, the fitted model is:



Y = 3.40 + 2.20x1 +  .70x2 +  I.



Now, SSE = Y'Y - B'X'Y



         = 273 - 272.40 = 0.60.
_   _2    0.60     , -    c  .   .,,
So, S  = 7-.n  ... =  .10 => S  =  .316.
         ^AU"4j              r~-"





b.  Test H : B~ =  0 versus  H  :  B- t 0 at  the  5%  level  of significance.   Inter-
          U   «•>             Si   «->


    pret B3 as something more  than just an  "interaction effect."



    Hn: BT = 0 ; H : B, ?* 0 (two-tailed test)
     w   ^         a   j


                           '•• ***"                 -                  .-/""'


    rejection region:  reject  H if  |t| = ]^3/Sv/cT7|  > t 975 5 = 2'447'




    In our case, t = 1.60/.316/1/5  °^ 11.32.  which  means that there is strong



    evidence in favor of H  .
                          a




    When Xj = 0 (pollutant  A),  E(Y) = BQ  +  B^;



    when Xj « 1 (pollutant  B),  E(Y) =  (BQ + BJ  +  (B2  + B3^X2-



    Thus, B, represents the difference in the slopes  of the straight lines



      giving the linear relationship between  response  and (coded) time for



      pollutants A and B.



c.  Find a 98% confidence interval for  (B9  +  B,) and  also interpret the meaning
                                          «     J


    of this parameter.



    The best point estimate of (B_ + B,)  is (B-  * §-)• where Var (B_ * B,)
                                 4,    +)        £     W       .         ft    «J
        Var B2 * Var B3 + 2Cov  (B2>
i + 1 + 2 ~~
O  = 0/10.
    The 98% confidence interval for  (B2  +  Bj)  is:
      IB2 * B3) ± t QQ A S/v'lO  , or  (.70  +  1.60)  ± 3.143 (,316)//TO ,  or
      2.30 ± .

-------
    From part (b), it should be clear that (8. + BJ  is just the slope of the


    straight line expressing the linear relationship  between the response and


    (coded) time for pollutant B.


d.  Calculate the value of R".  What does this value  mean?
     10        9
     I (Y.  - Y)  = 273 -    —  = 273 - 202.5 = 70.5
    So, R2 = 1 - -  -   - = 1 -       = 1 - .0085 = .9915

                     CY  - Y)2
                  i=l



    This value is the ratio of- the reduction in the sum of squares, of


    deviations obtained by using the linear model Y = 6Q + B,XI + $2*2 * ^3X1


    to the total sum of squares  of deviations about the sample mean,  Y,  which


    would be the predictor of Y if x.. and x_ were completely ignored.

-------
                                HANDOUT #15



            A GENERAL DISCUSSION OF THE CONCEPT OF INTERACTION



     It is the purpose of this section to provide a brief and general dis-

cussion of the concept of statistical "interaction".  More specifically, we

will be concerned with describing the way in which two independent variables

can "interact" to affect a dependent variable and with representing such an

interaction phenomenon in terms of an appropriate regression model.

     To help illustrate these concepts, we will consider the following simple

example.  Suppose it is of interest to determine how two independent varia-

bles, namely temperature (T) and ozone concentration (C), jointly affect

the reaction rate (Y) in a laboratory-simulated atmospheric system.  Further,

suppose that two particular temperature levels (namely, T« and T,) and two

particular ozone concentration levels (namely, CQ and C,) are to be exam-

ined, and that an experiment is performed in which an observation on Y is

obtained for each of the four combinations of temperature - ozone concentra-

tion levels, namely (TQ, CQ), (TQ, Cj), (T^ CQ), and (Tj, C^ .*

     Now, let us consider the following two graphs based on two different data

sets that could conceivably have arisen by using the experimentation scheme

described above.
     *Statistically speaking, the above experiment is called a complete factorial
      experiment, complete in the sense that observations on Y are obtained for
      all combinations of settings for the independent variables (or factors).
      The advantage in employing a factorial experiment is that of being able
      to detect and measure interaction effects when they exist.

-------
               FIGURE 1
0  FIGURE 2     I
     Figure 1 suggests that the rate of change* in Y as a function of

temperature is the same regardless of the level of ozone concentration;

in other words, the relationship between Y and T does not in any way depend
                                f '                                   rr •
on C.  It is important to point^out that we are not saying that Y and C are

unrelated, but that the relationship between Y and T is independent of the

relationship between Y and C.  When this is the case, we say that T and C do

not interact, or, equivalently, that there is no T by C interaction effect.

Practically speaking, this means that we can investigate the effects of T

and C on Y independently of one another and that we can legitimately talk

about the separate effects (sometimes called the "main effects") of T and C

on Y.

     One way to quantify the relationship depicted in Figure 1 is with a

regression model of the form
                          Y/T.C
                                                              (i)
    * For those readers having some familiarity with calculus, the phrase "rate
     of change" is related to the notion of a "derivative of a function."  In
     particular, Figure 1 is meant to portray a situation where the partial deri-
     vative with respect to T of the response function relating the mean of Y to
     T and C is independent of C.
                                      -2-

-------
Here, the change in the mean of Y for a one-unit change in T is equal to



B,, regardless of the level of C.  In fact, changing the level of C in (1)



has only the effect of shifting the straight line relating yv/T „ and T
                                                            Y/T, C


either up or down without affecting the value of the slope Sn , as is seen



in Figure 1.  In particular, MY/TjC  = (^+6^) + B-J and Py /T c  = (S0+6
     In general, then, one might say that "no interaction" is synonymous with



"parallelism" in the sense that the response curves of Y versus T for various



fixed values of C are parallel, in other words, these response curves (which



may be linear or nonlinear) all have the same general shape, differing from



one another only by additive constants independent of T (e.g., see Figure 3



below) .
                               FIGURE 3




     In contrast, Figure 2 depicts  the situation where the relationship



between Y and T depends on C;  in partucular, Y appears to increase with



increasing T when C = CQ, but  appears to decrease with increasing T when C = C,



In other words, the behavior of Y as  a function of  temperature cannot be con-



sidered independently of ozone concentration.  When this is  the case, we
                                    -3-

-------
say that T and C interact, or equivalently, that there is a T by C inter-



action effect.  Practically speaking, this means that it really doesn't make



much sense to talk about the separate (or main) effects of T and C on Y,



since T and C do not operate independently of one another with respect to



their effects on Y.



     One way to mathematically account for such an interaction effect is to


consider a regression model of the form
                                         62C + 612TC
Here, the change in the mean value of Y for a one-unit change in T is equal



to (6  + B-joC), which clearly depends on the level of C.  In other words,

                             -.-•' ,.*" *                -                   j" '
the introduction of a product term like B,2TC into" a regression model like



Equation (2) is one way -to ac'count for the fact that two factors like


T and C do not operate independently of one another.  For our particular



example, when C = C , model (2) can be written as
                                                  B12C0)T
and, when C = C-, model (2) becomes
And, Figure 2 suggests that the interaction effect g.- would be negative,


with the linear effect (g. + Sio^) of T at C  being positive and the linear


effect (B, + 612C ) of T at C  being negative.  A negative interaction effect


is to be expected here, since Figure 2 suggests that the linear relationship



between Y and T "reverses" from a direct one at CQ to an inverse one at C.. .


Of course, it is possible for 012 to be positive, in which case the inter-



action effect would manifest itself in terms of larger positive value for



the slope when C = C^ than when C = C-.




                                       -4-

-------
                              HANDOUT #16

                             EXERCISE SET


The data below are replicate measurements of the S0_ concentration (in ppm)

in each of three cities

               City I	City II	City III
2
1
3
4
6
8
2
5
2
a)  Fit by least squares the model

                        Y = 3  * B
                                                e,
    where x  = 1 if the measurement pertains to city I and equals zero other-
           1               -   
-------
b)  Calculate S2.
                              *•
c)  Test H :3 =B- versus the alternative H :$,^60 at the 10% level
          o  1  2                         a  1  2

-------
d)  For this same data set, state precisely the appropriate analysis of variance




    model, .construct the associated ANOVA table, and test (a=.05)  the null




    hypothesis of "no differences among cities".  (NOTE:   If you are proceeding



    correctly, the value of MSE in the ANOVA table should equal  the value of




    S  obtained earlier.)

-------
e)  Find a 90% confidence interval for the true difference between the effects



    of cities I and II.
f)  What is the relationship between the $'s in the model of part (a) and the



    parameters in the ANOVA model of part (d)?

-------
                             HANDOUT #17
                SOLUTION TO EXERCISE SET. HANDOUT #16

The data below are replicate measurements  of the S02 concentration (in ppm)
in each of three  cities
City I
2
1
3
City II
4
6
8
City III
2
5
2
a)  Fit by  least  squares  the model

                        Y ' *0  *  B1X1  +  *2x2 +  e,
    where x. =  1  if the measurement  pertains to city I  and equals zero other-

    wise, and x_  = 1  if the measurement  pertains to city II and equals zero
    otherwise.   In  this  case,*'1
                              (X'X)
                                   -1
                 1.
                 3
              1   -1  -1
             -121
             -112
    Let  Y'  = (2, 1, 3, 4, 6, 8, 2, 5, 2) ,  so that
       (1x9)
in which case X'X  =
             (3x3)
                                                     (9x3)
933
330
303
                              1
and X'Y  =
   (3x1)

  l -1 -f
    Then,  B   = (X'X) "X'Y = j
         (3x1)
         -1  2  1
         -1  1  2
33
 6
18
33
 6
                                 1 1 0
                                 1 1 0
                                 1 1 0
                                 1 0 1
                                 1 1) 1
                                 1 0 1
                                 1 0 0
                                 100
                                 1 0 0
          3
         -1
          3
:o
!i
    So, Y = 3 - x. + 3x .

-------
b)  Calculate S  .
                          9   2
    SSE  =  Y'Y -  B'X'Y  =   £ Y.   -  (3,  -1,  3)
                                              33
                                               6
                                              18
                                              = 163 - 147 = 16
             SSE
                     16  ."-4*2.67
        n-(k  +  l)~9-3~6    3
c)  Test Ho:61=B2 versus the alternative H^B.^ at the  10%  level,
                              A    ^
   te?t statistic:  t =
                        J V"^  -
rejection region:  reject when |tj > t
                                                  1.943
Now,
             ^     /^
                               + Var(B2)  -  2
                                      -  2c12a2  = 1(2  +  2  -  2)a2  =
So, t =
                        y(-l  -  3)
                                                 HO!

-------
d)  For this same data set, state precisely the  appropriate  analysis of variance



    nodel, construct the associated ANOVA table,  and  test  (ct=.05)  the null


    hypothesis of "no differences among cities".   (NOTE:   If you are proceeding



    correctly, the value of MSE  in the ANOVA  table should  equal the value of
    S  obtained earlier.)	-*.  Additive effect"

                         fever-all mean} \^f -i-t-v, city


    ONE-WAY ANOVA MODEL:   Y. .  = p" + T .*"+ e. .  ,.i =""l7~2, 3; j  = 1,  2,  3
                        ^i- ]         11 lb


                        observation


                     in i-th  city
    e...<-\ N(0,  a ) and the {e..} are mutually uncorrelated.
Let Y =
                                  3 =
    Total  sum of square? = TSS =  I
                                               .l.  2,  3.

3
Y fy .. v'l2

3 3 ,
= Y T Y 2 .
3
>
i = l
3 12
I Y..
• i iJ

                                          .2
                           = 163 -
                                             = 163 - 121 = 42.
    Treatment  sum of squares = SST = 3 T (Y.   - Y)   = -^
                                      •  i   1           "5 •
                     (is)
                                         =  (441) - 121 = 147 - 121 =
                                                                    SSE
    Error sum of squares = SSE = TSS - SST = 42 - 26 = 16_.  => MSE = ^~ = 2-.67.
    SOURCE OF VARIATION  P.P.   SUM OF SQUARES  MEAN SQUARE  F RATIO


          CITIES          2          26            13        13


          ERROR           6          ^6_           2.67


          TOTAL           8          42
                                                        ~-^r = 4.87
                                                        2.67   ==-
                                                           DO NOT REJECT HQ OF


                                                       "NO DIFFERENCE AMONG CITIKS"


                                                       SINCE 4.89 < F  '  . , - 5.14
                                                                     .95,2,6

-------
e)  Find a 90% confidence interval for the true difference between the effects



    of cities I and II.




         E(Y..) = P +  T.  , VarCY.J = 2_ , i = 1, 2, 3.



    909«  confidence interval for  (T  - T_)  is:
         I. IB] ±  1<943  /2j|r
               ^J        /V ->l*J
      >  (2  - 6)  ±  1.9431
                        *> J   $•                  '                  *'




     =>  -4 ±  2.59  =>  (-6.59, -1.41)



f)  What is the relationship between the B's in the model of part  (a) and the



    pararaeters in the ANOVA model of part  (d)?







                             Bl = Tl - T3



                             B2 = T2 - T3



    It  can be  seen that  B   =  Y  = 3,  $, =  Y~. - 7_ = 2 - 3 = -1,
                        0    3     '   1     1    a              '


    and B2 = Y2  - Y3  =  6 -  3  = 3.




    Thus, the  hypothesis tested in part  (c)  is essentially equivalent to the



    confidence interval  in  part  (e).



    This  problem illustrates  (at least in this simple example) how  regression



    techniques can be used  to solve ANOVA problems.

-------
                                HANDOUT //18




                               EXERCISE SET






The table below gives the concentration (Y) of particulate matter in the air




at a series of seven equally spaced distances (D) from zero to six miles down-




wind from a certain chemical plant.






                                  Coded Distance X  (=D-3)

Cone. (Y)
-3
5
-2
6
-1
8
0
6
. 1
5
2
i
5
3
2
a.  Fit the model Y = BQ + S,x + B2X  + e to this data using the method of




    least squares.  You may have use for the following matrix:
          (X'X)
               -1

1
7

7/3

0
-1/3
0

I/A
0
-1/3

0
1/12

-------
b.  Test whether there is a significant  (a =  .05) linear  effect  of  distance



    on concentration  (i.e., test H  :  g.  = 0 versus H  :  0. i*  0).
                                  o    l             a  1
c.  Use the fitted equation to obtain an estimate of  the distance  from the



    plant as which the particulate concentration is at  its highest level.
d.  Find a 90% confidence interva^ -for the true average value  of  the  p^a-rticulate



    concentration at the estimated distance of maximum particulate  concentration



    found in part (c).
              —2    2   /  k   l     2
e.  Calculate R  = R  - I—r—r-l  (1-R ), and comment.
                        \n~'K~' i /

-------
So
,  R2 = .84 -(7rfrrJO-.84) = .84 - .08 = .76.  The "null" value of R2 is
                              o           9                      —9
k/(n-l) = 2/(7-l) = .33, and R  =0 when R  =  .33.  The quantity R  represents


                 2                             2
a correction to R  to allow for the fact that R  = k/(n-l) when 6. = 8- =  ••• *



6  = 0.

-------
                   i              HANDOUT #19
                      SOLUTION TO EXERCISE. HANDOUT #

 The table below gives the concentration (Y)  of particulate matter in the air
 at a series of seven equally spaced distances (D)  from zero to six miles down-
 wind from a certain chemical plant.

                                      Coded Distance X  (=D-3)

Cone. (Y)
-3
5
-2
6
-1
8
0
6
1
5
2
' 5
3
2
a.  Fit the model Y = 8Q + 8,x •*• 3,*  + e to this data using  the method  of  leu?t
    squares.  You may have use for the following matrix:

(X'X)~] = 1/7

i

(7*1) =

	 1
5
6
8
6
5
5
2
7/3 0 -1/3
0 1/4 0
-1/3 0 1/12


X
(7x3) =

1 -3 9
1 -2 4
1-11
100
1 1 1
124
1 3 9

•



X'X
(3x3) '
— 1
7 0 23
0 28 0
28 0 196
_ 	

       (3x1)
 37
-14
120
        =  (X'X)~1X'Y
   (5x1)
        6.62
        -.50
"I
  1 .
Y = 6.62 -  .\ - ix

-------
 b.   Test whether there is a significant (a =  .05) linear effect  of distance



     on concentration.



                        SSE = Y'Y - B'X'Y = 215 - 211.90 =  3.10.





     S2 = ~i2-  = .775, S = .88.  Hn:  B  = 0, H  :  B, i 0  (two-tailed test),
           H                       U    A.       &     1
                                                   f*^^


     test statistic:  t =  1      =  —-—-—  = 7 _..   = ~.' -,   =  -3.00.

     •	       	      oo7i'/oo    1./6      1./6     	
                                     .88/1/28
     Since |t| = 3 > t g7_ 4 - 2.776, we reject HQ.





 c.   Use the fitted equation to obtain an estimate of the  distance  from the



     plant at which particulate concentration is at its highest level.




     -          1    12
     Y = 6.62 - -fx - yx




      A

     g . .1. . |x , c -> XQ ,J^  (or D = 2.25)






     d2Y/dx2 = "|  "> M\XIMUM





 d.   Find a 90% confidence interval for the true average value  cf the particulate



     concentration at the estimated distance of maximum particulate concentration



     found in part (c).  Best estimate of E(Y) at XQ = ^ is  YQ-- 6.62  - y|-f|  -




              f   -3    9)            A           -3.      ,-12.     2
>.Sl..Ai x* =  11. "  . . . -!. tnen Var  Li^iJ at xrt =   ,, !is x.-.^x xj  XMO  =  . jo .   ooj
. .       _Q    I'Alol             U      U    *t    ^U       *-U
     90% CI is 6.81 ± t gs A/-       -> 6-81  ±  2.132  /7S (.88)  ->  .6_ASji_.±._L.G




               _2
 e.  Calculate R  and comment.




          I (Y. - Y)2 - SSE     2l5.I37)i.3>10

    _2   i = l  	     	7	       19.AJ - 3.10    16.3j  .   R,
    R  , 	_               =          _^        =  _ —._    a _____  =  j.b4



              I CY. - Y)2         215  -  I2y_
              • *  jL  .                      /
             3 = 1

-------
                              HANDOUT #20





                SELECTING THE BEST REGRESSION EQUATION





1.  General Considerations


    The general problem to be discussed in this handout is as follows.  We


have one dependent variable Y and a set of k independent variables X.,X2,... ,X,.


We want to determine the best (i.e., the most important) subset of these k


independent variables and the corresponding best fitting regression model.


     There are four basic statistical procedures currently in use for


selecting this so-called best regression equation.  These are:      ,-
                              9                  '                  tf-

     1)  The All Possible Regressions Procedure


     2)  The Backward Elimination Procedure


     3)  The Forward Selection Procedure


     4)  The Stepwise Regression Procedure


(There are also some slight variations on these basic procedures and a few


other algorithms that are sometimes used.)


     Before proceeding to describe each of these procedures in detail, some


general comments are in order.


     1)  Some of the k independent variables may consist of higher order

                                                           2
functions of a few basic variables (e.g., X, = X^^, X, = X-, etc.).  In



practice, the use of any of the above procedures gives more acceptable


results when there are not very many of such higher order terms under

-------
consideration*    one reason being that a model containing several such terms

is invariably quite difficult to interpret.

     2)  It is possible to arrive at different solutions by using the four

different methods.  When this happens, it is necessary to weigh all the

results and make a choice of best model based on practical considerations

regarding the variables under study, the nature of the data, and the inter-

pretations that can be made regarding the different  candidate models.

     3)  Sometimes, even a single procedure will provide a number of

reasonably good candidate models, from which a choice will have to be made.



2.  Description of the Methods

    We will illustrate the use of the four procedures in the context of

the following example, which involves the three independent variables

                   2
X., X-, and X_ « X_,  and the dependent variable Y.
DBS. NO.
Y
Xl
X2
x3
1 2
64 71
57 59
8 10
64 100
3
53
49
6
36
4
67
62
11
121
5
55
51
8
64
6
58
50
7
49
7
77
55
10
100
8
57
48
9
81
9
56
42
10
100
10
51
42
6
36
11
76
61
12
144
12
68
57
9
81
    *Variable selection procedures like those discussed in this chapter some-
     times lead to unstable subset selection when the candidate independent vari-
     ables are highly correlated (e.g., when some independent variables are
     functions of other independent variables); see Marquardt and Snee (1975)
     for further discussion in this regard.
                                     -2-

-------
2.1  The All Possible Regressions Procedure


     Fit every possible regression equation associated with every possible


combination of the independent variables; in particular, for our example,


fit the seven models corresponding to the following seven combinations of


independent variables:  (i)  Xx   (ii)  X2   (iii)  X3    (iv)  X^ X2


(v)  XJt X3     (vi)  X2,  X3     and  (vii)  Xj,  X2> X3<  (For k

                                                             k
independent variables, the number of models to be fitted is 2 -1).  Then,


partition the different equations obtained into sets inVolving 1,2,3,...,


and k variables, and order the models within each set according to some


criterion (e.g., like R ).  Thefleaders in each set are then selectefl


for further examination.


     For our data, a summary of the results of this "all possible regres-


sions" procedure is given in Table 1 below:
                                   -3-

-------
TABLE 1.  SUMMARY OF RESULTS OF "ALL POSSIBLE REGRESSIONS" PROCEDURE




                  Estimated Coefficients
Number of
Variables
in Model
1
1
1
2
2
2
^ 3
i
Variables
in Model
(i) Xt
(ii) X2
(iii) X3
(iv) X1§ X2
(vi) X2, X3
(vii) X., X , X
^ *
6.190 1.072
30.571
45.998
6.553 0.722
15.118 0.726
32.404
3.438 0.723
B2 B3 Partial F Statistics
19.67**
3.643 14.55**
0.206 14.25**
2.050 7.665* 4.785
0.115 7.^01* 4.565
3.205 0.025 ' 0.113 0.002
2.777 -0.042 6.827* 0.140 0.010
Overall
F Statistics
19.67**
14.55**
14.25**
15.95**
15.63**
6.55*
9.47**
R2Values
.6630
.5926
.5876
.7800
.7764
.5927
.7802

-------
                                                     2
     From  the above  table,  the  leaders  (in  terms of R  values) in each


of  the sets  involving one,  two, and three variables are given by:
                                            o
        One variable set:      (i)  KX  with R  =  .6630


        Two variable set:     (iv)  KI ,  X2  with R2 = .7800


      Three variable set:    (vii)  ^ ,  X2>  X.  with R2 = .7802
Of the above three models, clearly model  (iv) involving  X,  and  X,  should

                         2
be our choice since its R  value is essentially the same as the value


for model  (vii) and is much higher than the value for model (i) .


     Thus, our choice of "best" regression equation based on the "all possi-


ble regressions" procedure is given by:
                        6.553 + 0.722X1 + 2.050X2
2.2  The Backward Elimination Procedure


     Step 1:  Determine the fitted regression equation containing all


independent variables.


     For our example, we obtain
              Y  = 3.438 + 0.723X.  + 2.777X-  - 0.042X,.
                                  i          *•          J >
with ANOVA table
Source
Regression
Residual
DF
3
8
SS
693.06
195.19
MS F
231.02 9.47**
24.40
R2
.7802

         Total        11   888.25
                                    -5-

-------
     Step 2:  Calculate the partial F statistic for every variable in



the model as though it were the last variable to enter (see Table I):






              Variable     Partial F (based on 1 and 8 df)



                Xx                     6.827**



                X2                     0.140



                x3                     o.oio






(Recall that the partial F statistics above test whether the addition

                                                        i

of the last variable to the model significantly helps in predicting the



dependent variable given that the other variables are already in the model.)



     Step 3:  Focus on the lowest observed partial F test value (FT , sai/).



In our example, FL = 0.010 for the variable X_.



     Step 4:  Compare this value FT with a preselected critical value
                                  LI


(F_, say) of the F distribution.
  \j


          a)  If F  < Fr, remove from the model the variable under con-
                  L    C


sideration3 recompute the regression equation for the remaining variables,



and repeat steps 2, 3, and 4.



          b)  If FT > Fr, adopt the complete regression equation as cal-
                  Li    C


culated.



In our example, if we work at the 10% level, then FL = .010 < FI g  90=FC =



Therefore, we remove  X,  from the model and recompute the equation using



only  X1 and  X2.  We then obtain:







                   Y  - 6.553 + 0.722 Xj + 2.050 X2  •






with ANOVA  table
                                     -6-

-------
            Source    DF     J3S       MS       £        R



          Regression   2   692.82   346.41   15.95**   .7800



          Residual     9   195.43    21.71



          Total       11   888.25






With  X_   out of the picture, the Partial F's become 7.665 for  X.  and



4.785 for  X2  (see Table 1).  Our new FC = FI g  9Q = 3.36, which is



less than 4.785.  Therefore, the Partial F for  X2  is significant and



we stop here with this model, which is the same model that we arrived

                                                       i

at using the "all possible regressions" procedure.





                            •"••'  f                 •                  if'"

2.3  The Forward Selection Procedure



     Step 1:  Select as the first variable to enter the model that variable



most highly correlated with the dependent variable, and then fit the asso-



ciated straight line regression equation.



     For our example, we have the following correlations:  r    = .814,
                                                            YX1


rvv  = .770, and r^.  = .767.  Thus, the first variable to enter is
 IA«              IA_



V


     The straight line regression equation relating Y   and X.  is:
                              6.188 + 1.072X1
with ANOVA table:
          Source     DF     J5S       MS      F         R



        Regression    1   588.92   588.92   19.67**    .6630



        Residual     10   299.33    29.93



        Total        11   888.25
                                    -7-

-------
If the F statistic in the above table is not significant, we stop and



conclude that no independent variables are important predictors.  If the



F statistic is significant (as it is) , we include this variable (in our



case, HGT) in the model and proceed to Step 2.



     Step 2:  Calculate the partial F statistic associated with each



remaining variable based on a regression equation containing that variable



and the variable initially selected.



     For our example,
                        Partial   FCXX^ = 4.785 ,
                        Partial   FCXX^ =4.565 .



     Step 3:  Focus on the variable with the largest partial F statistic.
                            ~  9"                  ~                  
-------
variable having the  largest partial F value if it is statistically

significant.*    At any step, if the largest partial F is not significant,

then no more  variables are included in the model and the process is termi-

nated.

     For our  example, we have already added  X.  and X~  to the model.  We

now see if we should add  X.,  .  The partial F for X,  , controlling for

X^  and  ^,  is given by
                     Partial  Ftt^X^X^ = 0.010  ,

and  this value  is not  significantly different from zero, since F, 0     = 3.46.
                               ,  r •                             J- > ° » • ",v
                                f                  '                  •f-
Again, we have  arrived at  the same two-variable model chosen by using the

previously  discussed methods.



2.4  The Stepwise Regression Procedure

     This is  an improved version of forward regression, which permits a

reexamination at every step of  the variables already incorporated in the

model  in previous steps.   A variable  that may have entered at an  early

stage  may,  at a later  stage, become superfluous because of its relation-

ship with other variables  now also in the model..   To check on this possi-

bility, at  each step a partial  F test for each variable presently in the
     *The reader  should  be aware that  the  probability of  finding at least one
      significant independent variable when actually there  are  none increases
      rapidly as  the number  k of candidate independent variables increases,
      an upper bound on  this overall significance  level being l-(l-a)^,  where
      a is the significance level of any one test.   To control  this over-all
      significance level so that it does not exceed a in  value, a conservative
      but easily  implemented approach  is to conduct any one test at level a/k
      (see Pope and Webster (1972)  for further discussion in this regard).
                                       -9-

-------
model  is made,  treating  it as  though  it were  the most recent variable



entered, irrespective of its actual entry point into the model.  That



variable with  the smallest nonsignificant partial F statistic  (if  there



is  such a variable)  is removed,  the model is  refitted with  the remaining



variables,  the partial F's are obtained and similarly examined,  etc.  The



whole  process  continues  until  no more variables can be  entered or  removed.



     For our example, the first  step  would be, as with  the  forward



selection   procedure, to add the variable X.   to the model, since  it



has the highest correlation with Y  . Next,  as before,$we  would add



^2   to the  model, since  it has a higher partial correlation with Y



than does  X^  , controlling for.X,-  .  Now, before testing to see if ^,-



X.,   should also be  added to the model, we look at the  partial F of



X.   given that X_ is already  in the  model to see if we should now



remove X1 .  This partial is given by FtX.JX^ = 7.665  (see Table  1)



which  exceeds  F1 Q   Q =  3.36.  Thus,  we do not remove  X,  from the model.
                j. y .7 9 • yu                                  j-


Next,  we check to see if we need to add  X_   ; of course, the answer is



no, since we have dealt  with this situation before.



     The Analysis of Variance  Table that best summarizes the results



obtained for our example is thus given by
Source DF
*

Regression <

XI
— A,

X2X1 *
Residual 9
SS

588.92

103.90
195.43
MS

588.92

103.90
21.71
F R2

19.67** .7800

4.79(P<.10)

         Total             11   888.25
                                     -10-

-------
The ANOVA table that considers all variables is given by:
Source


Regression


Residual
X.
1
DF


x2|Xl

-------
                                     HANDOUT #21
AxtJIKAS .lot «•>»!<» Kl-i
Co|i>rii;ht  Thr .l
                   hn- l|n|ilun» I
i St rVP«l uf Hv^irnf miri I'ublit Health
  V..I lu I S.. I
fVm(n/ in I' > »
    A NOTE ON CONTROLLING SIGNIFICANCE LEVELS  IN STEPWISF.
                                  REGRESSION

                   L. I.. Kt'Pl'KK. J R, STEWART AND K. A WILLIAMS
   Stepwise regression is being  used  more
 and more as an  exploratory data analysis
 tool  in  the health  and  social sciences.
 While it may be generally recognized that
 the probability  of finding  at  least one
 significant regressor variable when actually
 there are none increases as the  number of
 possible regressor variables  increases, the
 actual extent of this increase may not be
 fully appreciated.  This note investigates
 the relationship  between  this probability
 (the  true  significance  level a*)  and the
 so-called nominal  significance  level  of a
 which is typically used to control entry into
 the regression equation of one of k candi-
 date independent variables.
   Consider the situation where there  are k
 independent variables and one dependent
 variable of interest, and suppose that it is
 desired to determine which (if any) of the k
 independent variables are useful predictors
 of  the  dependent  variable.  The typical
 stepwisc regression algorithm proceeds as
 follows to find the "best" regression equa-
 tion. At the first step, each independent
 variable is regressed separately with the
 dependent variable and then the usual f,' =
 F,., statistic (based on. say, v  degrees of
 freedom for error) is calculated for each of
 the k estimated slopes. If none of the  F,. ,'s
 exceeds F,.,.., for some nominal significance
 level a (usually chosen such that .01 < a <
 .10). then it is concluded that none of the
 independent variables is important. On the
 other hand, if the larpest F, .does exceed
 F!.,.... then  that  variable is the  first to be
 included in the regression equation.
  Thus, the decision to introduce  a varia-
  From llic Department of Iliosfatislii;*. University
of Nonh Carolina. Chapel Mill. NC -J7.M4. (Address
far  reprint requests  which should ht>  MTII  in Dr.
 Kllp|>ff.l
                                          ble into the regression equation at the i'ir>t
                                          step is based on the value of the maximum
                                          of a set of k F,, statistics, all of which have
                                          been calculated from the same data set and
                                          which  typically  are  correlated to some
                                          extent. The fact that the largest of the |F,.,|
                                          is compared to F, ,„ mean^ that the actual
                                          (true) probability of a Type I error, a* say.
                                          is greater than a.
                                            In  particular, under the null hypothesis
                                          //„ of no association of any of the independ-
                                          ent variables  with the dependent variable.
                                          it follows that  a*  = Pr [max |F, ,|  >
                                          F,.,.n///,l - 1 - Pr  [all |F,..|  < Ft.tJHt\.
                                          which becomes
                                                       1 -  (1 - a)*
        (1)
                                          if it is assumed that the (F, ,| are mutually
                                          independent.
                                            The assumption of independence used to
                                          develop expression 1 would rarely hold in
                                          actual practice. However, the modification
                                          to expression  1 made necessary because of
                                          any dependency among  the  |F,.,| could
                                          effectively be  mimicked by substituting for
                                          k  some smaller  numerical quantity, the
                                          actual (but unknown) value being in some
                                          sense a function of the "strength" of the
                                          dependency.  Since such  a  modification
                                          would decrease" the value of «* given by
                                          expression 1,  it follows that expression 1
                                          actually provides an upper bound on the
                                          value of  a* to be expected  in practic.nl
                                          applications.
                                            An often-used alternative upper bound
                                          on a*,  based on  an  application of the
                                          Bonferroni inequalities (see reference 1 p
                                          110). is a* = Pr [at  least one of the |F,.,| >
                                          F,.,,.///.]  < kPr |any one F,.. > F......//M
                                          •  ka. Firstly, note that this upper bound
                                          on o* is completely uniiilorinative lor k >
                                          I/a. Secondly,  expression 1  provides a

-------
14
KUPPER. STEWART AND WIIJJAMS
more precise upper bound for a* than does
ka since it can be shown by an inductive
argument that
     1  - (1  - o)* < *a  for  * >  1.
  For various choices of the nominal signif-
icance  level a. table  1 illustrates  how
expression 1 increases in values for increas-
ing values of k. What this table displays so
dramatically is that a seemingly reasona-
ble choice  for the  nominal  significance
level a does not  in  general guarantee an
acceptable value for the actual significance

                TABU 1
Valuti of I
*
1
2
3
4
6
6
7
8
9
10
15
20
90
40
60
. . .10
0.1000
0.1900
0.2710
0.3439
0.4095
0.4686
0.5217
0.5695
0.6126
0.6S13
0.7941
0.8784
0.9576
0.9852
0.9948
- (/ - a)* as a function of k
a -.06
0.0500
0.0975
0.1426
0.1855
0.2262
0.2649
0.3017
0.3366
0.3698
0.4013
0.5367
0.6415
0.7854
0.8715
0.9231
a • .025
0.0250
0.0494
0.0731
0.0963
0.1189
0.1409
0.1624
0.1833
0.2038
0.2237
0.3160
0.3973
0.5321
0.6368
0.7180
t, . .01
0.0100
0.0199
0.0297
0.0394
0.0490
0.0585
0.0679
0.0773
0.0865
0.0956
0.1399
0.1821
0.2603
0.3310
0.3950
              level a*. For example, even in the extreme
              case of having to choose between just k •. 2
              independent variables, the actual signifi-
              cance  level  could be almost  twice  the
              nominal value.
                From expression 1. it is a simple matter
              to express a as a function of a':
                             1 - (1 - a')'
(2)
              In  other  words,  in order to control the
              actual Type I error probability of including
              one of the k independent variables in the
              regression  equation  when none should be
              included, it follows from expression 2 that
              it is necessary to work at a nominal signifi-
              cance level a often much less in value than
              a', this working value of o decreasing with
              an  increasing number k of candidate  inde-
              pendent variables.
                Table 2 illustrates the dramatic relation-
              ship between a and k for some specified
              values of a*. In this regard, the Bonferroni-
              based  inequality  a* < ka immediately
              suggests choosing the working value of a to
              be a*Ik for any specified value of  a*. It is
              easy to see that setting a = a*Ik provides
              values slightly smaller than those in table 2
              (i.e., more conservative values), but never-
              theless the approximation a'/k  would ap-
              pear  to be  entirely satisfactory for all
              practical purposes.
                                     TABLE 2
                      Valuti of a • I - (I - a'}'' at a function of k
k
1
2
3
4
5
6
7
8
9
10
15
20
30
40
60
•• . .24
0.2500
0.1340
0.0914
0.0694
0.0559
0.0468
0.0403
0.0353
0.0315
0.0284
0.0190
0.0143
0.0095
0.0072
0.0057
•• . .20
0.2000
0.1056
0.0717
0.0543
0.0436
0.0365
0.0314
0.0275
0.0245
0.0221
0.0148
0.01 11
0.0074
0.0056
0.0045
•• . .is
0.1500
0.0780
0.0527
0.0398
0.0320
0.0267
0.0229
0.0201
0.0179
0.0161
0.0108
0.0081
0.0054
0.0041
0.0032
a* - .10
0.1000
0.0513
0.0345
0.0260
0.0209
0.0174
0.0149
0.0131
0.0116
0.0105
0.0070
0:0053
0.0035
0.0026
0.00.' 1
a* - .05
0.0500
0.0253
0.0170
0.0127
0.0102
0.0085
0.0073
0.0064
0.0057
0.0051
0.00.14
0.0026
0.0017
0.001:1
0.0010
a* . .025
0.0250
0.0126
0.0084
0.0063
0.0051
0.0042
0.0036
0.0032
0.0028
0.0025
0.0017
0.0013
0.0008
0.0006
0.0005
a* . .01
0.0100
0.0050
0.003:1
0.0025
0.0020
0.0017
0.0014
0.00i:i
0.0011
0.0010
0.0007
0.0005
0.0003
0.000.1
0.0002

-------
                      SIC.MKH'ANCK I.KVKI.S IN STKI'WISK KKCKKSSION
                                       1ft
  One can similarly use .the above proce-
dures  to control  the actual  significance
level a* at any stage of the stcpwisc regrcs-
•ion. Since a fj.,  « F,.r statistic is calcu-
lated for each  of the remaining candidate
variables at any particular  step alter the
first and since max |F,., | is used as the test
criterion for entry into the regression equa-
tion, it follows that the appropriate value
of a to be used  can be determined  as
described  above  using  for  k the  actual
number  of remaining candidate variables
under  consideration.
  Pope and Webster (2) discuss the draw-
backs  of the use of an F-slatistic in step-
wise regression and the choice of signifi-
cance  level. As they point out.  there are
two types of errors that can be made when
A variables are being considered  for inclu-
sion in a regression model: 1) a variable is
included  when none  should  be  (Type  I
error), and 2) no variable is added when at
least one of the k should be (Type II error).
Pope and Webster are of the opinion  that
the Type II error is sometimes more serious
and  so advocate the use of a fairly large
value  of  a*  (say.  .2n!. This suresti-m
seems reasonable when specific significant
relationships are anticipated a priori.
  However, in  situations where stcpwisc
regression  is used  as  an exploratory data
analysis tool, it is our feeling that primary
attention should be given to controlling the
Type 1 error probability. We are specifi-
cally referring to the all-too-frequent appli-
cation of stepwise regression algorithms to
situations  in which the experimenter col-
lects information  on  a  large number of
variables with  no firm a  priori hypotheses
in mind  and  then attempts' to  let  the
algorithm ferret out whatever relationships
(if any) exist within the data set. It is in
just  such circumstances as these  that quite
rigid control of the actual significance level
a* is required, so that spurious associations
are not reported as real.
                REFERENCES
1. Feller \V: An Introduction  to Probability Theory
  and its Applications. Vol.  1. 3rd ed. New York.
  John Wiley & Sons. 1968
2. Pope PT. \Vebster JT: The use of an F-statistir in
  •tcpwise  repression  procedures. Tethnomcirirs
  M:327-:MU. 1972

-------
                                         HANDOUT  #22
The aim of this report is to determine whether changes in air
pollution have an effect on health. Climate and home heating
variables are included to see whether they may be involved. These
studies indicate a close association between mortality rates and air
pollution and lead to the conclusion that mortality rates could be
lowered by abating pollution. Estimates of economic benefits from
improved health are discussed.
        Air  Pollution,  Climate,  and  Home  Heating:
                                                  •                           \
                     Their  Effects  on U.S.  Mortality  Rates
Introduction

      We have investigated the health effects of air pollu-
tion and reported several sets of results (Lave  and Seskin.
1970.  I970a. I970b. 1971, Lave. 1972). The basic approach
is to explain differences in the mortality rates among U.S.
cities  by the level of air pollution and socioeconomic vari-
ables.  The aim of this work  is the estimation of the benefits
of pollution abatement.
      Two objections were raised to the prior  results. One
concerned the fact that weather is known  to affect health,
but no meteorological variables were included in the analy-
sis.1 The other concerned personal pollution arising from
home  heating sources.  Either of these factors might be the
"true" cause of the observed association between air pollu-
tion and ill health; if so, abating air pollution  would have
little effect on  health. Neither  the literature  relating ill
health to weather or to home heating  equipment  is devel-
oped sufficiently well to suggest a physiological mechanism
associating them to chronic disease. Thus,  we are confined
to a search for significant, plausible relationships.


The Statistical Model

      Our  goal is  the  determination of the effect  of
changes in air pollution on health. To answer this question,
confounding factors  must be accounted for or held con-
stant.  In explaining variations in the mortality rate across
cities, one  must  hold  constant many  socioeconomic  and
other  variables. We hypothesize that the mortality  rate in a
city can be written as in equation (I).

             (l)MRj= MRfPj.Sj.C.. H., e.)

where MRj is a mortality rate in city i, P; is one or more
measures of air pollution in city i, S, is a vector of measures
of socioeconomic status in city i, C, is a vector  of measures
of the climate  in  city  i,  H.  is  a  vector  of variables
representing the home heating characteristics in city i, and
es is an error term for omitted variables.
      To  estimate  equation (I), we assume  that  the
complex relation can be approximated by a linear function,
i.e., the mortality rate  is a linear function  of air pollution,
socioeconomic variables, climate  factors, and home heating
characteristics. Other functional forms were examined and
it was found that the linear  form is as good as any (Lave,
                                                                Lester B. Lave, Ph.D and Eugene P. Seskin, M.S.
                                                         1972). A discussion of the general  model,  along  with
                                                         problems stemming  from errors in variables and omitted
                                                         variables, is given elsewhere (Lave and Seskin, 1970b).
                                                               Since there are no  data which would  allow us to
                                                         hold the confounding factors constant, we must account for
                                                         their effects statistically. To do this, we employ  multivariate
                                                         regression analysis. Given the linear specification, and a few
                                                         plausible assumptions, simple least-squares provides  best
                                                         linear unbiased estimates  of the effect of each variable.2


                                                         Method

                                                               In a previous study (Lave and Seskin, I970a) we de-
                                                         termined the "best" set of regressions for a number of mor-
                                                         tality rates.  The specification was reestimated  with  data
                                                         from another year. It was concluded that air pollution had
                                                         important effects on mortality, even  when socioeconomic
                                                         variables were controlled.  In the present analysis we add
                                                         sets of "heating" variables in order to investigate the impor-
                                                         tance of .the indoor environment on mortality and to exam-
                                                         ine the interactions of the heating variables with the pollu-
                                                         tion variables. We also add climatic  variables to examine
                                                         their influence on the observed relationships.
                                                               The heating variables are added in sets grouped ac-
                                                         cording to the type of heating equipment, type of heating
                                                         fuel,  type of water  heating fuel and  a measure of the
                                                         number of air-conditioned homes. More precisely we added
                                                         each group and tested the results to see if the explanatory
                                                         power (R2) of the regression was increased significantly.3
                                                         We did this first with the heating equipment variables since
                                                         they were thought to be  of primary concern and since they
                                                         made  generally the  greatest contribution to  explanatory
                                                         power. If the contribution of this category was statistically
                                                         significant, we continued adding (he remaining sets of vari-
                                                         ables until there was no longer a significant increase in R2.
                                                         When heating equipment did not prove to be  a significant
                                                         factor in the first  instance, we  tried the other classes of
                                                         heating variables  and followed the same procedure.  This
                                                         constituted the first portion of the analysis for each mortali-
                                                         ty rate.
                                                              AIR POLLUTION. CLIMATE AND HOME HEATING   909

-------
        In a parallel investigation, we added climatic vari-
ables to the "best" I960  regressions. Only those climatic
 ^riables which made a significant contribution to explana-
   y power were added. Finally, we added heating variables
 .ii the same manner as described  above  to the 1960-regres-
 sions with the weather variables present.
        This sequential estimation  method  is intended  to
 bracket the effect of the two additional  sets of variables on
 the air pollution (and  socioeconomic) parameter  estimates.
 If either home heating characteristics or climatic factors are
 the "true cause" of ill health (and air pollution is merely a
 spurious  effect), this estimation  procedure is designed  to
 show it. Care must be taken  in  using  these  results to es-
 timate  the effect of either home  heating characteristics  or
 climate on mortality rates.4
 The Data5

       We  collected data  on 117  SMSAs. Air pollution
 data are reported by the U. S. Public Health Service. Sus-
 pended  particulates  and  total  sulfates are measured  for
 biweekly periods in  micrograms  per  cubic  meter (pg/m3).
 Observations are collected on the biweekh  minimum.and
 maximum readings and the annual arithmetic mean.6 There
 are a number of difficulties with these data.  The measuring
 instruments change over time and across cities and some in-
 struments  have  little reliability. In addition, the  data  are
 generally for a  single  point in  a vast geographical area.
 Since  pollution  concentration varies  greatly with the ter-
 rain, it is a heroic assumption to regard the figures as repre-
  Stative of an entire SMSA in making comparisons across
    is.
       Climatological variables are reported by the U. S.
 Department of Commerce.7 While there  is  little difficulty
 with regard to the  actual measurement of most of these
 variables,  the observations are  for a  single  point  and may
 not be characteristic of an entire area.
        Mortality data are reported in Vital Statistics of the
 United  States.  These include the total  death  rate  and a
 breakdown of the total death rate into age specific catego-
 ries, including various categories  of infant death rates (as a
 ratio to  live births). One problem  with the infant death rates
 is  that  a  classification  such as  fetal deaths  may not  be
 reported uniformly well  across all areas.
       The "heating" variables are reported in  the Census
 of Housing.
        Finally, the socioeconomic data are  taken from  the
 I960 census as reported  in the County and City Data Book.
        The variables which we use along with  their means
 and standard  deviations  are reported  in the  footnote jo.
 Table!.

 An Overview of the Results

        In this paper we present  a detailed  analysis of the
 total  mortality  and  infant death  rates.  We  have  also
 analyzed disease specific mortality rates in a longer version
 of the paper (available from the authors). We summarize all
 results in what follows.
        In general neither  climate nor home "heating vari-
     s  cause the  air pollution variables to lose significance.
    .ile there are individual pollution  coefficients which  do
lose significance, the coefficients are quite stable. An excep-
tion occurs when home heating fuels are added. These vari-
ables are associated closely with measured air pollution and
both pollution and heating fuel variables tend to become in-
significant. For example,  the  simple correlation between
minimum sulfates and "Coal"  fuel  is .41. Apparently, the
type of fuel used for home heating is a major contributor to
the air pollution level in the city. Note that this interpreta-
tion does not mean that the previous association between
air pollution  and mortality is disproved, but rather that it is
made more  specific by directing  the  association to  home
heating fuels, rather than at all sources of air pollution.
       The  socioeconomic variables  are  correlated  with
climate and home  heating variables. There is an interaction
between  the home heating variables and population density
and percentage  of poor families; the latter two variables
have some tendency to lose significance when  the heating
variables are  added.
       Climate  and home healing  variables interact very
little in the  regressions,  althou'gh  there  is some indication
that the variables may act as surrogates for each other. Add-
ing the two  sets of variables  simultaneously  changes the
results little from adding them sequentially.
       For most of the  mortality rates, the set of heating
equipment variables adds  significantly to the  explanatory
power of the regression. Generally,  the types of equipment
are associated with decreased mortality rates. When heating
fuels are present, they tend to be related positively to the
mortality rates. The presence of w-ater heating fuels (and ob-
viously  hot  water)  tends  to have  a negative effect on the
various  mortality  rates. The air-conditioning  variable  is
seldom important.
Specific Results
       The regressions relating to total and infant mortality
are reported in Tables 1 and 2 respectively. Regressions 1-1
is written out in equation (2).
         (2) MR  = 19.607
+ .041 Mean P.
 (2.53)
   +  S; + .001 P/M,2 -t- .041 %N-W.  + .687
         (1.67)       (5.81)       '(18.94)
+• .071 Min  +
 (3.18)

   2= 65, + e.
where "Mean P" is the arithmetic mean of the 26 biweekly
suspended  paniculate  readings. "Min  S" is the smallest of
the 26 biweekly sulfate readings. "P/M2" is the population
density  in  the SMSA,  "9rN-W"  is the percentage of the
SMSA population who are nonwhite,  "% &65" is the per-
centage of the SMSA population w-ho are 65 and older, and
"e" is-an error term. This regression explains variations in
the total mortality across 117 SMSAs  extremely well, since
82.7 per cent of the variation is explained  (R2  =  .827).
Each of the coefficients except population  density  is ex-
tremely significant (as shown by the t statistics in parenthe-
ses below the coefficients). As expected, increases in each of
the variables would lead to an increase in the total mortality
rate.
       The percentage of older people  is the most impor-
tant variable in equation  (2).  A  one percentage point
increase in the proportion (multiplied by ten. X  10) of peo-
ple 65 and older (raising the mean from 83.93  to 93.93) is
 910   AJPH JULY, 1972, Vol. 62, No. 7

-------
Table 1—Total Mortality
1-1
R» .827-
Constant 19.607
Pollution:
Mean P .041
<2.53)t
Mm S .071
(318)
Socioeconomic:
P/M» .001
(1.67)
% N-W ' .041
(5.81)
%* 65 .687
(18.94)
Climate:
Ram

H 1AM

Max % 90

Heating Equipment:
Steam

Floor

Elec.

Flue

N Flue

None

Healing Fuel:
Oil

Coal

Elec

B. Gas

Other

None

Water Mealing Fuel.
Elec

Coal

BGas

Oil

Other

None

NA-C

• The coefficient ol deteri
1-2
• .868
21.439

.040
(2.65)
.066
(3.11)

-.0001
(-.20)
.038
(5.26)
.610
(16.96)








17.825
(4.78)
•3.552
(-.63)
11.816
(1.11)
12.838
(237)
5.792
(1.321
-16663
(-.78)




























mination
1-3
.880
25.901

.041
(.94)
.025
(1.19)

.001
(2.75)
.045
(6.32)
.646
(19.48!





















1.365
(.64)
23.827
(5.47)
2.351
(.32)
-42.730
(-1.63)
-4.289
(-.181
8.552
(.27)















. value ol
1-4
.893
25.976

.029
(2.1 r,
.057
(3.06)

.0004
(.82)
.031
(4.46)
.619
(19.34)


































.1.417
(-.57)
39.003
(6.45)
-43.140
(-1.95)
11.851
(3.07)
118.754
(1.68)
29.984
(2.84)


1-5 1-6
.837 .920
6.864 24.086

.041 .015
(2.63) (1.07)
.060 .026
(267) (1.35)

.001 .001
(1.75) (1.30)
.048 .040
(6.49) (5.08)
.676 .602
(18.95) (16.18)








4.312
(.47)
-5.394
(-98)
-45.735
(-.68)
9.115
(.94)
-.660
(-.11)
266.641
(2.76)

2.978
(.40)
12.028
(1.23)
56.049
(1.12)
-11.813
(-.21)
-47.982
(-1.41)
-308.204
(-2.78)

-12.347
(-1.59)
20.522
(1.51)
-50.381
(-.94)
-1.693
(-.13)
164.095
(1.B5)
29.350
(1.42)
15.771 6.298
(2.58) (.90)
1-7
.909
26.182

.020
(1.44)
.023
(1.17)

.001
(1.57)
.038
(5.57)
.612
(18.89)







.
15.024
(3-36)
•5.324
(-1.00)
-53.774
(-.81)
19.307
(3.0D
6.79S
(1.561
279.664
(2.86;

•7.613
(-2.37)
15.903
(346:
41.914
(.841
45.093
(-2.28)
-8.905
(-.36;
-294 875
(-2.67]















.627 indicates a multiple correlation coedi
1-8
.649
44.447

.030
(1.86)
.056
(2.53)

.0004
(.98)
• V054,
(6.03)
.664
(17.42)

.001
(1.20)
-.285
(-2.75)
-.082
(-3.8D









































cient ol
1-9
.887
37.362

.038
(2.52)
.055
(2.70)

-.0001
(-.21)
.054
(6.60)
.632
(17.79)

-.001
(-1.89)
-.181
(-1.65)
-.102
(-3.44)

18.098
(4.49)
-4.475
(-77)
15.033
(1.47)
22.870
(3.92)
22.040
(3.84)
-12.474
(-.60)




























.910 and that
1-10
.890
47.732

.006
(35)
.023
(1.11)

.001
(2.56)
\.053
(5.99)
.641
(18.67)

.001
(1.08)
» -.284
(-2.77)
-.063
(-2.28)














-1.046
(-.41)
23.455
(5.40)
2.462
(.33)
-.266
(-.00)
2.693
(.11)
-29.936
(-.86)















1-11
.906
41.895

.022
(1.53)
.038
(2.00)

.001
(1.23)
.044
(5.35)
.629
(19.54)

-.0003
(-.50)
-.184
(-1.92)
-.073
(-3.51)



























•1.795
(-.69)
37.561
(6.41)
-17.592
(-.79)
8.033
(1.80)
77.834
(1.14)
43.634
(4.06)


63 percent ol the
1-12
.850
37.975

.031
(1.89)
.055
(2.48)

.0005
(1.05)
.055
(6.03)
.663
(17.32)

.001
(1.27)
-.272
(-2.58)
-.069
(-2.53)





















*

















5.733
(.75)
variation i
1-13
.923
41.192

.012
(.80)
.025
(1.28)

.001
(1.22)
.047
(5.08)
.609
(17.70)

-.0001
(-.12)
-.185
(-1.69)
..062
(-1.78)

2.356
(.25)
-5.162
(-.90)
-16.909
(-.23)
12.806
(1.30)
6.064
(.83)
180.038
(1.66)

2.426
(.32)
13665
(1.361
31.835
(.59)
5.346
(.10)
-40.864
(-1.19)
•220.998
(-1.80)

-10.964
(-1.30)
20.085
(1.46)
•41.079
(-.76)
1.285
(.10)
140.341
(1.58)
25.963
(1.21)
2.740
(.34)
n the death
1-14
.916
45.331

.018
(1.24)
.037
(1.96)

.0004
(.87)
.046
(5.67)
.621
(19.25)

-.001
(-.75)
-.216
(-2.13)
-.097
(-3.42)

6.342
(.70)
-6.940
(-1.25)
27.261
(2.361
13.638
(1.581
9979
(1.62)
-7.525
(-.35)














-7.784
(-1.75)
32.237
(3.42)
•8.576
(-.34)
1.159
(.12)
81 184
(1.15)
30.482
(2.03)


rate is
"explained" by the regression
t The t statistic: a value ol
1.66 ma.
cates sign
idcance at the
.05 level, using a
one-tailed test.
                                                            AIR POLLUTION, CLIMATE AND HOME HEATING   911

-------
      Footnote to Table 1 (cont'd)
      Variables Used in the Analysis
      Air Pollution
         Suspended Particulates (ug/m3)
           Minimum reading for a biweekly period (1960)
           Maximum reading for a biweekly period
           Arithmetic Mean (annual)
         Total Sulfates (fjg/m'x10)
           Minimum reading for a biweekly period
           Maximum reading for a biweekly period
           Arithmetic Mean (annual)
      Mortality
         Total death rate (per 10,000)
         Infant death rate (per 10,000 live births)
           <  1 year
           <  28 days
           Fetal
          Climate
         Average daily maximum temperature (x10)
         Average daily minimum temperature (x10)
         Degree Days
         Total Precipitation (inches x100)
         Relative humidity 1:00a E.S.T.
         Relative humidity 1:00p E.S.T.
         Average hourly wind speed (x10)
         Precipitation .01 inch or more ( # of days)
         Snow, sleet 1.0 inch or more ( # of days)
         Heavy fog ( # of days)
         Maximum temperature 90° and above ( # of days)
         Maximum temperature 32c and below ( # of days)
         Minimum temperature 32° and below ( # of days)
         Minimum temperature 0° and below (  # of days)
       Socioeconomic
         Persons per square mile
         % nonwhites in population (x10)
         % population 2= 65 (x10)
         % families with incomes < $3,000 (x10)
       Heating (% /100)
         Heating equipment
           Steam or hot water
           Warm air furnace
           Floor, walj. or pipeless furnace
           Built-in electric units
           Other means with flue
           Other means without flue
           None
         Heating fuel
           Utility gas
           Fuel oil, kerosene, etc.
           Coal or coke          v\          '•*   *> •
           Electricity
           Bottled, tank, or UP gas
           Other fuel
           None
         Water heating fuel
           Utility gas
           Electricity
           Coal or coke
           Bottled, tank, or LP gas
           Fuel oil, kerosene, etc.
           None
         No Air-conditioning
                                                                                      Mean
 45.47
268.36
118.14

 47.24
228.39
 99.65

 91.26
   .20
   .35
   .13
   .02
   .18
   .12
   .01

   .49
   .31
   .11
   .02
   .03
   .02
   .01

   .54
   .22
   .03
   .04
   .09
   .08
   .64
                 Standard
                 Deviation
 18.57
132.07
 40.94

 31.28
124.41
 52.88

 15.33
254.03
187.29
153.^5
654.99
459.73
4682.53
3710.45
76.81
56.96
91.71
109.89
8.21
27.07
38.23
27.18
94.30
3.50
756.15
125.06
83.93
180.85
36.44
24.52
34.35
79.79
75.71
1968.54
1309.10
8.11
7.39
19.05
26.74
6.62
18.97
39.15
28.96
49.57
7.54
1370.54
103.98
21.21
65.53
   .22
   .22
   .13
   .05
   .12
   .18
   .03

   .33
   .29
   .15
   .07
   .03
   .02
   .02

   .28
   .22
   .09
   .02
   .16
   .06
   .12
912   AJPH JULY, 1972, Vol. 62, No. 7

-------
Table 2—Infant Mortality


R2
Constant
Pollution:
Min P

MeanP

MinS

Mean S


Socioeconomic:
P/M2

% N-W

% Poor

Climate:
Rain

H 1 AM

H1 PM

Wind

Rain ^ .01

Fog

Max & 90°

< 1 year
2-1 2-2
.537 .575
185.802 228.842

.365 .340
(2.82) (2.61)










.186 .195
(6.52) . '(6.72)
.157 .163
(3.38) (3.43)

.003
(1.37)
-.958
(-2.78)


.196
(1.54)






< 28 days
2-3 2-4 2-5
.271 .322 .426
149.428 167.274 93.852

-
• " \
.083 « > .066 V<
(1.62) (1.19)
.120 .121
(1.82) (1.74)
.141
(2.67)
i

.003
(1.61)
.098 .088 .161
(4.04) (3.19) (5.33)
.056 -, .075 .125
(1.45) (1.85) (2.49)

.002
(1.08)
-.453
(-1.39)


.145
(1.29)


-.164
(-1.34)


Fetal
2-6 2-7
.512 .548
84.566 181.312







.081 .050
(1.46) (.94)


.002 .003
(.90)' (1.61)
.192 .183
(6.42) (5.60)
.211 .184
(3.24) (3.35)



-.764
(-1.51)
-.804
(-1-65)


.294
(2.35)
-.344
(-2.54)
-.263
(-2.51)
Heating Equipment:
  Steam

  Floor

  Elec

  Flue

  N Flue

  None
                                    30.029
                                     (1.96)
                                     2.243
                                     (.10)
                                   -38.701
                                     (-.86)
                                   -18.443
                                     (-.63)
                                   -53.699
                                    (-2.34)
                                   104.299
                                     (1.16)
estimated to raise the total death rate 6.87 per 10,000 (from
a mean of 91.26 to 98.13). Increasing nonwhites in the pop-
ulation (  X 10) by  I percentage point (raising the mean
from 125.06 to 135.06), is estimated to raise the total death
rate by .41 per 10.000. If air pollution worsened and either
the minimum  sulfate level or mean paniculate level rose by
 1 microgram  per cubic meter ( ug/m3), the total death rate
would rise by  either  .71 or .041, respectively.
       A difficulty arises in attempting to estimate the rela-
tion when a set of the home heating variables is to be added.
The variables  are defined as the percentage of all homes in
an area heated by a particular method, such as "Steam."
Since the sum of all  variables within a set is identically 100
per cent, adding all  variables would preclude inverting the
matrix of cross products and make it impossible to derive
estimates of the regression coefficients. A simple solution to
                                                                   AIR POLLUTION, CLIMATE AND HOME HEATING   913

-------
the difficulty  is to exclude one of the variables.  The es-
timated regression coefficients of the included variables are
then interpreted as the difference between the coefficient of
the variable and the coefficient of the excluded variable.8
       This difficulty can be clarified by examining regres-
sion 1-2. written out as equation (3),
      (3) MR; = 21.439 + .040 Mean P + .066 Min S -
                       f2.65)          (3.11)

            -  .0001P/M2  + .038 %N-W  +
                (-.20)    (5.26)

            + .610 % & 65  + 17.825 % Steam -
           (16.96)            (4.78)

          - 3.552 % Floor + 11.816 % Elec  +
           (-.63)            (1.11)

           + 12.888  % Flue + 5.792 <7c N Flue -
             (2.37)           (1.32)

                — 16.663 % None  + ei
                 (-.78)
where the first  five  variables are defined as above. "%
Steam" is the percentage of homes in the SMSA with steam
or hot water heating. "7c  Floor" is the percentage of hous-
ing units  with floor, wall or pipeless furnace. "7c Elec"  is
the percentage w'ith built-in electric units, "7e  Flue" is the
percentage heating by other equipment with a flue, "ft  N
Flue" is the percentage heated by other equipment without
a flue,  and  "9c  None" is the  percentage  of homes in the
SMSA  without  heating equipment. The category "Warm
Air Furnace" is  excluded  and  all  heating effects are relative
to this category. If equations (2)  and (3) are compared, one
notes that the magnitude  and significance of the pollution
and  socioeconomic variables  are essentially the same, ex-
cept that  population density  becomes insignificant. Only
two  of the  heating equipment variables  ("
-------
presence of the weather variables. This supports our conten-
tion that this variable  may be acting as a surrogate for the
climate variables in a  region, and when they are included
explicitly,  it  becomes  unimportant.  Regression  1-14  con-
tains the heating equipment and water heating fuel groups.
The result is as expected from  looking  at the two sets of
variables independently (regressions  1-9 and 1-11).
       Infant Deaths—The death rate for infants under one
year is examined  in  regression  2-1. The  minimum  par-
ticulate level is the important pollution variable, while the
percentage of nonwhites in the population and the percent-
age  of poor  families in the  population  are  the  important
socioeconomic  variables. Fifty-four per  cent of the varia-
tion in the mortality rate is explained across the SMSAs (R2
=  .537). No set  of heating variables contributed signifi-
cantly  to  the explanatory power of  the original regression.
In regression 2-2 weather variables were permitted to enter.
The only weather variable which was significant was the hu-
midity reading (I AM). It indicates that the mortality rate is
lower in regions which have higher humidity.
       In regression 2-3 the mortality rate for infants under
28  days  is explained  in  terms of pollution and  socio-
economic  variables. In  this  case, the mean-'level  of par-
ticulate pollution and  the minimum level of sulfate pollu-
tion are  the  important pollution measures,  while the per-
centage of nonwhite and the percentage of poor are the rele-
vant socioeconomic variables. Only 27 percent of the varia-
tion across SMSAs is  explained. Again, no  set of heating
variables contributed significantly to the regression. In ad-
dition, no climatological variable was statistically signifi-
cant for the  under 28 day category (regression 2-4). The
only effects of the weather variables are to decrease the sig-
nificance  of the mean  level of paniculate pollution and to
increase the significance of the "Poor" variable.
       Regression 2-5  explains the fetal death rate in terms
of mean sulfates.  population density, nonwhjtes and  poor
families. Forty-three per cent  of the variation across the
SMSAs is explained. The addition of the  heating equipment
variables in regression 2-6 increases the explanatory power
of the  basic regression significantly  (R2  rises from .426 to
.512).  These  variables tend to decrease  the  importance of
the  pollution variables  while  leaving  the socioeconomic
variables  unaffected. The heating equipment variables  in-
dicate  that the presence of any type is associated with  lower
fetal death rates. The effect of the climate variables on the
fetal death rate is seen in regression  2-7  and is  more
profound  than for the other infant categories. The pollution
variable  (mean  sulfates)  loses  significance, while  four
weather  variables  appear to be related significantly. Ap-
parently the fetal death rate is lowered by humidity, fog and
extreme heat, and is raised by heavy  rains.
 Summary and Conclusions

       To test whether previously estimated relations be-
 tween U.S.  mortality rates and air pollution W'ere spurious,
 we added variables for home heating characteristics and the
 climatology  of a  region. The objective was to determine
 whether these new variables would cause the estimated  ef-
 fect of air pollution to fall and become statistically insignifi-
 cant.  In  general, the  air  pollution  variables  were quite
 stable: there w-ere a  few instances when the variables lost
significance and  a few instances  when  the  new variables
increased the significance of air pollution.
       In addition to this investigation involving the addi-
tion of climatological variables and home heating variables.
the effect of air pollution on IJ. S. mortality has been cor-
roborated by different functional forms, by data from anoth-
er year,  and by an investigation of age, sex. and race spe-
cific death rates. In general, the significance of the pollution
variables is enhanced by disaggregating the mortality rates.
       A_rather consistent result which occurred in this in-
vestigation, w'as thai home heating fuels were added to the
regressjiyi. .air pollution^,variables tended to lose signifi-
cance. Elsewhere there  is also a  preliminary  result that
when occupation  variables  were  added, some  of the air
pollution  variables  lose  significance (Lave and  Seskin.
1971). These two results do not contradict the  association
between  air  pollution and  mortality,  but rather  tend  to
isolate the nature of thf problem. Apparently, home heating
fuels can be a major source  of air  pollution:  apparently
some occupations are closely associated with the level of air
pollution. This explanation is plausible if one notes  that the
air pollution readings are for one site in an SMSA  and are
taken from 26 biweekly readings.  Other investigators have
relied on fuel consumption as a measure  of air pollution
when good measures of pollution were not available.
       These studies make it apparent that there is a close
association between  mortality  rates and air pollution. This
investigation strengthens the conclusions cited in a previous
work that mortality rates could be lowered substantial!;, by
abating air pollution. For example, lowering the measured
levels of minimum sulfate readings and mean  paniculate
readings  by 10 per cent  is estimated  to  lead  to  a .897 per
cent decrease  in the total death rate.  A  50 per cent abate-
ment W'ould low'er the death rate by 4.485 per cent.  Assum-
ing that those who are saved have the same life expectancy
of others in their cohort, a 50 per cent abatement in air
pollution (specifically in  minimum sulfates. minimum par-
ticulates, and mean particulates) would result  in an increase
in life expectancy of about one year for a  newborn. As es-
timated elsewhere, such an abatement  would  reduce the
economic cost  of morbidity and mortality by just under 5
per cent (Lave and Seskin.  1970). Thus, such an abatement
is probably the single most effective way of improving the
health of middle-class families. Note  that this middle-class
family could do something about  smoking, but is powerless
to lower  its exposure to air pollution (except  by  leaving the
city). The importance of this improvement in health can be
assessed  by  noting that eradicating all cancer would result
in lowering the economic cost of morbidity  and mortality
by 5.7 per cent (see Lave and Seskin. 1970).
       Even so, there are many reasons to believe that these
estimates are gross understatements of the health cost of air
pollution. Chronic diseases  generally involve long periods
of illness. The economic costs, calculated as the  sum of lost
work and  medical  expenditures,  grossly understate  the
amount that would be paid to achieve good health for such
a chronically  ill period.  In  addition,  death may not result
from the chronic illness itself, but rather from one or anoth-
er  complication. For example,  chronic  bronchitis or
emphysema is likely to result  in death due to heart disease
or pneumonia, rather than from the chronic disease.
       Perhaps the  only good way to estimate the health
costs of air pollution would  be to analyze morbidity, rather
                                                                   AIR POLLUTION. CLIMATE AND HOME HEATING   915

-------
 than mortality data..It seems certain that such an  investiga-
•ed to test the difference between
  the effect of variable i and the excluded variable: a significant
  value of t  indicates that the ,two variables have significantly
  different effects. In the above notation, testing the significance
  of ai  is equivalent  to testing  whether  bi  is significantly  dif-
  ferent  from b3. If one chose  to exclude  an si  whose bi  was
  quite  different from the other b's. all of the ai would be "signif-
  icant." In transforming the estimated coefficients to exclude a
  different variable, the standard error of the coefficient  will
  change and it is not simple to derive the new standard error of
  the coefficient. For our purposes, it is not as important to
  derive significance tests for individual  coefficients, as it is to
  determine whether a particular set of variables makes a signif-
  icant  contribution to the explanatory power  of the regression;
  this is done via an F test and  the result is unaffected by which
  variable we decide to exclude.
Bibliography

Analysis of Suspended Particulates. 1957-61. U. S. Public Health
    Serv. Publ. No. 978(196:).
Berke. J. and Wilson. V. Watch Out for the Weather. New York:
    Viking. 1951.
Census  of Housing. U.S.  Dep. Commerce  Publ.  1. Pts.  2-8
    (1963).
Climate and Man. 1941 Yearbook  of Agriculture.  U.S Dep.
    Agric. Publ. (1941).
Climatological Data. National Summary. U.S. Weather Bureau
     11 (1960).
County and City Data Book. U.S. Dep. Commerce Publ. (196:).
Lave. L. Air Pollution Damage. In:  Environmental Quality  An-
    alysis. A.  Kneese  and B. Bower. Eds.  Baltimore: Johns
    Hopkins Press, 1972.
Lave, L.  and Seskin. E.  Air Pollution  and Human Health.
      Science, 169.723 (19701.
Lave, L. and Seskin. E. An Analysis of the Association Between
    U.S. Mortality and Air Pollution, working paper. (1970a).
Lave. L. and Seskin. E.  Does Air Pollution Shorten Lives^1  Pro-
    ceedings  of the Second Research  Conference  of Inter-
  - University Committee on Urban Economics (I970b).
Lave. L. and Seskin. E.  Health and  Air Pollution: The Effect of
     Occupation  Mix. Swedish  Journal  of  Economics (Mar..
     1971).
Vital Statistics of the United States (1960). DHEW Publ. (1963).
                          Dr. Lave is Head and Mr. Seskin is Research Associate. Department of Econom-
                          ics. Graduate School  of Industrial Administration. Carnegie-Mellon University.
                          Pittsburgh. Pa. 15213. This research was supported by a grant from Resources for
                          the Future. Inc. and by Fellowship AP48992. Air Pollution Control Office. Envi-
                          ronmental Protection Agency. The views and any error are those of the authors
                          and do not necessarily reflect the sponsoring agencies.
 916    AJPH JULY. 1972. Vol 62. No 7

-------
 TECHNO.METHICS©                   VOL. IS, No. 3                      AUGUST 1973
    Instabilities of  Regression  Estimates  Relating
                   Air Pollution  to  Mortality

             GARY C. MCDONALD  AND RJCHAKD C.  SCHWINC
                        General Motors Research Laboratories
                                Warren, Michigan
       The instability of ordinary least squares estimates of linear regression coefficients
     is demonstrated  for mortality rates regressed around various socioeconomic, weather
     and pollution variables. A ridge regression technique presented by Hoerl and Kennard
     (Technomelrics 12 (1970) 69-82) is employed to arrive at "stable" regression coefficients
     which, ia some instances, differ considerably from the ordinary least squares estimates.
     In addition, two methods of variable elimination are compared—one based on total
     squared error and the other on a ridge trace analysis.

                                  KEY WORDS

                        Multiple Linear Regression
                        Mortality Rate
                        Pollution Potentials
                        Ridge Regression
                        Standardized Total Squared Error, Cr
                        Ridge Elimination
                                1. INTRODUCTION
    Recently, Lave and Seskin [16] and Hickey, el al [9] using multiple regression
  analyses on large data banks of annual mortality rates and pollution measurements
  have provided a link between long-term, high pollution levels (sulfates, particulates
  and heavy metals eminating primarily from stationary sources) and  increased
  mortality. Comparable health studies, either time series or cross section, for the
  pollutants oxidant, NO, (oxides of nitrogen)  and CO  (carbon monoxide) are
  also available. Ilexter  and Goldsmith  [8], in a recent multiple  regression time
  series  study  on  acute  episodes, conclude that carbon monoxide contributes to
  excess deaths. They did not find an effect due to oxidant. Another study by Shy,
  el al [22] compares illness rates in "clean" versus "polluted" (vrith NO. and par-
  ticulates) neighborhoods in Chattanooga, Tennessee.  To eliminate the fact that
  pollution levels can be correlated with other variables which can influence health,
  it is often necessary to investigate several highly  correlated  (non-orthogonal)
  variables simultaneously. Recently  the  utility of relatively new statistical tech-
  niques for handling such systems has been demonstrated by Hoerl and Kennard
  til, 12] and others.
    We have chosen to study the chronic effects, as measured by an overall mortality
  fate, of HC (hydrocarbons), NO, and  S0a (sulfur dioxide) employing a "ridge
  regression" analysis. Because many of the explanatory variables are highly cor^
  related,  techniques for estimating  the  true variable  effects for non-orthogonal
fx—--iiern!; are  emphasized. Statistical methods do  not in themselves establish a

     •eceivcd Sept. 1071; revised Aug.  1972
                                       463

-------
 464
G. C. McOONAlD AND R. C SCHWING
 cause-effect link; but assuming the link is present, methods are available to quantify
 relative contributions-of the variable under investigation. In this paper we apply
 a method termed  "ridge Degression"  to arrive at rcgressiort Coefficients for
                                      ^
"total mortality rate. In addkion^Torthe^ total mortality rate we eliminate "super^
 fluous" explanatory (or prpdictiiig)_vanabjes by two methods-^oneljased on total
 squared error and another oa ridge analysis-^andj;ornpare~the^resul^r

                  2. DESCRIPTION OP VARIABLES CONSIDERED
   A multiple, jineax additive model will be assumed throughout this paper, i.e.,
 the response variable willbe expressed as alinear combination of many explanatory
 "variables. In this  section the variables~are descnb'ect~a.nd descriptive statistics
 of the^ sample used in this study are  provided. The total age adjusted mortality
 rate, our response variable in each reeression equation,  can  be  obtained for tKe"
 years 1959-1961 for 201 Standard Metropolitan  Statistical Areas (SMSA) from
 Duffy and Carroll [4J. In Table 5 of [4], the age-adjusted death rates are given
 for the categories male white, female white, male non-white and female non-white.
 In addition, the number of deaths in each of these four categories is also provided.
 We define our total age adjusted mortality  rate to be
                                                   \-l
                                                                         (2.1)
 where D< and R; are the deaths and age adjusted death rates of, say, the I'th category
 respectively, i  =  1, 2, 3, 4. The sums are then taken over the four categories.
   We include  three of the explanatory  variable groups  which are considered
 important in an epidemiological study of this  type. The first fifteen  variables
 listed in Table  I can be grouped as follows:
                              J\. . Weather
                             *r2.  Socioeconomic
                            "A  Pollution

   Accurate ambient  concentration  measurements on  the air  breathed by  all
 residents in an  SMSA would be the preferred pollution variables; however, several
 problems exist  with the data which has been published thus far.
     (i) Sampling methods and analytical techniques differ between communities.
    (ii) Sampling sites are often not representative of the community.
    (iii) The distribution of exposures cannot be characterized by a single measure.
    (iv) Only eight  SMSA's have been studied for  the pollutants of primary concern
        to this study.
 Because the above difficulties with available ambient air measurements are so
 'great' we choose  to use calculated  relative pollution potentials m eaclx !S:\I§%.
 'based on~cmission  and weather factory The pollution potential of three pollutants,
 namelyTICr, NO, , S02 , have been" estimated by Benedict [1]. The pollution
 potential isdotormincd as  the product of^ the  tons emitted per day per square
                                         ^
 'Kilometer ^iTcadi poTkilunt Tmd a dispcr?ionTjiC.tpiL which accounts for mixing
 ~"height7"wind i=peed,  number of episode days and dimension of each SMSATSince
  each S.\I55^V has^he sanftnliSporbion factor for  eacK pollutant, this quantity is
  "confounded" with  each pollution potential term. Benedict's pollution potentials
  are available for sixty SMSA's, for the year 19G3, which are geographically con-
  sistent with the available mortality data. Note, however, that the time  period
  for which the pollution potentials apply  (19G3) is  slightly later  than the time
  period applicable to the mortality data.
                                                                            Tho>
                                                                         variabl
                                                                         these i
                                                                         ticulal-
                                                                         cannot
                                                                         the rel
                                                                            Prcv
                                                                         and B<
                                                                         the V£-
                                                                         July t,
                                                                         study.
                                                                         of tbci
                                                                         •   Sev<
                                                                         betwe<
                                                                         of fan
                                                                                                  Varlab

-------
 -jl
 tics
 — -}'
 -je
 zite.
 iad.
•— all
—ties.
 iTr 5-5
 5M5A
                                             ESTIMATES RELATING AIR POLLUTION TO MORTALITY
                                                                           465
  Though the pollution variables are labeled HC, NO,, and S02, there are other
variables,  especially other pollutants, which are highly correlated  with each  of
these indices. For example, S0a is  highly correlated with certain  types  of  par-
ticulates and HC is closely  tied to carbon monoxide and lead salts. Thus one
cannot demonstrate a specific cause  and effect even though the analysis quantifies
the relationship.
  Previous workers, e.g., Glasser and Greenburg [6], Holland, el al [14], and Oechsli
and Buechley [20], have found  climate or weather variables account for soms of
the  variation in  disease rates.  Precipitation, mean January  temperature, mean
July temperature and  mean annual humidity, have been included in the  present
study. These variables, presented in  Table I, are considered independently because
of their possible effect  on health, not because they affect the pollutants.
•  Several  socioeconomic variables are important to account for health differences
between communities.  Green [7] has suggested indices to optimize the prediction
of family  health  actions from  socioeconomic information. Table I includes SO-

                                     TABLE I
                              Description of Variables
 Variable Number         Description [Source]

       1                 Mean annual precipitation in inches, [30].

       2             .    Mean January temperature in degrees Fahrenheit, [30].

       3           .      Mean July temperature  in degrees Fahrenheit,  [30].

       4          '       Percent of 1960 SMSA population vhich is  6? years of
                        •ge or over, [5],

       5                 Population per household, 1960 SMSA, [24,25].

       6             .    Median school years completed for those over  25 in 1960
                        SMSA, [27].

       7                 ' Percent of housing units vhich are sound  with all
                        facilities, [24].

       8                 Population per square  mile in urbanized area  in 1960,
                        [23,25].

       9           •   .   Percent of 1960 urbanized area population which Is non-
                        vhite, [26].

       10         .       Percent employment in  white-collar occupations In 1960
                      '  urbanized area, [28].

       11                Percent of families with income under $3,000  in 1960
                        urbanized area, [28].

       12                Relative pollution potential of hydrocarbons, HC, [1].

       13            •    Relative pollution potential of oxides of nitrogen, NO  ,
                        (1).                                                "

       14                Relative pollution potential of sulfur dioxide, S0a,
                        [11.

       15                Percent relative humidity, annual average at  1 p.m.,
                        [29].

       16                Total age adjusted mortality rate, all causes f4] and
                        equation (2.1), expressed as deaths per 100,000
                        population.	  	
                                                                                                                          it.'
                                                                                                                           ;rl •/...•"_
                                                                                                                          * '. '. W -*
                                                                                                                          I1)' : ••;•»:•
                                                                                                                          >V\.::-^.:
                                                                                                                          I [Ss •£

-------

        i££^;^-fe^-a£iiwJ&;k^
 466
           G. C. McDONAlD AND R. C 5CHWING
Variable
Precipitation
January Temperature
July Temperature
X 65 Years C. Older
Population/Household
Education
X Sound Housing
PopulatlonAUle3
X Non-White
X White Collar
X Under $3000
HC Potential
NO Potential
SO, Potential
Relative Humidity
Total Mortality
Mean
37.37
33.98
74.58
8.80
3.26
10.97
80.92
3876.05
. 11.87
46.08
14.37
37.65
22.65
53.77
57.67
940.36
Std. Dev.
9.98
10.17
4.76
1.46
0.14
0.85
5.15
1454.10
8.92
4.61
4.16
91.98
46.33
63.39
5.37
62.21
Minimum
10.00 .
12.00
63.00
5.60
2.92
9.00
66.80
1441.00 "
0.80
33.80
9.40
1.00
1.00
1.00
38.00
790.73
Maximum
60.00
67.00
85.00
11.80-
3.53
12.30
90.70
9699.00
38.SO
59.70
26.40
648.00
319.00
278.00
73.00
1113.20
                                                                  •"'•'-,-"> -+ ••*• ••".-.--.
                                                                  .T. iii J. '.v •-• vr •-'• vj- - v -
ciocconomic terms which account for broad differences in occupation, population
density, education, income, housing, race and age.
  Table II gives the sample means, standard deviations, and the minimum and
maximum values for each, of our variables. An examination of the data indicates
a "bunching" of the pollution potential variables at values  below their means.
This is the result of including in our analysis several SMSA's which have relatively
high values of the pollution potential variables. In particular, Los Angeles and
San Francisco have the two largest HC pollution potential values, 648 and 311
respectively,  while the third largest value  is 144. Los Angeles and San Francisco
also have the two largest NO, values, 319 and 171 respectively,  while the third
largest value is 66. The S0a variable is more evenly distributed among our sample
of sixty SMSA's. Table III provides the  correlations among the variables con-
sidered. The largest correlation, .9838, occurs between the HC and NO, pollution
potentials; other large positive correlations exist between education and percent
white collar,  and between percent non-white and percent under S3000.


                    3. A DESCRIPTION OF RIDGE REGRESSION
  Multiple linear regression  techniques ^ave played a prominent role  in  the
studies of associations between air pollution and mortality  (and/or morbidity)
rates. This technique may provide an adequate basis for overall prediction, but,
when the  explanatory variables are non-orthogonal, it frequently fails to give
proper  weight to the individual explanatory variables used as  predictors.  In
many problems where data are not obtained from a well designed or controlled
experiment, as is the case in air pollution studies involving socioeconomic, weather
and other uncontrolled variables, non-orthogonality requires that estimation of
individual effects be handled by techniques other than ordinary least squares
solutions. Reinke [21]  pointed out these difficulties in air pollution models several

                                   TABLE IT.
Means, Standard Deviations,  Minimum and Maximum Values of Variables (BO Observations)
*J«^«yi^fVf»J^J!^^«,JB^V^^
                                                                                        .0:1  ,»
                                                                                    years ago,
                                                                                    [3], pro vie!
                                                                                    The recer.
                                                                                    excellent <
                                                                                    "ridge rc£
                                                                                      As has
                                                                                    become t
                                                                                    the wron
                                                                                    the prcdi
                                                                                    for multi
                                                                                    where E
                                                                                    are assui
                                                                                    and the
                                                                                    with eac
                                                                                                 be the
                                                                                                 are a di
                                                                                                 if L3 is
                                                                                                 In tern
                                                                                                 and v,
                                                                        •;
                                                                 •  •   ••-''

-------
                    ESTIMATES RELATING AIR POLLUTION TO MORTALITY               467

                                    TABLE III

               Correlations Among the Variables Considered (60 Observations)
     .Q!J!  .$0»  .1011 .341* -.4*0* -.4*01  -.0033  .41}! -.2*7}  .JCM  ..111*  -.4IM  -.134*  •.Oil)   .5011
                                                                  ll  -.111?   . JiM
                                                                     ».SO:>   .423t
 years ago, and suggested that riclge analysis, as described by Hoerl [10] and Draper
 [3], provides a promising method for avoiding distortions such as described above.
 The  recent papers of Hoerl and Kennard [11,  12]  and Marquardt [IS]  give an
 excellent description of the theory and applications of what has now been termed
 "ridge regression."
  ^As has beer^ jbown injll], the estimates of j-egression coefficients jtend  to
                                                            somejynll even have
  ie' wrong  sign. The chances of encountering such dilncuitic-s increase the more
 the prc3lctiori^\'ectors_de'vTate from  orthogonality. Consider the standard model
 for multiple linear regression,
                                  y =
 •where E(t) = 0, £(te') =  a* In and x is (n X p)  and of full rank. The variables
 are assumed to be_standardized so that x'x is in thejorrrLof a_corrclation_ matrix,
 and the vector x'y is"the vector jT correlation coefHcknjs-orTherespODse variable'
'with                                                     "          ""
                                 g = (x'x)-'z'y                            (3.2)
 be the least  squares estimate of (J.  The difficulties in this standard  estimation
 are a direct consequence of the average distance between  (? and (5. In  particular,
 if L* is the squared distance between § and (*, then the following hold:

                           v - (S - e)'(? - ?),
                        E(L2) = S trace (x'x)",                            (3.3)
                       E($'& = tl'll  + a' trace (x'x)'1.
 In terms of the eigenvalues X4 > X, > • • •  > Xp  > 0 of x'x, we can write

                    E(L>)  = a' I! X-1 > a'x;1,
                                                                           (3.4)
 and when the error is normally distributed

-------
                            -.->-.-.-••.v---*"~i:.•••••••..---••;••••
                        !±fc£^^^
 468
 o. c. MCDONALD AND R. c. SCHWING
Var (LJ) = 2 2crV
                       (3.5)
As the vectors of x deviate further from orthogonality, X, becomes smaller and
$ can be expected to be farther from the true parameter vector g.
   Ridge regression is an estimation procedure based upon
                    (5* & jj*(fc)  = [x'x + H]~Vy,  k > 0,                (3.6)

and has two aspects. The first is the ridge trace which is a two-dimensional plot
of the $*i(k) and the residual sum of squares,  0; of course, at'/: = 0 these
estimators reduce to those of ordinary least squares which are unbiased.
   The vector g* for k > 0 is shorter than 0, i.e., (j?*)'(g*) < g'g. In fact,  (g*)'((h
is a decreasing function in k > 0. For_an estimate g** the residual sum of squares
is given by                       **
•  • •    •    •       '                    '             <•'    '

                              '" y'y - ((H'x'y -  *(g*)'(g*)-              (3.7)
The y'y term is  the sum of squares of the dependent variable and is equal to 1
when the data are transformed as indicated in this section; the (g*)'x'y' is the sum
of squares due to regression; and fc(S*)'((J*) is an adjustment term associated with
the ridge analysis. The coefficient of determination is given by
   Where x'x  = 7, i.e., the explanatory variables are uncorrelated, then g*(fc) =
 (k + I)"1 x'y = (fc -f l)~lg\ In  other words, the  least squares coefficients are
 uniformly scaled by the quantity (k +  1)~'. The relative values of the regression
 coefficients  are then independent of the choice of k; i.e., /3t(fc)/$*(k) =  0,/&  ,
 1 <  t, j <  p.fa * 0, for alt k > 0.


          4.  A RIDGE REGRESSION EXAMPLE: TOTAL MORTALITY RATE
   The x'x and x'y in correlation form  are given in  Table III. These values are
 based on a total of CO observations. There are several  large interfactor correlations,
 the most notable being that between  the  pollutant potentials of hydrocarbon
 and oxides  of nitrogen.  This is also reflected in the eigenvalues of x'x which are:
                X, = 4.5272

                X, = 2.7547

                X, = 2.0545

                X« = 1.3487

                X, = 1.2227
           X8  = .9605

           X,  = .6124

           X8  = .4729

           X,  = .3708

           X10 = .2163
 The sum of the reciprocals of the eigenvalues is
 X,, = .1665

 X,, = .1275

 X,, = .1142

 XM = .0400

 X,, = .0049

Z X:1 = 203.06. Thus, from
equa!
estim
be fo,
  Fig
comp
space
porln
                       *:reijS^^

-------


                        ^';iP^*:S'-'"<^/K-'-w'rT'--'-v'^:''.r'i>;fiC";P-^ •••.'"•;•••«'•,• •v-.'::.-.."/•»-'.•"•'/I'^iC'".-^-',
                        "4r\4>i^'^t^ni^
                    ESTIMATES RELATING AIR POLLUTION TO MORTALITY
469
equation  (3.4), the expected  squared distance  of the least squares  coefficient
estimate, p. from p is 263.06 ;iv-


                                                                                                 j:fe^j^.
                                                                                                 •  ^Siv;-r'«''"
                                                                                                 i;|
                                      fr-aaaiHcijxyVfjgj^ffcyfr^M^y^Ti^                          UM
                                      '.•r-'_v'- v^!. •^"••v"- •I':1-'1-.''•'••'^•''•*^-X'*f'-V'•'•-.V•'-"'-'*'•"•'iT-V'''''^^"-'

-------

   fjt-t:^y*?n^.-i^'''-: ••j^&SEkJf.-^v-*
                                                                                    7
470
                       C. C. McOONAlD AND R. C. SCHWINO
makes possible assessments that are usually not made even if all possible regressions
are computed. For example:
    (i) The coefficicnts-from the ordinary least squares are very likely to- be over-
      estimated. At least, they are collectively not stable. Moving a short distance
      from the least squares point k = 0 shows a rapid decrease in absolute value
      of at least two variables, namely, variables 12 and.13  which are the hydro-
      carbon and oxides of nitrogen variables respectively. This is not unexpected
      because  these two quantities have a sample correlation coefficient of .98.
      Both of these coefficients are quickly driven towards zero and are almost
      mirror images of each other about the zero line.
   (ii) The effect of variable 14, the sulfur dioxide term, is likely to be originally
      underestimated. The coefficient increases as k increases while the magnitudes
      of the coefficients of the other two pollution variables decrease.
   (iii) The effects of variables 2, the mean January temperature, and (9, percent
      non-white population, also appear to be  overestimated in absolute value.
      Both variables decrease in absolute  value as k increases, and level off at
      non-zero values.
   (iv) Variable number 8,  the populajaon density factor, is quite stable; the
      coefficient of this variable moves very little  as k ranges between 0 and 1.
   (v) The  coefficients,  with  the  exception of that corresponding to  variable
       number 5, appear to stabilize in the neighborhood of k  =  .2. We would
       expect coefficients chosen at this point to be closer to (5 and  more suitable
       for  estimation  of individual effects than the least squares coefficients.
       The residual sum of  squares at k —  .2 has increased about 17% from the
       corresponding value at k =  0.
   (vi) The'squared length of the coefficient vector decreases rapidly as k increases
       as shown in Figure 2. At k — .05, it is only 23% of its original value; whereas
       if the least squares coefficients were computed from an orthogonal system,
       i.e., x'x = /, it would be 91% of its original  value.


                        5. ELIMINATION OF VARIABLES

 Total Squared Error
   In order  to  adequately represent the mortality rate data as a linear function
 in fewer than fifteen explanatory variables, it is essential that-some simple criterion
 of goodness of fit be chosen to characterize each equation. The measure recom-
 mended by Daniel and Wood [2] is the "standardized total squared error"  given
 by Mallows [17]. This statistic, called C,  (p is the number of variables in the
 regression including a constant term if needed), estimates the sum of the squared
 biases plus the squared random errors in the response variable at all n data points.
 It is a  simple function of the residual sum of squares from  each fitting equation.
 Mallows has shown that regressions with small bias will have C,'s nearly  equal
 to p and so this, as well as the magnitude of C, , is used to judge a  particular
 subset  regression. The quantity C, is given by

                         C. = (RSS,/f) - (n - 2p),                   (5.1)

 where p is the number of variables in the regression, RSSf  is the residual sum of
 squares for the particular p-variate regression being considered and 6'  is an esti-
 mate of a2, the variance of the error term in tho regression model. Frequently aa
 is taken to be  the residual moan square from the complete regression.
  If 111
sum of
graph t
all pos;
C, valv
minimi
and Lc
and  co
sense c
which
the ab<
which i
  Figu
our CGI
this pa
as a pa
atory v
The rr
variab!
be writ
Mortal
The cc
                                    »ltt|iwi.,>jHaas««a'fV'»'*g^^

-------
.y-'C4.. •.•:••.;;.>*•*•
•:!-V*i>^ia»w

                                                                 •y •  '-.'!'•'• •*- *= '• '.'-V;'%".•'«:• - V. *r'v'- --"••.; .-••>'—v :*
                                                                 " • •'••••! •.  .':•• ~ •,•'••••*-—'>-«;•'•'"•'•'V'vj ••'-.•;' -'V':.
 ir^i^r^r^ax^                                                           •r-r':'' '•"':';'
   ESTIMATES RELATING AIR POLLUTION TO MORTALITY

             TOTAL MORTALITY
SQUARED LENGTH OF COEFFICIENT VECTOR
       2.8

       2.i
                                                                                    471
             If the total number of input variables is not large (say < 12) then the residual
           sum of squares can be computed for each  regression equation and compared via a
           graph of C, versus p. However, it is not  always practical or feasible to compute
           all possible regressions.  Procedures for determining regressions which have small
           C, values for each allowable value of p, and in fact to determine which regression"
           minimizes C'f , Without  actually computing all regressions, are given by Hocking
           and Leslie [13J and by La jMottc and Hocking [15j. We have applied the algorithm
          "and computer program  described in [15], to arrive at a "best" regression in the
           sense of minimizing Cf. and to isolate subsets of the fifteen explanatory variables
           which yield "almost best" regressions. To arrive  at the "best" equation using
           the above method necessitated computation of  1,465 sets of regression estimates,
           which is about 4.5% of the 32,768 total possible regressions.
             Figure 3 is a C, graph using total mortality  rate as the response variable. In
           our computations, we used as input the  raw data, i.e., our regression model for
           this part of the analysis  was not standardized, and so a constant term is counted
           as a parameter. Thus p may take values up to 16, and equals 16 when all explan-
           atory variables arc included. It then follows that C'f > 2p — 16 for all p = 1, • • •, 16.
           The regression equation with variables 1, 2, 3, 6, 9  and 14 as the explanatory
           variables yields the overall minimum value of C, which is 3.55. This equation can
           be written  as
           Mortality Rate = 1180.4 + 1.797 (Precipitation) -  1.4S4 (Jan. Temp.)

                            - 2.355 (July Temp.) - 13.619 (Education)

                            -f 4.585 (% Non-White) + 0.2GO (S05 Potential) + «.     (5.2)

           The coefficient of determination (Rl) for this equation is 0.735. The corresponding
                                                                                                         .

                                                                                                         -'A^r-i
                                                                                                        .V''Ki';'?;*

                                                                                                         e-^ '"-;'••• '-i.;:''
                                                                                                         5.v":^.-^---_-

                                                                             !" '^v.-.V'V"'^';V;



                                                                             ! • .' •'. "'.'•'i-'v.C'"

-------
                   avfo^
     472


        19

        18

        17

        16

        15

        14
     S
        12
     s n
     CsL
     1 10
     2  7
    c. c. MCDONALD AND B, c SCHWING

       C, versus p: Total Mortality
• Minimum  Cp
o Selected low values of Cp
  Deleted Variables in  Parentheses
                                                                lll.1DIII.1IXII.lt)
           — (H. IHl. 1U
                 O
               &l.).l. IH1. Ill
                K«.J.ll»>J. HI
                                         Cp • 2p - IS
                                                                                I
                            8     9    10    11    12    13    14    15
                             NUMBER OF VARIABLES  REMAINING, p
                                         FIGURE 3
                                                16    17

     value for the full regression  with all variables entered' is 0.764. Other "almost
     best" subsets of variables are given in Figure 3. Variables 1, 2, 9 and 14 are con-
     tained in almost all of the subsets with small C, values. The "best" set of five
     variables is 1, 2, C, 9 and 14 which yields C, = 4.90.
Eliminat
  Hocrl
based on
                                                                                    0)
       E
       P3
       st
       h
   fii) F
       h
       1
   (iii) I
       V
   Based
 1, 2, 6,
 squared
 peraturc
 associat
 larger t'
 (or leas!
 iables c:

 Mortal!
 The ecu
   LeU
 Cf critc
 variabl'
 subsets
 fifteen i
 rate ve
 are qir
 variabl
 to a co
 pared
 while )
 the cig
 9.80 ai
 for  an
 comrnc
 tion, 1
 the co
 in  the
 perccn
 have ;
  port.ii
  quite
  perati
  ridge
3^<^«j0^.VJS»«yr»3j»iJO»^^w^

-------
     ^vk^r/^^-^^^.v-^                              "^V vo:^-;".^;.~.v.r.-•;
                   ESTIMATES RELATING AIR POLLUTION TO MORTALITY
                                                        473
Elimination Using the Ridge Trace
  Hoerl  and Kcnnard [12] suggest a method of variable  elimination which  is
based on the ridge trace. Their procedure is:
   (i)  Examine the stable coefficients and eliminate the factors with the least
       predicting power.  From our ridge trace in Figure 1, variables which appear
       stable with coefficients small in absolute value are  4, 7, 10, 11 and 15;
       hence eliminate these variables.
   (ii)  Examine the unstable  coefficients and eliminate those factors that cannot
       hold their predicting power. It is obvious from the trace that variables
       12 and 13 can be eliminated with this criterion'.
  (iii)  Delete one or more of  the remaining unstable coefficients. In our example
       variables 3 and 5 are being eliminated at this step.
  Based on  this ridge analysis, the explanatory  variables now remaining are
1, 2, 6, 8, 9, 14 which agrees  with the subset of variables chosen rising the total
squared  error measure with the exception that variable 3, the mean July tem-
perature, has been replaced with variable 8, the population density. The Cf value
associated with this particular  subset of variables is 5.52  which is about 55%
larger  than the minimum value, but closer to the value p = 7. The regression
(or least squares) equation with variables 1, 2, 6, 8, 9, 14 as the explanatory var-
iables can be written as

Mortality Rate = 988.4 + 1.487 (Precipitation)  -  1.633 (Jan. Temp.)
- 11.533 (Education) + 0.004 (Pop. Density)

+ 4.145 (% Non-White) + 0.245 (S0a Potential) + t.
                                                                        (5.3)
The coefficient of determination for this equation is 0.724.
   Let Si = {1, 2, 3, 6, 9, 14} be the subset of variables determined by the minimum
Cf criterion and 
-------

474
          G. C. McOONAlD AND R. C SCHWINC

RIDGE TRACE: TOTAL MORTALITY (Min. C, Variables)
          .9-

          .8

          .7

          .6

          .5

          .4

          .3

          .2

          .1
        A
        0*0
          -1
    I	I
                 .1    .2
                   .4
   .5
   k
FIGURE 4
                                                                    3

                                                                    .4

                                                                    .3

                                                                    .2
.6    .7   .8    .9    1.0
than the other variables being considered.  Using the Cp criterion this variable
would be eliminated in arriving at a "best" subset with five explanatory variables.
As noted before, population density has a stable increasing effect.
   Figures G and 7 are the squared length of the coefficient vectors of the variables
                                                                                       specific.
                                                                                       unlike ;
                                                                                       for this
                                                                                       of orth<


-------
                  ESTIMATES RELATING AIR POLLUTION TO MORTALITY

     RIDGE TRACE: TOTAL MORTALITY (Ridge Elimination Variables)
                                                                475
          1
.9

.8

.7

.6

.5



.3

.2



 0

-.1

-.2

-.3

-.4

-.5

-.6

-.7



-.9

 -1
                       I     1     1     1     1     I
                 .1    .2
                         I    .5
                              k
                         FWURK .1
.6    .7    .8
                                                                  14
                                                                    .5

                                                                    .4

                                                                    ,3

                                                                    .2
1.0
specified by Sj and S2 respectively. For the S, variables, the system does not act
unlike an orthogonal system. The dccroase in the length of the coefficient vector
for this reduced set of variables is almost identical to what it would be in the case
of orthogonality.
                                                                                           ir

-------
'
£X?^£V^
fciiirr^fiirirfafc^^
                476
        G. c. MCDONALD AND R. c SOWING .

 SQUARED LENGTH OF COEFFICIENT VECTOR
             MIN C, VARIABLES
                               7 -
                               .6
                               .5
                               .4
                               .3
                               .2
                               .1
                                            ACTUAL
                                ••...ORTHOGONAL
                                  :\,\
                                                               I
                                 0   .1    .2   v.3   .4   .5   .6   .7   .8   .9   1.0
                           '   "    '   •'  •••.-•.       k
                                                 FIGURE 6

                                      . 6. CONCLUSIONS AND SUMMARY
                  We have discussed the total mortality data in detail with respect to ridge regres-
                sion and eliminating explanatory variables employing two different criteria. Table
                IV provides a summary  of our  results contrasting the ordinary  least squares
                estimates with those obtained from a ridge trace at the value A: =  .2. This table
                exemplifies the instability  of least squares estimates in this problem—namely,
                the coefficients which achieve a residual sum of squares slightly larger  than the
                minimum value can differ by more than an order of magnitude, and even in  sign,
                from the corresponding least squares estimates. Entries are given for the regression
                equation using all fifteen explanatory variables, the six variables selected by the
                Ce criterion, and the six chosen by  the ridge elimination method. The  standard
                deviations of the coefficient  estimates arc given  below the estimates and are
                enclosed in parentheses. The sum of squares  of the estimated  coefficients, the
residua!
of the d'
in Secti'
climinat
pollutioi
5s retain
with srn
of the •
closely A
agreeme
  Our c
changing
sum of £
lems of 1
lies inle:
this part

-------

             ^^
                                                                                           »•'
                                                                                           • |
                   ESTIMATES RELATING AIR POLLUTION TO MORTALITY
                                                          477
residual sum of squares and the coefficient of determination arc provided for each
of the derived equations. All coefficients apply to the standardized model described
in Section 3. In our analysis, emphasis was  placed on a technique which did  not
eliminate variables, since we were specifically interested in the effects of all  the
pollution potential  variables. Hoerl and  Kennard [12] suggest the best strategy
is retaining all variables in the analysis and choosing a "good" value of k. Variables
with small effects will then have small coefficients. As can be noted, the coefficients
of the variables remaining after  elimination (by  either method) agree rather
closely with the corresponding values with all variables included at k =  .2. This
agreement is not as good at fc = 0, i.e., when considering least squares solutions.
  Our choice of the value  k =  .2 is reasonable  in the sense  that: (i) all major
changing of order in the coefficient estimates has already occurred, (ii) the residual
sum of squares and coefficient of determination have values consistent with prob-
lems of this type, and  (iii) assuming normally distributed errors, the vector g*(.2)
lies interior to  the 95% confidence ellipsoid  for the unknown vector (5. However,
this particular value of k is not known to be "optimal" in the sense of minimizing
                 SQUARED LENGTH OF COEFFICIENT VECTOR
                      RIDGE ELIMINATION VARIABLES
   '«VM
   r) (r)
              .6
              .5
              .4
.3
              .2
              .1
                                                    ACTUAL
                                            ORTHOGONAL
                      I	I
                         I	I
       I	 I
                                      .4
                              .5
                               k
.6    .7    .8    .9    1.0
                                   FIGURE 7
                                                                             .;.:ir^::;::?

                                                                             i:'(#.'.;'<•"£;.••
                                                                             I . "V
                                          .
                                         i
                                                                                          fl •:

-------


478
G. C McDONAlD AND R  C SCHWING
TABLE IV
Total Mortality: Coefficients of Standardized Variables for Two Ridge Solutions; k = 0 (the Least
Squares Solution) and k «=> 2.
"* - I
The standard deviation of an estimate is given directly below it and in parentheses. Summary I
statistics are given at the bottom of the table. • y •
Variable Kene
Precipitation
January Tenperature
July Temperature
Z 65 Years and Older
Population per Household
Education
Z Sound Housing Units
Population per Mile3
Z Non-White
*
Z White Collar
Z Under $3000
RC Pollution Potential
MO Pollution Potential
S0a Pollution Potential
Relative Hunidity
15 Variables Minimum C- Rldf?c Elimination : I
k«0 k=.2 k=0 tc=.2 k=0 k=.2 I
.306 .243 .288 .247 .239 .230 " 1
(.148) (.069) (.096) (.065) (.094) (.065) '-. J
-.318 -.168 -.242 -ae4 -.267 -.172 I
(.181) (.055) (.084) (.063) (.085) (.063) ..- J
-.237 -.084 -.180 -.073 5
(.146) (.071) (.095) (.066) ,
-.213 -.055 :I
(.200) (.063) "
-.232 -.007 -3
(.152). (.068) - . •?
-.233 -.114 -.185 -.190 -.157 -.171 f
(.161) (.070) (.087) (.063) (.090) (.065)
-.052 -.094 1
(.146) (.069) >
.084 .123 .097 .091 1
(.094) (.065) (.082) (.063) ;
.640 .423 .657 .481 .594 .462 i
(.190) (.068) (.100) (.066) (.094) (.065) " £
-.014 -.034 .£
(.123) (.068) •<•
-.009 .044 :.^'
(.216) (.066) • ''•
-.979 -.046 ?
(.724) (.045) ^
.983 .043 r.
(.747) (.043) . *£
.090 .243 .265 .255 .249 .232 - '-
(.150) (.066) (.080) (.061) (.037) (.064)
.009 .033 -r
(.101) (.063) •-:.
(I*)'*!** 2'758 '38° <711 '426 *578 *388
9*
H3
.236 .276 .265 .289 .276 .292 i
.764 .572 .735 .541 .724 .553
the expected squared distance between $*(k) and fj (or at least to have this distance
less than the corresponding distance between (5*(0) and p). Hoerl and Kennard
[11] have established UiccxislC7iceof a ridge estimator (i.e., a fc-value) which achieves
a smaller expected squared distance than the least squares estimator. Ncwhouse
and Oman [IflJ propose several methods for choosing a k value to use in ridge
regression and investigate their properties using Monte Carlo experiments with
two explanatory variables. It appears that an optimal choice of k (or interval
of k val
the Icnj
We 1
replacci
The ins
variabli
and k
and .2C
model 1
in Tab!
now .OJ
wherea,'
made a
effect c
variabL
than th
transfo:
v* - •
corresp
transfo;
the sta!
thestai
the oth
argume
As IK
NO, po
the po!
effect o
distribi
deletinj
58 dat:
estimat
and k
(iii) SC
have v:
The po
coeffici'
residua
large ii
with Ic
and Sa
mortal:
In si
explan:
be rea.s
these \
potent!
and tli
lights t
as nn .

-------

                   ESTIMATES RELATING AIR POUUTION TO MORTALITY
479
of k values) is an open question at this time unless one has prior knowledge about
the length and/or direction of the unknown coefficient vector.
  We  have doria. similar analyses with the three pollution potential variables
replaced in the linear model by the natural logarithms of the pollution potentials.
The instability of the least squares estimates is still present using these transformed
variables. The values of the coefficients of the three pollution variables at k = 0
and k  = .2 respectively are: (i) In (HC),  -.669 and  .003; (ii)  In  (NO,), 1.02S
and .206; (iii) In (SO,), -.196 and .126. The other variable coefficients  in this
model  have values at fc = .2 which are reasonably close to the corresponding values
in Table IV with the exception of the July temperature  variable whose value is
now .017. It is. interesting to note that the sign  of this coefficient is now positive,
whereas in the linear model it is negative. This further substantiates a remark
made  at the end of Section  5 in connection with Figure 4; namely, the relative
effect  of mean July temperature is questionable. In this respect, the subset of
variables chosen  by the ridge elimination method appears to be'a  better choice
than the subset resulting from the minimum Cf criterion.
   The summary statistics for the model with fifteen variables including logarithmic
transformed pollution potentials axe 
-------

-; r--'.£• - .•>••-.:-••; -• ..•'•'<•,
 V^.'J.-:-'' •"±j~\,•'  ".. •'.:"..
        480
c. c. MCDONALD AND R. c. SCHWINO
        Los Angeles and San Francisco data provide  information  on a  rather. unusual
        combination'of  factors  and  should  motivate further  investigation  rather than
        rejection of the two observations. In regards to the other two pollution potentials,
        the HC term consistently exhibits no detrimental effect on mortality, while the
        positive association of the S02 potential is substantial. The effects of precipitation,
        population density, the percent non-v,rhite, the percent under S3000,  and relative
        humidity are all increasing, with the first three of these  variables having relatively
        large effects. The other weather and socioeconomic variables investigated possess
        negative coefficients and so have a decreasing effect on the total mortality rate,
        with relatively large contributions coming  from the January temperature and
        education terms.


                                      7. ACKNOWLEDGMENTS

          The authors are particularly  grateful to Miss D. Galarneau and Mr. H. Guge^
        for their computational  aid, and to Mr. H.  Ury, -Mr. A. Hexter  and the referees
        for their many helpful comments on an eSrlier version of this paper.
                                                                   i

                       .  •                   REFERENCES
         [1] BENEDICT, H. M.  (1971).  "Plant Damage by Air Pollutants: CRC-APRAC Project No.
            CAPA-2-6S," presented at the Automotive Air Pollution Res. Symp., Chicago.
         [2] DANIEL, C., and WOOD, F.  S. (1971). Fitting Equations to Data (Computer Analysis of Mvlti-
          '  factor Data for Scientists end Engineers), John Wiley.
         131 DRAPER, N. R. (1963).  '"Ridge  Analysis' of Response Surfaces," Technometrics B, 469-479.
         (41 DUFFY, E'. A., and CARROLL, R. E. (1967). United States Metropolitan Mortality, 1959-19S1,
            PHS Publication No. 999-AP-39, U. S. Public Health Service, National Center for Air Pollu-
            tion Control.
         15) GANZ, ALEXANDER, (1968). Department of  City and  Regional Planning.  Massachusetts
            Institute of Technology. Unpublished Data.
         [6] GLASSER, M., and GREENBURO, L. (1971). "Air Pollution Mortality and Weather," Archives
            Environmental Health 22, 334-343.
         (71 GREEN, L. W. (1970). "Manual for Scoring Socioeconomic Status for Research on Health
            Behavior," Public Health Reports 85, 815-827.
         [8] HEXTER, A. C., and GOLDSMITH, J. R. (1071). "Carbon Monoxide: Association of Community
            Air Pollution and Mortality," Science 172, 265-263.
         [9] HICKF.Y, R. J., BOYCE, D. E., HARNER, E. B., and  CLELLAND, R. C. (1970).  "Ecological
            Statistical Studies  Concerning  Environmental Pollution  and Chronic Disease,"  IEEE-
            Transactions on Geoscience Electronics GE-S, 186-202.
        [10] HonnL, A. E. (1962). "Application of  Ridge Analysis to  Regression Problems," Chemical
           ' Engineering Progress 68, 54-59.
        [11) HOKRL, A. E. and KENNARD, R. W.  (1970). "Ridge Regression: Biased Estimation for Non-
            orthogonal Problems," Technometrics 12, 55-67.
        [12| HorRL, A. E. and KESNARD, R. W. (1970). "Ridge Regression: Applications to Nonorthi>-
            gonal Problems," Technometrics 12, 69-82.
        [13] HOCKING, R.  R. and LESLIE, R. N. (1967). "Selection of the Best Subset in Regression
            Analysis," Trchnometrics 9, 531-540.
        [14] HOLLAND, W. W., Spicr.n, C. C., and WILSON, J. M. G. (19G1). "Influence of the Weather on
            Respiratory and Heart  Disease," The /xinrrt S, 338-341.
        (15J LAMOTTK, L. R. and HOCKING, R. R. (1970). "Computational Efficiency in the Selection
            of Regression Variables," Technometrics IS, 83-93.
        [16] LAVE, L. B.  and SIOSKIN, E. P. (1970). "Air Pollution and Human Health," Science 160,
            723-733.
        [17] MALLOWS, C. L. (1964). "Choosing Variables  in a Linear Regression:  A Graphical Aid,"
            presented at  the Central  Regional Meeting of the  Institute of  Mathematical Statistics,
            Manhattan, Kansas.
        [IS] MARQUAIIDT, D. W. (1970). "Generalized Inverses, Riclg.i Regression Biased  Linear Estima-
            tion, and Nonlinear Estimation," Technometrics 12, 591-612.
                                                                                 [19] NEWT
                                                                                     No. R
                                                                                 [20] OECII:
                                                                                     Los A
                                                                                 [211 REIN:
                                                                                     vironi.
                                                                                 [22] SHY, <
                                                                                     M. M
                                                                                     toNi
                                                                                     torj'l
                                                                                 [23] U. S.
                                                                                     GE-2
                                                                                 [24] U. S.
                                                                                     end L
                                                                                 [251 U. S.
                                                                                     U.S.
                                                                                 [26] U. S.
                                                                                     U.S.
                                                                                 [27) U. S.
                                                                                     17.5
                                                                                 [28] U. S
                                                                                     U.S.
                                                                                 [29] U. S.
                                                                                     Unit
                                                                                 [30) U. S.
                                                                                     dim
                                                                                     (193:
                                                                                     and .

-------
(19]

[20j

[21]

[22]



[231

[24]

[25]

[26]

[27]

[281

[29]

[30]
                 ESTIMATES RELATING AIR POLLUTION TO MORTALITY

NEWHOCSK, J. P. and OMAN, S. D. (1971). "An Evaluation of Ridge Estimators," Report
No. R-71C-PR, Rand Corp!, Santa Monica, Calif.
OECKSLI, F. W. and BUECHLEY, R. W. (1970). "Excess Mortality Associated with Three
Los Angeles September Hot Spells," Environmental Research 3, 277-2S4.
REISKE, W. A., (1969). "Multivariate and Dynamic Air Pollution Models," Archives En-
vironmental Iltalih IS, 481^184.
SHT, C. M., CREASON, J. P., PEARLMAN, M. D., McCtAiN, K. E., BENSON, F. B., and YOUNQ,
M. M. (1970). "The Chattanooga School Children Study: Effects of Community Exposure,
to Nitrogen Dioxide. I. Methods, Description of Pollutant Exposure, and Results of Ventila-
tory Function Testing," J. Air Polhjtim Control Assoc. 20, 539-545. .
U- S. Department of Commerce (1970). Bureau of  Census. Area Measurement Reports. Series
GE-20 and Records.
U. S. Department of Commerce (1966). Bureau of Census. U. S. Census of Housing: Stale
and Small Areas. .                                                           ,
U. S.  Department  of Commerce (1960). Bureau  of Census. U. S. Census of Population:
U. S. Summary. Part A (Number of Inhabitants).  Table 22.
U. S.  Department  of Commerce (1960). Bureau  of Census. U. S. Census of Population:
U. S. Summary. Part B  (General Population Characteristics). Table 63.
U. S.  Department  of Commerce (1960). Bureau" of Census. U. S. Census of Population:
U. S. Summary. Part B (Geaeral Population Characteristics). Table 151.
U. S.  Department  of Commerce (1960). Bureau  of Census. U. S. Census of Population:
U. S. Summary. Part B (General Population Characteristics). Table 152.
U. S.  Department of Commerce (1963).  Environmental Data Service. Climatic Atlas of the
United Slates, U. S. Govt. Printing Office.
U. S. Department of Commerce (1962). Weather Bureau. Decennial Census of United States
Climate — Monthly  Normals of Temperature, Precipitation ,acd Heating Degree Days
(1931-1960). Climatological Daia. Monthly and Annual; Local Climatological Data. Monthly
and Annual. .
,-f"
                                                                                                      siiu^
                                                                                                      Jfe.
                                                                                                  ' • .* -  •



                                                                                                  '•'
                                                                                                         -

                                                                                                 lilifv-^'^^?!

                                                                                                  p'.V- *-•
                                                                                                   t
                                                                                                 |:ij|::
                                                                                                  •:i«v.
                                                                                                              .. ':;.-.r- *..j

-------
                                                                        the nation's health, march 1978  3
,«
                          ithis  month's book
             The    Danger   In    Statistics
   Air Pollution and Human Health,
   by Lester B. Lave and Eugene P. Seskin.
   Published by Johns Hopkins University
   Press, Saltimore, MO 21218.
   Reviewed by Emanuel Landau, PhD
    This book represents the product of a
   decade of statistical research by two
   economists into an important issue, the
   benefits and costs associated with air
   pollution and its control. The authors
   are  to  be  commended for their
   willingness  to make  available so much
   of the data  used  to  generate the
   analyses.
    A  year ago, in the April  1976, The
   Nation's Health, the  senior author was
   quoted  as  stating  that a projected
   reduction in pollution from stationary
  •sources, (88 percent in  sulfur oxide
   emissions and 58 percent in paniculate
   emissions),  implied  "a  7.7  percent
   reduction  in the mortality  rate." The
   lessened pollution would result  in the
   subtraction  of  more  than  100,000
   deaths  from the 1.9  million annual
   deaths  and  would  increase  life
   expectancy  by about a  year. Their
   current "most conservative" estimate is
   "a 7.0 percent (6.97) reduction in the
   unadjusted total mortality rate." More
   than 100,000 lives would still be saved
   annually—a truly  impressive figure, if
   there were strong enough evidence  to
   support it.
     It  is regrettable, therefore, that the
   book by  Lave and Seskin has such
   severe  limitations as  to seriously
   circumscribe its usefulness for  policy
   purposes.  I am particularly struck by
   the  selectivity  of  the  authors  in
   choosing those bits and pieces of data
   which support their conclusions.
     i
    On the one hand, the authors state
   on page 6:  "existing pollution data do
   not  necessarily provide  good
   measurements   for  our
   purposes...Consequently, tha pollution
   measurements  are.  at  best,  remote
   approximations  to an  individual's
   exposure  to  a specific  pollutant."
   Nonetheless, the  authors ignore this
   and  other  of their  published
'.„  qualifications.
                The aerometric data used for 1960
              swarms with problems  which  are not
              even referenced in the book. Sulfate
              data from one  station in only three
              cities in  1960  were  used and then
              combined with data for 1957-1959 for
              the  remaining  114  cities. This
              combination  alone  could  produce
              statistics that have little to do with
              reality.  The  data  also  includes
              composites of different lengths of time
              for  different  cities,  a  potentially
              serious source of error.
                Then, these single-station central city
              data are stated to represent the sulfate
              values  for  the  117  Standard
              Metropolitan  Statistical   Areas
              (SMSA's) in  1960. The  findings of
              Goldstein and her co-workers on the
              limitation of single monitoring stations
              for pollution  measurement is  clearly
              re.'5vant here.
                The suspended paniculate data are
              better; they  rely on only two years
              data,  1960 and  1961,  to represent
              1960.  Again,  recognizing  that
              mortality  from  chronic  diseases  is
              assumed  to be  due  to  long-term
              exposure to "air pollution" rather than
              to current levels, the authors state that
              the  current  biweekly  samples  of
             . selected pollutants are deemed to be
              "characteristics of those past  levels.
                The suspended paniculate data are
              better; they rely on only two years
              data,  1960  and  1961,  to represent
              1960.  Again,  recognizing that
              mortality  from  chronic  diseases  is
              assumed  to  be  due  to long-term
              exposure to "air pollution" rather than
              to current levels, the authors state that
              the  current  biweekly   samples  of
              selected pollutants are  deemed  to be
              "characteristics  of  those  past levels,
              which are  in  fact, related to  current
              deaths."  Many  epidemiologists are
              forced to make the same assumption.
              But because  they are trained to do to,
              they make  much  more tentative
              conclusions about public policy.
•The  demographic  problems  in
analyzing differential mortality appear
to  have  been resolved with  flawed
knowledge. The problem of migration
is  superficially  treated.  The
non-uniformity of monality patterns
for nonwhites by area is apparently not
recognized. The use of  the population
interval "65 plus" as a surrogate for
age is beset with difficulties.
  One overwhelming  deficiency is the
absence of significant information. The
absence of smoking characteristics is a
critical deficiency in monaiity analysis
by  area. There are other demographic
and medical care characteristics which
clearly have  been excluded  thereby
producing  possibly   spurious
relationships. The  authors  say, again,
that  the  low correlation  between
supposed  causative factors  end deaths
(R2) "also  indicate  that omitted
factors  are  very  important in
explaining variations  in the mortality
rate across SMSA's." It is regretable
that this  observation  was not heeded.
  There are numerous other  technical
deficiences  of the study  which  this
review cannot encompass.
  In  summary. Lave  and Seskin  have
demonstrated once  again  that  even
sophisticated  and  innovative  analysis
cannot compensate  for  intrinsically
poor data.  They have confirmed  that
the role of a causative  agsnt(s) in the
pattern of  differential  mortality, by
rrea in the US, is a difficult one to
unravel.
  The authors recognize the limitations
of  the data  for the benefit of the
academic  audience,  but do  not
recognize  those qualifications when
they make  their recommendations on
policy.
  Congress  and other  policy-making
bodies cannot afford  to depend on
such data;  we  might  well  wind up
spenoir.g  billions of dollars controlling
the wrong things.

-------

                                                                                   •"—-*r Jil r» r mWi>
                                                                          Environmental Health Perspectives
                                                                          Vol. 20. pp. 149-157. 1977
            stical  Methods for Hazards
 by Yvonne M. ML Bishop*
               The objective of this article is to document the need for further development of statistical methodology,
             training of more statisticians and improved communication between statisticians and the many other
             disciplines engaged in environmental research. Discussion of adequacy of the current statistical methodol-
             ogy requires the use of examples, which will hopefully not be offensive to the authors. Reference is made
             to recent developments and areas of unsolved problems delineated in three broad areas: enumeration data
             and adjusted rates; time series; and multiple regression.
               A brief outline of the ideas behind current methods of analyzing discrete data is followed by a demon-
             stration of their utility using an example of the effects of exposure, sex, and education on bronchitis rates.
               Examples are listed of the ubiquity of the time component when relating pollution effects to each other
             and to health effects. An artificial example is used to emphasize the effects of time-dependent autocorrela-
             tions, trends, and cycles; References are given to a variety of new developments in time-series analysis.
               Discussion of the pitfalls in multiple regression analysis, and possible alternative approaches is largely
             based on two recent review s and includes references to recent developments of robust techniques.
 Introduction

   Dramatic episodes of fog or smog accompanied
 by notably increased mortality and morbidity have
 convinced us that polluted air affects health (1-3).
 Now we must determine more precisely how much
 pollution and  what type of pollution causes disabil-
 ity. Both  the  exposure variable "air quality" and
 the outcome variable "health effects"  are hard to
 define and  measure. Much discussion centers  on
 the reliability and validity of specific  measures;  in-
 creasingly, attention is being paid to  numerous an-
 cillary factors or covariates that influence pos-
 tulated relationships. All these issues  are of crucial
 importance in designing good studies and point to
 the need for interdisciplinary input when studies are
 being designed.  If a study is poorly designed  no
 amount of subsequent statistical  legerdemain will
 produce meaningful results.  Conversely, even the
 best designed studies can lead to misleading conclu-
 sions if the data are  inadequately analyzed. We
 need both good design and good analysis.
   This paper addresses only  the issue of data
 analysis and ignores study design, except insofar as
 improvements of analytic techniques  will reflect on
  •Harvard School of Public Health, Boston, Massachusetts
 02115.


October 1977
design  requirements. As the  need  for  better
methodology cannot be appreciated unless the de-
ficiencies of the present  state-of-the-art  are  con-
sidered,  examples will be given where the infor-
mation obtained from the available data is not op-
timum. Examples for this purpose have been taken
from a Chess monograph (4). In some instances the
state of the art has improved since this work was
done; in other areas many deficiencies still exist.
The purpose of using these examples is not to criti-
cize but to demonstrate the importance of improv-
ing our analytic techniques.
  The introductory overview to  the Chess mono-
graph cites two statistical methodologies, general
linear regression for quantitative variables and gen-
eral linear models for  categorical responses (4-6).
The similarity  of the two methods  is stressed.
Below we show how the emphasis on this similarity
has led  the authors to report their  analyses of
categorical  models inappropriately and generally
inadequately exploit the  strengths of  the analytic
technique. We discuss the problems of time series
and why linear regression techniques are inappro-
priate for their analysis. Some of the  modern ad-
vances  in  fitting linear and nonlinear, models to
quantitative variables.are mentioned  briefly. We
conclude that the 1970  task force recommendations
should be stressed once again.
                                             149


-------
a
 Enumeration  Data

 and

 Adjusted Rates

 What Is a Log-Linear Model?

_ .. In recent years there has been much development
 in the  handling  of  discrete data that have  many
 categorical variables.  Most authors agree that the
 interactions between the variables can best be de-
 termined by fitting models  that are linear in the
 logarithmic scale.
   Suppose we are  interested in the effect of the
 three variables sex, age, and exposure area on the
 prevalence of bronchitis. The most complex model
 states that each of the three variables has a propor-
 tional effect on the bronchitis rate, and that each
 pair of variables may modify the effect of the other,
 and  indeed that all three  variables may have a
 joint effect. This is equivalent to saying that the ef-
 fect of age on the bronchitis rate is not the same for
 each sex, and that the magnitude of this interaction
 varies  between exposure areas. We say that this
 model  includes  the  four-factor interaction
 bronchitis-age-sex-area. At the other extreme, the
 simplest model states that the bronchitis rate is con-
 stant for every sex-age-area combination. Between
 the most complex and the simplest model we  can
 choose from a large  variety of intermediate models.
 each postulating different  combinations of simple
 proportional  main  effects  and interaction effects.
 Each main or interaction effect is represented by a
 term in the log-linear model. Analysis consists of
 determining which intermediate model fits the data
 well and is not  appreciably improved by adding
 more terms.
 How Do We Choose a Model?

    Although most authors are agreed upon the gen-
 eral utility of the log-linear model approach, there is
 some disagreement over the methods of obtaining
 estimates under a specific model and determining
 how well these estimates fit the observed data.
 Most of the  proposed  methods such as maximum
 likelihood, least squares,  or minimum chi-square
 usually yield comparable if not identical estimates,
 and the probability  levels  associated with the
 goodness-of-fit statistics are  in general very close.
 Thus although we can chose  from a variety of tech-
 niques for fitting models to a  particular data set, the
 final selection of a suitable model is not dependent
on the choice of technique. Further discussion of
comparisons between  techniques has been given
elsewhere (7, 8).
  A well-fitting model  is selected by a process of
trial and error, and it includes those main effects
and interactions which are  large. The main effects
'and interactions that do not improve the goodness-
of-fit are discarded.  We often declare that the ef-
fects that are included  are "significant" and those
that are discarded are "not significant.!' Indeed, we
may finish up with a table resembling an analysis of
variance table.  Such a table will  list effects of
importance, and given an indication of how the over-
all goodness-of-fit would be  changed if each effect is
excluded from the model. The degrees of freedom
associated  with  these measure-of-fit statistics are
determined  from the number of categories in the
relevant variables. The  most commonly used mea-
sures are asymptotically distributed according to
the chi-square distribution and so the probability of
observing a value as large or larger than value tabu-'
lated may be readily obtained.
How Does This Help Us?

   Fitting models may be helpful in two ways: (a) we
can determine which effects are of importance, and
(b) we can use the fitted estimates obtained under
the model  in order to obtain meaningful summnry
statistics.  In  our example above, meaningful sum-
mary statistics might be bronchitis rates for  each
exposure  area  adjusted for differences in the sex
and age distributions  in the areas.
   The models can be extended to include many var-
iables. As an example of the type of situation where
they are of value we include Tables 1-3 which are
taken from the Rocky Mountain studies (4). Inspec-
tion of the first Tables 1 and 2 indicates  that we
have  the following five variables: bronchitis, two
categories, yes or no; sex, two categories;.educa-
tion, three categories; age, four categories; expo-
sure area two categories.
   Multiplying together  the  number of categories
tells us that each person is distributed into one  of 96
cells. It is difficult to interpret Table 3 because suf-
ficient information on which model was fitted is not
given. If we assume (a) that sex, education and age
are related to bronchitis rates, (b) exposure area has
no effect  on bronchitis rates, (c)  the numbers of
persons in each sex-education-age category differs
by exposure area, and (d) that no multifactor effects
are present, then the model fitted would have the
terms shown in Table 4, each with their associated
degrees of freedom, one for each parameter.
  150
                Environmental Health Perspectives

-------
                  Table 1. Smoking- and sex-specific prevalence rates (percent) for chronic bronchitis by education and age"-'
Nonsmokers
Category
Education:
High school •
Age
«S29
30-39
40-49
&5Q
Mothers

- 2.06
. 2.20
1.36

1.09
1.31
2.63
2.61
Fathers

.3.23
3.66
1.95

0.00
0.68
4.10
6.25
Ex-smokers
Mothers

5.83
1.43
2.51

2.63
3.86
2.38
0.00
Fathers

. 4.81
3.49
2.13

0.00
2.72
3.24
5.06
Smokers
Mothers

14.50
11.75
10.85

13.07
11.17
14.95
7.69
Fathers

21.18
15.70
19.46

14.55
15.59
- 21.00
. 28.41
         "Data from Chess monograph (4).               -- ••
         'Chronic bronchitis rates are equivalent to crude rates for symptom severities 6 and 7.
           Table 2. Prevalence of chronic bronchitis in nonindustrially exposed parents: individual and pooled community rates (percent)
                                                 by sex and smoking .status"
Nonsmokers
Com muni ;y
Pooled low
Low I
Low II
Low III
Pooled high
High 1
High 11
Mothers
1.08
1.36
0.48
1.06
2.54
3.56
1.50
Fathers
1.25
1.10
0.00
4.88
3.47
4.90
2.00
Ex-smokers
Mothers
3.12
3.05 '
4.55
0.00
2.80
1.79
3.92
Fathers
1.45
0.00
5.31
0.00
4.82
4.72
4.95
Smokers
Mothers
11.78
8.72
14.15
13.68
. 12.88
13.83
11.75
Fathers
17.05
12.44
2
-------
to two. With this reduction we would have 32 cells
and be fitting 20 parameters, giving 12 degrees of
L^edom. The addition of the effect of exposure on
  onchitis would bring us to  11 degrees of freedom
   given  in Table 3. This example has been cited
laboriously to illustrate the importance of specify-
ing which model was fitted.
  There  were further problems in understanding
Table 3.  Apparently two  separate models were fit-
ted, one to smokers and the other to nonsmokers. If
we look  at  the first line of the table we see \~
values  for sex and education are larger for smokers
than for nonsmokers. We might suspect that smok-
ing had a synergistic influence and enhanced  the
effects of age  and education. Such  a suspicion
would  be unjustified if the sample of smokers was
larger  than the  sample of nonsmokers. We cannot
malce  the assumption because x2  values increase
with larger sample sizes,  even  when the interaction
effect they reflect remains constant. We could read-
ily evaluate the  possibility of smoking  affecting
other interactions by the simple procedure of ad-
ding smoking as a sixth  variable to the other five
variables already in the model. Then we could de-
termine  the magnitude of possible three-factor ef-
fects—one relating smoking-sex-bronchitis and the
other relating smoking-education-bronchitis.
   If we  turn to the second purpose of model
fitting—to enable us to adjust rates for several  un-
derlying  variables simultaneously—we find that  this
£^ngth  of the procedure has been ignored. All the
   \ given are either crude rates, or adjusted for at
   A two variables using crude specific rates.

What Improvements Are Needed?

   In conclusion, the full strengths of the methodol-
ogy were not used: (1) variables were reduced to
two categories thus losing information, (2) smoking
was not included as a variable, thus its effect cannot
be assessed from the results given, (3) the particu-
lar model fitted could only be inferred, thus its
goodness-of-fit statistics are of no value, (4) the fit-
ted values were not used to compute adjusted rates.
Some  of the difficulties noted above stem from the
attempt to present the results in a table format that
resembles analysis of variance for continuous data.
Although there are similarities in that models  are
being  fitted, it is important to distinguish between
the strengths of the different methodologies appro-
priate  for different  types of data (9). Thus the in-
adequacies were largely due to a lack of understand-
ing of  the methodology. This indicates a need  for
 setter  training and communication.
   Since  1970, further advances in technology have
 ?een made, notably methods  for dealing with  or-
dered categories (10-14) and methods for comput-
ing variances for certain types of estimates. There
is still need  for further development of methods
suitable for a mixture of  discrete  and continuous
variables.

Time Series

Why Do We Need to Look at Them?

  The following are examples of situations where
the relationships between two or more series of data
collected  over time are of current  interest: (1) as-
sessing  the  performance  of a new pollution-
measuring device compared with that of a standard
device in the  field; (2) determining whether adja-
cent stations monitoring the  air in a city are giving
comparable data or whether there are real differ-
ences  in air quality in  neighboring regions;
(3)  determining whether central monitoring stations
give a true picture of individual exposure by com-
paring their readings with personal  dosimeter read-
ings; (4)  relating fluctuations in indices of disease
such as deaths, hospital visits or exacerbation  of
symptoms to measures of air quality; (5) assessing
the extent to which different pollutants increase and
decrease simultaneously or with a consistent lag be-
tween peaks; (6) prediction of the future levels of a
given series so that the effects of intervention may
be assessed.
  Thus the relationship of various time series is
central to relating environmental and health effects.


Why Is a Simple Correlation
Not Informative?

  In each of  the situations  cited  above attempts
have been made to use simple correlations as mea-
sures of the association between two time series.
This approach can be criticized on several levels.
  Range of  Observations.  If each serial
measurement could be regarded as independent  of
all preceding measurements  (which is usually un-
true) and was taken from a normal distribution then
correlation would be a reasonable approach. How-
ever when observing natural phenomenon the
strength of the association will depend on the range
of values that occurred  during the observation
period.
  As an illustration, consider Figures \ti and \b.  In
Figure  la, two lines, marked  A and  B, are connect-
ing  a series of points. The  points  were obtained
from a  table of random normal deviates (15)'.  Thus
the  points are independent  observations  from a
normal distribution with mean of zero and variance
                                                                  Environmental Health Perspectives

-------
                                                           «'.J!.lj;ara:frvta&-i«:£.?^^s^^
                              A
LA
                           \ /'v
                           V:    ;
_J 	 ! 	 : 	 1
'. -. ' '•• - "I/
,_• 1 . i I 1 1 J_ l_.l 	 1 	
                       lilt—I
  FIGURE I. (a) Two series of independent normal deviates, r =
     0.37; (h) same scries as Fig. In with different trends added to
     each series, r = 0.43: (r) same series as Fig. In with autocor-
     relation within each series r = 0.66.
of one unit. Theoretically  the two series of inde-
pendent observations have a correlation of zero. By
chance we have an observed value of r = 0.37. In
Figure  \b we  have introduced linear trends by
adding to these random deviates a difference of_0.2
between successive  measurements on line A,  and
differences of 0.1  for  line  B. The  correlation  we
now compute is increased to r = 0.43. If we were to
introduce steeper trends by adding larger constants,"
we would get larger  values of r.
  Clearly, in periods of relative stability of the un-
derlying phenomena the values we obtain'represerit
noise about the constant true value, as in Figure l«,
and the correlation between the two series will not
differ significantly from zero.  If we measure both
phenomena during a period when both are subject
to a seasonal trend, as in  Figure I/; we will increase
our apparent correlation. If we  measure 'during a
period when there Is  a  period of stability  and a
period when both  phenomena have a trend we  will
obtain an intermediate value for /•. Before comput-
ing a correlation coefficient, it is necessary to con-
sider whether the series have common large  shifts,
whether we need to distinguish short-term associa-
tion from general seasonal trends and in fact to con-
sider carefully the  hypothetical  model we  are
evaluating.
  Successive Values \ot Independent.  .Most of
the time series  data of interest cannot be regarded
as independent observations as we did in the pre-
ceding section.  We have only to consider a familiar
measure such as minimum 24-hr temperature  to ap-
preciate that the possible values for a particular day-
fall within a range determined by knowing the time
of year and can be defined even more closely  by
knowing the values for immediately preceding days."
Thus the  series are autocorrelated: the values for
day t are related to those for day /  - I. and so  on.
This autocorrelation invalidates the use of regres-
sion or multiple regression techniques designed for
independent observations.  The effect of autocor-
relation is shown in Figure If. The random value
for each day in Figure \n has been added  to the
value for the previous day,  to provide a new series.
We note  that the  new series looks smoother, as
each day's values in  a given series are related. The
two series are, however,  still unrelated to each
other, except insofar as they have the same internal
relationship. The observed correlation has however
changed to /• = 0.66.

Advances and Needs in Time Series Analysis

   The foregoing .simple examples illustrate some of
the characteristics of time  series that must be han-
dled. Almost any  series will exhibit noise and ati-
   Octoberl977
                                                                  153
'fra^yjii^sryg^^
iT;1"^:^^^

-------
 tocorrelation, and most will have cyclic patterns of
        length.
             (16. 17) has investigated the use of
         analysis as a tool for determining whether
 the asgravation of asthma symptoms are related to
 daily~minimum temperature or to atmospheric SOX
 levels. He explains: "The spectrum  may be-  re-
 garded as a decomposition of the variance of the
 data into components associated with different fre-
 quencies." Frequencies in  this context means
 number of cycles per day; thus an annual effect
 would theoretically be at the frequency  of  1/365
 cycle per day, but in fact the smoothing of the data
 (which was a necessary preliminary step) spreads
 the effect over a wider band. Bloomfield also com-
 putes the coherence  between series, which he ex-
 plains as "the frequency-dependent measure of cor-
 relation between series." Thus he has a  series of
correlations that show the extent to which the cy-
 clic  patterns of the series correspond.  He con-
 cludes, "the series are essentially unrelated at fre-
quencies above 0.25  cycles per day, which corre-
spond to a period of four days. However, at lower
frequencies, which correspond to longer  periods,
there is substantial  coherence. This is a warning
that  the impact of these  two series on the health
series may be complex and hard to disentangle."

  He also  investigates partial coherence, namely
   frequency-dependent partial correlation between
       and sulfur oxide after correction for the  ef-
     of minimum temperature. Throughout his
paper he warns  us about assumptions underlying
the analysis, namely that the series are "station-
ary"  in the sense that  the covariances  between
time periods are constant throughout the series, and
that  the  relationships between the  variables are
 linear, and finally  that the tentative  conclusions
 reached  may be reversed following  subsequent
analysis. Thus we conclude that this is a very prom-
ising approach but that care must be taken to recog-
nize  the importance of the underlying assumptions.

  Stressing the  limitations of a particular  model is
not  intended  to indicate  that the approach is
poor—rather it is to stress that analysis of time
series is  not simply a matter of running  the data
through a computer program. The situation  is de-
 scribed by Boxetal.(/S):  "The obtaining of sample
estim;. es of the autocorrelation function  and the
 spectrum are non-structural approaches, analogous
 to the representation of an empirical  distribution
 function  by a histogram . . . They provide a first
step . . . pointing  the way to  some  parametric
model on which subsequent analyses will be based.
   Box and other authors (19-21) have been
 developing such specific models for carbon monox-
 ide in Los Angeles to study the effect of changes in
 methods of instrument calibration and the effect of
 various control measures.
   The noise inherent in any system together with
 the limitations of the lengths of the series,  usually
 requires that some form of smoothing is carried out
- during the analysis. Researchers at Princeton have
 been  making rapid  advances  in development of
 these  techniques and are conducting Monte Carlo
 simulations to evaluate different approaches. Thus
 again  the research is in progress but much needs to
 be done before the relative advantages of different
 strategies are fully understood (22-24).
 Multiple Regression
                             •

 When Are Least-Squares Fits
 a Poor Choice?

   Pitfalls in the interpretation of linear least-
 squares regression relating to two variables are well
 known; they include nonnormality of the distribu-
 tion of variables, nonlinearity of the relationship be-
 tween the variables, lack of independence between
 observations and the presence of outliers. When the
 number of variables increases so do the problems:
 the list must be enlarged to include multicollinearity
 of the variables, and it is  no longer possible to de-
 tect these problems by simple  plots  of the  data.
 Even when  the problems are detected, the optimum
 method of analyzing data with one or more types of
 departure from the assumptions underlying least-
 squares regression is not  readily apparent. Recent
 developments deal with both methods of detecting
 particular types of departure and with data-analysis
 in the presence  of such  departures.  Increasingly
 these methods are being applied to analysis of en-
 vironmental data but are apparently not well known
 to all investigators.
Directions of Current Development

   In a recent review. Hocking, (25) suggests that
"the role of the developers of regression methodol-
ogy is to provide  the less skilled user with tech-
niques that are robust while easy  to  use and
understand."  Much effort has gone into the  de-
velopment of techniques that are "robust,"  or, in
other words, are relatively insensitive to departures
from the  usual assumptions underlying least-
                                                                  Environmental Health Perspectives
                                                           y

                                                                    "

-------
squares regression. Gnanadesikan et al. (26) have
been  particularly concerned with the detection of
outliers. Andrews (27, 28)  has re-analyzed data
originally analyzed by Daniel and  Woods (29),
using newer techniques that he believes are resis-
tant to a small number of gross outliers. He warns
that his iterative technique is more expensive than
least-squares but in addition to  producing stable es-
timates it will detect outliers. Andrews reaches the
same  conclusions regarding this sample data set as
Daniel and Wood, and this has led  Hocking (25) to
observe that these skilled analysts using repeated
inspection  of residual plots were in fact using  a
robust procedure. Diaconis (30) has  applied resis-
tant analysis of variance.techniques to air pollution
data.  Brown et al. (31) observed reduction in'mor-
tality rates in two California counties and suggested
that this might be a reflection of reduced air pollu-
tion consequent upon the 1974 fuel crisis. Diaconis
was unable to find parallel reduction in CO or NO2.
Thus the question remains open whether the ob-
served reduction in mortality was  due to other
causes, or to chance fluctuations, or to interactions
among air pollutants that  have  not yet been investi-
gated.
  The problem of multicollinearity has been tackled
by a variety of approaches. Schwing and McDonald
(32) have compared least-squares and ridge regres-
sion,  and have applied both ridge regression and a
sign-restricted least-squares method to the analysis
of the association between mortality  rates, natural
ionizing radiation,  and  some air pollutants. They
show that the two later approaches yield compara-
ble results  that differ from those obtained by using
least-squares (32, 33). The implications  of order re-
strictions have also been investigated (34). In the
conclusion of his review Hocking (25)  states that
"the  multicollinearity problem seems to have been
given too little attention in the statistics  literature."
He recommends that eigenvalues should always be
inspected to determine possible redundancies, but
that  when  near-singularities exist the  method of
handling them is not clear.                       .
  The problem of more complex relationships be-
tween variables has received  much  attention. In
a recent review,  Gallant (35) concentrates  on
methods of fitting nonlinear functions  rather than
on the detection of such functional relationships in
the data. Other authors such as Anscombe (36), and
Wilk  (37), and Cleveland  and Kleimer (38) have de-
veloped sophisticated plotting techniques for detec-
tion  of characteristics  of the  data. Gnanadesikan
and Kettenring (26) review many of these.
 " All of these endeavors point to the complexities
that may be encountered in multivariate  data. In
view  of these complexities, it is  unlikely that a
least-squares fit of a simple "hockey stick" func-
tion will prove to be an adequate method of deter-
mining "threshold" levels of pollutants as has been
done (Fig. 2). This  method may be useful in  an
experimental  situation such as that described  by
McNeil (39), because other sources of variation are
controlled.  Certainly "it is misleading "to present
point estimates obtained by this method without in-
dicating their variability, and without reporting any
"attempt to investigate alternate models.
 FIG. 2 Examples of the use of a hockey-stick function where no
   attempt is made to indicate reliability or to assess the intcrac-
  • lion effects of different pollutants (4). The-plots show
   temperature-specific threshold estimates for symptom aggra-
   vation by sulfur dioxide, total suspended particulates (TSP),
   and suspended sulfates (SS).
 t
 i
 •»
-i
                                                     ii
October 1977
                                             155

-------
   In the example reproduced in Figure 2 the effect
of temperature was held constant, but three differ-
    pollutants were each treated separately with no
         being  made to consider how they would
   ect symptom aggravation when present in differ-
Jrj? combinations. Similar observations were made
b> the discussants of a paper by Nelson et al. (40).


Conclusions

   The report of the task force on research planning
in Environmental Health  Sciences (41) recom-
mended in 1970  that further  development of effi-
cient  statistical techniques be undertaken. In at
least three of the five areas of concern (contingency
tables,  time series, and multivariate methods),
theoretical advances  have been made.  In some
areas these advances have been  well documented,
in others progress has  only reached the stage of
verbal reporting and  unpublished manuscripts.
Much needs  to be done, both in  terms of develop-
ment of theory and making readily accessible com-
puter programs with adequate documentation for
carrying out the techniques proposed.
   In spite of this developmental activity, review of
recent literature reveals  relatively few instances
where the newer techniques  are  being employed.
Partly this is because the stage of development is
Mich that they are not  readily available, partly be-
cause of lack of communication. Thus the need for
     •'ig recommended  in 1970 still exists.
      atcllitc  symposium was sponsored by IASPS
      itistical Aspects of pollution problems in 1971
(42).  In  the published report,  Van Belle noted the
dangers that "producers" of statistical analyses will
base their product on arguments of dubious valid-
ity.  He  cites  four areas: the  first two  were:
(1) "The use of a linear regression model to ap-
proximate a cause-effect link  is questionable"  and
(2) "The use of elasticity coefficients is misleading
when the variables are measured  in arbitrary
units."
   He also  cautions about  the indiscriminant
accumulation of large bodies of data and on the ten-
dency to place too much faith in "indices." These
problems are still with  us.

  The author was supported in part by grant ES 01108 from the
U.S. Public Health Service. Many thanks go to Drs. B. Ferris
and F. Speizcrfor introduction to these problems.
 "This material is drawn from a Background Document pre-
pared by the author for the NIEHS  Second Task Force for
Research Planning in Environmental Health Science. The Re-
port of the Task Force is an independent and collective report
which has been published by the Government Printing Office
under the title, "Human Health and Environment—Some Re-
                                                        search Needs." Copies of the original material for this Back-
                                                        ground Document, as well as others prepared for the report can
                                                        be secured from the National Technical Information Service.
                                                        U.S. Department of Commerce, 5285 Port Royal Road, Spring-
                                                        field, Virginia 22161.
                    REFERENCES

  I. Glaser, M., Greenberg, L., and Field, F. Mortality and
    morbidity during a period of high levels of air pollution,
    Nov. 23-25, 1966. Arch. Environ. Health 15: 684 (1967).
 2. Schronk, H.  H., et al. Air pollution in Donora, Pa.:
    Epidemiology of an unusual smog episode of October, 1948.
    P. H. Bulletin  306, Fed. Sci. Agency. Div. Ind. Hyg. PHS
    U.S. Dept.  HEW, Washington, D.C., 1949.
 3. Scott, J. A. The London fog of December 1966. Med. Of-
    fice. 109: 250 (1966).
 4. Health  Consequences of Sulfur Oxides:  A Report from
    Chess, 1970-1971. U. S. Environmental Protection Agency,
    Research Triangle Park. N. C.. 1974.
 5. Graybill, F. An Introduction to Linear Statistical Models,
    Vol. I. McGraw-Hill, New York, 1961.
 6. Grizzle, J. E.,Starmer, C. F., and Koch, G. G. Analysis of
    categorical data by linear models. Biometrics 25: 489(1969).
 7. Bishop, Y.  M. M., Fienberg, S.  E., and Holland, P. W.
    Discrete Multivariate Analysis: Theory and Practice. MIT
    Press, Cambridge, 1975.
 8. Johnson. \V. D., and Koch, G. G. A note on the weighted
    least-squares analysis of the  Ries-Smith contingency table
    data. Technometrics. 13: 438 (1971).
 9. Wermuth, N. "Analogies between multiplicative models in
    contincency tables and covariance selection. Biometrics 32:
    95(1976).
10. Bock. R. D. Multivariate analysis of qualitative data. In:
    Multivariate Statistical  Methods  in  Behavioral Research.
    McGraw-Hill, New York, 1975.
11. Clayton, D. G. Some Odds Ratio Statistics for the Analysis
    of Ordered Categorical Data. Biometrika 61: 525 (I $74).
12. Haberman, S..1. Log-linear models for frequency tables with
    ordered classifications. Biometrics 30: 589(1974).
13. Simon, G. Alternative Analyses for the singly-ordered con-
    tingency table. J. Amer. Statist. Assoc. 69: 971 (1974).
14. Williams, O. C., and Grizzle, J. E. Analysis of contingency
    tables having ordered response categories. J. Amer. Statist.
    Assoc. 67:55(1972).
15. Rand Corporation. A Million Random Digits with 100,000
    Normal Deviates, Glencoe, Free Press, 1975.  -
16. Bloomfield,  P. Spectrum analysis of epidemiological data.
    Paper presented at Fourth Symposium on Statistics and the
    Environment, Washington, D.C.,  1976.
17. Bloomfield,  P. Fourier Analysis of Time Series: An  Intro-
    duction. Wiley, New York, 1976.
18. Box, G. E.  P., and Jenkins,  G. M.  Time Series Analysis
    Forecasting and Control. Holden-Day, San Francisco,
    1970.
19. Box, G. E.  P., and Tiao, G. C. Intervention analysis and
    applications to economic and environmental problems. J.
    Amer. Statist.  Assoc. 70: 70(1975).         	
20. Tiao, G. C., Box, G. E. P., and Hamming. W. J. Analysis
    of Los Angeles photochemical smog data: a statistical over-
    view. J. Air Pollut. Control Assoc. 25: 260 (1975).     .  .
21. Tiao, G. C.. Box, G. E. P., and Hamming. W. J. A statisti-
    cal analysis  of the Los Angeles ambient carbon monoxide
    data 1955-1972. J.  Air  Pollut. Control Assoc., 25: 1129
    (1975).
156
                                                                         Environmental Health Perspectives


-------
               - -   L *«_  j. *a **>- •,- • 'i-1-  »•• xi-- .-...-«.•• , . *• ^ —,   -- -   -, .>'*. - - - '", 'v*H'*lJV1*'^ *:"•«;«.^* f»1»*. ^r •,-.*"^
-------
                    HWDQOT
                Methodological  Problems Arising from
                    the-Choice  o£ an  Independent
                 Variable in Linear Regression, with
                   Application  to an  Air Pollution
                        Epidemiclogical Study 1
                                 2                  3
                Inge F.  Goldstein, Martin Goldstein,
                                 4                  5
                 Joseph  L.  Fleiss and  Leon Landovitz
  Supported by the Study oh Statistics and Environmental Factors in Health
  under the auspices of the S.IAM Institute for Mathematics and Society,  and
  by grant 00899  from the National Institute of Environmental Health Sciences.
  Division of Epidemiology, Columbia University School of Public Health,
  600 West 168 Street, New  York,  N. Y. 10032
  Department of Chemi^vy,  Yeshiva University
4 Division of Biostatistics,  Columbia University

  Computer Center,  Graduate Center  of the City University of New York

-------
                              Abstract



        In epidemiological studies using linear regression, it is often


necessary for reasons of economy or unavailability of data to use as the

independent variable not the variable ideally demanded by the hypothesis

undei study but some convenient practical approximation to it.  We 'show

that if the correlation coefficient between the "practical" and "ideal"


variables can be obtained, then a range of uncertainty can be obtained


within which the desired regression coefficient of dependent on "ideal"

variable may lie.  This range can be quite wide, even if the practical

and ideal variables are fairly well correlated, especially if the

regression of dependent on practical variable explains only a small part

of the variance in the dependent variable.  These points are illustrated

with data and observed regression coefficients from an air pollution

epidemiological study, in which pollution measured at one station in a

large metropolitan area (containing 40 aeroaetric stations) was used z.s

the practical approximation to the city-wide average pollution.  The

uncertainties in the regression coefficients were found to exceed the
                                      factors of
regression coefficients themselves by/IS to 150.  The problem is one

that may afflict application of linear regression in general, and

suggests caution when selecting independent variables for regression

analysis on the basis of convenience, rather than relevance to the

hypotheses tested.




linear regression, selection of independent variables, air pollution,

acute health effects.

-------
           METHODOLOGICAL PROBLEMS ARISING FROM THE CHOICE

          . OF AN INDEPENDENT VARIABLE IN LINEAR REGRESSION,

       WITH APPLICATION TO AN AIR POLLUTION EPIDEMIOLOGICAL STUDY

         I.F.Goldstein, M. Goldstein, J.L.Fleiss, L. Landovitz



Introduction

     .. In the course of reviewing the literature on the acute health

effects of air pollution (1-S)  we became aware that too little attention

has been paid to the problem of incomplete concordance between the popu-

lation described by the health effects data and the, population described

by the air pollution data.  We believe that this problem, might have a

much wider relevance than to the specific studies which suggested it to

us.                                            .             '       .

       The problem may be stated in general terms as follows.  It is

often necessary, .for reasons of economy or availability of data, to use
                                          i
as an independent variable not the "ideal" variable demanded by the

hypothesis under study but some more easily accessible "practical"

variable which is believed to be an adequate approximation to the ideal

one.  If the ideal variable is really inaccessible, so that we have no

quantitative information about the relationship between it and the

practical variable, we have no choice but to rely on our intuitive judg-

ment that the second is an adequate approximation to the first.   However,

it occasionally happens that some quantitative information, even if

incomplete, can be obtained about their relationship — for example,  their

correlation.   The question we consider in this paper is: given such infor-

mation, can we estimate what kinds of errors exist in the regression or

correlation coefficients between the dependent variable and the  "practical"


                                 -1-

-------
independent variable if these coefficients are regarded as estimators




of the regression or correlation coefficients between the dependent



and "ideal" independent variables?




       The problem may be made more concrete by consideration of a



particular air pollution epidemiological study Q3).  In this Jtudy a



multiple regression analysis was performed of daily city-wide mor-



tality in New York City on daily values of sulfur dioxide and smoke-



shade (coefficients of haze)    measured at a single centrally located



monitoring station over a 10-year period.  Excess deaths attributable



to pollution were calculated by multiplying the regression coefficients



by the average pollutant concentrations observed at the central station.



It was found that the excess deaths attributable to sulfur dioxide,



which greatly exceeded the share attributable to particulates, remained



constant over the ten-year period, even though the average annual sulfur



dioxide concentration decreased to one-third of its initial value over



this tine period in response to regulatory efforts.  The conclusion



drawn was that sulfur dioxide is not really the agent responsible for



the excess mortality, but instead the agent is some component of urban



pollution whose daily fluctuations are highly correlated with sulfur



dioxide, but which has not changed over the ten-year period in which



sulfur dioxide.has decreased significantly.  The conclusions of this



study have been cited in the course of administrative and legislative



hearings on air quality standards (6) .



       During the course of the ten-year study, a number of additional




air pollution monitoring stations were established, and from 1967 on,



                                 -2-

-------
data from a network of 40 such stations have been available.  On the



basis cf an extended analysis of the data provided by this network for



the three-year period from 1969 to 1971 inclusive (7,8,9), we have found



that correlation coefficients for daily pollution concentrations between



pairs of stations, among them the central station used in the 10-year



study, were quite low, averaging 0.4 for siaokeshade and 0.5 for sulfur



dioxide.  We were led therefore to examine how the conclusions of the



former study might be altered by this new information about the relation



between pollution measured at the central station, regarded as the prac-



tical variable, and the city-wide daily average, regarded as the ideal



variable.
                                         *


       The problem can be stated concisely in statistical terms.  Given



the regression coefficient a   of a dependent variable y on an indepen-
                            yx


dent variable x, and the correlation coefficient p  ,  between x and a
                                                  f*A*             t


second variable xr, what can be inferred about the regression coefficient




V                   .                     '                 \





Calculation of upper and lower bounds on the regression coefficient.



       Using the symbols a  , a , and a  for the standard -deviations  of
                          X   X      j


the variables x, x1 and y respectively we have the well known relation



between regression and correlation coefficients




                       a   = —  P                                     (13
                        yx   a   pyx                                   *• '
                              J*
           "                           J


where p   is the correlation coefficient between y and x.
       yx


       The basis of our estimate of the relation between the observed



(a  ) and desired (a  ,) regression coefficients will  be a formula
  yx                yx


                                 -3-

-------
relating the correlation coefficients between three variables taken in



pairs.  Consider the partial correlation coefficient


                    P   i - P   Pi
                    Kyx'   Kyx Kxx'
                  - _£ - L -
                     _     __

                    /I - pz   /I - p*  ,
                         Kyx       Kxx'




Because a partial correlation coefficient must lie between the limits -1



and +1, the desired correlation coefficient p  , must lie in the interval
Substituting equation 1 into this result, we obtain an interval for all

                                                    i

mathematically possible values of the desired regression coefficient a   ,:
                                                                      y^






       	x XX>  a   ± -£ (1 - p2 D3* (1-- P2 J1* •                        (3)

          °X'     yx   ax'      Y



        The leading term or expression 3 has been proposed as a "corrected"



regression coefficient, with the additional term following the symbol jr_



oroittcd (3).  Such use is clearly, inappropriate unless certain assximptions



are made which require empirical test (see below).  In general, it is



necessary to bear in mind that the desired coefficient cannot be uniquely



determined by the additional information about the relation between ths



practical and ideal variables, but can lie anywhere within a possibly



wide interval about the "corrected" coefficient.




        The width of the interval depends on the magnitudes of the correla-



tion coefficients p  , and p     As p  ,  approaches unity, the multiplicative
                  A       y A *   -   ^»Jv


factor (1 - p2 ,)"* approaches zero, and the single "corrected" coefficient
             ^v^l


indeed becomes the correct one.  If, however, p  ,  is 0.8, a value generally
                                 -4-

-------
taken to represent excellent correlation, the multiplicative factor is



equal to 0.6, which is quite large.' When p  f - 0.9, the factor is equal

   *                                       xx-


to 0.44, which is still appreciable.



       When p vl is as large as 0.8 or 0.9, the interval of uncertainty
             ^L^v


would still be narrow if p   were close to unity.  If, however, p   itself
                          yx                                     yx


is small in magnitude, we can see that even an "excellent" correlation



between practical and ideal independent variables can be consistent with



considerable uncertainty in the regression coefficient a  , , as estimated
                                                        yx


from the "corrected" value.



       The quantity p2  represents the fraction of the variance in the
                     y*   .


dependent variable explained by the regression.  We may sum up the above



discussion by stating that the uncertainty in the "corrected" regression


                               -• •           ••''..
coefficient can be large either     if the practical and ideal indepen-



dent variables are imperfectly correlated, or     if the regression of



the dependent on the practical independent variable accounts for only a



small part of the variance in the dependent variable, or, of course,



both.




Application                  -__=.._„-.»  —-- .-—.-_•_




       As an example of the degree of uncertainty that can be introduced



into an estimated regression coefficient, we give the results of apply-



ing the above formulae to the regression coefficients obtained in the



air pollution study referred to earlier (3)  (see Table 1).   In this table



we give the "corrected" regression coefficients for each of three mor-



tality variates on the two pollution variates,  sulfur dioxide and smoke-



shade"," and in addition, the upper bounds (+ sign from expression 3) and



                                 -5-

-------
lower bounds  (- sign from expression  3) .on  the desired  coefficient.  As  an



estimator of  the uncertainty that exists in the "corrected" coefficient




we give the ratio of the range of uncertainty (upper bound minus lower



bound) to the regression" coefficient itself.  It can be seen that the




range of uncertainty of the regression coefficients greatly exceeds the



regression coefficients themselves, by factors ranging from IS to nearly




ISO.  'This should not be surprising in view of the above discussion.




The correlation coefficients between practical and ideal variables are



appxecisble, but are far from perfect, and the regressions found in the



study explain only a very small amount of the variance in daily mortal-



ity, although by the usual tests they are statistically significant.



      .The latter point is demonstrated in Table 2, where the six re-



gression coefficients for the three mortality variates on each of the two



pollution variates are tabulated; the squares of the six correlation co-



efficients are calculated from the regression coefficients and the stan-



da.ru ilevlalious.  There squared correlation coefficients, equal to the



fractions of the variances explained by the regressions, range from



.00013 to .0175. and are thus quite small in ell cases.






The Appropriateness of the "Corrected" Coefficient



       In an unpublished appendix to their air "pollution - acute health



effects study (3), available on request from them, its authors responded



to our earlier criticisms (10) of the use of a single monitoring station



to represent the whole metropolitan area.  They made an estimate, using



our published correlation coefficients for pairs of stations,  of the



effect their procedure .had on their conclusions.   As they have not yet



published their analysis in the literature,  we will not criticize it in





                                 -6-

-------
detail.  Their correction effectively amounts to the use of the first
terra on the right hand side of  expression  3, i.e., of the "corrected"
coefficient; the implication is that there is no range of uncertainty
in the "corrected" regression coefficient.  In our view their impli-
cation is valid only if the assumption is made that the differences
between the practical and ideal variables are due solely to random errors
in the practical variable.  The validity of this assumption is germane
to the subject of this paper.  The assumption that the errors in the
practical variable are random ones is a very restrictive one, and should
not be made without direct evidence that they are.
       In the case of the relation between pollution at the central
station and city-wide pollution there are ji priori reasons for believing
that at least some of the factors giving rise to differences are struc-
tured rather than random.  The central station is located in a specific
geographical area (Harlem) of the City, and has its own relation to geo-
graphical features such as hilly terrain, distance from the rivers sur-
rounding Manhattan Island, prevailing wind directions, and so on.  These
in turn interact with the location of major pollution sources in such a
way as to make it likely that there is a distinct pattern to pollution
in the vicinity of this station that makes it something other than merely
an unbiased estimator of city-wide pollution.  This is made additionally
plausible by the facts that pollution at the central station tends to be
greater than the city-wide average, and has changed over time in a differ-
ent way from the average.
       This a priori argument for the existence of a non-random component
in the relation between the two variables is confirmed by our own studies
                                 -7-

-------
of the data of the New York City aerometric network, from which we have
concluded that poor correlation between stations is not solely due to
random errors of measurement (8,9).  On the other hand, our attempts to
find an inner structure to the data, in terms of meteorological patterns,
inter-station distances, proximity to pollution source, etc., ".ave met'
with only modest success (11)..  For example, principal component analyses
of the covariance matrix of daily pollution readings did not reveal clear-
cut systematic patterns in the data.
       Our failure to discover such patterns may reflect, a large component
of random errors or it may reflect our own lack of ingenuity in the search.
In any event, our efforts in this direction are continuing.
How ideal is the ideal variable?        .  .
       In the above analysis we have assumed that the city-wide average
Jevel  of pollution is the ideal variable, to which pollution measured at
a central station is a practical approximation.  However,  we must acknow-
lege that epidemiological considerations raise questions about this
assumption.
       The pollutants measured by the New York City aerometric network -
sulfur dioxide and snokeshade - are only two of a great number of different
pollutants in urban air that .might have adverse health effects.   Further,
the average as we have defined it does not weight the individual stations
according to the size of the populations in the areas in which they are
located, ror docs it take into account the demographic characteristics of
these populations.

                                 -8-

-------
     A population -weighted city average of, say, sulfur dioxide need




not be necessarily perfectly correlated with the exposure to sulfur



dioxide of the individual inhabitants of the city, sons of whom spend




most of their time indoors and others not, and some of whom spend, their



days '.n a different area of the city from where they spend their nights,




while others stay in one area all the time.  Still further, it is not



clear whether the focus of the health study should be on the population



as a whole.' or on susceptible subgroups, nor whether mortality is a



better indicator of health effects of pollution than morbidity.



       It should be clear from this discussion that what we have designated



as the "ideal" variable is. far from ideal,  We do not have a solid em-



pirical basis for deciding what the ideal independent variable in air



pollution studies  should be,nor inevitably, any knowledge of how well



the city-wide average of sulfur dioxide or smokeshade correlates with it.



We must acknowledge therefore that as wide as the intervals of uncer-



tainty are in our estimates of the regression coefficients of health



outcomes on the city-wide averages,      they are probably gross under-



estimates of the real uncertainty in our knowledge of the health effects



of air pollution.







Conclusions



       While we have discussed this problem using one particular air



pollution study as an example, it should be obvious that it is a problem



of much wider relevance in all areas of epidemiology, and, for that



matter, whenever linear regression is used to provide clues to causal




relationships.



                                 -9-

-------
       We havo calculated, using a well-knovm relation among pairwise




correlation coefficients, both a "corrected" regression coefficient and



its range of uncertainty, applicable to the situation where practical



considerations dictate a  choice of independent variable other>han the




independent variable ideally demanded by the hypothesis under test, and



where the correlation coefficient between the two independent variables




:.s known.



       We have found that, the range of uncertainty ma^ greatly exceed



the coefficient itself, even if the two independent variables are fairly



well but not perfectly correlated.  We have shown further that the un-



certainty is particularly enhanced when the fraction of the variance in



the dependent variable explained by the regression is small.  In the



example we have considered in this paper, a study of the effect of air



pollution on health, the uncertainty in the regression coefficient due



to the iimperfect correlation between the practical and ideal independent



variables makes it quite unreliable as an estimator of health effects,



in spite of the fact that it is statistically significant by the usual



tests.



       It is more commonly the case that the practical independent



variable is recognized to be an uncertain measure of the ideal variable,



but no quantitative information about their relationship is available.



An awareness of how uncertain the observed regression coefficient can



actually be under such conditions should lead to greater caution in



interpretation of the results of a regression analysis.
                                -10-

-------
                                 TABLE 1



    Limits of Uncertainty About Si* "Corrected" Regression Coefficients
POLLUTION
VARIATE
(x)

Sulfur
Dioxide
(p =0.776)
XX

MORTALITY
VARIATE
(y)*


Ml
H2
' M3
"CORRECTED"
REGRESSION
COEFFICIENT
fn ~\
CVJ

0,326
0.0185
-0.158
UPPER
BOUND
ON o ,
yx*

6.55
1.34
3.76
LOWER
BOUND
ON <*.
yx'

-5,89
-1.30
-3.44
•
RATIO OF INTERVAL
LENGTH TO "CORRECTED"
COEFFICIENT


38.2
143.
45.6
Smokeshade
(o =0.652) Ml
*~ VY *
XX
M2
M3
0.427
0.0142
0.251
4.21
0.823
2.44
-3.35
-0.794
-1.93
17.7
114.
17.4
*  Three separate city-wide daily mortality variables were examined (3):



   Ml:  Total Mortality,  M2:  Respiratory Disease Mortality,   and M3:




   Heart Disease Mortality.

-------
                               TABLE 2

   Squared Regression and Correlation Coefficients of City-wide
Mortality Variates (y) .on Centrally Measured Pollution Variates  (x)
POLLUTION
VARIATEjx)
so2 •
(o2=12.S3)
v x

Smokeshade
CcrJ=S6.13)

MORTALITY
VARIATE (y) a2, a2
Ml
K2
M3
Ml
M2
M3
329.42
14.82
110.2
329.42
14.82_
110.2
.0476
.0001534
.01119
.09922
.000110
.03437
pyx
.00181
.00013
.00127
.0169
- .00042
.0175

-------
                                      References





   1.   (-lass,  M.,   Greeuburg, L.:   Air pollution, mortality and weather.   Arch.



           Environ. Health 22:334-345, 1971



   2.   Schim.T.el, H., Murawski, R.:   SCL-hannful pollutant or air quality indicator.



           J.Air Pollut.Cont.Assoc.25; No.7,  739-744, 1975



   3.   Kchimmsl, H., Murawski, R.:   The relation of air pollution to mortality.



           J.  Occup. Med.  19_:316-333, 1976



   4.   Hotlpscn, R.A.:  Tne effect of air pollution on mortality in New York City,


           Report  No. 1012.   Presented before the statistics section at the 96th
                                                             i

           cmmal  meeting  of the American Public Health Association.



   C.   Buschley, R.W.,  Riggan, W.B.., Hasselblad, V., Vanbruggen, J.S.:  SO- levels



           and perturbations in mortality:  a  study in the New York - New Jersey



           metropolis.   Arch. Environ. Health 27: 134-137, 1973


   6.   Hearing before the  New York  State Department  of Environmental Conservation



           and the New  York  City Environmental Protection Administration on the



           health  effects  of air pollution.   1976/1977. .

    /pf&
   7.   Goldstein,  I.F., Landovitz,  L., Block, G.: Air pollution patterns in New


           York City.   J.  Air Pollut.  Cent. Assoc.  24: Mo.2, 148-152,  1974

 /

 / 8.   Goldstein,  I.F., Landovitz,  L.: Analysis of air pollution patterns in New



           York City: I.   Can one station represent  the large metropolitan area?



 /      ^tmos.  Environ.  11:47-52, 1977


'•'  9.   Goldstein,  I.P., Landovitz,  L.:  Analysis of  air pollution patterns in New


           York City:   II.   Can one  aeroinetric station represent the area surround-



           it? xAtmos.  EnyiroTK U:53-57, 1977




                                                                      (continued)

-------
    References (continued)








10.  Goldstein, I.F., Landovitz, L.:  Sulfur dioxide: Harmful pollutant or



        air quality indicator?  Air PC Hut. Cent. Assoc. 24; No.12, 1195, 1975




11.  Glasser M., Greenburg, L.:  Air pollution mortality and weather.  Arch.




       .Environ. Health 22: 334-343, 1971




12.  Goldstein, I.F., Landovitz, L., Fleics. J,L.:   Analysis of air pollution



        patterns in New York City:  III.  Dirtxinutions of air pollutants




        over the city.  (Submitted to Atmos.Env5.roTi.)

-------
 Ann. Kit. Public Health. 1981. 2:397^29
 Copyright © 7937 by Annual Rtrinn Inc. All rights mervti
 AIR POLLUTION AND                         012529
 RESPIRATORY DISEASE

 AliceS. Whittemore
 Department of Family, Community, and Preventive Medicine, Stanford
 University School of Medicine, Stanford, California 94305
   If you visit American city.
   You will find it very pretty.
   Just two things of which you must beware:
   Don't drink the water and don't breathe the air.          ,

   Pollution, pollution,
   Wear a gas mask and & veil.
   Then you can breaths, long as you don't inhale.1

 INTRODUCTION              ~

 Concern about polluted air in our urban and industrial areas began gather-
 ing momentum shortly after World War II. At that time it seemed obvious
 that clean air, like clean water, clean food, and a clean body, was a worth-
 while goal in itself, requiring no further justification. But it soon became
 evident that this goal is expensive to attain, and that  rigid adherence to
./stringent standards of cleanliness diverts  limited human resources away
 from other pressing and critical problems. Awareness of such facts has
 reoriented the goal to one of protecting public health. This emphasis is
 clearly stated in the US Clean Air Act of 1963: "The Congress... finds that
 the growth in the amount and complexity of air pollution brought about by
 urbanization, industrial development, and the increasing use of motor vehi-
 cles, has resulted in mounting dangers to the public health  and welfare" (1).
 Two decades later we ask: What do we know of these dangers and what
 must we do to improve our knowledge?

   'From "That Was ihe Year that Was," Music and lyrics by Tom Lehrer, copyright © 1965
 by Tom Lehrer. Reprinted by permission.
                                                              397
                   0163-7525/81/0510-0397501.00

-------
 398    WHITTEMORE
   We know from "killer fogs" in the period from  1930 to 1960 that very
 high levels of air pollution for several days can be fatal. These episodes,
 during which hospital admissions and deaths increased dramatically, pro-
 vide the clearest evidence of human hazard due to smoggy, dirty air. They
 have motivated clean air legislation in many of the industrialized nations
 of the world.
   The  Clean Air Acts of Great Britain in 1956 and of the US from 1963
 to 1970 differ considerably in their underlying control strategies. The for-
 mer has reduced ambient levels of soot and smog simply by restricting the
 grossly inefficient burning of coal in domestic open fires. The US Clean Air
. Amendments of 1970, on the other hand, mandate the promulgation  of
 national air quality standards for each of several distinct pollutant species.
 Those currently regulated and their standards are shown in Table 1. Unlike
 the British approach, the US standards are based on the assumptions that
 the culpable constituents of pollution can be identified,  and that "safe"
 levels of these agents can be determined. As we shall see, these assumptions
 lead to considerable regulatory difficulty.
   On both sides of the Atlantic, legislation and its enforcement have led to
 sizable reductions in ambient levels of virtually all air contaminants. These
 trends are evident in Figure 1. Ironically, they present a dilemma: while the
 evidence of adverse effects due to high levels of air pollution is  unques-
 tioned, at the current lower levels effects are extremely difficult to discern.
 The three strategies for detecting effects are animal experiments, controlled
 human studies, and epidemiological studies of populations exposed to vary-
 ing types of ambient air.

      Table 1 US primary air quality standards
Pollutant
SO2
Total suspended
p articulates
Photochemical
oxidants
N02
CO-


Standard
(0.03 ppm)
365 pgm/m3
(0.14 ppm)
75 jig/m3
260pg/m3

0.12 ppm
100 pg/m3
(0.05 ppm)
10 mg/m3
(9 ppm)
40 mg.lm3
(35 ppm)
Averaging time
Annual arithmetic mean
Maximal 24-hr aver.
Annual geometric mean
Maximal 24-hr aver.

Maximal 1-hr aver.
Annual arithmetic mean
Maximal 8-hr aver.
Maxima] 1-hr aver.

•/r

-------
AIR POLLUTION AND RESPIRATORY DJSFASE     399
                       100

                        90

                        80

                        70

                        60

                        50

                        40

                        30

                        20

                        10
                        SO
           55  60
                              75
                                   80
                                   65   70
                              r .    YEAR
 figure I  Average annual levels of foul suspended participates and sulfur dioxide in US
 during 1960 to 1978. Paniculate levels represent annual geometric means of daily values
 averaged over 95 sites during 1960 to 1971 and over 3000 sites during 1972 to 1978. Sulfur
 dioxide levels represent annual arithmetic means of daily values averaged over 32 sites during
 1965 to 1972 and over 1322 sites during 1972 to 1978. Reprinted with permission from William
 F. Hunt. Environmental Protection Agency, personal communication.

:•.-:-. Because the investigator working .^th animals has relatively close con-
 trol over experimental conditions, causal relationships are more easily es-
 tablished in animal than in human studies. Further, animal experiments can
 delineate, biochemical and pathophysiological mechanisms for damage due
 to one or more air pollutants. Such mechanisms can motivate human stud-
 ies. They can also strengthen a causal basis for associations observed in
 epidemiological studies. On the other hand,  there are problems in ex-
 trapolating animal results to man in the face of large interspecies variability
 in susceptibility to toxic effects. Another limitation of animal studies, shared
 by controlled human experiments, is their inability to simulate the complex
 and changing pollutant mix in ambient air.
   Controlled laboratory studies of human exposures can  provide chemi-
 cally  specific information about which pollutants at which levels  produce
 acute effects. But they cannot identify the effects of long-term exposures to
 moderate pollutant levels. Moreoever, findings obtained with healthy volun-
 teers  may not pertain to sensitive subgroups of the general population.
 These limitations, together with those noted above for animal studies, force
 us to turn to epidemiology for detailed information on which to base air
 quality standards.
   Although epidemiological studies can examine the effects of realistic
 exposures among  various population subgroups, they have limited ability
 to specify which pollutants at which levels produce risk. In  addition, they

-------
400     WHITTEMORE

are insensitive to small effects. All observations] studies have at best limited
control of intervening factors that can bias relationships of interest. How-
ever, those concerning the effects of air pollution are fraught with special
difficulties, as discussed in the following sections.
  I limit this review to epidemiological studies of the role of ambient air
pollution in respiratory disease. I evaluate what is known and what needs
to be known in this area in order to determine pollutant levels leading to
tolerable health risks. Because animal studies, controlled human studies,
and nonrespiratory hazards are not considered, I make no general assess-
ment of the current US standards. Instead, I argue that the epidemipjbgical
studies considered_dp not provide  jinjadcquate basis for establishing the
dose-response relationships needed to set stanHarjiTThTtwolections below
contain a brief description of air pollution, and a critical survey of some
recent  evidence relating air quality  to respiratory mortality and morbidity,
especially morbidity. This is followed by a discussion of the obstacles to be
overcome and the outlook for the  future.

POLLUTANTS    .

Contaminants present  in ambient  air can be classified loosely  into four
categories:  (a) sulfur oxides and particulates, (£) photochemical oxidants
and oxides of nitrogen, (c) carbon monoxide, and (ef) metallic and other
pollutants,  such as lead, cadmium,  asbestos, radon, etc, that frequently are
emitted by  industrial point sources.  This article deals only with the respira-
tory effects of pollutants in the first two categories. The reader is referred
to references (2-11) for reviews of health hazards associated with the re-
maining pollutants.

Sulfur Oxides  and Particulates           .                    '
Sulfur oxides and particulates are grouped together because they commonly
occur simultaneously as the result of fossil fuel combustion. Their common
occurrence makes it extremely difficult to indite individual components, e.g.
sulfur dioxide, sulfuric acid, sulfate aerosols, hydrocarbons, and  inorganic
ash, as hazards to health. In addition  to  fossil fuel combustion, other
sources for  sulfur oxides are smelters and sulfuric acid manufacturers, while
particulates are also generated by such industrial sources as grain  elevators,
cement works, and lumber mills. These pollutants do not form a chemically
or physically specific mixture, but one that varies both geographically and
temporally in chemical character, oxidation state, and particle size distribu-
tion. Part of this  variation is due to the rate and source  of emissions, as
well as to the meteorology and local topography of the area. Thus associa-
tions with respiratory damage found at a particular time and place cannot

-------
              AIR POLLUTION AND RESPIRATORY DISEASE    401

be generalized readily. When inhaled, the larger particles are apt to lodge
in the nasal and pharyngeal passages and are less likely to be deposited in
the lower respiratory tract. For this reason it is the finer particles (<2 to
3 /xm in diameter) that are believed to pose the greater threat to the lower
respiratory tract.
  The problems in learning the effects of particulates are further compli-
cated by nonuniform methods of measurement. The two most widely used
techniques, the high volume (HV) method and a class  of smoke filter
methods, measure different properties of the paniculate complex.
  The HV method, which is generally used in the US, employs a device lhat
circulates large volumes of air through a filter.  The weight of the particu-
lates trapped  on the filter is divided by the volume of the air sample and
is reported in ^g/m3. The larger particles,  of less importance to the lowb'r
respiratory tract, provide a disproportionately large  contribution to  the
measurement.
  When smoke filter methods are used, air. is drawn through a filter paper
and the density of the resulting stain *is assessed by  various techniques.
These methods include the smoke shade method, commonly used in Great
Britain, and the coefficient of haze (COH) method. The British smoke shade
(BS) measurements are converted to an equivalent mass basis and expressed
as ^g/m3, while the COH measurements are expressed in terms of optical
density as COH units per 1000 linear feet of air.
  The HV and smoke filter methods measure different attributes of particu-
late pollution; the measurements are not  comparable. There have been
many studies  of the relationships between them, as summarized in Chapter
2 of (12). However, as is emphasized in (12), these relationships are empiri-
cal, and vary with the particular time and place of observation.
  The results of controlled  animal and human exposures to particulates,
sulfur dioxide, sulfuric acid mist, and other sulfates are reviewed in [(7),
Vol. 8, (8, 9,  13), and (14),  pp. 175-225].

Photochemical Oxidants and  Nitrogen  Oxides
Photochemical oxidants are produced by nitrogen oxides, hydrocarbons,
and solar radiation in a complex reaction.  Combustion of fossil fuels and
motor vehicle emissions provide major sources  for the reactants. Ozone is
generally the principal component and the one believed  to be the most
irritating to the respiratory tract. Also present but not usually measured arc
peroxyacetyl  nitrate,  acrolein,  peroxybenzoyl  nitrates, and aldehydes.
Apart from NOj, Uttlc is known about the respiratory effects of nitrogen
oxides.
  The problems in evaluating the effects of oxidants and  nitrogen oxides
have been discussed elsewhere (5-9, 15-18). One problem is that hydrocar-

-------
402     WHITTEMORE

bons from natural sources can react with ozone in the upper atmosphere
to produce high ambient levels of oxidant. Another is that oxidants often
occur simultaneously with very high temperatures, which can themselves
aggravate disease and hasten death.
  Like the particulate  complex, oxidants and NO; present measurement
difficulties. There has been concern (19) about the reliability and accuracy
of the potassium iodide method of measuring total oxidant, which  was
widely used prior to 1975.  In particular, reports using such measurements
probably underestimate the actual oxidant levels. Similarly, serious error
has been found in the Jacobs-Hochheiser method for measuring NO; [(13),
pp. 317-44], a fact that compromises the utility of reports based on  this
procedure. Detailed discussions of current measurement methods for  oxi-
dants and NO; are contained in (15, 17) and (13, 16, 18), respectively.
  There is no question  that accidental exposures to high (greater than 150
to 200 ppm) levels of NO; can be fatal. Nevertheless, epidemiologioal
studies have not  shown a relationship between  excess mortality and in-
creased  ambient NO; or oxidant levels, after adjusting for temperature.
Moreover, little is known  about the effects  of chronic exposure to these
pollutants. The difficulties in  trying to provide such information are de-
scribed in the next section.
THE  PRESENT: WHAT IS THE EVIDENCE?

Detailed surveys of the health effects of selected air pollutants have been
presented  in US air quality criteria documents (2, 3, 14-16), which are
periodically revised by the US Environmental Protection Agency. Compre-
hensive reports on the effects of certain air pollutants have also been pub-
lished under the auspices of the National Academy of Sciences (5, 6,13, 17,
18) and by the World Health Organization  (7). In addition a number of
investigators (8-12, 20-27) have published condensed and informative re-
views of the field.
  I make  no attempt  to duplicate the detail and scope of these surveys.
Instead, I  examine the findings of a small and arbitrary sample of studies
concerning air pollution effects on respiratory health. Particular attention
is paid to  the quality of the data, to the method:;  of analysis, and to the
implications of these factors for a study's utility in establishing standards.
I argue that although some investigations provide strong evidence for ad-
verse effects, many have serious weaknesses,  and very few have any utility
for estimating exposure-response relationships. Indeed, most of these stud-
ies were never intended to provide the detailed quantitative information that
is needed to establish a reliable basis for setting standards.

-------
              AIR POLLUTION AND RESPIRATORY DISEASE     403

 Respiratory Mortality

 SHORT-TERM EFFECTS  The strongest evidence against ambient air pol-
 lution as a threat to health is provided by a series of acute episodes in the
 Meuse Valley in 1930 (28), in Donora in 1948 (29), in London in 1952 and
 1962 (30, 31), and in New York in 1953 and 1962 (32, 33). These episodes,
 each involving several days of extremely high levels of sulfur oxide/particu-
 late pollution, were marked by substantial increases in daily mortality.
 Although healthy persons were affected, the greatest toll occurred among
 those with preexisting respiratory and. cardiac disease. No single pollutant
 could be incriminated.
   Excess deaths  were  estimated as the difference between observed and
 expected as computed either from mortality during the same period in other
 years, from mortality  during the previous or subsequent week, or from
 moving averages. The greatest estimated excess of approximately 4000
 deaths occurred during the London^og of 1952. At this time the maximum
 24 hour averages for SOj and particulates were estimated at 3830 /ig/m?
 and 4460 /xg/ra3(BS) (12). The subsequent London episode in December
 1962 was characterized by greatly reduced air pollution and mortality. Part
 of the decreased mortality could be due to greater protective measures taken
 by the sick and elderly.
   Approximately 20 deaths  in a population of 13,000 were attributed to
 sulfur oxide/particulate pollution in Donora, Pennsylvania, with an  esti-
 mated 6000 people suffering less serious effects. An investigation of a sample
 of this community nine years later showed that those affected by the episode
 had higher mortality rates and greater prevalence of respiratory disease in
 the interim than  did those who were unaffected  (34). This suggests either
 that acute exposures initiate enduring harmful effects, or that people  who
 respond to acute exposures are already weakened and are thus more vulner-
 able to subsequent disease.
   The Clean Air  Acts of England in 1956 and of the US from 1963 to 1970
 have been  followed  by  the virtual elimination of dramatic episodes of high
 sulfur oxide/particulate pollution associated with increased mortality. This
 is indirect evidence in  favor of a causal role for  these pollutants.
   The excess mortality observed in past episodes Jiasjrompted a number
 of studies on possible associations between temporaj_fiu£tuatiojis in mortal-
". There have been several recent
 and comprehensive reviews(8, 9lXTA~24) of these investigations. Here
 they are discussed only briefly.
   The sensitivity of such studies depends on a number of factors, including
 the size of the study population and the variability in pollutant levels to
 which the population is exposed. Thus, most of the investigations have

-------
 404    WHITTEMORE

 taken place in large industrialized cities such as London (35-38), New York
 (39-44), and Tokyo (45), although there have been others in smaller cities
 (e.g.  14, pp. 263-309) and (46, 47). As a group, these studies persistently
 show a pattern of increased mortality ondays_or_weeks.of increased sulfur
 oxide/particulate pollution; however, there are several  reasons why the
 results  must be interpreted cautiously.
   1.  Influenza  epidemics and temperature extremes can have  a much
 greater effect on mortality than  do moderate variations in air pollution.
 Other factors such as season and holiday weekends also influence mortality.
 If some of these factors are correlated with pollutant levels, then failure to
 control them adequately can either mask a real pollutant association or
 produce a spurious one. It is unlikely that air pollution is consistently
 correlated with these factors in all of the circumstances studied. Thus*, the
 consistency of the findings relating air pollution peaks to mortality provides
 some evidence for causation. Nevertheless, the data do not support detaDed
 quantitative interpretation. Regression  coefficients and other measures of
 association are sensitive to the level of adjustment of confounding factors.
 Consequently, unless a study controls the above factors, it has little utility
 for estimating quantitative exposure-response relationships.
   2.  The accuracy of the measured pollutant levels as estimates of exposure
 is questionable. The pollution monitoring  stations used in these studies
 typically were located on a very coarse  grid over the area inhabited by the
"study population. The concentrations of SO2 and particulates are dependent
 upon wind speed and direction and can vary appreciably over small areas.
 Further, those who by virtue of age or preexisting disease are at risk of death
 are apt to spend most of their time indoors. Thus density of cigarette smoke,
 type  of heating and cooking fuel, and other determinants of indoor air
 quality may have more relevance to mortality than do moderate variations
 in outdoor levels as measured several kilometers away.
   3.  There is undoubtedly a variable lag time between a stressful event,
 such  as an extreme of heat  or air pollution, and the resulting deaths. This
 problem has been addressed by some investigators (42,  45); but without
 prior knowledge of the distribution of lag times, a satisfactory solution is
 elusive. The inability to deal adequately with this variability makes a study
 less likely to detect a statistical  association, if it exists,  between given
 pollution levels and mortality.
   4.  Findings at one  time  and place are not easily generalized to other
 circumstances. The geographical and temporal variability of sulfur oxide/-
 particulate pollution in such characteristics as chemical  composition and
 particle size distribution has already been stressed.  Another barrier to
 generalization is the variability in characteristics such as size, susceptibility,
 exposure, etc of the sick and elderly subpopulation at risk.

-------
               AIR POLLUTION AND RESPIRATORY DISEASE    405

   The above difficulties make it impossible to determine from existing data
 general minimal levels of sulfur oxides and particulates at which excess
 deaths can be expected to occur. However, some order of magnitude esti-
 mates are necessary for control purposes.  Based on data from New York
 and London (e.g. 36-38) it has been estimated  [(7) Vol.  8, p. 90; (12), p.
 656] that  deaths are likely to increase when 24 hr averages of SO2 and
 particulates exceed 500-700 /zg/m3 and 500 to 750 /zg/m3 (BS), respec-
 tively. The uncertainty of these estimates should be emphasized whenever
 they are quoted.
   Perhaps the greatest limitation to further study of concommitant daily
 fluctuations in mortality and air pollution is the sizable decrease in pollution
 levels in the US and Western Europe following Clean Air legislation. This
 decrease means that a relatively weak  signal emitted by current milder
 pollutant fluctuations is likely to be swamped by random and nonrandom
 variations in daily deaths.                              : .•
 LONG-TERM EFFECTS  In evaluating the effects of air pollution, an im-
.portant question is whether chronic exposure to elevated pollutant levels
 increases the chance of respiratory disease. Many workers have investigated
 this question by examining geographic differences in mortality rates due to
 bronchitis, emphysema, lung cancer, and other chronic respiratory diseases
 (cf, for example, 48-56). I do not discuss these studies individually. Instead,
 •I summarize their general findings ancUpoint out some difficulties in inter-
 preting these findings. The studies differ in the sophistication and care  with
 which the statistical analysis is conducted (e.g., some control for smoking
 and occupational mix); however, all are subject to the inherent difficulties
 of not being able to control for all factors hypothesized to affect mortality
 across different areas. The reader interested in more details than are pre-
 sented here will find them in (7-9, 12-14, 21-24).
  The studies cited above have typically compared the mortality experience
 of residents in heavily polluted areas with that of residents in less polluted
 areas. They have shown a fairly consistent pattern  of association between
 residence in areas of heavy air pollution and respiratory mortality. How-
 ever, the assignment of an etiologic role for air pollution-on the basis of these
 data is fraught with difficulties.  Some of these difficulties are succinctly
 described in the following excerpt (8):

  Differences in mortality rates among geographic units cannot be attributed to differences
  in degrees of air pollution unless data on other determinants of geographic rtiffcrences
  in mortality are taken into account. In many studies, information on these contributors
  to mortality is lacking or difficult to obtain. Thus, important factors such as smoking

-------
406     WHITTEMORE

  habits, socioeconomic status, length of residence in the geographic unit, occupation,
  pre-existing disease, ethnic background, and diagnostic accuracy in filling out death
  certificates may significantly influence the reported mortality rate in each area.

  In general, the designs of the studies cited above permitted little or no
control for these important determinants of respiratory-related mortality.
Hence, it is quite plausible that consistent confounding due to these factors
may explain the consistency of the associations found. For example, people
with lower socioeconomic status (SES) tend to smoke more cigarettes (56a)
and live in more polluted  areas than do those with higher status. Thus,
cigarette smoking and SES could follow an air pollution gradient through-
out most of the geogaphical areas studied. Further, because heavily indus-
trialized areas are often heavily polluted as well, occupation among males
may well be consistently correlated with residence in a dirty area.
  Because air pollution measurements were not available during the life-
times of those  who  died, these studies cannot provide exposure-response
information. They have served a useful purpose in demonstrating geograph-
ical differences  in mortality, but their limitations, together with those of the
short-term mortality studies, imply that we must look beyond mortality to
more sensitive measures, such as morbidity and decreased pulmonary func-
tion, for evidence on which to base sound air pollution control strategies.

Respiratory Morbidity
•When the end  point of interest is respiratorjidamage among the living, it
is possible to study the effects of air quality on an individual and not on an
aggregate or population level. This  feature of morbidity studies permits
more accurate estimates of exposure and better control of intervening vari-
ables than is usually possible when examining mortality.

SHORT-TERM EFFECTS  Studies comparing temporal changes in respira-
tory status with concommitant pollution fluctuations  are more sensitive
indicators than are the corresponding mortality studies. Nevertheless, the
four problems associated with studies of acute mortality effects—confound-
ing by weather and season, inaccuracy of exposure estimates, unknown lags,
and specialized pollution or population at risk—must also be addressed in
acute morbidity studies. Reviews and critiques of selected short-term mor-
bidity studies are presented in Tables 2 to 4. Those in Table 2 compare daily
or weekly symptom  reports with pollution variation.  In addition to the
above problems these studies are vulnerable to the vagaries of self-assessed
symptom reporting and, unless the subjects are blind to the study's goals,
to the possibility of reporting bias.
  Table 3 describes  studies of various measures of pulmonary function, as
related to temporal changes in pollution. There are relatively few of these

-------
              AIR POLLUTION AND RESPIRATORY DISEASE    407

studies because repeated pulmonary function tests are costly to perform and
are subject to large errors due to variations in the performances of subject,
machine, and technician.
  The studies in Table 4 examine hospital or clinic visits vs  temporal
changes in air quality. The population served by the  health facility is as-
sumed to remain stable in size over the study period. Since influenza epi-
demics, season, and temperature are apt to be strongly  correlated both with
visits and pollution, these factors must be controlled if reliable information
is to be gained.
  All short-term morbidity studies share an important  advantage: the study
population can serve as its own control. The studies in Tables 2 and 3 have
the additional asset that each individual can serve as his own control, with
his own days or  weeks of morbidity compared only to his own periods in
good health. This approach, which avoids the obfuscation of interpersonal
variability, has received too little  attention. It will become  increasingly
useful as personalized exposure estimates become,  feasible.
                        s
LONG-TERM EFFECTS  Some of the many studies attempting to deter-
mine the respiratory effects of chronic pollutant exposure are summarized
in Table 5. Chronic respiratory disease is usually assessed by a questionnaire
developed by the British Medical Research Council, administered either by
mail or by personal interview. Its use facilitates reproducibility and compa-
rability of results.                                                 .   .
  Although some investigators have tried to control for smoking and occu-
pation, the cross-sectional studies of geographical differences in morbidity
shown in Table 5 suffer from many of the severe limitations affecting the
chronic mortality studies. There is little data on past  exposures,  and little
or no control for selective migration in and out of polluted areas, or for
important intervening variables. The cohort studies of temporal changes in
respiratory morbidity within a single population, although expensive and
time-consuming, provide  the best  tool for elucidating chronic effects.
  The studies in Tables 2 to 5 were rated for their  utility in estimating
exposure-response relationships according to the following criteria:

1. Little or no utility describes those studies dealing only with qualitative
   or extremely  crude quantitative estimates of exposure, or which suffer
   from severe limitations in design or analysis.
2. Moderate utility describes those whose estimates of exposure vs effect are
   limited by difficulties in assessing either exposure of effect, or in control-
   ling for important intervening variables.
3. Substantial utility describes those with reliable exposure and effect esti-
   mates that are marred neither by major methodological flaws nor by
   inadequate control of important variables.

-------
Table 2 Selected studies of short-term effects: respiratory symptom diaries
   Description of study
  Pollutants measured
              Analysis mid findings
          Comments
Schoettlin & Landau (57).»
D.iily asthma attacks
among 137 asthmatics In
Los Angeles area, Sept. to
Dec. 1956.
Zeidberg et at (58)." Dally
asthma attacks among 49
adult and 34 juvenile asth-
matics in Nashville, TN.
July 14,1958 to July 12,
1959.
Cohen et al (59).« Dally
asthma attacks among 20
New Cumberland. \W asth-
matics residing within '/i
mile of a coal-fueled power
plant during December
1969 to June 1970.
Whlttemore & Horn (60)>
Daily asthma attacks
Oxidants; particulars
(HV); carbon monox-
ide. No information on
methods, location, or
timing of pollution
monitoring.
S02, participates (HV,
COM) measured at
several monitors on a
fine grid over study
area.
Paniculate* (HV.COH);
SO2, sulfates, nitrates
as measured at 3 moni-
tors. Distance of asth-
matics from monitors
unspecified.
Oxidants, participates
(HV) monitored within
Simple correlation coefficients calculated between
daily attack rate and pollutants, temperature and
relative humidity. Correlation coefficients were not
significant for participates and CO. Multiple correla-
tion coefficient relating attacks to oxidant, temp.,
and humidity was low. When oxidant exceeded ,25
ppm, the mean number of subjects with attacks was
significantly greater than when it did not, indicating
that high levels may be hazardous.
Cross-sectional analysis. Crude-annual attack rates  '
calculated for each of three categories of asthma-
tics, classified according to annual geometric mean
sulfation rate at monitoring sln-ljon nearest to
home. Attack rates follow a pollution gradient for .'
adults but not for children; however, no statistical
tests for trend were reported. Comparison of attack
rales when SO} levels were above and bciow their
median yielded no significant differences. Pollen
levels and wind velocity were uncorrela'.cd with
attacks. Relationship between particulars and at-
tacks was not reported.
Multiple linear regression of daily attack rate against
pollution and weather variables. Wind speed, humid-
ity, and barometric pressure were cachuncorrelatcd
with attack rate. After adjustment for temperature,
each pollutant  was positively associated with attack
rate. After adjustment for temperature and any one
pollutant, none of the others explained a significant
amount of attack rate variation.

A separate multiple logistic regression was used for
each subject's attack data.  Independent variables
No information on missing at-
tack data, medication use, pol-
len counts. No  adjustment for
temperature, humidity, or
serial correlation of attacks.
No stratification on age or
race.*.  d

No information on missing at-
tack data, medication use, pol-
len counts. No adjustment for
serial correlation of attacks.
Results not easily generalized,
as pollutant mix may be unique
to area. Some reported zttacks
were confirmed by a physi-
cian.6
Analysis did not use data on
medication use. No data on

-------
among 443 asthmatics in
the Los Angeles area dur-
ing 34 week periods in the
years J972 » 1975.
Hammer et al (6l)> Daily
symptom reporting by
student nurses working
ar.il residing at two 1.0$
Angeles hospitals. Oct.
195! to June 1964.
approximately 3 mi of
subjects' homes.
Lawthcr et al (62).b Daily
changes in condition of
19 to 250 bronchitic pa-
tients in London and other
British cities during the
winters of 1955-1956,
1957-1958,1959-1960,
1964-1965.
Daily maximum of
hourly measurements
for ox'.dar.t, CO, NOj
as monitored at z single
station within 2 ml of
hospitals.
24 hr averag^levels of
SOj and partiGulatcs
(DS). Number of sta-
tions pet city vailed
from 1 to 7 depending
on year and location
of study.
were oxidant level, participate level, temperature,
humidity, wind speed, day of week, presence or ab-
sence of attack on preceding day and day of study.
Average regression coefficients for both pollutants
wc:c significantly positive. There were no signifi-
cant differences in response to pollution among
subjects when  classified according to age, sex, or
self-assessed asthma type.
Respiratory symptoms were cough and chest dis-
comfort. The symptom rate for a day was defined
as the number  of students reporting the symptom
divided by (lie number completing a diary on that
day. For each symptom the authors assumed a
threshold level below which the  symptom rale is
independent of oxidant. Estimates and confidence
intervals for these levels were obtained. In addition,
the average daily symptom rate was computed for
8 categories of oxidant level. Although not verified
by statistical tests, the rates for both cough and
chest discomfort were increased at levels greater
than 0.30 ppm.
Only descriptive, graphical methods used. Four
possible categories of daily condition were given
arbitrary numerical scores. Daily scores were aver-
aged over those reporting. The mean daily scores,
as well as the fraction of '' ose reporting their con-
dition to be worse than the day before, were plot-
ted against time, riots of daily pollutant levels were
shown on same graphs as the illness indices. Illness
and pollution peaks tended to coincide. The authors
judged that  500 pg/m3 of SO2 and 250 ng/m3 of
particulatcs (DS) arc minimal levels leading to in-
creased illness response.
pollen counts. Potential bias
of missing attack data Is mini-
mized by individual rcgrcs-
sions.c« d
No adjustment for tempera-
ture, humidity, CO or NO}.
The authors report that there
Is no relationship between
symptoms and cither tempera-
ture, CO or NOj, but give no
supporting evidence. No adjust-
ment  for missing data. No ex-
amination of validity of thresh-
old assumption.^ d
N" information on subjects'
knowledge of pollution-related
study goals. No adjustment for
temperature, humidity, missing
data or serial correlation in
morbidity indices. Assessment
of minimal pollutant levels at
which effects were seen is
somewhat subjective in the ab-
sence of formal analysis of ran-
dom variation in da(a.c
  a Little or no utility for estimating exposure-response relationships.
  ''Limited utility for estimating exposure-response relationships.
  c Daily morbidity status was self-assessed.
  <*To reduce reporting bias, subjects were not informed of pollution-related study goats:

-------
Table 3 Selected studies of short-term effects: pulmonary function
   Description of study
  Pollutants measured
               Analysis and findings
         Comments
Motley ctal (63).» Pulmo-
nary function tests in 66
subjects (of whom 46 had
emphysema) before r.nd
after stays In 1'iltcrcd rooms
for periods of 2 to 4, 18 to
20, or 40 to 90 hr. Study
took place in Los Angeles
during 3V> yr period begin-
ning 1956.
Wayne ct al (64).» Perfor-
mance of long distance run-
ners in 21 student meets at
a Los Angeles high school
from 1959to 1964.
Liwlhcr ct al (65, 66).h
Pulmonary  function mea-
surements on 4 normal
male nonsmokcrs In Lon-
don on each working day
during 5 yr period from
1960(01965.
Air quality during the
3 to 4 days prior to
filtered exposures was
classified as clear or
smoggy on the basis of
pollution values at near-
by station. Smog con-
sisted ofoxidanl, NO2,
CO, S02.

Oxidants, nitrogen ox-
ides, CO, particulatcs
(IIV), measured at sta-
tion 2 mi NW of high
school.
Particutates (nS), SOj
measured at testing lab.
Lcbowitz el al (67).« Pul-
monary function tests on
6 to 12 yr old children
Partlculates (HV), sul-
fates, temperature, and
humidity were corn-
 Each subject served as his own control. No consis-
 tent differences were noted when prior air was clear,
 when filtered exposures were less than 40 hr, or
 among normal subjects. Consistent differences were
 noted during smog when cmphyscmatic subjects
 breathed filtered air for at least 40 hr. These differ-
 ences were displayed graphically but not subjected
 to statistical testing.
 Univariatc linear regressions of fraction of runners
 who failed to improve their previous performance
 vs hourly concentration of each pollutant. Strong
 associations found for oxidant and particulatcs.

 I.  For each individual, mean values of FEV|,e
 FVC,d MM17C were computed for 6 ranges of par-
 ticulalcs and 6 ranges of SO^- x2  tests were used.
 In 3 of the 4 subjects MMF was consistently related
 to both pollutants. In  1 of these 3 subjects, FEVj
 also followed a gradient for both pollutants. There
 were no consistent relationships between FVC and
 pollution.
 II. Multiple linear regression of PCFr values aver-
 aged over all 4 subjects v« log SOj, log parliculatcs,
 temperature, and relative humidity. Mean P^F val-
 ues increased with time and temperature and de-
 creased with SOj  levels.

, Children: no significant differences between pre-
 excrcisc and postexercisc  pulmonary function
 measurements for indoor group. Decreased post-
All hough possible differences
in temperature and humidity
were not controlled, this study
provides evidence of transient
decreased pulmonary function
in cmphyscmatic subjects due
to smog. No specific pollutants
could be incriminated.
No control for temperature,
humidity, or correlation of
oxidants and particulatcs.
I.  Large amount of missing
pulmonary function data. Unl-
variate analyses of FF.Vj, FVC,
and MMF vs pollutants did not
control for weather, respira-
tory infections or other pol-
lutants. Pollutant levels were
arbitrarily grouped into cate-
gories.
I and II. It is not clear why In-
dividual multiple linear regres-
sions were not used for each
subject and for each measure
of lung function.
Age, height, and weight were
controlled by converting lung
function measurements to per-

-------
before and after itinch-
exerclsc periods Indoors
In Tucson (30 boys); out-
doors In Tucson (60 chil-
dren); outdoors In a cop-
per smelting town In
Arl/.ona (17 children). Pul-
nionnry function tests on 2
groups (9 adolescents and
10 adults) before and after
walking or running out-
doors in Tucson. Testing
performed on unspecified
dates, approximately 1972.
Stcbbings ct al (68).b Daily
pulmonary function mea-
surements of 270 children
In fourth to sixth grades of
6 schools during and after
period of high pollution
(November 21-26. 1975)
in Pittsburgh. Two schools
(npprox. 80 children) were
In control areas. The goal
was to see If the measure-
ments of children In pol-
luted areas improved over
the study period while
those in the control areas
remained the same.
blned into • single addi-
tive index. Monitoring
stations were within .25
mile of testing sites for
children, but more than
3 mi from testing sites
for adolescents and
adults.
Participates (COM). SO2
measured at stations
within 3 ml of schools.
exercise measurements among those children in out-
door Tucson group who exercised on 2 high pollu-
tion days, but not for the remaining children who
exercised on 2 low pollution dnys. Children In cop-
per smelling town showed a significantly greater
postexcrclsc decrease In PEV/FVC and MMF on
high pollution days than on low pollution days. Ad-
olescents: mean FEVj and FVC were significantly
decreased after their 20 ml walk. Adults: no signifi-
cant changes after exercise.
Mean FEV.7$« and FVC1 we're computed for each
day and each school, adjusting for the children un-
tested on that day. There were no temporal in-
creases In adjusted mean measurements for children
In polluted areas.
ccntage of predicted as calcu-
lated by specified formulae. No
explanation of statistical meth-
ods. F.ffccts of temperature,
humidity, and pollutants can-
not be separated. Effects noted
among the outdoor Tucson
group arc confounded by the
differences in children exercis-
ing on high and low pollution
dnys.
Method of adjustment for chil-
dren not tested on a given day
Is not clear.
  a Little or no utility for estimating exposure-response relationships.
  ^Limited utility for estimating exposure-response relationships.
  C1"EVX = forced expiratory volume In x seconds.
  ''Forced vital capacity.
  'Maximum mldexpiratpry flow.
  f Peak expiratory flow.

-------
Table 4 Selected studies of short-term effects: hospital/clinic visit!
   Description of study
  Pollutants measured
               Analysis and findings
          Comments
Durham (69)." Clinic visits
for respiratory illness
among students at 7 Cali-
fornia universities during
1970 to 1971 academic
year.
Flshelson A Graves (70).b
Emergency room visits for
respiratory disease in Chi-
cago's Cook County Hospi-
tal on 81 Tuesdays from
mid-Sept.  1971 to end of
March 1973.
Levy et al  (71)." Hospital
admissions for respiratory
disease In Hamilton, On-
tario, July 1.1970 to
June 30,1971.
Oxidant, NO7, SO2,
NO, CO. particulatcs,
hydrocarbons moni-
tored within approxi-
mately 5 mi of school.
SO2< participates
(COH). Measurements
averaged over 5 stations
surrounding hospital.
Sulfur oxides, parllcu-
lates (COH), oxldants,
CO, nitrogen oxides,
hydrocarbons, moni-
tored at one station In
heavily polluted part of
city.
 Simple correlation coefficients were computed be-
 tween specific upper respiratory symptoms and
 daily pollutant levels, as measured 0 to 7 d before
 symptom was first noticed. Factor analysis was
 used to control for weather. Visits for pharyngitis,
 bronchitis, tonsillitis, and colds were associated
 with oxidant, SO2 and  NOj for schools near Los
 Anpclcs,  but not for schools near San Francisco.
 Analysis  of sub-populations Indicated that male
 smokers were particularly affected by pollution.
 Multiple  linear regressions of number of respiratory
 admissions among males a'n,d females vs daily pol-
 lutant levels (with lags of 0 and 1 d), temperature,
 humidity and an indicator  for holidays. A small
 positive association was found for SO2 but not for
 particulatcs.

 Univarlate linear regressions between total weekly
 admissions at each of 4 hospitals and average
. monthly  pollutant levels. Associations found with
 SOj and  with participates. Strength of association
 varied inversely with distance of hospital from
 monitoring station. No  associations found with
 remaining pollutants, pollen count, or humidity.
 Authors state that associations of admissions with
 SO2 and  with particulatcs persisted wl'cn tempera-
 ture was included In regressions, but do r.ot present
 this analysis.
Thorough study ofassociations
between upper respiratory
symptoms and pollution. It Is
not clear how well seasonal and
day-of-wcck variations were
controlled. As no pollutant
levels were specified, exposure-
response relationships cannot
be inferred.

No control for influenza epi-
demics or season. A "lagged
dependent variable" was in-
cluded in some regressions as
a surrogate for unspecified in-
tervening factors; its interpre-
tation is not clear.
This paper is missing important
details on independent vari-
ables used in ihc regressions.
No adjustment for variations
due to season or holiday week-
ends. Authors report no influ-
enza epidemics (luring study.
Aggregating data into weekly
and monthly averages is insen-
sitive.

-------
Goldstein & Block (72).«
Emergency room visits for
asthma in Ibrlcni and
llrooklyn. New York. Sept.
to Occ.1970 and  Scpl. to
Dec. 1971.
Chiaramonte el al (73).*
Daily emergency room
visits for respiratory disease
by children to a Brooklyn
hospital for 3 wk in Nov.
1966, including 1  wk of
very high pollution.
Ipsen el al (74).a
I.  Weekly clinic visits
among employees of two
companies in Philadelphia
for 165 wk.Oct. 1960 to
Sept. 1963.

II.  Daily employee absen-
teeism in one company In
Philadelphia for 850 dsys,
Sept. !961 to Dec.'1963.
SOj, smokcslude
(COII) as monitored
within approximately
3 ml of hospitals.
S0}, as monitored ap-
proximately 5 ml from
hospital. 5 d of missing
data.
Participates, smoke-
shade, sulfates, S02,
NO. No information on
number and location of
stations.
Multiple linear regression of dally asthma visits vs
daily average S02 level and daily average tempera-
ture. 1'ositive association with SOj noted for
Brooklyn but not for llarlcm.
Week of high pollution characterized by marked
Increase in visits over those observed during 2 con-
trol weeks. Statistical methods not explained.
I.  Multiple linear regression. Dependent variable:
total weekly visits divided by average size of quarter-
ly working force. Independent variables: average
weekly levels of pollutants (excluding SOj and NO),
temperature, humidity, wind speed and direction,
precipitation. No consistent associations with pol-
lution after adjustment for meteorology.
II. Univariate linear regressions of daily incidence
(number of first-day absences) and of daily preval-
ence (number of absences) vs participates, tempera-
ture, relative humidity, and wind speed. No consis-
tent pollutant associations noted.
The authors believe that lack
of adjustment for season,
weekdays, holidays is not like-
ly In explain anomalous results.
They conjecture that SOj may
be a surrogate for an asthma-
imliiciiir, ap.cn! present In
Urooklyn but not llarlcm.
No Information on weather
variables. P.irticulatcs and CO,
although high during episode,
were not measured.
I.  High intercorrclation of pol-
lutants could account for lack
of association noted in multiple
regressions, but no information
on these intercorrclations
given. Aggregation of data
by weeks is insensitive.
II.  Extensive use of descrip-
tive statistics. Not all statistical
approaches are  clearly explain-
ed. No apparent adjustment for
intcrcorrclalion of pollution
and weather variable!;.
  * Little or no utility for estimating exposure-response relationships.
  ^Limited utility for estimating exposure-response relationships.

-------
414     WHITTEMORE

Few of the studies  reviewed have moderate utility for estimating dose-
response, and none of them has substantial utility for this purpose- This is
not a sweeping denial of their value, because most were designed to produce
only qualitative information. They have yielded important insights; how-
ever, now more is needed.

THE  FUTURE: WHAT MUST BE DONE?

Are the US air quality standards adequate for protecting the public health?
Are they too stringent? Resources for the control of air pollution are nol
infinite. At some point, a greater improvement in health and welfare can
be expected from a more viable economy than from a reduction in pollution.
Clearly, the evidence reviewed in Tables  2 through 5 does not provide an
adequate basis for determining that point. It is essential that we improve
the technical  data base used to set standards. Improvements are needed in
the assessment of exposure, of disease, and of the relationship between the
two.
                           ,•"
Pollution Exposure  Assessment
Uncertainty about actual exposures to the respiratory tract is a great weak-
ness of existing  studies. This uncertainty is typically disregarded when
results are reported. For example, when  disease is regressed oil pollution
measurements, the analysis often acknowledges only the randomness of the
disease variable, and not the large uncertainties in the exposure estimates.
However, nonsystematic error in an expos'ure variable can bias the corre-
sponding estimated regression coefficient toward zero (98, p. 376). This bias,
whose magnitude depends upon the size of the error and on other variables
in the  regression, increases the likelihood of false negative studies.
  While exposure uncertainties can never be eliminated entirely, there are
several ways  in which they can be reduced In studies of short-term effects
on  morbidity. Small, portable personal  monitors  may one day become
practicable on the scale needed  for epidemiological studies. Until then,
study subjects' time could be classified into various spatio-temporal "bins,"
e.g. resting indoors in the evening, exercising outdoors, driving a car during
rush-hour, etc. Concentrations of pollutants could be monitored over time
for each of the bins, and an individual's exposure on any given day could
be estimated as an average of such concentrations, weighted by the fraction
of the day spent in each bin. Thus, proper weight could be assigned  to
pollutant levels indoors where most people, particularly the more vulnera-
ble sick and elderly, spend the majority of their time.
  Methods for differentiating particles chemically and for measuring those
of a given size distribution need to be standarized and incorporated into all

-------
Tible 5 Selected studies or long-term effects
   Description of study
  Pollutants measured
              Analysis and findings
         Comments
Ferris et at (75. 76).b- e
Survey of respiratory dis-
ease and pulmonary func-
tion among random sample
of 1167 residents of Derlin.
Nil aged 25 to 74 in 1961.
followed by repeat  surveys
of same sample in 1967 and
1973. Disease symptoms
obtained by interviews
using BMRCe question-
naire.
Winkelstein 4. Kantor
(77).a- d Prevalence of
cough with phlegm  as re-
ported by 842 while wom-
en, In  Buffalo, NY,  1961
to 1963, using mailed ques-
tionnaire.
Lambert & Reid (78).a- d
Respiratory symptoms
among 9975 British resi-
dents aged 35 to 64 during
1965, obtained by mailed
BMRC questionnaire.
Particuhtes(IIV), S02,
SOj, averaged over a
9-month period in 1961.
a 22-month period from
1966 to 1967, and an
unspecified period in
1973. The 3 mean par-
ticulatc levels decreased
with time, while the
mean SO2 levels in-
creased.

SOj, participates (HV)
based on average levels
over period of 1961 to
1963. Distance of sta-
tions from subjects'
homes not stated.

Index based on domes-
tic coal consumption
obtained in 1952 for
88% of areas studied;
particulates (BS) and
SOj obtained in 1965
for at most 30% of
areas studied.
Slightly lower prevalence of chronic respiratory di-
sease and improved pulmonary function In 1967,
when compared with 1961, after standardization
for age, sex, and smoking habits. No statistical as-
sessment of these changes. No differences in respira-
tory symptoms, chronic respiratory disease preva-
lence or pulmonary function between 1967 and
1973.
Subjects classified according to 4 categories of SOj
and particulates. No consistent or statistically sig-
nificant association between cough prevalence and
either pollutant.
Age and smoking adjusted presence rates of per-
sistent cough with phlegm follow a pollution gradi-
ent for each pollution index. Gradient more pro-
nounced for smokers than for nonsmokcrs. No
statistical analysis of trends.
No adjustment for possible sea-
sonal and weather differences
among the three surveys. Prob-
lems of selective loss to follow-
up could not be completely
addressed. Decreases in partic-
ulates could reflect reduction
in particles of nonrespirablc
size. Generalization limited by
possible uniqueness of pollu-
tant mix in a pulp mill com-
munity.
Arbitrary and Inefficient
grouping of pollution aver-
ages. Smoking stratified only
as smoked/never smoked. In-
complete control for social
status, migration.

No adjustment for occupation,
social class, or change of resi-
dence.

-------
TableS (ConUntitd)
   Description of study
  Pollutants measured
              Analysis and findings
         Comments
Collcy A Reid (79).»-d Res-
piratory symptoms and pul-
mon.vy function among
10.000 Dritisli and Welsh
children from Sept. to
Nov. 1966. Symptoms ob-
tained by mailed question-
naire.
Hrubec et si (SO).*- d Sur-
vcy of respiratory symp-
toms among 4377 pairs of
US male twins and an addi-
tional 2364 unpaired  twins
aged 41 to SI obtained by
BMRC  questionnaire ad-
ministered by mail In  1968.
Crude estimates of win-
ter mean SOj levels dur-
ing 1966. 1967.
Emission estimates for
188 US metropolitan
areas during period from
1965 to 1967 were used
in diffusion model to
obtain single measures
of participate, SOJt and
CO levels. These mea-
sures were combined
into a single index for
each ZIP code location.
Pollution exposure
scores were computed
Tor each subject based
on residence and em-
ployment history. An
urbanization score was
a!so assigned to each
subject.
Chronic cough prevalence follows pollution gradi-
ent only for children In working class families. No
slatislic.il assessment of trends.
Two analyses were performed; 1. All subjects were
treated as individuals and only one twin of each pair
was included. Respiratory symptoms were associ-
ated with urbanization scores but not with pollution
scores. 2. Only twin pairs concordant for smoking
and discordant for pollution or urban scores were
included/Respiratory symptoms were associated
with neither urbanization nor pollution exposure
scores.
Pollution measurements Inade-
quate. No adjustment for par-
ental smoking or for cooking
and heating fuel type.
Occupation and Indoor expo-
sures were not controlled in
the annlysis. Negative results
may be due to low prevalence
of chronic respiratory symp-
toms among men aged 41 to
SI, or to inefficient statistical
methods.

-------
Table 5 (Con tinned)
   Description of study
  Pollutants measured
              Analysis and findings
         Comments
Colien et at (86).b- d Res-
piratory symptoms and
lung function tests among
441 nonsmoking Seventh
Day Aclvcntists In  Los
Angeles and San Diego,
1970. Respiratory symp-
toms obtained from DMRC
questionnaire administered
by a single Interviewer.

Atmryetat(87).b'd Survey
of respiratory symptoms
and lung  function  among
237 men and 244 women
aged 45 to 64 in 3 com-
munities  in Montreal area
timing 1972lo 1973. Res-
piratory symptoms obtain-
ed by DMRC questionnaire.
No details about number of
Interviewers or whether
they were rotated among
communities.
ParticulatesfJIV). NOj,
oxidant, SOj, nitrates,
and sulfates. Distance
of stations from sub-
jects' homes not stated.
Average annual pollu-
tant levels for both
cities were low and com-
parable, but daily oxl-
dnnt peaks were greater
in Los Angeles.
Particulates(HV,COH).
dust fall, SOj, SO3,
averaged over period
1970 to 1974. High
SO^ in community 1;
high participates In com-
munity 2; low panicu-
late s and SOj in com-
munity 3.
Analysis of variance of spiromctric and flow volume
parameters vs city of residence, age, height, and
social status was performed for each sex. No sig-
nificant differences between cities for any of the
pulmonary function parameters tested. No signifi-
cant Intercity differences in symptom prevalence
after control for age and sex.
Discriminant analyses performed for community 1
vs community 2, and for communities 1 and 2 vs
community 3. Independent variable: were age,
smoking habits, coded respiratory symptoms, and
pulmonary function. No significant intercommunity
differences in male respiratory symptoms or lung
function. Prevalence of cough and phlegm among
women was significantly higher in communities t
and 2 than in community 3.
Recent migrants were excluded
from study. Low and similar
pollutant levels throughout the
subjects' lifetimes could ex-
plain negative results.
No adjustment for social class,
occupation, change of resi-
dence, or indoor fuel type.
Sensitivity of conclusions to
departures from assumptions
of discriminant analysis is not
addressed. Small sample sizes
might explain negative results
for males.

-------
Shyctal(R1.82.83)."'d
I.  Weekly pulmonary func-
tion mc.iMircincnls for 987
second gratlc children In 4
areas of Chattanooga, TN
during Nov. 1968 and
March 1969.
11.  Diwcckly diaries of res-
piratory illness among fam-
ilies of above children from
Nov. 1968 through April
1969.
III. Survey of bronchitis,
croup, and pneumonia
among 1311 Infants and
1906  first and second grade
children in areas  I, 3, find
4 from July 1966 to June
1969  obtained by question-
naire mailed or brought
home from school.
Speizcr & Ferris (84,
85).»- d Survey of respira-
tory symptoms and pul-
monary function in 128
central city and MO sub-
urban Boston policemen
with outdoor  duties during
winter of 1969 to 1970
from  ftMRC questionnaire
administered by a trained
physician Interviewer.
NOj, participates (HV,
COII). sulfatcs, nitrates.
High NO- in area l;high
particulars In area 2;
moderate levels of N02
In area 3; low levels of
all pollutants In area 4.
SOj, NOj, CO, hydro-
carbons, lead, measured
during study by porta-
ble monitors on the
streets. Differences In
pollutant levels between
suburban and central
city areas are not speci-
fied.
I.  Height and scx-ndjusted FF.V 75f was significant-
ly  lower in high NO2 nrca 1 than in areas 3 and 4.
However, there was no NO2 gradient for Individual
schools within the high N02 nrcu.
II. Illness rates significantly higher in high NO2 area
than In areas 3 and 4  for  the children,  their siblings,
and parents. These differences could not be explain-
ed by differences in family size, economic level, or
prevalence of chronic respiratory conditions. No
correlation between illness rates among children
and parental smoking.
III. Ridit analysis was performed on the propor-
tions of children reporting one or more Illness epi-
sodes. Bronchitis rates were significantly elevated
for school children exposed to high NO2  for 2 or
3 yr, but  not for those exposed for 1 yr or for in-
fants. Bronchitis rates followed an NO2 gradient
only for children exposed for  3 yr. No significant
area differences In rates for croup or pneumonia.
Nonsmokers and smokers (but notexsmokers) from
the central city group had slight but nonsignificant
excess in symptom prevalence when compared to
suburban group.  Methods used to analyze symp-
tom prevalence nre not described. Multiple linear
regression of FEV (f against age, height, smoking,
and years of traffic exposure indicated that FEV,
is not associated  with traffic exposure.
I.  The authors found no dif-
ferences in parental smoking
and social cl.iss between arras
1,3, ii nil 4, niul so d Id not con-
trol for these variables. No ad-
justment for indoor fuel type.
Lung function technicians were
rotated between areas.
II. Parental illness rates were
not adjusted for  differences in
age and smoking status. No ad-
justment for home heating and
cooking fuels.
HI. No adjustment for indoor
pollution levels at home.
l-lll. High NO2 area  also had
higher sulfalcs and nitrates
than did other areas. Jacob-
llochhciscr method for analyz-
ing NOj later found to be in
error.
No information on area differ-
ences in retirement or in trans-
fers to  inside assignments be-
cause of respiratory problems.
Results not easily pcncrnlizcd
because of possible self-selec-
tion by  men more tolerant of
automobile exhaust.

-------
Tiblc 5 (Continued)
   Description of study
  Pollutants measured
              Analysis and findings •
                                                                                    Comments
Lunnctal(91.92).b''l'e
Respiratory symptoms and
pulmonary function among
children in 4 aress of differ-
ing pollution in Sheffield,
England. 819 children aged
5 and 1049 children aged
11 examined at school dur-
ing summers from 1963 lo
1965; 558 of those aged 5
reexamincd 4 yr later In
summers of 1967 to 1969.
Holland et a! (93).*' d Sur-
vey of peak expiratory flow
rate (PEFR) in 10,971
school children aged 5,11,
and 14, in 4 areas of differ-
ing pollution In Kent, Eng-
land during 1964 and 1965.

Melia et al (94,95,
96).8- e- d Respiratory
symptoms and lung func-
Particulates (IJS) and
SO2 measured from
1964 to 1968 at sta-
tions within I mi of
schools. Participates,
and to a lesser extent
SOj, decreased over
study period.
Mean partlculate levels
for each area during
winter of J966 to 1967
are presented,  but their
source is unspecified.
NOj measured In kit-
chens, in children's bed-
rooms, and outside of
Children ngcd 5: significantly decreased prevalence
of chronic upper respiratory trcct Infection, of his-
tory of 3 or more colds per year, and of history of
chronic cough, pneumonia, or bronchitis in clean
vs polluted srea.
-------
Douglas et al (88, 89,
90).a- c I. Respiratory
symptomj In 3131 children
bornln Great [Irilain in
1946 who resided since
blrlli In areas of high, med-
ium, low, or very low pol-
lution. Illness Information
obtained by interview with
mothers at ages 2 and A yr,
and during examinations
by school doctors at ages
5.6.7,11 and 15. Hospi-
tal admissions recorded.
Special absence records
kept at school from ages
6.5 to  10.5 yt.
II. Survey of prevalence of
cough In  winter among
3899 20-yr-old members of
above cohort. Survey re-
pc.itcd at age 25.
Annual domestic coal
consumption In region
of residence during
1952 was used as an
index of air pollution.
I.  Upper respiratory Infections not related to air
pollution, but frequency and severity of lower res-
piratory tract Infections Increased significantly with
pollution. This association held for both sexes, for
all social classes, and at each age examined.
                         II. Multiple linear regression of winter cough prev-
                         alence at age 20 vs smoking habit, childhood chest
                         illness, father's social class, and air pollution index
                         yielded no significant association between cough
                         and air pollution. Multiple logistic regression of
                         cough prevalence at age 25 vs above factors yielded
                         slightly stronger but still non-significant association
                         with pollution.
I.  No control for parental
smoking or indoor fuel use.
Study provides qualitative
evidence that air pollution Is
related to lower respiratory
disease In  early childhood.
Specific pollutants cannot
be identified.
                                                  II.  17% of those still living In
                                                  UK at age 18 were lost to
                                                  follow-up, and this group
                                                  contained a disproportionate
                                                  number of  Individuals from
                                                  areas of high pollution. No
                                                  control  for parental smoking
                                                  and Indoor fuel use.

-------
lion among 4827 children
aged 6 to 11 yr in England
and Scotland from 1972 to
1977.
Kerrebljn & Mourmans
(97).*1 d Respiratory symp-
toms and lung function ob-
tained for 2380 fourth and
fifth grade children in  2
areas of Westland, Holland
during spring 1973. Symp-
toms obtained by question-
naire mailed to parents.
homes for 1 wk In win-
ters. Mean NOj levels
higher in kitchens and
bedrooms of homes
using gas as compared
to electric cooking.
Winter and summer
mean levels of particu-
lates (BS) and SO 2 mea-
sured at 7 stations In
area 1 (high pollution)
and 1 station in area 2
(low pollution). Aver-
age distance of stations
from schools not speci-
fied.
each sex and for urban and rural areas. According
to this analysis,'which did not control for parental
smoking, gas cooking was significantly (p < .05)
associated with respiratory illness only for boys in
urban areas. These results also held after adjustment
for heating fuel type and for outdoor levels of par-
ticulatcs and SO2. Effects of pas cooking smaller In
1977 than in 1973. Multiple logistic regression of
presence or absence of respiratory illness vs father's
social class, age, sex, familial smoking, and NO2 in
kitchen or child's bedroom yielded marginally sig-
nificant association between respiratory illness and
N02 levels In bedroom. No relation between N02
levels In kitchen and respiratory illness in children,
siblings, or parents, or between lung-function and
N02 levels in kitchen or bedroom.  '
Children cross-classified by area an<|, presence or ab-
sence of symptom for each of 13 sy'rhptoms. Only
cough during day or night was more prevalent (p =
.05 by Fisher's exact test) in area 1 than in area 2.
Area 1 had no consistent excess of remaining symp-
toms. No area differences in lung funclio'n. Children
in areas 1 and 2 comparable in social class and fam-
ily size.
Indoor N02 levels are harmful
to health.
In view of multiple compari-
sons of symptoms and statical-
ly weak results, study provides
no convincing evidence for
pollution effect on respiratory
status of fourth to fifth grade
children. No information on
rotation of technicians be-
tween areas. No control for
parental smoking or indoor
fuel type.
 1 Little or no utility for estimating exposure-response relationships.
 ^Limited  utility for estimating exposure-response relationships.
 'Cohort .itutly.
 ^Cross-sectional study.
 e llrltish Medical Research Council.
 f FEVX =  forced expirntory volume In x seconds.

-------
422    WHITTEMORE

monitoring systems. In addition, techniques for measuring other monitored
pollutants should be standardized. Information on measurement and cali-
bration methods, and on (he distance between monitoring stations and
populations at risk, should be included in the reports of all studies.
  The difficulties of apportioning risk among highly correlated pollutants
suggests that control  efforts should be directed toward profiles of several
pollutants.

Respiratory Disease Assessment
In their search for evidence against air pollution, epidemiologists are ham-
pered by ignorance of the precise pathophysiological mechanisms by which,
airborne toxicants are likely to induce acute and chronic respiratory disease.
Dialogue with lexicologists can stimulate animal and  controlled human
experiments, which in turn can provide information useful in the design of
observational studies.
  Apart from our need for more knowledge about disease mechanisms,
there is need for improved measurement of respiratory damage. Subjective
measures of acute morbidity, such as presence of cough, chest discomfort,
or asthma, should have some objective confirmation, such as recorded times
of medication use or  verification by a physician.
  The precision by which any lung function test can predict the develop-
ment of disabling disease is unknown. Commonly used spirometry measure-
ments, such as forced expiratory volume and peak expiratory flow rate,
probably reflect disturbances in the large airways. Unfortunately, they are
relatively insensitive to those changes in the small peripheral airv/ays that
may mark the start of chronic disease (12). Longitudinal studies show great
variability in one person's pulmonary function measurements over tune, as
well as in rates of decline between people. The prognostic importance of this
variability' has not been adequately studied. Even in  long-term studies the
calculated individual rates of decline are not highly reliable. Better methods
of analyzing this type of data are needed.
  The questions of bow many pulmonary function tests a subject should
receive at one testing, and  whether to  take  the mean or the maximum
reading, are important. The procedure should be standardized (sec 99). To
insure the output of reliable data, studies must be designed to provide for
instrument calibration between testings, and to adjust for  measurement
differences between instruments and between technicians.

Exposure-Response Assessment
Implicit in the US Clean Air Act is the assumption that no-effect levels of
pollutants can be  determined. However, the concept of such levels is a

-------
              AIR POLLUTION AND RESPIRATORY DISEASE    423

politically attractive chimera. No-effect levels, be they positive or zero, must
depend upon the individual, the weather, the pollutant mix, the disease, and
many other variables. Even for particular values of all these variables, such
levels are impossible to estimate with useful precision. Rather than waste
effort debating the existence and values of no-effect levels we must face the
possiblity of some risk at any positive level, and decide which risks are
tolerable.
  Exposure-response assessment suffers from a  dearth of reliable evidence.
The paucity of epidemiological data on respiratory effects at levels near the
current standards has been stressed  elsewhere (e.g.  12). Although the ab-
sence of reliable data supporting the  relaxation of standards should receive
equal emphasis, it does not. The insensitivity of existing studies to the effects
of moderate levels of pollutants is poorly appreciated.
  There  are several sources of this insensitivity: (a) insufficient variability
in exposures to pollutants, (b) large  random errors in measuring exposure
to pollutants, (c) inadequate control of confounding factors, (d) insufficient
sample sizes, (e) the chance that reactions of sensitive subgroups will go
unnoticed. Temporal studies of acute effects are further weakened by un-
known lags between exposure and effect, while cross-sectional studies of
chronic effects are muddied by migration in and out of geographical units.
  Because of these limitations, negative results cannot be automatically
interpreted as evidence of safety. At best, such results provide crude upper
bounds for effects. At present the lack of documented evidence on adverse
effects at levels slightly above the standards cannot be construed as evidence
in favor of relaxing them. More sensitive studies are needed. What must be
done  to achieve  them?

Recommendations
1. We must reduce exposure uncertainty with  more accurate, better cali-
brated, standard measuring devices located close to the nose and mouth of
the population at risk. The notion of parceling subjects' time into bins and
monitoring the bins, described earlier in this section, deserves implementa-
tion in short-term morbidity studies.
  2. We should  abandon hope of elucidating chronic effects by studying
current or former smokers. Cigarette smoking  so dominates air pollution
as a respiratory  hazard that these people cannot provide information on
chronic effects without precise control of such myriad smoking modifiers
as duration, pattern, amount, depth of inhalation, and tar and nicotine
content. The impracticality of doing this suggests that future chronic effects
studies be restricted  to life-long nonsmokers. Similar comments apply to
strong occupational respiratory  hazards.

-------
 424     WHITTEMOR£

   For these reasons studies in children are uniquely valuable. Insults to the
 respiratory tract in childhood may be one of the important determinants of
 chronic respiratory disease in adulthood. To clarify this, cohorts of children
 whose  exposures  are being documented  and who remain  nonsmokers
 should be followed into adulthood for chronic disease. Unfortunately, the
jiecessar^restnction^of study to nonsmokers precludejTBejnvesngation ol
 Dpssiblejot£ractioai>etween cigarette smoke and air pollutionTnTHelnduc-
 tion~of^respiratory disability^  "                                       '
   3TMore thought must be given to study design. Due to the many limita-
 tions cited in the previous pages, further cross-sectional studies of chronic
 effects  are not  likely to be useful in determining standards. Long-term
 cohort studies of geographical  differences in pulmonary function changes
 share some of these limitations and thus should be undertaken only  with
 the greatest care. Current air quality monitoring and data storage an,d
 retrieval are sufficiently developed so that within the next few decades,
 national  data banks on average air pollution levels by time and place of
 residence may be available. It may thus be feasible to conduct retrospective
 studies of lifetime exposures among n^ffsmoking, nonoccupationally ex-
 posed victims of chronic lung disease as compared with exposures among
 suitably chosen controls. As with all chronic effects studies, the sensitivity
 of these  retrospective studies would be compromised by large  errors in
 exposure estimates. Further, they  would be subject to at  least as many
 potential biases as are corresponding cohort studies. Nevertheless, because
 the frequency of chronic lung disease is low, the retrospective studies would
 be substantially cheaper and more efficient than the cohort studies.
   Short-term studies of acute nonfatal events as they vary  with  pollution
 over time, while  unable to  answer the important questions concerning
 chronic effects, have the best prognosis for sensitivity, reliability, and feasi-
 bility. It is important that existing studies of this type be duplicated using
 improved design,  exposure assessment techniques, and analysis.
   4. Analysis of all studies can be improved. Future investigators should
 exploit recent statistical methods for controlling intervening variables and
 for increasing sensitivity. Tables 2 through 4 show that many studies failed
 to adjust for weather and other pollutants even when data' on these variables
 were available. The sensitivity and validity of studies such as those in Tables
 2 and 3 can be enhanced by relating disease to exposure on an individual
 rather than on an aggregate level. Finally, study plans should include more
 careful sample size calculations.
   While implementation of the above recommendations will help to im-
 prove the epidemiological  evidence upon which regulatory decisions are
 based,  it is unrealistic to expect  that "perfect" studies will some  day be

-------
              AIR POLLUTION AND RESPIRATORY DISEASE    425

possible. Exposure can never be known with complete accuracy, nor can
confounding be completely controlled. These basic limitations insure con-
tinued dispute over the extent to which observed relationships between air
pollution and respiratory damage are causal, as well as continued disagree-
ment over the interpretation of negative studies. To resolve such disputes
we must look for patterns and consistency across many studies, carefully
performed and replicated under different circumstances. Decisions about
the quality of the air we wish to breathe and the costs and risks we  are
willing to assume must ultimately be societal judgments, based on informa-
tion from a broad range of good epidemiological and laboratory studies, and
on consideration of competing demands for our resources.
                            .. .. • •    .         •            • •   •    »'

SUMMARY                     .          .

The existence of respiratory damage.^Qe to air pollution has been estab-
lished beyond reasonable doubt by a large body of work. Nevertheless, little
of the work that has been reviewed in this paper can be used to obtain
reliable estimates of effects at specified levels. None of the data discussed
in Tables 2 through 5 indicate that current or even moderately less slriitgent
standards are inadequate to protect the public health. On the other hand,
. there is a dearth of reliable data indicating that standards can be relaxed.
   Respiratory morbidity studies are more sensitive and reliable than are the
corresponding mortality studies. Investigations comparing acute  respira-
tory events  with temporal pollutant fluctuations should  continue, using
better design, better analysis, and better exposure assessment techniques.
The prognosis is poor for reliable quantitative data relating chronic pollu-
tant exposure to respiratory damage. At present, cohort studies of changes
in respiratory status vs  careful measures of cumulative exposure, despite
their cost, offer the best hope of obtaining these data. Such studies should
be restricted to children and to adults who have never smoked.

ACKNOWLEDGMENTS

The author wishes to thank Yih-shih Yuh and Patricia Ford for advice and
assistance in preparing this review. This work was supported by grants to
the SIAM Institute for Mathematics and Society from the National Science
Foundation,  the Department  of  Energy,  the  Environmental Protection
Agency, and the Rockefeller Foundation, by a Research Career Develop-
ment Award from the National Institute of Environmental Health Sciences,
and by grant number CA 23214-01 from the National Cancer Institute.

-------
  426      WHITTEMORE
  Literature Cited

     \. US Clean Air Act. 1963. 42 USC. Sect
       1857-1957/
     2. US Dept. Health, Educ., Welfare. 1970.
       Air Quality Criteria for Carbon Monox-
       ide. Natl. Air Pollut.  Control Admin.
       Publ.  No.  AP-62.  Washington  DC:
       GPO
     3. US Dept. Health, Educ.. Welfare. 1970.
       Air Qualify Criteria for Hydrocarbons.
       Natl. Air Pqllut. Control Adinin. Publ.
       No. AP-64. Washington DC: GPO
     4. US Environmental Protection Agency.
       1973.  Background Information on De-
       velopment of Rational Emission Stan-
       dards for Hazardous Air Pollutants: As-
       bestos.  Beryllium, and Mercury.  Off.
       Air Quality Planning and  Standards
       Publ.  No. APTD-1503. Research  Tri-
       angle Park, NC: USEPA
   .  5. Natl. Acad. Sci. 1974. Air Quality and
       Automobile Emission Control  US  Sen.
       Comm. Publ. Works, Print Ser. Ncv93-
       24. Washington DC: GPO
     6. Natl. Acad. Sci. 1975. Air Qualify and
       Stationary Source Emission Control US
       Sen.  Comm.  Publ.  Works,  Print  Ser.
       94-4.  Washington DC: GPO
     7. WHO.  1979.  Environmental  Health
       Criteria Series. Vol. 3, Lead:  Vol. 4,
       Oxides of Nitrogen; Vol. 7, Photochemi-
       cal Oxidants: Vol. 8, Sulfur Oxides and
 '~  • •  Suspended Paniculate Matter. Geneva:
       WHO
     8. Am. Thorac. See. 1978. Health effects
       of air pollution. ATS News,  pp. 22-63
     9. Ferris, B. G. 1978. Health effects of ex-
       posure to low levels of regulated air pol-
       lutants.  J.  Air Pollul Control Assoc.
       28:482-97
    10. Goldsmith, J. R. 1969. Epidemiological
       bases for possible air quality criteria for
       lead.  J.  Air  Pollut.   Control Assoc.
       19:714-21
-•'   11. Goldsmith, J. R., Cohen, S.  I. 1969.
       Epidemiological bases for air  quality
       criteria for carbon monoxide. /. Air Pol-
       lut. Control Assoe.  19:704-13
    12. Holland, W. W., Bennett, A. E., Cam-
       eron, I. R., Florey, C. du V., Leeder, S.
       R.. Schilling.  R. S. F.. Swan, A. V.,
       Waller.  R. E. 1979. Health  effects of
       paniculate pollution: Reappraising the
       evidence. Am. J. Epidemiol 110(5)
    13. Natl.  Acad.   Sci.  1973.  Proc. Conf.
       Health Effects of Air Pollutants.  US
       Sen.  Comm.  Publ.  Works,  Print  Ser.
       No. 93-15. Washington DC: GPO
    14. US Dept. Health, Educ., Welfare. 1981.
       Air Quality Criteria for Paniculate Mat-
       ter and Sulfur Oxides. Vols. 1, 5. In
       press
  15. US Dept. Health, Educ., Welfare. 1978.
     Air Qualify Criteria for (hone and Other
     Photochemical Oxidants. EPA-600/8-
     78-004. Research Triangle Park, NC:
     USEPA
  16. US Environmental Protection Agency.
     1971. Air Quality Criteria for Nitrogen
     Oxides.  Air Pollut. Control Off. Publ.
     No. AP-84. Washington  DC: GPO
  17. Natl. Acad. Sci. 1977. Ozone and Other
     Photochemical Oxidants. Washington
     DC:  NAS
  18. Natl. Acad. Sci. 1977. Nitrogen Oxides.
     Washington DC: NAS
  19. Chambers,  L. A. 1973. Proc.  Conf.
     Health  Effects of Air PolKttants,  pp.
     567-69. US Sen. Comm. Publ. Works,
     Print Ser. No. 93-15. Washington DC:
     GPO
  20. Wright, G. W. 1969. An appraisal of
     epidemiological  data  concerning  the
     effect of oxidants, nitrogen dioxide and
     hydrocarbons  upon  human  popula-
     tions. /. Air Pollul.  Control Assoc.
     19:679-82
  21. Lave, L. B., Seskin, E. P. 1977. Air Pol-
     lution and Human Health.  Baltimore:
     Johns Hopkins Univ. Press
  22. Goldsmith, J. R., Fribcrg, L. T. 1977.
     Effects  of  air  pollution  on human
     health. In Air Pollution, ed. A.C. Stern,
~   2:458-410. New York: Academic
  23. Ware, J., Ferris,  B. G.  1981. Assess-
     ment of the health effects of sulfur ox-
     ides  and paniculate matter: Evidence
     from observational studies. Environ.
     Health Perspect  In press
  24. Speizer, F. E. 1969. An epidemiological
     appraisal of the effects of ambient air on
     health: Particulates and  oxides of sul-
     fur.  /.  Air Pollut.  Control  Assoc.
     19:647-54
  25. Holland, W. W.. ed. 1972. Air Pollution
     and  Respiratory Disease.  Westport,
     Conn: Technomic
  26. Finkel.  A. J., Duel, W.  C, eds. 1976.
     Clinical Implications of Air Pollution
     Research.  Acton, Mass:   Publishing
     Sciences Group
  27. Lee, D. H., Fafic, H. L., Murphy, S. D..
     Geigcr, S. R-, eds. 1977. Reactions to
     Environmental Agents. Bethesda, Md:
     Am. Physiol. Soc.
  28. Firket, J. 1931. The cause of the symp-
     toms found in the Meuse Valley during
     the fog  of December  1930. BulL Acad.
     R. Med. Belg. 11:683-741
  29. Schrenk, H. H., Heimann, H., Clayton,
     G. D., Gafafer, W., Wexler. H. 1949.
     Air pollution in Donora, Pennsylvania.
     Epidemiology of the unusual smog epi-

-------
                AIR POLLUTION AND RESPIRATORY DISEASE     427
    $ode of October 1948. Public Health
    Bull No. 306. Washington DC: GPO
30. Logan, W.  P. D. 1953. Mortality  in
    London fog incident Lancet  1:336-38
31. Scon, J. A. 1963. The London fog of
    December 1962. Med.  Off. 109:250-52
32. Greenburg, L., Jacobs,  M. D., Drolette,
    B. M., Field, F., Braverman, M. M.
    1963.  Report  of an  air pollution  inci-
    dent in New  York City,  November
    1953. Public Health  Rep. 78:1061-64
33. McCarroll, J., Bradley. W. 1966. Excess
    mortality as  an indicator of  health
    effects of air pollution. Am.  J. Public
    Health 56:1933-42
34. Ciocco, A., Thompson, D. C 1961. A
    follow-up of Donora  ten years  later:
    Methodology and findings. Am. J.  Pub-
    lic Health 51:155-64
35. Martin, A. E, Bradley, W. 1960. Mor-
    tality, fog and atmospheric pollution-^r"
    An investigation during  the winter of
    1958-59. Month. Bull Minist. Health
    Public Health Lab. Serv. 19:56-72
36. Martin, A. E. 1964. Mortality and mor-
    bidity statistics and air pollution. Proc.
    R. Soc. Med.  57:969-75
37. Lawther, P. J. 1963. Compliance  with
    the Clean Air Act: Medical aspects. /
    Inst. Fuel 36:341
38. Gore, A. T., Shaddick, C. W.  1968. At-
    mospberic pollution and mortality  in
    the County of London.  Br. J. Prev. Med.
    12:104-13
39. Buechley,  R. W. 1977. SO, Levels.
    1967-1972.  and Perturbations in Mor-
    tality. Contract  NO1-ES-5-2101. Re-
    search Triangle Park,  NC: Natl.  InsL
    Environ. Health Sci.
40. Scbimmel, H., Greenburg. L. 1972. A
    study of the  relation  of pollution  to
/   mortality. New York City, 1963-1968.
    / AirPolluL Control Assoc. 22:607-16
41. Schimmcl. H., Murawski, T. J.  1975.
    SOj—Harmful pollutant  or air quality
    indicator? J. Air Pollut Control Assoc.
    25:739-40 ,
42. Lebowitz, M. D. 1973. A comparative
   • analysis of the stimulus-response  rela-
    tionship between mortality and air pol-
    lution—weather.  Environ.   Res.   6:
    106-18
43. Hodgson,  A.  Jr.   1970. Short-term
    effects of air pollution on mortality in
    New York City. Environ. Sci TcchnoL
    5:589-97
44. Ghsscr. M., Grecnburg, L.  1971. Air
    pollution and mortality  and weather,
    New York City. 1960-64. Arch. Envi-
    ron. Health 22:334-43
45. Lebowitz. M. D., Toyama, T., McCar-
    roll, J. 1973. The relationship between
    air pollution and weather as stimuli and
    daily mortality as responses in Tokyo,
    Japan, with comparisons with other cit-
    ies. Environ. Res.  6:327-33
46. Kevany, J., Rooney, M.,  Kennedy, J.
    1975. Health effects of air pollution in
    Dublin. Jr. J. Med. Sci. 144:102-15
47. Lindberg,  W.  1968. Air  Pollution  in
    Norway. Oslo: Smoke Damage Council
48. Buck, S. F., Brown. D. Z. 1964. Mortal-
    ity from Lung Cancer and bronchitis in
    Relation to Smoke and Sulphur Dioxide
    Concentration.  Pollution Density, and
    Social Index. Res. Paper  No. 7. Lon-
    don: Tobacco Res. Counc.   t
49. Lipfert, F. W. 1978.  The Association of
    Human Mortality with Air Pollution:
    Statistical Analyses by Region, by Age.
    and by Cause of Death.  New York:
    Long Island Lighting
50. Zeidberg, L. D., Horton, R. J. M.,
    Landau, E.^961. The Nashville Air
    Pollution Study. Mortality from dis-
    eases of the respiratory system in rela-
    tion to air  pollution.  Arch. Environ.
    Health 15:214-24
51. Winkelstein, W. Jr., Kantor, S., Davis,
    E. M., Maneri, C S.,  Mosher, W. E.
    1967. The relationship of  air pollution
    and economic status to total mortality
    and selected respiratory system mortal-
    ity in  men.  Arch.  Environ.  Health
    14:162-69
52. Winkclstein, W. Jr., Kantor, S., Davis,
    E. W., Maneri, C. S..  Mosber. W. E.
    1968. The relationship of  air pollution
    and economic status to total mortality
    and selected respiratory system mortal-
    ity in men (II. Oxides of Sulfur). Arch.
    Environ. Health  16:401-5
53. Crocker, T. D., Scbulze, W., David. B.,
    Kneese, A. V.  1979. Methods Develop-
    ment for Assessing Air Pollution Control
    Benefits. Vol. I, Experiments in the Eco-
    nomics of Air Pollution of Epidemiology.
    EPA-600/5-79-001a. Research Trian-
    gle Park, NC: US EPA
54. Lave. L. B.. Seskin, E. P. 1970. Air pol-
    lution  and  human  health.  Science
    169:723-33
55. Watanabe. H., Kaneko, F. 1971. Excess
    death  study of air pollution. In Proc.
    hi Clean Air Congr..  2nd.  ed. H. M.
    Englund, W. T. Beery, pp.  199-200.
 '   New York: Academic
56. Schwing, R. C. McDonald, G. C. 1976.
    Measures of association of some air pol-
    lutants, natural ionizing radiation, and
    cigarette smoking with mortality rats.
    Sci Total Environ. 5:139-69
56a. Smoking and health: A report of the
    Surgeon General. 1979. DHEW PubL

-------
428     WHITTEMORE
     No. (PHS) 79-50066. US Dept. Health,
     Educ., Welfare
 57. Schoettlin, C E., Landau, E. 1961. Air
     pollution and asthmatic attacks in the
     Los Angeles area. Public Health Rep.
     76:545-48
 58. Zeidberg,  L.  D..  Prindle,  R.  A.,
     Landau, E. 1961. The Nashville air pol-
     lution study. 1. Sulfur dioxide and bron-
     chial  asthma.  A preliminary report.
     Am. Rev. Resp. Dis. 84:489-503
 59. Cohen, A. A., Bromberg, S., Buechley,
     R. W., Heiderscheit, L. T., Shy, C M.
     1972.  Asthma and air pollution from a
     coal-fueled power plant. Am. J.  Public
     Health 62:1181-88
 60. Whittemore, A. S., Korn, E. L. 1980.
     Asthma and air pollution in the Los
     Angeles area.  Am.  J.  Public Health
     70:687-96
 61. Hammer. D. L.  Hasselblad. V.. Port-
     noy. B., Wehrle, P. F. 1974. Los Ange-
     les student nurse study. Arch. Environ.
     Health 28:255-60      .     f :
 62. Lawther, P. J., Waller,  R. E^Wender-
     son, M. 1970. Air pollution and exacer-
     bations  of  bronchitis.  Thorax  25:
     525-39
 63. Motley, H. L., Smart, R. H., Leftwich,
     C. I. 1959. Effect of polluted Los Ange-
     les air (smog) on lung volume measure-
     ments. / Am. Med. Assoc.  171;:J 469-77
 64. Wayne. W. S., Wehrle, P. F.. Carroll.
     R. E.  1967. Oxidant  air pollution and
     athletic performance. /. Am. Med. As-
     soc. 199:901-4
 65. Lawther, P. J., Brooks, A. G. F., Lord,
     P. W., Waller, R. E.  1974. Day-to-day
     changes in ventilatory function in rela-
     tion to the environment. I. Spirometric
     values. Environ. Res.  7:24-40
 66. Lawther, P. J., Brooks, A. G. F., Lord.
     P. W., Waller, R. E.  1974. Day-to-day
     changes in ventilatory function in rela-
     tion to the environment.  II. Peak ex-
     piratory  flow  values.  Environ.  Res.
     7:41-53
 67. Lebowiu, M. D., Bendheim, P., Cris-
     tea, G., Markovitz,  D., Misiaszek,  J.,
     Staniec, M.,  van Wyck, D. 1974. The
     effect  of air pollution and  weather on
     lung function in exercising children and
     adolescents.  Am.  Rev.  Resp.  Dis.
     109:262-73
 68. Stebbings, J. H., Fogleman, D. G.,
     McClain, K. E., Townsend, M. C. 1976.
     Effect of the Pittsburgh air pollution ep-
     isode  upon  pulmonary  function  in
     schoolchildren. J. Air Pollut. Control
     Assoc. 26:547-53
 69. Durham, W. H. 1974. Air pollution and
    Studcr.t btahh.  Arch. Enriron ffesltfi
    28:241-54
70. Fisbelson, G., Graves, P. 1978. Air pol-
    lution and morbidity: SO3 damages. /
    AirPol.'ut. Control Assoc. 28:785-89
71. Levy, D., Gent, M., Newhouse, M. T.
    1977. Relationship between acute res-
    piratory illness  and air pollution levels
    in an industrial city. Am. Rev. Resp. Dis.
    116:167-73
72. Goldstein,  I.   F., Block, G.  1974.
    Asthma and air pollution in two inner
    city areas in New York City. J. Air Pol-
    lut. Control Assoc. 24:665-70
73. Chiaramonte, L. T., Bonaomo, J. R.,
    Brown, R., Laano, M. E. 1970. Air pol-
    lution and obstructive respiratory dis-
    ease  in  children. NY State  J. Med.
    70:394-98           ,'
74. Ipsen, J., Deane, M., Ingenito, F. E.
    1969. Relationships of acute respiratory
   , disease to atmospheric pollution  and
    meteorological conditions. Arch. Envi-
    ron. Health 18:462-72
75. Ferris, B. G. Jr., Higgins, L T. T., Big-
    gins, M. W.. Peters, J. M. 1973. Chronic
    nonspecific respiratory disease in Ber-
    lin, New Hampshire, 1961 to 1967. A
    follow-up study. Am. Rev. Resp. Dis.
    107:110-22
76. Ferris, B. G. Jr., Chen,  H., Puleo, S.,
    Murphy, R. L. H. Jr.  1976.  Chronic
    nonspecific respiratory disease in Ber-
    lin, New Hampshire, 1967 to 1973. A
    further follow-up study. Am. Rev. Resp.
    Dis.  113:475-85
77. Winkeistein. W. Jr., Kantor,  S. 1969.
    Respiratory symptoms and air pollution
    in an urban population of northeastern
    United States. Arch.  Environ. Health
    18:760-67
78. Lambert,  P. M., Reid. D. D. 1970.
    Smoking, air pollution aad bronchitis in
    Britain. Lancet  1:853-57
79. Colley, J. R. T., Reid, D. D. 1970. Ur-
    ban  and social  origins  of childhood
    bronchitis in England and Wales. Br.
    Med. J. 2:213-17
80. Hrubec, Z.. Cederlof, R..  Friberg. L.,
    Horton, R., Ozolins, G. 1973.  Respira-
    tory symptoms in twins. Arch.  Environ.
    Health 27:189-95
81. Shy,  C. M.,  Creason, J.  P., Pearlman,
    M. E., McClain, K. E., Benson. F. B..
    Voung, M. M. 1970. The Chattanooga
    school children  study: Effects of com-
    munity exposure to nitrogen dioxide. I.
    Methods, description of pollutant expo-
    sure, and results of ventilatory function
    testing.  / Air  PoIIuL Control Assoc.
    20:539-45

-------
                AIR POLLUTION AND RESPIRATORY DISEASE     429
82. Shy, C. M., Creason, J. P., Peartman,
    M. E, McClain. K. E., Benson, F. B.,
    Young, M.  M. 1970. The Chattanooga
    school children study: Effects of com-
    munity exposure to nitrogen dioxide. II.
    Incidence of acute respiratory illness. /.
    Air Pollut. Control Assoc. 20:582-88
83. Pearlraan, M. E, Finklea, J. F., Crea-
    son, J. P.. Shy,  C. M., Young, M. M.,
    Horton, R. J. M. 1971. Nitrogen diox-
    ide and lower respiratory illness. Pediat-
    rics 47:391-98
84. Speizer, F. E, Ferris, B. G. Jr. 1973.
    Exposure  to automobile  exhaust. I.
    Prevalence  of  respiratory  symptoms
    and disease. Arch. Environ.  Health
    26:313-18
85. -Speizer, F. E, Ferris, B. G. Jr. 1973.
    Exposure  to automobile exhaust. II.
    Pulmonary  function   measurements.
    Arch. Environ. Health  26:319-24  "    '•
86. Cohen, C. A., Hudson, A. R., Clausen,
    J. L.. Knelson, J. H. 1972. Respiratory
    symptoms, spirometry, and oxidant air
    pollution  in nonsmoking adults. Am.
    Rev. Resp. Dis.  105:251-61
87. Aubry, F.. Gibbs, G. W., Becklake, M.
    R. 1979.  Air pollution  in three urban
    communities. Arch. Environ.  Health
    34:360-48
88. Douglas, J. W. B., Waller, R. E. 1966.
    Air pollution and respiratory infection
    in  children. Br. J. Prev. Soc. Med.
    20:1-8
89. Colley. J. R. T., Douglas, J. W. B..
    Reid, D.  D. 1973. Respiratory disease
    in young adults: influence of early child-
    hood lower respiratory tract illness, so-
    cial class,  air pollution, and smoking.
    Br.  Med.  J.  3:195-98
90. 'Kieman. K. E., Colley, J. R. T., Doug-
    las, J. W. B., Reid. D. D. 1976. Chronic
    cough in young adults  in relation to
    smoking habits, childhood environment
    and chest illness. Respiration 33:236-44
 91. Lunn, J. E, Knowcldcn, J., Handyside,
     A. J.  1967. Patterns of respiratory ill-
     ness in Sheffield infant schoolchildren.
     Br. J. Prev. Soc. Med. 21:7-16
 92. Lunn, J. E, Knowelden. J.. Roe, J. W.
     1970.  Patterns of respiratory illness in
     Sheffield junior schoolchildren.  A fol-
     low-up study. Brit. J. Prev. Soc. Med.
     24:223-28
 93. Holland, W. W., Halil, T., Bennett, A.
     E., Elliott, A. 1969. Factors influencing
     the onset of chronic respiratory disease.
     Br. Med. J. 2:205-8
 94. Melia, R. J. N., Florey. C. duV., Chinn,
     S. 1979. The relation  between respira-
     tory illness in primary schoolchildren
     and the use of gas for cooking. 1. Results
     from  a   national  survey.  Int.  J.
     EpidemioL  8:333-38
 95. Goldstein,  B. D., Melia. R. J.  W..
r •'  Chinn, S., Florey, C. duV., Clark, D.,
     John, H. H. 1979. The relation between
     respiratory iHni-« in  primary school-
     children and the use of gas for cooking.
     II.  Factors affecting nitrogen dioxide
     levels in the home.  Int. J. EpidemioL
     8:339-46
 96. Florey, C. duV., Melia, R. J. W.. Chinn,
     S., Goldstein, B. D., Brooks, A.  G. F.,
     J.ohn, H. H., Craighead, 1. B., Webster.
     X:  1979. The relation between respira-
     tory illness in primary schoolchildren
     and the use of gas for cooking. III. Ni-
     trogen dioxide,  respiratory illness awl
     Jung  infection.   Int.   J.  Evidential
     8:347-54    .
 97. Kerrebijn, K. F., Mourmans, A.  R. M.,
     Biersteker, K. 1975. Study of the rela-
     tionship of air pollution to respiratory
     disease in schoolchildren.  Environ. Res.
     10:14-28
 98. Kendall. M. G., Stuart, A 1961. The
     Advanced  Theory of Statistics. Vol. 2.
     New York: Hafner
 99. Ferris, B. G. Jr. 1978.  Epidemiology
     standardization project. Am. Rev. Resp.
     Pis. 118:1-120

-------
          The tirtiment of misung data in multivaiiale analysis
            The treatment of missing data in multivariafe analysis
            Jae-On Kim  and James  Curry
                     University of Iowa

   For any large data set it is unlikely that complete information
 will be present for all the cases. In surveys which rely on respon-
 dents' reports of behavior and attitudes, it is almost certain that
 some information is either  missing or in  an unusable form.
 Although statisticians have long appreciated that the existence of
 such missing information can change an ordinarily simple statis-
 tical analysis into a complex one (e.g., Orchard and Woodbury,
 1S72) and responded to this challenge by producing enormous
 amounts of literature (see, for example, Afifi and Elashoff, 1966;
 Hartley and Hocking, 1971; Orchard and Woodbury, 1972; Press
 and Scott, 1974). there is little indication that survey researchers
 have paid much attention to the literature. When faced with such
 missing-data prcblems. most survey researchers are likely to
 choose either a listwise deletion or pairwise deletion, and then
 proceed to interpret the resulting statistics as usual.1
   The primary  objective of this paper is to review and organize
 the procedures for handling  missing data,  having in  mind the
 practical needs of survey researchers with a relatively complex
 analysis problem but with little statistical sophistication. To make
 the task manageable, we will confine our discussion mostly to the
 situation in which variables are measured at least on an interval
 scale. Other situations will be dealt with only when such excursion
 is simple and dees not interrupt the flow of the presentation. For
 researchers with specific problems not discussed in this paper, a
 brief bibliographical note is included.

 Missing data problems: a practical overview

   There are many different ways to categorize the treatment of
 •Hissing  data  problems.  From the  practical point  of  view,
 however, the most important  question is when and under what
 conditions can one safely consider the problem of missing infor-
 mation to be trivial. This much is obvious; the smaller the relative
 proportion of missing information, the larger the sample, and the
 more random the missing information, the less troublesome the
 missing data problems.
  !f the sample size is large,  as is usually  the case in survey
 research, and the proportion  of missing data relatively small,
 probably the first options to consider  for handling missing data
 are the simplest ones: listwise deletion or pairwise deletion. If the
 problem  can be handled safely by either of these options,  the
 missing-data problem may be considered trivial. But when is the
 proportion of missing  data considered relatively small? The
 overall loss of data for a given analysis depends on several fac-
 tors: the proportion of missing  observations on each variable, the
number of variables under consideration, the degree to which
missing observations are clustered, and the choice of proce-
dures for handling missing data.
  Although listwise deletion is the simplest, there is an inherent
con!lict in the often recommended requirements for its use in that
 for a fixed number of variables and a given proportion of missing
 cases for each variable, the number of deleled cases increases
 as the pattern of missing information becomes more random. To
 illustrate, if only 2% of the cases contain missing values on each
 variable and the pattern of missing values is random, the lisiwise
 procedure will delete 18.3% of the cases in  an analysis using 10
 variables. As  the overlap of missing values increases, the loss
 due to listwise deletion will decrease, in the extreme to 2% of the
 cases. Pairwise deletion is an attractive alternative when the num-
 ber of missing cases on each variable is small relative to the total
 sample size, the pattern of missing values  is random, and the
 number of variables involved  is large.
   The problem with the use of listwise deletion is the relatively
 greater loss  of data, whereas the problem with the pairwise dele-
 tion is the potential inconsistency of the covariance matrix in a
 multivariate  context.' There is an approach which tries to over-
 come the limitations of these two simple procedures. The basic
 Strategy.of this approach is to estimate in the first step the missing
 values (not parameters) from the available information and then
 proceed to the estimation of parameters. The drawback of this
 approach is  that except for very simple situations it depends on
 iterative  numerical  solutions,  making it out of reach  of mos!
 survey  researchers  (at least  until  ready-made  computer
 programs are  more widely available). But the advantages of this
 approach are loo compelling not to consider its adoption: (1) the
 parameter estimation is more efficient because a greater amount
 of available information is used, and  (2) it allows at the same time
 the use of estimated values for index construction, thereby pre-
 venting a severe data loss in complex analyses.
   Keeping in mind, however, the varied needs of researchers, we
 will organize our discussion around  the following three practical
 issues: (1)'how to evaluate whether or not observations are ran-
 domly missing and how tc live with a potentially serious missing
 data problem; (2) if missing data are trivial, what are the factors to
 consider in choosing between listwise and pairwise deletion; and
 (3) when these simpler methods are inappropriate, what are the
 general techniques available in the  literature.

 Assessing the nature of missing data

   Almost all the techniques suggested in the literature assume
that information is missing randomly (see Buck. 1960; Hartley
and Hocking. 1971; Orchard and Woodbury,  1972; Beale and Lit-
tle, 1974). But the simple dichotomy—random  versus non-
random—is often not sufficient. We  begin therefore with a brief
examination  of various patterns of missing data.

Types of missing data
   Because the most important aspect of missing data is whether
and to what  extent the missing information  may be considered
random, the  following categorization is chiefly based on the pat-
tern of randomness.

-------
                                                                   The treatment of missinp rfjf.s ,n rr.utlivanste
                                                        35
       1. M;:£;r>g data is randomly produced. That is, whether infor-
    mat'on is missing or not on a given variable is unrelated to the
    values of that variable or to the values of other variables in the
    da'.a set. As the sample size increases, it is expected not only that
    the mean and variance of each variable and the covariance be-
    rween any two variables will be affected less by the existence of
    missing data but also that the pattern of missing data will exhibit
    such randomness. By  the pattern of missing data we mean the
    'recency distribution of  different categories of missing patterns
    such as missing only on X,. missing on both X, and X,, and so on.
      2. Missing data on  a  given variable X, is dependent on the
    >-2!ue o! another variable X2. The differentiating characteristic of
     Vis pattern is whether information missing or. a given variable X,
    is not  dependant on the value of X,- but rather on its underlying
    re!?.'.:onsh;p v>ith another variable X,. Therefore, whether or not
    information is missing  on X, is independent of the values of X,
    given values of X,. This represents an important class of missing
    data v/hich is not completely random but allows simple factoriza-
    tion of m£xirrium lik -ihcod (e.g., see Little, 1976a: 1976b). An ex-
    ample of tr.ls type of missing data  problem can be found in the
    situation in which respondents are asked whether they voted in
    pre-.icjs elections but  seme were  ineligible tc vote because of
    age.2 Or. in a panei study, a whole set cf information may be miss-
    ing  from the later waves for some respondents who  were not
    availibl-e. Tr.e missing Zz'iZ pattern w'!! be sys'.ematic. but the
ff—^ause o' the missing data may be independent of the  values of
      2 variables unde' consideration (Rubin. 1976).
      3. Underling values not observed in a given da'.a set  may
    determine whether information is misa.ng or not. For example, in
    an arti'uC-nal survey, a respondent with iow cognitive ski!is may
    refuse or be unabie to give usable answers to many quesiions. In
    this case, the pattern of missing data will exhibit clustering. The
    difference between this case and 2 above is that here the variable
    Ahich  determines the pattern of missing data is not included in
    the survey. The practical  difference is  that in case 2 the missing
    information can be made conditionally independent of its un-
    derlying values by controlling for the differences in the determin-
    ing factor (e.g.. age). Such a control is usually impossible in the
    present example. But the difference between cases 2 and 3 is not
    an absolute one; there may exist variables such as education and
    other measures of cognitive ability with which the underlying syn-
    drome may be approximated and  factored out.
      <. Whether information is missing  on a given variable X, is
    dependent on the values of itself, that is, X,. For example, respon-
    dents  with excessively  high income may be reluctant  to reveal
    their 'eve! of income. In this type of missing data problem, the
    mean and variance of the variable is affected by the existence of
    missing information. The degree to which the missing  informa-
    tion can be estimated from the existing data will depend on the
    decree to which the variable is related to other variables in the
    c'ata set. but unless the determination is complete,  completely
    eliminating the effects of missing data is not possible.  In addition,
f  "-^examination  of the  pattern of missing  data alone  may not
       vide clues to the nature of this type of missing data.
      5. Missing data is a product of a particular combination of two
    or more variables. For instance, people with high education may
    be iess inclined !o reveal their /otv income, or people with high in-
    come may be unwilling to reveal their fow education. In this type
    o! station, any effort to recover the missing information from the
 relationship between the variables in exisling data  wojld be
 misleading.
Testing for a random pattern
   A simple test for the randomness of missing values can be
devised if missing values are relatively large and many variables
are involved in the analysis. The strategy is to consider the (K -r 2)
patterns of missing data where K is the number of variables with a
substantial number of missing cases (oay, 20 or over). The pat-
tern to consider is:  (M,).  (M,).... (MJ. (MM—missing on two or
more variables), and (NM-none missing) whe're M  stands for
cases on which information is missing only on the variable X . To
illustrate, if there are three variables, X,. X7. and X3. the categories
involved and the expected frequencies under the assumption of
randomness are given as follows:

M, = 
-------
 36
The »ri>'.::ieni of musing
                                    m mulliv.m;ile
 a in the regiession equation Y = a -r bX .will represent the mean
 of Y for the first m (complete) cases and the coefficient b will
 represent the difference between tne mean of the missing cases
 and that of the nonmissing cases. The test for the significance of b
 will  se've as a test for the randomness of the missing data.
   Furthermore, if one inserts any arbitrary constant in the place
 o! the missing X values and regresses Y on both X and X ., then
 the  partial  regression  coefficient associated with X  will be
 equivalent to the simple b that would be obtained from the com-
 plete data of the m cases. The data pattern would then be:
 Y:y,y,	:	
 X: x,Xj 	xnCC
 X -00	 011.
 Ttie increment in R' due to X from the dummy regression is
 equivalent to the simple r7 that would exist for the first m cases.
 The complete analysis calls for a hierarchical regression. If, on
 the other hand, the missing values on X are replaced by the mean
 of X {for the m cases), then the regression of Y on X and X , need
 not be made in a hierarchical fashion; the partial b's associated
 with X and X. are then equivalent to simple b's (see Cohen and
 Cohen. 1S75, for more illustrations and uses of different coding
 methods).'
   The use of dummy indicator variables can be easily extended
 to the situation in which X is a categorical variable. The only ad-
 justment necessary is to regress Y on the missing data indicator
 variable X. first and then on both X. and other dummy variables
 representing (K -1) categories o! X. Cohen and Cohen therefore
 argue that it is not only unnecessary to assume randomness in
 missing data but also unwise to do so when such easy means of
 handling missing data are available (1975: 288).
   It must be noted, however, that their suggestion is of limited
 value in most survey  research where many variables are ex-
 amined in a complex manner as in path analysis. In path analysis,
 a variable is often considered a dependent variable in one con-
 text,  but an independent  variable in  another  context.
 Nevertheless, the use of a dummy indicator variable is very con-
 venient in testing whether the type 4 situation exists. If information
 is missing on Y because ot the underlying values of Y  itself (type
 4), one would look tor another variable (Z) that is closely related to
 Y. Then one would regress 2 on Y . (the indicator variable) and
 test whether the b associated with Y . is significant. One must be
 cautious, however, in interpreting a result from such a bivariate
 case. Even if the missing cases have different means on some
variables, it does not mean that the pattern belongs to case 4; the
 case 2  can produce significant differences if the bivariate
association be%veen the two variables is strong. As will be shown
later, such situations can be exploited as means of obt?inir>2 ?*•-
tra information about the missing data.
  Another method suggested by Cohen and Cohen (1975: 286f)
is to examine the relationships among the missing data indicator
variables to see if there is any significant clustering. Although ap-
plying factor  analysis to  dichotomous variables is not  fully
justified,  such an  application may  still  provide  means  of
identifying important underlying dimensions of the clustering of
missing  values when the correlations are moderate (Kim et al.,
1977). Such examination of clustering may help identity the un-
derlying causes of missing data. Even if the nature of missing
                                                  data remains obscure, one can at lenst control to- the unknown
                                                  eflc-cts of missing dsta more efficiently by including m:ss:ng vai je
                                                  scales along with other independent variables ir, a multivanate
                                                  analysis (Cohen and Cohen. 1975: 2861.).
                                                     An examination of the correlation matrix for the set of missing-
                                                  data  indicator  variables  can also serve as  a  quick  way of
                                                  ascertaining whether there is any unusual clusiering  between
                                                  missing values of two variables. Noting that  tne correlation be-
                                                  tween dichotomous variables (r „) is equivalent to <> .and that £>'
                                                  =  \VN. one may consult the  \' table with  x' =  p(N) and
                                                  degrees of freedom = 1 (Cramer, 1946: 441-445).*
                                                     As may have  been obvious from the discussion in this section,
                                                  the researcher really does not have much recourse in assessing
                                                  the nature of the missing data. The final assessment should be
                                                  based on substantive,knowledge about the content and the cir-
                                                  cumstances under which the data were coiiecied.

                                                  Listwise and pairwiae deletion

                                                     If one is willing to assume (or the test for randomness shows)
                                                  that the pattern 'of missing data does not deviate significantly
                                                  from the random model, the easiest options  to consider are
                                                  listwise and pairwise deletion of missing cases.
                                                    As  noted earlier, the disadvantage of listwise deletion is the
                                                  relatively greater loss of data. Its advantages are that it always
                                                  generates  consistent covariance and  correlation matrices and
                                                  that test statistics used with the complete data can be  used
                                                  without  modification. The  advantages and  disadvantages of
                                                  pairwise deletion are complementary to those c! lisrwise deletion.
                                                  The matrix generated by pair\vise deletion may not be consistent
                                                  (not positive-definite), especially when the missing deta panern is
                                                  not random or when the total sample size is small.' Further more.
                                                  the sampling distribution of estimates based on pairv.-ise deletion
                                                  usually contains  nuisance parameters which are no! readily com-
                                                  putable (Haitovsky, 1958).
                                                    The choice between the two methods, even in trivia! situations.
                                                  is not clearly indicated in the literature. Using ir.e matrix based on
                                                  pairwise deletion may be  close  to  the spirit  of  maximum
                                                  likelihood solutions proposed for missing data problems  (e.g.,
                                                  see Haitovsky. 1968; for a summary of earlier literature of max-
                                                  imum likelihood  solutions, see Anderson, 1957). A specific com-
                                                  parison between the two methods was first attempted (as far as
                                                  we know) by Buck (1960) in his examination of several methods
                                                  for handling  missing  data. He compared the  estimated
                                                  coefficients using various methods of handling missing data tc
                                                  those  based on  the complete (74 cases) data (Buck,  1960  305)
                                                  Among other things, he showed that listwise  deletion producer
                                                  results closer to the complete data than  pairwise deletion
                                                  Because his conclusion is based on the examination  of a singk
                                                  data set (containing 72 cases and 4 variables from which he ran
                                                  domly deleted a  few cases from each variable resulting in a tote
                                                  loss of 34 cases) and a single simulation, his conclusion shoui
                                                  not be taken seriously.
                                                    Gtasser (1954) argued that the efficiency of pair.vise celetio
                                                  over listwise deletion improves as the overall correlation amon
                                                  the independent variables decreases, and may become bent
                                                  when the sample is large and correlations are below a certs
                                                  level, given a fixed pattern of missing values. For example. !or tr
                                                  situation where  there are two independent  variables and tt

-------
                                                                 The trej!mcnt of missing da'.a in mulii
                                                                                                 vnr:3ie
                                                                                                                     37
 D'opor'.ion of missing values is uniform, the efficiency of es-
 v.'naring partial  b's from the covariance  matrix based on the
 pairwise deletion  is in genera! better than that based on the
 i:sr.vise deletion if  the magnitude of correlation between the Kvo
 variables is less than .58 (Glasser,  1954: 839).
   As far as we know. Haitovsky (1963), through the use of COITK,
 pu'er sTrnL^tTolO^s^p^r^c^me^'t^iFTnost systematic and exten-
 Conclusive: listv.'ise deletion is jfvgeneraTsupeTioT fo~'pa'?p.vise
 deletion in the estimatjonjjf __par!i3iregression coefficients. He_
 Cumm_ariges_his_fInding as.foMov.'s:^lri_almost all the cases which
 ••/ere investigated the former method (ordinaryleasTsquares ap-
~~)liedonly__tp corr,plete~~ol3seryations)[s~~ju3ged  superior.'
 "howeygr. whenIn'e'prop'bf lion of incomplete observatidnTishigh
 or when the pattern cTtriernissing entries is highly non-rancfbm, it
'~~'
 values to tjie .missing entries shouldbe_appjiecl''(Haitovsky11968T""
 67). Later_publications based on simulations do not consider the
 pairwise deletion^ p^^rqrTthe^b^sis of evrdence_p.reseDteaT5y.
 Hai;ovskj^g..j.iromZJ'.9.7J3^_B.eale and^LJUJe,  19?4).
   STter  carefully  examining  Haitovsk/s  sTmuIalion model,
 however, we conclude that he does not have a mode! typical of
 sociological data, which usually contains only moderate bivariate
 correlations and a multiple R usually not beyond .7. Furthermore,
 it is not dear at what point the proportion of missing observations
 should be  considered to be high.  Partly for these reasons, we
 have made  our own sim;-'3tionV usi'rig~BIaTj~ahd~Dunca"n's~T
 correlation matrix among status-attainment  variables as the
 model (Blau  and Duncan. 1957: 169).
   More  specifically, we have simulated sampling 1000 cases
 from a muitivariate-norma! population with a correlation matrix
 equal to Biau and Duncan's matrix,  and we deleted randomly
 about 10% of the cases from each variable.' Such sampling was
 repeated 10  times, and the resulting sample  correlations and
 covariance matrixes were compared with the population model.
 The results are present in Table 1.
   In contrast to Haitovsky's findings, our simulations indicate that
 the pairwise  deletion  performs better than listwise deletion, at
 least for the present model. Pairwise deletion  produces less
 mean deviation from the model with respect to not only the whole
 matrix but also every individual coefficient but one (the exception
 Is underlined in Table  1). The result is about the same whether we
 consider the sample correlation matrices or covariance matrices.
 To evaluate the fit between the model and the sample covariance
 matrix, one may also include variance terms in  the calculation of
 trie index. Since such an inclusion produces similar results, we
 have not included  them in Table 1.
   The deviation indices in Table 1 alone do not  provide full infor-
 mation about the effects of employing these two methods of han-
 dling samples with no missing data. Table 2 contains information
 from samples without missing data. Sample correlations and co-
 variances from samples without missing data  deviate from the
 model almost the same extent as the sample coefficients based
 >n pairwise deletions. Compare, for instance, .0258 from the third
  olumn of Table 1.  to .0232 from Table 2. In other  words, most of
 the variability observed in Table 1 is due to the general sampling
 variability and not  to the problem of missing data.
  Since  Haitovsky's findings were  based on the examination of
 ttio unstandardized regression coe'ficients. we thought it prudent
 Table 1.  Simulation results, using Blau-Duncan's correla-
 tion matrix  as the model,  sample of 1000,  replicated 1C
 times,  from multivariate normal population, and  10% of
 cases are made missing randomly from each variable
                     Mean Deviation from the Model

                   Correlations           Covariances

      Blau-
     Duncan   Listwise    Pairwise    Listwise   Pairwise
      Model    Deletions  Deletions   Deletions Deletions
YW
YU
YX
YV
wu
wx
wv
ux
uv
XV
.541
.596
.405
.322
.538
.417
.332
.438
.453
.516
   Overall
  Deviation
    Index
Legends:    Y:
          W:
           U:
           X:
           V:
                 .0285
                 .0320,
                 .0311
                 .0266
                 .0259
                 .0228
                 .0354
                 .0355
                 .0287
                 .0292
.0296
            .0248
            .0202
            .0272
            .0271
            .0223
            .0203
            .0334
            .0254
            .0279
            .0262
.0258
            .04153
            .0548
            .0418
           *.Q306
            .0-136
            .0267
            .0526
            .0540
            .0-U7
            .0460
.0449
           .0331
           .0369
           .0366
           .0327
           ,0379
           .0263
           .0452
           .0386
           .0378
           .0424
.0371
 19S2 Occupation Status
 First Job Status
 Education
 Father's Occupational Status
 Father's Education
a.  Deviation index is given by: dj
                             ~
Where i refers to the bivariate relationships indicated in column one: M, refers
to the underlying value from the Blau-Ourcan Model. -S,, refers  to the
corresponding sample estimate where j stands for 10 different samplings.

b. Overall Deviation index =-/-.-. (Mi • SjpVlOO

Source of the model: (Blau arid Duncan, 1967:169.)
to examine such regression coefficients as well. In Table 3, we
present results based on the direct examination of regression
coefficients, representing the values of the presumed population
model. Then the sample path coefficients (unstandardized)' are
compared to  these  underlying values.  The conclusion is the
same; pairwise deletion still fares better than listwise deletion.
   In order to check whether our findings are due to the excessive
amount of missing  data, we repeated the simulations while
decreasing the proportion  of missing  observations  on each
variable. Note that listwise deletion would retain only about 590
cases while pairwise deletion will retain  about 810 cases when
10% of  the cases are randomly missing on each variable. As
shown in  Table 4,  however, pairwise  deletion  maintains its
superiority as  the proportion of missing  values on a variable is
decreased to about 1%.
   The sample size differences between a pairwise deletion and a
listwise deletion are greatly  affected by the number of variables

-------
          Trie ;>en!">i":l o'
                               daia in
 Table 2.   Sampling  variability as measured by the mean
 deviation  of sample coefficients from population values,
 ••/hen there is no missing data *>
                                     Table 4.  Mean deviations of sample covnrianccs (rom pop-
                                     ulation covariances for 10 samples of 1000  cases,  ench
                                     based on the Blau-Duncan correlation matrix by percent
                                     randomly missing B
                      Mean Deviation from the Model
      Bivariate
      Relation

        YW
        YU
        YX
        YU
        'A/U
        NY.
        WV
        ux
        uv
        XV

      Overall
Correlations

   .0216
   .0163
   .0219
   .0238
   .0194
   .0201
   .0258
   .0285
   .0236
   .0283

   .0232
Covariaoces

   .0333
   .0350
   .0320
   .0295
   .0348..
   .0293
   .0398
   .0423
   .0391
  ,-.0462

   .0361
'• Based on the same samples as presented in Table 1. except
   that coefficients are calculated before some information is
   made randomly missing.
 %of Missing
   from each       1%         2%         5%       10%
   -Variable     -Missing •  Missing  'Missing  "Missing
   Listwise
   Pairwise
.0383
.0363
.0394
.0376
.0426
.0378
.0449
.0371
a  See Table 7 for source of data and description of deviation
   indexes.                    	
                     i
 Table 5.  Overall mean deviation  of  sample covariances,
 when the number of variables are reduced (10% missing
 from each variable) a
                                                    2345
                                                Variables    Variables    Variables  Variables

                                      Listwise
                                      Deletion    .0376*       .0433        .0393      .0453
Table 3.  Simulation results using the Blau-Duncan correla-
tion matrix:  10 samples of 1000  cases each from  a  mul-
tivsn'ate normal population where  10% of the cases are mis-
sing randomly from each variable:  unstandardized  path
coefficients3
  Path Coefficient

   : YW  .2881
     YU  .3983
     YX  .1205
     YV -.0139
    WU  .4326
    WX  .2144
    WV  .0254
     UX  .2784
     UV  .3094
     XV-  .5160

   Overall Index
Mean
from
Listwise
Deletion
.0271
.0348
.0298
.0378
.0333
.0440
.0467
.0364
.0393
.0365
Deviation
the Mode!
Pairwise
Deletion
.0308
.0270
.0282
.0277
.0256
.0398
.0416
.0351
.0369
.0315
   .0370
   .0328
 iVhfcie Y is regiessed on W. U X and U: W is regressed on U.X and V; U is
.ec-essed on X. and V; X on v lor convenience.

3  See Table  J  for source and legends for the variables.
                                                            Pairwise
                                                            Deletion
                                                  .0376c
                                            .0373
                                      .0370
                                .0375
                                     3 See Table 1 for source of data and description of deviation
                                       indexes.
                                     b The variables used in these simulations are: Y and U: Y. U.
                                       and W: Y. X, U. W; and all five.
                                     c The two methods are equivalent.
involved. With 10% missing on each variable (randomly), listwise
deletion with'a five-variable data set would lose about 410 cases
out of 1000, but would lose only 270 cases with a three-variable
data set. In contrast, pairyvise deletion would retain a data base of
810 cases regardless of the number of variables involved. Table 5
illustrates the effects  of such changes in the data base as we
decrease the number of variables. As expected from the result o!
Table 4, pairwise deletion performs better.
  Although we cannot generalize our finding because it is based
on a fixed (large) sample size (1000 cases) and a fixed (moderate)
correlation matrix, it is clear that the use of pairwise deletion can
be better than the use of listwise deletion for a certain type of data
and that it is  premature to preclude the issue on the basis of
Haitovsky's simulations.9 For survey researchers with a relatively
large data set. where the strengths of the bivariate associate-as
are moderate, pairwise deletion should remain a viable option
provided the observations are missing randomly. But if one is in-
terested in retaining as many cases as  possible as in index con-
struction, the use of pairwise deletion in the stage of parameter
estimation (such as factor loadings) wili not be of much he!p. One
must find ways of removing missing values before constructng a

-------
                                                              The treatment of missing data in multivariate analysif
                                                      39
  composite index, otherwise the data loss will be severe as in the
  _-ase oi listwiss deletion.

    Estimating missing values

    There are three main reasons why a researcher might con-
 sider replacing missing information with some estimate: (1)
 to simplify calculation of statistics. (2) to improve parameter
 estimations, or (3) to retain as many cases as possible in con-
 structing scales out of many variables.
    The first reason has become trivial as researchers rely in-
 creasingly on computers for their calculations. But there are
 .situations in which ease of presentation and/or interpretation
 may justify using such  a method even with a computer
 (Draper and Stoneman,  1964). An example in point has
 already been given in  the section on the use of dummy in-
 dicator variables; when the mean values are used in the place
 of the missing values of X. the partial regression coefficients
 for the indicator variables X, and X' (with mean replaced for
 missing values)  become equivalent  to simple regression
 coefficients. Of course, it also implies that a simple regression
 of Y on X' will be equivalent to a regression based on the m
 cases only.  (Wilks. 1932; also  see Afifi  and  Elashof. 1966;
 Cohen and Cohen, 1975.) But  this does not mean that the
 correlation between the two for the m cases will be the same
 because the variance of X' will be less than the variance of X
 for the first m cases.10
   Another example is found when the orthogonality of an ex-
   jrimental design is destroyed by the existence of some mis-
 sing  data and  the use of neutral values can simplify the
•calculation and presentation of the result (Cochran and Cox,
 1957).

   Maximum utilization of information
   Reexamination of the pattern A (below) will reveal that neither
listwise deletion nor pairwise deletion uses all the available infor-
mation. For the estimation of a covariance, both procedures use
the same data base and therefore are equivalent in this case. If
there is some relationship  between Y and X,

X:x,,x,....xm...xn
Y:y,.y,....ym       (PatternA)

and missing values are independent of the values of Y given
the values of X—therefore, the observed covariation between
Y and X is unaffected in  the long run by the existence of miss-
ing values in Y—then the X values for the last (N - m) cases
contain some information about the possible  values of the
missing Ys."
   Estimates based on the utilization of  all the available data
are given by (Wilks. 1932; Anderson, 1957).

m. = IX,/N '  (i= 1.2....N)
~~\= m; + b,, (m. -m'.)
    = I(X,-mm)VN  (1=1.2. ...N)
where
                                                                                          •m.)>
                                                           In other words, estimates of the mean and variance of Y can be
                                                           improved if information on X and the relationships between Y and
                                                           X are utilized. The solutions above are derived using the least-
                                                           squares principle, but they will be maximum-likelihood solutions
                                                           if the underlying  population distribution is bivariate normal and
                                                           these estimates  are in general more efficient  than estimates
                                                           based on only the first m cases (e.g., Wilks, 1932; Anderson
                                                           1957).
                                                              It must be noted that if there is no underlying  association be-
                                                           tween Y and X, no information would be gained by following this
                                                           type of strategy and as the association  between the two variables
                                                           gets stronger the greater will be the gain in efficiency (see Little,
                                                           1976b).
                                                              In arriving at the more efficient estimates of parameters, the
                                                           values of missing Ys are not actually replaced ty the estimated
                                                           values. Such estimation of parameters is possible only because
                                                           the missing •'. ata pattern A is extremely simple. We turn now to a
                                                           more complex pattern of missing, data.

                                                           Estimation of missing  values and iterative solutions
                                                              Consider now  two relatively more complex missing data pat-
                                                           terns, B and C. In pattern B:
A. Xt> X^, . . . X^, ym+1 , . . .
Y:y,,y2....ym.*   ....
                         .ym*n,y „.„.»+... "
                                                                                                        (Patterns)
                                                           (* represents missing data)
missing values are present on both X and Y. We would normally
delete the cases with no valid infomation and consider only the
first N cases. In pattern C, we present only abbreviated data:

  Types:  12345678
      X:  /   /   /   *   /   *
      Y:  //*/*/•*      (Pattern C)
      Z:  /'//'•/'

(I indicates nonmissing data; * Indicates missing data)

There are 2" (where p = number of variables and 2* = 8) types of
data. Out of these potential types only the last one does not con-
tain any information. Therefore, we would normally consider only
the first (8-1) types, deleting the last type out of analysis. In
particular note that type 7 would have been deleted in bivariate
context, and that in general, the greater the number of variables
under consideration (provided they are all associated with each
other  to some extent), the greater the utilization of existing infor-
mation.
  Returning to pattern B. and extending the strategies used in
dealing  with  the A data, we may try to estimate underlying
parameters from the examination of the two different combina-
tions of data. (1) Use the first (m + n) cases to retrieve the lost in-
formation due to n missing values on Y. and (2) use (m + w) cases
to retrieve the lost information due to missing values on X. But ths

-------
  40
           7t>f trt-Aftt'en! uf rnmirg
  • esuiiing estimates are likely to disagree. A general solution to
  real v.-itr. situations ii'«e B, as wen asC. is as follows (Beale and Lit-
  tle. 1974):
  1. Use available information to estimate missing values;
  2. after replacing the missing values with estimated values, es-
  timate the parameters from the data containing estimated values
  (with  proper adjustment);
  3. re-estimate  the  missing  values, using  the estimated
  parameters given in step 2;
  4. repeat the process 1 through 3 until the estimated values con-
  verge.

  More specifically, the steps are:

  1. Estimate regression coelficients for each type of missing oata.
  while using aii the available information.  Then insert tne predicted
  values based on the regression in the place of missing data ('or
  convenience, one can use  the covariance  matrix based on
  listwise deletion for the initial estimation of missing values).
  2. Then estimate tne parameiers—mean  (m )."variance, and
  covariance (V_J— by:
 m = 2X../N
                             '...-.I /N.   (i= 1.2. ...N)
 where V,..t. refers to the partial covariance between variable j and
 k while Pi represents the other variables in the set with valid infor-
 mation on case i—{X is the observed value if not missing).
 3.  Repeat the process until convergence. The only aspect requir-
 ing comment is the adjustment term Vt... in step 2. This is the
 term  that vanishes unless  information is missing on  both
 variables j and k (which implies of course, in calculating variance
 |V._|  the term must appear whenever information is missing OP
 one variable). This adjustment lerm is necessary because the es-
 timated values reduce the variance of a variable and reduce or in-
 crease the covariance depending upon ;he direction of partial
 relationship between] and k (see. e.g.. Buck. 1960: Beale  and Lit-
 tle. 1974).
   For the demonstration that such a solution leads to the max-
 imum likelihood solution proposed by Orchard and Woodbury
 (1972) when multivariate  normality is  assumed and that this
 iterated solution is in general superior to other methods, Beale
 and.Little (1974) should be consulted. Among the methods Beale
 and Little (1974) examine are (1) ordinary regression solutions
 based on listwise deletion and (2) the method of using regression
 estimates of missing values by Buck (1960). Unfortunately, they
 did not consider the regression solution based on pairwise dele-
 tion, relying partly on the evidence present by Haitovsky (1968).
 On the basis of our simulation presented earlier, we must con-
 sider for now that the relative merits of pairwise deletion  and the
 iterative solution are unknown for a large sample with moderate
 correlations."
 Estimation of missing values without iteration

   For those who have no access to custo—i computer prog: an--
. ming. we will mettle-: a lew ether procedures of es'.'rrating miss-
 ing values and parameters:
 1  assigning recession  estimates to missing values and es-
 timaln'iQ parar.-.eiers with some adjustment (Buck. isoOi.
 2. assigning regression estimates and random component witn
 equal-expecieu variance as the residual var.ance (V t..,.). then es-
 nmating parameiers using ordinary regression routines'
 3. estimating the missing values by using principal component
 transformation, then estimating parameters (by Dear, cued in
 Afifi and Elasho!. 1966; and Timm. 1970).

   The lirst procedure above is to use regression  estimates as
 one would do in the first step of iterative solution but  without itera-
 tions. The accuracy  of the  first estimate is more critical  in this
 method than in the iterative solution. Therefore, one may not use
 the matrix based on  listwise deletion but should use a different
 matrix for every type of missing value pattern. This method was
 •ound to be generally superior to assigning  means to missing
 values or 10  Dear's method, but not in every case (see Timm.
 1970). One point needing soeoal attention is that variances and
 covariances catenated from ihe variables (with estimated values
 -eplacing the missing values) have to be corrected lor bias (as
 was the case in iterative solutions).
   The second procedure is a modification ot Buck's. Instead of
 replacing the missing  values with regression estimates, this
 method replaces them with regression estimates plus a random
 component in order to simulate the degree ot residual variance
 existing in the data:

     Value to use m the place of the missing value on X =
     X  -i- (error of estimate) (random normal number1

 Because  the random component  remtroduces the expected
 residual variation in  the predicted data, one can use ordinary
 regression algorithms !o estimate various parameters. Tha: is. it
 oDviates the need for  ad;ustment in the estimation of parameters.
 At least one popular statistical package. SPSS (Nie et al.. 1975).
 allows the generation c! random numbers to be used in such
 situations along with  the variables under consideration.
   The third procedure applies principal-components analysis to
 the correlation matrix based on listwise deletion. The  principal
 components for the missing values are then estimated from the
 existing  data,  using the  weights given  by the principal-
 components  analysis.  Next, the  principal  components are
 transformed back to raw  data. Then the raw data  (with the es-
 timated values included) are used for parameter estimation {see
 Afifi and ElashoM. 1956; and Timm, 1970). Timm's (1970) simula-
 tions do support this  solution as preferable to Buck's only when
 the number of variables is  relatively large in relation to the sample
 size, correlations are  not moderate, and the missing information
 is substantial. Overall, however, he finds that this procedure is in-
 ferior to Buck's.
   Finally, although all  the  procedures for estimating  missing
 values  involve considerably more complex computations. Ihey
 have one definite advantage  over simple solutions  such  as
 pairwise and listwise deletion; in constructing scales out of many
variables, the estimated values can be used in the place of miss-
 ing values. Furthermore,  the estimated values are based on a
 more refined use of the available information than simply replac-
 ing missing values with the mean as is found as an option m at
 least ore packaged program for scale construction (Nie et al..
 1975).

-------
                                                                The treatment of missing data in multivariate analysis
                                                                                                     41
    To recapitulate some salient points from various computer
 sii-o'.atior.s. Buck (1950) and Haitovsky (1968) find listwise dele-
 tion  preferable to pairwise  deletion. Timm (1970) finds that
 Buck's (1S30) regression solution is superior in general to the
 regression solution based on listwise deletion; Beale and Little
 (1974) find the iterative maximum likelihood solution superior to
 Buck's regression solution; assigning means to missing values
 does not fare well in comparison with any of the other solutions
 mentioned above and Dear's principal components solution ap-
 pears to lag behind Buck's regression  solution. Against this
 background, we find  that for a large data set with moderate
 correlations,  the pairwise deletion  performs better than the
 listwise deletion. It is obvious that we need additional studies
 comparing these various methods using  sociological data.
   Some  preliminary  results  from  our  own simulations are
 presented in Table 6.'5 Most of the data presented in Table 6 are
 from earlier tables. The only new information is the performance
 of the regression method with a random component compared
 with other standard procedures.  We  find that replacing missing
 values by this method does not do as well as pairwise deletion but
 is superior to listwise deletion. More importantly, parameter es-
 timates based on pairwise deletion and the regression-random
 component method are fairly close in their efficiency to the es-
 timates based on the complete samples with no  missing data.
 Table 6.  Overall comparison of various methods, using the
 Blau-Duncan matrix as model:  10 samples of 1000 cases,
 10% missing on each variable
Method
    Mean Deviations from the Model

(1)       (2)       (3)       (4)     (5)
Correlation
Matrix        .0497     .0296     .0253     .0286   .0232
Covariance
Matrix  •      .0891     .0449     .0371     .0388   .0361
Legends'
f If Means assigned to missing values
t21 Listwise deletion of missing data
131 P»iK*ri\e deletion of missing data
ttl fesfessJnn estimate and random component in the place of missing values
15) Complete sample
Conclusion

  We have tried to abstain from making general assertions as to
the superiority of one method over another because convenience
and feasibility must also be considered in making the final deci-
sion. Furthermore, when the sample size is very large (say 1000
or more), the choice may not make very much difference (Beale
•and Little, 1974).  It is our hope that the preceding discussion
alerts survey researchers to possible complications arising from
missing data and  to the fact they may be using less than an op-
timal solution to the problem.
  Since we have confined our discussion largely to the mul-
tivariate case where variables are  measured on at least an
interval scale and are considered to be random variables, a few
words are in order concerning other situations not covered in this
paper.
   First, jn least-squares regression and analysis of variance, it is
customary to consider the independent variables as fixed con-
stants. When one deals with survey data, however, there is no
compelling reason to consider the independent variables as fixed
constants other than the fact that such an assumption (although
not very realistic) can simplify the derivation of the sampling dis-
tributions of parameter estimates (Johnston, 1972). On the other
hand, if one has a data set with genuinely  fixed independent
variables as in an experimental design, the  researcher should
consult specialized sources (e.g.,  Hartley and Hocking, 1971).
   For situations in which some variables are  categorical, see
Hartley and Hocking (1971), Jackson (1969), and Chan and Dunn
(1974). See also Hertel (197(6) and the report of the U.S. Bureau of
the Census for a description of the "hot-deck" procedure which is
especially suited for large data sets such as the  national census.
Either because of their reviews of the literature or their generality
of scope, the following sources are especially valuable:  Afifi and
Elashoff, (1965,  1967,  1969); Hartley and Hocking  (1971);
Orchard and Woodbury (1972); Beale and Little (1974); Cohen
and Cohen (1975); and Rubin (1976). Orchard and Woodbury
(1972) contains a classified  bibliography. See  also Press and
Scott (1974,1.976) for an introduction to the growing literature on
the Bayesian approach. A few recent materials dealing with
specialized topics are included in the bibliography without com-
ment.

Notes
  1. Computer packages, such as SPSS (Nie et al., 1975). have
made it easy for a researcher to choose either a listwise deletion
or a pairwise deletion of missing data by making them standard
options  in various multivariate analysis. A pairwise deletion es-
timates bivariate relationships on the basis of cases for which in-
formation is complete for the two  variables only, and then con-
structs a synthetic multivariate data  matrix  from the bivariate
matrix.
 2. When the correlation or covariance matrix is not consistent.
one may get regression results in which the m jltiple correlation is
greater  than one or less than zero. The statistical results from
such a matrix would be meaningless (see Cohen and Cohen) for
an illustration of the problem, 1975).
 3. In this case,  the existence of missing data on the variable of
interest, voting, is due simply to the fact that the respondent was
too young to vote. We may thus view the legal-age restriction on
voting  as  essentially  unrelated   to  the  respondent's overall
propensity for political participation. Yet, at the same time, a
prime research objective may be to construct a  general index of
political participation such that complete information on voting
behavior is necessary.
 4. We do not examine all the possible patterns of missing data
because, in most instances, the expected frequencies for some
of the categories such as (M,MjMj), will be too  small to be used
for x* testing. For the same reason,  variables  with too small a
number of cases, producing M less than 5 cases, may be drop-
ped from the test.
  5. One may further refine x1 tests by introducing Yale's correc-
tion for continuity if the expected frequencies are relatively small t
(Cramer. 1946: 445). or may use  Fisher's exact test.

-------
 42       Treatment of missing data in mullivatiaie analysts
  6. See note 2 above.
  7. For the generation of random numbers see Newman and
 Odel  (1971), and for the generalion'of population correlation
 matrices see Kaiser and Dickman (1962). We have generated
 random-normal  numbers using the "Super-Duper" program
 (from Duke University). After creating samples with a given pop-
 ulaiion correlation matrix according to Kaiser and Dickman's
 method, we augmented the data set with an equal number of
 variables using the random-uniform number generator provided
 by SPSS. The  proportion of cases made "missing" on a given
 variable was determined by the range of the associated random-
 uniform variable. For  this reason,  the proportion of missing
 observations is not fixed but fluctuates slightly from sample to
 sample.
  8. We assume  the basic model in the population to be mul-
 tivariate normal with mean = 0, and variance = 1. However, sam-
 ples from such a  population would not necessarily have a mean
 of 0. and a variance of 1. Therefore, the sample variance and
 covariance would be different, although  they should  be
 equivalent in our  population  model.
  9. We hope to report on further comparisons of these two op-
 tions as well as  other procedures which we are currently in-
 vestigating with the support of  NIMH Grant No. 30407-01.
 10. Hertel (1976) illustrates  this point and advocates the  re-
 placement of regression estimates instead of the mean for the
 missing values. But regression es'imales also introduce biases
 into the calculation of variances and covariances. These points
 are discussed in  a later section of the text
 1V  Cases 1 and 2 of the missing data patterns would qualify. On
 the ill .or hand, if  the missing data are produced by the process
 described in 5. an attempt  to retrieve information  would  be
 erroneous. Other  patterns, such as 3 and 4, would allow one to
 retrieve some information from the available data  but would not
 allow unbiased estimation (see Little. 1976b; Rubin, 1976).
 12.  This and other problems of rrissing data are now under in-
 vestigation (NIMH Grant No. 30407-01); we will report the result in
 the near future.
 13.  Due to special programming chores required to evaluate
 Buck's regression method and the iterative method, we do not yet
 have results comparing the two methods but we hope to report
them soon (see note 12).

 References

 Afifi, A. A. and R. M. Elashoff (1969) "Missing observations in
   multivariate statistics IV: a note on simple linear regres-
   sion."  J.  of   the Amer.  Statistical  Association  64
   (March):359-365.
 —(1969) "Missing observations in  multivariate statistics  III:
   Jarge sample analysis of simple linear regression." J. of the
   Amer. Statistical Association 64 (March):337-358.
 —(1967) "Missing  observations in multivariate statistics II.
   Point  estimation in  simple  linear  regression." J. of the
   Amer. Stalistical Association 62 (March):lO-29.
 —(1955) "Missing  observations in multivariate statistics  I.
   Review of the literature." J. of the Amer. Statistical Asocia-
   tion 61:595-604.
Anderson, T. W.  (1957) "Maximum  likelihood estimates for a
   multivariate normal distribution when some observations
    are missing." J. of the Amer. Stalistical Association 52
    203.
 Beale. E. M. and R.J.A. Little (1S74) "Missing values in
    tivariate analysis." J. of the Royal Statistical Society. I
    don 37.
 Blau, P.  M. and O.  D. Duncan (1967) The  American
    cupational Structure. New York: John Wiley.
 Bloomfield, P. (1970) "Spectral analysis with randomly m
    ing observations." Royal Statistical Society. London B.
    369-380.
 Box, M. J., N. R. Draper, and W. G. Hunter (1970) "Miss
    values in multi-response non-linear model fitting."  Te
    nometrics. 12 (August):6l 3-320.
 Buck, S. F. (1960) "A  method of estimation of  missing valt
    in multivariatfi data suitable for use with an electronic co.
    puter." Royal Statistical Society, London B. 22:302-30!
 Chan, L. S. and O. J.  Dunn (1974) "A note on  the asympto
    aspect of the treatment of missing values in discrimina
    analysis." J. of the Amer. Statistical Association 69  (Se,
    tember):672-673.
 Chow, G.  C. and An-Loh Lin  (1976) "Best linear unbiased e:
   timation  of missing observations in an economic tim
    series."  J. of  the  Amer.  Statistical  Association 7
   (September):719-721.
 Cochran,  W. G. and G. M. Cox (1957) Experimental Designs
   New York:  John Wiley.
 Cohen, J. and P. Cohen (1975) Applied Multiple Recres
   sion/Correlation Analysis. New York: Erlbaum.
 Cramer. H. (1946) Mathematical Methods of Statistics. Prin-
   ceton:  Princeton Univ. Press.
 Dagenais, M. G. (1974) "Multiple regression analysis with in-
   complete observations, from a Bayesian viewpoint." Stud.
   in Bayesian Econometrics and Statistics.
 —(1971) "Utilization of incomplete observations in regression
   analysis."  J. of the  Amer.  Statistical  Association 66,
   (March):93-98.                 .                      j
 Draper, N. R. and D. M. Stoneman (1964) "Estimating missing;
   values  in unreplicated two-level factorial  and fractional)
   factorial designs."  Biometrics 20 (September):443-458.
Glasser, M. (1964) "Linear regression analysis with missingi
   observations among the independent variables." J.  of the
   Amer. Statistical Association 59:834-844.              j
Goodman. L. A. (1968) "The analysis of cross-classified data:
   independence,  quasi-independence and interactions in
   contingency tables with or without missing entries." J. oj
   the Amer.  Statistical Association, 63  (December):1091-
   1131.
Fienberg,  S. E. (1970) "Ouasi-indeperidence  and maximum
   likelihood estimation in incomplete contingency tables." J,
   of the Amer. Statistical Association 65 (December):l6lO-
   1616.
Haitovsky, Y.  (1968) "Missing data in regression analysis."
   Royal Statistical Society.  London. B. 30:67-82.
Hartley. H. O.  and R.  R. Hocking (1971) "The  analysis  of in
   complete data," Biometrics, 27 (December):783-823.
Hartwell. T. D. and D.  W. Gaylor (1973) "Estimating variance
   components for two-way disproportionate data with  miss
   ing cells by method of unweighted means." J. of the Amerj
   Statistical Association 68  (June):379-383.
I

-------
                                                              The treatment of missing data in multivariate analysis
                                                      43
 Me :e!. S. R. (1976) "Minimizing error variance introduced by
   missing  data routines in survey analysis." Soc. Methods
   and Research 4 (May):459-474.
 Horking. R. R.  and H.  H.  Oxspring  (1971)  "Maximum
   likelihood estimation  with incomplete observations in
   regression analysis." J. of the Amer. Statistical Association
   66 (March):65-70.
 Mocking. R.  R.  and  W.  B.  Smith (1968) "Estimation of
   parameters in  the multivariate normal distribution with
   missing observations." J. of the Amer. Statistical Associa-
^  tion 63(March):159-173.
 'ackson, E. C. (1968) "Missing values in linear multiple dis-
   criminant analysis." Biometrics 24 (December):835-844.
 Johnston,  J. (1972) Econometric Methods.  New  York:
   McGraw-Hill.
 Kaiser, H. F. and K. Dickman (1962) "Sample and population
   score matrices and sample correlation matrices from an
   arbitrary  population correlation  matrix." Psychometrika
   27:179-182.
 Kelejian. H. H. (1969) "Missing observations in multivariate
   regression: efficiency of a  first-order method." J. of the
   Amer. Statistical Association 64:1609-1616.
 Kim, Jae-On, N. H. Nie, and S. Verba, (1977) "A note on factor
   analyzing dichotomous variables:  the case of political
   Bactjcipation." Pol. Methodology (Spring):39-62.
 Lif:rh (1973) "Procedures for testing the difference of
  '      is with incomplete data." J. of the Amer.  Statistical
   Association 68 (September):699-703.
 —0971) "Estimation procedures for difference of means with
   missing data." J. of the Amer. Statistical Association 66
   (September):634-635.
 —and L. E. Stivers (1975) "Testing for equality of means with
   incomplete data on one variable: a Monte Carlo study." J.
   of  the Amer. Soc. Association 70 (March):190-193.
 Little. R.J.A. (1976a) "Comments on paper by D. B. Rubin."
   Eiometrika 63. 3:590-591.
 —(1S76b) "Inference about means from incomplete mul-
   tivariate data." Biometrika 63:593-604.
 McDonald. L. (1971) "On the estimation of missing data in the
   rr>L'!tivariate linear model." Biometrics 27 (September):
   535-543.
 Mehta. J. S. and P.A.V.B. Swamy (1973) "Bayesian analysis of
   a  bivariate normal distribution  with incomplete ob-
   servations." J.  of the  Amer. Soc. Association 68 (De-
   cember): 922-927.
 Morrison, D. F. (1971) "Expectations and variances of maximum
    likelihood  estimates  of  multivariate  normal  distribution
    parameters with missing data." J. of the Amer. Soc. Associa-
    tion 66 (September): 602-604.
 Newman, J. G. and P. L. Odell (1971) The Generation of Random
    Variates. New York:  Hafner Publishing.
 Nie. N. H.. C. H. Hull. J. G. Jenkins. K. Steinbrenner. and D. H.
    Bent (1975) Statistical Package for the Social Sciences. New
    York: McGraw-Hill.
 Orchard. T.  and  M. A. Woodbury (1972) "A missing information
    principle: theory and applications." Proceedings  of the sixth
    Berkeley Symposium on  Mathematical  Statistics and
    Probability, Theory of Statistics. Univ. of California Press.
 Press. S. J. and A. J. Scott (1976) j'Missing variables in Bayesian
    regression. II." J. of the Amer. Soc. Association 71 (June}:366-
    369.
 —(1974) "Missing variables in Bayesian regression." Studies in
  ,  Bayesian Econometrics and Statistics. Amsterdam: North
   Holland:259-272.
 Rubin. D. B.  (1976) "Comparing regressions when some predic-
   tor values are missing." Technometrics 23 (May):201-205.
 —(.1976) "Inference and missing data." Biometrika 63,3.581 -592.
 —(1974)  "Characterizing  the  estimation of  parameters  in
   incomplete-data  problems."  J.  of the Amer.   Statistical
   Association 69. 346:467^74.
 Timm. N. H.  (1970) "The estimation of variance-covariance and
   correlation matrices  from incomplete data." Psychometrika
   35 (December):417-437.
 U.S. Bureau of the Census (1970) 1970  Census User's Guide.
   1:26-28.
 Wilks,  S. S  (1932) "Moments and distributions of  population
   parameters  from   fragmentary  samples." Annals of
   Mathematical  Statistics 3 (August), 163-195.
Woodbury, M. A. (1971) "Discussion of paper  by Hartley and
   Hocking." Biometrics 27 (December):808-823.

Jae-On Kim is  Associate Professor of Sociology  at  the
 University of Iowa. His areas of interest are political sociology.
 social stratification, and quantitative methodology.

James Curry is a  Ph.D. candidate in Sociology at the University
 of Iowa. He is interested in social stratification and quantitative
 methodology.
Reprinted from SOCIOLOGICAL METHODS & RESEARCH^
Vol.6 No. 2. November 1977 by permission of the authors.  \
                   Supplement:  estimating missing data with SPSS
                     Jean Jenkins
                       SPSS, Inc.

   Professor Kim discussed  various methods for estimating
        data. One which is effective and also relatively inexpen-
sive to use is assigning regression estimates plus a random nor-
mal component (You will find this in Professor Kim's paper as
method |2)  under "Estimation of missing values without itera-
tion.")
   In the output that follows we show you how to perform this

-------

                                 HANDOUT
      NOTES ON LARSEN'S PROCEDURES BASED ON THE LOGNORMAL DISTRIBUTION
Normal Distribution:
        2TT O
                         1.  Parameter Relationships
                                 y<00
                              Y  f
Lognormal Distribution:  X = e  or In X = Y
f (X) =
               -(ln X '
                                , 0 < X < -
E(XT) = E(erY) =
=> E(X) = exp(y +   a) ,

                »•


Var(X) = E(X2) - lE(X)]2


                     2              2
       = exp  (2y + 2o ) - exp(2y + o ),



medCX) = ev.



Note that (1),  (3) => med(X) < E(X) if o > 0
                                                             CD




                                                             (2)



                                                             (3)

-------
                                            G
                        2.  Estimation of Parameters
Let c., c2,..., c  be a sample of n observations from the given lognormal dis-



tribution.  Then, let
 ~          ii       *      71           7  /•>
E(X) = m = -~-  , Var(X) = S^ = i I. (c. - nTT , y
            n                    nil
                                                         C.
                                                       n
                                                                 = - I.(In c. -
                                                                   n  i     i
We define:
geometric mean = m  = c
* -    g



and
                         = (c *c •••€ ) 'n ,
                             i  i    n
so that y = In m  ;

                o
standard geometric deviation = S  = e  ,  so that a = In S .
	*	    g                         g



          From (1),


                  *   I ~2       ff"      1       2
          m" = exp(y + -x a ) = exp[ln m  + y(ln S ) ]  ,

                                      O         O
or
          m
           ^
                  exp[0.5(ln S ) ].

                              O
                (4)
or
or
or
          •From (2),




          S2 = exp(2y + 2a2) - m2 = exp[2Cln m ) + 2(In S )2] - m2

                                              o          ^


             = m2 exp[2(ln SJ2] - ffi2 = m2 exp[(ln S )2] - m2  ,

                O           O                       O
          (In S )' = In I
                                                             (5)
          From (3),
          ned(X) = e   = m  ,



so that the estimate of the median of a lognormal distribution is the geometric


mean of the data.



From (4), it follows that m  < m if S  > 0 .
                           E         E

-------
                          3.  Graphical Procedures


               From Figure 3, we see that:


1)  m  is the ordinate value corresponding to an abscissa value of 50% cumula-


   • tive frequency (i.e., the 50th percentile-point), or Z = 0.


2)  S  is the ratio of the 16th percentile concentration value to the 50th
     o

    percentile ordinate value.  This follows from properties of the normal


    distribution:
                               f          X I ~1  .	 V.. .»

                                                       d
    (£ + o) - y = In Cm S ) - In m  = In On S /m ) = In S .
                      e> o        o       o o   o        o

    Note that In S  is the slope of the line.
                  g        	c-

         In general, the slope (In S ) can be  calculated using any two pairs of
                                    O
   • points, say (Z,, In c,) and (Z., In c^), as
                  In S
     In c,  - In c.
         h	i_

         z^ - z-
          h    i
    so that
                           In

    For example, for 1-hour averages, an SO. concentration of 0.34 ppm was equaled


    or exceeded at the Washington CAMP site for 0.10% of the measurements during


    the period 12/1/61 - 12/1/68, and a concentration of 0.06 was equaled or


    exceeded for 30%.  Substituting this information into equation (6), we have
          g hr
                 exp
In(0.34/0.06)

 3.09 - 0.52
= 1.96 ;

-------
    Table 11 gives the plotting positions of various percentiles in terms of


    the number of standard deviations between these percentiles and the median


    (top abscissa of Figure 3).


3) ' n is approximately equal to the brdinate value corresponding to the 30th


    percentile point or Z = 0.84.


4)  For standard setting purposes, Larsen recommends estimating the annual


    maximum 1-hour concentration as the "ordinate value corresponding to the


    0.10 percentile point."  If f is the percentile value associated with a
                                                         »

    particular observation (i.e., the percent of all the n observations which


    are larger than that particular observation) and r is the rank of that obser-
                               -••' **                  -                  <£

    vation (with 1 being the rank of the largest observation and n the rank of


    the smallest), then



                   r=^_+ 1
                   r   100   1

    expresses r as a function of f, with truncating (i.e.,  rounding down) to the


    nearest integer when necessary.



    For a complete year's data on one-hour averages (so that n = 365 x 24 = 8,760),


    then the 0.10 percentile point corresponds to the ninth largest observation,


    since
                                          ..

                *•

                 4.  Calculating the Expected Annual Maximum

                     Concentration for a Given Averaging Time.


     The general equation for the type of line plotted in Figure 3 is


         In c = In m  + Z In S  ;                            (7)
                    g         g

•note that the values of m  and S  will vary with changes in averaging time,
                         o      o

and hence a different straight line will be obtained for each averaging time.



In general, as the averaging time increases, m  increases in value and S
                                              &                         g

decreases in value.

-------
In order to use equation (7) to determine the value of the expected annual



maximum concentration (say, c   ) for some particular averaging time of interest,
                             U13X


we need to specify the appropriate Z value to use.  A simple empirical equation



(based on examining tables of mean positions of ranked normal deviates) for



locating the percentile point f (and hence the Z value) of the observation with



rank r is



                           f = 10°'r ; °-4°) ;               (8)



for maximum values, r = 1, so that f = 60/n.  Table 11 gives values of n, f = 60/n,

                                                         i

and Z for various averaging times.




As an example, the plotting position for the highest annual mean concentration
                               -••' f                  -                  !f


for SO  based on a 1-hour averaging time is





                   f = 8^0 = °-°°685'


which (from Table 11 or from normal curve area tables) gives Z = 3.81.  Then,



for m  =0.042 .ppm and S  =1.96 ppm, we get from  (7) that




             c    .   = m S Z = (0.042)(1.96)3'81 = 0.55 ppm .
              max hr    g g    v                   	vv






                     5.  "Non-Continuous" Data  (small  n)





       For small  n,  Larsen suggests possibly using the arithmetic mean  (m)



and the observed maximum concentration to obtain values for  S , m ,  and other

                *•                                              o   o


indices of interest  (e.g., an expected maximum).



       Using  (4) and (7), we have





                     tn m - 0.5(£n S )2 = In c - Z tn S  ,
                                   e>                 o




and the resulting quadratic equation in  £n S ,  namely





                        Sg)2 - 2Z(£n Sg)

-------
can be solved for the appropriate root, which is
                                                                         (9)
For example, suppose that  n= 109  24-hour averages are available, that the



observed maximum is  c    =0.15 ppm,  and that  m=0.05 ppm.
                      max                               rr


Then, from  (8), the percentile point  f  for  c     is
                                               013.X




                           = 100(1-0.40) m
                         1 -     109       U.O3U3 ,

                                                         I



which (from tables of the standard normal distribution) corresponds to a  Z



of  2.54.  Thus, from (9),    •-' f               -                  
-------
As an example, frequency distributions for 19 averaging times have been plotted
by computer as the  "+"  points in Figure 4.   Measured maximum and minimum
values have been plotted as triangles.  Each averaging time is about twice the
previous one;  thus, when plotted on logarithmic paper, they are approximately
equally spaced along the abscissa.
       Examination of these patterns and other results leads to the following
model characteristics:
1.   Concentrations are (approximately) lognormally distributed for all averag-
     ing times.
                              - •'  »*~ "                                  s~
2.   The median concentration (namely,  m )  i,s proportional to averaging time (t)
     raised to an exponent, and thus plots as a straight line on logarithmic
     paper; i.e.,
                              m  « tP  .
                               g
                                                       p
     When t = 1 hr. then m  = m ,  , so that m  = m ,   t .
                          g    ghr'          g    ghr
3.   The arithmetic mean concentration (m) is the same for all averaging tines.
4.   For the largest averaging time calculated (usually 1 year),  m, m , c
                                                                      o
     and  c .   are all equal (since  n=l).
           mm           ^
5.   The maximum concentration  c     is approximately inversely proportional
                                 ulcLX                   "™~
     to averaging time raised to an exponent for averaging times of less than
               ••
     1 month; i.e. ,

                            Snax * tq  '  1 < ° •
     When  t = l hr, cmax = c,  so that
                              Snax

-------
                   6a.  Calculating S  for one averaging

                     time when it is known for another.




     From model characteristic #2, the slope  p  of the straight line
can be determined for averaging time  t   as
                                       cL
                        m - £n m        £n(ra/m
                     *> 'tot ' *> *a



where t    = total averaging time (usually 1 year) so that m = m   , and m



is the geometric mean for averaging time t .  (Note that t    and t  must be
                                          cl               tO t      3
                                                           *

expressed in the same units).  Similarly, for averaging time t, ,
P = 7 VV — /t ^ '  Using equation (4), it then follows that

    '
                S a)       D.5(£n S fe)
or


                  ./v"
           S ,  = S   , where v =
            gb    ga
(11)
Thus, expression  (11) can be used to obtain S , , given values for S   , t  and
                   6b.  Calculating S  based on concentration data
                                     &       •


                        at two different averaging times.



     Using  (11) and appealing to model characteristic #3, we can develop an



equation to calculate S   , given concentration values c  and c, based on averag
 ^                     ga                              a      b


ing times t  and t, , respectively.  In particular, from  (4), we have
           o       D


                       £n m = tn m  + 0.5(£n S )2;
                                  o           o

-------
and, using (7), we obtain
                                         + 0.5(£n S )

                                       o           o
Since m is the same for all averaging times (model characteristic #3), we have:
    and
In m = £n c  - Z  In S   + 0.5(£n S  )2
           a    a     ga      v    ga'





£n m = £n c,  - Z,  in S ,  + 0.5(£n S , )2.
           b    b     gb           gb
If we now equate the right-hand sides of these two equations, and use (11) to



express S .in terms of S  , we obtain the following quadratic equation in £n S  ,

                                                         i

namely



         0.5(1 - v)(£n Sga)2 - (Za - /7 Zb) In Sga + £nl-^l = 0.


                              '   /"

The two roots of this equation are





                     w ±
       wfc - 2(1 - v) tn J-S.

                         Cb
                               (1 - V)
where w = Z  - /v Z, .
           a       b
Taking the antilogarithm, we obtain the computational formula
f

S = exp <
ga *
w ±

0
w - 2(1

(1
rc T
- v) tn \^-\
—
•2 »


- v)
                                                                         (12)
when c,  > c , use the "+" solution, and when c,  < c , use the "-" solution.



     As an example, let us look at the three national secondary standards for



SO..  These three standards have been plotted on Figure 5.  The annual maximum



concentration line has been drawn through the 3- and 24- hour standards.  This



line thus depicts the air quality expected at an air sampling site where these



two concentrations are expected to occur once a year.



     Let us now calculate the value of S  for such a site for 24-hour averages
                                        O


using the above results.  Let c  be the 24-hour standard and c.  be the 3-hour



standard.  Then, c  = 260yg/m  and c.  = 1300yg/m .  Then, from (11),
                  3                 D

-------
                                                                          10
                         v =
                               in(8760/3)
                               £n(8760/24)
= 1.35.
For c , f = 100(1 - 0.40)/365 = 0.1644, which gives Z  = 2.94.
For c, , f = 100(1 - 0.40)/2,920 = 0.02055, which gives Z,  = 3.53.
     b                                                  b
Thus, w = Za - /v~ I  = 2.94 - /T735 (3.53) = -1.16.
Then, from (12),

                        f               2                           Cl    '
          <:      - Mn J-1.16 + [(-1.16r - 2(1-1.35) in (260/1300)1*1   . „
          Sg day ' exp S     '            (1-1.35)                    f = 7'22'
                        V.                               i             J
From (4) and (7), we have
   m = m  exp[0.5(£n S )2] = c S"Z exp[0.5(£n S )2] = 260(7.22)~2*94 exp[0.5(£n 7.22)'
        O             ft         O—. „            O                      ,.- .
                 3
        = 5.5yg/m .
Thus, if the 3- and 24-hour standard concentrations each occurred at a given
site once a year, then the annual arithmetic mean concentration expected at
                          3                                                     3
that site would be 5.5ug/m , which is far below the allowable standard of 60ug/m .
     Any two air quality standards for a given pollutant can be inserted into
the mathematical model to determine the standard geometric deviation that occurs
when both standards are achieved simultaneously.  This deviation can be compared
with the deviation determined by analyzing aerometric data for a particular
air sampling site.  If the measured deviation is less than that calculated from
the two standards, then the longer averaging time standard will require the
greatest source reduction at that site and could be called the CONTROLLING
STANDARD.  If the measured deviation is greater than the one calculated from
the two standards, the shorter averaging time will control.
     If one of the standards is an arithmetic mean, then equation (9) can be
used to determine when the controlling standard switches from the 24-hour S02
                   3                                                 3
standard of 260yg/m  to the annual arithmetic mean standard of 60yg/m .

-------
                                                                           11





In particular, from (9), S  day = exp^2.94 - K2.94)2 - 2£n \%Q\ \  f = 1-73.



Thus, if the standard geometric deviation for 24-hr averages is less than 1.73,



the annual mean standard is the "controlling standard"; if the standard geome-



tric deviation is more than 1.73, the 24-hour standard controls.



                   6c.  Calculating the geometric mean



               Using, as before, the fact that



                 £n m - £n m       £n m - £n m ,
                  _ ga   _            gb
                    'tot - *» *.   *» Sot
we can express m ,  as the following function of m  :




                                         mga    +  n m,
                                           a/J



or equivalent ly
                                      v     1-v

                               mgb = V ' m   •


And, from (4), an equivalent expression is



                     mgb = mga exp [0.5 (1-v) (£n Sga)2].                    (13)




Equations (6), (7), (11), and (13) have been used, together with the 0.10- and



30-percentile concentrations measured for 1-hour averages, to plot all of the



lines in Figure 4.



                   6d.  Calculating expected annual maximum concentrations



     'Finally, from model characteristic #5, the equation

               •'

                                c    = (c    .  )  tq
                                 max     max hr




can be used to calculate the expected annual maximum concentration for any



averaging time (of less than 1 month), given values for S  .   and c    .  .



The value of q to use in the above equation can be determined from Table 13.



     For example, the expected annual maximum concentration of SO- for a



1-month averaging time (730 hrs) can be calculated using Table 13 and the



previously calculated maximum 1-hr concentration  of 0.55 ppm.   In particular,

-------
                                                                           12
since S  .   = 1.96, Table 13 gives q = -0.279,  so that
       g nr



             c  x mo = (0.55)(730)~°'279 = 0.087 ppm,



which is quite close to the 0.089 ppm value read from  the top abscissa of



Figure 4.

-------
                                 REFERENCE

Larsen, Ralph I. (November, 1971).  "A Mathematical Model for Relating Air
Quality Measurements to Air Quality Standards."  EPA,  Office of Air Programs,
RTP, NC.

-------
                                      NUMBER OF STANDARD DEVIATIONS (z) FROM MEDIAN


                                  21              0             -1
UJ
o
o
  0.01
        0.01
                 0.1
10  16      30      SO      70


           FREQUENCY, percent
                                                                                 90
                                                                                                99
 Figure 3. Frequency of 1-hour-average sulfur dioxide concentrations equal to or in excess of stated values,

 Washington. D.C.. December 1.  1961, to December 1. 1968.

-------
      Table 11. PLOTTING POSITION OF EXTREME CONCENTRATIONS AND PERCENTILES FOR
                                    SELECTED AVERAGING TIMES
Averaging
time.
hr
1 sec
1 min
5 min
8.8 min
10 min
1 5 min
30 min
1 hr
1.46hr
2 hr
3 hr
8 hr
12 hr
14.6 hr
1 day
2 day
4 day
5.9 day
7 day
14 day
1 mo
2 mo
3 mo
6 mo
1 yr
0.000278
0.0166
0.0833
0.146
0.166
0.25
0.5
1
1.46
2
3
8
12
14.6
24
48
96
146
168
346
730
1460
2190
4380
8760
No. of samples
in year
31.500.000
525,000
105,000
60,000
52,500
35,000
17.500
8.760
6,000
4,380
2,920
' 1.095
730
600
365
183
91
- JO'
52
26
12
6
4
2
1
Plotting position
Frequency (60%/N),
percent
of time
0.0000019
0.0001142
0.000571
0.001
0.001142
0.001715
0.00343
0.00685
0.01
0.0137
0.02055
0.0548
0.0822
0.1 '
0.1644
0.328
0.657
1
1.153
2.31
5
10
15
30
50
No. of standard
deviations (z)
from median
5.50
4.73
4.39
4.27
4.24
4.14
3.98
3.81
3.72
3.63
3.53
3.26
3.14
3.09
2.94
2.72
2.48
2.33 ./••
2.27
1.99
1.64
1.28
1.04
0.52
0.00
         second
           1
  100.000
   10.000
 S  LOCO
 2
 O
2   100
     10
minute
  1
AVERAGING TIME

hour
  1
day
 1
fnonih
  1
year
 1
                 I
                                                         I
                                                                                              100
                                                                                              10
                                                                                              o.i
                                                                                              0.01
                                                                                              0.001
                                                                                             0.0001
                                                                                                   111
     O.OOC1
               0.001
                         0.01
                                   0.1        1         10

                                         AVERAGING TIME, hours
                                                               100
                                              1.000
                                                        10.000
                                                                100.000
Figure5. Expected annual maximum sulfur dioxide concentrations at sampling site where national secondary
3- and 24-hour-standard concentrations occur once a year.

-------
               SECOND
                    1
               AVERAGING  TIME
              MINLJTE-1         HOUR         DAY
          1     5 10 IS 30 i   P.  4  6 12 1  E>   4  7
 100.000 ..
^10.000


I)
U
\

3  1.000
Z
D
H
h
     100 ..
z
UJ
U
LI
U
       1 ..
                 150/7
                                      -4-
UG/CU M
  PPM
                 7k	\-
- 3tH
            1 - 1 - 1
   MDNJTH
 1  P. 3  H j
-A	1—I	1—
0-J07 0
     E.XFTCTE.'D ANNUAL MAXIMUM CTMT
                                                                                                    ..  10
                                               -f-
                     CiETJ- MrAN  FDR 1-HR =   110  HGTU M = 0-042 PPM
                     STANDARD GEOMETRIC: DEVIATION = 1-C3F5
                     70 PER [TNT  OF' HOURS HAVE DATA AVAILAR. E.
                    	1-	1	1	1	
                                                                                                      100
                                                                                                        0-001
                                                                                                        0-0001
            C-OOOl    0-001       0-01      0-i         1         1O
                                       AVERAGING  TIME:*  HOURS
                                                 100
                       l.OOO
                                 10.000
   Figure 4. Computer plot of concentration versus averaging time and frequency for sulfur dioxide at site 256. Washington. D.C.. December 1. 1961, to

   December 1, 1968.

-------
    Table 13. SLOPE OF MAXIMUM CONCENTRATION LINE FOR 1-HR AVERAGE STANDARD
                     GEOMETRIC DEVIATIONS FROM 1.0 THROUGH 4.99
SGDa
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1.90
2.00
2.10
2.20
2.30
2.40
2.50
2.60
2.70
2.80
2.90
3.00
3.10
3.20
3.30
3.40
3.50
3.60
3.70
3.80
3.90
4.00
4.10
4.20
4.30
4.40
4.50
4.60
4.70
4.80
4.90
0.00
0.000
-0.042
-0.080
-0.114
-0.145
-0.174
-0.200
-0.224
-0.246
-0.267
-0.287
-0.305
-0.322
-0.338
-0.353
-0.368
-0.3S2
-0.395
-0.407
-0.419
-0.430
-0.441
-0.451
-0.461
-0.471
-0.480
-0.489
-0.497
-0.505
-0.514
-0.521 '
-0.529
-0'.536
-0.543
-0.549
-0.556
-0.562
-0.568
-0.574
-0.580
0.01
-0.004
-0.046
-0.083
-0.117
-0.148
-0.176
-0.202
-0.226
-0.248
-0.269
-0.288
-0.307
-0.324
-0.340
-0.355
-0.369
-0.333
-0.396
-0.408
-0.420
-0.431
-0.442
-0.452
-0.462
-0.472
-0.481
-0.490
-0.498
-0.506
-0.514
-0.522
-0.529
-0.536
-0.543
-0.550
-0.556
-0.563
-0.569
-0.575
-0.580
0.02
-0.008
-0.050
-0.087
-0.121'
-0.151
-0.179
-0.205
-0.229
-0.251
-0.271"
-0.290
-0.308
-0.325
-0.341
-0.356
-0.371
-0.384
-0.397
-0.410
-0.421
-0.432
-0.443
-0.453
-0.463
-0.473
-0.482
-0.491
-0.499
-0.507
-0.515
-0.523
-0.530
-0.537
-0.544
-0.551
-0.557
-0.563
-0.569
-C.575
-0.581
0.03
-0.012
-0.054
-0.090
-0.124
-0.154
-0.182
-0.207
-0.231
-0=253
-4K273
-0.292
-0.310
-0.327.
-0.343
-0.358
-0.372
-0.386
-0.398
-0.411
-0.422
-0.434
-0.444
-0.454
-0.464
-0.474
-0.483
-0.492
-0.500
-0.508
-0.516
-0.523
-0.531
-0.538
-0.545
-0.551
-0.558
-0.564
-0.570
-0.576
-0.582
0.04
-0.017
-0.057
-0.094
-0.127
-0.157
-0.184
-0.210
-0.233
-0.255
-0.275
-0.294
-0.312
-0.329
-0.344
-0.359
-0.373
-0.387
-0.400
-0.412
-0.424
-0.435
-0.445
-0.455
-0.465
-0.475
-0.484
-0.492
-0.501
-0.509
-0.517
-0.524
-0.531
-0.539
-0.545
-0.552
-0.558
-0.565
-0.571
-0.576
-0.582
0.05
-0.021
-0.061
-0.097
-0.130
-0.160
-0.187
-0.212
-0.235
-0.257
-0.277
-0.296
-0.314
-0.330
-0.346
-0.361
-0.375
-0.38S
-0.401
-0.413
-0.425
-0.436
-0.446
-0.456
-0.466
-0.476
-0.485
-0.493
-0.502
-0.510
-0.517
-0.525
-0.532
-0.539
-0.546
-0.553
-0.559
-0.565
-0.571
-0.577
-C.583
0.06
-0.025
-0.065
-0.101
-0.133
-0.1,63
-0.190
-0.214
-0.238
-0.259
-0.279
-0.298
-0.315
-0.332
-0.347
-0.362
-0.376
-0.390
-0.402
-0.414
-0.426
-0.437
-0.447
-0.457
-0.467
-0.477
-0.485
-0.494
-0.502
-0.510
-0.518
-0.526
-0.533
-0.540
-0.547
-0.553
-0.560
-0.566
-0.572
-0.578
-0.583
0.07
-0.029
-;0.069
-0.104
-0.136
-0.165
-0.192
-0.217
-0.240
-0.261
-0.281
-0.299
-0.317
-0.333
-0.349
-0.364
-0.378
-0.391
-0.403
-0.415
-0.427
-0.438
-0.448
-0.458
-0.468
-0.477
-0.486
-0.495
-0.503
-0.51 1
. -0.519
-0.526
-0.533
-0.541
-0.547
-0.554
-0.560
-0.566
-0.572
-0.578
-0.584
0.08
-0.034
-0.072
-0.107
-0.139
-0.168
-0.195
-0.219
-0.242
-0,263
-0'.283
-0.301
-0.319
-0.335
-0.350
-0.365
-0.379
-0.392
-0.405
-0.417
-0.428
-0.439
-0.449
-0.459
-0.469
-0.478
-0.487
-0.496
-0.504
-0.512
-0.520
-0.527
-0.534
-0.541
-0.548
-0.555
-0.561
-0.567
-0.573
-0.579
-0.584
0.09
-0.038
-0.076
-0.111
-0.142
-0.171
-0.197
-0.222
-0.244
-0.265
-0.285
-0.303
-0.320
-0.337
-0.352
-0.366
-0.380
-0.393
-0.405
-0.418
-0.429
-0.440
-0.450
-0.460
-0.470
-0.479
-0.488
-0.497
-0.505
-0.513
-0.520
-0.528
-0.535
-0.542
-0.549
-0.555
-0.562
-0.568
-0.574
^0.579
-0.585
'Standard geometric deviation for a particular slope it the sum of the left and top margin values.

-------
                               EXERCISES
     The following exercises are recommended for increasing proficiency
with the techniques discussed in this paper.  The example used involves
only 11 samples in order to reduce the amount of computation needed.  It
has been assumed that 1-month sulfur dioxide samples have been taken in
front of the U.S. Capitol for each month of 1 year and that the concen-
tration, in yg/m3, fron January through December are 300, 250, 180, missing,
150, 60, 120, 100, 140, 160, 190, and 220, respectively.
     1.  What is the arithmetic mean concentration for the year?

     2.  Calculate the geometric mean by using all values.
     3.  What is the maximum 1-month observed concentration, and at vhat
         frequency should it be plotted on log-probability paper?
     4.  What is the minimum 1-month observed concentration and at what
         frequency should it be ^plotted on log-probability paper?
                            "-" **                 -                  rf"
     5.  At what frequency should each of the observed values, from Janu-
         ary through December, be plotted on log-probability paper?

     6.  Plot the 11 values on log-probability paper.
     7.  Use the arithmetic mean and the maximum to calculate the standard
       .  geometric deviation (a frequency of 5.5 percent is a Z-value of
         1.60).
     8.  Calculate the geometric mean froia the calculated standard geometric
         deviation and the naximun.
     9.  Use the answers from 7 and 8 to calculate the standard geometric
         deviation and geometric mean for 1-hour average concentrations.

    10.  Use the answers-from 9 to calculate the expected annual maxioun
         1-hour concentration.
                                ANSWERS

     1.  m =^170 yg/m3.

     2.  m        '= 156 yg/m
          g month
         Snax month ' 3°° «'»'• f " 5'™'
         Snin month =
         5.5%, 14.5%, 41.8%, missing, 58.2%, 94.5%, 76.4%, 85.5%, 67.3%,
         50.0%, 32.7%, 23.6%.

         Sg month ~ 1-50'
     8-  mg month ' 15?
     9.  Sg hr = 2.17, m     - 126 yg/m3.

              hr

-------
STATISTICAL PACKAGE  FOR  THE  SOCIAL  SCIENCES
                                                          SECOND EDITION
                                                                 NORMAN H. NIE
                                                      Department of Political Science
                                                                          and
                                                    National Opinion Research Center
                                                           .  University Of Chicago
                                                                 C. HADLAI HULL
                                                              Computation Center
                                                             University of Chicago
     MCGRAW-HILL.
    BOOK COMPANY

        New York
         St. Louis
      San Francisco
        Auckland
        Dusseldorf
     Johannesburg
      Kuala Lumpur
          London
          Mexico
        Montreal
        New Delhi
          Panama
           Paris
        Sio Paulo
        Singapore
          Sydney
          Tokyo
          Toronto
            JEAN G. JENKINS
 National Opinion Research Center
          University of Chicago
        KARIN STEINBRENNER
 National Opinion Research Center
          University of Chicago
               DALE H. BENT
Faculty of Business Administration
                       and
           Computing Services
       The University of Alberta

-------
Library of Congress Cataloging in Publication Data
Nie, Norman H
     SPSS: statistical package for the social sciences.
     1.  SPSS (Electronic  computer  system)  2.  Electronic data
processing—Social sciences. I. Title. II. Title: Statistical package for
the social sciences.
HA33.N48  1975                029.7                74-16223
ISBN 0-07-046531-2
STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES

Copyright ©  1970, 1975 by McGraw-Hill,  Inc. All rights reserved.
Printed in the United States of America. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the
publisher.

1 234567890 WC WC7 987 6 5

This book was set in Times Roman by Creative Book Services.
division of McGregor & Werner, Inc. The editors were Kenneth J.
Bowman and Matthew Cahill; the designer was Joseph Gillians; the
production supervisor was Thomas J. LoPinto. New drawings were
done by J & R Services. Inc.
Webcraftcr, Inc., was printer and binder.

-------
Xii  CONTENTS
                16
           CONTINGENCY TABLES AND RELATED
           MEASURES OF ASSOCIATION:
           SUBPROGRAM CROSSTABS                                 218

           16.1  AN INTRODUCTION TO CROSSTABULATION                     218

                16.1.1 Crosstabulation Tables              •                  219
                16.1.2 Summary Statistics for Crosstabulations                  222

           16.2  SUBPROGRAM CROSSTABS:* TWO-WAY TO /V-WAY CROSS-
                TABULATION TABLES AND RELATED STATISTICS                230

                16.2.1 Components of the CROSSTABS Procedure Card            231
                16.2.2 TABLES= List                                      231
                16.2.3 VARIABLES= List for Integer Mode        •              234

           16.3  FORMAT OF THE CROSSTABS PROCEDURE CARD                236
           16.4  PRINTED OUTPUT FROM SUBPROGRAM CROSSTABS             237

                16.4.1 A Note on Value Labels for CROSSTABS  '                239
                16.4.2 Including Missing Values Only in Tables                  240

           16.5 OPTIONS AVAILABLE FOR SUBPROGRAM CROSSTABS: THE
                OPTIONS CARD  - • f               .                 *r  241
           16.6 STATISTICS AVAILABLE FOR SUBPROGRAM CROSSTABS: THE
                STATISTICS CARD                                         242
           16.7 PROGRAM LIMITATIONS FOR SUBPROGRAM CROSSTABS          243

                16.7.1 Program Limitations for Subprogram CROSSTABS. Gen-
                      eral Mode                                         243
                16.7.2 Program  Limitations for  Subprogram CROSSTABS,  In-
                      teger Mode                                        244

           16.8 EXAMPLE DECK SETUPS FOR SUBPROGRAM CROSSTABS          245

                17
           DESCRIPTION  OF SUBPOPULATIONS
           AND MEAN DIFFERENCE TESTING:
           SUBPROGRAMS BREAKDOWN AND T-TEST                  249

           17.1  SUBPROGRAM BREAKDOWN                                249

                17.1.1 BREAKDOWN"Operating  Modes: Integer and General        250
                17.1.2 TABLES= List                                      252
                17.1.3 VARIABLES= List for Integer Mode                      254
                17.1.4 Format of the BREAKDOWN Procedure Card               255
                17.1.5 Options Available for Subprogram BREAKDOWN: The
                      OPTIONS Card                                      257
                17.1.6 Statistics  for Subprogram BREAKDOWN: One-Way
                      Analysis of Variance and Test of Linearity.                257
                 17.1.7  Program Limitations for Subprogram BREAKDOWN'         261
                 17.1.8  Example Deck Setup for Subprogram BREAKDOWN         262
                 17.1.9  BREAKDOWN Tables Printed in Crosstabular Form: The
                       CROSSBREAK Facility                                264

            17.2 SUBPROGRAM T-TEST: COMPARISON OF SAMPLE MEANS         267
                 17.2.1 Introduction to the T-TEST of Significance                267
                17.2.2 The T-TEST Procedure Card                            271
                17.2.3 Options and Statistics for Subprogram T-TEST             273
                17.2.4 Program Limitations for Subprogram T-TEST              273

-------
     17.2.5  Example Deck Setups  and Output  for Subprogram
           T-TEST                                           274

    18
BIVARIATE CORRELATION ANALYSIS:
PEARSON CORRELATION, RANK-ORDER CORRELATION,
AND SCATTER DIAGRAMS                                 276

18.1  INTRODUCTION TO CORRELATION ANALYSIS                   276
18.2  SUBPROGRAM PEARSON CORR: PEARSON
     PRODUCT-MOMENT CORRELATION COEFFICIENTS               280

     18.2.1  PEARSON CORR Procedure Card ''                      281
     18.2.2  Options Available for Subprogram PEARSON CORR        283
     18.2.3  Statistics Available for Subprogram PEARSON CORR       285
     18.2.4  Program Limitations for Subprogram PEARSON CORR      285
     18.2.5  Sample Deck Setup  and Output for Subprogram PEAR-
           SON CORR                                    -286

18.3  SUBPROGRAM NONPAR CORR: SPEARMAN AND/OR
     KENDALL RANK-ORDER CORRELATION COEFFICIENTS            288

     18.3.1  NONPAR CORR Procedure Card                  •     290
     18.3.2  Options Available for Subprogram NONPAR CORR         291
     18.3.3  Statistics Available for Subprogram NONPAR CORR        291
     18.3.4  Program Limitations for Svu-bprogram NONPAR CORR       291
     18.3.5  Sample Deck Setup and 6utput for Subprogram                 ^'
           NONPAR CORR                                     292

18.4  SUBPROGRAM SCATTERGRAM: SCATTER DIAGRAM OF DATA
     POINTS AND SIMPLE REGRESSION            .               293

     18.4.1  SCATTERGRAM Procedure Card                        294
     18.4.2  Options Available for Subprogram SCATTERGRAM         296
     18.4.3  Statistics Available for Subprogram SCATTERGRAM        297
     18.4.4  Program Limitations for Subprogram SCATTERGRAM      297
     18.4.5  Sample Deck Setup  and Output for Subprogram SCAT-
           TERGRAM                                         298

    19
PARTIAL CORRELATION:
SUBPROGRAM PARTIAL CORR                               301
19.1  INTRODUCTION TO PARTIAL CpRRELATION ANALYSIS           302
19.2  PARTIAL CORR PROCEDURE CARD                   "         305

     19.2.1  Correlation List                                     306
     19.2.2  Control List                                        306
     19.2.3  Order Value(s)                                      307

19.3 : SPECIAL CONVENTIONS FOR MATRIX INPUT FOR
     SUBPROGRAM PARTIAL CORR                               308
     19.3.1  Requirements on the Form and Format for Correlation
           Matrices                                          308
     19.3.2  Methods for Specifying the Order of Variables on Input
           Matrices                                          309
     19.3.3  Specifying the Order of Variables on Input Matrices by
           the Partials List: The Default Method                    309
     19.3.4  Specifying the Order of Variables on Input Matrices by
           the Variable  List Card: Option 6            '           310

-------
                                                                                   II

xfv   CONTENTS


                19.3.5  SPSS Control Cards Required for Matrix Input              311

            19.4 OPTIONS AVAILABLE FOR SUBPROGRAM PARTIAL CORR          312
            19.5 STATISTICS AVAILABLE FOR SUBPROGRAM PARTIAL CORR        315
            19.6 PROGRAM LIMITATIONS FOR SUBPROGRAM PARTIAL CORR       315
            19.7 EXAMPLE DECK SETUPS FOR SUBPROGRAM PARTIAL CORR       316
                                       *

               20                     *'
            MULTIPLE REGRESSION ANALYSIS:
            SUBPROGRAM REGRESSION                                 320

            20.1 INTRODUCTION TO MULTIPLE REGRESSION                     321

                20.1.1  Simple Bivariate Regression             '               323
                20.1.2  Extension to Multiple Regression   .                     328

            20.2 REGRESSION PROCEDURE CARD                              342
                20.2.1  VARIABLES List                   •                  343
                20.2.2  REGRESSION Design Statement                         343

            20.3 SUMMARY OF PROCEDURE CARDS                            348
            20.4 SPECIAL CONVENTIONS FOR MATRIX INPUT WITH          ,r
                SUBPROGRAM RE6fiESSION                            "    349

                20.4.1  Format of Input Correlation Matrices                     349
                20.4.2  Format and Position of Input Means and Standard Devia-
                       tions                                         '     350
                20.4.3  Methods for Specifying the Number and  Order of Vari-
                       ables on Input Correlation Matrices                 '     350
                20.4.4  Control Cards Required to Enter Matrices                  351

            20.5 SPECIAL CONVENTIONS FOR HANDLING RESIDUAL OUTPUT       351
            20.6 OPTIONS AVAILABLE FOR SUBPROGRAM REGRESSION           352
            20.7 STATISTICS AVAILABLE FOR SUBPROGRAM REGRESSION         355
            20.8 PROGRAM LIMITATIONS FOR  SUBPROGRAM REGRESSION        356
            20.9 PRINTED OUTPUT FROM SUBPROGRAM REGRESSION             358
            20.10 EXAMPLE DECK SETUPS AND OUTPUT  FOR SUBPROGRAM
                 REGRESSION                                             360

                21
            SPECIAL TOPICS IN GENERAL LINEAR MODELS              368

            21.1 NONLINEAR RELATIONSHIPS                                 368

                21.1.1  Data Transformation                                  369
                21.1.2  Examining Polynomial Trends                          371
                21.1.3  Interaction Terms                                    372
            21.2 REGRESSION WITH DUMMY VARIABLES                        373
                21.2.1  Dummy Variables: Coding and Interpretation              374
                21.2.2  Dummy Variable Regression with Two or More Categori-
                       cal Variables                                        377

            21.3  PATH ANALYSIS AND CAUSAL INTERPRETATION                383

                21.3.1  Principals of Path Analysis                             384
                21.3.2  Interpretation of ?ath Analysis Results                   387
                21.3.3  Statistical Inference                                  392
                21.3.4  Standardized versus Unstandardized Coefficients           394

-------
                                                                                      IZ
                                                                 CONTENTS  xv
    22
ANALYSIS OF VARIANCE AND COVARIANCE:
SUBPROGRAMS ANOVA AND ONEWAY     *                398

22.1  INTRODUCTION TO ANALYSIS OF VARIANCE AND
     COVARIANCE                                               399

     22.1.1 Variation and Its Decomposition: Basic Ideas               400
     22.1.2 Factorial  Design with Equal Cell Frequency                 401
     22.1.3 Factorial  Designs with Unequal Cel'l Frequency              405
     22.1.4 Covariance Analysis                                    408
     22.1.5 Multiple Classification Analysis                          409

22.2  SUBPROGRAM ANOVA                                       410

     22.2.1 ANOVA Procedure Card                            '    411
     22.2.2 Specifying the Form of the Analysis with the OPTIONS
           Card                                                413
    •22.2.3 Multiple Classification Analysis (MCA)                     416
     22.2.4 Options and Statistics Available for ANOVA                418
     22.2.5 Special Limitations for ANOVA                          419
     22.2.6  Example  Deck  Setups  and  Output  for Subprogram
           ANOVA            .                                  421
22.3  SUBPROGRAM ONEWAY                                      422

     22.3.1 Tests for Trends                                       425
     22.3.2 A Priori Contrasts                                      425
     22.3.3 A Posteriori Contrasts                                  426
     22.3.4  More Complex ONEWAY Deck Setups                     428
     22.3.5  Options Available for Subprogram ONEWAY               429
     22.3.6 Statistics Available for Subprogram ONEWAY              430
     22.3.7 Program Limitations for Subprogram ONEWAY             430
  . .  22.3.8  Example Deck Setup and Output for Subprogram ONE-
           WAY                              .                  430

    23
DISCRIMINANT ANALYSIS                                   '  434
23.1  INTRODUCTION  TO DISCRIMINANT ANALYSIS                   435

     23.1.1  A Two-Group Example                                 436
     23.1.2  A Multigroup Example                                 439
23.2  ANALYTIC FEATURES OF SUBPROGRAM DISCRIMINANT          441

     23.2.1  Determining the Number of Discriminant Functions          442
     23.2.2  Interpretation of the Discriminant Function Coefficients       443
     23.2.3  Plots of Discriminant Scores                             444
     23.2.4  Rotation of the Discriminant Function Axes             .   444
     23.2.5  Classification of Cases                                  445
     23.2.6  Selection Methods: Direct and Stepwise                   446
23.3  DISCRIMINANT PROCEDURE CARD                             .448
     23.3.1  GROUPS Specification                                  449
     23.3.2 VARIABLES Specification                               450
     23.3.3  Specifying Subanalyses with the ANALYSIS Keyword        450
     23.3.4  METHOD                                            452
     23.3.5  TOLERANCE                                          453

-------
xvi  CONTENTS
                 23.3.6  MAXSTEPS             .                              453
                 23.3.7  Setting Minimum Criteria for the Stepwise Procedure        453
                 23.3.8  Controlling the Number of Discriminant Functions           454
                 23.3.9  Establishing Prior Probabilities  for Classification Pur-
                        poses                                               455
                 23.3.10 Summary of the DISCRIMINANT Procedure Card Specifi-
                        cations          i                                   456
                                        •
            23.4 OPTIONS AVAILABLE IN SUBPROGRAM DISCRIMINANT            456
            23.5 STATISTICS AVAILABLE IN SUBPROGRAM DISCRIMINANT         459
            23.6 SPECIAL CONVENTIONS FOR MATRIX OUTPUT AND INPUT FOR
                 SUBPROGRAM DISCRIMINANT                                 460
            23.7 PROGRAM LIMITATIONS FOR SUBPROGRAM DISCRIMINANT       461
            23.8 SAMPLE DECK SETUP AND OUTPUT FOR SUBPROGRAM
                 DISCRIMINANT                                              462

                 24
            FACTOR ANALYSIS                                           468
            "24.1 INTRODUCTION TO FACTOR ANALYSIS                         469
                 24.1.1  Types of Factor Analysis         -                  •'   469
                 24.1.2  Meaning of Essential Tables and Statistics in Factor
                        Analysis Output                                       473
            24.2 METHODS OF FACTORING AVAILABLE IN
                 SUBPROGRAM FACTOR                           .            478
                 24.2.1  Principal Factoring without Iteration: PA1                  479
                 24.2.2  Principal Factoring with Iteration: PA2                     480
                 24.2.3  Remaining Methods  of Factoring                         481
            24.3 METHODS OF ROTATION AVAILABLE IN
                 SUBPROGRAM FACTOR            •                           482
                 24.3.1  Orthogonal Rotation: QUARTIMAX                       484
                 24.3.2  Orthogonal Rotation: VARIMAX              *            485
                 24.3.3  Orthogonal Rotation: EQUIMAX                         485
                 24.3.4  Oblique Rotation: OBLIQUE                             485
                 24.3.5  Graphical Presentation of Rotated Orthogonal Factors       486
            24.4 BUILDING COMPOSITE INDICES  (FACTOR SCORES) FROM THE
                 FACTOR-SCORE COEFFICIENT (OR FACTOR-ESTIMATE) MATRIX     487

                 24.4.1  Values of Factor Scores Output by Subprogram FACTOR     489
                 24.4.2  Calculation of Factor Scores When Missing Data Are to
                        Be Replaced by the Mean                              489
            24.5  FACTOR PROCEDURE CARD                                   490
                 •24.5.1  VARIABLES= List                                     490
                 24.5.2  Selection of Factoring Methods by the TYPE= Keyword     490
                  24.5.3  Altering the Diagonal of the Correlation Matrix by Means
                        Of the DIAGONAL Value List                            491
                 24.5.4  Controlling the Factoring Process: NFACTORS.
                        MINEIGEN, ITERATE, and STOPFACT Parameters           492
                  24.5.5  Selecting the Method of Rotation With the ROTATE
                        Parameter                                           494
                  24.5.6  Writing  Factor Scares on a Raw-Output-Data  File: The
                        FACSCORE Keyword                                  496

-------
                                                          CONTENTS  xvii
     24.5.7  Performing Multiple Factor Analyses                    496
     24.5.8  Summary of the Format of the FACTOR Procedure
           Card                                            497

24.6  SPECIAL CONVENTIONS FOR MATRIX INPUT AND OUTPUT
     FOR SUBPROGRAM FACTOR                               499

     24.6.1  Format of Input Correlation Matrices             '       500
     24.6.2  Methods for Specifying the Number  and Order of Vari-
           ables on Input Correlation Matrices                    500
     24.6.3  Input of the Factor Matrix                            501
     24.6.4  Control Cards Required to Enter Matrices                501
     24.6.5  Output of Correlation and Factor Matrices               502

24.7  FACTOR-SCORES OUTPUT FORMAT                         502

     24.7.1  Factor Scores Produced from Selected Data              503
    . 24.7.2  Printed Output Generated For Factor Scores              503
24.8  OPTIONS AVAILABLE FOR SUBPROGRAM FACTOR              503
24.9  STATISTICS AVAILABLE FOR SUBPROGRAM FACTOR            506
24.10 PROGRAM LIMITATIONS FOR SUBPROGRAM FACTOR           507
24.11 EXAMPLE DECK SETUPS FOR SUBPROGRAM FACTOR           508
                            -  /"    •

    25
CANONICAL CORRELATION ANALYSIS:
SUBPROGRAM CANCORR                                  515
25.1  INTRODUCTION TO CANONICAL CORRELATION ANALYSIS        515
25.2  BUILDING COMPOSITE INDICES (CANONICAL VARIATE
     SCORES) FROM THE CANONICAL VARIATE
     COEFFICIENT MATRICES                                  519
25.3  THE CANCORR PROCEDURE CARD                          520
25.4  SPECIAL CONVENTIONS FOR MATRIX INPUT AND OUTPUT       521
25.5  OUTPUT FROM SUBPROGRAM CANCORR                     . 522
25.6  OPTIONS AVAILABLE IN SUBPROGRAM CANCORR              523
25.7  STATISTICS AVAILABLE FOR SUBPROGRAM CANCORR           525
25.8  LIMITATIONS FOR SUBPROGRAM CANCORR                  525
25.9  EXAMPLE DECK SETUP AND OUTPUT                       526

    26
SCALOGRAM ANALYSIS:
SUBPROGRAM GUTTMAN SCALE                          528
26.1  INTRODUCTION TO GUTTMAN SCALE ANALYSIS               529
     26.1.1  Evaluating Guttman Scales                           531
     26.1.2  Building Guttman Scales                            533

26.2  GUTTMAN SCALE PROCEDURE CARD                        535
26.3  OPTIONS AVAILABLE FOR SUBPROGRAM GUTTMAN SCALE       536
26.4  STATISTICS AVAILABLE FOR SUBPROGRAM  GUTTMAN SCALE    537
26.5  LIMITATIONS FOR SUBPROGRAM GUTTMAN SCALE            537
26.6  SAMPLE DECK SETUP AND OUTPUT FOR SUBPROGRAM
     GUTTMAN SCALE                                       538

-------
                                                                                          ..,-
                 rrra   \•.•''.'•'/'•'•."' ':i•••>'i^'--:v •<$  ..'•-"/fH
                 •••'3    ••;':•:•'  • '••./:'.'•••' !.•-.-';•.:•"•>!/    -v^
                                                /#v
                                               .-;.„.    '•     V  •/•;•:!.• -•..-..^•...-;,:...;.;....,. :V... ..••,:•....••.-••,.•
                                               •v:r:-   '••.-'"•   '..' jpi*'-:\- )•••-'"£'•'••';.• ••'•^:vV;^>r^vv-V::-^'."•'"• :i:" ' "'

                                               .•7.'-      :  <^-J-Vv^V •:-:':': M/V. :';':^-w ^V^''^-^"'"^:^ ''
                                                                                             I!  ' •"•"
                                                                                    ^         :  .•' •' ; •:
   HEALTH SCIENCES COMPUTING FACILITY
..; .DEPARTMENT OF BIOMATHEMATICS      /•
 '•SCHOOL OF MEDICINE  '      '. ' ''••  ' '''• '-^rj '• .
 '..UNIVERSITY OF CALIFORNIA, LOS ANQELES
                                             '
                                                                      .
       .- ';.•'•")'-    v: •,,::}.;. ••:.'•'•.''''."   ' ."/  •'':.'• '/•"•^ - V'.  . " -UNIVERSITY OF CALIFORNIA PRESS

-------
This publication reports work sponsored under grant RR-3 of the Bio-
technology Resources Branch of the National Institutes of Health.
Reproduction in whole or in part is permitted for any purpose  of the
United States Government.
              UNIVERSITY OF CALIFORNIA PRESS,
               Berkeley and Los Angeles, California

           UNIVERSITY OF CALIFORNIA PRESS, LTD.
                       London, England
   Copyright© 1975 by The Regents of the University of California
                      ISBN: 0-520-02917-8
        Library of Congress Catalog Card Number: 74-24702
Orders for this publication should be directed to one of the above ad-
dresses. Comments on programs or orders for tape copies of the pro-
grams should be addressed to:

               Health Sciences Computing Facility
                       CHSBIdg.,AV-111
                     University of California
                  Los Angeles, California 90024
           Manufactured in the United States of America

-------
                                  CUNILNIb

                                                                            Page

 PREFACE                                                        '              1x
 INTRDDUCTION

 I. GENERAL CLASSIFICATION AND  DESCRIPTION OF  PROGRAMS                         1

    A.  ANALYTICAL TECHNIQUES FOR DATA ANALYSIS USING  BMDP  PROGRAMS             1
       (a comparative discussion giving  a  brief explanation and background
       for the statistical techniques used)
       1. Data Screening and Description:  initial screening,  listing op-      1
          tions,  stratifying the data, grouping variables  to recode data,
          histograms, plots, multivariate  outliers, summary statistics,
          correlations,  cross tabulations, transformations
       2. Analysis of Variance:   basic models,  one-way  analysis, covari-       5
          ates, tvo or more factorial models,  repeated  measures, non-
          parametric tests

       3.  Regression;  basic regression models  — simple, multiple, step-      7
          vise, polynomial,  nonlinear; ill conditioned problems; regres-
          sion on groups;  regression  on principal components; partial
          correlations;  scatter plots^of observed and predicted values,
          plots of residuals      " *'           .      "                  'f  •     .
       4. Multivariate Analysis                                               10

          Cluster analysis:  distance measures and amalgamation pro-          10
          cedures for three types of  cluster analysis

          Factor analysis:  objectives of factor analysis  related to model   12
          and to the analyses done by the  program; simple  example showing
          ease with which a factor analysis is obtained
          Canonical correlation                       '                        14

          Discriminant analysis:   classification functions, variable          14
          selection, canonical variables,  jackknife classification, con-
          trasts
    B.  PROGRAM FEATURES  TABLE                                                17
       (illustrates features not obvious from  the program  titles, and also
       multiple options  available in  programs)
    C.  SHORT DESCRIPTIONS OF PROGRAMS                                        20
       (brief description of individual programs)

II. NEW USERS                                                                 31
    A.  HOW TO  SET UP A PROGRAM                                                31
       (system cards, program control cards)
    B.  TRY IT, YOU'LL LIKE IT                                                32
       (a simple first example of how to run a program)
    C.  USING A BMDP PROGRAM                                                  34
       (setting up a program by  using the  P4D  writeup;  annotated input/
       output  examples for two programs)
    D.  COMMON  ERRORS                                                         35
       (errors and the error messages generated by the  system  and by
       the program)

-------
                                                                       •     Page        lu
III.  PREPARATION OF INPUT DATA                                               43
     A.  STANDARD DATA INPUT                                                  A3
     B.  KEYPUNCHING AND CODING                                               44
     C.  DATA PROCESSING EQUIPMENT                                            44
     D.  CODING SHEETS                                                        44
     E.  DESIGN OF RESEARCH FORMS                                             47
     F.  INPUT FORMAT                                                         47
        (F-type format and A-type format)                                                  \
 IV.  BMDP PROGRAM INFORMATION                                                51
     A.  CONTROL LANGUAGE                                      '              51
        (parameters and instructions are stated in Program Control
        Language)
        1. Language Definition                                               52
       .2. Convenience Features                  .                            54
     B.  PROGRAM PARAMETERS                                                  . 56
        (parameters common to all problems)
        1. Problem Definition:  PROBLEM paragraph:,  TITLE                    56   •
        2. Input Data:  INPUT paragraph:  VARIABLE,  FORMAT, CASE,  UNIT,       57
           REWIND; input from Save File — CODES, CONTENT, LABELS, TYPE
        3. Variable Description:  VARIABLE paragraph:  ADD, NAMES, USE,       59   .
           LABELS, MISSING, MAXI1-!U!-!S, MINIMUM, BEFORE/AFTER TRANSFOPMA- '
           TIONS, GROUPING, USE
        .4. Grouping or Category Description:  GROUP or CATEGORY paragraph:    61
           OUTPOINTS, CODES, NAMES
        5. Transformation of Input Data:                .              .63
           Control language transformations:   TRANSFormation paragraph:       63
           form, temporary, USE,  KASE,  XMIS,  X
           Table of Program Control  Language transformations                 67
           FORTRAN transformations:   BIMEDT procedure — USE,  KASE,  XMIS,     68
           KPROB,  X
           Adding variables:  ADD                                            70
           Data checking:  BEFORE/AFTER transformations                      70
        6. Output Options                                                    71
           Numerical  results:   PRINT and PLOT paragraphs                     71
           Save Files:   SAVE paragraph: UNIT,  CODES,  CONTENTS,  LABELS       73
           Chart of assumed values                                           76
     C.  DECK SETUP                                                           75
        (arrangement of cards:  system  control cards  and Program Control
       Language cards)
        1. Basic Setup                                                        75
        2. Save File  Setup                                                   78

                                      v1

-------
                                                                          Page
      3.  BIMEDT  Setup                                       '                79
         FORTRAN coded transformations
         Increasing  computer  memory  space
PROGRAM WRITEUPS
  .D-SERIES -- DATA  DESCRIPTION  PROGRAMS
      P1D:   Simple Data  Description     .                                    81
      P2D:   Frequency  Count Routine                                         95
      PSD:   t Test and T2 Routine                                         115
      P4D:   Alphanumeric Frequency Count  Routine            _              137
      P5D:   Univariate Plotting                                            147
      PSD:   Bivariate  Plotting                                            163
      P7D:   Description  of Strata with Histograms and Analysis of Variance 189
    . PSD:   Missing  Value Correlation                                     219
      P9D:   Multidimensional  Data .Description                            ,.237
   F-SERIES — FREQUENCY TABLE PROGRAMS
      P1F:   Two-way  Contingency Tables                                     259
   M-SERIES « KULTIVARIATE PROGRAMS
      P1M:   Cluster  Analysis on Variables                                  307
      P2M:   Cluster  Analysis on Cases                                  '    323
      P3M:   Block Clustering                                               339
     ' P4M:   Factor Analysis                        .       .                357
      P6M:  _Canpnical  Correlation Analysis                                 393
            Stepwise Discriminant Analysis           '                      411
           "REGRESSION PROGRAMS
      P1R:   Multiple Linear Regression            .     .                    453
      P2R:   Stepwise Regression                                            491
      P3R:   Nonlinear Regression                            .;               541
      P4R:   Regression on Principal  Components                             573
      P5R:   Polynomial Regression                                      •    593
      P6R:   Partial  Correlation and Multivariate Regression                621
   S-SERIES - SPECIAL PROGRAMS
      PIS:   Multipass Transformation                                       637
      P3S:   Nonparametric Statistics                                       657
   V-SERIES - ANALYSIS OF VARIANCE PROGRAMS
      P1V:   One-way Analysis of Variance and Covariance                    683
      P2V:   Analysis of Variance e.-»d Covariance, Including Repeated        711
            Measures

                                 .  vii

-------
       SAS USER'S GUI
                    Statistics
               1982
SAS INSTITUTE INC.
BOX 8000
CARY, NORTH CAROLINA 27511

-------
                                    Contents
REGRESSION

    1   Introduction to SAS Regression Procedures	   3
    2   NLIN	   15
    3   REG	   39
    4   RSQUARE	   85
    5   RSREG	   91
    6   STEPWISE	  101

ANALYSIS OF VARIANCE                   '

    7   Introduction to SAS Analysis-of-Variance Procedures	  113
    8   ANOVA	  119
    9   GLM	/.-I jF.:'.	•:	  139,
   10   NESTED	  201
   11   NPAR1WAY	  205
   12   PLAN	  213
   13   TTEST	  217
   14   VARCOMP	  223
   15   The Four Types of Estimable Functions	  229

CATEGORICAL DATA ANALYSIS
   16   Introduction to SAS Categorical Procedures	  245
   17   FUNCAT	  257
   18   PROBIT	  287

MULTIVARIATE ANALYSIS
   19   Introduction to SAS Multivariate Procedures	  295
   20   CANCORR	;	  297
   21   FACTOR	  309
   22   PRINCOMP	  347

DISCRIMINANT ANALYSIS
   23   Introduction to SAS Discriminant Procedures	  365
   24   CANDISC	  369
   25   DISCRIM	  381
   26   NEIGHBOR	  397
   27   STEPDISC	  405

CLUSTERING
   28   Introduction to SAS Clustering Procedures	  417
   29   CLUSTER	  423
   30   FASTCLUS	  433
   31   TREE	  449
   32   VARCLUS	  461

-------
SCORING
   33   Introduction to SAS Scoring Procedures	  477
   34   RANK	  479
   35   SCORE	  485
   36   STANDARD	  493

MATRIX

   37   Introduction to the Matrix Language	  499
   38   MATRIX	  503

INDEX	  569

-------
                                         PRE-TEST




          1.   Four sets of replicate observations of  the concentration  of  a  certain




          pollutant were made by sampling the air at  each of  four  locations.   Location 1




          was 30 miles from a certain industrial plant,  and Locations 2, 3,  and  A  were




          selected at points 20 miles away,  10 miles  away, and  adjacent to the plant.




          The data appear in Table One below:









                                         TABLE ONE
LOCATION

1
2
3
A
CONCENTRATION OF POLLUTANT (Y^)

A, 5, 6
6,6,6
7,8,9
8,9,10
. J
MEAN Y± = ^ 1 Y

5
6
8
9
          Here,  Y..  denotes the value of the concentration for the j-th replicate  at




          Location i(j=l,2,3;  i=l,2,3,A),  and Y.  denotes the mean of the three  replicates




          taken  at Location i.




25 pts.   (a) Do the data provide sufficient evidence to suggest a difference  in




              mean pollutant  concentration among the four locations?  (Use  ot=.05).




              MAKE SURE TO WRITE DOWN THE MODEL  YOU ARE USING AND TO STATE  ALL




              ASSUMPTIONS NECESSARY FOR YOU TO PERFORM YOUR ANALYSIS.   PERFORM BOTH




              A PARAMETRIC AND A NON-PARAMETRIC  ANALYSIS.

-------
25 pts
(b)   If  y.  is  the true  mean pollutant  concentration  at  Location  i,  1=1,2,3,4,



     test the  null hypothesis
               versus
                    H0:   ~
                    H,    -3y,  - y0 + p,  + 3y.  ^ 0
                     A:      1     t.    J      A
               at the 1% level;  the quantity (-3
                                       ,
                                              + Vo  + 3y,)  can be shown to
     be a measure of the linear relationship between "locations"  (as  thought



     of in terms of four equally spaced distances of 30,  20,  10,  and  0 miles



     from the plant) and "pollutant  concentration".



(c)   Another way to quantify the strength of this linear  relationship would



     be to fit by least squares the  model
                    xi ~
               where


                                      -3 for Location 1,



                                      -1 for Location 2,




                                      • 1 for Location 3,




                                       3 for Location 4.


                                     ^

               Fit such a model to the n=12 data points in Table One to obtain




               least squares estimates 8Q and ^ Of BQ and Sj,  respectively.





25 pts    (d)   Test the null hypothesis H : B,=0 versus the alternative H :  B,/0




               at the 1% level.  STATE ALL ASSUMPTIONS NECESSARY FOR YOU TO  PERFORM




               THIS TEST.

-------
                         SOLUTIONS TO PRE-TEST
                                                            4

1) a) Model:   Y.. = y + T,  + e,.,  i = 1,2,3,4 and  j=  1,2,3;  I   T  -  0
               'J        '     'J                           j_i  i

      Here,


      Y.J. = the observed value of  the pollutant concentration  for the.j-th

        J   replicate at location  i,


        y = over-all mean effect,


       T.J = additive effect due to the i-th treatment (location),


      E-- = experimental error component.
       ' j
                                                          2
   For a  parametric analysis, one  assumes  that  e..^N(0,  o ) and that  the
                                                ' J

   {e- •}  are mutually independent.
      ' J


   PARAMETRIC ANALYSIS (one-way ANOVA):

         4   3   9    /4   3      \2          •   ,fi/Is2
       =  [   I  Y*  . I I   I   -YVl  /12  = 624 -  ML  = 36.

        1=1 j=l  1J   \i=l  j=l   1J/               ltL
         i  4   / 3           fft/n
       = ^ I   (I  Y. -)   - ^- = 618 -  588 =  30.
         Ji=l  \j=l                              ~~
   SSE =  TSS-SST = 36-30 = 6.
   F      SST/3   MST   30/3   40 _

   r3,8 " SSE/8 " MSE "  6/8 "  3 "
       Reject HQ:  TI = TZ = TS = ^ = 0 since F3j8j>95 =  4.07.


   Thus,  there is  evidence of a significant (5% level, at least)  difference


   among  the four  locations with respect to mean pollutant concentration  level

-------
NON-PARAMETRIC ANALYSIS (Kruskal-Wallis  procedure)

The table of ranks is:
  L
  0
  T
  I
  0
  N
                                RANK TOTALS
1
2
3
4
1, 2, 4.5
4.5, 4.5, 4.5
7, 8.5, 10.5
8.5, 10.5, 12




R1 = 7.5
R2 = 13.5
R3 = 26.0
R4 = 31.0
H .=
       12
               4  R?
               I  —	3(12+1)= 9.09.  which is  significant  at  the  5%  level
              i=l   3
    since
          Xji95
                 = 7.815.

b) The best estimate of {-Sy-.-v
                                          1S
   (-3Y1-Y2+Y3+3Y4), with variance
  r
                    _         _
                     -+ (IT  -
                                                20  2
                                                ~Ta •
   Ho: " 3yl " y2 + y3 + 3y4 = Oi  Ha:  "  3yl  "  y2 + y3 + 3y4  ^  0<
  Test  statistic:   tg =
                         (-37,  -
                                           3Y4)  - 0
                              - (MSE)

  Rejection region:   reject when |tg|  > tg  gg5 =3.355,
   So,
            [-3(5) - 6 + 8 + 3(9)]  - 0
                                           6.26,
   so reject H .

-------
)




Y =
(12x1)




X'Y
(2xl~) =
4
5
6
6
6
6
7
8
9
8
9
10





, X =
(12x2)




1 -3
1 -3
1 -3
1 -1
1 -1
1 -1
1 1
1 1
1 1
1 3
1 3
1 3



r~
12 0
, X'X = 0 60
(2x2) -




84
42
f •
i . rc
(2x1) «
1
- (X* X)'1 X' Y = ,, *
0 1/60 ^
1 	 __
                                                              0.7 !
So,  $  = 7,  B1  =  0.7, and   Y = 7 + 0.7X.
d)
                        432
                                    r    -
    S5E = Y  Y -  3  X  Y = T   I Y.. -  7, 0.7
          -  ~   -    -  1=1 jal u   L    _J
84


42
= 624 -588  -  29.4  - 6.6;
             SSE     6.6

           TT2TF'  TF" =
                               and S = .8124
ASSUMPTIONS:   For  the model Y..,. = BQ + 61 x^ + e^,  we  assume that
e.-.-^N (0,o ),  the  {e..} are independent, the  {x  }  are  known without error,
 • j                  IJ                         i


and the model  is  correct.
        H0:  61  - 0   ;

-------
                        B, - 0

Test statistic:  t1Q =    '
Rejection region:  reject when |t,g| > t gg5  .  = 3.169




  •         0 70 - 0
    tiQ =  _£iZ2	0	   = 6.674 => Reject HQ.
                    ).66

-------
                               POST-TEST
1.  The concentration  (Y) of a certain pollutant based on analysis of air samples




using two newly developed chemical methods (I and II) is known to be affected by




the relative humidity  (x) present at the time the air was sampled.  An experiment




conducted to see whether the two methods give comparable results involved analyz-




ing   4 randomly selected air samples by each method.  The true concentration of




the pollutant in each of the 8 samples can be assumed to be the same (because of




the time frame in which the air monitoring was done), but the relative humidity




readings varied from sample-to-sample.  However, the relative humidity values are



available (see the table below), so that a comparison of the two methods can be




made after adjusting for the effect of the relative humidity on the observed



                                - • /"                '                   rf""
response (Y).  The relative humidity variable (x) is called a COVARIATE, and




the comparison of the two methods, adjusted for the covariate, is called an



ANALYSIS OF COVARIANCE.  The data are as follows. '
RELATIVE
HUMIDITY (x)
METHOD (z)
RESPONSE (Y)
3
I
14
1
I
13
4
I
14
4
I
15
5
II
16
2
II
15
3
II
15
2
II
14
 The  linear model  to  be  fit  by  the method  of  least  squares  is




                     Y = BQ  + Bjx +  B2z  +  e,



 where  x  = relative humidity value,  and  z  =  -1  if method  I  is  used  and- +1  if




 method II is  used.

-------
15 pts    (a)  Fit the above model using the method of least squares.   You should be

               able to use the matrix

                              7/8    -I/A    0

                             -I/A    1/12    0

                               0       0    1/8

               and the following quantities should help you in your calculations:

                                                               2    A
                    2    A
                    I    I
                   1-1  J-l
 2     A
 I     I   Yj
L-l   J-l
                                                        1688,
                                                              i-1  J-l
                                                                        *U
2A,
                              I    [    x..  = 8A,  and   T .   T   x..Y..  = 35A,
                             1=1  j=l    1J             i=l  j=l    1J  1J

               where Y.. is the observed response  for the j-th.sample analyzed  by/the

               i-th method and x.. is the corresponding relative humidity value,  i =

               I, II, and j = 1,2,3,A.

15 pts    (b)  Ignoring the covariate completely and just considering only the  re-

               sponse, test the null hypothesis that there is no difference between

               the two methods with respect  to their true mean response  levels  (use

               a two-tailed test at the 5% level).   Use both a parametric and a non-

               parametric procedure.

15 pts    (c)  Test H_: 6_ = 0 versus H   6, ^ 0 at the 5% level.
                     U   ^             A •  £

 5 pts    (d)  In your own words, what question are the tests performed  in parts

               (b)  and (c) trying to answer?  Which of the two  tests do you prefer

               and why?

          2.   The Environmental Protection  Agency is interested in comparing four types of

               air pollution monitoring devices with regard to precision.  The  data in the

               accompanying table are the maximum  2A-hour concentrations of a certain  air

               pollutant as recorded by the  four devices at a specific location on three

               randomly selected days.

-------



Days
I
II
III
MAXIMUM 24-HOUR CONCENTRATION (Y)

Type of Monitoring
1 2
3 5
4 4
8 12
Device
3
5
3
10

4
4
5
9
10 pts    (a)   Write down an appropriate statistical model for this experiment,




               defining all terms and stating all assumptions.




20 pts    (b)   At the .05 significance leveif" test whether there is any difference"




               in mean response among the four types of devices.  Use both a parametric




               and a non-parametric testing procedure.




]JHfeks    (c)   Find a 95% confidence interval for the true difference in response




               between devices 1 and 2.




 5 pts    (d)   Which method of analysis (i.e., parametric or nonparametric) do you




               think is the most appropriate for these  data, and why?

-------
    Since F- ,  Q. = 4.76, we do not reject H :  TiSBT2!=T3s=T4= °' and so


    there is no evidence of a significant difference (5% level) among the


    four types of air monitoring devices.



    NON-PARAMETRIC PROCEDURE (Friedman's procedure):


    The table of ranks is:


                     TYPE OF DEVICE


           DAY      1234
I
II
III
1
2.5
1
3.5 3.5
2.5 1
4 3
2
4
2
    So, RI = 4.5, R2 = 10.0,'Rj- 7.5, RA = 8.0.'


    Thus,
    K =
             12
(4.5)2 + (10.0)2 + (7.5)2 + (8.0)2 !  - 3(3) (4+1)
        (3) (4) (4+1)


      = 3.1, which is not significant since Xo   QS  = 7.815.


c)  The 100(l-a)% confidence interval is of the  general form
    (Y. - Y0) + t, .  a  1/MSE I/-  + -   .
      1    2  -  6,1-2  |T     r nl   n2



    In our example, we obtain:




      (5-7) + 2.447  1/7.5/6 |/| + \
                     V       » J   J >
    or  -2+2.234, or (-4.234, +0.234).


d)  Since extreme-value statistics are not normally distributed  (e.g.  see


    Roberts, Journal of the Air Pollution Control Association, July,  1979,


    Vol. 29, Nos. 6 and 7), the nonparametric analysis is to be  preferred.

-------
                          SOLUTIONS TO  POST-TEST
14
13
14
11 1 Y 15
1 ) *J * _ iA '
16 f
(8x1) 15 l
15
14

8 24 0
X'X = 24 84 0
<3^3) 0 0 8
__

A.
So, f _ ^°
(5x1) 3:
A
1 3 -1
1 1 -1
14-1
x 14-1
8X31 l 5 +1
8X3) 1 2 +1
13+1
12+1

116
.X'Y - 3" ,
(3x1) "
L- —

4
(x-x)-1 r?/8 'i/4 °i
X X) - -1/4 1/12 0
3x3)
001/8
. ••' tT' L ._)/
i*

(X'X)'1 X'Y =

13
1/2
•
1/2

So, Y = 13
b)  PARAMETRIC PROCEDURE  (two-sample t-test):
                                                4
V v
               ' Ha:
                                           S2

                                           P
Test statistic: t, =
                 o
                            " Y2) "
                                                2 +  2


                                                (4+4-2)
••  n2  -  2)




2

I '
   Rejection region: reject when  jt,| > t Q7c  ,. = 2.447,
                                   o     • fi j > o
So, t6 = -^_-
                -1S) - 0
               "Z   K -- T    = -1.732 => Do no^ reject K  .

               I  /I + i                   -         °
               3 V 4   4

-------
   NON- PARAMETRIC PROCEDURE fMann-Whitnev "U" test*):


          (I)  (I) (I) (II) (I) (II) (II) (II)


   DBS:   13,  14, 14,  14, 15, 15,   15,  16



   RANK:   1,   3,  3,   3,  6,  6,    6,   8



         Rj = 13,    R   = 23.
     Un = (4) (4) +         - 23 = 3,  Uj = 13.





   From published tables, Pr (UTT <  3/H  is true and n, = n., = 4) = .1000,
                               ii — -     o              1    l


   so we do not reject H  at the 5% level.
            — —
c)  SSE = Y'Y-B'X'Y = 1688-13(116) - ~ (354) - \ (4) = 1688-1637 = 1;
                       ~                  z         2




                               '   tf~                 '                  *


   SO'S  =         = '200'      V B2 = °« Ha:B2^ °"


                            A

                            e  - o
   Test  statistic:  t  =
   Rejection region: reject when |tc| >  t  gyc  5= 2.571
   So, tc * — -   "  -  = +3.162 => Reject H .
 d)  The test in (c) is to be preferred because it takes into account the effect



     of the covariate.  In this example, even though XT - XTI = 3, test  (c)



     should be used since it utilizes the appropriate estimate of experimental



     error (i.e., one taking into account the significant linear relationship



     between X and Y).

-------
2.   a)  This is a RANDOMIZED BLOCK DESIGN.

        Model:  Y   = y + T± + 6.  + e^,  i = 1,2,3,4 and j  = 1,2,3,

        4        3
        I  T, =  I   .6. = 0.  Here,
       i=l      j=l   J

     Y.. = the observed concentration as  recorded by the i-th type of

           device on the j-th day,

       y = over-all mean effect,

      T. = additive effect due to  i-th device

      g. = additive effect due to  j-th day,

     e.. = experimental error component.

                                                            2
     For a parametric analysis,  one^ assumes  that e..^»N(0,a )  and that ,the
                               -  p               1J •                  rf

     {e..} are mutually independent.


     b)  PARAMETRIC PROCEDURE (Two-way ANOVA):
         Let T.  =  >    Y..  and B.  =   T   Y...
              1     "     11       1      "    11
              X    -?-!    ^J       J     J-l    ^J
                  /A    3      \  /
         Let CM = (I    I   Y.. r/12 = (72r/12 * 432.
                  \i=l  j=l   ^Jl          I

                4    3     ,
         TSS =  J    7   Y/ - CM = 530 - 432 = 98.
         SST = \   I   I2 - CM = 438 - 432
               •*  _» i    1
               1    3   9
         SSB = i   I  BT - CM = 516.5 - 432 = 84.5.
         SSE = TSS - SST - SSB = 7.5
         F
    _ SST/3    (6/3)
3,6   SSE/6   (7.5/6)
1.6

-------
SAS   USER'S SOIDB:
                                     "
SAS
BOXSOOO
            27511
 we.
                                Contents
REGRESSION

    1  Introduction to SAS Regression Procedures	   3
    2  NUN	   15
    3  REG	   39
    4  RSQUARE	   85
    5  RSREG	   91
    6  STEPWISE	  101

ANALYSIS OF VARIANCE

    7  Introduction to SAS Analysis-of-Variance Procedures	  113
    8  ANOVA	  119
    9  GLM	  139
   10  NESTED	  201
   11  NPAR1WAY	  205
   12  PLAN...-	  213
   13  TTEST	  217
   14  VARCOMP	  223
   15  The Four Types of Estimable Functions	  229

CATEGORICAL DATA ANALYSIS

   16  Introduction to SAS Categorical Procedures	  245
   17  FUNCAT	  257
   18  PROBIT	  287

MULTIVARIATE ANALYSIS

   19  Introduction to SAS Multivariate Procedures	  295
   20  CANCORR	  297
   21  FACTOR	  309
   22  PRINCOMP	  347

DISCRIMINANT ANALYSIS

   23  Introduction to SAS Discriminant Procedures	  365
   24  CANDISC	  369
   25  DISCRIM	  381
   26  NEIGHBOR	  397
   27  STEPDISC	  405

CLUSTERING
   28  Introduction to SAS Clustering Procedures	  417
   29  CLUSTER	  423
   30  FASTCLUS	  433
   31  TREE	  449
   32  VARCLUS	  461

-------
SCORING
    33   Introduction to SAS Scoring Procedures	  477
    34   RANK	  479
    35   SCORE	  485
    36   STANDARD	  493

MATRIX
    37   Introduction to the Matrix Language	  499
    38   MATRIX	  503

INDEX	  569

-------
REGRESSION
            Introduction
               NLIN
                REG
             RSQUARE
              RSREG
             STEPWISE

-------
         Introduction  to   SAS
 Regression    Procedures
This chapter reviews the SAS procedures that are used for regression analysis: REG,
RSQUARE, STEPWISE, NLIN, and RSREG.
  Many procedures in SAS perform regression analysis, each with special features.
The following SAS procedures have similar specifications and computations:

             REG  performs general-purpose regression with many
                  diagnostic and input/output capabilities
         RSQUARE  builds models and shows fitness measures for all possi-
                  ble models
         STEPWISE  implements several stepping methods for selecting
                  models
            NLIN  builds nonlinear regression models
           RSREG  builds quadratic response-surface regression models.

  Several other procedures also perform regression:
            GLM  performs an analysis of general linear models including
                  models containing categorical terms and polynomials
                  (documented with other analysis-of-variance pro-
                  cedures)
        AUTOREG  implements regression models using time-series data
                  where the errors are autocorrelated (documented in
                  the SAS/ETS User's Guide)
          SYSREG  handles linear simultaneous systems of equations, such.
                  as econometric models (documented in the SAS/ETS
                  User's Guide)
          SYSNLIN  handles nonlinear simultaneous systems of equations,
                  such as econometric models (documented in the
                  SAS/ETS User's Guide)

  Other regression procedures contributed by SAS users are documented in the
SAS Supplemental Library User's Guide.
  These procedures perform regression analysis, which is the fitting of an equation
to a set of values. The equation predicts a response variable  from a function of
regressor variables and parameters, adjusting the parameters such that a measure of
fit is optimized. For example, the equation for the /'* observation might be:
                + £/
where y, is the response variable, x, is a regressor variable, /}„ and /?, are unknown
parameters to be estimated, and t. is an error term.

-------
4  Chapter 1

     For example, you might use regression analysis to find out how well you can
   predict a person's weight if you know his height. Suppose you collect your data by
   measuring heights and weights of 20 school children. You need to estimate the in-
   tercept f)0 and the slope /?, of a line of fit described by the equation:
WEIGHT
                       + /?, HEIGHT + t
   where
               WEIGHT  is the response variable (also called the dependent
                         variable)
                   Po,P,  are the unknown parameters
               HEIGHT  is the regressor variable (also called the independent
                         variable, predictor, explanatory variable, factor, carrier)
                      t  is the unknown error.
     A plot of your data is:
           PLOT OF WEIGHT»HEIGHT    SYMBOL USED IS •

        WEIGHT

        11*0




        120
         too
         BO
         60
         'lO
                  50
                          55
                                   60 '

                                 HEIGHT
                                           65
                                                 '  70
     Regression estimates are b0= - 143 and b, = 3.9, so the line of fit is:

        WEIGHT	143 + 3.9*HEIGHT

     Regression is often used in an exploratory fashion to look for empirical relation-
   ships like the relationship between  HEIGHT and WEIGHT.  In  this example
   HEIGHT is not the cause of WEIGHT. We do not even  have evidence that the two
   variables  change together over time, since these data are across subjects (cross-
   sectional) rather than across time (longitudinal). (We would need a controlled ex-
   periment  to confirm the relationship scientifically.)
    The method used to estimate the parameters is to minimize the sum of squares of
   the differences between the actual response value and the value predicted by the

-------
                                     Introduction to SAS Regression Procedures  5

equation. The estimates are called least-squares estimates and the criterion value is
called the sum-of-squares error:

      SSE = Z(y,-ho-b,x,)J

where b0 and b, are the values for /?0 and /?, that minimize SSE.
  For a general discussion of the theory of least-squares estimation of linear models
and  its application to  regression  and analysis of variance, see one of the applied
regression texts including Draper and Smith (1981), Daniel and Wood (1980), and
Johnston (1972).
  SAS regression procedures produce the following information for a typical regres-
sion analysis:

     •  parameter estimates using the least-squares criterion
     •  estimates of the variance of the error term
     •  estimates of the variance or standard deviation of the parameter
       estimates
     •  hypotheses tests about the parameters
     •  predicted values and residuals using the estimates
     •  evaluation of the fit or lack of fit.

  Besides the usual statistics of fit produced for a regression, SAS regression pro-
cedures  can produce many  other specialized diagnostic statistics,  including:

     • collinearity diagnostics to  measure how much  regressors are related to
      other regressors and how this affects the stability and variance of the
      estimates. (REG)
     • influence diagnostics to measure how each individual observation con-
      tributes to determining the parameter estimates, the SSE, and the fitted
      values. (REG, RSREG)
     •  lack-of-fit diagnostics that  measure the lack of  fit of the regression model
      by comparing the error variance estimate to another pure error  variance
      that is not dependent on the form of the model. (RSREG)
     • time-series diagnostics for equally spaced time-series data that measure
      how much errors may be  related across neighboring observations. These
      diagnostics can also  measure functional goodness of fit for data  sorted
      by regressor or response. (REG)

  Other diagnostic statistics can be produced by programming a sequence of runs.
For example, tests to measure structural change in a model over time can be per-
formed by calculating  items from several regressions or by writing a program with
PROC MATRIX.
Comparison of Procedures

The REG Procedure  PROC REG is a general-purpose procedure for regression
with these capabilities:

     •  handles multiple MODEL statements
     •  can use correlations or crossproducts for input
     •  prints predicted values, residuals, studentized residuals, and confidence
       limits and can output these items to an output SAS data set
     •  prints special influence statistics
     •  produces partial  regression leverage plots
     •  estimates parameters subject to linear restrictions
     •  tests  linear hypotheses

-------
6  Chapter 1

       •  tests multivariate hypotheses
       •  writes estimates to an output SAS data set
       •  writes the crossproducts matrix to an output SAS data set
       •  computes special collinearity diagnostics.

   The RSQUARE Procedure   PROC RSQUARE fits all possible combinations of a list
   of variables  specified in a MODEL statement. The procedure prints R1 and Cp
   statistics and reports no estimates. PROC  RSQUARE is useful when you want to
   look at alternative models.  Since the number of possible models (2") gets large
   quickly, you should use the  RSQUARE procedure only when you have fewer than
   12 regressors to consider.

   The STEPWISE Procedure   PROC STEPWISE selects  regressors for a model by
   various stepping strategies; you can request five different methods to search for
   good models. The FORWARD method starts with an empty model and at each step
   selects the variable that would maximize the fit. The BACKWARD method starts
   with a full model and at each step removes the variable that contributes least to the
   fit. There are three other  variations:  STEPWISE,  MAXR, and MINR. PROC STEP-
   WISE gives an analysis of variance and parameter estimates, but cannot produce
   predicted and residual values.

   The NLIN Procedure  PROC NLIN implements iterative methods that attempt to
   find least-squares estimates  for nonlinear  models. The default method is Gauss-
   Newton, although several   other methods are available.  You must specify
   parameter names and starting values, expressions for the model, and expressions
   for  derivatives  of the  model  with  respect to  the parameters  (except  for
   METHOD=DUD). A grid search  is also available to select  starting values of the
   parameters.  Since nonlinear models are often tricky to estimate, NLIN may not
   always find the least-squares estimates.

   The RSREG  Procedure   PROC RSREC  fits a quadratic response-surface model,
   which  is useful in searching for factor values that optimize a response. The three
   features in RSREG that make  it preferable to other regression procedures for analyz-
   ing response surfaces are:

       •  automatic generation of quadratic effects
       •  a lack-of-fit test
       •  solutions for critical values of the surface.
   The  GLM Procedure  PROC GLM  for linear  models can  handle  regression,
   analysis-of-variance, and analysis  of covariance.  (GLM is documented  with  the
   analysis-of-variance procedures.) The features for regression that distinguish GLM
   from other regression procedures  are:

       •  ease of specifying categorical effects (GLM automatically generates dum-
          my variables for class variables)
       •  direct specification of polynomial effects.


   Statistical  Background
   The rest of this chapter outlines the way many SAS regression procedures calculate
   various regression quantities. Exceptions and further details are documented with
   individual procedures.
     In matrix algebra notation, a linear model  is

-------
                                     Introduction to SAS Regression Procedures  7

where X is the nxk design matrix (rows are observations and columns are the
regressors), /} is the /ex 1 vector of unknown parameters, and t isthenx 1 vector of
unknown errors. The first column of X is usually a vector of 1s used in estimating
the intercept term.
  The statistical theory of linear models is based on some strict classical assump-
tions. Ideally, the response is  measured with all the factors controlled  in an  ex-
perimentally determined environment. Or, if you  cannot control the factors  ex-
perimentally,  you  must assume that  the factors are fixed with  respect to the
response variable.
  Other assumptions are that:

     • the form of the  model is correct
     • regressor variables are measured without error
     • the expected value of the errors is zero
     • the variance of the errors (and thus the dependent variable) is a  con-
       stant across observations called o2
     • the errors are uncorrelated across  observations.

  When hypotheses are tested, the additional assumption is made that:

     • the errors are normally distributed.


Statistical Model  If the model satisfies all the necessary assumptions,  the least-
squares estimates are the best linear unbiased estimates (BLUE); in other words, the
estimates  have minimum variance among the class of estimators that are a linear
function of the responses. If the additional assumption that the error term is nor-
mally distributed is also satisfied, then:

     • the statistics that are computed have the proper sampling distributions
       for hypothesis testing
     • parameter estimates will be normally distributed
     • various sums of squares are distributed proportional to chi-square, at
       least  under proper hypotheses
     • ratios of estimates to standard errors are distributed as Student's  ( under
       certain hypotheses
     • appropriate ratios of sums of squares are distributed as F under certain
       hypotheses.

  When regression analysis is  used to model data that  do  not meet the assump-
tions, the  results should be interpreted in a cautious, exploratory fashion, with dis-
counted credence in the significance probabilities.
  Box (1966) and Tukey and Mosteller (1977, Chapters 12-13) discuss the  problems
that are encountered with regression data, especially when the data are not under
experimental control.
Parameter  Estimates and Associated  Statistics

Parameter estimates are formed using least-squares criteria by solving the normal
equations:

      (X'X) b  = X'y  ,

yielding

      b = (X'X)-'X'y  .                    .        .

-------
8  Ch.ipler 1

    Assume for the present that (X'X) is full rank (we relax this later). The variance of
   the error a1 is estimated by the mean square error:
        s2 -MSE

    The parameter estimates are unbiased:

        E(b) =  0

        E(s2) -  o2  .

    The estimates  have the variance-covariance matrix:

        Var(b)  = (X'X)-'o2 .

    The estimate of the variance matrix replaces o2 with s2 in the formula above:

        COVB  = (X'XTV  .

   The correlations of the estimates are derived by scaling to Is on the diagonal.
    Let:

        S - diag((X'X)-')"s

        CORRB = S(X'X)-'S .

    Standard errors of the estimates are computed using the equation:
        STDERR(b,) = V «X'X)"s2)

   where (X'X)" is the /'* diagonal element of (X'X)-1. The ratio

        t  = b,-/stderr(b,)

   is distributed as Student's t  under the hypothesis that /?, is zero. Regression pro-
   cedures print the ( ratio and the significance probability, the probability under the
   hypothesis /?,=0 of a larger absolute t value than was actually obtained. When the
   probability is less than some  small level, the event is considered so unlikely that the
   hypothesis is rejected.
     Type I SS and Type II  SS measure the contribution of a variable to the reduction
   in SSE.  Type I SS measure the reduction in SSE as that variable is entered into the
   model in sequence. Type II SS are the increment in SSE  that results from removing
   the variable from the full model. Type II SS are equivalent to the Type III and Type
   IV SS reported in the CLM procedure. If Type II SS are used in the numerator of an
   F test, the test  is equivalent to the t test for the hypothesis that the parameter is
   zero.  In polynomial models,  Type  I  SS measure the  contribution  of each
   polynomial term  as if it  is orthogonal to the previous terms in the model. The four
   types of SS are described in a more general context for the GLM procedure and  in
   Chapter 15, "The Four  Types of Estimable Functions."
     Standardized estimates are defined as the estimates that result when all variables
   are standardized to a mean  of 0 and a variance of 1. Standardized estimates are
   computed by multiplying the original estimates by the standard deviation of. the
   regressor variable and dividing by the sample standard deviation of the dependent
   variable.

-------
                                    Introduction to SAS Regression Procedures  9

  Tolerances and variance inflation factors measure the strength of interrelation-
ships among the regressor variables in the model. If all variables are orthogonal to
each other, both tolerance and variance inflation are 1. If a variable is very closely
related to other variables, the tolerance goes to 0 and the variance inflation gets
very large. Tolerance (TOL) is 1 minus the R2 that results from the regression of the
other  variables in the model on that  regressor.  Variance  inflation (VIF)  is the
diagonal of (X'X)'1 if (X'X) is scaled to correlation form. The statistics are related as
shown below:

      VIF - 1/TOL   .

Models not of full rank  If the  model is not full rank, then  a generalized inverse
can be used to solve the normal equations to minimize the SSE:

      b=  (X'X)-X'y .

  However, these estimates are not unique, since there are  an infinite number of
solutions  using different generalized inverses. REG and other regression  pro-
cedures choose a nonzero solution  for all variables that are linearly independent of
previous variables and a zero solution for other variables. This corresponds to using
a generalized  inverse in the normal equations, and the expected values  of the
"estimates" are the Hermite normal  form of X'X times the true parameters:

      E(b) = (X'X)-(X'X)/?  .

  Degrees  of  freedom for  the zeroed  estimates  are  reported  as  zero.  The
hypotheses that are not testable have t tests printed as missing. The message that
the model is not full rank includes a printout of the relations that exist in the matrix.
 Predicted Values and Residuals
 After the model has been fit, predicted values and residuals are usually calculated
 and output. The  predicted values are  calculated from the estimated regression
 equation; the residuals are calculated as actual minus predicted. Some procedures
 can calculate standard errors.
  Consider the /'* observation where x, is the row of regressors,  b is the vector of
 parameter estimates, and s2 is the mean squared error.
  Let:

      h. = xXX'X)-'x/ (the leverage)

 Then

      9. «= x,b (the predicted value)

      STDERR(£)  = V" (h,s2) (the standard error of the  predicted value)

      resid/ =  y,-x,b  (the residual)
      STDERR(resid,) = \M(1-h.)s2) (the standard error of the residual).

  The ratio of the residual to its standard error, called the student/zed residual, is
sometimes shown as:

-------
10  Chapter!

         Student-  resid / STDERR(resid).

      There are two kinds of confidence intervals for predicted values. One type of
    confidence interval is an interval for the expected value of the response. The other
    type of confidence interval is an interval for the actual value of a response, which is
    the expected value plus error.
      For example, construct for the /'* observation a confidence interval that contains
    the true expected value of  the response with probability 1  -  a. The upper and
    lower limits of the confidence interval  for the expected value are:

         LowerM=  x,b - t./jV (h.s2)

         UpperM=  x.b

      The  limits  for the confidence  interval  for an  actual  individual response
    (forecasting interval) are:
         Lowerl= x,b - ta,2\/ (h,s2 + s2)
         Upperl= x,b +t./2V (h.-s2 + s2)  .

      One measure of influence, Cook's D, measures the change to the estimates that
    results from deleting each observation..

         COOKD=  Student2 (STDERR(y)/STDERR(resid))2/k.

      For more information, see Cook (1977, 1979).
      The predicted residual for observation / is defined as the residual for the /'* obser-
    vation that results from dropping the /'* observation from the parameter estimates.
    The sum of squares of predicted residual errors is called the press statistic:

         presid, = resid/0 + h.)

         press = Zpresid/2
    Testing  Linear Hypotheses

    The general form of a linear hypothesis for the parameters is:

         H0:l£ - c ,

    where L is q x k, b is k x 1 , and c is q x 1. To test this hypothesis, the linear function
    is taken with respect to the parameters:

         (Lb-c)  .

    This has variance:

         Var(Lb-c) =  L Var(b)L'  = L(X'X)-LV   .

    A quadratic form called the sum of squares due to the hypothesis is calculated:

         SS(M?-c) = (Lb-c)'(L(X'X)-L')-'(Lb-c)   .

-------
                                    Introduction to SAS Regression Procedures  11

Assuming that this is testable, the SS can be used as a numerator of the F test:

      F = SS(Lb - c)/q/s2 .

This is referred to an F distribution with q and dfe degrees of freedom, where dfe iS
the degrees of freedom for residual error.
Multivariate Tests
Multivariate hypotheses involve several dependent variables in the form:

      H0:L/JM =  d  ,

where L is a linear function on the regressor side, 0 is a matrix of parameters, M is a
linear function on the  dependent side, and d is a matrix of constants.
  The special case (handled  by REG) where the constants are the same for each
dependent variable is  written:

      (L/J-cj)M = 0 ,

where c is a column vector of constants, and j is a row vector of Is. The special case
where the constants are 0 is

      L/3M =  0   .

These multivariate tests are covered in detail in Morrison (1976); Timm (1975); Mar-
dia, Kent, and Bibby (1979); Bock (1975); and other works cited in Chapter 20, "In-
troduction to SAS Multivariate Procedures."
  To test thjs hypothesis, construct two matrices, H and E, that correspond to the
numerator and denominator of a univariate F test:
      H = M'd.B-q

      E = M'(Y'Y-B'(X'X)B)M  .

Four test statistics, based on the eigenvalues of E~'H or (E+ H)~'H, are formed. Let
A, be the ordered eigenvalues of E"'H (if the inverse exists), and let 4, be the ordered
eigenvalues of (E + H)''H. It happens that |, = A,/(1 + A.) and A,=4,/(1 -£,), and it turns
out that e.= V (4.0 is the i"1 canonical correlation.
  Let p be the rank of (H+ E), which is less than or equal to the number of columns
of M. Let q be the rank of L(X'X)''L'. Let v be the degrees of freedom for error. Let
s = min(p,q). Let m = .5(|p-q|- 1), and let n = .5(v-p- 1). Then the statistics below
have the approximate F statistics as shown:

     • Wilk's Lambda

            A= det(E)/det(H+E) = fl 1/d+A,) = 11(1-1,.)  .

      F=(1-A'")/(A"') (rt-2u)/pq is approximately F, where r = v-(p + q+ 1)/2
      and u = (pq-2)/4, and r = V ((p2q2-4)/(p2 + q2-5)) if (p' + q2-5)>0 or 1
      otherwise. The degrees of freedom are pq and rt-2u. This approxima-
      tion is exact if min(p,q) < = 2. (See  Rao, 1973, p. 556.)

-------
12  Chapter 1

         • Pillai's Trace

                 V = trace(H(H+E)-')  - ZA./(1+A,) = I 4,  .

           F=(2n+s + 1)/(2m+s + 1)V/(s- V) is approximately F with s(2m+s+ 1) and
            s(2n+5+ 1) degrees of freedom.

         • Hotelling-Lawley Trace

                 U= trace(E-'H)= IA.= I4/(1-|.)   .

           F=2(sn + 1)U/(s2(2m+s+D) is approximately F with s(2m+s + 1) and
           2(sn+ 1) degrees of freedom.

         • Roy's Maximum Root

                 0=A,  .

           F=0(v-r- 1)/r where r = max(p,q) is an upper bound on F that yields a
           lower bound on the significance level.
       Tables of critical values for these statistics are found in Pillai (1960).
    REFERENCES

    Allen,  D.M.  (1971),  "Mean  Square  Error  of  Prediction  as  a  Criterion
      for Selecting Variables," Technometrics, 13, 469-475.
    Allen, D.M. and Cady, F.B. (1982), Analyzing Experimental Data by Regression, Bel-
      mont, CA: Lifetime Learning Publications.
    Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnostics, New York:
      John Wiley & Sons.
    Bock, R.D. (1975), Multivariate Statistical Methods in Behavioral Research, New
      York: McGraw-Hill.
    Box, C.E.P. (1966),  "The Use  and  Abuse of Regression,"  Technometrics,  8,
      625-629.
    Cook, R.D. (1977), "Detection of Influential Observations in Linear Regression,"
      Technometrics, 19, 15-18.
    Cook, R.D. (1979), "Influential Observations in Linear Regression," Journal of the,
      American Statistical Association, 74,  169-174.
    Daniel, C. and Wood, F.  (1980), Fitting Equations to  Data, Revised Edition, New
      York: John Wiley & Sons.
    Draper, N. and Smith, H. (1981), Applied Regression Analysis, Second Edition,
      New York: John Wiley & Sons.
    Durbin, J. and Watson, G.S. (1951), "Testing for Serial Correlation in Least Squares
      Regression," Biometrika, 37, 409-428.
    Freund, R.J. and Littell, R.C. (1981,), SAS for Linear Models, Cary, NC: SAS Institute.
    Goodnight, J.H. (1979), "A  Tutorial  on the SWEEP Operator,"  The American
      Statistician,  33, 149-158.
    Johnston, J. (1972), Econometric Methods, New York:  McGraw-Hill.
    Kennedy, W.J. and Gentle, J.E. (1980), Statistical Computing,  New York: Marcel
      Dekker.
    Mallows, C.L. (1973), "Some Comments on Cp," Technometrics, 15, 661-675.

-------
                                    Introduction to SAS Regression Procedures  13


Mardia, K.V.,  Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis,  London:
  Academic Press.
Morrison, D.F. (1976), Multivariate Statistical Methods, Second Edition, New York:
  McGraw-Hill.
Mosteller, F. and  Tukey, J.W. (1977),  Data Analysis and  Regression, Reading,
  MA: Addison-Wesley.
Neter, J. and  Wasserman, W.  (1974), Applied Linear Statistical  Models, Home-
  wood, IL:  Irwin.
Pillai, K.C.S. (1960), Statistical Table for Tests of Multivariate Hypotheses, Manila:
  The Statistical Center, University of Philippines.
Pindyck, R.S. and Rubinfeld, D.L. (1981), Econometric Models and Econometric
  Forecasts, Second Edition, New York: McGraw-Hill.
Rao, C.R. (1973), Linear Statistical Inference and  Its Applications,  Second Edition,
  New York: John Wiley & Sons.
Sail, J.P. (1981), "SAS Regression Application,"  Revised Edition, SAS Technical
  Report A-102, Gary, NC: SAS  Institute.
Timm,  N.H. (1975),  Multivariate Analysis  with  Applications  in Education  and
  Psychology, Monterey, CA: Brooks/Cole.
Weisberg, S. (1980), Applied Linear Regression, New  York: John Wiley & Sons.

-------
                      The   STEPWISE
                                  Procedure
ABSTRACT

The STEPWISE procedure provides five methods for stepwise regression. STEPWISE
is useful when you have many independent variables and want to find which of the
variables should be included in a regression model.
INTRODUCTION

STEPWISE is most helpful for exploratory analysis, because it can give you insight
into the relationships between the independent variables and the dependent or
response variable. However, STEPWISE is not guaranteed to give you the "best"
model for your data, or even the model  with the largest R2. And no model
developed by these means can be guaranteed to represent real-world processes ac-
curately.


STEPWISE and Other Model-Building Procedures

STEPWISE differs from RSQUARE, another procedure used for exploratory model
analysis. RSQUARE finds the R1 value for all possible combinations of the indepen-
dent variables. Therefore, RSQUARE always identifies the model with the largest R2
for each number of variables considered. STEPWISE uses the selection strategies
described below in choosing the variables for the models it considers. Also, when
STEPWISE evaluates a model, it prints a complete report on the regression, while
RSQUARE prints only the R2 value for each model. RSQUARE requires much more
computer time than STEPWISE.


Model-Selection  Methods

The five methods of model selection implemented in PROC STEPWISE are:

        FORWARD forward selection
       BACKWARD backward elimination
         STEPWISE stepwise regression, forward and backward
            MAXR forward selection with pair switching
            MINR forward selection with pair searching

A survey article by  Hocking (1976) describes these and other variable-selection
methods. The five methods are described below with the keyword  value of the
METHOD- option to request each method.

-------
102  Chapters
     Forward selection (FORWARD)  The forward-selection technique begins with no
     variables in the model.  For  each of the independent  variables, FORWARD
     calculates F statistics reflecting the variable's contribution to the model if it is in-
     cluded. These F statistics are compared to the SLENTRY- value that is specified in
     the MODEL statement (or to .50 if SLENTRY- is omitted). If no F statistic has a
     significance level greater than the SLENTRY- value, FORWARD stops. Otherwise,
     FORWARD adds the variable that has the largest F statistic to the model. FOR-
     WARD then calculates F statistics again for the variables still remaining outside the
     model, and the evaluation process is repeated. Variables are thus added one  by
     one to the model until no  remaining variable produces a significant F statistic.
     Once a variable is in the model, it stays.


     Backward  elimination (BACKWARD)   The  backward  elimination  technique
     begins by  calculating statistics for  a  model  including all of the independent
     variables. Then the variables are deleted from  the model one by one until all the
     variables remaining in the model  produce  F statistics significant at the SLSTAY -
     level specified in the MODEL statement (or at the .10 level, if SLSTAY- is omitted).
     At each step, the variable  showing the smallest contribution to the model is
     deleted.

     Stepwise (STEPWISE)   The stepwise method is a  modification of the forward-
     selection technique  and differs in that  variables already in  the  model do not
     necessarily stay there. As in the forward-selection method, variables are added one
     by one to the model, and the F statistic for  a variable to be added must be signifi-
     cant at the SLENTRY- level. After a variable is added,  however, the stepwise
     method looks at all the variables already included in the model and deletes any
     variable that does not produce an F statistic significant at the SLSTAY- level. Only
     after this check is made and the necessary deletions accomplished can another
     variable be added to the  model.  The stepwise process ends when none of the
     variables outside the  model has an F statistic significant at the SLENTRY- level and
     every  variable  in the  model is significant  at  the SLSTAY-  level, or  when the
     variable to  be added to the model is one just deleted from it.


     Maximum R2 improvement  (MAXR)   The maximum R2 improvement technique
     developed  by James Goodnight is considered  superior to the stepwise technique
     and almost as good as all possible regressions.  Unlike the three techniques above,
     this method does not settle on a single model.  Instead, it tries to find the best one-
     variable model, the best two-variable  model, and so forth,  although it is not
     guaranteed to find the model with the largest  R1 for each size.
       The MAXR method  begins by finding the one-variable model  producing the
     highest R2. Then another variable, the one that yields the greatest increase in R2, is
     added.  Once  the two-variable model  is obtained, each of the variables in the
     model is compared to each variable not in the model. For each comparison, MAXR
     determines if removing one variable and replacing it with the other variable in-
     creases R2.  After comparing all possible switches, the one that produces the largest
     increase in  R2 is made.  Comparisons begin again, and the process continues until
     MAXR finds that no  switch could increase  R2. The  two-variable model  thus
     achieved is considered the "best"  two-variable model the  technique can  find.
     Another variable is then added to the model, and the comparing-and-switching
     process is repeated to find the "best" three-variable model, and so forth.
       The difference between the stepwise technique and the maximum R2 improve-
     ment  method is that all switches are evaluated before any switch  is made in the
     MAXR method. In the stepwise method, the  "worst" variable may be removed

-------
                                                              STEPWISE  103

without considering what adding the "best" remaining variable might accomplish.
The MAXR method may require much more computer time than the STEPWISE
method.


Minimum R2 improvement (MINR)  The MINR method closely resembles MAXR,
but the switch chosen is the one that produces the smallest increase in R2. For a
given  number of variables in the model, MAXR and MINR usually produce the
same "best" model, but MINR considers more models of each size.
Significance Levels
When many significance tests are performed, each at a level of, say 5%, the overall
probability of rejecting at least one true null hypothesis is much larger than 5%. If
you want to guard against including any variables that do not contribute to the
predictive power of the model in the population, you should specify a very small
significance level.  In  most applications many variables considered  have some
predictive power, however small. If you want to choose the model that provides
the best prediction using the sample estimates, you need only guard  against
estimating more parameters than can be reliably estimated with the given  sample
size, so you should use a moderate significance level, perhaps in the range of 10%
to 25%.
C, Statistic
Cp was proposed by Mallows as a criterion for selecting a model. Cf is a measure of
total squared error defined:

     C, - (SSEp/s2) - (N - 2p)  ,
where
       s2 is the MSE for the full model,
     SSEp is the sum of squares error for a model with p variables plus the
          intercept.

If Cf is graphed with p, Mallows recommends the model where C, first approaches
p. When the right model is chosen, the parameter estimates are unbiased, and this
reflects in C, near p.  For further discussion, see Daniel and Wood, 1980.
SPECIFICATIONS

The statements used to control PROC STEPWISE are:

  PROC STEPWISE options;
   MODEL dependents-independents I options;
   WEIGHT variable;
   BY variables;

STEPWISE needs at least one MODEL statement. The BY and WEIGHT statements
can be placed anywhere.

-------
104  Chapters

     PROC STEPWISE Statement
      PROC STEPWISE options;
     Only one option is used on the PROC statement:
        DATA-SASdaraset  names the SAS data set containing the data for the
                         regression. If it is omitted, the most recently created
                         data set is used.
     MODEL Statement
      MODEL dependents =* independents / options;
     In the MODEL statement, list the dependent variables on the left side of the equal
     sign and the independent variables on the right  side of the equal sign.
      For each dependent variable given, STEPWISE goes through the model-building
     process using the independent variables listed. Any number of MODEL statements
     may be included. The options below may be specified in the MODEL statement
     after a slash (/).
                 NOINT

              FORWARD
                       F
             BACKWARD
                       B
               STEPWISE
                  MAXR
                  MINR
         SLENTRY=va/ue
              SLE- value
           SLSTAY-va/ue
              SLS=va/ue
            INCLUDE-n
               START-s
                STOP=s
prevents the procedure from automatically including
an intercept term in the model.
requests the forward-selection technique.

requests the backward-elimination technique.

requests the stepwise technique, the default.
requests the maximum R1 improvement technique.
requests the minimum R2 improvement technique.
specifies the significance level for entry into the model
used in the forward-selection and stepwise techniques.
If SLENTRY- is omitted, STEPWISE uses the
SLENTRY- value .50 for forward selection, .15 for
stepwise.
specifies the significance level for staying in the model
for the backward elimination and stepwise techniques.
If it is omitted, STEPWISE uses the SLSTAY-  value .10
for backward elimination, .15 for stepwise.
forces the  first n in dependent variables always to be
included in the model. The selection techniques are
performed on the other variables in the MODEL state-
ment.
is used to  begin the comparing-and-switching process
for a model containing the first s independent variables
in the MODEL statement, where s  is the START value.
Consequently, no model is evaluated that contains
fewer than s variables. This applies only to the  MAXR
or MINR methods.
causes STEPWISE to stop when it has found the "best"
s-variable model, where s is the STOP value.  This
applies only to the MAXR or MINR methods.

-------
                                                               STEPWISE  105

WEIGHT Statement
  WEIGHT variable;
The WEIGHT statement is used to specify a variable on the data set containing
weights for the  observations. Only observations with positive values of the
WEIGHT variable are used in the analysis.
                                                               ,)

BY Statement

  BY variables;
A BY statement may be used with PROC STEPWISE to obtain separate analyses on
observations in groups defined by the BY variables. When a BY statement appears,
the procedure expects the input data set to be sorted in order of the BY variables. If
your input data set is not sorted in ascending order, use the SORT procedure with a
similar BY statement to sort the data, or, if appropriate, use the  BY statement op-
tions NOTSORTED or DESCENDING. For more information, see the discussion of
the BY statement in Chapter 8, "Statements Used in the PROC Step,"in SAS User's
Guide: Basics, 1982 Edition.
DETAILS

Missing Values
STEPWISE omits observations from the calculations for a given model if the obser-
vation has missing values for any of the variables in the model. The observation is
included for any models that do not include the variables with missing values.


Limitations
Any number of dependent variables may be included in a  MODEL statement.
Although there is no built-in limit on the number of independent variables,  the
calculations for a model with many variables are lengthy. For the MAXR or MINR
technique, a reasonable maximum for the number of independent variables in a
single MODEL statement is about 20.


Printed  Output
For each model  of a given size, STEPWISE prints an analysis-of-variance table,  the
regression coefficients, and related statistics.

  The analysis-of-variance table includes:
   1. the source of variation REGRESSION,  which is the variation that is at-
      tributed to the independent variables  in the model
   2. the source of variation ERROR, which is the residual variation that is not
      accounted for by the  model
   3. the source of variation TOTAL, which is corrected for the mean of y if
      an intercept is included in the model, uncorrected if an intercept is not
      included
   4. DF, degrees of freedom
   5. SUMS OF SQUARES for REGRESSION, ERROR, and TOTAL
   6. MEAN SQUARES for  REGRESSION and ERROR

-------
106  Chapters
        7. the F value, which is the ratio of the REGRESSION mean square to the
           ERROR mean square
        8. PROB > F, the significance probability of the F value
        9. R SQUARE or R2, the square of the multiple correlation coefficient
        10. C(P) statistic proposed by Mallows.
     Below the analysis-of-variance table are printed:
        11. the names of the independent variables included in the model
        12. B VALUES, the corresponding estimated regression coefficients
        13. STD ERROR of the estimates
        14. TYPE II SS (sum of squares) for each variable, which is the SS that is
           added to the error SS if that one variable is removed from the model
        15. F values and PROB > F associated with the Type II sums of squares.
     EXAMPLE

     The example below asks for the FORWARD, BACKWARD, and MAXR methods.
                         -DATA ON PHYSICAL FITNESS	*
                                                                        I
THESE MEASUREMENTS WERE MADE ON MEN INVOLVED IN A PHYSICAL FITNESS
COURSE AT N.C. STATE UNIV. THE VARIABLES ARE AGE(YEARS), WEIGHT(KC),
OXYGEN UPTAKE RATE(ML PER KG BODY WEIGHT PER MINUTE), TIME TO RUN
1.5 MILES(MINUTES), HEART RATE WHILE RESTING, HEART RATE WHILE
RUNNING (SAME TIME OXYGEN RATE MEASURED), AND MAXIMUM HEART RATE
RECORDED WHILE RUNNING. CERTAIN VALUES OF MAXPULSE WERE MODIFIED
FOR CONSISTENCY. DATA COURTESY DR. A. C. LINNERUD
      DATA FITNESS;
       INPUT AGE WEIGHT OXY RUNTIME
              RUNPULSE  MAXPULSE @@;
       CARDS;
                                 RSTPULSE
     44 89.47 44.609
     44 85.84 54.297
     38 89.02 49.874
     40 75.98 45.681
     44 81.42 39.442
     44 73.03 50.541
     45 66.45 44.754
     54 83.12 51.855
     51 69.63 40.836
     48 91.63 46.774
     57 73.37 39.407
     52 76.32 45.441
     51 67.25 45.118
     51 73.71 45.790
     49 76.32 48.673
     52 82.78 47.467
            11.37 62
             8.65 45
             9.22 55
            11.95 70
            13.08 63
            10.13 45
            11.12 51
            10.33 50
            10.95 57
            10.25
            12.63
             9.63
            11.08 48
            10.47 59
             9.40 56
            10.50 53
48
58
48
178 182  40 75.07 45.313
156 168  42 68.15 59.571
178 180  47 77.45 44.811
176 180  43 81.19 49.091
174 176  38 81.87 60.055
168 168  45 87.66 37.388
176 176  47 79.15 47.273
166 170  49 81.42 49.156
168 172  51 77.91 46.672
162 164  49 73.37 50.388
174 176  54 79.38 46.080
164 166  50 70.87 54.625
172 172  54 91.63 39.203
186 188  57 59.08 50.545
186 188  48 61.24 47.920
170 172
10.07 62
 8.17 40
11.63 58
10.85 64
 8.63 48
14.03 56
10.60 47
 8.95 44
10.00 48
10.08 67
11.17 62
 8.92 48
12.88 44
 9.93 49
11.50 52
185
166
176
162
170
186
162
180
162
168
156
146
168
148
170
185
172
176
170
186
192
164
185
168
168
165
155
172
155
176
     PROC STEPWISE;
       MODEL OXY-AGE WEIGHT RUNTIME RUNPULSE  RSTPULSE
                    /FORWARD BACKWARD MAXR;

-------
                                                                                                    STEPWISE  107
I!
|u
STATISTICAL ANAL
FORWARD SELECTION PROCEDURE 1 OR
^ f »ll' ' VARIAOLt RUNTIME ENIERED R SQUARE
£!j x-\ -^^HICRf SSI ON
|i (2 )— 7-cr-rnKon
'F (3)IOIAL

" t f"*\ I'lIlliCEPT
. i'' \^J RUNTIME
(Dnr ©
1
29
30
02) B VALUE
82.112177268
-3.31055536
( J Itlf 2 VARIABLE ACE ENTERED R SQUARE
K
1
\ REGRESSION
» ERROR
TOTAL
I

! INTERCEPT
ACE
, RUNTIME
!' STEP 3 VARIABLE RUNPULSC
'
REGRESSION
ERKOR
TOTAL

INTERCEPT
ACE
RUNTIME
RUNPULSE
or
2
28
30

B VALUE
88.16228719
-0.15036567
-3.20395056
ENTERED R SQUARE
OF
3
27
30
B VALUE
111.71806113
-0.25639826
-2.82537867
-0.13090870
= 0.7133801oMn
SUM OF SQUARES
632.90009985
218.18111199
85 1.381 51181
U3) STD ERROR

0.361191485
= 0.7611214693
SUM OF SQUARES
650.66573237
200.715812147
85 1.381 514 14814

STO ERROR

0.095511)68
0.358771488
= 0.8110914146
SUM OF SQUARES
690.55085627
160.83068857
851.38151411814
STD ERROR

0.09622892
0.358280141
0.05059011
YSIS SYSTEM
DEPENDENT VARIABLE OXY
C(P) - 13.6988'lOl48\
('jpNMEAN SQUARE
632.90009985
7.533814293

(M)TYPE 1 1 ss

632.90009985
C(P) = 12.389M14895
MEAN SQUARE
325.33286618
7.168U2187


TYPE 1 1 SS

17.76563252
571.67750579
C(P) = 6.95962673
MEAN SQUARE
230.18361876
5.95669217

TYPE 1 1 SS

142.288671438
370.143528607
39.88512390
^^^
S)
(DF
811.01


F

814.01

F
15.38



F

2.148
79.75

F
38.614


F

7.10
62.19
6.70


^g^PROB>F
n. 1)001


Q[5)PRO8>F

0.0001

PROB>F
o.ooot



PROB>F

0.1267
0.0001

PROB>F
0.0001


PROB>F

0.0129
0.0001
0.01514

                                STATISTICAL   ANALYSIS    SYSTEM



                                FORWARD SELECTION PROCEDURE FOR DEPENDENT  VARIABLE OX.Y
          VARIABLE MAXPULSE  ENTERED

REGRESSION
ERROR
TOTAL

INTERCEPT
ACE
RUNTIME
RUNPULSE
MAXPULSE
STEP 5 VARIABLE WEIGHT

REGRESSION
ERROR
TOTAL

INTERCEPT
AGE
WEIGHT
RUNTIME
RUNPULSE
MAXPULSE
DF
1
26
30
B VALUE
98. 11788797
-0. 19773170
-2.76757879
-0.31810795
0.27051297
ENTERED R SQUARE
OF
5
25
30
B VALUE
102.20127520
-0.21962138
-0.07230231
-2.68252297
-0.37310085
0.30190783
SUM OF SQUARES
712.15152692
136.93001792
851.38151181
STO ERROR
0.09563662
0.31053613
0.11719917
O. 13361978
= 0.81800181
SUM OF SQUARES
721.97309102
129.10815082
851.38151181
STD ERROR
0.09550215
0.05331009
0.31098511
0.11711109
0.13393612
MEAN SQUARE
178.11288173
5.31316223
TYPE 1 1 SS
22.81231196
352.93569605
16.90088671
21.90067065
C(P) = 5.10627516
MEAN SQUARE
111.39161880
5.17633803
TYPE 1 1 SS
27.37129100
9.52156710
320.35967836
52.59623720
26.82610270
F
33.33
F
1.27
66.05
8.78
1.10

f •
27.90
r ' .
5.29
1.81
61.89
10.16
5.18
PROB>F
0.0001
PROB>F
0.0188
0.0001
0.0061
0.0533

PROB>F
0.0001
PROB>F
0.0301
0.1871
0.0001
O.0038
0.0316
HU oiHER  VARIABLES MET THE 0.5000 SIGNIFICANCE LEVEL  FOR  ENTRY  INTO THE MODEL.

-------
    108  Chapter 6
    STEP 0
      '"'•'".'.  '    '    STATISTICAL   ANALYSIS   SYSTEM
          ••••:•  ~ . .   BACKWARD ELIMINATION PROCEDURE FOR DEPENDENT VARIABLE OXY
ALL VARIABLES ENTERED        R SQUARE - 0.84867192       C(P) =.     7.00000000
I-



,.'.
STEP 1
:- ' -. ''


STEP 2

\ ? .r



REGRESSION
ERROR
TOTAL

INTERCEPT
AGE
WEIGHT
RUNTIME
RUNPULSE
RSTPULSE
MAXPULSE
VARIABLE RSTPULSE
• --.rv-/.' •"•• ,"
REGRESSION
ERROR
TOTAL
INTERCEPT
AGE
WEIGHT :
' RUNTIME
RUNPULSE -
' MAXPULSE '"'
OF . .
6
24 ,
30 *' •'•'••'
B VALUE v
102.93447948
-0.22697380
-0.07417741
-2.62865282
-0.36962776
• -0.02153364
0.30321713
SUM OF SQUARES
722.54360701
128.83793783
851.38154484
STD ERROR
0.09983747
0.05459316.
0.38456220
0.11985294
0.06605428
0.13649519
REMOVED R SQUARE - 0.84800181
: OF . . .
5
25
30
B VALUE
102.20427520
-0.21962138
-0.07230234 .
-2.68252297 :
-0.37340085
0.30490783
SUM OF SQUARES' ,
721.97309402
129.40845082
851.38154484
STO ERROR' .
0.09550245
0.05331009
0.34098544
0.11714109 :•.,
0.13393642 ~ '
VARIABLE WEIGHT REMOVED R SQUARE = 0.83681815

REGRESSION.
ERROR
TOTAL . .,;
, • '.
INTERCEPT
ACE
RUNTIME
RUNPULSE
MAXPULSE
OF
30
B VALUE
98.14788797
-0.19773470 .
-2.76757879
-0.34810795
0.27051297
ALL VARIABLES IN THE MODEL ARE SIGNIFICANT AT



''•* .

S T A T 1 S
SUM OF SQUARES
712.45152692.
138.93001792 ,
851.38154484 '
STD ERROR
0.09563662
0.34053643
0.11749917
0.13361978
THE 0.1000 LEVEL.

T 1 C A L ANAL
" MAXIMUM R-SQUARE IMPROVEMENT FOR
STEP 1




THE ABOVE
STEP 2




VARIABLE RUNTIME

REGRESSION
ERROR
TOTAL

INTERCEPT
RUNTIME
MODEL IS THE BEST
ENURED R
OF
1
29
30
B VALUE
82.42177268
-3.31055536
1 VARIABLE MODEL
VARIABLE ACE ENTERED R

REGRESSION
ERROR
TOTAL

INTERCEPT
ACE
RUNTIME
OF
2
28
30
B VALUE
88.46228749
-0.15036567
-3.20395056
SQUARE - 0.74338010
SUM OF SQUARES .
632.90009985
218.48144499
851.38154484
STO ERROR
0*. 36119485
FOUND.
SQUARE = 0.76424693
SUM OF SQUARES
650.66573237
200.71581247
851.38154484
STD ERROR
0.09551468
0.35877488
MEAN SQUARE :'
120.42393450 .
5.36824741
TYPE 1 1 SS
27.74577148
9.91058836 ,
250.8221U090
51.05805832
0.57051299
26.49142405
C(P) = 5.10627546
MEAN SQUARE
144.39461880
5.17633803
TYPE 1 1 SS
27.37429100
9.52156710
320.35967836
52.59623720
26.82640270
C(P) = 4.87995808
MEAN SQUARE
178.11288173
5.34346223
TYPE 1 1 SS
22.84231496
352.93569605
46.90088674
21.90067065


YSIS SYSTEM
DEPENDENT VARIABLE OXY
F
22.43
F
5. 17
1.85
46.72
9.51
0.11
4.93

F
27.90
F
5.29
1.84
61.89
10. 16
5.18

F
33.33
F
4.27
66.05
8.78
4. 10




PROB>r
O.O001
PROB>F
0.0322
0. 1869
O.UOO1
O.OO51
0.7473
0.0360

PROB>F
0.0001
PROB>F
0.0301
0.1871
0.0001
0.0038
0.0316

PROB>F
0.0001
PROB>F
0.0488
0.0001
0.0064
0.0533



C(P) = 13.69884048
MEAN SQUARE
632.90009985
7.53384293
TYPE 1 1 SS
632.90009985

F
84.01
F
84.01

PROO>F
o.ouot
PROB>F
O.OO01

C(P) = 12.38944895
MEAN SQUARE
325.33286618
7. 16842187
TYPE 1 1 SS
17.76563252
571.67750579
F
45.38
F
2.48
79.75
PROB>F
0.0001
PROB>F
0. 1267
0.0001
    THE ABOVE MODEL IS THE BEST  2 VARIABLE MODEL FOUND.
                                                                                                        (continued on r.a

-------
                                                                                                    STEPWISE   109
         l:i>m previous
         VARIABLE  RUNPULSE ENTERED
                                                                     C(P)
                                                                               6.95962673
• >
1 S
• REGRESSION
: ERROR
TOTAL
!
> INTERCEPT
ACE
RUNTIME
RUNPULSE
! THf ABOVE MODEL IS THE BEST
i»
r

OF
3
27
30
B VALUE
111.718064143
-0.25639826
-3.82537867
-0.13090870
3 VARIABLE MODEL

S T A T 1 S
SUM OF SQUARES
690.55085627
160.83068857
851.38151414814
STO ERROR
0.09622892
0.358280141
0.05059011
FOUND.

TICAL ANAL
MAXIMUM R-SQUARE IMPROVEMENT FOR
SUP 1 VARIABLE MAXPULSE
T
REGRESSION
ERROR
TOTAL

INTERCEPT
ACE
RUNTIME
RUNPULSE
MAXPULSE
E ABOVE MODEL IS THE BEST
ENTERED . R
OF
1)
26
30
B VALUE
98.111788797
-0. 19773U70
-2.76757879
-0.31)810795
0.27051297
4 VARIABLE MODEL
If 5 VARIABLE WEIGHT ENTERED R

REGRESSION
ERROR
TOTAL

INTERCEPT
ACE
WEIGHT
RUNTIME
RUNPULSE
MAXPULSE
THE ABOVE MODEL IS THE BEST
SUP 6 VARIABLE RSTPULSE

REGRESSION
ERROR
TOTAL

INTERCEPT
ACE
WEIGHT
RUNTIME
RUNPULSE
RSTPULSE
MAXPULSE
Df
5
25
30
B VALUE
102.20'l27520
-0.21962138
-0.072302314
-2.68252297
-0.373140085
0.301490783
5 VARIABLE MODEL
ENTERED R
DF
6
2>4
30
B VALUE
102.93141479148
-0.22697380
-0.0714177M1
-2.62865282
-0.36962776
-0.021533614
0.30321713
SQUARE --. 0.83681815
SUM OF SQUARES
712.45152692
138.93001792
851. 38151) U814
STD ERROR
0.09563662
0.3140536143
0.117149917
0.13361978
FOUND.
SQUARE = 0.81)800181
SUM OF SQUARES
721.973091402
129.1)081)5082
851.38151)1)81)
STD ERROR
0.095502M5
0.05331009
0.31)09851)1)
0.11711)109
0.1339361)2
FOUND.
SQUARE = 0.84867192
SUM OF SQUARES
722.51)360701
128.83793783
851.38151)1)81)
STO ERROR
0.0998371)7
0.051)59316
0.381)56220
0. 11985294
0.066051)28
0.1361)9519
MEAN SQUARE
. 230.18361876
5.95669217
. TYPE 1 1 SS
1)2.288671)38
• 370.1)3528607
39.88512390


V S 1 S S V S T
DEPENDENT VARIABLE
C(P) = D
MEAN SQUARE
178.11288173
5.31431)6223
TYPE 1 1 SS
22.81)2311)96
352.93569605
1)6.90088671)
21.90067065

CF
.0.0001
PROB>F
0.0129
0.0001
0.01514





PROn>F
0.0001
PROB>F
0.01488
0.0001
0.0061)
0.0533


PROB>F
0.0001
PROB>F
0.0301
0.1871
0.0001
0.0038
0.0316


PROB>F
0.0001
PROB>F
0.0322
0.1869
0.0001
0.0051
0.71)73
0.0360
1HE  AUOVE MODEL  IS THE BEST  6 VARIABLE  MODEL  FOUND.

-------
110  Chapters

     REFERENCES

     Daniel, Cuthbert and  Wood,  Fred S. (1980), Fitting Equations to Data, Second
       Edition, New York: John Wiley and Sons, Inc.
     Draper, N.R. and Smith, H. (1981), Applied Regression Analysis, Second Edition,
       New York: John Wiley and Sons, Inc.
     Hocking,  R.R. (1976),  "The  Analysis and Selection of Variables in Linear Regres-
       sion," Biometrics, 32, 1-50.
     Mallows, C.L. (1964), "Some Comments on Cp," Technometrics, 15, 661-675.
     Sail, J. (1981), SAS Regression Applications, Technical Report A-102, Cary, N.C.:
       SAS Institute.

-------
 ANALYSIS OF

     VARIANCE

              Introduction
                ANOVA
                  GLM
                NESTED
              NPAR1WAY
                 PLAN
                 TTEST
               VARCOMP
The Four Types of Estimable Functions

-------
         Introduction   to  SAS
       Analysis-of-Variance
                               Procedures
This .chapter reviews the SAS procedures that are used for analysis of variance:
GLM, ANOVA, NESTED, VARCOMP, NPAR1WAY, TTEST, and PLAN.
  The most general analysis-of-variance procedure is GLM, which can handle most
problems. Other procedures are used for special cases as described below:

            GLM  performs analysis of variance, regression, analysis of
                  covariance, and multivariate analysis of variance
          ANOVA  handles analysis of variance for balanced designs
          NESTED  performs analysis of variance for purely nested random
                  models
        VARCOMP  estimates variance components
       NPAR1WAY  performs nonparametric one-way analysis of rank
                  scores
            TTEST  compares the means of two groups of observations
            PLAN  generates random permutations for experimental
                  plans.

Introduction

These procedures perform analysis of variance, which is a technique for analyzing
experimental data. A continuous response variable is measured under various ex-
perimental conditions identified by classification variables. The variation in the
response is "explained" as due to effects in the classification with random error ac-
counting for the remaining variation.
  For each observation, the/ANOVA model predicts the response, often by a sam-
ple mean. The difference between the actual and the predicted  response is the
residual error. Analysis-of-variance procedures fit parameters to minimize the sum
of squares of residual errors. Thus the method is called least squares. The variance
of the random error, o1, is estimated by the mean squared error (MSE or s2).
  Analysis of variance was pioneered by R.A. Fisher (1925). For a general introduc-
tion to analysis of variance, see an intermediate statistical methods textbook such
as Steel and Torrie (1980), Snedecor and Cochran (1980), Mendenhall (1968), John
(1971), On (1977), or Kirk (1968). A classic source is Scheffe (1959). Freund and Lit-
tell (1981) bring together a treatment of these statistical  methods and SAS pro-
cedures. Linear models texts include Searle (1971), Graybill (1961), and Bock
(1975). Kennedy and Gentle  (1980) survey the computing aspects.

-------
114  Chapter?

     ANOVA for Balanced Designs

     One of the factors that determines which procedure to use is whether your data are
     balanced or unbalanced. When you design an experiment,  you choose how many
     experimental units to assign to each combination of levels (or cells) in the classifica-
     tion.  In order to achieve good statistical properties and  simplify the  statistical
     arithmetic, you typically attempt to assign the same number of units to every cell in
     the design. These designs are called balanced.
      If you have balanced data, the arithmetic for calculating sums of squares can be
     greatly simplified. In SAS, you can use the ANOVA procedure rather than the more
     expensive GLM procedure for balanced data. Generalizations of the balanced con-
     cept can be made to use the  arithmetic for balanced designs even  though the
     design does not contain an equal number of observations per cell. You can use
     balanced arithmetic for all one-way models regardless of how unbalanced the cell
     counts are. You can even use the balanced arithmetic for Latin squares that do not
     always have data in all cells.
      However, if you  use the  ANOVA procedure to analyze a design that is not
     balanced, you may get  incorrect results, including negative  values reported for the
     sums of squares.
      Analysis-of-variance procedures construct ANOVA tests by comparing mean
     squares relative to their expected values under the null hypothesis.  Each mean
     square in a  fixed analysis-of-variance model has an expected value that is com-
     posed of two components: quadratic functions of fixed parameters and random
     variation. For a fixed effect called A, the expected value of  its mean square is writ-
     ten:

          E(MS(A)) -  Q(/J)  +  o.1  .
      The mean square is constructed so that under the hypothesis to be tested (null
     hypothesis) the fixed portion Q(/3) of the expected value is zero. This mean square
     is then compared to another mean square, say MS(E), that is independent of the
     first, yet has the expected value o.2. The ratio of the two mean squares  is an  F
     statistic that has the F distribution under the null hypothesis:

          F = MS(A)/MS(E) .

      When the null hypothesis is false, the numerator term has a larger expected
     value, but the expected value of the denominator remains the same. Thus large F
     values lead to rejection of the null hypothesis. The test decides an outcome by con-
     trolling for the Type 1 error rate, the probability of rejecting a true null hypothesis.
     You look at the significance probability, the probability of getting an even larger F
     value if the null hypothesis is true. If this probability is small, say below .05 or .01,
     you are wrong in rejecting less than .05 or .01 of the time respectively. If you are
     unable to reject the hypothesis, you conclude that either the null hypothesis was
     true or that you do  not have enough data to detect  the small  differences to be
     tested.
     General Linear Models

     If your data do  not fit into a balanced design,  then  you probably need the
     framework of linear models in the GLM procedure.
      An analysis-of-variance model can be written as a linear model, an equation to
     predict the  response as a  linear function of parameters and design variables. In
     general we write:

-------
                            Introduction to SAS Analysis-of-Variance Procedures  115

      y. = /J0x0.+ /JtX,.-+  ... +  pkXki  + f-i  ,    i - 1..."

where y, is the response for the /'* observation, /3t are unknown parameters to be
estimated, and x., are design variables. Design variables for analysis of variance are
indicator variables, that is, they are  always either 0 or 1.
  The simplest model is to fit a single mean to all observations. In this case there is
only one parameter, /?„, and one design variable, x0i, which always has the value 1:

      y i = Pox,,.- + t,
         = flo + £.  •

  The  least-squares estimator of /J0 is the mean of the y..  This  simple  model
underlies all more complex models, and all larger models are compared to this sim-
ple mean model.
  A one-way model is written by introducing an indicator variable for each level of
the classification variable. Suppose that a variable A has four levels, with two obser-
vations per level. The indicators are created as shown below:
rcep
1
1
1
1
1
1
1
1
t al
1
1
0
0
0
0
0
0
a2
0
0
1
1
0
0
0
0
a3
0
0
0
0
1
1
0
0
a4
0
0
0
0
0
0
1
1
  The linear model can be written:

      y.=  PO+ al,/?, + a2,/?2 + a3./J3  + a4./?4

  To construct crossed and nested effects, you can simply multiply out all combina-
tions  of the main-effect  columns. This is described in detail  in the  section
Parameterization in the GLM procedure.
Linear hypotheses   When models are expressed  in  the  framework of linear
models,  hypothesis  tests  are  expressed in terms  of  a  linear function of the
parameters. For example, you may want to test that/?2-/?3 = 0. In general, the coef-
ficients for linear hypotheses are some set of Ls:

      H0: Lo/Jo+L,/J, + ...+ U/Jt =  0

  Several of these linear functions can be combined to make one joint test. Tests
can also be expressed in one matrix equation:

      H0:L/?-0.

  For each linear hypothesis, a sum of squares due to that hypothesis can be con-
structed. This sums of squares can be calculated either as a quadratic form of the
estimates:

          = 0) = (Lb)'(L(X'X)-L')-'(Lb)

-------
116  Chapter?

     or as the increase in SSE for the model constrained by the hypothesis

           SS(L/? = 0)= SSE(constrained) - SSE(full).

     This SS is  then divided by degrees of freedom and used as a numerator of an F
     statistic.

     Random effects  To estimate the variances of random effects, use the VARCOMP
     or NESTED procedures; PROC CLM does not estimate variance components but
     can produce expected mean  squares.
       A random effect is  an  effect  whose parameters are drawn  from a normally
     distributed random  process with mean zero and common variance.  Effects are
     declared random when the levels are randomly selected from a large population of
     possible levels. The inferences concern fixed  effects but can be generalized across
     the whole population of random effects levels rather than only those levels in your
     sample.
       In agricultural experiments, it is common to declare location or plot  as random
     since these levels are chosen randomly from a large population and you assume
     fertility to  vary normally across locations. In  repeated-measures experiments with
     people or animals as subjects, subjects are declared random since they are selected
     from the larger population to which you want to generalize.
       When effects are declared  random in GLM, the expected mean square of each
     effect is calculated. Each expected mean square is a function of variances of ran-
     dom effects  and quadratic functions of parameters of fixed effects. To test a given
     effect, you must search for a term that has the same expectation as your numerator
     term, except for the portion of the expectation that you want to test to be zero. If
     the two mean squares are  independent, then the F test is  valid. Sometimes,
     however, you will not  be able to find a proper denominator term.

     Comparison of means When  you have more than two means to compare, an
     ANOVA F  test tells you if the means are significantly different from each other, but
     it does not tell you which  means differ from which other means.
       If you have specific comparisons in mind, you can use the CONTRAST statement
     in GLM to make these comparisons. However, if you make many  comparisons
     using some alpha level to judge significance, you are more likely to make a Type 1
     error  (rejecting incorrectly a hypothesis that means are equal) simply because you
     have more chances to  make the error.
       Multiple comparison methods give you more detailed information about the dif-
     ferences among the  means and  allow you to  control error rates for a multitude of
     comparisons. A variety of multiple comparison methods are available with the
     MEANS statement in the ANOVA and GLM  procedures. These are  described in
     detail in the  section Comparison of Means in GLM.

     Nonparametric analysis  Analysis of variance is sensitive to the distribution of the
     error term. If the error  term is not normally distributed, the statistics based on nor-
     mality may be misleading. The traditional test statistics are called parametric tests
     because they depend on the specification of a certain probability distribution up to
     a set of free parameters. Nonparametric methods make the tests without making
     distributional assumptions.  If the  data  are distributed  normally, often  non-
     parametric methods  are almost as powerful as parametric methods.
       Most nonparametric methods are based on taking the ranks of a  variable and
     analyzing these ranks  (or transformations of  them) instead of the original values.
     The NPAR1WAY procedure  is  available  to  perform  a  nonparametric one-way
     analysis of variance.  Other nonparametric tests can be performed by taking ranks

-------
                           Introduction to SAS Analysis-of-Variance Procedures  117

of the data (using PROC RANK) and using a regular parametric procedure to per-
form the analysis. Some of these techniques are outlined in the description of
PROC RANK and Conover and Iman (1981) cited below.


REFERENCES

Bock, M.E. (1975), "Minimax Estimators of  the Mean of a Multivariate Normal
  Distribution," Annals of Statistics, 3, 209-218.
Conover,  W.J. and  Iman,  R.L. (1981),  "Rank  Transformations  as a Bridge Be-
  tween Parametric and Nonparametric  Statistics," The American Statistician,
  35, 124-129.
Fisher,  R.A. (1925), Statistical Methods for Research Workers,  Edinburgh: Oliver
  & Boyd.
Freund, R.J. and Littell, R.C. (1981), SAS for Linear Models, Cary, NC: SAS Institute.
Graybill, F.A.  (1961),  An Introduction  to Linear Statistical Models, New  York:
  McGraw-Hill.
John, P. (1971), Statistical Design and Analysis of Experiments, New York: Mac-
  millan.
Kennedy,  W.J., Jr. and Gentle,  J.E.  (1980), Statistical Computing, New  York:
  Marcel Dekker.
Kirk, R.E.  (1968), Experimental Design:  Procedures  for the Behavioral Sciences,
  Monterey, CA: Brooks/Cole.
Mendenhall, W. (1968), Introduction to Linear Models and The Design and Analysis
  of Experiments, Belmont, CA: Duxbury.
Ott, L.  (1977), An Introduction of Statistical Methods and Data Analysis, Belmont,
  CA: Duxbury.
Scheffe, H. (1959), The Analysis of Variance,  New York: John Wiley & Sons.
Searle. S.R. (1971), Linear Models, New  York: John Wiley & Sons.
Snedecor, G.W. and Cochran, W.G. (1980), Statistical Methods,  Seventh Edition,
  Ames, Iowa: The Iowa State University Press.
Steel R.G.D and Torrie, J.H. (1980), Principles and Procedures of Statistics, Second
  Edition,  New York: McGraw-Hill.

-------
    LopLfi^C^U^
                         t
       INS-TJTOTE INC
BOX  8OOO
CAKY.NC

TABLE  OF  CONTENTS
                         Chapter 1   INTRODUCTION  1
                                      1.1 SOME BASIC STATISTICS: A REVIEW  1
                                         1.1.1  Statistical Inference  1
                                         1.1.2  Linear Models 3
                                         1.1.3  Experimental Design  5
                                      1.2 ELEMENTS OF A SAS PROGRAM  6

                         Chapter 2   REGRESSION   9
                                      2.1 STATISTICAL BACKGROUND  9
                                         2.1.1  Terminology and Notation 9
                                         2.1.2  Partitioning the Sums of Squares 10
                                         2.1.3  Hypothesis Testing  12
                                         2.1.4  Using the Generalized Inverse  14
                                      2.2 IMPLEMENTING GLM FOR REGRESSION  16
                                         2.2.1  A Model with One Independent Variable 16
                                         2.2.2  A Model with Several Independent Variables 19
                                      2.3 OTHER TOPICS 22
                                         2.3.1  Missing Data  22
                                         2.3.2  Other Variable Specifications   22
                                         2.3.3  A Polynomial Model  23
                                         2.3.4  Optional MODEL Specifications  25
                                         2.3.5  Modifying the Procedure 27
                                         2.3.6  Weighted Regression  30
                                         2.3.7  Test for a Subset of Coefficients 35
                                      2.4 CREATING DATA  35
                                         2.4.1  Plotting Residuals  35
                                         2.4.2  Predicting to a Different Set of Data  37
                                         2.4.3  Transformations on the Dependent Variable  38
                                         2.4.4  A Polynomial Plot  40

-------
               2.5 MULTICOLLINEARJTY  42
                  2.5.1  Variance Inflation (Tolerance)  42
                  2.5.2  Roundoff Error  43
                  2.5.3  Exact Multicollinearity: Linear Dependencies  43

Chapter 3    ANALYSIS OF MEANS  47
               3.1 INTRODUCTION  47
               3.2 ONE- AND TWO-SAMPLE TESTS AND STATISTICS  47
                  3.2.1  One-Sample Statistics  47
                  3.2.2  Two Related Samples  48
                  3.2.3  Two Independent Samples  50
               3.3 COMPARISON OF SEVERAL MEANS:
                  THE ANALYSIS OF VARIANCE   52
                  3.3.1  Terminology and Notation  52
                  3.3.2  Using PROC ANOVA 55
                  3.3.3  Multiple Comparisons  56
                  3.3.4  Completely Randomized Design  57
                  3.3.5  Randomized Blocks Design  60
                  3.3.6  Latin Square Design   63
                  3.3.7  Factorial Experiment  67
                  3.3.8  Split-Plot  Design  74
                  3.3.9  Nested Design  78
                  3.3.10 Repeated-Measures Design  82
                  3.3.11 Split-Split-Plot Experiment  83


Chapter 4    ANALYSIS-OF-VARIANCE MODELS
               OF LESS THAN  FULL RANK   85

               4.1 INTRODUCTION  85
               4.2 THE DUMMY-VARIABLE MODEL 85
                  4.2.1  The Simplest Case: One-Way Structure  86
                  4.2.2  Getting Useful Estimates  88
                  4.2.3  Using PROC GLM for Analysis of Variance  91
                  4.2.4  Estimable  Functions  95
               4.3 TWO-WAY STRUCTURE  100
                  4.3.1  General Considerations  100
                  4.3.2  Sums of Squares Computed by GLM  103
                  4.3.3  Interpreting Sums of Squares in Reduction Notation  104
                  4.3.4  Interpreting Sums of Squares in pi-Model Notation  107
                  4.3.5  Analyzing a  Two-Factor Layout  109
                  4.3.6  MEANS, LSMEANS, CONTRAST, and
                        ESTIMATE Statements in the Two-Way Layout  112

-------
           4.4 HIGHER-ORDER STRUCTURES  116
               4.4.1  Special Notation  116
               4.4.2  Lack-of-Fit Analysis  117

           4.5 NESTED STRUCTURE  120
               4.5.1  Strictly Nested Structure  120
               4.5.2  Nested and Crossed Structure  123
               4.5.3  The ABSORB Statement  123

           4.6 PROPER ERROR TERMS  132
               4.6.1  Alternate Error Specification  133
               4.6.2  Expected Mean Squares  134

           4.7 ESTIMABLE FUNCTIONS 143
               4.7.1  The General Form of Estimable Functions  143
               4.7.2  Interpreting Sums of Squares Using Estimable Functions  145
               4.7.3  Estimating Estimable Functions  151
               4.7.4  Interpreting LSMEANS, CONTRAST, and
                     ESTIMATE Results Using Estimable Functions  151

           4.8 EXAMPLES OF SPECIAL APPLICATIONS  153
               4.8.1  Confounding in a Factorial Experiment  153
               4.8.2  A Balanced Incomplete Blocks Design  157
               4.8.3  Designs to Estimate Residual Effects  159
               4.8.4  Empty Cells  163
               4.8.5  Diagnosing the NON-EST Message: A Case Study  176
               4.8.6  Experiments with Qualitative and Quantitative Variables  180

Chapter 5    COVARIANCE AND
               THE HETEROGENEITY OF SLOPES   187

               5.1 INTRODUCTION  187

               5.2 A ONE-WAY STRUCTURE  188
                  5.2.1  Covariance Model   188
                  5.2.2  Means and Least-Squares Means  191
                  5.2.3  Contrasts   192
                  5.2.4  Multiple Covariates  193

               5.3 TWO-WAY STRUCTURE WITHOUT INTERACTION  194

               5.4 TWO-WAY STRUCTURE WITH INTERACTION  197

               5.5 HETEROGENEITY OF SLOPES  200

-------
Chapter 6   MULTIVARIATE LINEAR MODELS 207
            6.1 INTRODUCTION 207
               6.1.1  Statistical Background  208
               6.1.2  Using PROC ANOVA and PROC GLM  210
            6.2 A ONE-WAY STRUCTURE  210
            6.3 A TWO-FACTOR FACTORIAL  214
            6.4 MULTIVARIATE ANALYSIS OF COVARIANCE  220

            BIBLIOGRAPHY  227

            INDEX  229

-------