Receptor Model Technical Series VI: a Guide to the Use of Factor Analysis and Multiple Regression (FA/MR) Techniques in Source Apportionment


&EPA
          United States
          Environmental Protection
          Agency
          Office of Air Quality
          Planning and Standards
          Research Triangle Park NC 2771 1
EPA-450 4-85-007
July 1985
          Air
Receptor Model
Technical Series VI:
A Guide To The Use
Of Factor Analysis
And Multiple
Regression
(FA/MR)
Techniques In
Source
Apportionment

-------
                                  EPA-450/4-85-007
   Receptor Model Technique Series IV:
A Guide  To The  Use  Of  Factor Analysis
    And Multiple Regression (FA/MR)
   Techniques In Source Apportionment
                         By

              Paul J. Lioy, Theo. J. Kneip, And Joan M. Daisey
                 Institute of Environmental Medicine
                 New York University Medical Center
                   Contract No. 4D2975NASA
                     EPA Project Officer:
                     Thompson G. Pace
              U.S. ENVIRONMENTAL PROTECTION AGENCY
                   Office Of Air And Radiation
              Office Of Air Quality Planning And Standards
                 Research Triangle Park, NC 27711

                       July 1985

-------
This report has been reviewed by the Office Of Air Quality Planning And Standards, U.S. Environmental
Protection Agency, and approved for publication as received from the contractor. Approval does not signify
that the contents necessarily reflect the views and policies of the Agency, neither does mention of trade
names or commercial products constitute endorsement or recommendation for use.
                                      EPA-450/4-85-007

-------
                                PREFACE




                    Receptor Model Technical  Series




                               Volume VI









     A Guide  to  the Use  of Factor Analysis and Multiple Regression




                   Techniques in Source Apportionment
     In order to meet the requirements of the 1977 Clean Air Act regard-




ing  attainment of the National Ambient Air Quality Standards for parti-




culate matter, EPA has been preparing guidelines for use in  identifying




and  quantifying  source  contributions  to  measure ambient particulate




matter concentrations.  Many analysis techniques and  models  have  been




developed  for the purpose of source apportionment.  Receptor models are




those that are based primarily on ambient concentration data gathered at




the receptor, and are used to determine the sources contributing mass at




the site.






     Guidance for using source apportionment techniques  has  been  com-




piled  by  EPA into the Receptor Model Technical Series.  The first four




volumes in the Technical Series have primarily addressed receptor  model




source apportionment techniques.  Volume I (EPP-450/4-81-016a),  entitled




'Overview of Receptor Model  Application to Particulate  Source Apportion-




ment",   introduces  the  concept of receptor models and briefly  discuses




the various types of receptor models and briefly discusses  the   various




types  of receptor models and their applications.   Volume II (EPA-450/4-




81-016b) pertains to the 'Chemical  Mass  Balance"  model   and  provides
                                  111

-------
information  on  model theory, data requirements and case studies of the




application of the  model  to  emission  control  strategy  development.




Volume  III (EPA-450/4-83-014), the 'User's Manual for (a) Chemical Mass




Balance Model", documents a computer program that performs source appor-




tionment  using  the  weighted least squares and other optional  forms of




the mass balance equations.  The user's guide provides a  complete  pro-




gram  listing, an example  set of input and output data, and further dis-




cussion of model theory and use.






     Volume IV of the  series  (EPA-450/4-83-018),  "Summary  of  Particle




Identification  Techniques"  gives an overview of the methods and equip-




ment generally used  in particle characterization for   source  apportion-




ment  studies.  The  discussion includes sampling and analytical methods,




choice of filter media, particle  properties  and  source  fingerprints,




costing and method selection  criteria.






     Volume V  (EPA-450/4-84-020).  "Source Apportionment  Techniques  and




Considerations   in Combining  Their Use", provides guidance for the coor-




dinated use of the various receptor  and  source  model  techniques  in




source  apportionment  activities.  Summary discussions of the available




receptor  and  source  models are presented.  The  use of  the models is dis-




cussed in a phased approach starting with  analyses of  low complexity and




cost and  proceeding  to analyses of  greater  complexity and   cost  which




produce   more  quantitative   results.    Input data requirements  for each




phase and example case histories are provided.
                                  3.V

-------
     The present volume (VI), "A  Guide  to  the  Use of Factor Analysis and




Multiple Regression (FA/MR) Techniques is Source Apportionment", provides




an informative analysis  of the  theory  and  application  of  the  combined




technique of factor analysis  and  multiple  regression to  identify sources




and estimate their contributions.  Features  of  the  volume  are a thorough




discussion of the methods  for applying  these statistical techniques to a




large data base,  interpretation  of the results  and techniques for valida-




tion of the tracers and regression coefficients.  The types of information




and tests available to determine the  stability of factor  analytical solu-




tions, and the completeness of  a  regression  analysis,  are  also described




in the text.

-------
                                Abstract








                    Receptor Model  Technical  Series








                               Volume VI








          A Guide  to  the Use of Factor Analysis and Multiple




         Regression  (FA/MR) Techniques in Source Apportionment
     One of the major requirements of the Clean Air Act is to attain the




National  Ambient Air Quality Standard for particulate matter.  In addi-




tion, with the anticipated changed in the  form  of  the  standard  from




total  suspended  particulate  matter  to  a standard for matter with, an




aerodynamic diameter of _<_ 10 pm (PM-10), more  sophisticated  approaches




to  identifying  the  primary  sources of PM-10 will be required in some




instances.  The present document is  the  sixth  in  the  user  oriented




receptor  modeling series.   Over the past twelve years, a number of mul-




tivariate methods have been used to determine the sources of mass  emit-




ted  in  a number of cities.  The present document focusses primarily on




the FA/MR technique; however, the procedures required to identify poten-




tial tracers or source profiles, and validate the results are applicable




to all.






     The specific items covered in the text are 1)  the  definition  and




use of factor analytical modeling techniques; 2) discussion and analysis




of the results of applications of the  factor  analytical  technique  to
                                   vi

-------
real data sets,  including  attempts for validation,  3)  the definition of




the regression model, and  4)  the  application of the regression technique




to apportion  the particulate mass  based  upon  tracer  selection criteria




established for FA.  Finally, the important task of validating the source




contributions obtained from a stepwise regression is  examined by examples,




and weaknesses  discussed.    To  assist  the   reader,  the  results   from  a




complete FA/MR  analysis  are  followed in   the application  section  of




both the Factor Analysis and Multiple Regression Chapters.






     Other multivariate  techniques  used  in   source  apportionment  are




identified in Section 4.0  of the  report.   No critical review or detailed




discussion is provided  for  these.   Further examination  of  the previous




applications to  ambient data  sets are left to the reader.

-------
Acknowledgements






     The authors wish to thank the Project Officer,   Thompson  G.   Pace,




for  providing comments and criticism  throughout  the  development of this




document.  Ms. Laraine Wittrup and Mrs. Betty McCarthy   for   typing  the




manuscript,  and  Mrs.  Mary Jean Yc/none Lioy for  editing the final ver-




sion.  Further appreciation is extended to the  technical reviewers  of




the document. Dr. Thomas Dzuby, Dr. Charles Pratt  and Mr. William  Cox of




the U.S. EPA, and Mr. Stuart Dattner of the Texas  Air Control Board.
                                Vlll

-------
                           Table  of  Contents
                                                                Pa«e
1.0 Introduction                                                  1.




   1.1 Background                                                 2.




2.0 Factor Analysis                                               7.




   2.1 The Factor Analytical Model                               11.




   2.2 Selection of Variables                                    22.




   2.3 Commonly Available Statistical Software                   27.




   2.4 Application of Factor Analysis to Air Foliation -         32.




       Data and Interpretation of Results




     2.4.1 General Considerations                                32.




     2.4.2 Preliminary Examination of Data                       33.




     2.4.3 Factor Analysis Solutions                             34.




     2.4.4 Type of Factors                                       38.




   2.5 An Example of Application to Air Pollution Data           40.




   2.6 Validation of Factor Analysis                             45.




     2.6.1 Introduction                                          45.




     2.6.2 Source Composition Profiles                           46.




     2.6.3 Factor Validation                                     47.




     2.6.4 Evaluation of Source Inventories                      48.




     2.6.5 Factor Stability                                      49.




  2.7 References                                                 52.




3.0 Multiple Regression Analysis                                 57.




  3.1 Simple Bivariate Linear Regression Model                   58.
                                  IX

-------
3.2 Multiple Regression Analysis                                   63.




3.3 An Example of An Application of Multiple Regression            66.




    Analysis to Air Pollution Data




3.4 Regression Model Validation                                    71.




  3.4.1 Regression Coefficient Validation                          73.




  3.4.2 Meteorological Relationships                               80.




  3.4.3 General Model Evaluation                                   81.




  3.4.4 Summary                                                    84.




3.5 Interpretation                                                 85.




  3.5.1 Summary                                                    88.




3.6 References                                                     89.




4.0 Appendix:  Alternative Approaches  to Regression                93.




    Analysis




  A.I Target Transformation Factor Analysis  (TTFA)                 93.




  A.2 Multiple Linear Regression on Tracers/Factor Analysis        95.




       [MCR(T)/FA]




  A.3 Regression of Absolute Principal Component  Scores            96.




  A.4 References                                                   99.

-------
 1.0  INTRODUCTION




      A primary purpose  of  the  receptor  modeling  series.  Volumes   I


 through  V,   is  the  transfer  of  information  and techniques  that are


 extremely useful to the research community, to environmental  specialists


 and   to  scientists  in air pollution regulatory agencies  (1-5).  Subse-


 quently, these individuals can use the techniques to examine  air  pollu-


 tion  in  locations  with  high particulate matter concentrations,  toxic


 pollutant emissions from multiple sources, visibility degradation,  acid


 deposition, and hazardous organic pollutants.  No single receptor model-


 ing technique is always applicable  or  desirable  for  all   situations.


 Therefore, a judgement must be made on the utility of a particular  tech-


 nique or group of techniques for the problem to  be  studied.   In  this


 regard,  a  preliminary  evaluation  must  be conducted to determine the


 quality and quantity of available,  retrievable input data, and ancillary


 information proposed for use in a model.




      In Volume VI,  the application of multivariate techniques to  recep-


 tor modeling will be examined.   There will be an emphasis on common fac-


 tor analysis and stepwise regression analysis and on application to  and


 interpretation  of problems associated with particulate matter pollution


 (irrespective of any particle cut size).   The  Factor  Analysis/Multiple


 Regression  (FA/MR) Receptor Model,  the topic of this volume, first uses


 factor analysis to identify sources of particulate matter and to  select


 source  emission tracers.   Stepwise multiple regression analysis is then


 used  to obtain a quantitative relationship between  the  source  tracers
            -»

 and   particle   mass  concentration.    However,   both  are  independent


mathematical techniques.   The purpose of volume VI is to familiarize air

-------
                                 - 2 -






pollution managers with the framework and approach necessary to use this




receptor modeling technique.  It is not intended as a  treatise  on  the




statistical  and  mathematical  aspects  of each topic.  The volume will




provide such individuals with enough information to decide if the  tech-




nique will be useful in solving a given problem.






     1..1. Background






     Because of the nature of atmospheric processes and meteorology, the




concentrations  of  individual air pollutants will often vary simultane-




ously.  This occurs irrespective of the sources; therefore, it is diffi-




cult  to  resolve  tracers  or  markers emitted from individual sources.




Frequently, the difficulties in differentiating individual  sources  are




associated  with   the  simultaneous increase and decrease  (intercorrela-




tion) of the selected variables, which are usually  observable  on  time




series  plots   (Figure  1.1  and 1.2).  Inmost applications of FA/MR to




particulate matter, the variables are  the  measured  concentrations  of




trace   elements;   for example, vanadium (V), nickel (Ni).  lead (Pb), and




cadmium (Cd).






      In Figure 1.1, the two elements portrayed  (A and  B)  increase  and




decrease   simultaneously   in almost all cases,  indicating  an almost per-




fect  correlation  coefficient of r = 1 (an actual value of  r  =  0.99  in




this  case).  In  Figure 1.2, real data for nickel and vanadium are shown




which have  a very high correlation of r = 0.84 with  the   most  dramatic




variations  occurring during pollution episode periods.  These are iden-




tified  by  the shaded area.  Other variables  shown, Pb and  Cd, are not as




firmly  coupled   to the previous two, having r = 0.62 and  0.39, and 0.74




and 0.54, with V  and Ni respectively.  Both Pb and Cd still show major

-------
                                  Figure 1.2
to
CD
0>
ct:
   100
    80
    60
40
    20
     0
      0
         Highly  Correlated  Variables (A a B)  for  Day to Day
                   Measurements of Air  Pollutants
                        u>
                        I
                 8
12     16      20     24
    Dote  of Sampling
28
32
36

-------
Figure i.i   Data  Collected for  Lead, Cadmium, Vanadium, and Nickel  on
          Thirty nine Sampling  Days in Elizabeth, N.J.
                       V//////S
IE
O*
c
  20

   0
  20

   0
IE  40
C

    0
   0.8

i
"  04
         Cd
   0
    0
                    8
                                  16     20     24
                                  Date of Sampling
                                                                                    4S-
                                                                                    I
28     32     36     40

-------
                                  -  5  -






concentration excursions during an episode.






     This example, however, does not immediately suggest that individual




sources or source categories contributing to the particulate mass can be




identified from the data and suggests the  need  for  statistical  tech-




niques  to unravel sources.  In the past, attempts have been made to use




co—varying materials as tracers for the sources of the particulate  mass




in  linear  regression models.   Errors in source assignments are inevit-




able in such a case.  Without  incorporation  of  independent  selection




procedures  to  verify  the  acceptability of an element as a tracer the




models were usually inadequate.  For V and Pb in the  example  given  in




Figure 1.2, this is not readily apparent, even though these elements are




usually tracers for different sources, i.e., fuel oil  and  automobiles,




respectively.






     One of the major problems with  air  pollution  data  is  that  the




underlying  reason for the simultaneous variation of some of the tracers




is the meteorology.  This can affect the accumulation of  individual  or




all particulate species and vapor phase species.   Frequently,  the effect




is identified by sudden increases in particulate  mass and  species  from




one  day  to another,  a peak day or a series of high concentration days,




and then a decline to much lower concentrations for  a  day  or   a  pro-




tracted  period  of time (6).  At this point,  it  could be suggested that




it is inappropriate to attempt  a mass  apportionment for a receptor  site




by  multivariate  techniques.   However,   within  the data,  patterns are




present, and in some cases these can be inferred  by the lack of   uniform




rates  of change in concentration from day to day for individual species




(such as for V and Pb in Figure 1.2).   Unfortunately,  such  qualitative

-------
                                 - 6 -






features  alone  are  not  usually sufficient to begin to devise source-




receptor relationships and to allocate mass contributions from  sources.




The  latter  requires  the  use  of  specific multivariate techniques to




disentangle the co— variation of a subgroup of potentially useful elemen-




tal source tracers.






     Use of multivariate techniques requires .a priori 1) measurement  of




a large number of tracers (elements, etc.), and 2) large numbers of data




observations or samples.  The latter are usually referred to  as  cases,




and  represent  a  specific  time period of sampling (1 hr, 4 hr, 24 hr,




etc.).  This point is extremely important since the financial  and  per-




sonnel  resources  must  be  allocated to collect a sufficient number of




samples for analysis or a sufficient number of archived samples must  be




available  for  chemical composition analyses.  In any case, the samples




to be used must be obtained from a single site in the  area  under  con-




sideration  in the receptor modeling study.  Two sites located near each




other can be used most effectively in  source  apportionment  validation




studies  and  in  detection  of  differences  due  to localized emission




sources (e.g., minor  emission  sources, < 50 T/yr).  The latter would  be




especially useful  for toxic substance investigations since many of these




sources will not necessarily appear on an emissions inventory.






     A  further caveat is that multivariate techniques are only mathemat-




ical  tools  which  help interpret the data, and are only as good as the




data entered  (qualitatively and  quantitatively).   The  latter  subject




will  be addressed in another  section; however, it can not be overstated




since solutions to a  poorly constructed data set would ultimately be  of




no value to an environmental specialist and eventually to the regulator.

-------
                                  - 7  -






2.0 FACTOR ANALYSIS






     Factor analysis (7) comes from the field of social science and  has




been  described  by Rummel (8) as  the  "calculus of the social sciences".




It constructs a model  which  mathematically  describes  any  behavioral




relationships  that  can  be  deduced  or  predicted  from  the specific




phenomena under investigation.  In the case of air pollution, the  model




describes source emissions relationships from the characteristic changes




in pollutant species and chemical element concentrations.   The mathemat-




ics  required  include  correlation  matrices,   diagonalization of axes,




eigenvalues, eigenvectors and principal axis rotations with the computa-




tions  normally  performed by computer.  The details of the mathematical




formulations will not be subject of this document.   However,  the  basic




concepts will be reviewed.






     Over the past ten years, factor analysis has been a useful tool  in




the  source  apportionment  of  particulate  matter  (9-16).  The factor




models developed for source apportionment have been used to infer  emis-




sion  source  tracer  relationships and subsequently in the selection of




individual elements or chemical species for use in the regression source




apportionment  analyses (9-13,17).  Recently,  Thurston (16) has used the




results of principal component  analysis  to  obtain  estimates  of  the




actual  source  emissions  composition  profiles.    However, this method




requires more ji priori information about the completeness  of the  source




tracer list than is required by the more general  FA/MR technique.






     The technique of factor  analysis  groups   the  selected  variables




according  to  their common variations.  These  groupings are called fac-




tors.   In air pollution,  the variables are normally the source  emission

-------
                                 -  8 -






tracers and these tracers are normally elements or ions.  Experience has




shown that in order to conduct a successful factor analysis to  identify




the  major  sources  of  particulate pollutants, the investigator should




attempt to obtain at least two distinct  potential  tracers  per  source




type.   Since two may not be available in all cases, one marker could be




used nominally to represent a possible  source.   The  investigator  may




wish  to  use  as many potential markers as are available.  It should be




emphasized, however, that the use of many  elements  to  characterize  a




particular source type will not necessarily yield a better result.  This




is because factor analysis attempts to group the variation of  the  ele-




ments  provided.    In  the  case  where there are many elements for one




source type, this can lead to indications that that source type dominate




the  variation of the data (which is a bias). A balance should be sought




in the number of tracers to be used as  input  to  the  factor  analysis




study  and  the  number  to be used to characterize an individual source




type, as  is explained in a subsequent section  (17).






     A major advantage of using common factor  analysis is that it  takes




"a." tracers that are presumed to be tracers related to the air pollution




receptor  site under investigation and groups these "n." tracers in such a




way  that 'm" new independent tracers are constructed as factors.   These




describe  the variation of the data based upon  the clustering of the ori-




ginal  variations.   To  be  an effective tool  'm" needs to be less than




"n", which indicates that the dimension of the air pollution problem has




been reduced   into  linear  combinations of the tracers called factors.




This procedure  is common to all scientific theory and is  known  as  the




principal of parsimony (8).

-------
                                 -  9 -


     For factor analysis, the reduction in  the  number  of  tracers  is

associated  with  the  size  (rank)  of  the variable correlation matrix

(degree of association among each of the variables under consideration).

This reduction in size is related to mathematical constraints,  which are

found under the heading of  commonality  considerations*  (7).    Factors

constructed  in  any factor analysis model will have fm" dimensions,  but

can be represented by geometric plots in two or three dimensions.   Com-

puter  programs  allow for the examination of the factors as a  series of

two dimensional plots.  In this way, the investigator  can  construct  a

visual interpretation of the model  from the analytical solutions, obtain

qualitative information on clusters of variables,  and gain further ideas

for factor analyses of the data.

     As a preliminary example,  in Figure 2.1 we see the grouping of ele-

mental  tracers  (£1  through E8) along the axes of what for now will be

called two factors (SI, S2). These  groupings are identified by  values of

loadings (correlations with factors) greater than 0.90 on either factor.

It can be seen that there is a  definite  separation of the elements  with

two  groups  forming  along the individual  new rotated factor axes.   The

values along the axes are the correlation coefficients of  the   original

tracers  with  the  factors; these  are  always defined as the loadings of

the element with the factor represented by  that axis.   For   this  rather

straightforward  example,  the split among the elemental  tracers is quite

explicit,  and the elements EL to  E4  and £5  to E8 are related to  factors
•The communality of a given variable is the sum of the  squares  of
the  loadings (correlations) of a given variable with each factor,
and is determined by an  iterative  process  in classical  factor
analysis.   The  estimated  communal ity  for any variable must not
exceed 1.0 (7).

-------
Figure 2.1
                                    - 10 -
                             Factor  Pattern for  8 Tracers, Two
                             Factor  Example Solution after  Rotation
"2
1.0 XTT?
0.8-
o>
1 0.6-
o
1 0.4-
u_

0.2-

i
i
-0.2
-0.2-
of Approximately 45°
%%$%%*
r^^^i




8




6
Factor 1 Loading

-------
                                 - 11 -






2 and 1 respectively.  This result indicates that  two  new   independent




factors  are adequate to represent the original eight elemental tracers.




In a factor analytical solution of pollution  data,  the  eight  tracers




should  be  individual  source tracer elements and the two factors would




represent individual source types.  In general terms, the above  example




can be classified as a two factor model which has combined the variation




of eight tracers into the two new distinct factors.




     A number of steps must be completed to obtain a  factor  analytical




solution  and  yield  sufficient information from the data to eventually




interpret the factors.  An outline  of  the  general  framework  of  the




mathematics  used  in  factor analysis is presented in the next section,




with the main focus being an understanding of  the  model  design.    The




details  are  found  in  an  excellent book by Harm an (7) which includes




numerous problems.








2.1 THE FACTOR ANALYTICAL MODEL






     A very concise flow diagram of the factor model  design is shown  in




Figure  2.2 as adapted from Rummel (8).  Frequent  reference to this fig-




ure will aid in understanding the remainder of this Chapter.




     Basically there are two different factor analytical  models:  1) Com-




mon Factor Analysis; 2)  Principal Component Analysis.




     Variants of the latter exist  and  are  used   in  computer  program




applications.    Although the emphasis will be on Common Factor Analysis,




the Principal Component  Analytical Model  will be  discussed  first.   It




involves  a  number of simplifying assumptions with respect to the  vari-




ables set,  but the  solutions are exact.

-------
                                 - 12 -
                              Figure 2.2
              FACTOR ANALYSIS RESEARCH DESIGN FLOW DIAGRAM
                              Theory, Design Goals
                             I The Factor Analysis Question
                                    The Factor Model
                       i
               Data Transformation
             Number of Factors
                                        Factor Techniques
Unrotated Factors |


Orthogonal Rotation)
                                            Oblique Rotation |

                                                  »

                                            Higher-Order Factors
                                i    [Factor Scores I
                         Factor Comparison] I
                         ~   -       '
                                       [Distances I	
                             Decision]
KEY —> = usual flow of factor analysis design;	>  alternative flow


Adapted from Rnmmel (8)

-------
                                 - 13 -
     The principal component model is defined by the linear equation:
     z. = a^F- + a.-F- + *"*+ a. F                   Eq. 1.
      1    il 1    i2 2         inn
     (i = 1 to n)
where the original n variables (tracers) are distributed as linear  com-
binations  to  produce  the  new oncorrelated variables called principal
components F , F-, F  (i.e., the number of factors equals the number  of
variables), and the a..'s are the loadings on the factors.   z. is called
the standardized variable of observed variable X.,  such as the  elements
Vanadium or Zinc, and for individual cases is of the form:

                      X,_ - X.
where:  X.  = pth observation of the ith variable
        X.  = variable mean
        p   = variable mean observations
The main advantage of using « ,  which is called a  z-score,   is  that  it
has  a  mean  of  zero and a standard deviation of 1,  which simply  means
that  only  the  relative  deviations  of  the   original   variables  are
retained.   The  obvious advantages of having all  variables in this form
are the elimination of problems  associated with 1)  different  variable
scales,  and  2)  the  differing ranges of the  measurements observed for
each variable (7).

     For principal component analysis,  an important property is that the
individual  factors  (or  components) attain a  maximum contribution from
the variances of all the selected variables (tracers).  In  the  present

-------
                                 - 14 -

analysis  this  involves reproducing the original correlation matrix for
the (n) tracers used to develop the factor models,  assuming the  tracers
selected  describe  the  air pollution problem completely, assuming that
there are no errors* and assuming the  tracers  selected  are  the  most
representative  of an individual source profile.  This feature can prove
to be beneficial in cases where the  emission  sources  are  known  (16)
since  the factor solution can be scaled to give new linear combinations
called factor scores.  These scaled or normalized equations can then  be
adjusted  to estimate source contributions.  The equation for the factor
score is of the form:

     fj - Sjlzl SJ2Z2 + ••• Sjnzn                  **• 3'
where: f  = the factor score,
       S..  is the factor score coefficient,
        z  is  the normalized or standardized variable,  (as defined in
         1 Eq-tt), 1 «  1 to n.
     In contrast, common factor analysis attempts to maximize the common
variation   among  the  available variables and produce correlations which
approximate the original correlation matrix of the original variables (a
reproduced  correlation  matrix).   More simply, it maximizes the shared
variation of  the available variables.  The equation describing the  com-
mon factor model is:

   Zi =  ailFl + ai2F2  + *" + aimFm +  l*iYi   * " 1 to n       Eq. 4.
where  each  of  the n observed variables used in the model  are  described
by m   common  factors  (m < n) and a unique factor (Y.).  The a., and u.
which  are called loadings, are the correlations  of the jth variable with
the mth  factor.  The unique factor (T.) will include  the  remaining vari-
ance of  the variable z..  This latter  portion of the  equation  describes

-------
                                 - 15 -





the part of a variable which is uncorrelated with the other variables or



the derived m factors, and may suggest the need for more tracers if  the



u.  is  large.   A  generalized print-out from this model is as shown in



Table 2.1.





     The common factor analysis,  therefore, provides a check of the com-



pleteness  of  the  tracer  set used to develop the model.   In fact, one



test for the applicability of common factor analysis  was  developed  by



Gut tin an  (18).   He  indicated  that the original correlation matrix (R)



should be inverted (R     ) and the off diagonal elements of the  matrix
                     nxn


should  approach  zero.   If  these  are  not  near zero, more variables



(tracers) may be required to describe the  features  of  the  phenomenon



(air  pollution) under investigation.  It should be noted,  however, that



a number of other conditions must also be satisfied (7).





     To summarize, a basic difference between common factor analysis and



principal component analysis is that the common factor analysis attempts



to separate common, specific (not associated  with  included  variables)



and  random  error  variances of  the selected variable list,  whereas the



latter assumes the availability of a complete set of variables,  which is



most  often not the case for air  pollution problems.  The results of the


                                               2
two analyses will become nearly equivalent as (i .  approaches zero.   How-



ever,   because  the  initial conditions used to extract the factors from



PGA and FA are different, the results will  not  achieve  exact  conver-



gence.





     An example of the matrix that can be obtained as a result of a fac-



tor  analysis  is shown in Table  2.2.  This could be for any city in the



U.S.A., but it happens to be the  result of a common factor  analysis for

-------
                                - 16 -
                                TABLE 2.1




               INFORMATION ARRAY FROM A COMMON FIVE FACTOR




                  ANALYSIS WITH SEVEN ELEMENTAL TRACERS






Tracers   Factor 1   Factor 2   Factor 3   Factor 4   Factor 5   Factor Y
El
E2
E3
E4
E5
E6
E7
all a!2
* 21 ' 322
S 31 S32
3 41 342
3 51 a52
a 61 a62
3 71 a72
a!3 a!4
a23 S24
a33 S34
343 a44
S53 S54
a63 S64
a73 374
a u
15 ml
a u
25 m2
a u
35 m3
a u
45 m4
a u
55 m5
a y
65 m6
a u
75 m7
                2          22222
*calculated by u   . « 1 - a   -a   -a   -a   -a

                 ml        II    12    13    14    15

-------
                                  - 17 -
                                Table 2.2.




             ROTATED FACTOR LOADING  (PATTERN) FOR SUMMERTIME




             PARTICULATE MATTER MEASUREMENTS IN CAMDEN,  N.J.
Variable       Factor 1,     Factor 2_     Factor 3^     Factor 4,      Factor 5,
Pb            -0.050          .192         .184        .719           .242
Mn               .107          .825          .094        .051           .000
Cu               .118          .097          .164        .143           .700
V                .802          .151          .092       -.039          .050
Cd             -.029        -.013          .674        .557         -.080
Zn              .050          .118          .661         .083          .286
                .274          .608          .017         .231          .303
Ni              .879          .342          .097         .032          .073
IpS°4           .598        -.023        -.083         .010          .076

-------
                                 - 18 -

summertime particnlate matter data in Cam den, N.J. (19)  The intent here
is not to interpret the data, but to explain basic information available
from the analysis.  The table shows the factors  from  a  common  factor
analysis of n = 9 tracers and the resultant m ~ 5 factors.  Each factor,
Fj to F^ is composed of a  linear  combination  of  loadings  which  are
related  to the individual tracers.  When placed in the form of the ori-
ginal equation, the variation of lead is described by  the rows of  Table
2.2:

   Pb = a-P  + aF  * *^  "**     * ^^^  or more specifically:
                                                              Eq. 5.
   Pb = -0.05 FI + 0.192 F2 + 0.184 FS + 0.719 F4 + 0.242
If this set of factors accounted for all  the variation in  the  Pb  then
                       2                                               2
the  sum  of  the  a_,   = 1  (j = 1 to 5).  In the case of lead, Xj ap, .
                      j                            2                  J
only achieves a value of 0.67; the remainder is up.  ,  suggesting  other
sources  or  other   atmospheric  processes affect  the Pb concentrations.
For Ni, the sum of the a...    was 0.91, indicating  that  the   sources  of
                          j
nickel  have,  for   the  most part, been  identified.  The final form  for
equation 5 would be:
Pb = -O.OSFj^ + 0.192F2 + 0.184F3 + 0.719F4 + 0.242F5 +
     0.593 Ypb

     The factor equations are defined by  the columns in Table 2.2,   and
should  be examined  independently of the  results  sought from  Eq. 5.   For
example, Factor 1  is defined  by  the following:
   F± = -0.050 Pb +  0.107 Mn  + 0.118 Cu + 0.802 V
       - 0.029 Cd +  0.050 Zn  + 0.274 Fe + 0.879 Ni
       + 0.598 IPS04                                            Eq.  7.

-------
                                 - 19 -






     Geometrically, Factor 1 and Factor 2 are illustrated by Figure 2.3.




 It   can  be   seen  that  the  new factors are fairly well defined by  two




 groupings of  the tracers indicating that a more  interpretable  form  of




 the  tracers  has  been  identified,  and a structure of the data can be




 explained by  fewer .dimensions.  The ancillary data  requirements  neces-




 sary   for  interpretation  of  the  results  are discussed later in this




 chapter.






     The word rotation should be mentioned here  since  it  is  commonly




 used   to  simplify  the  structure of any factor analytical solution.  A




 rotation is the simultaneous movement of the groupings  of  data  around




 the  origin  of  two factor axes.   For the simple example in Figure 2.1,




 there were Factors 1 and 2.  In the case of these  data,  an  orthogonal




 rotation  of approximately 45 degrees was necessary to achieve the axial




 alignments shown.  If properly used, a  rotation  can  quite  frequently




 make   the  sources represented by the factors more readily intepretable.




 Going back to our  hypothetical  example  in  Figure  2.1,  the  factors




 presented were the rotated solution. In Figure 2.4, their unrelated form




 is shown and it is clear that the  coordinate  system  of  the  original




 solution  was  not in its simplest form, although there is approximately




 90  separating the clusters.   Rotating the coordinates about 45° with  a




 standard  mathematical  procedure yielded new factor axes that were per-




 pendicular or orthogonal, but were more easily identified.   This  result




was  obtained  using what is termed in the computer program as a varimax




 rotation.  Rotations are of tremendous value since, if the  factors  are




 independent  of  each other,  the attainment of a simple structure  yields




well defined and interpretable solutions.

-------
Figure  2.3
                                   - 20 -
                           Rotated  Factor Patterns for  Camden N.J.
                                  using  Factors 1 and 2
                                                                 1.0
                                       Factor 1 Loading

-------
Figure 2.4
                                     - 21 -
                            Unrelated  Factor Pattern  for  8 Tracers,
                                   Two  Factor  Example  Solution
                 Key:  Tracers rated
                      by number
                         -0.4   -0.2
                                     -0.2


                                     -0.4


                                     -0.6--


                                     -0.8--
0.2    0.4     Q6     0.8

-------
                                 - 22 -






     Before leaving this topic, it should be  stated  that  an  infinite




number of rotations are possible which may or may not have perpendicular




axes.  In some cases, an *bblique rotation" can yield more interpretable




solutions when the factors still remain correlated, i.e., the factors or




sources are not completely independent of each other.






     The above discussion has presented the essentials of the the factor




analytical  technique.   Some  of  the  more  important concepts will be




described further in the form of applications to air pollution  problems




and  selection  of  variables.   Others  will  be left to the interested




reader in anticipation of his/her application to particular problems.









2.2  SELECTION OF VARIABLES






     Once the above mathematical basis is conceptually  understood,  the




next important  step in the application of factor analytical techniques




is the identification of the input tracers.  It would be rather naive to




suggest  that  this  is  a  simple task.  In fact, if done properly, the




investigator is required  to  have  both  qualitative  and  quantitative




information on  the  meteorology, source types, and traffic patterns at




his/her  disposal to  assist in  interpreting the derived factors (5). This




is essential for allowing an individual to judge the completeness of the




tracer set  and size  of  the sample data set needed to  identify  sources.




It cannot be over  emphasized that tracer selection is necessary in order




for  the  results to be applicable  to the overall questions being asked by




the  regulatory body.






     Major  questions that should  be asked include:

-------
                                 - 23 -



1)   Can the potential major and some minor source be represented by the


     input variables?


     If you are missing trace elements  or  other  marker  species  that


     would  be related to sources such as soil, oil burning,  etc., there


     is a possibility of ignoring an important source that  affects  the


     variance in the ambient data set.



2)   Are a sufficient number of cases (sampling  periods)  available  to


     complete a factor analysis for the tracers used?


       A rigorous solution is only achieved when there are greater  than


     a  minimum  number  of degrees of freedom (d) available  in the data


     set.  Previous work has been* done  that  indicates  the   number  of


     cases  (n)  and  variables (v) necessary to satisfy boundary condi-


     tions (limits) for the analysis (7).   Recently,  Henry (20) has sug-


     gested that the formula:


              d/v = n- [(v+3)/2]   >30                         Eq. 8


     is needed as a minimum precondition in order to   attempt  a  factor


     model.   However,  other considerations within this heading include


     the distribution and transformation of data for  each tracer.    This


     would  determine if further samples are required to have sufficient


     data above a detection limit or minimize  the  effects  of  extreme


     values (21).



3)   Are there a sufficient number of  distinct  tracers  for  different


     sources  within the list available to avoid biasing the  solution to


     a given type of source or a single point source?
             -*

        The issue of balance among the number of  variables  was  previ-


     ously  mentioned as being important since the dominance  of elements

-------
                                 -  24  -






     or  species associated with a particular source  type could bias  the




     solution  of  the factor model.   Since the factor solution is based




     upon the input data in  the  correlation  matrix,  those  variables




     correlated with one source will  produce a factor which accounts for




     a large part of the variation in the chosen set of tracers.    Simi-




     larly,  inclusion of extra tracers related to previously unsuspected




     sources can yield the identity of a particular  source if there is a




     sufficient  number  of cases available to decipher its variability.




     However, an individual source  type cannot be identified  without  a




     tracer.   Henry  (20) has suggested that, in practice, multivariate




     techniques can usually only identify 5 to  8  sources.   Therefore,




     inclusion  of large numbers of tracers (> 20) may be fruitless when




     attempting to interpret the results unless there is  a  significant




     increase  in  the number of samples available for use in the Factor




     Analyses  (see Eq. 8).






4)   Are there meteorological data available for  the  area  surrounding




     the receptor site?




        In  the initial discussion,  it was indicated that factor analyti-




     cal models apportion the variability of a set of data and meteorol-




     ogy can have a significant effect on those results (5).  Therefore,




     the availability of meteorological components for inclusion in ini-




     tial attempts at model development may yield improved resolution of




     individual  sources  or a major area source.  With enough data sets




     (> 100),  it may be possible for the investigator to complete a fac-




     tor  analysis  that  is stratified by wind direction for individual




     wind sectors or by heating and cooling degree days.

-------
                                 - 25 -






S)   Are the facilities and resources available to conduct further  ana-




     lyses  on archived portions of the samples or the analysis of other




     samples?




        Supplemental sample analyses can increase the  number  of  vari-




     ables by adding elemental or species tracers which could be associ-




     ated with other source types.   In  addition,   it  could  provide  a




     means  of resolving the source which emit some of the same tracers.




     Of course, the stability of the samples being considered  for  sup-




     plemental analyses, and the appropriateness of the technique to the




     type of filter used for sample collection must be fully considered.






     The variables of interest for  inclusion in source receptor modeling




studies include a number of inorganic elements.   In addition,  Daisey,  et




al. (22) have suggested that particulate organic species.may be used  in




the  future.   Similarly,   microscopic analyses can yield  information on




other properties of the sampled particles which can be used to  identify




sources  (4).  The techniques available for  determining potential tracer




elements and  species  are  summarized  in  many  papers  on  the  topic




(5,22,24,25).   Examples  of some source tracers that should be included




in a factor analysis variable list  for particulate matter  are  shown  in




Table 2.3.






     The reader is referred to volume V (5)  for  a  list of  references  on




source emission profiles for sources shown in Table 2.3 as well as a. num-




ber of other source types.

-------
                                 - 26 -
Table 2.3. Sources of Particulate Matter and Tracers.
       Tracers
Potential Source Type
V, Ni




Al. Si. Ca, Mn, Ti. Fe
SO
   -2
Zn, Pb. Cd. As




Pb*. Br




C14. K




Na, Cl




As, Se. Sb
Oil burning




Crustal (soil)




Secondary particles




Primary and Secondary smelting




Motor vehicles




Wood burning




Marine (ocean)




Coal
 •Note:   Pb  is  being  phased  out  as  an  additive  in  automobiles which




 means  that a  new  tracer  must be found.   Bromine  is  a  scavenger  for  lead




 and will also  be  eliminated as  the lead  in gasoline  decreases.

-------
2..3_ COfLMONLY AVAILABLE STATISTICAL  SOFTiYARE







1.   BMDP Statistical SofUvarc




     Manual Available in 1983 Printing




     University of California Press,  Berkeley,  1983









     Factor Analytical Programs  Include:




     Principal Component Analysis




     Principal Factor (Classical  or Common) Analysis




     Maximum Likelihood Factor Analysis




     Kaiser's Second Generation  Little Jiffy









     Rotations Available -  Seven Rotations  Listed Including Varimai  and




     Obiique

-------
                                      -  28  -
2.   SAS Softnvare




     Manual  available:  SAS User's Guide, Statistics:   19o2  Edition,   SAS




     Institute,  Box 8000,  GARY,  North Carolina 27511




     Factor  Analytical  Programs Include:




     Principal  Component Analysis




     Principal  Factor Analysis




     Alpha




     Maximum Liklihood




     Image




     Harris Component Analysis




     Unweighted Least Squares Factor  Analy.sis




     Rotations Available




      Seven Options Listed  Including Varic-.ax  and  Oblique





      Special fp t i ••»:>:




      Conner.a 1 i ti cs  -renter than 1  sc t to 1  ar.cl iteration proceeds




      Target  pattern  rotation

-------
                                - 29 -








SPSS Statistical Soft-.vare




Manual Available, Statistical Package for Social Sciences, 2nd  Ed.,




McGraw  Hill  Book  Company.  Nev; York, 1975, Supplements have  been




published




Factor Analytical Programs Include




Principal Component Analysis




Principal Factor Analysis




Al ph a




Image




RAO




Rotations Available -




Four Options Listed Including Variaax and Oblique
For the above statistical packages the following default  values  are




used:




                                   SPSS        SAS        BilDP









Number of  iterations for  initial    25          30          25




(unrelated)  factor  extraction









Convergence  Criteria for         0.001        0.001      0.001




iterations and communality

-------
                                 - 30 -
       These  parameters can be respecified by the operator.
    The  Communalities  (h )  for the first three statistical packages are




    estimated or  defaulted by a number of options.
         Package
  Communalitv Estimates (  Ji.~)
     BMDP




      Principal  Component Analysis




      Principal  Factor Analysis
 Squared Multiple Correlation (SMC)




 Maximum Row Absolute Values (MRAV)




 User Specified (US)
     SAS




      Principal Component Analysis




      Principal Factor Analysis
 SMC;  MRAV;  US,  Random
     SPSS




      Principal Component Analysis




      Principal Component Analysis




       (modified)




      Principal Factor Analysis
SMC; MRAV; US;
SMC
4.   IMSL Statistical Libraries, Inc., Instructional Manual,  1979  Edi-




     tion, 7500 Bellaire Boulevard, Houston, Texas 77036




     Factor Analytical Programs Include:




     Principal Component Analysis




     Principal Factor Analysis

-------
                                 - 31 -






     Rotations Available




     Varimax and Oblique






5.   Microcomputer Software -




     Programs developed by individual users




     Programs by Software Corporations




       These vary in options and rotation  schemes.    However,   programs




     have been written for Principal  Component  and Factor  Analysis  with




iterative capabilities and rotation available for  Apple,  IBM,   and   oth-




ers.   BMDP,  SAS  and SPSS is also available for  specific microcomputer




systems.

-------
                                 - 32 -
2.4. APPLICATION OF FACTOR ANALYSIS TO AIR POLLUTION DATA AND  INTERPRETA-r




TION OF RESULTS






     2_.4_.l. General Considerations






     In developing a Factor Analysis/Multiple  regression (FA/MR)   Source




Apportionment  model   for   a   given  location,  common factor  analysis  is




first used as an  exploratory  tool  to identify  sources,  to examine   rela-




tionships  between  tracer  elements  or chemical  species,  and to  select




statistically independent  source  emissions  tracers.   A  typical  set  of




air  pollution  data for  a  FA/MR model  will consist of > 50 measurements




of many source  tracer  species made at a single site  over  several   sea-




sons.   Trace   elements   are   the  source  tracer  species most frequently




used, but  other source tracers such as  organic species (22)  or micros-




copic  measurements  can   also be used (4).  Ideally,  the source  tracer




species to be used are pre-selected on  the  basis  of what is known   about




the   types  of  sources in a given area. Existing emissions inventories,




the  knowledge  of  local air   pollution   officials,   and  microinventories




 (surveys   of  the area   around a  given site)  (17)  should all be  used  to




identify  probable source  types affecting the  air  quality  in  a   given




location.    Table 2.3 presented  an example of some types of  sources and




typical  tracers used  to identify  those  source  types.  Ideally,  measure-




ments of  tracers  in both  the  coarse  (particles >  2.5 urn in diameter) and




fine «  2.5  (im  in diameter) mode  of  the aerosol should be made  as  such




measurements  can be   helpful  in discriminating sources. For example,




 resuspended soil  and  its  tracers  are  found largely in  the  coarse  mode




particles  while   tracers  of  combustion  aerosol are found in fine mode




particles.  As  stated previously,  measurement  of  more  than  one   tracer

-------
                                 - 33 -






 for  a  given  source  type  is also advisable and can be useful  in  resolving




 sources in those  instances in which a preselected  tracer  is   found   to




 originate  from other  source types than expected.  For example, Pb might




 originate from industrial as well as motor vehicle sources; by  including




measurements  of  the  second tracer, Br, it is usually possible to iden-




 tify the presence of both source types.






     In many  instances,  an existing set of trace elemental data  can   be




 used to identify major sources impacting a given site.  In this case,  it




 is advisable  to   select  subsets  of  variables  which  are  appropriate




 tracers for source  types of interest,  with an emphasis on having balance




 in the number of tracers per source.






     ^.4..2 Preliminary Examination of the Data






     Prior to attempting a factor analysis,  the data should be carefully




 reviewed.   Limits  of  detection,   uncertainties  in  measurements,  and




extreme outliers in the data set should be examined.   Measured variables




close  to  the limits of detection of  a particular measurement technique




will have greater variability due to the analytical  uncertainties  than




due  to  variations  of  elemental   composition  in the sources.  Conse-




quently, data of this nature will be of little use in the model develop-




ment.   Extreme  outliers  in  a  data set often indicate errors in data




entry or unusual events,  such as episodes.   Since the correlation  coef-




ficients,  which are used as the first step in the development of a fac-




tor analytical model, can be strongly  biased by a single extreme  value,




it  is advisable to examine the distribution of the measurements of each




variable and to eliminate such outliers from the data set prior to  fac-




tor analysis.  Figure 2.5 presents  some examples of the distributions  of

-------
                                 - 34 -





ambient measurements for several tracers.  The distribution  of  the  Ni



(2.5.1)  shows that a substantial fraction of the data are below or near



the analytical limit of determination (LLD).  The tail of the  distribu-
                                         /


tion,  to  the  right  of  Figure 2.5.1, indicates that there are also a



large number of high values; this distribution  is  obviously  far  from



normal.   The  distribution  of  Ho values in Figure 2.5.2. appears more



normally distributed; however, the midpoint of the  distributions  marks



the  analytical  limit  of  determination for Ho.  In this instance,  the



uncertainties in  the  analytical  determinations  can  be  expected  to



account for most of  the variation in the data.





     Figure 2.5.3 presents a distribution of measured Fe data which more



closely approximates a normal  distribution.  The LLD for Fe was 30 ng/m


and  all of the measurements were above  this value.  The reason for the



extremely high concentrations  should be investigated separately since they




may  identify  a. strong point source or unusual meteorological event.   The



examples given here  represent  only part of a print out from a BMDP program



which  can be  used to examine  the distributions of measurements prior  to



attempting a  factor  analysis.   In the preliminary examination of data,



the  entire output of the  distribution program would be carefully examined.
      2_.4..3_ Factor Analysis Solutions





      For the FA/MR model,  classical factor analysis  with  rotation,   in



 most  cases  Varimax,   is   used  to identify sources and source  tracers.



 Common factor analysis assumes that the variations in a given data  set

-------
                    - 35 -
          Figure 2.5.1      NJCk0l

                          • = 6 counts


             Units (ng/m3)
             {t), LLD f Lower Limit of Determination
                                        = 8ng/nv
0  t                                300

                     •    Molybdenum
       Figure 2.5.2     j    J =2COUntS
                                    LLD = 50 ng/m3
0                 t                100

                          Iron
       Figure 2.5.3          • = 3 COUntS
                                    LLD = 30 ng/m3
6        f                         iobo

-------
                                 - 36 -






are of two types:






     a) common or shared determinants, such as sources; and






     b) unique determinants.






     The unique part of the variation of a given tracer does not contri-




bute  to  relationships among tracers available for the analysis and can




be due to uncertainties in  the  analytical  measurements,  unidentified




sources,  or  changes, in  emissions  composition.  Because of common or




shared determinants among tracers, it follows  that  the  common  deter-




minants  will  account  for  the  observed  relations between the source




tracers.  As previously indicated, these  will  be  smaller  in  number,




i.e.,  some  tracers  will  come  from  the same type of  source.  Conse-




quently,  these can be expressed as a reduced set of variables called the




factors   or  common  determinants.   Each  factor would be defined in an




equation  similar  to Equation 4 as a mathematical combination of the ori-




ginal  tracers.






     A key problem in factor analysis is determining  how  many  factors




should  be  retained, i.e., how many are meaningful.  The number of fac-




tors obtained is  determined by the number and type of variables included




and by  the minimum eigenvalue selected for the analysis.  The eigenvalue




is a characteristic number  associated with each factor; it is  also  the




sum  of   the  squares of the factor loadings  (a..) of each variable for  a




given  factor  (the columns in Table 2.1).






     The  default minimum  eigenvalue  for most computer programs  is  1.0;




this,  however, has often  been found  to be too high a value to obtain all




of the  sources of significance.   In  general, the  best  approach  is  to

-------
                                 - 37 -






view   the   factor   analysis   as  an exploratory  technique and,  using  dif-




ferent  criteria, to factor analyze various subsets and  combinations  of



variables  and  cases.







     As a  first  step,  the entire set of variables can be analyzed  using




the  computer  program   default  criteria. These include a minimum eigen-




value,  the  number  of  iterations  to reach the final estimated  commonali-




ties  and  the  value  for  convergence of the communalities (i.e., the  com-




munal ities  are iteratively estimated until the values change by no  more




than  a  specified value, and, as shown on page 27, the default conver-




gence criterion  is usually 0.001).






     The communalities of the variables and eigenvalues  for  the  first




set of  factor* calculated (one per variable)  as well as the factor load-




ings of the rotated factors and factor score  coefficients of  each  case




should  be  examined  in  the computer print  out.   If the estimated com-




munality of any  variable has exceeded 1.0,  the  iterative  procedure  of




the  program   will  be   terminated  before  the convergence criterion is




reached and the  rotated factors obtained may   be  misleading.    Alterna-




tively,  the   default  value  for  number  of iterations may not be high




enough  to reach  the convergence point.   The default  eigenvalue  of  1.0




may  be  too   high  to  identify  all  factors of  importance and thereby




underestimate  the  rank of the correlation matrix.   Subsequently, it  may




be appropriate to  select a lower value  for the next factor  analysis.   An




eigenvalue of 0.7  or 0.8 or even as low as 0.5 may then be   selected  to




determine the  number of factors to be retained.

-------
                                 - 38 -






          2..4..4 Types oj Factors






     The factors and factor loadings must also be carefully  examined  to




identify  the  type  of factor obtained.  There are a number of  types of




factors which are commonly obtained with air  pollution  data,   many  of




which  have been discussed by Roscoe, et al (26)  for principal component




factor analysis.  In general, the discussion  is  applicable   to  common




factor analysis.






1.   Single source—type factor (area  source) — due to emissions   from   a




     single   type of source,  such as motor vehicles, and readily identi-




     fied by  high factor loadings (0.70 - 0.99) of characteristic  tracer




     elements.






2.   Coincident multiple source  factor - due to two or more  sources  or




     source   types  having similar  temporal variations in  emissions pat-




     terns, having similar impacts  on the receptor  site   due  to  their




     locations  and  prevailing  meteorology or having similar emissions




     compositions.  Factor loadings  tend to be somewhat lower  for  this




     type   of factor and source  tracers are mixed, i.e., the tracers for




     both  source  types  are found on a single factor.






3.   Anti-coincident multiple source  factor -  usually  due  to   sources




     requiring mutually exclusive meteorological  conditions  to impact on




     the receptor  site.  These are  characterized  by  high  negative  as




     well  as  positive loadings for  certain elements on the same  factor.

-------
                                 - 39 -






4.    Single  source  factor  (point source) - due  to emissions  from  a   sin-




      gle  source with  a  unique  tracer(s).  Such  a factor usually accounts




      for  a small  percentage  of  the total variance in the data and can  be




      recognized   by  factor  scores  which are  generally low, but with a




      number  of very high scores for certain days or  samples.   Morandi,




      et   al   (17)   have observed such a phenomenon for a zinc smelter  in




      Newark  and noted that the concentrations   of  zinc  were  unusually




      high on certain  days  when  the  prevailing  winds were from the




      northeast.






5.    Regional Transported Aerosol Factor — due  to  aerosol  transported




      from an area  upwind  of  the site.   This type of factor has been




      reported by  Thurston  (16)  and was characterized by the  high  load-




      ings  for the  tracer elements Se,  S and Mn and the high correlation




      between the  factor loadings for this factor and  certain  types   of




      meteorological   conditions associated with the transport of sulfate




      aerosol  into the northeastern U.S.






6.    Error Factors - due to random errors,  individual data point errors,




      bias  errors or  errors in the use  of factor analysis.   Random error




      and  individual data point errors  can  usually  be  identified  and




     prevented  through the preliminary  data screening.  Bias errors can




      be due  to  the  analysis  method  or  may  occur  due  to  sampling




      artifacts,   e.g.,  volatilization,   condensation or chemical  change




      during  sampling.   Errors in the use of factor  analysis can be  more




      difficult to identify but generally occur as a consequence of inex-




     perience with  the technique.

-------
                                 - 40 -


2..5. AN EXAMPLE OF THE APPLICATION OF FACTOR  ANALYSIS   TO   AIR  POLLUTION

DATA.

     Tables 2.4 and 2.5 present  the results  of  a  factor   analysis   on  a

set  of  data  as  an   example.   This  data  set was not  subjected  to the

recommended preliminary data  examination,  but   the   results  have  been

included   to point out  certain features and  pitfalls  of  factor analysis.

In this instance,  an eigenvalue  of 0.7  was used to determine the  number

of   retained factors.   Inspection of  Table 2.4  reveals that more than 25

iterations are required for  convergence of the   estimated  communalities

to 0.001.   It  also reveals that the  factors  beyond factor 6 each account

for  less  than  57o  of  the total variance  in the data set.   The high  factor

loadings  of the rotated factor matrix,  Table 2.5, help identify the fol-

 lowing types of  sources:   Factor 1  - Pb, Br; motor vehicles; Factor 2  -

Fe,   Ti;  soil-related;  Factor 3 -  V,  P - oil burning; Factor 4 - S; sul-

 fate aerosol,.   Factor  5, with high loadings for V and Ni  also  appears

 to  be related to oil  burning while Factor 6 has a high loading only for

molybdenum.  These last two factors account  for 6.3To and  3.3ft,  respec-

 tively,  of the total variance.  The communalities for both Mo and Ni are

 less than 0.5.  Examination of  the analytical  limits  of  determination

 for   these variables and their  distributions reveals that most of  the Mo

 measurements and about 20To of the Ni measurements are less than or  close

 to the analytical limits of determination.  In addition,  three extremely

 high values of Mi (> 100 ng/m ) were in the data set.   Thus,  it   would

 appear  that  factors  5  and 6 are of  the  type  related to errors  in the

 data, as  discussed by  Roscoe,  et al (26).
 **The analytical limit of determination is  that value below  which
 the concentration of a species  cannot be reliably determined.

-------
                                 - 41 -
Table 2.4.  Results of Initial Factoring.

Variable
so/
Fe
Pb
Ho
P
Br
V
Ti
Ni
Estimate of
Co""puna 1 i tv
0.511
.582
.806
.198
.354
.826
.557
.571
.242

Factor
1
2
3
4
5
6
7
8
9

Eigenvalue
2.18
1.94
1.64
1.06
0.79 '
0.70
0.36
0.24
0.08
Percent of
Variance
24.2
21.5
18.2
11.7
8.8
7.8
4.0
2.7
0.9
More  than 25  iterations required

-------
                                 - 42 -
Table 2.5.  Varimax Rotated Factor Matrix*
                                      Factors and Factor Loadings
Variable Communal ity
S
Fe
Pb
Mo
P
Br
V
Ti
Ni
Probable
Source
0.632
0.814 .
0.980
0.451
0.592
0.969
0.877
0.692
0.414
-

. ... .1 .
-.014
0.114
0.951
-0.005
-0.060
0.919
0.119
-0.147
0.016
Motor
Vehicles
.-2 .
0.106
0.874
0.054
0.207
0.187
-0.106
-0.023
0.800
-0.045
Soil

.. . _ 3
0.126
0.099
-0.102
-0.064
0.686
0.103
0.677
0.072
0.148
Oil
Burning/
4 - , ,
0.747 0
0.055 0
0.247 0
0.179 -0
0.164 0
-0.315 0
-0.034 0
0.093 -0
0.084 0
Snlfate
Aerosol
5
.099
.001
.019
.042
.157
.052
.588
.072
.615
Oil
Burn—
. . 6- --
0.194
0.159
-0.025
0.608
-0.180
0.028
0.239
0.103
-0.075
Moly-
bdenum
                                             Phosphor-           ing




                                             ous
 an = 93  cases.

-------
                                  - 43 -






     The  relatively high negative loading of Br  (-0.315)  on  factor  4




 should  also be  noted.  This is suggestive of a sampling artifact due to




volatilization losses of Br (probably as HBr) in the presence  of  H SO




 aerosol (27).






     Having examined the results shown in Tables 2.4  and  2.5,  several




 other  factor analyses were performed on different subsets of variables,




with Mo and Ni omitted, and sets of cases (sampling days) which had more




 normally  distributed  data  (all data were within three standard devia-




 tions of  the mean).  An example of these  is  presented  in  Table  2.6.




Five factors were obtained in this instance and are clearly identifiable




as being  related to resuspended soil (Fe,  Ti),  motor vehicles (high fac-




 tor loadings for Pb andBr), sulfate-related aerosol (SO ~),  oil  burning




 (V), and  incineration (Cu).  These results were consistent with what was




known  about  the  major sources of airborne particles  at the site where




 the samples were collected for analysis.   The  tracers,  Fe,  Pb,  S,  V  and




Cu  have  high  loadings  on  each  of  the factors,  and the  factors are




orthogonal (statistically of one another).   These   independent  tracers




were then selected as predictors for the next  step  in the development of




the FA/MR model,  that is,  the  development  of a  stepwise multiple  regres-




sion model.  The results of this exercise  are  shown in  Section 3.3.   The




quantitative regression model  will then be used to  estimate the  contri-




butions of each source type to the ambient aerosol  concentration.

-------
                 Table 2.6.  Factor Loadings of Rotated Factor Matrix
Variable
S°4=
Fe
Pb
Br
V
Ti
Mn
Cu
% of Variance
Probable Source
Type
Qommunality
0.921
0.917
0.767
0.724
0.549
0.684
0.647
0.366
—

1
-0.024
0.949
-0.042
-0.090
-0.058
0.792
0.437
0.055
35.3
Resuspended
Soil
2
-0.006
0.068
0.852
0.832
0.113
-0.216
-0.042
0.085
26.6
Motor
Vehicles
Factors
3
0.951
0.011
0.113
-0.139
-0 . 360
-0.034
0.138
0.066
22.3
Sulfate
4
. -0.089
0.101
-0.001
0.063
0.614
0.070
0.653
-0.032
10.3
Oil
Burning
5
0.090
0.038
0.165
-0.007
-0.162
0.065
0.094
0.592
5.6
Incineration
a.  n=78 cases;  eigen value cut off = 0.8;  convergence  required 51  iterations.

-------
2..6 VALIDATION OF FACTOR ANALYSIS






     2.6.«i Introduction







     The models developed by  the existing methods for source  apportion-




ment  require   validation  if  any  meaningful  use is to be made of  the




results.   Some models permit relatively  simple  comparisons  to  actual




data.   Measurement of  the  growth of a tree, enumeration of organisms in




a lake .vs.   time,  or measurement of crop yields may all serve to confirm




model outputs.






     In receptor modeling,  the alternative means of evaluating the truth




of  a  model's  output   are  generally  the results of other models, the




internal consistency of  the model and the consistency of the model  with




what is known about the  severity of the air pollution problem in a given




area.  Thus, some  care  is  required  when  attempting  to  validate  the




model.






     Various estimates  can be made for the expected concentration  of  a




pollutant.   Emission   inventories  can  indicate the importance of each




source or  source type evaluated,  and dispersion modeling of  the  inven-




tory emissions provides estimates of the pollutant  concentrations.   When




such estimates are compared to measured data,  the accuracy of the inven-




tory  and  dispersion  modeling  also  can  be evaluated.   In some cases




increasing divergences have occurred between dispersion model  estimates




and measured TSP values as TSP levels declined because of effective pol-




lution control measures. Efforts to add non-point  sources  and  improve
                 •



the  dispersion  model  have been partially successful.   However,  source




apportionment  by  the  multivariate  statistical  approach  offers   an

-------
                               - 46 -
attractive  and  potentially less expensive alternative and can identify




contributing source types without the need for extensive source  testing




in a potential impact area.






     The various source apportionment-receptor models, such as the FA/MR




or  chemical  mass  balance  (CMB) models  (2), may be intercompared where




data sets support the use of both mathematical approaches.  The  regres-




sion  models from FA/MR analysis yield estimates of source contributions




and regression coefficients relating the  source tracer  to  the  emitted




mass.  These coefficients are  subj ect to  verification.






     Finally, all models are subject to re-evaluation where  changes  in




source   types  or  quantities  of   emissions  are observed or predicted.




Model results, if accurate,  can be  used to  predict  trends.   This  has




been  successful  in  the  past for simple emission inventory estimates.




Where trends in  concentrations of the predictor tracers, the TSP or  any




other  particnlate mass measure, are observed the model derived from any




source apportionment method  should  accurately  predict  or  reflect  the




observed changes.






     2..6..2.  Source Composition  Profiles






     The composition of particles emitted by  a  given  source  can  be




obtained by   analysis of  a  sample  of the stack emissions collected on a




filter in a standard EPA  train (28).  More recently it  has  been  noted




that  several   toxic  elements and many  organic compounds are condensed




onto  the particles as the  stack gases cool (29).  This may  continue  to




occur for some  distance in the plume after leaving the stack  (30).

-------
                               - 47 -
     Composition profiles useful  at  receptor  sampling  locations  are




likely  to be  somewhat  different from the emissions data obtained by sam-




pling heated  stack  gases.  As plume sampling is very difficult and  very




expensive,  few useful  plume samples have been obtained.  Emission compo-




sition  profiles previously cataloged are possibly biased and of somewhat




doubtful  use.   Development  of a suitable set of profiles in a complex




urban-industrial area would be very difficult and  extremely  expensive.




Thus  the applications of factor analysis methods to ambient sample com-




position data has provided an alternative means of obtaining data  which




relates ambient compositions to the probable sources of the emissions.






     2..6..S.  Factor Validation






     Factor analysis has been used as an exploratory tool to select pos-




sible   source tracers for New York City (9, 31)  and for Boston (32).  In




each of these early applications of the method,   comparisons  were  made




among  different source emission profiles to confirm that the factor was




associated with one source type.  Kleinman «t  «1.   (9)  found  factors




with elemental loadings which indicated automobiles,  oil burning,  soils,




incinerators and a  source of a sulfate  compound were the types affecting




New  York  City.   However,   it should  be noted that the factors did not




reproduce the source profile,  since z-scores were  used  in  the  factor




analysis.






     Bnission sample composition analysis  gives  a  source   composition




profile  with  a concentration for each element  in mass per  unit mass of




emitted particles'.   Derived factors provide  loadings  or  scores  which




indicate  the  degree  to which the variability  (variance)  of the tracer




element is related to the factor.   The  most strongly loaded element  may

-------
                               - 48 -
provide an independent predictor  tracer variable, but may  not  be  a major




mass contributor in the  source emissions.  For  instance, selenium may  be




a useful tracer for coal  fly  ash, while the  concentrations of  major  ele-




ments Al, Si, Fe, Ca, Mg,  etc. are  so  similar  to  those  in soils   that




they  are  useless as predictor variables.   Moreover, any  trace elements




which are loaded at equal  levels  on several  factors  are  unlikely   to  be




useful as independent predictor variables.






     From existing emission profiles,  certain  elements are  expected  to




load  significantly  on  separate  factors.  For  instance, Pb, V, and  SO~




are typically assigned to   autos,   oil  burning  and secondary   aerosol




sources.  Lead may be emitted by  primary  and secondary smelters yielding




factors with Fb-Br for autos  and  some  combination of Pb-Zn-Cu-As load-




ings  when  smelters are  involved.   Should steel industries be  present,  a




strongly  loaded Fe-Mn factor  may  occur.   However, Fe and Mn should   also




load  with  Si, Al, Ca, Ti on  a  soil factor unless the industrial  contri-




butions  are  overwhelming.   Should more complex  situations  occur,   more




exact   comparisons  of   emission  and   ambient   profiles  may  be  made  by




conversion  of  the  factor scores to  relative  concentration  profiles  as




has  been  done   in  the methods  described by  Thurston  (33). Dattner (in




Currie  •* el.. 34) and Hopke  je$ .al.  (32, 35).






     2_.6..4.  Evaluation of Source Inventories






     Where  the  factors are not  readily related  to   previously reported




source   profiles,  a careful  macro  and micro source  inspection may be  in




order.   This serves both to identify sources or source  types  previously




omitted from past  studies and to  confirm  the presence of sources  identi-




 fied by the factor analysis approach but  not listed  in  the  category  of

-------
                                - 49 -
 expected sources for the sampling  site.






      In a study of Newark,  N.J., Morandi .§t jil.  (12) observed  four   to




 seven  factors  (Table 2.7).  depending upon the variables included in the




 analysis.   The  table is provided for  illustrative  purposes,  and  a




 detailed  discussion is given in the reference.  Factors and predictor




 tracers  were   easily  identified  for  oil  burners,  soil,  industrial




 sources,  the sulfate related mass  and autos were identified in the final




 analysis - Row  6.   Two  factors were found with unusual loading patterns.




 One   was loaded with zinc and benzo(e)pyrene and a second contained only




 a  seasonal component.   The regression model coefficient for  zinc  indi-




 cated the composition to be ZnO, a likely emission from a smelter.  This




was verified from  state  emissions  data which showed that a  single  zinc




 smelter existed nearby  and was in  operation.






      2..6..S. Factor  Stability






      Emission tracers established by the factor loadings may match  with




published source sample  data for carefully selected particle size ranges




 stable,  nonvolatile  elements or substances.   Factor analysis  selections




of  independent  predictor  tracers may differ from predictions based on




source  profiles, where  source and ambient particle size ranges  are  not




matched,   or volatile,  condensible and reactive substances are involved.




In these  cases  the variations may be examined by compiling "source" pro-




files  from the  factors with the methods used by Dattner (34)  or Thurston




(33).   Calculation of the ratio of a potentially variable substance to a




stable   element  or  substance  normalizes for dilution or other effects




which  do not cause  relative  profile  shifts.   Equivalent  ratios  for




source  and  ambient  data  indicate  no change in relative composition.

-------
 Table 2.7  Factor Analysis of the Newark,  N.J.  Data Set.
Variables Entered
Pb, Mn, Cu, V, Cd,
Zn, Fe» Ni, S04~2
Probable Source
Pb» Mn, Cu» V, Cd,
Zn, Fe, Ni, SO^2
COmax
Probable Source
Pb, Mn, Cu, V, Cd
Zn, Fe, Ni, S04~2
COmax, S(>2Ave
probable Source
Pb, Mn, Cu, V, Cd,
Zn, Fe, Ni, S04~2
IBEP, IBAP, COmax
Probable Source
N
77
77
66
57
Fl
V(0.94)
Ni(0.92)
Pb(0.47)
Mn(0.46)
oil burning/
space heating
Mn(0.'90)
Fe(0.77)
Soil
V(0.83)
Ni(0.82)
oil burning/
space heating
Fe(0.93)
Mn(0.83)
Cu(0.75)
Cd(0.63)
soil
F2
Cu(0.76)
Pb(0.75)
motor -trehicles
¥(0.83)
Ni(0.82).
oil burning/
space heating
C0max(0.78)
Pb(0.66)
SG2Ave(Q,59)
aotor vehicles
V(0,87)
Ni(0.87)
oil burning/
space heating
F3
Cd(Q,73)
Fe(0.52)
unknown
Pb(0.79)
COmax (0.69)
motor vehicles
Mn(6.83)
Fe)0.67)
soil
C0inax(0.71)
Pb(0.68)
BAP (0.64)
motor vehicles
F4
Fe(0.56)
Mn(0.54)
soil
Cu(0.77)
Fe(0.51)
smelter/metal
related indus,
Cu(0.81)
Fe(0.57)
Pb(0.42/
smelter/tnetal
related indus.
BEP(0.87)
Zn(0.83)
zinc related
source
F5

S04"2(0.52)
-2
S04 /second-
ary aerosol
S04~2(0.50)
-2
S04 /second-
ary aerosol
S04~2(0.54)
-2
SO/ /second-
ary aerosol
F6

Zn(0.54)
zinc related
source
Zn(0.50)
S02Ave(0.43)
zinc related
source

F7




                                                                                                                        Ui
                                                                                                                        o
COmax = maximum hourly carbon monoxide concentration.
S02Ave= 24 hour average S0« concentration.
BAP   « *Ben?o(a)pyrene.
BEP   « -Benzo(e)pyrene.
Headd = heating degree days.
Coodd = cooling degree days.

-------
                             - 51  -
Based on the fact that larger particles have higher setting  velocities,




a  ratio  of  the  airborne concentrations of an element  concentrated in




large particles to that of a second element concentrated  in small  parti-




cles  will  decrease  with  time  of  residence  in the  atmosphere.   This




assumes no fresh infusion of new particles.   Volatile materials may  con-




dense  as  a  plume cools or shift from particle to vapor phase and  back




again as ambient temperatures increase and decrease.  Reactive materials




can decrease in concentration with time.






     Careful examination of composition profiles and ratios of nonstable




to  stable  tracers should show shifts between source emission test  pro-




files and ambient profiles which compare   well  to  expected deposition




rates  and  volatility or reaction characteristics.  Shifts which  do not




agree qualitatively are indicative of  possible difficulties in selection




of independent predictor variables by  the  factor analysis method.

-------
                                 -  52 -
2.7. References






1.   U.S. EPA. 1981a:  Receptor  Model  Technical  Series Volume I: Overview




     of  Receptor  Model  Application to Participate  Source Apportionment.




     EPA-450/4-81-016a.






2.   U.S. EPA, 1981b:  Receptor  Model  Technical  Series Volume II:  Chemi-




     cal Mass Balance. EPA-450/4-81-Ol6b.






3.   U.S. EPA, 1983a:  Receptor  Model  Technical  Series Volume III: User's




     Manual  for  Chemical Mass Balance Model.  EPA-450/4-83-014.






4.   U.S. EPA, 1983b:  Receptor  Model  Technical  Series Volume IV: Summary




     of  Particle Identification Techniques.  EPA-450/4-83-018.






5.   U.S. EPA, 1984: Receptor Model Technical Series Volume  V:  Methods




     for Combining  the  Various  Source   Apportionment  Approaches  (in




     press).






6.   Wolff,  G.T.,  1980,  Mesoscale and Synoptic Scale Transport  of   Aero-




      sols.    Ann.   N.I.  Acad. Sci., T.J. Eneip and P.J. Lioy, Eds., 338:




     379-388.






7.   Barman, B.H., 1976, Modern Factor Analysis,  3rd Ed., University   of




     Chicago Press, 1-487.






8.   Rummel, R.J., 1970, Applied Factor Analysis,  Northwestern Univer-




     sity Press.

-------
                                 -  53  -



9.   Kleinman, BUT., B.S. Pasternack,  M.  Eisenbud,  and T.J.  Kneip, 1980,


     Identifying  and   estimating   the relative importance of sources of


     airborne particnlates.  Environ.  Sci.  Technol.,  14:  62-65.



10.  Dzubay, T.G., R.K.  Stevens, W.D.   Balfour,   H.J.   Williamson,  J.A.


     Cooper,  J.E. Core, R.T. Decesar,  E.R.  Crutcher,  S.L.  Dattner,  B.L.


     Davis, S.L.  Heisler, J.J.  Shah. P.K. Eopke and D.L.  Johnson,   1984,


     Interlaboratory  comparison  of receptor model results  from Houston


     aerosol.  Atanos. Environ,  (in  press).



11.  Stevens, R.K., T.G. Dzubay, C.W.  Lewis  and R.W.   Shaw,   Jr.,  1984,


     Source  apportionment  methods  applied to  the determination  of the


     origin of ambient   aerosols  that  affect   visibility   in   forested


     areas.  Atmos. Environ., 18: 261.



12.  Morandi, M., J.E. Daisey and P.J.  Lioy,  1983,  A receptor   source


     apportionment  model  for  inhalable  particulate matter in Newark,


     N.J.  Paper  No. 83-14.2 presented  at the 76th Annual Meeting  of the


     Air Pollution Control Association. Atlanta, GA, June 19-24.



13.  Pace, T.G.   The Role of Receptor   Models   for  Revised  Particulate


     Matter  Standards,   IN: Receptor Models Applied to Contemporary  Pol-


     lution Problems, S.L. Dattner  and  P.K. Hopke, Eds.,  Air  Pollution


     Control Association, Pittsburgh, PA, pp. 18-28.



14.  Henry, R.C.  and G.M. Hidy,  1979, Multivariate analysis of  particu-


     late  snlfate  and  other  air  quality variables by principle  com-
                   -»

     ponents - Part I.  Annual data  from  Los  Angeles  and  New  York.


     Atmos. Environ. 13: 1581-1596.

-------
                                 - 54 -






15.  Lioy, P.J., R.P. Mallon, M. Lippmann, T.J. Kneip and  P.J.  Samson,




     1982,  Factors affecting the variability of summertime sulfate in a




     rural area using principal  component analysis.  J. Air Poll. Contr.




     Assoc. 32: 943-1047.






16.  Thurston, G.D. and J.D. Spongier,  1982,  Source  contributions  to




     inhalable particnlate matter in metropolitan Boston, Massachusetts.




     Paper No. 82-21.5 presented at the 75th Annual Meeting of   the  Air




     Pollution Control Association, New Orleans, Louisiana, June 20-25.






17.  Morandi, M.T., 1985, Development of Receptor Oriented Source Appor-




     tionment  Models  for   Inhalable Particulate Matter and Particulate




     Organic Matter in New Jersey.  Ph.D. Dissertation, New York Univer-




     sity Medical  Center, February.






18.  Guttman, L.,  1954,  Some necessary  conditions  for  common factor




     analysis.  Psych. 18, 277-296.






19.  P.J. Lioy and J.M.  Daisey,  1984, ATEOS Project. Unpublished Data.






20.  Henry, R.C.,  C.W. Lewis, P.K.  Hopke  and  H.J.  Williamson,  1984,




     Review of receptor  model fundamentals. Atmos. Environ, (in  press).






21.  Draper, N.  and H. Smith, 1981,  Applied  Regression  Analysis,  2nd




     Ed., Wiley  Interscience, New York, 1-709.






22.  Daisey, J.M., 1983,  Receptor Source Apportionment  Models   for  Two




     Polycyclic  Aromatic Hydrocarbons.  IN: Receptor Models Applied to




     Contemporary  Pollution  Problems, S.L. Dattner and P.K. Hopke,  Eds.,




     Air  Pollution Control Association, Pittsburgh. PA, pp. 348-357.

-------
                                 - 55 -



23.  Watson, J.G., 1979, Chemical element balance  receptor  model   metho-


     dology  for assessing the  source of fine  and  total  suspended parti-


     culate matter in Portland, Oregon. Ph.D.  Dissertation,  Oregon  Gra-


     duate Center, February.



24.  Miller, M. S., S.K. Friedlander and G.M. Hidy, 1972,  A  chemical  ele-


     ment  balance for the Pasadena aerosol.   J. Colloid  Interface Sci.,


     39: 165-176.



25.  Gordon, G.E., W.R. Pierson, J.H. Daisey,  P.J.  Lioy,  J.A.   Cooper,


     J.G.  Watson. Jr. and G.R. Cass, 1984, Considerations for design of


     source apportionment studies.  Atmos.  Environ, (in press).



26.  Roscoe, B.A., P.K. Hopke, S.L. Dattner, and J.M. Jenks, 1982.   The


     use of principal component analysis to interpret particulate  compo-


     sitional data sets.   J. Air Pollut. Control Assoc., 32: 637-642.



27.  Pierson, W.R. and W.W. Brachaczek,  1983.  Particulate matter  associ-


     ated  with  vehicles  on  the  road. II.  Aerosol Sci.  Techno1., 2:


     1014.



28.  U.S. EPA,  1971, Method 5, Federal Register, 36(247), p. 24880.



29.  Natuseh, D.F.S., 1978. Potentially carcinogenic species emitted  to


     the  atmosphere  by  fossil-fueled  power  plants.  Environ.  Health


     Pers., 22: 79-90.



30.  Natuseh, D.F. S. and Tonkins,  B.A.,  1978.  Theoretical  consideration
                    .»

     of  the  adsorption  of polynuclear aromatic hydrocarbon vapor onto


     fly a'sh in a  coal-filed  power  plant.   Carcinogenesis,   Vol.   3:


     Poly nuclear   Aromiatic   Hydrocarbons,   P.W.   Jones   and   R.I.

-------
                                 - 56 -






     Freudenthal, Eds., Raven Press, New York, 145-153.






31.  Kneip, T.J., M.T. Kleinman and M. Eisenbud, 1973. Relative  contri-




     bution  of  emission   sources to the total airborne particulates in




     New York City.  Proc.  Third. Int. Clean Air Congress, Dusseldorf.






32.  Hopke, P.E., 1980. Source  identification  and  resolution  through




     application of factor and cluster analysis.  Ann. N.Y. Acad. Sci.,




     338:  103-115.






33.  Thurston, G.T. and J,D. Spengler, 1985. A  quantitative  assessment




     of  source  contributions to inhalable particulate matter pollution




     in metropolitan Boston.  Atmos. Environ, 19 (in press).






34.  Eopke, P. K., D.J. Alpert and E.A. Roscoe, 1983.  Fantasia - A  pro-




     gram  for target  transformation  factor analysis to apportion sources




     in environmental  samples.  Computers and Chemistry, 7: 149-155.






35.  Currie, L.A.,  et  ai.,  1984,  Intercomparison of source apportionment




     procedures:  results   for  simulated data  sets. Atmos. Environ. 18:




     1517-1537.

-------
                                 - 57 -
3..0 MULTIPLE REGRESSION ANALYSIS



     After identifying the major sources of particnlate  pollutants  and


selecting  independent  source tracers or predictors using common factor


analysis, the next step is to obtain a quantitative relationship between


particle  mass  concentration  and  the  concentrations  of  the  source


tracers.  Such a relationship will provide estimates  of  the  contribu-


tions  of  each  source  or source type to atmospheric concentrations of


particulate matter and can assist  in  determining  control  strategies.


This   relationship   is  determined  by  stepwise  multiple  regression


analysis.  Although the mathematical steps in obtaining such a relation-


ship  and  the  application  of  such analysis to real world data can be


quite complex,  the conceptual basis is fairly straightforward.   In order


to  acquaint  the  reader  with the fundamental concepts and some of the


terms used in multiple regression analysis,  an example of a simple  (one


dependent  and  one  independent  variable)   bivariate linear regression


model is presented and discussed first.  In  the  particulate  mass   air


pollution example,  the independent variable would be an elemental tracer


and the dependent variable would be the  total  particulate   mass  or  a


fraction  of  the  particulate  mass.   This  is then extended to a simple


multiple linear regression model where more  than one independent   tracer


is used.  Finally,  an example of the actual  application of  stepwise  mul-


tiple linear regression to air pollutant data is given using the  results


from  the  factor  analysis presented in Table 2.6 to select tracers and


the source attributions identified.


                  -•
     A more thorough understanding of multiple  regression   is  required


for  those  who  actually  develop multiple  regression models.   Standard

-------
                                 - 58 -






references on this technique, snch as  that  by  Draper  and  Smith  (1)




should be consulted.








3.1 SIMPLE BIVARIATE LINEAR REGRESSION MODEL






     In simple regression analysis, which is familiar  to  many  in  the




field  of  air  pollution,  the  values  of a dependent variable, T, are




predicted from a linear relationship to an independent  predictor  vari-




able X,  This relationships is of  the form:






     Y = A + BX                                             Eq. 1.




where B is a constant by which all values of X are multipled, and A is a




second constant which is added in  each case to predict 7.






     As an example, assume that for a given site,  only  motor  vehicles




using  leaded gasoline  contribute  to airborne concentrations of particle




mass less than 10  urn, PM1Q,  and that Pb can be used to trace these emis-



sions.   Thus,  when  Pb concentrations increase, concentrations of PM._




increase proportionately at  this site.  The  model,   in  this  instance,




would be:






     [PM] = A+ B[Pb],                                        Eq. 2.




where  [PM] and  [Pb] are the  concentrations  of  FM,n,  and  Pb,  respec-




tively,  B   is  the   slope of  the  line and A is the intercept.  If  there




were no errors in  the measurements of PM^ mnd Pb, and   the  model  per-




fectly  described   the   sources  of  PM1Q  for  this site, then A would be




equal  to zero.  When  the intercept is zero, the result implies that  all




of  the  PMjQ   is   proportional to Pb, and that there is no other  ^ex-



plained source".   This  is shown graphically in Figure 3.1.  The

-------
                                - 59 -
FIGURE 3.1 Idealized Linear  Relationship Between PM-10 and Pb
      K>
          10.0
          8.0
          6.0
       =L
         i
       2  4.0
         i


          2.0


            0
B = 10
              0      0.2      0.4      0.6
                              [Pb] /ig/m3
      0.8     1.0
  PM . .yg/m-

     2.0
     3.0
     4.0
     6.0
     8.0
    10.0
         Pb .  yg/m~

           0.20
           0.30
           0.40
           0.60
           0.80
           1.00

-------
                                 - 60 -


coefficient B would then be determined simply by dividing each  measured

value  of  IPM  by its paired measured Pb value; B would be constant for

each case and would describe the ratio of PM-. to Pb  in  motor  vehicle

emission.  The value of B is 10 in this example, i.e., Pb is 0.10 of the

particle mass emitted by these ideal motor vehicles.


     Since measurements always involve some uncertainties and  the  real

world  is far from ideal, the values of B determined would vary somewhat

from case to case.  The value of A will also differ from zero.  If aver-

age  values of B  and of A were selected to predict PM-Q from the concen-

tration of Pb, the individual predicted values of PM-Q (PM) would differ

from the actual value, i.e.,

            A
     [PM] -[PM]   f 0                                        Eq. 3.


The value of this difference between actual and predicted PM.. _ for  each

case   is  termed  the residual.  Analysis of residuals often give useful

information about the goodness of fit of  the  model,  and  patterns  or

correlations  which  indicate  a  need  for  additional source terms.  A

thorough discussion of Residual Analysis can  be  found  in  Draper  and

Smith, 1981.


     Regression analysis involves determining A and B in such a way that

the sum of the squares of these residuals is a minimum;

              /v
       (PM - PM)2 = minimum                                 Eq. 4.


Thus,  the values  of A and B which define the linear relationship between

Pb  and W^Q are  selected to minimize the deviation of the actual values

of PM10 from the  line shown in Figure 3.2 these values can be calculated

-------
                                       - 61 -


       FIGURE 3.2  Linear  Relationship Between PM-10 and Pb  With  Uncertainties

       in Measured Concentration of PM-10.
         ro
            10.0
             8.0
             6.0
             4.0
             2.0
               0
                 0
        Least Squares
        Regression Line
                           i   APM=2.04/ig/m:
                          j

                 APb=0.2Mg/m3
 A = -0.04
J	
0.2      0.4     0.6     0.8
                                  [Pb]
                                  1.0
a.
PM,
Actual
2.0
3.0
4.0
6.0
8.0
10.0
. I
Predicted3
2.21
2.72
4.45
5.47
8.93
9.24
[PM - PM]^ [Pb - Pb]
                   I   (Pb - Pb):
                                              10.19
                                                          Pb,   Ug/m'
                                                             0.220
                                                             0.270
                                                             0.440
                                                             0.540
                                                             0.880
                                                             0.910
     Where:   Pb  and PM  are the average concentrations of the  species  given:
             A PM - B Pb

-------
                                 - 62 -



according to the formulas indicated in the same figure.



     In this model, it  is assumed that the measurement  of Pb or  any  X


does  not  involve any uncertainties, that only the measurements of P*^Q


or Y involve any uncertainties.  The uncertainties associated  with  the


model  obtained  by  regression analysis (sometimes termed least squares


analysis) can be evaluated  by  one  or  more  of  the  statistics  that

                                                           2
describe the average size of the residuals.  The value of r  (the square


of the correlation coefficient) indicates the proportion of  the  varia-


tion in PMj^ explained by Pb.  This is 1.0 for the ideal example of Fig-


ure 3.1, and is 0.957 for the data for Figure 3.2.  The  Standard  Error


of the Estimate (S.E.E.) is the standard deviation of actual PH.. values


from those predicted and can be thought of as an Average" residual:
                               ^

              S.E.E   =   (PM-PM)2-                 Eq. 5.

                              N





If the PM^Q  measurements  are  normally  distributed  about  the  least


squares  line, then approximately 67% of the measured values of PM-0 for


a given Pb concentration will fall within  1  S.E.E.  of  the  predicted


value.   The  uncertainty  associated  with the estimated value of B can


also be estimated and is termed the  standard error of B.  The  statisti-


cal significance of the estimated coefficient B can also be tested, usu-


ally by evaluating the F ratio.  The F ratio is defined as the ratio  of


the variability in Y explained by the regression line to the unexplained


variability in Y, corrected for degrees of  freedom.   The  greater  the


value of F, the greater the ability  of the equation to explain the vari-


ability in Y (or in this case PHj^).  The  statistical  significance  of


the  calculated value of F can be determined by referring to appropriate

-------
                                 - 63 -






statistical tables;  however*  most  computer  programs   for  regression




analysis automatically provide this value.









MULTIPLE BE/?BKSSTQM ANALYSIS






     The simple bivariate linear relationship can be readily extended  to




the  multivariate  case  in which two or more independent variables, X^,




determine the predicted value of Y:






     Y = A + B^ + Bj X2+... + BkXk                         Eq. 6.






For the example discussed in Chapter 2, a model for Pb and V would be  of




the form:






     [PM] = A + B1[Fb] + Bj [V]                                  Eq. 7.






This equation implies that both motor vehicle  emissions  and  emissions




from  oil  burning, traced independently by Pb and V, respectively, con-




tribute to the variation in the concentration of P^Q.   When  multipled




by  the ambient concentrations of Pb, the regression coefficient B. will




give the concentration of PM^ aue to motor vehicles alone.  Similarly,
      gives the concentration of PM-iQ **e to oil-burning while A is the




concentration of PM__ which cannot be explained by variation in Pb or V.




The value of A can be due to uncertainties in the measurements of the




tracers in the model, or may reflect the presence of unidentified sources




of PM,Q or the inadequacy of the linear model itself.  This model, as do




other receptor models, assumes that the contributions of PM..- from the





two sources are linearly additive, a key assumption.  Although this



assumption appears to be adequate in most instances, further research



may ultimately lead to refinements in these models through development

-------
                                 - 64 -






of nonlinear equations.






     The regression coefficients in  the  stepwise  multiple  regression




model are partial regression coefficients.  That is, the partial coeffi-




cient 8^^ in equation  (7)  for example, is equivalent to a simple  regres-




sion  coefficient  between PH.,. and the residuals of [Pb] from which  the




effect of [V]  is removed.  That is, if Residual (Pb) = [Pb]-[Pb],  where




[Pb]  -  A' +  B'[V],  the  partial B^^ of equation (7) is equivalent to  the




simple regression coefficient B.




     PM = A +  B [Residual (Pb)]                               Eq. 8.




     PM = A +  Bx [Pb  -[A' + B'(V)]]                           Eq. 9.






Thus, partial  B^ js the  simple regression coefficient between PM and  the




residuals  of   [Pb].   The effect of [V] is  thus removed from both PM  and




Pb  and the resulting  residuals are correlated by  simple  least  squares




analysis.   As a consequence of this, the  coefficient of a given tracer




fitted in a stepwise  regression will vary somewhat  from step to step   as




each succeeding variable coefficient is fitted.






     If  there  is  a   great  deal   of  multicollinearity  or  correlation




between  the   independent  tracers,  estimates of the regression coeffi-




cients may vary considerably  from  data set  to data  set.  If there  is  a




high  degree of correlation between the X.'$, their coefficients may  not




be  uniquely determined or it may not be possible  to invert  the  correla-




tion matrix of these  tracer variables.  The application of  common factor




analysis to a  data  set prior  to multiple  regression analysis can  minim-




ize this problem by identifying statistically independent  source  tracers




to  be used as  predictor  variables.  If the  source  tracers  are  highly




correlated  for  a  given  data  set,  additional   field measurements or

-------
                                 - 65 -





additional tracers will be required to separate  the contributions  of  the



sources.   It should be noted that in some instances, an estimate  of  the



combined contributions of two source types may be adequate for  the  pur-



pose of a study.
     The regression coefficients, B., relate  the  source  emissions  to



ambient particulate concentrations according to the relationship:



              a   [PM].                          Eq. 10.
         B. =                        j = 1 to n
where EPM]./[T]- is the ratio of particle mass to tracer mass  in  emis-



sions  from source j and o. .  is a coefficient or more complex functional



relationship which describes changes in the ratio [PM]./[T]..  This  may



occur between the source and receptor as a result of physical and chemi-



cal processes.  For the simple example given for Figure 2,
              1X[PM]
                    mv
      B =     _•-	
              [Pb]
                  mv                                   Eq.  11.
where the Mnv" subscript  denotes motor vehicle  emissions and a value  of



1  is assumed for a...  With actual air  pollution data,  the value of a..
                   * J                                                 *J


need not be 1.  If the ratio of [PM]./[T].  for  source  j   emissions  has



been  measured,  a  value  of  o,.  can be  estimated using  the  regression



coefficient from the model for that source  and   emissions   measurements.



In  principal,  more • complex  relationships could  also  be  determined,



although this has not been done to  date.

-------
                                 - 66 -


3..3_ AN EXAMPLE Of AN APPLICATION OF MULTIPLE REGRESSION ANALYSIS TO  AIR

POLLUTION DATA


     Trace element and snlfate  concentrations  determined  for  samples

collected  over  a period of two years at an urban site were analyzed by

common factor analysis.  This solution was shown in Table 2.6.  Based on

the results,  the species Pb, V, Fe, Cu* and S04= were initially selected

as   predictors   for   emissions  from  motor  vehicles,  oil  burning,

resuspended  soil,  incineration  and  sulfate-related  aerosol,  respec-

tively,  and  regression coefficients were fitted for a model of the form:
               n
    [PM]  = A  + -  B. [T.]                                   Eq. 12.
               I   1  '*
              i = 1

           i = 1 to n sources


Further  work indicated that  the equation could be improved if Ti  rather

than  Fe was used as a soil tracer.  The percent of PM mass contributed

by  soil  resuspension was not significantly changed by this substitution,

but the  significance of the  overall equation was improved somewhat.


     Table 3.1 presents the  regression  coefficients  for  each  of  the

predictor  tracers as well  as the  estimated standard errors of the coef-

ficients, the values of F and their level of  significance.  The  tracers

are  listed  in the order in which  they were entered into the equation by

the stepwise regression analysis,  i.e., SO = was entered first since  it

accounted  for  the  greatest proportion of the variability in PM; V was

the second predictor fitted, and so on.  The  coefficients for the  first
•Cu could be used as a  tracer  in  this data  set as induction motors
were  used in  the air samplers, thus eliminating the usual Cu con-
tamination from  the brushes of the Hi-Vol samplers.

-------
                                  -  67  -
Table 3.1.   Regression Coefficients  for  a Multiple Regression Equation




              for  PMa'b






                                                       Statistics  For




                                     Standard Error      Coefficients




Variable*.-    S«urc«-.Tvpe    ~  -A--.   of..-B.  , -.-  -	  -  F	  -  - P
S04       Sul fate-re la ted    2.10     0.18            130.0       <0.001




            aerosol




V         Oil burning       110       34               10.0       0.002




Ti        Resuspended       329      102               10.3       0.002




            soil




Cu        Incineration      126       64                3.8       0.05




Pb        Motor Vehicles     5.8      4.4               1.7       0.19









Constant       -            -1.7      3.7               0.2       0.6
 Coefficients given for concentrations of tracers expressed in ng/m
 [PM] = 2.10 [S04~2] + 110 [V] + 329 [Ti] + 126 [Cu] =5.8 [Pb] - 1.7
bOverall statistics for entire equation: F = 37.3. p < 0.001. r2 = 0.72,

-------
                                 - 68 -
four tracers are all statistically significant at the p <. 0.05 level  or


better  (i.e.,  there  is  only a 0.05 probability (out of 1.0)) or less


that the relationship between PM and each of the  variables  is  due  to


chance  alone.   The coefficient for Pb is of only marginal significance


(p = 0.19) and would usually be omitted from the equation.  However,  it


can provide an estimate of the motor vehicle contributions at this site>


although the estimate will have  relatively  large  uncertainties.   The


intercept  or  constant  .A for this equation is negative but has a large


uncertainty and represents only a small fraction of the total PH.



     The overall statistics for the entire equation  are  given  at  the


bottom  of  Table  3.1.  The value of F is statistically significant and

              i
the value of r" (the square of the  correlation  coefficient)  is  0.72,


i.e., 72% of the variability in PM is explained by the equation.



     At each step in this multiple regression  analysis,  the  tolerance


for  the  variables  not yet in the equation was calculated and printed.


This can be considered a measure of the independence  of  the  remaining


tracers  relative  to  those  already entered.  A high value (0.8 - 1.0)


indicates little association among the tracers  (independence)  while  a


low  value  indicates collinearity with the tracers already in the equa-


tion and may result  in  computational  difficulties.   In  the  example


given,  the  tolerances were always greater than 0.8 for tracers not yet


in the equation.



     Table 3.2 presents the estimates of the  average  contributions  of


each  type  of  source  to atmospheric concentrations of PM based on the


multiple regression coefficients and A of Table 3.1.  The  average  con-


centration  of  each  of the tracers was multiplied by its corresponding

-------
                                  - 69 -
Table 3.2  Source Contributions to PM Estimated by Regression Equation
           of Table 3.1.
Variable
   Source Type
SO,

V
Ti
Cu
Pb
A
Sulfate-related
  aerosol
Oil burning
Resuspended soil
Incineration
Motor Vehicles
Constant
   Average
Concentrations
   of Tracer
    yg/m3
     (Ti)
 Regression    Estimated Source
Coefficients  Contribution to PM
    (Bi)           yg/m3a
                   (Bi-Ti)
     8.38

     0.039
     0.016
     0.023
     0.506
 2.1 + .2

 110 +  34
 329 + 102
 126 +  64
 5.8 +4.4
•1.7 + 3.7
17.6 + 1.5

 4.3 + 1.4
 5.3 + 1.6
 2.8 + 1.5
 2.9 + 2.2
-1.7 + 3.7
 Uncertainties estimated as products of standard error or coefficient times
 average tracer concentration.

-------
                                - 70 -
coefficient  to  obtain  these   estimates.     Snlfate-related   aerosol


[(NH4)2S04,  (NH4)HS04,  HjSO^ and  any  related  organic  or inorganic


species] contributed 17.6 ug/m3 of the PM on average,  while  oil-burning


and  re suspended  soil  contributed  lower concentrations of 4.3  and 5.3

    a
Hg/m , respectively.  Incineration and motor vehicles  contributed only a


few ug/m  at the rooftop (15 stories) sampling site where these measure-


ments were made.  Contributions were almost as  large   as  the  estimate


itself;  however, the contributions from this source at the site  studied


was less than 10% of the total PM.
     Once the regression equation has been obtained and examined,  it  is


advisable  to  examine  both the intercept A (PH,  which cannot be  attri-


buted to known sources) and the residuals of the individual  cases.    If


the  value  of  A  is  fairly large, it usually indicates that there are


unidentified sources of PM.  Morandi (2) has shown that the  correlation


of  the residuals of such a model with other variables in a data set can


be useful in identifying additional sources of PH.  The pattern  of  the


residuals should be examined as well (1).  The principal ways of examin-


ing such patterns are plots of the distribution of the of the  residuals


in  a time sequence, and plots against both the particulate mass (depen-


dent variable) and the tracers (independent  variables)  of  the  model.


These  techniques,  and  others, have been discussed in detail by Draper


and Smith (1) and can be helpful in arriving at a  best  possible  final


regression model for estimating the contributions of various sources.

-------
3..4 REGRESSION MODEL VALIDATION






     The regression equation obtained in the final  stage  of   the  mul-




tivariate  analysis is a model of the relationship between the  dependent




variable (in the previous case PH) and a set  of  independent   predictor




tracers.  The value of such a model lies in its ability to represent the




real physical-chemical properties of the system under  study,   and  more




particularly  of  predicting  changes  or  trends likely on the basis of




expected changes in source strength.  The use of a regression model  for




these  purposes  requires validation of the coefficients derived as well




as validation of the overall model.






     A factor analytical model when satisfactorily validated  will  pro-




vide  tracers  from which independent predictor tracers will  be selected




for use in the development of the regression  model.    Horandi  (2)  has




developed regression subroutines for identification of independent pred-




ictor  tracers  where  mul ticoll inearity  or  variable  intercorrelation




prevents predictor selection through the factor analysis alone.   Regard-




less of the methodology, the validity of each selected  predictor  vari-




able  must  be  qualitatively  confirmed from published source profiles,




known fuel compositions, or well established source emission   or  source




location  vs.   ambient composition relationships.   Entry of  unexplained




predictor tracers in stepwise multiple regression  calculations   renders




outputs of doubtful validity.






     During the data analyses the statistical parameters  available  for




evaluation  of program outputs should be thoroughly understood and care-




fully examined at each step in the process.   No  equation  can  be  con-




sidered valid if all statistical criteria have not been met.

-------
                                 -  72 -



     The ability of a model to predict the concentration of  particnlate


matter or other air pollutant is a major test of its validity.   Adequate


tests of predictions are rare in  the  field  of  atmospheric  pollutant


source apportionment.  Several approaches to model validation are possi-


ble.  The Quail Roost II workshop provided a  complete  set  of  artifi-


cially  generated source and receptor composition data for analysis (3).


As the true results were known, outputs of  the  modeling  systems  were


readily  validated.   Unfortunately, only 40 ambient data sets were gen-


erated, which  is insufficient for validation  of  multivariate  analysis


methods (d/v = 28, see eq. 8, Chapter 2).



     Currie, ±t al.  (3) reported the  results  of  the  application  of


several receptor modeling methods to a simlulated set of 40 ambient sam-


ples.  Factor  analysis and multiple regression correctly identified  and


quantitated  a  source for which no source profile was provided.  One of


two  chemical mass  balance  approaches  and  the  Target  Transformation


(TTFA)  method also  identified  the  'missing" source.   Another source

                                    a
which produced a simulated 0.05  ng/nr concentration was missed by  three


methods, and a 20 fold overestimate was reported by a fourth.  Two other


sources were not found by the regression method; however, one  of  these


sources was identified by TTFA.  It appears that the sample set of forty


is inadequate  to produce accurate FA/MR results.






     Individual regression coefficients are the best  estimates  of  the


ratio  of  mass  (dependent variable) to the tracer (predictor variable)


for  a particular source type.  For  validation  these  coefficients  are


readily  compared  to  available source emission profiles. However, care

-------
                                   - 73 -
must be exercised in such, comparisons regarding the  assumption  of   the




conservative  behavior  of  the  emitted  particle mass and composition.




Alterations of elemental concentrations due to changes in  the  particle




size  distribution, condensation, or evaporation enroute to the receptor




will affect comparisons of the coefficients  with  elemental  concentra-




tions obtained from stack tests on a particular source.






     There are two other approaches that can be used for  validation  of




the  regression  model.   The first is to split the data into a training




set and one or more test sets.  The training set is analyzed by stepwise



multiple regression to obtain the coefficients for the model.   The coef-




ficients are then applied to average predictor tracer concentrations  to




obtain  average  source  contributions.   The latter step is performed on




both the training and test  sets.   In  each  instance,  the  calculated




masses are summed and compared to the measured average for  the dependent



variable.   The comparison affords a measure of the accuracy of the model




calculation.  This requires the availability of large numbers  of  samples



to produce valid solutions for both data sets.






     The ultimate evaluation of a model  lies in comparisons of predicted




variations  with  time.   Where data bases are collected over  sufficient




time periods,  or ambient studies are repeated, data may be  available  to




compare  predictions  from early studies to actual  results  of  changes in




source strengths or other regulatory activities.






     3>4.*i Degression coefficient validation






     The qualitative validation of the predictor variables  has been per-




formed  in  the  examination  of  factors,   factor  loadings, clusters of

-------
                                -  74  -
variables, etc.  Hie  stepwise  regression  produces  coefficients  (B),




which  are estimates of the quantitative relationship between each pred-




ictor variable and the dependent variable as associated with a  particu-



lar  source  (as  shown in equation 10).  The dependent variable is gen-




erally a mass of particles or some substance per  cubic  meter  of  air,




e.g. B = [Mass in source/Y in source].






     The coefficients will have error terms which depend on the variance



inherent  in  the  data set, but say also have significant biases.  This




possibility can be examined by validation of the coefficients.  Kleinman




(4,  5)  evaluated  pertinent literature data to suggest source profiles




for  the sources of particles in the New York City area and  to  validate




the  derived regression coefficients for several of the major sources. As




shown in Table 3.3, the coefficients  for  1972-1973  were  compared  to




source  emissions  composition  data  to  validate the coefficients. The




regression coefficients had relative error terms of 25  to  40%.   Error




terms  for  the  emission  ratios cannot be estimated from the available




reports.






     Hie case for the automobile, prior to control of the  lead  content




of   gasoline,  illustrates  the  ideal  since the regression coefficient



indicates 12 ug mass/ug Pb.   Data  for  samples  taken  of  intake  and




exhaust  air  for  a  vehicular tunnel were used to calculate a ratio of




mass to lead of 11.2 for the automotive emissions added to  the  exhaust




air  (7).  Automobile exhaust data reported by Mueller jt ,§1.  (11) would




have indicated a factor of 2.5.  The difference  can  be  attributed  to




loss of lead by large particle deposition, if any, and mass added to the




primary exhaust particles by condensation  in  the  exhaust  plume.  The

-------
                                   - 75 -
Table 3.3.  Comparison of Regression Coefficients  and  Source Emission
            Factors
                        yg Particle Mass/yg Tracer
                                Literature Emission Ratios
   Regression
   Coefficient'
  Oil  h                 d                 e     Sulfate  i
Fly Ash     Auto     Soil "    Incinerators *    Aerosols
V
Pb
Mn
Cu
so4~
103 8-91
12 11.2
418 670-4200
54 >500
1.66




1.0 to 1.4
a. Ambient data from 1972-1973, Kleinman  (4, 5).
b. Watson (6).
c. From Larsen, 1966 (7).
d. Watson - for resuspended soil (6).
e. References (8-10).
f. Factors for H0SO., NH.HSO., and (NH.)0SO..
                24    44         424

-------
                                 - 76 -
tonne1  study  measured  the total motor vehicles particulate mass added




after dilution and cooling, and gave a coefficient very  close  to  that




for  the ambient regression coefficient for a period prior to reductions




of lead in gasoline or the use of catalyst equipped cars.






     The case for vanadium offers a second insight.   The  concentration




of  this element in crude oil is highly variable, ranging from less than




a few, to hundreds of micrograms per gram of oil, depending on  the  oil




field  studied.   The  concentration  in fly ash will vary by source and




burner system as well as being dependent on the sampling approach.   The




agreement for this coefficient was worse, but acceptable.






     Profiles for soils and secondary aerosols such as sulfate  are  not




readily obtainable.  Both have area or regional sources.  The agreements




between the regression coefficients for these sources and the  available




literature  data are acceptable, as urban and roadside dusts may contain




higher organic contents than reference  rocks,  lowering  any  elemental




concentrations.   Secondary  aerosols  would typically contain water and




organic matter as well as NH. .






     The copper value falls outside the range  of  reported  ratios  for




incinerators;  major industrial sources such as smelters do not exist in




New York City.  Either emissions from  the  residential  and  commercial




incinerators  in  this  area differed from those measured at a few large




municipal incinerators studied by others, or an  unexpected  source  was




involved.






     Daisey and Hopke (12) have compared coefficients from a  model  for




extractable  organic  matter  with  organic/elemental masses from source

-------
                              - 77 -
studies on automotive  (Pb), oil burning  (V) and resuspended dust  sources




(Ti).   The  results   showed satisfactory agreement (Table 3.4) particu-




larly in view of the very limited data available in the literature.  The




variability of the V ratio to the reference values is in part due to the




variable vanadium content of oils, which was noted previously.






     As part of  the  Airborne  Toxic  Element  and  Organic  Substances




(ATEOS)  (13),  a  study  of  sources of inhalable particulate matter in




Newark, N.7. by Horandi «t al.  (14) developed coefficients for sulfate,




soil,  autos,   oil  burning  and  an  industrial source.  Comparisons to




source profile data are shown in Table 3.5.






     The ratio of the mass of  any  emitted  substance  to  a  predictor




tracer  should  offer an opportunity for similar comparisons to sources.




Daisey (15) and Daisey  and  Kneip  (16)   have  made   such  comparisons.




Ratios  of  both eztractable organic mass and specific organic compounds




to tracers have been evaluated.   For the former,  a  ratio  of  non-polar




eztractable organic matter to lead was 0.65 for motor  vehicle sources on




a annual basis and 2.2 with no space heating.   This compares to an esti-




mate  of  1.6   for Allegheny Tunnel data.  Daisey  (15)  also found  that a




regression coefficient for chrysene/Pb of 0.0017 compared well  with  an




emission estimate of 0.0022.






     Morandi (2)  has continued the evaluation and  validation of predic-




tor  variable coefficients by methods which emphasize  the examination of




the intercept  and the residuals  of a regression equation  for   inhalable




particulate matter.    A  pattern  in the time sequence  of the inhalable




particulate matter residuals indicated a  missing predictor tracer.

-------
                                 -  78 -
Table 3.4.  Comparison of Source Tracer Coefficients and Ratios of Non-Polar
            Cyclohexane-Soluble Organics (CX) or Volatilizable Carbon to
            Tracer in Source Emissions
Models

  Model 1(1979,1980)

  Previous NYC Model(1978)

Sources

  Allegheny Tunnel,1979*'
   Spark ignition engines
                              Organic
                              Species
CX

CX



CX
              yg Extracted Mass/yg Tracer
              Coefficient of or Ratio to:
   Pb_          V



1.1 ±0.4   52  ± 3

0.65±0.35  29  ± 4



2   ±7
  Residual Oil(fine particle) Volatilizable
                               Carbon
  Street Dust(fine particle)  Volatilizable
                               Carbon
                                     Ti
20112
Ref.



 12

 16
                          5.7±34.0
                          0.6- 7.8
                          0.1-10.3
                                   11-13
   Samples of particulate matter  collected  in the Allegheny Tunnel during  the
   summer of 1979 were provided by Dr. William Pierson of the Ford Motor
   Company.  Portions of each  sample were extracted sequentially with
   cyclohexane and dichloromethane to determine extractable organic masses.
   A second aliquot of each was analyzed for lead.

-------
                                  - 79 -
Table 3.5.  Comparisons of Regression Coefficients of Particulate  Matter
            Models to Source Emissions Compositions
                                  yg Mass/yg Tracer
Predictor
Variable

Literature Emission Ratios
Regression Sulfate ,
Coefficients Aerosols * Soils " Autos ' Oil
Newark New York3*
so4=
Mn
cob'
V
1
718
714
106
.6 1.66 1.0-1.4
418 670-4200
350-700
103 8-91
a. See Table 3.3.
b. Substituted for Pb as multiple lead sources were observed in the
   factor analysis.  Emissions estimates from Allegheny Tunnel Studies
   (17. 18).

-------
                              - 80  -
     Hie analysis of  the  regression  equation  residuals  has  enabled




Horandi (2) to develop predictor tracers based on tracer regressions and




analysis of residuals to identify a tracer for  an  unsuspected  source.




Meteorological  relationships,  to  be  discussed,  and  micro inventory




processes, discussed earlier, have aided in the validation of the  resi-




duals from the tracer regression equations.






     S..4..2. Meteorological Relationships






     Studies  of  point  sources  have,  for  obvious   reasons,   often




emphasized  the  changes in  concentrations associated with samples taken




simultaneously both upwind and downwind of  such  a  source.   The  data




obtained  in a source apportionment study  is often from a temporal (time




series) study design rather  than a spatial design.  Spatial  information




is,  however,  embedded  in   th,e  sample data because of changes in wind




direction.  Morandi .et al.   (14) have found that  specific  time  periods




or   single days may yield extremely high concentration values for one or




more tracers which are not characteristic  of the  data distribution found




for  the bulk of the data.   In such cases  careful use of data truncating




methods may be required to   obtain  a  useful  set  of  factor  analysis




results or to reduce apparent error in the regression coefficients (19).






     Several common sources  of error or causes of tracer  intercorrela-




tion  include  meteorological relationships, and  sampling and analytical




influences.  The latter two  problem types  should  be  examined  so  that




invalid  data  are removed before multivariate statistical analysis pro-




cedures are applied.  As this is not always  done effectively,  reviews




are necessary where unusual  outcomes are observed (20, 21).

-------
                                -  81  -
     More commonly, specific data sets with extreme values may relate  to




unusual  meteorological  conditions  such as stagnations, repeated daily




inversions,  or  continuous  stable  wind  directions.   These  periods,




whether  single  days or sets of days often termed "episodes",  should be




studied to determine the relationship of the  unusual  predictor  tracer




concentrations  to  meteorological factors.   The regression coefficients




obtained with reduced data sets must also be compared to those for total




data  sets,  and,  if possible,  with the excluded data sets.   This latter




step forms a further basis for  coefficient validation.






     Examination of the extreme value sets was actually used in the pre-




viously cited example for Newark (2).  High zinc concentrations occurred




during an episode  and were  strongly  related  to  northeasterly  winds.




This  wind  direction occurred  %20% of the time for the Newark area,  and




proved to be the  direction  which  put  the  sampler  downwind  of  the




smelter.






     3.4.3  General Model Evaluation






     The regression equation is a numerical  model of  the source—receptor




relationships  which were the focus  of the program.   The overall  valida-




tion of the equation requires more than  validation  of  the   individual




coefficients or examination of  source—receptor  geography and meteorolog-




ical parameters.






     The model must be examined as a whole,   the residual  term  must   be




satisfactorily  small.    An objective means  of  determining this requires




setting a limit pi ion to the study.   A residual  term  of more  than 10% of




the  total   measured  dependent  variable maybe  an   indication   that a

-------
                               - 82 -
significant source has been missed.  The presence of predictor variables




expected  for local sources, and general tracer relationships compatible




with known sources provided further confidence  in  the  overall  model.




Should  an  expected major source fail to have a significant coefficient




for an acceptable tracer, serious doubt is cast on the  model.   Overall




knowledge  of   the area  source inventory and catalogs of source emission




composition profiles are useful in these types of general model  evalua-




tions.   The  model  must  have  an  overall  sound, logical relation to




sources, meterology and  atmospheric chemistry and physics.






     Additional  objective evaluations are possible in model  validations




where sufficient data  are available.  For instance, a model may be based




on data called  a "training set"  and  then  used  with  additional   data




called  a  "test  set."  In the process of the development of the Factor




Analysis/Multiple Regression approach a long term  data  base  was  col-




lected for several locations in New York City.  Kleinman  (4) and  Klein-




man «t al.  (22) demonstrated  the validity of the  approach  by  several




evaluations  of "test   sets," i.e.,  data sets unused in the regression




analysis, using coefficients for the 1972-1973 data for samples taken at




the New York University  Medical Center, i.e., the  "training set."






     The coefficients  obtained for the 1972-1973 data were first used to




calculate  particle  mass for  each source and the  total mass.  The resi-




dual term was also recorded for that period.   The  calculated  sum  was




compared to the average  measured TSP as a means of validating the model.




As would be expected,  the percentage difference between   calculated  and




observed TSP was only -3.7% for 1972 and +5%  for 1973.

-------
                                  - 83 -
     Hie regression  coefficients were  adjusted  for  known changes  in fuel


 composition  and the adjusted coefficients were used with the  respective


yearly average predictor variable concentrations for 1969, 1974 and 1975


 for  the same sampling site. The differences were +9.7%,  +7%  and +28% for


measured TSP values of 134 |ig/m3 (1969), 71 ug/m3 (1974)  and  52   jig/m3


 (1975).   While  the  model  appears   to  accurately   identify the  major


 sources, it apparently failed to indicate correctly the  decreased   emis-


 sions  when annual TSP levels dropped  from 70 to 80 |ig/m in the 1972  to

                      a
1974 period to 52 fig/m  in 1975.  The  residual  term  accounted  for  a


 third  or  more  of the calculated mass in the 1975 calculation.  There-


 fore, possible changes in the regression  coefficients,   i.e.,  emission


 composition  changes,  and variation in the residual term may have  seri-


 ously affected the accuracy of the estimated TSP values for the test set


for data collected in 1975.



     The coefficients derived from the 1972-1973 results for the Medical


Center  site were also applied to data for sites in lower Manhattan, the


Bronx, Queens and a rural site in Tuxedo,  N.Y.   The differences  between


calculated  and  observed  values ranged from 0 to 13% for the first two


 sites which are affected by sources similar to those around the  central


Manhattan  Medical  Center  site.    For  the  Queens  and the rural  area


differences of -32 to 57% were observed indicating that the model   could


not  be  used  at  these  two sites;  this  is probably due to substantial


differences in the nature of the sources at these  two sites compared  to


the Manhattan site.




     In a subsequent study Kneip,   MaiIon  and  Kleinman  (23)   observed


variations  year  to year in the coefficients  over the period of 1977 to

-------
                                 - 84 -
1980.  The shifts were probably related to  changing  fuel  compositions




and changes in  sources.   Satisfactory differences between calculated and




observed values were  obtained  for  the 1978 to 1980 data when  respirable




particles  (DSQ <. 3 .5 |im)  and coarse particles  (D5Q 2.3.5 jim) were sam-




pled and the  tracers  were determined for  each size fraction. The coeffi-




cients  for   each   size   range gave measured differences of -2% for the




respirable fraction  and  0% for the coarse fraction.






     They concluded  that regression coefficients  were  valid  for  the




periods during  which data was  collected,  but not valid for other periods




where unknown or undefined changes in fuels or sources had affected  the




source  emission compositions. For instance, a  shift in the coefficient




for vanaditm  was apparently related to a  sharp increase in the  airborne




vanadium  levels.    A change in the location of  the  sources of crude oil




caused  by the Iranian revolution  probably affected the  source  emission




compositions  and ambient concentrations.  Copper concentrations dropped




sharply from  1976  to 1980 as small residential   incinerators  were  shut




down   in  New York.   The copper coefficient was  no longer significant  in




the 1978-1980 data.






     S..4..4. Summary






     The validation process is undertaken in these stages, 1) validation




of the  source related factors obtained and predictor tracer selected;  2)




validation of the regression coefficients; and 3) overall model  valida-




tion.   While  these  have  been successfully performed  in a number of  cited




cases,  several  principles must be kept in mind.   Sufficient  data  sets




must  be available  to obtain a reasonable number of  significant factors.




Under or over specification  in   the   selection   of   probable  predictor

-------
                                 - 85 -
tracers can. be a problem.  Hie presence of only a few  factors  or  a  large




residual term in the regression equation usually  signifies  a lack   of




sufficient  information  on  potential sources and the related predictor




tracers.  Careful experimental designs and validation efforts  not  only




Till  provide  greater  confidence in the factor-source and quantitative



regression relationships, but also will  aid  in  defining  sources  not




expected when the original design was established.








3..5. INTERPRETATION






     The final interpretation of a source  apportionment/receptor  model




is  dependent on all preceding steps.  Clearly stated objectives*  effec-




tive experimental design, accurate and precise operations from  procure-




ment  to data base maintenance, thorough multivariate analyses, and com-



plete model validation are all essential.  Since  the  FA/MR  method  is




applicable  in  situations of standard compliance problems or toxic sub-




stance investigations,  it is assumed  that  a  major  objective  is  the




definition  of  those sources which may be controllable,  and may also be




contributing significant fractions of the pollutant  under  study.   The




final  product of such a study is the concentration of pollutant assign-




able to each source at the receptor during  the  period  of  the  study.



Fractional  contributions  are  normally calculated and the magnitude of




each source contribution is then apparent.






     The first step in interpretation is a  comparison of   all   data  for




the pollutant in question to a standard or  to exposure limits.   An esti-




mate of overall reductions needed is made on  the  basis   of  this  com-




parison.   The  second step is to examine the source assignments and the

-------
                                - 86 -







residual term from the regression equation.  If the latter  is  a  large




fraction of the total pollutant  concentration, a major source or sources




may remain unidentified.  When this occurs, one returns  to  the  factor




analysis  step  and  reexamines  the  assumptions concerning convergence,




commonality and significance of  the apparently  less  important  factors




(2).






     Assuming this is not the  case,   the   sources  identified  must  be




categorized as controllable, potentially controllable or uncontrollable.




The latter might  be  a case  for the  problem  of regional secondary sulfate




pollution.  The  situation may remain  virtually unchanged, since any con-




trols  implemented to reduce local contributions to zero will not  reduce




the concentrations at the receptor  site from the regional sources.






     Controllable or potentially controllable sources are the categories




desired when major reductions in ambient concentrations are needed.  The




previous  sections have  illustrated  cases of  industrial  point  sources,




area   sources,   automotive  emissions,  and general industrial contribu-




tions.  Each  of  these or any other  source discovered  must  be  reviewed




prior  to  establishing final conclusions.






     The  criteria are for example:






     1.   Type of  source






           Point  - major






           Point  - numerous, small






           Area - resuspended soil

-------
                                - 87 -

           Mobile - motor vehicles

     2.  Fraction of mass attributable to each source

     3.  Error range of mass

     4.  Validity of factor and predictor variable for each source

     5.  Validity of each regression coefficient

     6.  Meteorological relationships

           Wind direction

           Stability

           Stagnation episodes

     7.  Micro and Macro Inventories

     8.  Observation of source operations.

     Several decisions can be made on the basis of  these  criteria.    A
source (point or area) may be judged to be  a major contributor,  and con-
trolled.  Thus, a control program should be established.   A source(s)
could be judged insignificant or uncontrollable and no control program

implemented.  A source (s) could be judged significant, but more informa-
tion needed to clearly prove the case.

     Such a case might occur where the  FA/MR method identifies a  previ-
ously unknown,  unevaluated or unexpected source.   These may include such
cases as: groups of small  commercial/industrial   operations  with  poor
emissions  controls or frequent outdoor operations not controlled in any

-------
                                -  88  -
way; a large point source impacting the  receptor  via  infrequent  wind




directions  or  only  during extended stagnation episodes; or impacts of




resuspended soil from vacant lands.  These types of problems may require




special studies where the data are judged insufficient for action.








3..5..1. Summary






     Interpretation of the FA/MR results is based  on  a  well  designed




program, operated and evaluated by well informed staff members.






     It is important to maintain an  appreciation  of  the  relationship




between  objectives and interpretation, to carefully document every step




of  the process and to use all  information from the validation  steps  in




classifying the results of the overall interpretation.

-------
                                  - 89 -
3..6. Reference*









  1.  Draper, N., and Smith, H., 1981, Applied Regression Analysis,   2nd




      ed., John Wiley and Sons, New York.









  2.  Morandi, M.T.,  1985,  Development  of  receptor   oriented  source




      apportionment models for  inhalable particulate matter  and particu-




      late organic matter in New Jersey.  Ph.D. Dissertation,   New  York




      University Medical Center, February.









  3.  Currie, L., Gerlach, R.W., Lewis,  C..  et  al.,   1984,   Inter com-




      parison  of source apportionment procedures: Results for  simulated




      data sets.  Atmos. Environ., 18: 1517-1537.









  4.  Kleinman, M.T.., 1977, The apportionment  of  sources  of   airborne




      particulate  matter. Ph.D. Dissertation, New Tork University Medi-




      cal Center, June.








  5.  Kiel man, M.T., Eisenbud, M., Lippmann, M., Kheip. T.J., 1980, The




      use of tracers to identify sources of airborne particles.  Environ.




      Int., 4: 53-62.








  6.  Watson, J.G., 1979, Chemical element balance receptor model metho-




      dology for assessing the source of fine and total  suspended parti-




      culate matter in Portland,  Oregon.    Ph.D.  Dissertation,  Oregon




      Graduate Center,  February.

-------
                             -  90 -








 7.   Larson,  R.J.,  1966, Air pollution from motor vehicles.   Ann.   N.T.




     Acad.  Sci.,  136:  275.








 8.   Greenberg,  R.R.,  Gordon, G.E.,  Zoller, W.H., Jacko,   R.B..  Neuen-




     dorf,  D.W.,  and lost, K.J., 1978, Composition of particles emitted




     from the nicosia municipal incinerator.  Env. Sci. Tech..  12(12):




     1329-1332.









 9.  Law, S.L., and Gordon, G.E., 1979, Sources of metals in  municipal




     incinerator emissions.  Env. Sci. Tech., 13(4): 432-438.









10.  Jacko, R.B., and Neuendorf, D.W., 1977,  Trace  metal  particulate




     emission  test  results  from a number of industrial and municipal




     point sources.  J. APCA, 27(10): 989-994.








11.  Mueller, P.K., Helwig, H.L., Aleocer, A.E., Gong, W.E., and Jones,




     E.E.,  1964,  Concentration  of  fine  particles  and  lead in car




     exhaust. Symposium on Air Pollution Measurement  Methods,  Special




     Technical  Publication,  No.  352,  American  Society  for Testing




     Materials.









12.  Daisey, J.M., and Hopke, P.K., 1984,  Receptor source apportionment




     models  for  three   fractions  of  respirable  particulate organic




     matter.  Paper No. 84-16.4 presented  at  the 77th Annual Meeting of




     the  Air  Pollution  Control  Association,  San Francisco, CA, June




     24-29.

-------
                                  - 91 -




13.  Lioy, P.J., Daisey, J.M..  Atherholt,  T.,  Bozzelli,  J.,  Da rack, F.,




     Fisher,  R. , Greenberg,  A.,  Bartov, R,, Kebbekus,  B.,  Kneip,  T. J.,




     Louis, J.,  McGarrity, 6.,  McGeorge, L., and Reiss,  N.M.,  1983, The




     New  Jersey Project  on  airborne  toxic  elements  and  organic sub-




     stances  (ATEOS): A summary of  the 1981  summer  and  1982  winter stu-




     dies. J. APCA, 33(7): 649-657.









14.  Horandi, M., Daisey, J.M., and Lioy,  P. J.,  1983, A receptor  source




     apportionment  model  for  inhalable  particulate matter  in Newark,




     N.J. Paper  No. 83-14.2 presented at the 76th Annual Meeting of the




     Air Pollution Control Association, Atlanta,  6A, June 19-24.









15.  Daisey, J.M., In Press,  Anew approach to   the  identification of




     sources  of  airborne  mutagens  and  carcinogens: Receptor source




     apportionment modeling.   Env. Int'l.  .









16.  Daisey,  J.m.  and  Kneip,  T. J.,  1981,  Atmospheric  particulate




     organic  matter  - Hultivariate models for  identifying sources  and




     estimating  their contributions to the ambient aerosol. ACS  Sympo-




     sium  Series,  No.  167,  Atmospheric  Aerosol: Source/Air Quality




     Relationships,  E. S. Macias and P.K.  Hopke, Eds., American Chemical




     Society. 197-221.









17.  Pierson, W.R.,  and  Brachaezek,  W.W.,  1983,  Particulate  matter




     associated  with vehicles on the road, II. Environ. Sci. Tech., 2:




     1.

-------
                                   - 92 -
18.  Gorse, R.A. and Nor beck, J.M., 1981, 00 emission rates for in  use




     gasoline and diesel vehicles. J APCA, 31: 1094.









19.  Shah, J.J., Huntzicker, J.J., Kneip, T.J., and Daisey, J.M., 1981,




     Investigation  of  the  sources of carbonaceous aerosol in New York




     City by multiple linear regression.  Paper No.  81-64.3  presented




     at  the  74th Annual Meeting of the Air Pollution Control Associa-




     tion, Philadelphia, PA, June 21-26.








20.  Gaarenstroom, P.D., Perone,  S.P., Mbyers, J.L., 1977,  Application




     of pattern recognition  and factor analysis for characterization  of




     atmospheric particnlate  composition  in  southwest  desert  atmo-




     sphere. Env. Sci.  Tech.,  11: 795-800.









21.  Daisey, J.M., Lioy, P.J.,  and Kneip, T.J.. 1984,  Receptor  models




     for   airborne organic species.  EPA report submitted January, Con-




     tract No.  CR810300-01-0.









22.  Kleinman,  H.T., Pasternack,  Eisenbud, M., and Kneip,  T.J.,  1980,




     Identifying and   estimating the relative importance of  sources  of




     airborne particulates.   Env. Sci. Tech.,  14:  62-65.








23.  Kneip, T.J., Ma 11 on, R.P., Kleinman, M.T.,  1983,  The   impact   of




     changing   air quality on multiple regression  models for  coarse and




     fine particle fractions.  Atmos. Env., 17(2):  299-304.

-------
                                  - 93 -





                                APPENDIX









4.0 ALTERNATIVE APPROACHES TO REGRESSION ANALYSIS






     Several authors have used variations in the factor  analysis  steps




to  aid  in  selection  of  the  independent  predictor  variables,  i.e.




tracers, used  in  the  stepwise  multiple  regression.   Three  methods




reported to date are described briefly in this appendix.









A.I TARGET TRANSFORMATION FACTOR ANALYSIS (TTFA)






     Both principal component and  common  (classical)  factor  analysis




apportion  the  variance  rather  than the absolute concentration of the




elemental tracers of a data set among different factors.   As  a  conse-




quence,  the  source-related  factors which are obtained do not give the




relative concentrations of the  elements  in  the  emissions  from  each




source nor the contributions of each source factor to the ambient parti-




cle concentrations.  In order to overcome  these  inherent   limitations,




Hopke and co-workers (1-4) have developed the target transformation fac-




tor analysis (TTFA) for source apportionment  of  the  ambient  aerosol.




The  TTFA  method  differs  from  PCA  and common factor analyses in two




important ways:




1)  the use of Q rather than R-mode analyses;




2)  rotation of the obtained factors to align with a target which  is  a




source  emission profile rather than rotation according to  abstract sta-




tistical criteria.






     As the first step in TTFA,  Q-mode factor analysis is used to screen




the data and to determine the upper limit on the number of  factors to be

-------
                                 -  94 -
retained for the actual TTFA.  In Q-mode analysis,  correlations  between




samples  rather than between variables (i.e., elements)  are used and the




relative elemental concentrations are thus  preserved  in  the  analysis




(2).   For  example,  ambient Pb concentrations are typically one to two




orders of magnitude greater  than those of Cd and,  in  Q-mode  analysis,




this  feature  of  the  data  is  retained.   In  the  more usual E mode




analysis, the data for each  element are normalized  in  calculating  the




correlations between elements and the information about relative elemen-




tal concentrations is thus lost.  In Q-mode analysis, the source profile




matrix   is  calculated  first from the data and eigenvalue matrix; in R-




mode, the mass contribution matrix is calculated first from the data and




eigenvectors.   In  principal, the results of the two analyses should be




comparable  and this is generally observed in practice.






      The second difference between TTFA and other types of  factor  ana-




lyses  lies  in   the  method of rotating the factors in order to obtain




interpretable factors.  In TTFA, each factor is rotated to maximize  (by




least  squares  analysis)  the  overlap of the factor and a test vector.




The test vector can be an actual  source  emissions  profile  containing




values   for  all  of the elements (normalized relative to one element in




the vector) or an unique test  vector   in  which  a  tracer  element  is




assigned an  initial value  of 1.0 and  all other elements are assigned a




value of zero.  The latter is the most  common   approach.   When  unique




test  vectors  are  used, there is a vector  for  each element or species.




The test vector is iteratively refined  by replacing the input value  for




a  given  element with that predicted by  the target transformation and the




factor  is again rotated against the refined  test vector.  This  should be




equivalent  to refining source emissions profiles measured at a  source to

-------
                                 - 95 -






 the  form  in which  they  exist  in the  atmosphere.






     After each  of the  refined  source profiles has   been  obtained  they




 are  normalized to  a sum of 1,000,000 and  they are grouped p  vectors at a




 time to reproduce  the original  ambient data.  Cluster  analysis   is  used




 to   determine  which  of  the  iterated profiles are  similar and  these are




 combined  to form a matrix of  source  profiles.   These   source   emissions




 profiles  are  then scaled   to  reflect the actual  concentration of each




 element in the emissions  from that source  (fig element  per  gram  of parti-




 cles emitted)  by regression   of the mass concentration  of each sample




 against each element.   Finally  the contribution of  each  source  to  total




 particle  concentration  is   determined based on the scaled  source  emis-




 sions profiles.






     The  FANTASIA  program written by Hopke and co-workers  (1-4)  to   per-




 form TTFA  is available  through the courtesy of Dr. Hopke.  The  program




 has  been written for the  CDC Cyber 175 Computer and utilizes a number of




 subroutines  from   the  IMSL  subroutine library (5)  for matrix manipula-




 tions.  The principal advantage  to TTFA is that  complete  source   emis-




 sions  profiles  can be obtained from ambient measurements (based on use




 of unique vectors)  and  can be compared  to  actual  emissions  profiles,




 thus  providing  evidence  of  the  validity of the source apportionment




model.








A.2  MULTIPLE LINEAR REGRESSION ON TRACERS/FACTOR ANALYSIS  [MLR(T)/FA]






     Dattner (in Currie, 6) has applied the FA/MR  method  to  the  data




 sets  analyzed  in the  Quail  Roost II exercises.   This version of the




method was designated   'Multiple  Linear   Regression  by  Tracers/Factor

-------
                                 -  96  -



Analysis   [MLR(T)/FA].   The  steps  described  were  classical  factor


analysis for data screening and for  determination  of  the  number  and


characteristics  of major sources, selection of the element with highest


loading as a tracer for each factor.  "Backward stepwise unweighted mul-


tiple  linear  regression"  was  used to obtain the equation for mass in


terms of the tracer elements with  significant  regression  coefficients


(coefficients,  B.  >  1.96 S,, ) and calculation of the source contribu-

                 j            J
tions from the products of the coefficients, B. and concentrations,  C..
                                              J                       J

The  source contributions were calculated for each sampling period (case)


and  averaged over cases of interest.




     Dattner extended  the method   of  repeating  the  backward  stepwise


regression for each tracer as dependent variable, and of normalizing the


regression coefficient matrix obtained to a total mass  coefficient  for


each source  (column) equal to unity.  The columns are then concentration


equivalent profiles which can be compared to  available  source  profile


data sets.






 A.3.  REGRESSION OF ABSOLOTE PRINCIPAL  COMPONENT SCORES (1.)




     Principal Component Analysis  is  applied to elemental and other  com-


position  variables  for  a  large data  set, in the case of Thurston and


Spongier  (2) 332, and  the results   are   used  to   identify  the  sources


affecting  the   site.   Then  the  mass  contribution from each  source  is


identified using a new empirical  technique  that involves computation  of


an   Absolute  Principal  Component Score   (APCS)   for   each  sample, and


regression of the sample mass on the  APCS to obtain a mass contribution.


Pollution  source  elemental profiles suggested by  the  source impacts  of

-------
                              - 97 -
the final regression analyses are then compared with  the literature.






     The  principal  component  analysis  used  is  the  standard   type




explained  in  Chapter  2, with a varimax rotation applied to obtain the




best orthogronal representation of the factors.   After  examination  of




the  original  composition  variable  loadings  on  a  given factor, the




results are interpreted and identifiable sources noted.






     The next step in the analysis is the examination of  the  principal




component  scores (eq. 3, Chapter 2) to estimate the quantitative source




impacts.  Thurston and Spengler (2)  note that  the  principal  component




scores  are  correlated with a source impact but are not proportional to




these impacts in the usual Z-score of PCA computer printout.   Regression




of the dependent variable on these scores (P., ) would be of the form
where Y.  is the mean,  and C.  are  the  conversion  coefficients  of  the




non-dimensional  score  deviations  into  mass  deviations from the mean




impact of a source (7).  Since these results  are  not  deviations  from




zero, there is no apportionment possible of the dependent variable (e.g.




TSP, IPM, PM).






     The technique developed by Thurston and Spengler (2) estimates  the




absolute  zero  principal component score by separately  scoring an extra




"day" in  which the  tracer  concentrations  were  zero.   This  is  accom-




plished  by obtaining  a Z-score (Chapter 2, eq. 2)  for its absolute zero




concentrations, and  then  calculating  the  rotated  absolute   zero  PC

-------
                                - 98 -
scores.   These  estimated  absolute zero scores are subtracted from the


original components for each day to obtain an APCS.



     The final set in the solution to obtain the mass contributing  from


a given source is
                 p

     M  =  Y  +  £  Y. APCS*..
      x     o        j      jk
                                                            *
where M, is the mass associated  with  observation  k,  APCS ..  is  the


rotated  absolute   component   scores for  source j and observation k, and

     *
Y.APCS  ., is the mass contribution from source k.

-------
                             - 99 -
1.  Hopke, P. K., E. E. Lamb and d. F.  S. Natusch, 1980.   Mul tielemen-




    tal characterization of urban  street  dust.  Environ.  Sci.  Techno1.,




    14: 164-172.









2.  Alpert, D. J. and P. K. Hopke, 1980.  A quantitative  determination




    of  sources  in  the  Boston   urban  aerosol, Atmos.  Environ., 14:




    1137-1146.









3.  Alpert, D. J. and P. K.  Hopke,  1981.   A  determination  of  the




    sources  of  airborne  particles collected  during the  regional air




    pollution study, Atmos. Environ., IS: 675—687.








4.  Hopke, P. E., D. J. Alpert and B. A. Roscoe, 1983.  Fantasis  -  A




    program  for  target  transformation  factor analysis  to apportion




    sources in environmental samples. Computers an Chemistry. 7:  149-




    155.









5.  IMSL Library Reference Blanual, IMSL LIB-0008, IMSL,  Inc., Houston,




    Texa s.








6.  Cnrrie, L. A. et al., 1984.  Inter comparison of source  apportion-




    ment procedures: results for simulated data sets, Atmos. Environ.,




    18: 1517-1537.

-------
                            - 100 -








7.  Thurston. G. D., 1983.  A source apportionment of particulate  »ir




    pollution in metropolitan Boston.  Doctoral Dissertation.  Depart-




    ment of Environmental Health Sciences, School  of  Public  Health.,




    Harvard University, Boston, MA.









8.  Thurston, G. D.  and  I.  D.  Spengler  3,  1985.   A  quantitative




    assessment  of  source contributions to inhal&ble particulate matter




    in metropolitan Boston, Atmos. Environ., 19:

-------
TECHNICAL REPORT DATA
(Please read Instructions on the reverse before completing}
1. REPORT NO.
EPA-450/4-85-007
2.
|3. RECIPIENT'S ACCESSION NO.
4. TITLE AND SUBTITLE
Receptor Model Technical Series, Vol. VI: Factor
Analysis And Multiple Regression (FA/MR) Techniques
In Source Apportionment
5. REPC : T DATE
Julv 1985
6. PREFORMING ORGANIZATION CODE
7. AUTHOR(S)
8. PERFORMING ORGANIZATION REPORT NO.
Paul J. Kioy, Theo J. Kneip and Joan M. Daisev
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Institute Of Environmental Medicine
NYU Medical Center
550 First Avenue
New York, NY 10016
10. PROGRAM ELEMENT NO.
11 CONTRACT/GRANT NO.
4D2975NASA
12. SPONSORING AGENCY NAME AND ADDRESS
U. S. Environmental Protection Agency
OAQPS, MDAD, MD 14
Research Triangle, NC 27711
13. TYPE OF REPORT AND PERIOD COVERED
14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES

EPA Project Officer: Thompson G. Pace, III
16. ABSTRACT

The anticipated change in the form of the particulate matter standard from total
suspended particulate to matter with aerodynamic diameter equal to or less than 10
micrometers (PM n) will require, in some instances, more sophisticated approaches to
identifying primary sources of PM-n. This is the sixth document in a series of user
oriented receptor modeling guidance.

Over the past twelve years, a number of multivariate methods have been used to
determine the sources of mass emitted in a number of cities. This document focuses
primarily on the FA/MR technique. However, the procedures required to identify
potential tracers or source profiles and to validate the results are applicable to all
ON i s OBSOLETE

-------