United States Environmental Protection Agency Environmental Monitoring Syste Laboratory Research Triangle Park NC 2771 Research and Development EPA-600/S4-80-048 August 1982 Project Summary Application of Cluster Analysis to Aerometric Data Harold L. Crutcher, Raymond C. Rhodes, Maurice E. Graves, Beth Fairbairn, A. Carl Nelson, and Michael Symons The NORMIX data analysis program, which incorporates cluster analysis and multivariate statistical analysis routines, was modified and revised for use in a UNIVAC 1110 computer. The revised program was tested on three sample data sets and produced results in agreement with those from the original program. The NORMIX pro- gram was then used to evaluate and analyze eight sets of aerometric data from various sources. Comparison of the performance of NORMIX with two other cluster analysis algorithms, MIKCA and SAS CLUSTER, revealed that all three programs produce simi- lar results in terms of hierarchical clus- tering, but NORMIX produces consid- erably more statistical evalu ation and information to the user. Thus NOR- MIX is recommended as the most use- ful cluster analysis program of these three. This Project Summary was developed by EPA's Environmental Monitoring Support Laboratory. Research Triangle Park, NC. to announce key findings of the research project that is fully documented in a separate report of the same title (see Project Report ordering information at back). Introduction Pollutants in the environment pose an enduring threat to all living organisms and inanimate structures. It is continu- ally necessary to assess this threat and to take remedial action. For such assessment, it is essential to monitor environmental conditions in order to provide information bases about the possible threats and their variation over time. Such comparisons permit reas- sessment of the average conditions, their expected variabilities, and any significant changes in those average conditions and variabilities. To produce credible models and assessments of atmospheric pollution requires extensive, reliable aerometric data. Production of valid data bases requires adequate instrumentation and maintenance, representative exposure, competent personnel, excellent com- munications, and sufficient quality assurance and control systems in an ongoing updated observational program. To aid in the development of valid data bases and to extend methods of data analysis, five specific goals of this study on aerometric data clustering were defined: 1. Develop and document a validated and calibrated digital computer program for cluster analysis. 2. Extend the theory for clustering data. 3. Validate the data obtained. 4. Classify the data. 5. Demonstrate the application of the computer program to various types of data. As a result of this project, an exten- sively developed computer program for cluster analysis is available to all users of the UNIVAC 1110 computer at the National Computer Center at the Environ- mental Protection Agency, Research Triangle Park (NCC-EPA/RTP), North Carolina. ------- Several clustering techniques that separate heterogeneous aerometric data sets into homogeneous groups were reviewed. It is sometimes difficult to state categorically that a datum belongs to a specified group or cluster, hence assignment or classification to a group of related data is made in a probabilistic sense. This report series illustrates the use of these clustering algorithms to grouped data from a large data base of observed aerometric and meteorological information. These grouped data can be interpreted by researchers and presented to decision- makers. For example, conditions accom- panying pollutant episodes can be identified, and oftentimes specific pollutant sources can be identified. Data from the Community Health Air Monitoring Program (CHAMP) were analyzed by the NORMIX clustering algorithm1, as described in the three- volume Project Report. Volume I presents a detailed examination of the historical development of the NORMIX algorithm and its application to the CHAMP data set, plus descriptions of cluster analyses of data from the Los Angeles Catalyst Study (LACS) by the SAS2 and MIKCA3 programs. Volume II contains the modified NORMIX program with com- plete documentation, and Volume III contains more detailed examples and discussion of the application of the NORMIX algorithm to the CHAMP data. Until the advent of clustering tech- niques and their algorithms, analyses of multivariate data were generally of the multiple regression type, factor analysis, or principal component analysis. Clus- tering analysis involves a hierarchical grouping of data. Some cluster analysis programs (notably NORMIX) also provide statistical estimates of the multimodal, multivariate distribution. Cluster analy- sis has been used on air pollution data to reveal cyclic patterns (over days, weeks, or seasons) and to identify local source effects. All of these relations are reflected in different combinations of values of the variables that cluster together. Because of the clustering in aerometric data, multivariate cluster analyses are useful as preliminary analytical and data validation techniques. The results of the report support the well-known inverse concentration relationship between ozone and the oxides of nitrogen, presumably due to their production timing and to their chemical interaction. These and many other relationships are indicated in tabular form and illustrated in some examples by diagrams in which data are plotted with overlying distribution ellipses in bivariate form, or by profile models of means andthree-sigma limits for given measurements. Data from monitoring activities cannot be considered as random samples from a single universe, but rather, such result from sampling mixtures of distributions, usually internally correlated within each group. As expected, these studies show that pollutant data depend on, or are highly correlated with, meteorolog- ical conditions. At a given site and with a given set of pollutant sources, the pollutant concentration at the site is heavily dependent upon the meteoro- logical regime. Thus, the pollutant distributions will be a mixture of the distributions that result from the mixture of the meteorological regimes and the interaction of the pollutants over the time of the monitoring. Solar radiation is an effective agent in the meteorological regime, but these data were not available for inclusion in this study. This discussion suggests that the pollutant data distributions first be separated into meteorological regimes by cluster analysis and that these subsets then be evaluated individually by other approporiate techniques. Moreover, analysis involving prediction of future pollutant concentration distri- butions for each meteorological regime should also consider the probabilities of occurrence of each of the meteorological regimes. This will enhance the clustering and the classification of data. Normality of distribution is not required for simple hierarchical clus- tering, but if statistical significance statements are to be made, or if statistical characteristics of the clusters are to be used, the normal distribution is quite useful. It is not necessary that exact normality be achieved, as the techniques used are sufficiently robust to accommodate considerable departure from normality. The results are more reliable if individual element (variate) distributions are assumed to be normal or near normal during the application of clus- tering techniques. If the distributions are distinctly non-normal in appearance, various mathematical operations are available to transform the individual data so that the distributions of the transformed data may be described by the normal distribution prior to their entry into the heirarchical clustering scheme. If no prior information is available, the heirarchical clustering may be the only product of the operation, or it may be a prelude to further information extraction. Because the NORMIX program has a substantial statistical basis with corre- sponding statistical assumptions and tests of significance, it was selected for further development in this study. The program, originally written in IBM FORTRAN IV language, was converted to the ASCII FORTRAN language require- ment of the UNIVAC 1110 at NCC- EPA/RTP. Figure 1 shows the general config- uration of the expanded NORMIX program, detailing the NORMIX pre- processing options, the central NORMIX core algorithm, and the post-processing flow: Figure 2 presents the NORMIX flow chart. Documentation is available in supplementary and complementary reports. Volumes II and III, which are discussed later. Calibration and Validation of the Expanded Normix Program The ability of the present revised version of the NORMIX program to produce results equivalent to those of Wolfe1 (for the same data in the older program version and with a different computer) indicated that the revised version has been adequately calibrated. Program validation consisted of applica- tion over several types of data sets, not necessarily all aerometric. Data valid- ation consisted of the isolation of outliers, if any, for examination and treatment. Both single and clustered outliers are identifiable in the hierarch- ical clustering as well as in the NORMIX processing. Since the NORMIX program uses the same hierarchical clustering algorithms as several of the other programs, it is not necessarily more useful for this purpose. In order to demonstrate that the present version of the NORMIX program is available and works properly, 11 sets of data were used. These were: 1. A classical data set composed of measurements of petal and sepal lengths and widths (four variates) made by Anderson4 and used by Fisher5 to illustrate clustering and classification. 2. A synthetic bivariate data set from Wolfe1. 3. A set of synthetic data consisting of three predetermined three- element data sets that could be expanded both in variances and distances between the centroids. ------- Normix-Prep Normix Calcomp Input Data Displays Trans- formed Data Displays Chi-square tests Histograms Probability chart^ 1* Size Grouping ^ Preliminary Estimates ^ Clustering — >• Mapping of subsample Chi-square tests Grouping pattern Minimum number of points in clustei (3rd iteration) Merge details Number of hypothesized types (in seguence) Probability level for chi-square tests Override chi-square test abort Random Scaling numbers Eigenanalysis printout Printing at Time limit Wind components Maximum Covariance each iteration on computations number of iterations matrices: same or different Time printouts Input of preliminary estimates Ellipse-plotting and mapping Profiles of cluster variables Figure 1. NORMIX flow and options. 4. A bivariate set of maximum and minimum temperatures for June and July at the Raleigh-Durham, North Carolina, airport, supplied by Professor Jerry Davis of the North Carolina State University at Raleigh. 5. A five-variate set of health-related data for trace metals. 6. A univariate set of river discharge data supplied by the U.S. Geolog- ical Survey. 7. A 12-variate set of data on new filters, 12 impurities, and trace metals. 8. A six-variate set of data on the physical and chemical character- istics of new filters. 9. A univariate set of acid precipitation data for each of several locations within the United States. 10. The LACS data set. 11. The CHAMP data set. The first three sets of data mentioned above were processed by the NORMIX algorithm to calibrate the program ( PROFILE j ^ ELLIPSE \ 2/7 INFORM. Lj.4 NORMIX- /J INITLE r~l PREP Data \ _J l_ / Paramete estimates i i MOMENT > ( SYMINV Figure 2. NORMIX flow chart. ------- conversion. The output of Sets 1 and 2 agreed with those of Wolfe. The output of Set 3 returned the original stipulated clusters prior to their being mixed. The NORMIX program produces hierarchical mapping of the data, as do most of the other programs discussed in this report, although the metrics used may differ. Tree (branching or dendritic) diagrams may be prepared from the maps, which show the coalescence of data into clusters. The report contains such tree diagrams, which are not reproduced here. From such tree diagrams, outliers (either singletons or small groups) can be identified easily, as they are the last to enter a larger cluster or the last to join the total group. A lengthy discussion on the reading and interpretation of the diagrams is also available. The presence of extreme outliers, as singletons or as minimum sized clusters, creates near-singularities in the data matrices, which halt a running computer program, produce slow convergence, or do not allow the program to converge to a solution. This phenomenon is charac- teristic of any program that uses convergence routines involving matrix calculations. The extensive environmental data bases, LACS and CHAMP, were pro- cessed by means of cluster algorithms; SAS CLUSTER and MIKCA were used for the LACS data and NORMIX was used for the CHAMP data. In order to study the effect of the use of automobile catalytic converters on aerometric parameters, the period of the LACS data necessarily had to include the periods before and after the 1975 introduction of these devices; the period selected was 1974 through 1978. The pollutant elements (variates) observed were suspended particulates (SP), ozone (Oa), nitrogen oxide (NO), nitrogen dioxide (N02>, sulfur dioxide (SOa), carbon monoxide (CO), and lead (Pb). The meteorological variates were temperature, wind speed, wind direc- tion and traffic count. For this analysis, measurements of the variates SP, CO, Pb, wind speed, wind direction, and traffic count from two selected sites were used. These sites were on either side of the San Diego Freeway between the intersection of the freeway with the two boulevards, Sunset and Wilshire. When more than one element is being observed and recorded, one of the elements may not be obtained for a particular observation time for various reasons. Effectively, in the multivariate sense, the omission of a single element requires the rejection of the entire observation from the data set under consideration. In some cases, it may be reasonable to merge one incomplete multivariate vector (observation) with another complementary incomplete vector from a nearby locale to obtain a usable complete vector. If this is done, this factor must be considered when interpreting the results Here is an example of the two effects mentioned above: using only two sites from the LACS data bank, the number of available and useful hourly observations for the 1977 through 1978 period is about 3031, as compared to about 25,000 observations originally available from all eight sites of the LACS. The reader should consult Part 2 of Volume I of the Project Report for further details. The periods of record at the CHAMP stations were relatively short, i.e. from September 1975 through November 1976 for Angwin, California, and Loma Linda, California, and from August 1974 through September 1976 for Magna, Utah. The data selected for use are for certain hours of the day, days of the week, weeks, and seasons. The pollutant and meteorological data consisted of oxides of nitrogen (NO*), calculated nitrogen oxide (NO), nitrogen dioxide (NOz), sample nitrogen oxide (SNO), ozone (O3), sulfur dioxide (SO2), total hydrocarbons (THC), non-methane hydrocarbons (NMHC), temperature (T), dew point, winds, and atmospheric pressure (P). For this study, all winds were transformed to east-west and north-south components, positive from the west and south. An option in the program permits transformation of polar coordinates (wind direction and speed) to rectangular coordinates, along and at right angles to any preselected direction. The default option is the east- west and north-south configuration. Comparison of Three Clustering Programs. Table 1 compares three clustering algorithms, SAS CLUSTER, MIKCA, and NORMIX. All three algorithms select clusters by an agglomerative rather than a divisive procedure, and the number of clusters to be examined must be stipulated. For each program, the recommended maximum number of cluster configurations is seven. The reader and user of the Project Report will find a rather extensive discussion of clustering and data validation problems and uses for computer programs of clustering tech- niques. No one program satisfies all users. Some of the limitations of each program are discussed. The NORMIX program, being the most complex, produces much more informational output than do the SAS CLUSTER and MIKCA programs. Examples of Processing Output Table 2 shows an intercomparison of selected pollutant datl throughout the year for NO, NOX, and Oa at Angwin, California, Loma Linda, California, and Magna, Utah. The information in Table 2 reveals that Angwin, California probably has the lowest levels of oxides of nitrogen and highest level of ozone, and the lowest variability of these three variates at the three stations. This is, of course, one reason why the Angwin, California, site was selected for monitoring and for this study. The large standard deviation and negative mean value for the Loma Linda, California NOX data reflect that the number added to low observed values (to ensure a minimum value of two before logarithms are taken was insufficiently large. Figure 3, developed from NORMIX output information from Magna, Utah data, illustrates the 0.50 probability ellipses for wind speed and direction and the associated pollutant means, standard deviation, proportions, and the number of observations. Cluster num- bers for each point are included to help assess the clustering efficacy. It must be remembered that the pollutant variables have been logarithmically transformed and numbers refer to such transformed data. The 0.50 probability ellipses are centered on the centroids of the plots of east-west wind components versus north-south wind components. The ellipses are for the wind components, but the cluster classifications are in terms of the eight variates. As previously noted, it is in the overlapping regions that errors of misclassification may occur for an individual datum. However, the statistical estimates are generally expected to provide the best estimates of the cluster configurations. This type of presentation is a projection of the total multidimensional ellipsoid onto the plane of the two selected variates. Any two variates can be selected by options provided in the program. The program option that developed Figure 4 arranged the variate output in terms of the largest cluster with variate ------- Table 1. Property Comparison of Capabilities of the SAS CLUSTER. MIKCA and NORMIX Algorithms SAS MIKCA NORMIX Complexity Output Quantity Number of Input Data Limit to Number of Variables Maximum Number of Clusters Distance Options Criteria Options Hierarchical Clustering Low Minimum 250 10 250 1 1 Yes with Maps Medium Moderate 500 20 15 3 9 No High Extensive 2000* 20* 150 1 1 Yes with Maps *May be increased if computer memory space permits. Computation time increases exponentially with the numbers of variates and observations' Table 2. Intercomparison of Selected Pollutant Data Throughout the Year for NO, NO* and Os at Three Locations Transformed variates* NO NO, Locations Mean Std dev. Std Mean dev. Std Mean dev. Num- ber of obser- va- tions Angwin. 0.6981 0.0017 0.7002 0.0045 0.7127 0.0087 288 California LomaLinda. 0.7036 0.0130 -3.0442 0.9576 0.7081 0.0198 122 California Magna, 0.7055 0.0164 0.7123 0.0216 0.7081 0.0090 324 Utah *Values are in logarithmically (base e) transformed data originally in ppm. means of increasing in sequence. Again, as in all presentations such as this, the values of the other variates follow the sequence established by the largest cluster. The numbers below the minimum three-sigma limit, ranging from one through eight, identify the variates in order of their entry into each observational vector. The other num- bers identify the three-sigma levels. In Figures 3 and 4, it may be noted that weak southeast winds with a mean speed of approximately 4 km/h are associated with ozone readings lower than average and oxides of nitrogen and sulfur readings higher than average. Strong winds from the northwest, approximately 9 km/h, and from the south-southeast, approximately 12 km/h, again show the inverse relation- ship of ozone with oxides of nitrogen and sulfur. Clusters 1 and 3, with southeast and south-southeast winds, respectively, show the greatest tem- perature difference between the means, namely, 17.54°C. Further investigation might yield the reason for this tempera- ture difference. Speculatively, this feature might be associated with seasonal characteristics or synoptic episodes. Conclusions The applications of three clustering algorithms to aerometric data bases were compared in order of investigation: SAS CLUSTER, MIKCA, and NORMIX. The three routines produce similar results through the processing steps of hierarchical clustering and output of cluster means. Beyond that point, MIKCA appears to provide slightly more information than SAS CLUSTER. NORMIX, as modified, produces consid- erably more information and guidance than either MIKCA or SAS CLUSTER. - NORMIX is the recommended clustering program; a calibrated and tested program with full documentation, available as Volume II of this three-volume report series. Many other clustering programs may be used, but only the aforementioned three have been examined in this study, and only NORMIX has been examined in detail. Of the three, only NORMIX provided complete statistical estimates of the multimodal, mulitvariate distri- butions. SAS CLUSTER is strictly hierarchical in grouping and mapping and uses this information as initial statistical estimates for further iter- ations to achieve maximum likelihood estimates. Los Angeles Catalyst Study (LACS) data were analyzed by use of the two algorithms, SAS CLUSTER and MIKCA. The results were similar. Community Health Air Monitoring Program (CHAMP) data also were analyzed by means of the NORMIX program. References 1. Wolfe, John J. (1971 (NORMIX 360 Computer Program. Naval Personnel and Training Research Laboratory, San Diego, California. Research Memorandum SRM 72-4. 125 pp. 2. Barr, A.J., J.H. Goodnight, J.P. Sale, and J.T. Helwig (1976) A User's Guide to SAS. Spanks Press. 3. McRae, D.J. (1973) MIKCA. A FOR- TRAN IV Iterative K-Means Cluster Analysis Program. CTB/McGraw Hill, Del Monte Research Park, Monterey, California. Revised by M.J. Symons, October 1973. 4. Anderson, Edgar (1953) The Irises of the Gaspe Peninsula. Bull. Amer. Iris Soc. 59:2-5. 5. Fisher, R.A. (1936) The Use of Multi- ple Measurements in Taxonomic Problems. Ann. Eugen. Vll:11:179- 188. ------- -20.00 " -70.00 I -5.00 o "to -S 0.00 1 5.00 c3\ ' 70.00 to 75.00 20.00 25.00 30.00 Cluster Variable 1 -NO Mean Standard deviation 2 -NOX Mean Standard deviation 3 - Ozone Mean Standard deviation 4- TS Mean Standard deviation 5 - West wind comp Mean Standard deviation 6 - South wind comp Mean Standard deviation 7 - Temperature Mean Standard deviation 8 - Dew point Mean Standard deviation 2 2 2 2/^~ 2 f 2 ( 2 V 2 21 Y^ 2 7 74.00 \ 10.00 I 6.00 12.00 8.00 7 P = .354 0.72 0.03 0.74 0.03 0.70 0.01 0.78 0.07 -1.70 3.41 3.40 4.12 0.47 3.49 -5.60 3.36 2~~~^ 22 2 21 2 21 2 — 3 V ' ;V 3 3/ } 3 \ \ 2.00 4.00 2 P = .333 0.70 0.00 0.70 0.00 0.71 0.00 0.72 0.03 3.13 4.42 -8.32 5.56 14.85 5.59 -1.31 2.95 ^2 2\ 2 2 2}2 2 y ; ==lr~-^ 7 /! ^^ ^3'3'-22 ^/^--J ' 3 3 3 3 \3 3~— 3 3 3 \ -2.00 3 P = .313 0.70 0.00 0.70 0.01 0.71 0.00 0.71 0.07 -2.86 3.86 12.12 6.00 18.01 6.22 -3.83 3.02 2 ^ ri^ ' 3 3 — 1"^ ^ X 31\ 3 3 3 33 — ~ \ -6.00 -10.0O\ -14.OO 0.00 -4.00 -8.00 -12.00 5 - West wind comp West vs South wind - probability level = O.50 Figure 3. Magna, Utah, Day 3, 0.50 probability ellipses of the west-east and south-north wind components for three cluster types. ------- 38.6240 X+3S 7,378 31.5780 0.1471 jfor 0.313 0.333'r-' 0.354^ X—3S *- -4- -0.0248 -14.4900 -0.2841 -0.1339 -17.1200 -14.4340 -27.1330 -0.1076 73856412 Profile plot Figure 4. The means and three-sigma limits for each variate of the three data clusters of Figure 3 are represented by triangles (cluster 1), diamonds (cluster 2} and circles (cluster 3), respectively. The variate numbers along the abscissa refer respectively to: 1. NO; 2. NOx; 3, Ox 4, TS; 5, W component of wind; 6, S component of wind; t. temperature; and 8, dew point. The sequence of variates is determined by the value of their logarithms for cluster 1 (lowest to highest). The other numbers refer to the three sigma limit values for each variate. 6USGPO: 1982 — 559-092/0494 ------- Harold L Crutcher is a private consultant at 35 Westell Avenue, Asheville, NC 28804; the EPA author Raymond C. Rhodes (also the EPA Project Officer, see below) is with the Environmental Monitoring Systems Laboratory, Research Triangle Park, NC 27711; Maurice E. Graves is with Northrop Services, Inc., Research Triangle Park, NC 27709; Beth Fairbairn and A. Carl Nelson are with PEDCo Environmental, Inc., Durham, NC 27701; Michael J. Symons is with the University of North Carolina, Chapel Hill. NC27514. The complete report consists of three volumes, entitled "Application of Cluster Analysis to Aerometric Data:" "Volume I. Part 1—Clustering, Validation, and Classification of Data; Part 2—Investigation and Report of Cluster Analysis." (Order No. PB 82-226 432; Cost: $ 13.50, subject to change) "Volume II. Part 3—Modifications and Options Applied to Wolfe's NORMIX 360 Cluster Analysis Program," (Order No. PB 82-226 440; Cost: $16.50, subject to change) "Volume III. Part 4—Separation of Environmental Data Into Clusters by the NORMIX Program." (Order No. PB 82-226 457; Cost: $10.50, subject to change) The above reports will be available only from: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-487-4650 The EPA Project Officer can be contacted at: Environmental Monitoring Systems Laboratory U.S. Environmental Protection Agency Research Triangle Park. NC 27711 United States Environmental Protection Agency Center for Environmental Research Information Cincinnati OH 45268 Postage and Fees Paid Environmental Protection Agency EPA 335 Official Business Penalty for Private Use $300 Pj> 0000329 U S ENVIR PROTECTION AiiEHCY HtGlON b LItiftAKY 230 i> DE.AKBORN STREET IL 606U4 ------- |