^ EPA/600/A-96/089
Reprinted from Journal of Climate, Vol. 8, No. 4, April 1995
American Meteorological Society
On the Application of Ouster Analysis to Growing Season Precipitation Data
in North America East of the Rockies
Xiaofeng Gong and Michael B. Richman
Cooperative Institute for Mesoscaie Meteorological Studies, University of Oklahoma, Norman. Oklahoma
(Manuscript received 23 February 1993, in final form 24 August 1994)
ABSTRACT
Cluster analysis (CA) has been applied to geophysical research for over two decades although its popularity
has increased dramatically over the past few years. To date, systematic methodological reviews have not appeared
in geophysical literature. In this paper, after a review of a large number of applications on cluster analysis, an
intercomparison of various cluster techniques was carried out on a well-studied dataset (7-day precipitation
data from 1949 to 1987 in central and eastern North America). The cluster methods tested were single linkage,
complete linkage, average linkage between groups, average linkage within a new group, Ward's method, k
means, the nucleated agglomerative method, and the rotated principal component analysis. Three different
dissimilarity measures (Euclidean distance, inverse correlation, and theta angle) and three initial partition
methods were also tested on the hierarchical and nonhierarchical methods, respectively. Twenty-two of the 23
duster algorithms yielded natural grouping solutions. Monte Carlo simulations were undertaken to examine the
reliability of the duster solutions. This was done by bootstrap resampling from the full dataset with four different
sample sizes, then testing significance by the t test and the minimum significant difference test.
Results showed that nonhierarchical methods outperformed hierarchical methods. The rotated prindpal component
methods were found to be the most accurate methods, the nucleated agglomerative method was found to be superior
to all other hard cluster methods, and Ward's method performed best among the hierarchical methods. Single linkage
always yielded "chaining" solutions and, therefore, had poor matches to the input data. Of the three distance measures
tested, Euclidean distance appeared to generate slightly more accurate solutions compared with the inverse correlation.
The theta angle was quite variable in its accuracy. Tests of the initial partition method revealed a sensitivity of k-
means CA to the selection of the seed points. The spatial patterns of cluster analysis applied to the full dataset were
found to differ for various CA methods, thereby creating some questions on how to interpret the resulting spatial
regionalizations. Several methods were shown to incorrectly place geographically separated portions of the domain
into a single cluster. The authors termed this type of result "aggregation error." It was found to be most problematic
at small sample sizes and more severe for specific distance measures. The choice of dustering technique and dissimilarity
measure/initial partition may indeed significantly affect the results of duster analysis. Ouster analysis accuracy was
also found to be linearly to logarithmically related to the sample size. This relationship was statistically significant.
Several methods, such as Ward's, k means, and the nucleated agglomerative were found to reach a higher level of
accuracy at a lower sample size compared with other CA methods tested. The level of accuracy reached by the rotated
prindpal component dustering compared with the other methods tested suggests that application of a hard and
nonoverlapping clustering methodology to fuzzy and overlapping geophysical data results in a substantial degradation
in the regionalizations presented.
1. Introduction
In geophysical research, investigators often need to
classify variables into homogeneous groups and identify
their features in order to enhance understanding of
phenomena and predict potential future changes. For
example, in synoptic weather analysis or climate re-
gionalization, observations or variables are separated
into different groups to reveal underlying structure
embedded in the datasets on which further investiga-
tion can be based. Cluster analysis (CA) is one of a
Corresponding author address; Xiaofeng Gong, Cooperative In-
stitute for Mesoscale Meteorological Studies, The University of
Oklahoma, 100 E. Boyd, Room 1110, Norman, OK 73019-0628.
number of multivariate techniques that can be used to
accomplish such classifications.
Ousters have been defined as relative constellations
of contiguous points in space (Punj and Stewart 1983).
Clustering therefore involves the grouping of similar
entities (variables) or observations that exhibit two
properties—external isolation and internal cohesion
(Cormack 1971). External isolation requires that en-
tities (or observations) within one cluster be separated
from entities in another cluster by fairly empty space.
Internal cohesion requires that entities within the same
cluster be similar to one another. The objective of clus-
ter analysis is to uncover the complex nature of mul-
tivariate relationships (by searching for natural group-
ings or types) among the data under investigation, so
as to foster further hypothesis development about the
-------
898
JOURNAL OF CLIMATE
Volume 8
phenomena being studied. Cluster analysis imposes a
characteristic structure on the data analysis for ex-
ploratory purposes.
Although CA was originally proposed in the 1930s
(e.g., Tryon 1939), it did not attract much interest
until the early 1960s when high-speed computers made
the use of its procedures less tedious (Blashfield 1976;
Hartigan 1975). Clustering techniques were first de-
veloped in an applied scientific context (e.g., biological
taxonomy) rather than by mathematical statisticians.
Their development has been fostered by the research
designs in applied scientific fields and by the enhance-
ment of computational capabilities. During the 1970s,
CA became a common classification tool in many dif-
ferent research fields, including biology, zoology, psy-
chology, psychiatry, sociology, medicine, geography,
and botany.
Cluster analysis has come to be recognized as an
effective statistical tool to deal with such tasks as
grouping stations into climatologically homogeneous
regions (regionalization) based upon some given me-
teorological parameters, or grouping time periods
(days, years, etc.) into clusters that reflect the occur-
rence of weather events (or patterns). The use of CA
in these contexts has proliferated in the past decade.
A detailed survey of the major methodological contri-
butions to CA by other atmospheric scientists, as well
as a comprehensive list of the applications involved, is
presented in appendix A. Those applications can be
summarized roughly into five categories: 1) climato-
logical regionalization for a specific weather parameter;
2) more general climate regionalization through the
use of multiple weather parameters; 3) large-scale at-
mospheric circulation and synoptic pattern analysis;
4) atmospheric elemental composition research; and
5) hydrometeorological studies, especially for wa-
tersheds. These applications are comprehensively
summarized in appendix A.
In atmospheric science research, there have been
several discussions and evaluations of a range of meth-
odological CA issues. These limited investigations into
the utility of different clustering techniques do not ap-
pear to have yielded any unequivocal consensus. White
and Perry (1989) chose the complete linkage and
Ward's methods to perform their climate classification
on the basis of subjective knowledge and statistical
testing. Kalkstein et al. (1987) applied three clustering
techniques (average linkage, centroid, and Ward's) for
the purposes of synoptic climatological classification.
They concluded that average linkage was superior to
both the centroid and Ward's techniques in the above
context. Femau and Samson (1990a,b) used CA to
delineate periods of similar atmospheric transport and
patterns of precipitation and pollutant distribution in
eastern North America. After they examined the av-
erage linkage, centroid, and Ward's methods to inves-
tigate the sensitivity of each clustering algorithm, they
concluded that Ward's method yielded a cluster size
that was subjectively thought to be interpretable for
their research problem. Fovell and Fovell (1993) pre-
sent a unique examination of CA preprocessing and
various types of bias. For less traditional methods of
clustering, Richman (1986) presents extensive discus-
sions and comparison of rotation algorithms of prin-
cipal components in a Monte Carlo framework.
After comprehensively reviewing the CA method-
ological research performed in the atmospheric sci-
ences, as summarized above and in appendix A, it be-
came apparent that all previous studies on traditional
methods of CA have justified their methodological de-
cisions (i.e., choice of dissimilarity measure, type of
CA) on the basis of subjective criteria (appendix A).
The main purpose of this paper is to review the large
range of available clustering techniques used in pre-
vious atmospheric science research and shed light on
the sensitivities of the different CA methods to various
aspects of the analysis, so as to provide potential users
with objective general guidelines for selecting a rela-
tively appropriate clustering algorithm in their research.
When this issue has been rigorously examined in past
methodological papers outside of the atmospheric sci-
ences, synthetic data has been used (Cunningham and
Ogilvie 1972; Kuiperand Fisher 1975; Blashfield 1976;
Mojena 1977; Edelbrock 1979; Edelbrock and Mc-
Laughlin 1980; Milligan 1980). Synthetic data have
the advantage of being sampled from known distri-
butions and, therefore, can easily be tested for statistical
significance. However, the key findings of these syn-
thetically based studies are not always in agreement.
Kuiper and Fisher (1975), Blashfield (1976), and Mo-
jena (1977) have found that Ward's method outper-
formed all other hierarchical methods after the appli-
cation of bivariate normal data mixtures, multinomial
data mixtures, and multivariate gamma distribution
data mixtures, respectively. On the other hand, Edel-
brock (1979) and Edelbrock and McLaughlin (1980)
concluded that both Ward's method and average link-
age were most accurate with the multivariate normal
mixture (standardized and unstandardized) and mul-
tivariate gamma mixture datasets. Alternately, Cun-
ningham and Ogilvie (1972) found that average linkage
performed best among the hierarchical methods. Mil-
ligan (1980) found that the k-means procedure with
seed points based on a priori knowledge generally per-
formed better than other methods and that the hier-
archical methods rankings were not robust, changing
with each type of error he introduced. The concept of
ranking CA methods to identify a "best" one has been
controversial among these studies, and no one method
has been accepted to be the most accurate. Further-
more, the use of synthetic data precludes any firm con-
clusions on how these methods might perform on ac-
tual atmospheric or other geophysical datasets.
We feel that, following the general philosophy de-
veloped in Richman (1986), data can be divided into
two broad categories—single-sided (correlations of all
-------
April 1995
GONG AND RiCHMAN
899
one sign) and two-sided (correlations of bipolar mag-
nitudes). This paper focuses solely on the former, using
a precipitation dataset that has been extensively quality
controlled and analyzed. We hope to extend our anal-
yses to two-sided data in future work.
2. The operationalization of cluster analysis
methodology
a. Basic notations and definitions
Generally, in cluster analysis, the initial raw data
collected by the researcher consists of a p X n matrix
of measurements, X, where
*11
*12
*1 „
X =
*21
*22
*2*
(2.1)
Xpi
XP2
Xpn
The matrix X can be thought of as n points in a p-
dimensional space. In the rest of this paper, the term
"entity" is used for denoting the column vectors (i.e.,
variables) in the matrix X, which are to be subjected
to CA. The term "observation" is used for denoting
the row vectors in X. Hence, xy represents the value
for the y'th entity for the z'th observation, with n the
number of entities to be clustered and p the number
of observations.
The matrix X can be transposed to an n X p matrix
XT so that when XT is used as input data, it is the
observations that will be clustered. It should also be
mentioned that in other types of analyses, which are
not spatial/temporal, the rows and columns of X may
take on different meanings.
b. Data types and transformations
The types of data in matrix X can be binary (only
two possible values), discrete (finite, small number of
values), and continuous (values are real numbers)
since clustering algorithorms generally do not restrict
the input dataset to particular statistical distributions.
Furthermore, most researchers prefer datasets with
multinormal mixtures when they work on the theo-
retical statistical evaluation of various clustering meth-
ods (Punj and Stewart 1983).
c. Similarity /dissimilarity indices
The most fundamental stage before applying a clus-
tering algorithm is to establish a numerical similarity
or dissimilarity measurement to characterize the re-
lationships among the data. Hierarchical clustering
methods require specific measurements of dissimilarity
when searching for the most similar data pairs to merge.
This involves the calculation of a dissimilarity matrix
of order nXn. Different measures of dissimilarity and
similarity have been proposed in clustering method-
ology studies. In geophysical research, the Euclidean
distance and correlation coefficient are the two most
frequently employed. The choice of dissimilarity mea-
sure may affect the final clustering solutions and some-
times even the success or failure of the clustering al-
gorithms in specific studies (Anderberg 1973). Less
attention has been focused on this issue, whereas most
previous methodological effort has been expended to
find an appropriate clustering method. This work will
examine both methodological issues in the clustering
context.
To gain a fuller understanding of the effects of dif-
ferent similarity measures, we chose to incorporate a
range of available distancelike (dissimilarity) measures,
ones that are applicable to most types of climatological
data and relatively interpretable. We hope this decision
to test a range of techniques will enhance the utility of
the work for other users of CA.
1) Euclidean distance (ED)
Euclidean distance is the most commonly used dis-
similarity measure in CA t,see appendix A), although
many other distance measures exist. A detailed dis-
cussion of various distanvflike dissimilarity measures
is presented in Cormack ( N71) and Duran and Odell
(1974).
Consider a p X n data matrix X in ^-dimensional
space. The Euclidean distance between entities x, and
Xj is given by
d,j = K-V; - a; )Ttv, - Xj))u2, (2.2)
assuming that the p observations are independent. En-
tities are usually standardized before calculating Eu-
clidean distance to eliminate the scale effect since ob-
servations having different scales (or mean levels) may
unequally contribute to the calculated distance. Also,
if correlations exist between the observations, a trans-
formation is usually applied to rebuild the data matrix
X into a new dataset. This is because some of the ob-
servations that have high correlations with each other
contribute much to the disance as compared with the
remaining observations that are uncorrected with each
other (Takeuchi et al. 19S2 V. that is, the repeated in-
formation exists in computing ED. Such repeated in-
formation may directly affect final cluster solutions.
One way to do the transformation is to use the trun-
cated principal component analysis (Stone and Auli-
ciems 1992; Fovell and Fovell 1993). Euclidean dis-
tance has been commonly applied in geophysical stud-
ies. The range of Eucbdean distance is from 0 (for
identical vectors of variables) to +oo (for vectors with
no relationships).
2) The theta angle between entities (T)
The theta angle (8) between entities is another dis-
similarity coefficient that measures the "closeness" of
the entities (Anderberg 1973). Sausy et al. (1987) em-
-------
900
JOURNAL OF CLIMATE
Volume 8
ployed this type of dissimilarity measure in their CA
that classified atmospheric particulates. The angle be-
tween entity vectors x, and Xj is given by
6 = arc cos
xjxj
\x,\ \Xj\
(2.3)
The | Xj | and | Xj\ terms, which can also be written as
(2/Li xj)U2 and (2jLi xj)u2, respectively, are the
Euclidean length of entities i and j. The range of values
for this measure is from 0° (for identical vectors) to
90° (for the vectors with no common elements).
3) Inverse correlation coefficient (1C)
Historically, the correlation coefficient has been used
as a measure of linear similarity in certain types of CA.
Because our clustering methodology employs dissim-
ilarity-based input, the correlation coefficient is con-
verted to a distancelike measure by taking the reciprocal
of its absolute value. The IC value is given by
r„] =
XjXj
(XjXi)l,2(XjXj)
1/2
(2.4)
where the x is a vector form of [(jcj - x), (x2 - x),
..., (xp - x)].
Although we have found no prior application of
r y1 in previous clustering studies, we believe the rea-
soning given above makes it a valid dissimilarity mea-
sure. The range of value of is from 1 (for identical
entities) to +oo (no similarity). It should be kept in
mind that rtends to exaggerate the distinction be-
tween the entities when their correlation is close to
zero compared with other dissimilarity measures. An-
other way to transform the correlation coefficient to a
dissimilarity measure would be to convert it to some
complementary form to correspond to distances.
However, since the correlation coefficient is generally
a nonmetric function, it does not obey the triangle in-
equality during such a transformation, and it can be
shown that perfect correlations could occur between
nonidentical vectors (Sneath and Sokal 1973). Also,
ri/ can only be used for a single-sided dataset as ab-
solute value of correlation may eliminate the negative
correlation effects if the data is in a two-sided category.
d. Seed points
The a priori choice of a partition is a prerequisite
by which nonhierarchical clustering algorithms are in-
itialized. The most often applied method by which the
initial partition is made uses estimates of cluster cen-
troids. These estimates are known as "seed points" and
it is around them that clusters may be formed. Hence,
in certain cluster methods, the researcher is required
to choose some initial values. No widely accepted
"best" method exists, and the choice of these seed
points affects the clustering solutions. Anderberg
(1973) listed eight ways to make the choice of the seed
points, which are somewhat interchangeable with each
other. Since the aim of this paper is to assess the utility
of a large range of methodology, we apply here three
varied approaches, tailored to the spatial analysis of
geophysical data, to specify the seed points. These are
now outlined individually
1) Sequentially (S): The first k data entities in the
matrix X are sequentially chosen as the seed points of
k clusters. This requires no prior knowledge or input
from the researcher and it is also computationally sim-
ple. Anderberg (1973) felt that the method was one
reasonable approach.
2) Randomly (R): The k seed points are chosen via
pseudorandom numbers. This may remove any biases
imposed by sequential collection when the data matrix
consists of regional distribution elements; however, it
does require more computational resources than the
first method.
3) Gridded (G): The seed points are chosen by sub-
jective knowledge. This is quite appropriate when in-
vestigators have an idea about the data structure and
character. In the present analysis an equal-area grid
was superimposed on the domain (section 4) and seed
points were then chosen at the center of each grid box.
3. Clustering methodology
a. Hierarchical clustering algorithms
An hierarchical technique is a procedure for creating
a nested sequence of partitions of the patterns from a
dissimilarity matrix. It proceeds by a series of either
successive mergers or successive divisions. Agglomer-
ative procedures begin with a set of individual entities
and then build up groups by a process of accumulation.
The final outcome of these mergers is a single group
of n entities. Divisive procedures work in the opposite
direction, dividing the initial single group into
subgroups. The splitting process is repeated until all
entities have been separated into n groups. Agglom-
erative methods consider all characteristics of the entity
simultaneously, whereas in divisive methods the whole
is split into subgroups in stages, a different criterion
being used at each stage. A disadvantage of divisive
methods is that their computational speed tends to be
slow when the number of the entities is greater than
100 (Mather 1976). Agglomerative methods usually
require much less computing time, though this may
still be considerable for large numbers of entities
(Mather 1976). However, in the present study, which
makes extensive use of both large numbers of entities
and resampling, the computational efficiency was an
important factor.
Agglomerative hierarchical methods are considered
to be the most popular cluster analysis techniques
(Blashfield 1976). Our literature survey (appendix A)
confirmed this and we restrict our hierarchical solutions
-------
APRIL 1995
GONO AND RICHMAN
901
solely to agglomerative procedures. Hierarchical
methods require the formation of a dissimilarity ma-
trix, and therefore, start with as many clusters as en-
tities. From the dissimilarity matrix, the most similar
entities are grouped first and then the dissimilarity ma-
trix is recalculated to get the relationship of the new
cluster with the remaining entities. These methods
generate solutions that can be graphically presented as
hierarchical trees or dendrograms (e.g., Alsop 1989,
Figs. 2-4). We now outline the agglomerative hierar-
chical clustering algorithms used here. All hierarchical
methods tested herein yield hard clusters [those that
assign a binary value of 1 (in the cluster) or 0 (out of
the cluster) to each variable] and no overlap between
clusters.
1) Single linkage (SL)
Single linkage is also known by the names "mini-
mum method" (Johnson 1967), "linkage analysis"
(McQuitty 1967), and "nearest neighbor cluster anal-
ysis" (Lance and Williams I967a,b). In single linkage
analysis, a cluster is defined as a group of entities such
that every member of the cluster is more similar to at
least one member of the same cluster than it is to any
member of another cluster. Let dtJ be a distancelike
similarity measure between clusters i and j. If these
clusters are to be merged, the distance between the new
cluster k and any other cluster m will be
dkm ~ min(dim, djm)• (3*1)
Since single linkage joins clusters by the shortest link
between them, this procedure can have an effect that
distorts the multidimensional space. The initial matrix
of D = {d,,) defines a space containing all the entities.
As grouping goes on, the updated D matrix may not
define a space with the original properties. Lance and
Williams (1967a,b) called this a space-conserving or
space-distorting effect. In single linkage, such a distor-
tion effect makes the space adjacent to a new cluster
"contracted," (i.e., the cluster will seem to move closer
to some remaining points). Therefore, single entities
are more likely to join an existing cluster than to act
as the nucleus of a new cluster, and the technique may
not discern poorly separated clusters. This tendency of
single linkage to pick out long stringlike clusters is
known as "chaining" (Sokal and Sneath 1963; Johnson
and Wichern 1982).
2) Complete linkage (CL)
Complete linkage cluster analysis is often known by
the names "maximum method" (Johnson 1967) and
"farthest neighbor cluster analysis" (Lance and Wil-
liams 1967a,b). This method proceeds in much the
same manner as single linkage, with one important
exception. At each stage, the distance between clusters
is determined by the distance between the two entities,
one from each cluster, that are most distant. Thus,
complete linkage ensures that all entities in a cluster
are within some maximum distance of each cluster.
The distance between cluster k, merged by / and j, and
another cluster m is computed by
dkn = max(rf,m, djm). (3.2)
Here, d,m and djm are the distances between the most
distant members of clusters / and m and clusters j and
m, respectively. In contrast to single linkage CA, the
space-distorting effect in this technique may make a
newly formed cluster more distant from remaining
points or clusters in the multidimensional space (i.e.,
as the size of a cluster increases in the process of clus-
tering, the updated distance between this cluster and
some other points or clusters becomes too distant).
Thus, the chance of a cluster obtaining a new member
becomes smaller as the size of the cluster increases.
Hence, complete linkage cluster analysis may some-
times "dilute" the multidimensional space (resulting
in highly separated compact groups).
3) AVERAGE LINKAGE BETWEEN MERGED
GROUPS (AL1)
This method is commonly known as "average link-
age" (Blashfield 1976) but is sometimes referred to by
the names "group average" (Lance and Williams
1967a,b; Mather 1976) and "unweighted pairwise
group average linkage" (Sneath and Sokal 1973). The
general method has been described and used often in
the literature (Duran and Odell 1974; Johnson and
Wichern 1982; Wolter 1987; Kalkstein et al. 1987;
Fovell and Fovell 1993, among others). This technique
treats the distance between two clusters as the average
distance between all pairs of entities where one member
of a pair belongs to each cluster. The distance between
cluster k, merged by clusters / and j, and another cluster
m is determined by
^ = <3-*)
NkNm
where d,, is the distance between entity i in cluster k
and entity j in the cluster m, and Nk and Nm are the
number of entities in clusters k and m, respectively.
In AL1 cluster analysis, every entity in one cluster has
a greater average similarity with ill members of the
same cluster than it does with all members of any other
cluster. As this method evaluates all the possible pairs
in the process of clustering, it has a property known
as "space conserving."
There are a family of three hierarchical techniques
related to, but not identical to, this average linkage
method (Blashfield 1976; Anderberg 1973). They are
"centroid" (Gower 1967), "median" (Lance and Wil-
liams 1967a,b), and "weighted average" (McQuitty
1967). The formula of distance between clusters k and
m for these methods are presented below.
-------
902
JOURNAL OF CLIMATE
Volume 8
Centroid method:
N,
N,
N,Nj
dkm N, +' Nj dim + N, + N; djm Nj + Nj d"'
Median method:
dkm 2 ^,m 2 ^m 4 >
and
Weighted/simple average:
dkm j dim "^* j djm •
The distance Eqs. (3.4), (3.5), and (3.6) can be directly
compared with the average linkage between the merged
groups equation
dkm ~
N, + Nj
dim
*0
N, + Nj
(3.7)
dsk dsm *4~ dkm
(Nk + Nm){Nk + Nm~ 1 )/2
(3.10)
(3.4)
(3.5)
(3.6)
In all the four methods above, an entity is joined to
a cluster if it has a certain average degree of similarity
with all current members of the cluster. However, the
average degree of similarity is defined differently among
the four methods. Since these methods are all based
upon a similar mathematical principle and the large
majority of applications apply Eq. (3.7), we will con-
centrate on the average linkage between merged groups
and one different average linkage technique outlined
below.
4) AVERAGE LINKAGE WITHIN THE NEW GROUP
(AL2)
Anderberg (1973) proposed this average clustering
technique that evaluates not only the similarity between
all pairs of entities that belong to two different clusters,
like the average linkage techniques outlined in the
above section, but also the similarity of all pairs of
entities that belong to the same cluster. The basic idea
of this is to characterize a cluster by the average of all
links within it. The sum of pairwise distances between
cluster k, merged by clusters i and j, and another cluster
m is updated as
dkm = dim + djm (3.8)
to maintain this form for the distance matrix.
Let dsj and dsj be the sum of all pairwise distances
among entities within clusters i and j, and N, and Nj
be the number of entities in clusters i and j. When
clusters i and j are merged to be cluster k, the sum of
pairwise distances within cluster k is updated:
dsk = dSi + dsj + djj. (3.9)
There are two related matrices in the process of clus-
tering. When searching for the most similar clusters or
pairs, the average distance within the group for the
cluster, formed by merging cluster k and another cluster
m is
The principal difference between this AL2 method
and the AL1 method is that the sums of within merged
group pairwise distances, dsk and dsm, are ignored un-
der AL1 (compare Eq. (3.3) with Eq. (3.10)]. This
method is more computationally intensive than average
linkage between merged groups. Anderberg (1973) ex-
pected these two methods to yield results that were not
radically different. However, we were unable to find
geophysical applications of this method to test Ander-
berg's (1973) hypothesis. Therefore, we present AL2
to introduce a new methodology and to widen the scope
of this work.
5) WARD'S METHOD (W)
Ward (1963) proposed a very general hierarchical
cluster method known as "Ward's method" or the
"minimum variance method." It was designed to gen-
erate clusters in such a way that mergers at each stage
were chosen so as to minimize the within-group sum
of squares (WGSS). Unlike other hierarchical cluster-
ing techniques, Ward's method optimizes an objective
statistic. At each stage it combines those two clusters
that result in the least increase in the WGSS objective
function. The increase in WGSS when merging clusters
k and m is given by
Wkm =
NkNm
Nk + Nr
(Xk-Xm)r(Xk-Xm), (3.11)
where Xk and Xm denote the centroids of clusters k and
m. We can use the dissimilarity matrix form to rep-
resent Wkm in Eq. (3.11) as the increase in WGSS de-
pends on the Euclidean distance between the centroids
of the two merged clusters. If cluster k is a combination
of clusters i and j, then Eq. (3.11) is given by
N, +Nm J , Nj + Nm J
Okm ~ ~ . <*im "t" ~—:—7T" U,m ~
N„.
Nk + N„
Nk + N„
Nk + N„
d,r
(3.12)
Thus, Ward's algorithm can be implemented
through updating a stored Euclidean distance between
cluster centroids. In Eq. (3.11) Wkm is the increase in
WGSS derived from the Euclidean distance between
the centroids of two clusters. Ward's method can be
quite a versatile technique for cluster analysis, even
though it has been limited to Euclidean distance (An-
derberg 1973). This was our rationale for applying two
other distancelike measures mentioned (theta and in-
verse correlation) to Ward's technique (although Wkm
may no longer denote a one to one increase in WGSS).
Blashfield (1976, p. 385) found that the Ward's method
"clearly obtained the most accurate solutions" among
the four hierarchical methods he tested and recom-
mended it to the researchers who wish to use a hier-
-------
April 1995
GONG AND RICHMAN
903
archical method. However, Cormack (1971), in his
review work, criticized Ward's method as being biased
toward choosing spherical clusters. This may hold true
for average linkage methods too. Since Ward's method
has been applied frequently to geophysical datasets
(Willmott 1978; French and LeDuc 1983; Winkler
1985; Fernau and Samson 1990a; Bonell and Sumner
1992; Cheng and Wallace 1993) with interpretable re-
sults, we feel it is desirable to test the method under
controlled conditions.
b. Nonhierarchical clustering methods
Unlike the agglomerative hierarchical clustering
methods outlined in section 3a, which give classifica-
tions beginning with clusters having one member each,
the central idea in most nonhierarchical algorithms is
to choose some initial partitions of the data and then
alter cluster memberships so as to obtain a new par-
tition that reveals a natural structure of the input data.
The nonhierarchical methods are designed to cluster
data entities into a single classification of k clusters,
where k is specified a priori or is determined as part of
the clustering procedure. In the present study, two kinds
of the nonhierarchical methods, "k means" and "ag-
glomerate nucleated centroid sorting," are performed.
The two methods are rooted in the same algorithm,
"nearest centroid sorting." A set of seed points are
computed as the centroids of clusters, and a set of clus-
ters can then be constructed by assigning each entity
to the cluster with nearest centroid. These two processes
are alternately applied until they converge to a stable
configuration. Such a process implicitly minimizes the
variance within each cluster (Anderberg 1973). The
nonhierarchical methods identified in our literature
review (appendix A) are limited to these two tech-
niques, particularly to the &-means method.
1) AT MEANS METHOD (KM)
The basic algorithm introduced in the present study
is the convergent A>means cluster method, which is a
variant of MacQueen's k-means algorithm (Anderberg
1973). The detailed procedure of the clustering com-
putation is as follows:
Step 1. Specify k seed points as a set of centroids of
k clusters.
Step 2. Compute the Euclidean distance (tf,,) be-
tween the entities and the centroids; then assign each
entity to the cluster having the nearest centroid to attain
an initial partition.
Step 3. Recompute the centroids of the clusters by
the formula
1 /v«
ymj = — Z -Xyij = 1. 2,- • -,p\ m = 1, 2,- ",k),
(3.13)
where Nm represents the number of entities assigned
to cluster m, and ymj are the coordinates of the mth
cluster center on the jth axis (observation).
Step 4. Take each data entity ( jc, ) in sequence and
compute the distances to all the k cluster centroids. If
the following condition is satisfied, the nearest centroid
(ym) is not that of the data entity's (x,) parent cluster;
then reassign the data
Nm £ (X„ - Yml)2 ~ 77^7 £ (X,i ~ Yji)2 > 0
+ 1 /.)
NJ ~ 1
(3.14)
entity to the nearest cluster (>',) and update the new
centroids of the clusters that have lost or gained vari-
ables.
Step 5. Repeat step 4 until stability or convergence
is achieved; that is, continue until a full cycle through
the dataset fails to cause any change in cluster mem-
berships.
2) Nucleated agglomerative method (NA)
Based on the same basic principle as nearest centroid
sorting, this method performs in an agglomerative pro-
cedure that generates the clusters at a range from km3%,
a number of the initial clusters, to a predetermined
cluster number kmin (Mather 1976). This nucleated
technique provides a convenient way for people to look
into a range of groups from the nearest centroid sorting
as the hierarchical methods do. Key and Crane (1986)
and Ronberg and Wang (1987) apply this technique
to geophysical data. The nucleated method operates in
following procedure:
Step 1. Specify the numbers of maximum and min-
imum clusters, and Choose km„ seed points
as a set of centroid of kmtx initial clusters.
Step 2. Assign each data entity to its proper cluster
as in step 2 of the i-means method and recompute the
centroids of clusters by formula (3.13).
Step 3. Attain kmlx clusters in the same manner as
the k-means method (steps 4 and 5 of the section
above).
Step 4. The number of clusters k is reduced by 1
from kmtx. The pair of clusters to be merged is found
by locating the combination of two clusters that min-
imizes the increase in the squared deviations of the
entities from their cluster centroids, which is defined
as
WW I ,
KTN,Zir»-y')
(i= 1,•••,*- l;j= 1,•••,*). (3.15)
The centroid of the resulting cluster is calculated in
the procedure in step 2.
Step 5, Repeat step 4 until k = k„m.
-------
904
JOURNAL OF CLIMATE
Volume 8
c. Rotated principal component method
Everitt (1974, p. 37) stated "a method of cluster
analysis which has been widely used, especially in the
field of psychology and the behavioral sciences, is the
'inverse'of'0* factor analysis. . . . The usual methods
of factor analysis, including rotation are applied to this
matrix of correlations. Individuals are then assigned to
groups on the basis of their factor loadings. . . ."
Richman (1986) illustrated how rotation can be ap-
plied to principal component (PC) analysis as validly
as for factor analysis. It is the simple structure rotation
stage of the analysis that separates a purely statistical
algorithm, seeking to maximize variance, from one that
seeks to find highly related clusters of correlation vec-
tors in n space. Thurstone (1947, pp. 319-346) exten-
sively discussed the principles of simple structure,
which he considered "fundamental in making factor
analysis a scientific method rather than merely a
method of statistical condensation." Thurstone (1947)
claimed that examination of the constellations of cor-
relation vectors are important. He stated that the con-
figurations of such vectors typically consist of groupings
of correlations in n space, the angular separations be-
tween the intergroup vectors are relatively small, while
the separations between constellations are marked.
Simple structure rotation attempts to isolate constel-
lations from a large number of variables. The simple
structure, as Thurstone envisioned it, attempted to
achieve two goals: (i) create a reference frame where
the eigenvectors speared clusters of highly related vari-
ables, and (ii) create a solution where the number of
near-zero loadings (in the hyperplane) were maxi-
mized. Thus, goal (i) is concerned with finding which
variables lie "in the cluster." If the reference frame of
PCs has a matrix that orthogonally transforms the un-
rotated PC solution to simple structure, this is known
as an orthogonal rotation (PCOR). A simple structure
solution which does not transform the unrotated so-
lution under the orthogonality constraint is known as
an oblique rotation (PCOB). We see virtually identical
goals between simple structure rotation of PCs and tra-
ditional methods of cluster analysis [i.e., see introduc-
tion for definitions of CA by Punj and Stewart (1983)
and Cormack (1971)]. Since these goals are essentially
indistinguishable and Everitt claims rotated factor
analysis is a type of cluster analysis, we chose to com-
pare PCOR and PCOB with other forms of cluster
analysis. We do so since we consider PCOR and PCOB
to be cluster analysis.
The RPC (rotated principal component) algorithm
was processed in the following steps.
Step 1. Data matrix was interrelated by calculating
the correlations between all variables or entities (R).
Step 2. All eigenvectors (V) corresponding to non-
zero eigenvalues (D) were drawn (662) to explain 100%
of the total variance.
Step 3. PC loadings were calculated, A = VDI/2.
Step 4. The matrix A was rotated to simple structure
B = AT"1 under the constraint that T was orthogonal
(Varimax rotation) and without the orthogonality
constraint on T (Harris-Kaiser B'B rotation) (Richman
1986).
Step 5. Elements of each column vector of B (each
PC) were analyzed after hardening [see section 5a(3)
for details on hardening].
Therefore, all hierarchical, nonhierarchical, and
RPC clustering were hard clusters. The difference be-
tween the RPC solutions and the other types of CA
was that RPC solutions had overlapping solutions (i.e.,
some entities did not have to be entered into any cluster,
some entities could be entered into more than one
cluster).
4. Data
a. Dataset
We created a new dataset for use in the research on
which this paper is based. It contains the individual
daily precipitation totals for 766 locations in the eastern
two-thirds of the United States and southern Canada
for the May-August periods of 1949-87, which total
more than 3.67 million observations. The gridlike sta-
tion network is displayed in Fig. 1. It is bounded to
the west and east by the Rocky Mountains and Atlantic
Ocean, to the south by the Gulf of Mexico, and to the
north by the limit of agriculture and /or the availability
of precipitation stations with at least moderate density.
Full details of the methodology behind the con-
struction and characteristics of much of this dataset
were presented in Richman and Lamb (1985, 1987);
hence, only a brief overview to explain the dataset ex-
pansion is being given here. The first step in the de-
Fig. 1. Station network (766 locations) for central and eastern
North American precipitation for May-August periods of 1949-87.
-------
April 1995
GONG AND
RICHMAN
905
velopment of the dataset was the acquisition of all daily
precipitation totals archived by the U.S. National Cli-
matic Data Center and the Canadian Climate Centre
for the study region and period. This involved 11 634
stations, 98% of which were cooperative stations. After
identifying the stations with the longest records and
carefully scrutinizing and quality controlling these data,
it was concluded that a network resolution of 80-110
km was the finest practicable over most of the domain
from the standpoints of spatial and temporal unifor-
mity and dataset completeness. A modified 1 ° lat-Iong
"target" grid (the meridional separation north of 49°N
was held constant at the 49°N value) was therefore
established, and for each intersection the nearest station
with the most complete and seemingly most reliable
record was identified. These stations were located, on
average, 10 km (U.S.) and 24 km (Canada) from the
grid intersection concerned, and were chosen to form
the present dataset (Fig. 1). For the days these stations
did not possess observations (15.1% of the total), the
value from one of several adjacent "alternate" stations
was substituted without adjustment. The alternate sta-
tions were usually situated within 30 km of their parent
network station and were selected because their records,
although often incomplete, covered the gaps in the rec-
ords of the network station. The resulting dataset thus
contains actual daily precipitation totals for a regularly
spaced array of 766 stations with no missing observa-
tions. For this application we use 633 7-day rainfall
totals.
Richman and Lamb ( 1985, 1987, 1988, 1989),
Richman et al. (1991), and Richman and Gong (1993)
have reported the major early results of a comprehen-
sive set of analyses of this dataset that remains ongoing.
In general, these studies have applied a range of eigen-
vectorial techniques (unrotated and rotated principal
components analysis, common factor analysis, and
procrustes target analysis) and correlation analysis to
various temporal amalgamations of the data (station
totals for 3-, 7-, 15-, 30-, 41-, and 61-day periods).
Detailed information on the statistical methods em-
ployed appears in the aforementioned papers, Richman
(1986, 1987, 1993) and Richman and Easterling
(1988). A key thrust of the work to date has been the
use of those methods to delineate areas of spatial pre-
cipitation coherence.
b. Data preprocessing
In this application, we use 7-day rainfall totals for
the period 1 May-27 August for the 1949-87 period.
The mean and variance changes from one adjacent
location to another are relatively small because of the
short seasonal window and relatively smooth topog-
raphy in the domain. Therefore, the coherent scales
are relatively small with no remote teleconnection.
The raw data matrix X contains 766 entities and 663
observations (633 X 766). This matrix is square root
transformed to reduce skewness (Richman and Lamb
1985, p. 1328) at first. A principal component analysis
(PCA) is performed based on the correlation matrix
R (766 X 766) to obtain the unrotated PC loadings
matrix A (662 X 766). The calculations of ED is based
on this new matrix where variables are now grid points
and observations are PCs. All rotated principal com-
ponent analysis calculations in this study also start with
this PC loading matrix A. The other two dissimilarity
measures, IC and theta angle, are computed from the
matrix X.
If all the PCs with nonzero eigenvalue are retained
in the ED computations as we do in this work, the
resultant ED matrix contains the same variance infor-
mation as that computed from the standardized raw
data matrix. In applied research, truncated PCs are
usually used in the calculations of ED to help eliminate
noise signal in the raw data. We keep all PCs with
nonzero eigenvalues in this study because we want to
retain all the information in the data in order to make
as fair a comparison among ED-, JC-, and theta-based
CAs as possible. This leads to a potential information
problem as mentioned in section 2c(l), as our ED-
based CA solutions were computed from the full PCs
with nonzero eigenvalue. This may influence the CA
solutions to a certain (unknown) extent. Moreover,
this problem may exist when any substantial portion
of PCs are retained. At present there is no known
method to precisely identify the extent to which this
potential problem may impact CA solutions. However,
this study concentrated on the application and evalu-
ation of various CA techniques. The performances of
all the CA algorithms contained the same amount of
repeated information. Hence, we feel that such a po-
tential information problem does not influence our in-
tercomparisons substantially due to the nature of our
experimental design.
5. Results
Results will be presented for the full dataset first and
then for the various Monte Carlo resamplings. While
we have not applied every possible CA algorithm, we
have chosen a broad range of different ones that have
been applied previously in climate research. These have
been chosen to span most of the key concepts behind
traditional CA and enhance utility of our findings.
Several technical issues and an explanation and de-
scription of our graphical devices are discussed below.
a. Descriptions of several methodological points
1) Patterns retained
To objectively evaluate the aforementioned meth-
ods, we needed to keep as many parameters to be fixed
as possible so that we could isolate the effects of various
CA algorithms. One decision that must be made con-
cerns the number of clusters to be retained for each
-------
906
JOURNAL OF CLIMATE
method. In this paper we intentionally sidestepped this
issue as there is no universally accepted objective tech-
niques by which to accomplish this (Everitt 1979). The
seminal work on this topic, by Milligan and Cooper
(1985), tested 30 procedures for determination of the
proper number of clusters, with results pointing to sev-
eral indices with minimum bias. However, without ex-
tensive testing on our dataset, we decided against ac-
cepting Milligan and Cooper's (1985) findings as their
data were well separated with clearly defined, non-
overlapping clusters with a very small "correct" num-
ber of clusters (2-5). In contrast, we know that the
precipitation data have overlapping clusters (from the
correlation fields) and a large dimensionality (approx-
imately 16) from the previous work of Richman and
Lamb (1985, 1987). Moreover, Milligan and Cooper
(1985, pp. 162, 167, 175) repeatedly warn readers that
their design, using synthetic simplified data with no
error, may not hold for "real life research." They issue
the caveat (p. 162), "it should be recognized that the
ordering must not be construed as valid for all possible
data structures . . ." and (p. 175) . .the findings
are likely to be somewhat data dependent. It would
not surprise the present authors to find that the ordering
of the indices would change if different data structures
were used." In this work, when the number of clusters
was less than 15, aggregation error (which we define
as an inappropriate placement of variables by having
little in common in a single cluster) was evident in
over half of the CA algorithms, suggesting more than
this number of clusters was probably required for a
robust analysis of precipitation over our domain.
For the RPCs, difficulty in determining the correct
number of dimensions is analogous to the "number of
clusters" problem. After a survey of the state of the
science for separating signal from noise in eigenanalysis,
Richman et al. (1992) concluded none of the methods
surveyed was free of bias.
Due to the uncertainties of objectively determining
the correct number of clusters for the precipitation data,
we relied on a potential range of statistically significant
7-day growing season rainfall patterns. The previous
related studies indicated that retaining 8-25 patterns
would be a more than sufficient range to reflect the
rainfall variability without losing any important syn-
optic-scale signal (Richman and Lamb 1985, 1987).
The optimal number of patterns to retain may be
somehow different; therefore, the range of 18 (8-25)
cluster solutions is chosen to be used in the resultant
discussion of this work. Therefore, all analyses were
repeated 18 times, stepping the number of clusters up
by one each analysis until statistics on the full range
were collected.
Hierarchical methods start with 766 clusters, each
cluster containing one single rainfall station. The most
similar two clusters are then merged at each level. All
stations will be in one cluster if the process is left to
continue to a full hierarchy. In this work, we let the
Volume 8
solutions range from 25 to 8 clusters. For the /c-means
method, 8-25 seed points were chosen for clustering
separately; hence, 18 solutions of k-means clustering
were obtained. For the nucleated method, 25 seed
points were first chosen and the kmM and were 25
and 8, respectively. This process of agglomerative clus-
tering yields clustering solutions from 25 to 8. As men-
tioned in section 2, each clustering algorithm was car-
ried out in three different dissimilarity measures
(hierarchical methods) or three initial partitions
(nonhierarchical methods). For the two types of ro-
tated PCs, the number of components was varied from
8 to 25. The nine clustering algorithms total to yield
23 clustering methods, each of which contained 18 dif-
ferent clustering solutions from 8 to 25 clusters. This
was a total of 414 analyses.
2) BOXPLOTS
A graphical display method, known as the boxplot,
is used to summarize the large number of matching
values generated in this work. The technique was orig-
inated by Tukey (1977), who called it a "schematic
plot;" however, it is now widely known as a boxplot.
Benjamin (1988) describes the development of boxplot
technology and illustrates its inherent usefulness for
resampling research. The boxplot is a rectangular dis-
play with the top and bottom drawn at the upper and
lower quartiles of the data. The box is cut by a hori-
zontal line at the median. Above and below the rec-
tangular box are "steps" or values set at 1.5 times the
interquartile range. The steps have a vertical line ex-
tending from the middle of the top of the box to the
largest observation within a step from the top. Simi-
larly, there is a lower step. Observations that lie outside
the step are individually displayed as asterisks. We
adopt this "standard" definition of a boxplot and ac-
knowledge that there are variations that define the steps
and the display of outliers differently (Hoaglin et al.
1986).
The boxplot was chosen for this work as it has five
properties that make it attractive for presenting large
summaries of data in a compact manner. These are 1)
visual information is available on the data location
(cutline), spread (box length, step length), skewness
(distance between the end of the step and the end of
the box, and between the median and the end of the
box), and the longtailedness (marked outliers); 2) the
display gives details about outliers in the tails; 3) dis-
tributions between numerous boxes can be compared;
4) computational efficiency; and 5) ease of interpre-
tation.
3) Matching BETWEEN CA/RPC PATTERNS AND
THEIR CORRESPONDING CORRELATION FIELDS
Evaluation of the clustering solutions employed in
this work is performed by matching (or comparing)
-------
April 1995.
GONG AND RICHMAN
907
PC without Gradient vs. PC with Gradient
3
t
S
Fig. 2. Box and whisker plots of the correlation of the PC loadings with the corresponding
interstation correlation field for the highest PC loadings. PCs with gradients intact (PCG) are
compared with binarized PCs for Varimax orthogonal (OR) and Harris-lCaiser oblique (OB) ro-
tations. Boxes represent the median and interquartile range for each algorithm for 18 solutions
with varied numbers of clusters from 8 to 25. Whiskers represent 1.5 times the interquartile range
and any outliers.
the spatial pattern of any CA solution against its cor-
respondent interstation correlation field. Experimen-
tation with two matching coefficients (see appendix C)
led us to this particular coefficient. The correlation
coefficient is used to match the two fields, since it is a
common measure to compare the "closeness" between
two vectors. Selecting a pattern's correspondent inter-
station correlation field (or teleconnection pattern)
follows the procedure: 1) The spatial pattern of each
cluster is obtained by setting a value of 1.0 for the sta-
tions with the membership inside the cluster, and of
0.0 for the stations outside that cluster. Thus, the spatial
pattern field for any cluster is in a binary form. 2) The
geometric center of a specific clustering pattern is iden-
tified. 3) The vector of 7-day square-root-transformed
rainfall totals of this center (station) is correlated with
those vectors at the other 765 stations to get the inter-
station correlation pattern for the specific cluster. Since
the rainfall interstation correlations are continuous and
the cluster fields are binary, the form of matching cor-
relation is known as the biserial correlation. Milligan
(1980) successfully used this type of correlation to
match in his synthetic data analysis. For an eight-cluster
solution, eight correlation-matching coefficients will be
obtained and the average of the eight coefficients is
used to represent the results of eight-cluster matching.
So, for each algorithm, if 8-25 clustering patterns are
retained, we have 18 resultant matching coefficients
from which to generate statistics. These are summa-
rized for the overall mean and median matches. The
method by which the results are presented makes ex-
tensive use of the boxplot.
The RPCs were matched to their corresponding cor-
relation pattern in the same manner as described for
CA, with one important exception. Since RPC loadings
were not in a binary form, they were binarized or hard-
ened to facilitate a more meaningful intercomparison
to the CA methods. This involved selecting a "cutoff
loading" to define the loadings to be set to 1.0 and
those set at 0.0. Richman and Lamb (1985, 1987) used
the +0.4 loading isopleth to achieve their regionali-
zations to balance between, including as many stations
as possible in at least one region, and to minimize
overlap. Experimentation with this cutoff loading re-
vealed highest biserial correlations from 0.2 to 0.4 (de-
pending upon the sample size) (Richman and Gong
1993). We chose the highest match as the cutoff. The
effect of removing the gradients from RPCs or hard-
ening had a negative effect on their accuracy, as quan-
titatively measured in the boxplots of Fig. 2, where the
unaltered fuzzy RPC loadings matched their correla-
tions (as described in Richman and Lamb 1985) at a
value of between 0.85 and 0.90. The hardened RPCs
biserial correlations dropped to approximately 0.75.
Additionally, an experiment, using a biserial Eu-
clidean distance matching the clustering patterns
against their corresponding "interstation Euclidean
distance field," was performed in this study to gain
-------
908
JOURNAL OF CLIMATE
Volume 8
more insight into the various clustering algorithms. The
procedures achieving such matching and the results
are summarized in appendix C. We undertook such
an investigation as we could find no applications in
the literature. The results of appendix C show disagree-
ments with those biserial correlation matching.
4) Bootstrap resampling
The recently developed statistical method known as
the "bootstrap" can be used to investigate the reliability
of CA methods. A detailed statistical description of the
bootstrap is given by Wu (1986). The basic idea of the
bootstrap involves inferring the variability in an un-
known distribution from which one's data is drawn by
resampling from the data. It suggests that users resam-
ple their data to construct a series of fictional sets of
data. Each of these is constructed by randomly sam-
pling m observations (m « full sample size of the da-
taset) with replacement. It is quite likely that, in their
resampling process, some of the original data obser-
vations are represented more than once and others are
omitted (Felsenstein 1985). This is considered an at-
tractive aspect of the bootstrap.
In our research the resampling process is done by
drawing four sets of sample sizes (50, 100, 250, and
500 observations) in order to examine both the sen-
sitivity and reliability of CA and RPC methods to the
number of observations. We performed 10 replications
for each of the four sample sizes [i.e., resampling of
bootstrap procedure was done 40 times so as to con-
struct 40 matrices in the form of Eq. (2.1)]. These 40
matrices were subjected to CA and PCA, the results of
which are used to evaluate the various methods.
5) Information consistency problem in
COMPARISONS OF CA WITH RPC PATTERNS
The primary goal to incorporate RPC clustering into
the intercomparison of CA algorithms in this study is
to introduce that RPC is an alternative to traditional
CA methods in achieving homogeneous groups by
simple structure rotations. However, there exist two
confounding factors in direct comparison between tra-
ditional CA and RPC. One is that RPC generates fuzzy
solutions although we have binarized the solutions. The
other problem involves possible inconsistent input
variance information to these two clustering algo-
rithms. RPC uses the variance information given by
PC truncation at a selected root number to obtain clus-
ter solutions. This implies that noise in original data
is excluded if the dimensionality reduction stage of PCA
is successful (Richman et al. 1992). Conversely, tra-
ditional CA includes the full original raw variance in-
formation or uses the fixed number truncated PCs as
the input data to compute dissimilarity matrix. Such
difference of algorithm operations may complicate the
comparison between these two types of clustering tech-
niques.
In this study, the effects of variance differences were
investigated. The first 8-25 unrotated PCs were used
separately as input data to compute ED matrices.
Ward's and AL1 methods were applied to these ED
matrices to obtain a cluster solution that corresponds
to the dimensionality of the number of PCs retained
and used in ED computation. That is, an 8-cluster so-
lution was obtained when the input ED matrix involved
the first 8 PCs and so forth. This experiment design
ensured that the same variance information entered
into both the CA and RPC regionalizations. The re-
sultant CA patterns exhibited different regionalizations
from those using the full original data variance, since
CA algorithms are sensitive to the input dissimilarity
matrix (this will be demonstrated in the next section).
However, point biserial correlation-matching accuracy
levels of these tested regionalizations were approxi-
mately at the same levels as their counterparts using
the full original data information. For instance, the 16-
cluster solution using the first 16 PCs for Ward's and
ALl methods both had a 0.66 biserial correlation
match, whereas the 16-cluster solution with full data
information from Ward's and ALl methods yielded
0.68 and 0.67 in matching. Therefore, we consider that
the differential in variance information is not a sub-
stantial factor in the matching accuracy between tra-
ditional CA and RPC in the present study. However,
investigators should be cognizant that direct compar-
ison of traditional CA with RPC clustering may not
be straightforward in other studies unless this is inves-
tigated.
b. Results of the full analysis for CA and RPC
1) Mean matches
Analysis of the May-August 1949-87 7-day precip-
itation totals is-presented in this section. The mean
values of the 18 analyses (number of clusters varied
from 8 to 25) were summarized in the boxplots of Fig.
3. Single linkage techniques (SL) had the worst matches
of any techniques. The biserial correlations were all
less than 0.4, indicating little correspondence to the
input correlations. Examples of one corresponding so-
lution show pathological chaining. The complete link-
age (CL) techniques were all well above the accuracy
of SL. The inverse correlation-based CL patterns
(CLIC)had some aggregation errors for the 8-12 clus-
ter solutions (Table 1). An example of this can be seen
in Fig. 4a for the 12 cluster CLIC analysis, where cluster
11 is split between the northern Great Plains and the
southeastern United States. Not only are the climatol-
ogy of precipitation widely different in these two re-
gions, but Richman and Lamb (1985, p. 1338) list the
interstation correlation patterns (their Figs. 8 and 9)
for this region, which depict an elliptical pattern of
isocorrelations centered in only one of the regions iso-
lated by cluster 11. Taken collectively, this information
strongly counters any CA regionalization that aggre-
-------
April 1995
GONG AND RICHMAN
909
Full Data Set
S <0
5 o
b+
i * I . i
&
£
—i—i—i—i—i—i—r
O O O O O Q
^ _ ......
n
"I 1 1 T
OODKOOt^^OW
sa^gssg^isis
< * <
Fig. 3. Box and whisker plots of the biserial correlation between cluster algorithm regional-
izations/PC regionalizations and the corresponding interstation correlation fields based at the
geometric centroid of each spatial pattern for the full observation period (663 weeks). Boxes
represent the median and interquartile range for each algorithm for 18 solutions that varied the
number of clusters from 8 to 25. Whiskers represent 1.5 times the interquartile range and any
outliers.
gates a coherent area of precipitation into two or more
subareas for our data. The Euclidean distance dissim-
ilarity coefficient (CLED) solutions had a similar me-
dian to CLIC but with a smaller range. For CLED,
aggregation error for the full dataset is also apparent
(Table 1). The results of the aggregation are shown in
Figf4b for the 14-cluster solution. Here, western Texas,
eastern New Mexico, and southeastern Colorado are
grouped with a markedly different climate in the
southeastern United States. Such an aggregation would
seem counterintuitive, as the aforementioned Richman
and Lamb (1985) interstation correlations show no
relationship between west Texas and the southern states
along the Gulf Coast. Apparently, more than 14 clusters
are needed to fully resolve the precipitation variability,
as from 15 clusters onward the aggregation ceases.
Conversely, the theta angle-based patterns (CLT) had
a distribution with no overlap with other CL techniques
due to chaining. Both average linkage (AL1IC and
AL1ED) methods had similar medians and ranges. The
AL1T boxplot is not listed since no coherent groupings
of stations were found for this method, which prevented
any quantitative matching or regionalization. AL2IC
and AL2ED had similar medians and ranges to each
other and to the AL1 counterparts. AL2T had a dis-
tribution not overlapping the other AL2 methods.
Chaining degraded all of the AL2T analyses. Ward's
techniques (W) all proved to be robust, with WIC hav-
ing the smallest range, although there were some minor
aggregating errors for 8- and 9-cluster solutions (Table
1). This can be seen in Fig. 4c, where the 9-cluster
WIC solution has two cases of aggregation of the same
map. Cluster 6 attempts to pull together Florida with
the central Midwest. The isocorrelations presented in
Richman and Lamb (1985, p. 1338) indicate that there
is no significant (physically or statistically) relationship
between the central Midwest and any stations south of
northern Mississippi, Alabama, and Georgia. The sec-
Table 1. Aggregation statistics.
Size
SLIC SLED SLT CLIC
CLED
CLT
ALIIC
ALlED
AL2IC AL2ED
AL2T WIC
WED WT KMS KMR
KMG NAS NAR NAG
Full
5
7
2
50
85
37
8
35
II
42
64
4
1 2 2
100
71
26
II
36
7
21
48
250
70
21
5
12
7
24
500
43
5
4
4
4
20
-------
910
JOURNAL OF CLIMATE
Volume 8
Fig. 4. Spatial regionalization examples of May-August precipi-
tation given by cluster analysis algorithms that suffered "aggregation
errors" (see text and Table 1). Panel (a) represents CLIC for the 12
clusters retained. Panel (b) represents CLED for the 14 clusters re-
tained. Panel (c) represents WIC for the 9 clusters retained.
ond aggregation, for cluster 8, is evident in the north-
western corner of the domain (Saskatchewan and Al-
berta) with Wisconsin and much of Minnesota. Com-
parison of this cluster to the isocorrelations in Richman
and Lamb (1987, p. 151) suggests that the relationship
between the Canadian part of cluster 8 and areas in
the United States are unlikely as the correlation pattern
exhibits a single maximum with a tight gradient. Sim-
ilar results are presented in Richman and Lamb (1985)
for Wisconsin or Minnesota. Again, these results sug-
gest that the aggregation is a pathological manifestation
of too few clusters for this dataset. It can be found that
the clustering algorithms using inverse correlation
(CLIC, AL1IC, AL2IC, WIC) perform relatively worse
due to aggregation error.
The nonhierarchical methods tended to have slightly
higher medians than the hierarchical ones. The k-
means methods (KM) showed notable variability as
different partition methods were considered. The se-
quential partition (KMS) had the lowest biserial cor-
relations as the seed points affected the final position
of the clusters. Specifically, the sequential partition
method seeded the first 8-25 clusters all in the southern
states (Florida and Texas), and the resulting KMS
clusters had many small regions in the south. For the
random (R) and gridded (G) seeding methods, the
KM results were superior to KMS (Fig. 3) and more
consistent spatially. The nucleated agglomerative
methods (NA) had similar medians and ranges for all
partitioning techniques used. The lack of seed point
dependence is a consequence of the algorithm always
starting with 25 seed points and sorting these down to
the number of clusters retained.
The orthogonally and obliquely rotated PC's
(PCOR, PCOB)boxplots (Fig. 3) had similar medians
and ranges. The medians had no overlap with any of
the traditional CA methods suggesting that, even after
binarization, rotated PC better matched the input cor-
relations.
2) Minimum matches among the techniques
INVESTIGATED
To assess the worst matches associated with each
method, the single lowest match for each of the 18
solutions was used to generate statistics that are shown
in the boxplots of Fig. 5. This may be useful in con-
junction with Fig. 3 as an indicator of the potential
risk for problematic solutions with poorly defined clus-
ters. Most of the methods with low median matches
in Fig. 2 also appear as low in Fig. 5. The SL methods
remained worst with median lowest matches of 0.2 or
less, reflecting the severe chaining inherent to these
techniques. The IC dissimilarity coefficient led to the
best matches relatively, but these were still quite poor.
The CL methods all improved over SL, with CLIC and
CLED having median matches between 0.5 and 0.6.
CLT is notably lower, caused by pathologically small
-------
April 1995
GONG AND RICHMAN
Minimum Matching for The Full Data Set
911
a
0
o p o p ^ o
_jLU —
a a s s * g §
Fig. 5. Same as Fig. 3 but for the single worst matching spatial pattern
of each of the number of clusters retained (8-25).
clusters over isolated portions of the domain. This will
be illustrated in the next section. The matches for av-
erage linkage methods were comparable with complete
linkage. AL1IC had the highest median of all AL tech-
niques, lending credence to using the 1C dissimilarity
coefficient for AL algorithms. AL1ED had a surpris-
ingly large range, indicating that on rare occasions it
resulted in a few poorly defined clusters. This tended
to occur when over 20 clusters are retained. The AL2
methods generally followed the mean results of Fig. 3.
AL2T had lower matches due to chaining, which were
worst when lower numbers of clusters were retained.
A notable lack of variation and relatively high median
accompanied the worst matches for various Ward's
techniques. This was also consistent with Fig. 3. WED
had a slight edge over the remaining dissimilarity coef-
ficients.
The nonhierarchical solutions performed better than
hierarchical ones in general. Both KMR and KMG
had median minimum matches in excess of 0.6. KMS
was markedly lower with a large range due to the sen-
sitivity of the method to the seed points. This will be
illustrated in the next section. In contrast, all NA al-
gorithms were fairly uniform with median minimum
matches in the 0.6-0.65 range. The RPC results were
somewhat surprising as the mean matches of both
PCOR and PCOB (Fig. 3) were high with fairly small
interquartile ranges. In Fig. 5, PCOR had a notably
lower median and larger range as compared with PCOB
or most of the traditional CA methods. Investigation
of the low matches showed several instances of severe
fragmenting when excess PCs were orthogonally ro-
tated. Richman and Lamb (1987, their Fig. 11) show
precisely this fragmenting phenomenon. In contrast,
PCOB (Fig. 5) was much more robust to the effects of
"overfactoring," having the highest median minimum
matches (approximately 0.7) of any technique, con-
sistent with Fig. 3. Again, we stress that Fig. 3 is more
representative of the expected value of the matches,
whereas Fig. 5 illustrates the single worst match per
map. Hence, Fig. 5 needs to be interpreted cautiously.
If there were several poor matches for each solution,
this would obviously lower the median in Fig. 3.
3) Spatial patterns from various methods
FOR 16 CLUSTER SOLUTION
The primary goal of examining various clustering
algorithms was to find appropriate methods that yielded
desirable solutions having close correspondence to the
input data. While retaining 8-25 patterns for the in-
spection of accuracy and reliability, we considered that
the 16-cluster solution was the midpoint and was rep-
resentative of most of the results. Spatial patterns with
16 clusters retained are presented to view the capability
of each method to accomplish regionalization (Fig. 6).
Single linkage algorithms yielded very poor region-
alizations (Figs. 6a,b,c) as chaining effects were very
significant. All three algorithms had spatial patterns
with one large cluster, which occupied nearly the full
domain, and 15 small clusters. In SLIC, the small clus-
ters were mostly located in Florida and Texas. All the
small clusters contained only one station, except cluster
7, which was slightly larger. SLED had a spatial pattern
that had more scattered chains in Florida, Texas, North
Carolina, Virginia, Colorado, and Sable Island (Can-
ada). For SLT, all small clusters were distributed in
the southwest part of our domain, mostly in Texas.
-------
912
JOURNAL OF CLIMATE
Volume 8
(c) slt 7
(a) SUC
(b) SLED
(f) CLT 7
(d) CUC
(e) CLED B
U) AL11C
^5
&
(b) U.1ED
l(k) AL2T '
0) AL2IC
16
V
(J) AL2ED
&
Fig. 6. Regionalizations of central and eastern North America for weekly May-August precipitation on the basis of seven clustering
algorithms and three dissimilarity/seed point methods as well as for orthogonally and obliquely rotated binanzed PCs. The number of
clusters was held fixed at 16. See text for cluster technique mneumonics.
(appendix C, Table CI contains the 16-cluster mean
statistics.)
The spatial patterns of complete linkage illustrated
much less chaining (Figs. 6d,e,f), yet CLT (Fig. 6f)
still had two rather big clusters (4, 13). Pattern 4 cov-
ered a large area including the Midwest and certain
portions of some eastern and southern states. Pattern
2 stretched across the whole East Coast. Eight small
clusters were found in Texas. In CLED (Fig. 6e), clus-
ters tended to be of a more similar size. However, pat-
terns 3, 4, 8, 10, and 16 were larger, whereas patterns
5, 7, and 15 were relatively small. CLIC (Fig. 6d) had
-------
APRIL 1995
GONG AND R1CHMAN
913
(t) wic
(m) WED '
&
(n) WT
(o) KMS
(p) KMR
U
(4) KMG '
(r) NAS
(>) NAR '
(I) NAG
(») PCOR 1
(t) PCOB *
Fjg. 6. (Continued)
four relatively large patterns (11, 13, 15, and 16),
mostly in the north part of the domain. All other clus-
ters were similar in size. AL1ED (Fig. 6h) produced
fairly equally sized clusters, except patterns 8, 12, and
15, which were a little larger than the others. In AL llC's
spatial patterns (Fig. 6g), all the clusters in the west
part of the domain were larger than those near the east
coast. AL1T is missing since it does not yield natural
clusters. Like AL1IC, AL2IC (Fig. 6i) also generated
some larger patterns (8, 10, 14, and 15) in the western
portion of the domain and relatively small clusters in
the east, with the exception of pattern 2. The sizes of
AL2ED's patterns were also relatively equal, only pat-
terns 3, 5, and 13 were small (Fig. 6j). A chaining
cluster appeared in AL2T (Fig. 6k) as pattern 4 covered
the eastern domain while patterns 3 and 10 occupied
-------
914
JOURNAL OF CLIMATE
Volume 8
two-thirds of the western part of the domain, with about
10 very small clusters assembled in the southwest and
another one (cluster 11) occurred in southern Florida.
WIC (Fig, 61) gave similar distributions as ALIIC and
AL2IC, with the exception of the Great Lakes area.
The patterns (10, 15, and 16) on the west edge were
large, whereas those to the east (5, 6, 12, 14, and 17)
were relatively small. WED (Fig. 6m) yielded patterns
where two clusters (4 and 14) were a little larger and
one (1) was small. All others were roughly similar in
size. In WT's spatial pattern (Fig. 6n), pattern 16 in
the southeast was the largest one, and patterns 1, 2,
and 14 were somewhat smaller in the southwest. The
specific matching values are shown in Table CI.
In KMS (Fig. 6o) there were big clusters in the
northern and western part of the domain (1, 5, 7, 10,
11, and 13), whereas small clusters (2, 3, 4, 8, 9, 14,
and 15) were in the central south and southeast. As
opposed to some of the aforementioned distributions,
K.MR (Fig. 6p) produced larger clusters (3,10, 11, and
15) in the east. Other patterns in the central and west
were approximately equally sized. In K.MG (Fig. 6q)
the spatial patterns (1, 2, 5, 6, 9, and 13) in the north
and west were larger, whereas others had relatively
uniform sizes. NAS (Fig. 6r) had three larger patterns
(2, 8, and 12). NAR's spatial patterns (Fig. 6s) had a
spatial distribution in reasonable agreement with that
of NAS. Minor differences were with the bigger patterns
(1, 8, 12) in the southeast and the north, with four
patterns along the eastern edge of the domain. How-
ever, many of the clusters in the central domain were
subtly different. NAG (Fig. 6t) also yielded a big cluster
in the southeast as did the other two nucleated algo-
rithms, but all other patterns were rather uniformly
distributed, showing less congruence with the other two
nucleated methods, PCOR and PCOB yielded very
similar patterns (Figs. 6u,v). The patterns for both
simple structure rotational algorithms were in excellent
agreement with those presented in Richman and Lamb
(1985, 1987). Severe overlapping among the patterns
existed in PCOR and, to a lesser degree, PCOB because
the loading cutoff was chosen to maximize the biserial
correlation (+ 0.2). The increased overlap of the or-
thogonal patterns was consistent with the findings of
White (1988) and White et al. (1991, p. 10). Table
CI contains the specific details of the regionalization
statistics.
Hardness of CA solutions is problematic for region-
alization of most geophysical data. In particular, 7-day
squared root rainfall total is fuzzy in nature. Causal
mechanisms are multiple and overlap. This is partially
the reason that the hard clustering algorithms achieve
lower accuracy than that of RPC clustering solutions
(Fig. 3). The rigidity of the clusters likely leads to much
of the variation in regionalizations seen in Fig. 6. This
is quite alarming as compared with RPC clustering.
Also, hard clusters omit useful information on prob-
ability of group membership. Hence, to get the best
physical insight, a CA method that allows natural
overlap to be depicted is worthwhile.
c. Results of the resampling of CA
Statistical test techniques were needed to distinguish
accuracy levels that various CA methods achieve based
on the results of resampling. There are several different ;
approaches that have been proposed and the deter-
mination of a proper method remains a problem (Sokal I
and Rohlf 1981). A traditional l test was applied (see
appendix B). Pairwise t statistics were calculated for
each pair separately. Alternatively, the newer MSD
(minimum significant difference) test was also em-
ployed to further detect the differences among the
methods tested (see appendix B). MSD is an "un-
planned comparison" (all pairs tested, a posteriori)
statistic dealing with multiple comparisons of means
and testing potentially significant differences among
all possible pairs of means (Sokal and Rohlf 1981). In
the MSD test, used herein, the studentized range (Qa[ k,
fl) is used as a critical value, whereas in the / test it is
based upon the t distribution. MSD also has another
advantage over the l test; it is computationally more
efficient (as all pairwise comparisons are done only
once). Our testing results (shown below) indicated that
the MSD test was able to detect certain differences that
the t test failed to detect. However, both MSD and l
test results are included as they are based on different
distributions.
At the smallest sample size tested (50) SL was very
poor. When CL was applied, it was clearly superior to
SL (appendix B, Tables Bl, B2). AH showed further
improvement over CL, but not enough to pass the MSD i
or l tests. Most interestingly, AL1T had no coherent
clustering, thereby precluding regionalization. With the
other two dissimilarity measures, AL I showed less ag-
gregating than CL methods. AL2 results showed some
decrease in accuracy from AL1, but not enough to be
statistically significant. Ward's methods showed lower
interdissimilarity coefficient variability than AL2.
Nevertheless, WIC was significantly less accurate than
WED for both the MSD and t tests, and less than WT
for the MSD test only due to aggregating. Overall the
results were statistically indistinguishable from most
AL1 and AL2 methods. The fc-means methods were
found to be slightly superior to Ward's, but these dif-
ferences were not statistically significant. Seed point
methodology played some importance with k-means
as the randomly chosen and gridded methods were
slightly more accurate than the sequentially chosen
seeds. Although this was not a statistically significant
difference, we were able to track down the reason to
pathologically smaller groups near the seed points (not
shown for resampling portion but an example is pro-
vided in Fig. 6). The nucleated agglomerateve method
had the highest matches, significantly higher than
KMS, but not KMR or KMG. However, the NAS
-------
APRIL 1995
GONG AND RICHMAN
915
(») Sample Size 50
I 3-
d1 ^
° 1
3, £0^9
QQ^ao^ODucst-uot-^aai^KOEi:®
- •¦• ¦ - 3 : : 3 * = S s » 3 i > 5 S 3 g 3
< - < 2 & a
$ as W " 8
IcI Sample Size 250
(b) Sample Size 100
¦
qf^E? '
ee,
(dl Sample Size 500
¦ O c^>
• a
i—r r
—r - i—r— r » "i-
-1—r \ • r t i ?
OQ»-UO»-UCjOOt-OQt«rt£EO"Jtta
iS^Bg^s^gsssSasllJi
o a <- o c
1 l—X 1~T'-1 1 1—
uoyofcuots
;55
-------
916
JOURNAL OF CLIMATE
Volume 8
gation error began to decrease (Table 1). The SL tech-
niques were still the worst. The CL and AL1 techniques
showed a large elevation in accuracy, CL1C still had some
aggregation errors (Table I). CLT was statistically worse
than either CLIC or CLED, AL2IC and AL2ED had
similar means according to our testing, whereas AL2T
was significantly less accurate. Once again, Ward's meth-
ods were all very similar and relatively accurate, exhibiting
small interquartile ranges. The fe-means methods showed
less range than any hierarchical method. KMG was sta-
tistically superior to KMS according to the MSD test
(appendix B, Table B2) but not to KMR. The NA meth-
ods all had higher biseriaJ matches than any other CA
technique. Interestingly, results of the significance testing
showed that all three NA methods exceeded KMS for
both the MSD and t tests but were not significantly dif-
ferent from KMG or KMR. The two RPC methods
yielded similar results; both significantly exceeded all three
NA methods according to both the MSD and t tests.
The final sample size tested was 500 bootstrapped ob-
servations. Generally, the interquartile ranges compressed
even further and the increase in accuracy began to level
off (Fig. 7d), with the most notable changes in the CL
and AL techniques. Aggregation errors continued to de-
cline (Table 1), being serious only for CLIC. The SL
techniques showed a similar pattern to the previous panels
of Fig. 7, The SL methods remained worse. CLT was
significantly less accurate than CLIC and CLED accord-
ing to the MSD test (Table B2). The AL1 techniques
had slightly higher medians than the CL counterparts but
the differences were not significant. AL2 methods were
indistinguishable from AL1 with the exception of AL2T
being significantly inferior. All three Ward's solutions
were at the same level statistically. They were about as
accurate as AL1 and AL21C and AL2ED but better than
AL2T and most of the CL methods. Both KMG and
KMR were statistically superior to KMS according to
the l test. KMG was also superior to WIC but not to the
other two Ward's solutions. The three NA methods were
indistinguishable and quite accurate. They were better
than KMS according to both the MSD and t tests. The
RPC solutions showed no significant differences between
PCOR and PCOB. Both had statistically higher means
than other CA methods according to the MSD and i
tests.
6. Summary and discussion
The intercomparison of various clustering tech-
niques is of interest to geophysical researchers. In past
methodological studies different types of synthetic data,
which contained cluster structure, were used to ex-
amine cluster methods. However, all the data were
theoretical and their distribution and cluster structure
were known before carrying out clustering. Whereas
this is useful for methodological debugging of proce-
dures, it does not provide much insight into how var-
ious techniques perform on geophysical data. The
present work used a well-studied dataset of North
American growing season precipitation to investigate
the capability of various clustering methods to recover
the input data structure and to test the applicability of
a large range of clustering algorithms to real geophysical
problems such as precipitation regionalization. This
dataset was chosen due to the extensive quality control
it had received and our prior knowledge of both its
interstation correlation fields and numerous statistical
investigations of regjonalizations (Richman and Lamb
1985, 1987, 1988, 1989).
Monte Carlo bootstrap resampling from the full data
was also carried out to examine the reliability of the
solution for each clustering algorithm. The raw input
data dissimilarity matrix and the clustering solutions
were matched by the point biserial correlation and Eu-
clidean distance (appendix C). To the best of our
knowledge, this type of testing methodology appears
to be unique to the geophysical literature and datasets.
Many of the findings of this research were consistent
with conclusions from simulated artificial data in past
methodological studies with a few notable exceptions.
Based upon point biserial correlation matching, single
linkage methods performed poorly, confirming what
numerous previous studies (Kuiper and Fisher 1975;
Bashfield 1976; Mojena 1977; White and Perry 1989)
have found. Theta dissimilarity was found to degrade
the patterns for single, complete, and average linkage
techniques, whereas inverse correlation seemed to be
acceptable for the above techniques and particularly
promising for average linkage within the new group
(although it may have contributed to some aggregat-
ing). Euclidean distance dissimilarity was found to be
adequate too for all methods tested. The most accurate
hierarchical methods for one-sided data were Ward's
methods. However, the nonhierarchical methods tested
(k means and nucleated agglomerative) both generally
outperformed Ward's. Nucleated agglomerative clus-
tering was clearly the best hard cluster method tested
herein, being relatively insensitive to the initial partition
methods. Conversely, k means exhibited a strong de-
pendence upon the partition method, manifested by
numerous smaller patterns near the initial seeds. Ro-
tated principal component clustering was quite accu-
rate, though the orthogonal technique had problems
with occasional fragmenting.
Everitt (1979) speculated six major unresolved
"problems" in cluster analysis that might offer potential
dividends in future research, which were 1) determin-
ing the number of clusters, 2) choosing a "best" clus-
tering method, 3) applying resampling to clustering,
4) interaction between cluster methods and informal
graphical techniques, 5) computer packages for cluster
analysis, and 6) improved algorithms for particular
techniques. In this research, we have addressed four of
the six issues mentioned. Determination of optimal
clustering algorithms was the major goal of our re-
search. Although the full sample results of validity
-------
APRIL 1995
GONG AND
R1CHMAN
917
present a general evaluation on each clustering tech-
nique, it is difficult to gauge the differences without
resampling. The bootstrap was applied for this purpose
(Wu 1986). This allows statistical testing using the
point biserial correlations to be applied. Researchers
(e.g., Blashfield 1976; Mojena 1977; Milligan 1980)
usually used randomly generated data to do their re-
sampling to evaluate the cluster methods. In geophys-
ical work, Wolter (1987) used a Monte Carlo method
to conduct significance testing for a specific cluster so-
lution. In this research, we resampled directly from the
real data in four different sample sizes (50, 100, 250,
and 500 samples). The ranges chosen are wide enough
to permit us to uncover the differences in accuracy as
a function of sample size. Also, the range is useful as
most research designs fall within these sample sizes. It
also shows the penalty incurred when small samples
are used.
Asymptotical results show that the better cluster
methods start out at around 0.5 biserial correlation at
a sample size of 50 and increase to about 0.65 at 500
observations (e.g., Ward's technique has an approxi-
mate logarithmic increase as a function of sample size).
Although this is not unexpected, the documentation
of the increase in accuracy provides new insight into
the differential among methods. The traditional t test
and newer MSD test were used to distinguish which
methods differ statistically. Those results point to many
statistically significant differences among both dissim-
ilarity measures and cluster analysis methods. Overall,
the best technique for small sample sizes is the nu-
cleated agglomerative method. For larger (> 100)
samples, the RPC clustering is most accurate. We rec-
ommend these methods for one-sided data when non-
overlapping and overlapping solutions are acceptable,
respectively. For those researchers wishing an hierar-
chical method, Ward's and average linkage (except
based on theta) are indistinguishable in terms of their
statistical accuracy. While these findings narrow the
list of clustering algorithms under consideration for
regionalization, one should not yet prefer one of the
relatively better performed algorithms (e.g., AL2ED,
WED, WT, KMR, KMG, NAS, NAR, NAG, PCOR,
PCOB) over another without additional information
about their performances with respect to other mea-
sures of matching quality and to other dataset char-
acteristics.
A potential information problem for CA solutions
may be incurred when all nonzero PCs were retained
in this study. We kept all the PCs in order to make a
possible fair intercomparison among all ED-, 1C-, and
theta-based clustering algorithms. This problem can
occur when a substantial number of PCs are retained.
However, if the PC solutions are excessively truncated,
problems with information loss arise. Hence, it is dif-
ficult, if not impossible, to fully assess this problem in
our current comparison framework. We feel such a
repeated information problem does not essentially af-
fect our type of intercomparison since the same data
(with the unknown amount of repeated information)
were equally applied to all the tested CA algorithms.
Additionally, the tested examples in section 5a(5)
showed that the biserial correlation matches of Ward's
and AL1 methods using full PCs (662) and truncated
PCs (16) were at virtually the same accuracy level
(0.66-0.68), suggesting that any repeated information
in the computations did not adversely affect our
matching results significantly. Therefore, this potential
problem of repeated information should have little ef-
fect on our intercomparison conclusions.
To isolate the effects of cluster hardness, we binarized
the RPCs in order to help make direct comparisons
with the traditional CA methods. This had the effect
of lowering their accuracy (Fig. 2). The reason for this
is that hard CA methods have irregularly shaped regions
due to the constraint of clusters that assign each entity
to only one area with no intercluster overlap, whereas
RPC regions are "fuzzy" (have loadings suggesting
probability of entities having group membership) and,
hence, very smooth. The minimum matches (single
worst match for each of the 18 solutions) for all meth-
ods (Fig. 5) show a more complicated picture, with
the orthogonal PC solution clearly being more sensitive
to the effects of selection of too many components than
either the oblique RPC or most of the traditional CA
methods. This means that one or two areas of the or-
thogonal RPC solution will fragment if too many or-
thogonal RPCs are retained, as shown in Richman and
Lamb (1987, their appendix). Similar degradation un-
der other cluster analysis methods occurs with chaining
(at any number of clusters retained) and, to a lesser
extent, aggregating (with a lower number of clusters
retained). However, these effects are infrequent, so that
it does not drag the mean matches down noticeably
under orthogonal RPC or most other CA methods (Fig.
3). Interestingly, obliquely rotated PCs do not show
this degradation, exhibiting the highest point biserial
correlations of any technique tested.
What this suggests is that, although we have patho-
logically altered RPC, it can still accomplish region-
alization with an accuracy level equivalent to or ex-
ceeding the best of the CA methods when a sufficient
sample size is available. However, the nature of an RPC
regionalization is quite different from that given by the
nonoverlapping cluster analysis methods tested herein.
Hence, there are two conclusions arising from this
comparative exercise. 1) Both nonoverlapping cluster
analysis methods and RPC can yield relatively accurate
results. With hard cluster analysis, the regions must be
irregularly shaped. With RPC, they are usually ellip-
tical. 2) The goals of the research designs may be a
leading factor in deciding which techniques are applied.
This also points to the potential for fuzzy clustering
(Rousseeuw et al. 1989), which results in clusters with
gradients (not hard clusters), to theoretically achieve
the level of accuracy seen in the RPC methods.
-------
918
JOURNAL OF CLIMATE
Volume 8
There exist two confounding factors in directly
comparing the CA and RPC regionalizations. The first
issue was solution consistency (overlapping versus
nonoverlapping clusters). We binarized the RPC so-
lutions to harden them to mitigate the inconsistency
but it did not eliminate the difference. The second fac-
tor was the difference in input variance information
for CA and RPC regionalizations. The RPC procedure
discards some input variance information in achieving
a final regionalization, whereas CA uses all input in-
formation in our design. This second difference occurs
when CA and RPC are conventionally applied. In this
study, we found that this input information differential
did not affect our matching accuracy. Our findings also
suggested that the CA regionalizations change based
upon slightly differing amounts of input variance;
therefore, researchers have to be cognizant that direct
comparison of CA and RPC may be rather compli-
cated.
This research conducted the hierarchical cluster
analysis in such a way that used distancelike measures
as the input data dissimilarity matrix. In an effort to
distinguish if the input dissimilarity matrix had dis-
cernible effects on the outcome of cluster analysis, we
applied the most common measure, Euclidean dis-
tance, as well as two other lesser-used coefficients. The
widely used correlation coefficient was used for RPC
and was in versed to be a distance measure for other
hierarchical CA. Such a measure is also new to the
literature. Results showed that inverse correlation coef-
ficient had adequate matches, although cluster analysis
based on it sometimes suffered from aggregating errors.
A second experimental dissimilarity measure, the theta
angle between vectors, was applied only once in the
literature (Sausy et al. 1987). It was also applied herein
with moderate degradation (compared with either Eu-
clidean distance or inverse correlation) in the matches
for several of the hierarchical methods. Two algorithms
of Ward's method, WIC and WT, are experimental as
this technique was derived on the basis of using Eu-
clidean distance measure between the entities. Whereas
Euclidean distance makes the term "minimum vari-
ance" mathematically exact, there is no reason not to
extend the technique to other dissimilarity measures.
Our results have indicated that incorporating the in-
verse correlation and theta angle into Ward's method
resulted in statistically equivalent accuracy to the orig-
inal formulation if the sample size is sufficient and en-
hances the potential for further exploration using this
popular method. However, one should be careful about
the aggregation error when WIC is used.
The formation of clusters that group two or more
geographically diverse portions of the spatial domain
was noted for various CA methods and sample sizes.
We termed this type of incorrect clustering "aggregation
error." It can hinder the interpretation of the cluster
patterns. This is easy to detect in spatial regionalization
research (Fig. 4) but may be more insidious in other
modes. These kinds of problems have been difficult, if
not impossible, to detect in theoretical studies using
synthetic data. Nevertheless, they affect applied re-
search. Aggregating does not necessarily imply that the
algorithm is inappropriate when fewer clusters are re-
tained (this is when it is most serious). The aggregating
clusters may be split into two coherent ones as more
clusters are retained. Sometimes it is plausible to have
an aggregated pattern over different portions of the re-
search domain. However, for our data, aggregating re-
sults in an unrealistic regionalization as the clustering
algorithm chosen does not fit the problem and fails to
identify the cluster patterns (or possibly the signal-to-
noise level is too low). Easterling (1989, his Fig. 3)
encountered this problem in thunderstorm precipita-
tion research as did White and Perry (1989) in their
climate regionalization research. Aggregating was de-
tected to a serious degree in Kikuchihara (1984) in the
precipitation regionalization of Japan.
Unlike the RPC results (Richman and Lamb 1985),
there are moderate disagreements of the specific spatial
CA patterns in the figures for 16 clusters retained for
the methods examined (Fig. 6). This is at least partly
attributable to both the different dissimilarity matrices
and the methods. Some techniques were unable to yield
anything resembling a realistic regionalization (e.g.,
single linkage). The instability among methods based
upon the same dissimilarity measure does suggest that
differences in precipitation regionalizations derived by
cluster analysis methods exist. Other investigators have
also noted similar problems, such as White and Perry
(1989), who found different spatial patterns given by
six cluster methods. After they rejected four of the six
worst methods, the two accepted methods (Ward's and
complete linkage) agreed by 87% of the areas falling
into the groups that were thought to be the same.
Moreover, Winkler (1985) noted that the results from
Ward's, centroid, and fc-means methods showed sig-
nificant differences based on one of her datasets. This
finding is further supported by Bonell and Sumner's
(1992, p. 97) regionalizations of precipitation areas in
Wales using centroid, median, and Ward's CA meth-
ods. They found different cluster boundaries for each
method and with different PC preprocessing. Therefore,
researchers need to be cautious in physically explaining
the clustering solutions. We strongly urge investigators
to apply either a posteriori tests to the results or several
CA techniques to verify the validity of the solutions.
Determination of the optimal number of clusters to
retain is one of the major unresolved issues in cluster
analysis (Everitt 1979) and principal component anal-
ysis (Richman etal. 1992; O'Lenic and Livezey 1988).
The use of various ad hoc statistics or tests are rec-
ommended to detect where the clustering procedure
should be stopped. Jolliffe et al. (1986) described eight
criteria for choosing the optimal number of clusters
for nonhierarchicai cluster methods. Mather (1976, p.
329) presented an F-ratio test technique involving the
-------
APRIL 1995
GONG AND R1CHMAN
919
comparison of the within-cluster sums of squares at
two different clustering levels for nucleated agglom-
erative clustering and pointed out that "this test is an
approximation and involves no explicit distributional
assumptions." However, both Mather (1976) and
Ronberg and Wang (1987) reported problems with the
F test. Mojena (1977) gave two possible stopping rules
for hierarchical clustering methods. These methods
may be helpful and useful when being incorporated
into cluster analysis by geophysicists. In climate ap-
plications, pseudo F and pseudo T2 statistics were used
by Stooksbury and Michaels (1990) and Davis and
Kalkstein (1990). Kalkstein et al. (1987) applied an
R2 statistic to investigate the stop point for hierarchical
methods. Sanchez and Ramos (1987) used a method
that plots the error function "e" versus numbers of
clusters to look for a better stop point. None of these
techniques is commonly accepted or recognized as a
secure or statistically justifiable method. More research
on stopping techniques for CA is needed to help ensure
objectively reaching optimal clustering solutions. The
motivation of this research was to supply information
to enable investigators to objectively assess which
methods offer accurate solutions and minimize the risk
of interpreting spurious results. Hence, we have avoided
making specific decisions on the stopping point by de-
riving all statistics for a wide range of clusters (8-25).
A useful approach would be to re-create an analysis
framework similar to that in Milligan and Cooper's
(1985) Monte Carlo work for various fields of clima-
tological data. Such work might point to procedures
that may help to determine the "correct" number of
clusters for more complicated and less separated data
for high dimensionality.
Our work suggests the penalty imposed by hardening
the RPC cluster solution was a drop in accuracy of
approximately 16% variance. Moreover, all of the
nonoverlapping methods appear to have a further de-
crease in accuracy of ~8% variance. This is alarming
and researchers should consider if the geophysical re-
search question and the real world data are hard in
nature before turning to a canned CA program. In our
experiment, the causal mechanisms of precipitation are
fuzzy and overlapping. For an applied research problem
we suggest examination of the physical mechanisms
prior to embarking on specific methodology and then
choosing techniques that best preserve the underlying
physics.
Unfortunately, through our extensive review of the
literature on CA (and PC), we find little evidence this
is routinely performed. Researchers rarely state why
an algorithm was chosen, among many options, to en-
hance the opportunity of achieving optimal results. In-
deed, the norm was to choose a specific method a priori,
run the data through the algorithm(s), and then hope
something useful appeared. We hope this work will
encourage others to carefully choose a course of action
most likely to preserve the physics.
A final point regards the nature of the data tested,
one-sided correlations. We feel that the results based
upon these data should be transferable to other one-
sided datasets (Richman 1986). Further work is needed
on two-sided correlations to investigate whether the
most accurate CA and PC methods identified herein
appear again.
Acknowledgments. This study was supported by NSF
Grant 89-08545 and EPA Cooperative Agreement CR-
816318-02-0. Computational support was provided by
the National Center for Supercomputing Applications
at Urbana, Illinois. We thank Peter Lamb, Robert
Livezey, and an anonymous reviewer for providing
useful comments on preliminary manuscripts, Robert
Fovell for his exhaustive reviews, Ginger Rollins for
typing the manuscript, and Joan Kimpel and Linda
Riggin for their drafting expertise.
APPENDIX A
A Survey of Cluster Analysis Applications
Cluster analysis has been applied to geophysical re-
search since the 1960s. During the last decade, the ap-
plication of clustering techniques developed prolifer-
ated. A detailed survey of the literature on cluster anal-
ysis applications in geophysical fields has been
conducted, mainly focusing on the works in the past
decade. Thorough literature reviews of rotated principal
component analysis are contained in Richman and
Lamb (1985) and Richman (1986). A comprehensive
list (Table Al) is presented to review these traditional
CA applications and point out methodological contri-
butions. Table Al shows that various cluster analysis
methods have been widely used in different types of
problems in atmospheric research. The most common
CA methods applied were Ward's and k means (20
and 20 references, respectively) and average linkage
(18 references). The large majority (85%) of investi-
gators applied the Euclidean distance. Previous (sub-
jective) knowledge of the research problem was cited
as the major basis on which an initial partition was
decided upon for nonhierarchical methods.
APPENDIX B
Significance Testing
Results of biserial correlation matches from resam-
pling present insight into the reliability and accuracy
of each of the 22 algorithms. While the averages of the
biserial correlations can be used to assess a general level
of accuracy, significance testing can be applied to
quantitatively distinguish the performance of each
method. This facilitates precise documentation of the
statistical reliability of each method. Two tests are now
outlined.
-------
Table A). Literature review of clustering applications.
Author (year)
Dataset
Mel hods
Similarity measure or
inilial partition
Methodological contributions of multiple
techniques tested
Skaggs(1969)
McBoyle(l97I)
Ayoade(1977)
Willmott (1978)
Crceellius et al. (1980)
Gadgil and Iyengar(1980)
Leach (1980)
Mosely (1981)
White (1981)
Crotchet et al. (1982)
French and 1-cDuc (1983)
Gadgil and Joshi (1983)
l-edrew (1983)
Ambrozy et al. (1984)
Gorhametal (1984)
Kikuchihara (1984)
Maheras (1984)
Bartholy and Kaba (I98J)
10 variance values and phase angles in 157 location cells
(24 X 157) in central and eastern United Stales
3 factor loadings of 20 climate variables by 66 stations
in Australia
10 climatic variables or indices by 32 weather stations
in Nigeria
4 PC loadings of 120 monthly rainfall by 90 stations in
California
32 particulate elements by 48 samples (48 X 32) near
Colstrip, Montana
a. 2 PCs by 53 stations (2 X 53) from a dataset with 73
climate parameters
b 2 PCs by 37 stations (2 X 37) in an area or India
13 elements in water by 14 stations (13 X 14) al St.
Clair Lake
90 (north) and 84 (south) stations by 2 Hood parameters
((2 X 90) and (2 X 84)] in New Zealand
6 PCs of 24 climate variables by 68 stations in Great
Britain
979 cyclones, 24-h forecast error wiih latitude and
longitude (2 X 979) in the Gulf of Mexico, the
Caribbean, and the North Atlantic
a. A number of locations over continental United
States by 12 monthly temperatures
b. A number of locations over continental United
States by monthly and 2 monthly temperatures
First 2 PCs of 119 stations from rainfall moisture and
temperature (6 X 119) in India
182 daily geopotential height by 225 grid points in pari
of the Arctic basin
8767 daily geopotential heights by 80 grid points
(80 X 8767)
3 major pollutants in 49 sites (3 X 49) in the eastern
United Stales
8 districts within which these are 292, 280, 289, 298,
310, 273, 304, and 356 stations, respectively, in Japan
by 12 monthly precipitation
6 factor scores from 8 weather variables by 2122 days
Break 8767 to 4 seasons daily heights field in the same
reason
Complete linkage
Complete linkage
Ward's
Ward's
k means
k means (nearest center sorting)
Average linkage
Average linkage
Average linkage
Single linkage
Centroid
Ward's, k means
k means (nearest center sorting)
Ward's
k means
k means
Complete linkage
An hierarchical method
k means
Correlation coefficient
A distance
Euclidean distance
Euclidean distance
Subjectively
Subjectively
Euclidean distance
Euclidean distance
Euclidean distance
Euclidean distance
Subjectively
Euclidean distance
Randomly and
subjectively
Starting one single
cluster and splitting
Euclidean distance
Randomly and
subjectively
Early user of common factor scores (or loadings)
as the input data of cluster analysis
Different grouping may be obtained by using
different algorithms or different weather
variables
O
C
50
Z
>
r-
n
r
2
>
H
m
<
o
t-
c
s
-------
Oalliani and Filipptni
(1985)
Gooscns(l985)
ledrcw (1985)
Mary on and Storey
(1985)
Steinhorst and Williams
(1985)
Tagami (1985)
Winkler (1985)
Capaldo et al. (1985)
Acreman and Sinclair
(1986)
Key and Crane (1986)
I.agarec (1986)
Reich (1986)
Anyardike (1987)
Kalksteinetal (1987)
Ronberg and Wang
(1987)
42 stations by one of five climate elements of 17 years
(17 X 42) in a small area of Italy. Five elements: daily
precipitation, wet days, annual precipitation, and
manimum and minimum daily temperature,
5 PCs from annual rainfall by 90 stations (5 X 90) in
the Mediterranean area
181 winter daily geopotential heights by 225 grid points
in part of the Artie Basin
320 half monthly anomalies by grid point. Keep the first
8 eigenvector coefficients (8 X 320) for every two
months in Europe and the North Atlantic Ocean.
a. 7 ground water variables by 154 samples (7 x 154) in
west Colorado
b. 10 variables with 75 observations (10 X 75) in south-
central Washington
212 days wind a, t> components in 1973 by 137 stations
(137 X 212) in Japan Islands
a. Phase and normalized amplitude of four classes of
heavy rainfall by 240 grid points (8 X 240) in central
and eastern United States
b. First 2 PCs of the data (2 X 240)
Majimum and minimum monthly temperature of 20
years by 62 stations (20 X 62) in Italy
11 basin characteristics by 168 stations (I I X 168) in
Scotland and England
5, 11, and 17 PC scores from 93-gridpoint 7(X)-mb
geopotential height by 310 days in July or 1973-82
First 3 PCs of temperature by 27 stations (3 X 27) in an
area of Canada
180 stations" precipitation in East Germany
4 factor scons of 17 weather variables by 109 stations (4
x 109) in West Africa
First 4 PCs of 28 weather variables by 620 days
(4 X 620) in Mobile, Alabama
13 PCs of precipitation by 93, 102, and 118 stations
in China
Sanchez and Ramos
(1987)
10 particle components by 231 samples (10 x 231) at
Valladolid. Spain
>
¦a
Average linkage
Ward's
Ward's
A nonhierarchica! method
a. Correlation
coefficient
b Difference between
stations
Euclidean distance
Euclidean distance
Arbitrarily
Complete linkage
Euclidean dislance
Average linkage (group average)
Ward's, centroid. and k-means
Average linkage
Ward's
Relocate (nucleated agglomerated method
and normu)
Centroid
Single, complete, median, cenlroid.
weighted average, average linkage, and
Ward's
Ward's
Average linkage, centroid, and Ward's
method
Nucleated agglomerative method
k means
Euclidean distance
Euclidean distance
Euclidean distance
Euclidean distance
Randomly
Euclidean distance
Correlation coefficient
Euclidean distance
Euclidean distance
Randomly
Randomly and
subjectively
Compared the solutions from each method data:
Significantly different patterns found, but
after a PC transformaiion, the results from all
methods were almost identical.
Compared the two clustering procedures:
agglomeralive and divisive.
A review of cluster results by aid of variance
analysts and other methods showed that
Ward's method is an optimal approach.
Evaluated three algorithms by both the
clustering procedure and results. Average
linkage was best for the clustering application
in synoptic climate analysis
Value of error functions vs numbers of clusters
for getting optimal solutions
a
O
z
D
>
z
O
n
x
2
>
z
vO
-------
Author (year)
Dataset
Sausy el al (J987)
19 particle elements by 7438 arid 5384 samples (19
X 7438 and 19 X 5384) at a site in Norway
Wolter (1987) 36-year monthly SLP, total cloudiness surface wind.
and SST by 391 squares over the tropical Atlantic,
the eastern Pacific, and the Indian Ocean
Mo and Ghil (1988) a. EOFs of model 500-mb streamfunction
b. EOFs of real 500 mb height data.
Alsop(l989)
Changnon(1989)
Easlerling (1989)
Stone(1989)
White and Perry (1989)
Davis and Kalkstcin
(1990)
Weekly maximum, minimum, and mean temperature
of year by 150 stations (150 X 52) in west Oregon
and Washington
12 thunder event flash characteristics by 25 stations (12
X 25) in the United States
Gamma and bela parameter of storm rainfall by 450
stations (2 x 450) in the United States
10-13 PC scores of 27 variables for each of 15 years by
the days in the year
2 PCs of 16 agriclimatic elements by 69 cell areas (2 x
69) in England and Wales
Each day's matrix of 141 stations by m PC scores from
24 variables
Femau and Samson
(1990a,b)
Puvaneswaran (1990)
Sanchez et al. (1990)
Stooksbury and Michaels
(1990)
Transport components (x.y) of 3 days in 10 sites of
eastern Northern America by 1093 3-day period
observations (60 x 1093)
3 factor scores from 28 climatic variables by 113
stations
a. 12 meteorological variables by 244-day observations
it a place in Spain
b Pollutant sources by 244 days
72 meteorological variables by 449 stations (72 X 449)
in southeastern United Slates
Table A I. (Continued)
Methods
Similarity measure or
initial partition
Methodological contributions of multiple
techniques tested
k means
Average linkage
k means
An hierarchical melhod
Try each particle as a
potential seed point
fl-angle, Chi-square
and modified chi-
square distance
were used as
similarity measure.
Correlation coefficient
Unfined number of
cluster to initiate
clustering
correlation used as
similarity measure.
Dissimilarity
) angje and chi square incorporated into
clustering procedures
Monte Carlo simulation and statistical test were
used in cluster analysis.
Unfixed number of clusters algorithm used to
search for optimal numbers based on the
criteria period. The concept of fuzzy cluster
membership involved.
Compact method (complete linkage)
Ward's
Ward's
Single complete, average linkage, centroid.
Ward's, and t-mcans
Average linkage, Ar-means
Ward's, centroid, median, and average
linkage
One of the hierarchical linkage methods
k means
Euclidean distance
Euclidean distance
Euclidean distance
Euclidean distance
Euclidean distance
Initial partition using
the results of
average linkage.
Euclidean distance
Euclidean distance
Randomly
Combined subjective knowledge and a statistical
test to choose proper clustering methods.
Two step clustering was carried out. Used
hierarchical results as initial partition for
nonhietarchical method. F, T1, R1, and,
'clustering' analog of screen lest for
determining the number of clusters.
Ward's method found to yield well-grouped
solutions. The other 3 methods had
"chaining" and identified outlier clusters
Average linkage, k means
Euclidean distance Two step clustering was carried out. Used
Initial partition using hierarchical results as initial partition for
results of average nonhierarchical method. F and T2 were used
linkage. for delecting an optimal solution of
hierarchical methods
-------
Davis C1991)
PC scores by N days from a matrix (24 x N) for each of
the 8 U S, cities
Bart holy (1992)
905 precipitation stations in Hungary
Bonell and Sumner
(1992)
Davis and Walker (1992)
5 unrotated and orthogonal rotated PCs from 765 daily
precipitation by 121 sites in Wales
6 PC scores from 798 synoptic weather variables by
3620 days
Dorlinget al. (1992)
Cuttmann (1992)
Kim (1992)
Le Due (1992)
Stone and Auliciems
(1992)
Cheng and Wallace
(1993)
Fovell and Fovell (1993)
Clustering of isobaric air trajectories to assess
precipitation, gas, and aerosol concentrations
7 variables including location and variability of the
annual cycle of rainfall by 1119 stations in the United
States
11 variables from various infrared channels at 459 cases
PC scores of two-month periods of 700-mb height for
14 variables and approximately 2300 observations
2 PC scores and SOI indices by the period from Jan
1882-Dec 1989
702 maps by 445 grid points for Northern Hemisphere
wintertime 10-day low-pass-filtered 500-hPa height
fields and anomaly fields
24 variables including monthly temperature and
precipitation by 344 spatial points in the United
States Also PC scores from the 24 variables.
Average linkage, k means
Single linkage, Ward's centroid. and it-
means
Ccmroiri. median, and Ward's
Average linkage, k means
k means
Average linkage. Ward's
Euclidean distance
Initial partition using
the results of
average linkage
Euclidean distance
Euclidean distance
Euclidean distance
Initial partition using
the results of
average linkage
Euclidean distance
Entensive discussion on types of bias associated
with different preprocessing options.
Compared the groupings from Ward's method
by using unrelated and obliquely rotated PCs
O
o
z
o
>
z
D
Complete linkage
Nearest centroid sorting
fc-mcans, average linking, and Ward's
Ward's
Euclidean distance
Euclidean distance
Unknown
Euclidean distance
O
X
2
>
z
Reproducibility of a cluster was tested
Average linkage
Euclidean distance Data preprocessing before computing ED.
Methodological, latent, and information bias
influences are addressed.
v£>
KJ
W
-------
Table B la. The T statistic values for four sample sizes. Test for sample she 50
SLED
SLT
CLIC
CLED
CLT
ALHC
ALIED
AL21C
AL2ED
AL2T
WIC
WF.D
WT
KMS
KMR
KMG
NAS
NAR
NAG
PCOR
PCOB
SLIC
0.12
0.21
8.83
9.59
9.09
11.08
11.34
9.09
11 45
3.31
907
14.42
11 96
12.38
13.32
13.30
13.42
15,26
14.42
14.93
13.75
SLED
0.04
8.74
9.51
8.99
10.94
11.22
9.03
11.30
3.34
9.00
14.12
11.81
12 19
13.10
13 09
13.22
14.95
14 16
14.62
13.52
SLT
10.45
10 90
1076
13.08
13.05
10.20
13.62
3.57
10.43
18.10
13 96
14.87
1606
15.99
15.89
19.06
17.57
18.58
16 51
CLIC
1 53
0.20
1.89
2.71
1.47
2.04
2.01
0.82
3.44
2,89
2.59
3.29
3.35
3.76
4.20
3.98
3.98
3.72
CLED
1.35
0.20
1.04
0 00
0.31
3.02
0.69
1 43
1,15
0.78
1.40
1.47
1.89
2.13
2 00
1.95
1.81
CLT
1.69
2.52
1.29
1.84
2.15
0.63
3.22
2.70
2.39
3,09
3.14
3.57
3.99
3.78
3.78
3.52
AUIC
0.93
0 19
0,13
3.30
0.95
1.36
1.04
0.64
1.32
1.39
1.86
2 14
1 99
1.93
1.77
ALIED
1.00
0.82
3.86
1.78
0.26
0.08
0.35
0.29
0.36
0.82
1,00
0.89
0.81
0.71
AL2IC
0.30
2.95
0,67
1.36
1.10
0.75
1.34
1.40
1.81
2.03
1.91
1.85
1,73
AL2ED
3.40
1 08
1.24
0.93
0.52
1.21
1.28
1.76
2.04
1.88
1.82
1.67
AL2T
2.53
4.31
3.98
3.77
4.23
4.27
4.56
4.81
4.69
4.67
4.53
WIC
2.29
1.91
1.58
2.22
2.28
2,70
301
2.85
2.82
2,64
WED
0.18
0.70
0.04
0.13
0,68
0.89
0,76
0.66
0,55
WT
0.45
0.21
0.28
0,76
094
0.83
0.74
0.65
KMS
0.70
0.78
1.28
1.52
I 38
1.30
1.17
KMR
0.08
0.60
0.78
0.67
0.57
047
KMG
0.52
0.69
0.58
0.48
0.39
NAS
0 12
0.04
008
0 14
NAR
0.09
0.23
0.28
NAG
0.13
0.18
PCOR
0.07
Table Bib, The Tstatistic values for four sample sizes. Test for sample size 100.
SLED
SLT
CLIC
CLF.D
CLT
ALIIC
ALIED
AL21C
AL2ED
AL2T
WIC
WED
WT
KMS
KMR
KMG
NAS
NAR
NAG
PCOR
PCOB
SLIC
0.00
1.20
13.49
14.93
4.93
10.91
18 40
16.70
16.40
2.54
15.39
18.44
15.33
16.40
21.21
20.25
21.25
18.17
21.07
20,39
22.52
SLED
1.20
13.49
14.93
4.93
10.91
18.40
16.70
1640
2.54
15.39
18.44
15.33
16.40
21 21
20.25
21.25
18.17
21.07
20.39
22.52
SLT
14.73
16.23
5.40
11.73
20,12
18.22
17,88
2.92
16 86
20.09
16.57
17.89
23.33
22.18
23.23
19.66
23.02
22.08
24.54
CLIC
0.97
1.51
0.78
1.93
1.39
1.35
2.89
0.43
2.31
1 84
1.30
2.68
2.62
3.36
3.02
3.36
4.04
4.36
CLED
2,04
0.02
0.87
0.37
0,34
3.34
0.60
1.29
0.92
0.28
1 58
1.54
2 30
2.05
2.30
3.05
3.32
CLT
1.89
2.52
2.26
2.24
1.29
1.76
2.73
2-52
2.21
2.86
2.85
3.21
3.12
3.21
3.61
3,70
AL1IC
0.63
0.26
0.24
3.15
0.48
0.95
0.71
0.20
1.12
Lit
1.66
1.56
1.67
2.29
2.41
ALIED
0.52
0.55
3.78
1.60
0.49
0.17
0.61
0.75
0.74
1.60
1.39
1.61
2.50
2.78
AL2IC
0.03
3.55
1.04
0.97
0.61
0.09
1.26
1.23
2.05
1.80
2.06
2.87
3.16
AL2F.D
3.53
0,99
0.99
0.64
0.06
1.28
1.25
2,06
1.81
2.06
2.87
3.15
AL2T
3.12
3.95
3.75
3.50
4,07
4.06
4,37
4.28
4.37
4.70
4.78
WIC
2.03
1.54
093
2,43
2.36
3.17
2.80
3.17
3.91
4.27
WED
0.24
1.06
0.20
0.20
1.05
0.91
1.06
1.97
2.20
WT
069
0.43
0.43
1.15
1,03
1.16
1.95
2.13
KMS
1.35
1.32
2.13
1.88
2.14
2,94
3.23
KMR
0.02
0.97
0.81
0.99
1.99
2.27
KMG
0,91
0.77
0.93
1.90
2.16
NAS
0.02
002
1.07
1.25
NAR
0.03
0.96
1.09
NAG
1.04
1.22
PCOR
0.07
£w
o
c
"Z
>
r
n
r
2
>
H
rn
<
o
r~
C
z
-------
Tabi.f. Blc. The Tstatistic values for four sample sizes. Test for sample size 250.
SLED
SLT
CLIC
CLED
CLT
ALIIC
ALIED
AL2IC
AL2ED
AL2T
WIC
WED
WT
KMS
KMR
KMG
NAS
NAR
NAG
PCOR
PCOB
sue
0.00
1.98
26.70
27.51
8.20
31.03
25.26
35.25
35.92
4.31
26.31
40.66
35.62
28.51
40.64
47.49
41.73
46.71
36.79
43.52
47.70
SLED
1.98
26.70
27.51
8.20
31.03
25.26
35.25
35 92
4.31
26.31
40.66
35.62
28.51
40.64
47 49
41.73
46.71
36 79
43.52
47,70
SLT
22.52
23.20
8.61
25.39
22.38
27.24
27.83
4.87
22.84
29.85
27.73
24.16
29.98
32.23
31.05
32.71
29.10
33.36
34,97
cue
0*0
3.09
2.14
2.19
2.30
3.04
5.44
1.70
3.78
3.09
2.12
4.06
460
5.47
5,80
5.12
8.85
9.38
CLED
3 49
1.30
I 46
1.39
2.13
5.80
094
2.81
2.19
1.32
3 09
3.55
4.50
4.77
4.23
7.91
8,38
CLT
4 17
4.19
4.24
4 58
2 17
3 93
4.88
4.61
4.16
5.01
5.17
5.65
5.72
5.58
7.28
7.39
ALIIC
0.36
0.01
0.78
6,40
0.24
1.42
0.85
0.10
1.72
2.12
3.24
3.46
3.03
6.87
7.3S
AL1ED
0.39
0.27
6.37
0.54
0.76
0.33
0.25
1.02
1.27
2.27
2.37
2.18
5,34
5.61
AL2IC
0.87
6.49
0.25
1.60
0.94
0 12
1.94
2.44
3.63
3.94
3.33
7.60
8.23
AL2ED
6.79
0.94
0.65
0.08
0.62
1.00
1.39
2.68
2.91
2.48
6.69
7.24
A 1.21
6 16
7.06
6.81
6.38
7.17
7.33
7.73
7.80
7.65
9.14
9.25
WIC
1.50
1.00
0.32
1.76
2.08
3.09
3.24
2.93
6,31
6,65
WED
0.56
1.20
0 38
0.75
2.23
2,46
2.04
6,57
7.21
WT
0.68
0.90
1.28
2.58
2.79
2.38
6.55
7.09
KMS
1.48
1.81
2.88
3.05
2.73
6.29
6.67
KMR
0.31
1.84
2.03
1.69
6,17
6.77
KMG
1.77
2.01
1.59
6.56
7.34
NAS
0.00
0.04
4.37
4.81
NAR
004
4.73
5.28
NAG
3.93
4.27
PCOR
0.10
Table Hid. The Tstatistic values for four sample sizes. Test for sample size 500
SLED
SLT
CLIC
CLED
CLT
ALIIC
ALIED
AL2IC
AL2ED
AL2T
WIC
WED
WT
KMS
KMR
KMG
NAS
NAR
NAG
PCOR
PCOB
SL1C
0.09
4.06
28.63
8.98
4.52
36.52
10 59
11.30
25.22
3.39
30.68
34.34
25.40
29.54
40,19
38.40
40.43
35.12
39.02
51.46
48.14
Sl.F.D
4.17
28.83
8.96
4.51
36.94
10.57
11.29
25 32
3 37
30.91
34.66
25.50
29.75
40.71
38.84
40.93
35.42
39.45
52.23
48 76
SLT
30.19
10.18
5.45
37.04
11.93
12.69
27.04
4.68
32.07
35,32
27.20
31.04
40,13
38.74
40,46
36.10
39.36
49.85
47.33
CLIC
0.07
2.09
2.05
0.24
0,48
1.53
5.87
1.53
3.01
1.66
10.20
3.79
3.91
491
4 66
504
12.45
11.89
CI ED
1.68
0.67
0.21
0.38
0.67
4.16
0.58
1.02
0.72
0.4!
1.15
1.22
1.49
1.58
1.58
3.50
3.51
CLT
2.55
1.93
2.08
2.51
1.74
2.47
2.80
2.55
2.34
2.89
2.95
3.14
3.20
3.21
4.61
4.62
ALIIC
0.45
0.23
0.05
6.58
0.30
1.34
0.19
0.85
2.11
2.32
3.54
3.33
3.73
13.33
12.31
ALIED
0.17
0.45
4.67
0,35
0,85
0.51
0 16
0.99
1.08
1.39
1.49
1.49
369
3,71
AL21C
024
4.93
0.13
0.65
0.30
0.07
0.80
0.89
1.2!
1.32
1.32
3.62
364
AL2ED
6.35
0.26
0.88
0.11
0.67
1.26
1.45
2.20
2.30
2.40
7.96
7.77
AL2T
6.40
6.90
6.40
6.22
7.09
7.15
7.43
7.45
7.50
9.53
9.52
WIC
141
0.39
049
2.00
2.19
3.17
3.12
3.37
10.77
10.27
WED
0.75
1.90
0.44
0.72
1.75
1.87
2,01
10.03
9.48
WT
0.79
1.12
1.31
2.06
2.17
2.26
7.83
7.65
KMS
2.53
2.70
3.67
3.57
3.84
11.08
10 60
KMR
0.37
1.61
1.72
1.91
12.02
10,97
KMG
1.14
1.34
1.46
10.63
9.85
NAS
0,40
0 39
9.79
9.03
NAR
006
7.36
7.02
NAG
8.69
8.12
PCOR
0.15
-------
926
JOURNAL OF CLIMATE
Volume 8
a. The t lest
Results of the biserial correlation matching from the
resampling are in the form of Y, a 10 X 22 matrix,
where 10 refers to the number of replications, and 22
columns denote various CA methods. Therefore, this
significance test is used to distinguish the differences
of the means among the column vectors. Pairwise t
tests are applied under the assumption that each col-
umn variable is normally distributed. The t statistic
between any two column vectors, i and j, is calculated
by
. Y.-fj
(n, - l)sf + (tij - 1)5/
n, + n,
1/2
(i+±)
\«i nj)
1/2 >
cedure as outlined in Sokal and Rohlf (1981). The
general formula of MSD can be expressed by
MSD = (critical value) X SE,
(B.2)
(B.l)
where Y, and Yj are the means of two vectors, sj and
sj are the sum of squares of the vectors, and n, and n,
denote the samples size (10) in this research. Results
of this calculation are shown in Table B1 for all resam-
plings for sample sizes of 50, 100, 250, and 500. The
significance value at the a = 0.05 level, ta equals 2.101.
If ly in Table B1 is greater than ta, the difference be-
tween methods i and j is statistically significant.
b. MSD test
Minimum significant difference (MSD) is a newer
statistical test, which is popular in biological research
and compares the means of vectors. We follow the pro-
where the critical value refers to the statistical distri-
bution of the MSD test (studentized range), and SE is
the appropriate standard error of the sample data. The
two means of column vectors (representing different
CA/PC methods) are significantly different when their
difference is greater than MSD.
The studentized range Qa[k, "], a special statistic
for significance testing, is used as the critical value, a
is the significance level, k refers to number of column
vectors, and v is the degrees of freedom of the standard
deviation. The Qa[k, i»] for k normally distributed
variables is taken from Rohlf and Sokal (1981, Table
18), where k = 22, i> = 20, and a - 0.05.
1/2
SE=(^y,
where MSw,,hm is the within sum of squares, and n (10)
is the sample size. Therefore, MSD becomes
MSD = Qa
MS.
1/2
(B.3)
A convenient method for systematically testing all
pairs of means is to compute upper- and lower-com-
parison limits for each mean, such that two means are
significantly different if, and only if, their limits do not
overlap (Sokal and Rohlf 1981). The upper and lower
limits, respectively, are given by
Table B2. MSD test results at the a = 0.5 level. Columns refer to means, the lower and upper limits of a significant difference.
Sample size 50
Sample size 100
Sample size 250
Sample size 500
Method
?,
/,
Ui
Y,
/,
u,
Y,
I,
u,
Y,
/,
SL1C
.2118
.1953
.2283
.2659
.2504
.2813
.3102
.3013
.3192
.3489
.3356
.3621
SLED
.2094
.1929
.2259
.2659
.2504
.3813
.3102
.3013
.3192
.3496
.3364
.3628
SLT
.2088
.1923
.2253
.2485
.2330
.2640
.2886
.2796
.2976
.3096
.2964
.3228
CL1C
.4182
.4017
.4347
.5264
.5109
.5419
.5875
.5786
.5965
.6229
.6097
.6361
CLED
.4641
.4476
.4806
.5481
.5326
.5636
.5979
.5889
.6069
.6208
.6076
.6340
CLT
.4236
.4071
.4401
.4633
.4478
.4788
.5092
.5002
.5181
.5360
.5228
.5492
AL1IC
.4699
.4534
.4864
.5487
.5332
.5642
.6140
.6050
.6229
.6410
.6278
.6543
ALIED
.4967
.4802
.5132
.5655
.5500
.5810
.6189
.6099
.6279
.6292
.6160
.6424
AL2IC
.4641
.4476
.4806
.5559
.5404
.5714
.6138
.6049
.6228
.6352
.6220
.6485
AL2ED
.4734
.4569
.4899
.5552
.5398
.5707
.6225
.6135
.6315
.6416
.6284
.6548
AL2T
.3367
.3202
.3532
.3858
.3703
.4013
.4304
.4214
.4393
.4487
.4355
.4620
WIC
.4476
.4311
.4641
.5356
.5201
.5511
.6108
.6018
.6198
.6385
.6252
.6517
WED
.5037
.4872
.5202
.5743
.5588
.5897
.6285
.6196
.6375
.6518
.6386
.6650
WT
.4992
.4827
.5157
.5691
.5537
.5846
.6233
.6143
.6323
.6431
.6299
.6563
KMS
.5204
.5039
.5369
.5541
.5386
.5695
.6152
.6062
.6242
.6335
.6202
.6467
KMR
.5234
.5069
.5399
.5775
.5620
.5930
.6318
.6228
.6407
.6551
.6419
.6684
KMG
.5213
.5048
.5378
.5778
.5623
.5932
.6341
.6252
.6431
.6576
.6444
.6708
NAS
.4869
.4704
.5034
.5922
.5767
.6077
.6479
.6389
.6568
.6655
.6523
.6788
NAR
.5048
.4883
.5213
.5919
.5764
.6074
.6479
.6390
.6569
.6688
.6556
.6820
NAG
.5068
.4903
.5233
.5925
.5771
.6080
.6483
.6393
.6573
.6683
.6551
.6815
PCOR
.5184
.5019
.5349
.6106
.5951
.6261
.6893
.6803
.6983
.7553
.7121
.7385
PCOB
.5168
.5003
.5333
.6118
.5963
.6273
.6902
.6812
.6991
.7262
.7130
.7395
-------
April 1995
GONG AND R1CHMAN
927
IH*
\/!l
* .»!*
¦
>m «**«•
w»-
W—
J fjtkwrfW
1 * * * *'y^
^T— *
j „'»|na'ini^ »' V s
/ «»'* tsnai*»e 9 V" f L,
I <•»»*<«»*» 6 « 1 * • 9 9 ^
•¦¦ f ii
-------
928
JOURNAL OF CLIMATE
volume 8
Table CI. Point biseria! correlation and Euclidean distance matches for the means of 8-25 solution and 16-cluster solution.
Methods are ranked from best match (1) to worst match (22)
PBC match for 16 PBED match for 16
Mean PBC match pattern Mean PBED match pattern
Algorithm
Value
Rank
Value
Rank
Value
Rank
Value
Rank
SLIC
.392
20
.404
20
981
3
969
3
SLED
.369
21
.356
21
1031
16
1049
17
SLT
.326
22
.335
22
1264
22
1259
22
CL1C
.650
16
.682
8
988
4
980
8
CLED
.645
17
.652
17
1010
14
1013
13
CLT
.548
18
.537
18
1116
19
1100
19
ALUC
.668
14
.686
7
968
2
971
4
ALIED
.672
12
.671
15
999
9
994
11
AL21C
.670
13
.672
14
992
5
977
7
AL2ED
.676
10
.677
12
999
9
976
6
AL2T
.485
19
.495
19
1147
21
1132
21
WIC
.680
8
.681
10
1001
12
967
2
WED
.675
11
.675
13
999
9
990
10
WT
.680
8
.682
8
1036
17
1031
16
KMS
.664
15
.663
16
960
1
926
1
KMR
.686
7
.680
11
996
7
1024
14
KMG
.689
6
.694
4
997
8
974
. 5
NAS
,698
3
.694
4
1016
15
1029
15
NAR
.692
5
.695
3
995
6
998
12
NAG
.695
4
.690
6
1004
13
981
9
PCOR
.759
2
.782
1
1134
20
1106
20
PCOB
.761
1
,775
2
1097
18
1066
18
PBED (point biserial Euclidean distance) field (small
close to the cluster, larger farther awav).
4) The PBED between the data field (e.g., Fig. Cla)
and the readjusted cluster pattern (e.g., Fig. Clc) is
calculated. The statistics for each cluster within a so-
lution (i.e., 16 clusters) are collected. The statistics for
all solutions (i.e., 8-25 clusters retained) are also cal-
culated. These will now be presented for the 16-cluster
solution to match to Fig. 6 and for the mean after col-
lapsing all cluster solutions (8-25).
In a similar fashion, we calculated the point biserial
correlation (PBC) matches. This was accomplished by
matching the input interstation correlation field for the
cluster centroid (i.e., Fig. Cld) to the appropriate clus-
ter (i.e., Fig. C1 b). These results are presented for both
the 16-cluster solution to match to Fig. 6 and for the
mean after collapsing all cluster solutions (8-25).
Results of both PBED and PBC matching are re-
ported in Table CI for each cluster and dissimilarity
algorithm. The PBC matches ranged from a low of
0.326 (SLT) to a highest (best) match of 0.761 (PCOB)
for the mean matches across all 18-cluster solutions.
Interpretation of the PBED match is not as straight-
forward as the values can range from 0 to +oo. Smaller
values of PBED indicate a better match. For the mean
values (Table Cl), the PBED ranges from a worst
match of 1264 (SLT) to a best match of 960 (KMS).
The top five ranked solutions are KMS, ALIIC, SLIC,
CL1C, and AL2IC, Another finding was that PBED
tended to select the IC dissimilarity measure over ED
or T (Table C1). In the case of SL clustering, the rank-
ings were striking: SLIC (3), SLED (16), SLT (22).
For CL, AL1, and AL2 the same held true. The only
solution (W) where IC was not best, there was a virtual
tie (W1C = 1001, WED = 999). Hence, we feel PBED
may be biased. We found such an array questionable
and gathered statistics on the 16-cluster solutions to
cross reference to Fig. 6. Although, there are some mi-
nor changes in the rankings, the top five algorithms
matched by PBED are KMS, WIC, SLIC, AL1 IC, and
KMG. Investigation of the performance in PBED
matching revealed little in common with Fig. 6 or Table
1, as SLIC appears totally unreasonable for precipita-
tion regionalization, and WIC and ALIIC have prob-
lems with fragmenting. Moreover, the best matches
were for solutions that tended to be with geographically
compact clusters. This seems to be tied into the biserial
nature of the clusters and the nearly step function of
the interstation ED field (i.e., Fig. Cla), where only
approximately 100 km away from the basepoint (clos-
est station) the ED averages over 50% of the maximum
ED. Moreover, at a distance of two stations away from
the base station (cluster centroid) the ED was about
as large as that of remote locations (i.e., Fig. Cla has
an ED value of 85 in northern Ontario and 81 two
stations to the north). Such a highly nonlinear response
of precipitation ED with geographical distance and
sharp gradient is not consistent with our previous view
of coherent regions of precipitation. Comparison of
ED to the interstation correlation field (Table Cld)
shows a different characteristic, where the correlation
-------
April 1995
GONG AND RICHMAN
929
drops from 1.0 (base station-cluster centroid) to ap-
proximately 0.4 at a distance of two stations. At remote
locations (i.e., northern Ontario) the correlations are
effectively zero.
We next examined the PBC matches (Table CI) for
the 16-cluster solution to compare with Table 1 and
Fig. 6. The top five PBC solutions are PCOR, PCOB,
NAR, NAS, and K.MG. These show little or no prob-
lems with fragmenting and appear to be relatively con-
sistent with respect to cluster size, exhibiting no gross
pathological tendencies (e.g., SL chains badly). It can
be seen that the clustering algorithms using inverse
correlation (CLIC, AL1IC, AL2IC, and W1C) perform
relatively worse due to aggregation error (Table 1).
Hence, the biserial correlation matching does not bias
toward the correlation dissimilarity measure. The effect
of dissimilarity coefficient under PBC does not appear
biased as in the case of PBED. For PBC matches (Table
CI), the ranking appears relatively uniform around
the cluster method (i.e., nonhierarchical best, Ward's
best hierarchical). Moreover, solutions with chaining
problems (i.e., SL) or highly unequal areas due to al-
gorithms (i.e., KMS) are penalized under PBC match-
ing, whereas PBED chose KMS best, despite problems
in the seed locations. When considering all of these
factors, we feel that in light of Milligan's (1980) ex-
perimentation in choosing PBC for a matching coef-
ficient, and these findings, PBC is superior to PBED
for this study as the arbitor of success.
REFERENCES
Acreman, M. C., and C. D. Sinclair, 1986: Classification of drainage
basins according their physical characteristics: An application
for flood frequency analysis in Scotland. J. Hydro!, 84, 365—
380
Alsop, T. J., 1989: The natural seasons of western Oregon and Wash-
ington. J Climate. 2, 888-896.
Ambrozy, P., J. Bartholy, and O. Gulyas, 1984: A system of seasonal
macrocirculation patterns for the Atlantic-European region.
Idojaras. 88, 121-133.
Anderberg, M. R., 1973: Cluster Analysis for Applications. Academic
Press, 359 pp.
Anyadike, R N. C., 1987: A multivariate classification and region-
alization of west African climates. J. Climatol., 7, 157-164.
Ayoade, J. O., 1977: On the use of multivariate techniques in climatic
classification and regionalizalion. Arch Meteor Geophys Bio-
kltmatoi Ser. B, 24, 257-267.
Bartholy, J., 1992: Meteorological choices to clustering precipitation
data series and a case study for Hungary. Preprints, 12th Conf
on Probability and Statistics in the Atmospheric Sciences. To-
ronto, ON, Canada Amer. Meteor. Soc., J123—J)24.
, and M. Kaba, 1985: Further development of seasonal climate
forecasts for the territory of Hungary. Idogaras, 89, 185-193.
Benjamin, Y., 1988: Opening the box of a boxplot. Amer. Stat.. 42,
257-262.
Blashfield, R. K.., 1976: Mixture model tests of cluster analysis: Ac-
curacy of four agglomerative hierarchical methods. Psychol.
Bull.. 83, 377-388.
Bonell, M., and G. Sumner, 1992: Autumn and winter daily precip-
itation areas in Wales, 1982-1983 to 1986-1987. Int. J. Cli-
matol.. 12, 77-102.
Capaldo, M., C. De Simone, and C. Finizio, 1985: Application of
cluster analysis to monthly mean temperature data over Italy.
Rev Meteor Aeronaut . 45, 237-246.
Changnon, S„ 1989: Relations of thunderstorms and cloud-to-ground
lighting frequencies. J. Climate, 2, 897-921.
Cheng, X., and J. M. Wallace, 1993: Cluster analysis of the Northern
Hemisphere wintertime 500-hPa height field spatial patterns. J
Atmos. Set., 50, 2674-2696.
Cormack, R. M., 1971: A review of classification. J Roy Stat. Soc
(Ser. A). 134,321-367.
Crecellius, E. A., E. A. Lepel, J. C. Laul, L. A. Rancitelli, and R. L.
McKeever, 1980: Background air particulate chemistry near
Colstip, Montana. Environ. Sci Techno!.. 14, 422-428.
Crutcher, H. L., C. J. Neumann, and J. M. Pelissier, 1982: Tropical
cyclone forecast errors and the multimodal bivariate normal
distribution. J. Appl. Meteor., 21, 978-987.
Cunningham, K. M., and J. C. Ogilvie, 1972: Evaluation of hierar-
chical grouping techniques: A preliminary study. Comput J.,
15, 209-213.
Davis, R. E., 1991: A synoptic climatological analysis of winter vis-
ibility trends in the mideastem United States. Atmos Environ.,
25B, 165-175.
, and L. S. Kalkstein, 1990: Development of an automated spatial
synoptic climatological classification. Int J. Climatol. 10,769-
7*94.
, and D. R. Walker, 1992: An upper-air synoptic climatology
of the western United States. J. Climate. 5, 1449-1467.
Dorling, S. R., T. D. Davies, and C. E. Pierce, 1992: Cluster analysis:
A technique for estimating the synoptic meteorological controls
on air and precipitation chemistry-method applications. Atmos
Environ . 26A, 2575-2581.
Duran, B. S., and P. L. Odell, 1974: Cluster Analysis A Survey
Springer-Verlag, 137 pp.
Easterling, D. R., 1989: Regionalizalion of thunderstorm rainfall in
the contiguous United States. Int. J Climatol, 9, 567-579.
Edelbrock, C., 1979: Comparing the accuracy of hierarchical clustering
algorithms: The problem of classifying everybody. Multivariate
Behavioral Res. 14, 367-384.
, and B. McLaughlin, 1980: Hierarchical cluster analysis using
intraclass correlations: A moisture model study. Multivariate
Behavioral Res. 15, 299-318.
Everitt. B., 1974: Cluster Analysis. John Wiley and Sons, Inc., 122
pp.
, 1979: Unsolved problems in cluster analysis. Biometrics. 35,
169-181.
Felsenstein, J., 1985: Confidence limits on phylogenies: An approach
using bootstrap. Evolution, 39(4), 783-791.
Fernau, M. E., and P. J. Samson, 1990a: Use of cluster analysis to
define periods of similar meteorology and precipitation chemistry
in eastern North America. Part 1: Transport patterns. J Appl
Meteor. 29, 735-750.
, and , 1990b: Use of cluster analysis to define periods of
similar meteorology and precipitation chemistry in eastern North
America. Part II: Precipitation patterns and pollutant deposition.
J App! Meteor, 29, 751-761.
Fovell, R. G., and M.-Y. C. Fovell, 1993: Climate zones of the con-
terminous United States defined using cluster analysis. J Cli-
mate, 6, 2103-2135.
French, V., and S. LeDuc, 1983: Strategies for determining climatic
groupis. Preprints, Eighth Conf. on Probability and Statistics in
Atmospheric Science. Hot Springs, AR, Amer. Meteor. Soc.,
111-114.
Gadgil, S., and N. lyenger, 1980: Cluster analysis of rainfall stations
of the Indian Peninsula. Quart. J. Roy. Meteor. Soc., 106, 873—
886.
, and N. V. Joshi, 1983: Climatic clusters of Indian region. J
Climatol . 3, 47-63.
Galliani, G , and F. Filippini, 1985: Climatic clusters in a small area.
J. Climatol.. 5,487-501.
Goosens, C., 1985: Principal component analysis of Mediterranean
rainfall J Climatol., 5, 379-388.
Gorham, E., F. B. Martin, and J. T. Litzau, 1984: Acid rain: Ionic
correlations in the eastern United States. Science. 225, 407-
409.
-------
930
JOURNAL OF CLIMATE
Volume 8
Gower, J. A., 196"?: A comparison of some methods of cluster.analysis.
Biometrics. 23, 623-627.
Guttman, N. B., 1992: The use of L-moments in the determination
of regional precipitation climates. Preprints, 12th Conf. on
Probability and Statistics in the Atmospheric Sciences. Toronto,
ON, Canada, Amer. Meteor. Soc., 231-236.
Hartigan, J. A., 1975: Clustering Algorithms. John Wiley & Sons,
Inc., 351 pp.
Hoaglin, D. C., B. Inglewize, and J. W. Tukey, 1986: Performance
of some resistant rules in outlier labeling. J. Amer. Stat Assoc,
81,991-999.
Johnson, R. A., and D. W. Wichern, 1982: Applied Multivariate
Statistical Analysis. Prentice-Hall, 594 pp.
Johnson, S. C„ 1967: Hierarchical clustering schemes. Psychometrika,
32, 261-274.
Jolliffe, I. T., B. Jones, and B. J. T. Morgan, 1986: Comparison of
cluster analysis of the English personal social services. J Roy
Stat. Soc. A. 149, 253-270.
Kalkstein, L. S., G. Tan, and J. A. Skindlov, 1987: An evaluation of
three clustering procedures for use in synoptic climatological
classification. J Climate Appl Meteor, 26, 717-730.
Key, J., and R. G. Crane, 1986: A comparison of synoptic classifi-
cation schemes based on 'objective' procedures. J. CUmatol. 6,
375-388.
Kjkuchihara, H., 1984: The regional division of Japan by precipi-
tation. Geophys. Mag, 41, 61-157.
Kim, D., 1992: Cluster analysis of radiance data measured by satellite
and computed from forecast profiles. Preprints, Fifth Int Meeting
on Statistical Climatology, Toronto, ON, Canada, Environment
Canada, Atmosphenc Environment Service, J109-J112.
Kuiper, F. K., and L. A. Fisher, 1975: A Monte Carlo comparison
of six clustering procedures. Biometrics, 31, 777-783.
Lagarec. D., 1986: La regionalistion des climats thermiques du sud
Yukon. Climatol. Bull., 20, 3-20.
Lance, G. N„ and W. T. Williams, 1967a: A general theory of clas-
sificatory sorting strategies. I. Hierarchical systems. Comput. J.
9, 373-380.
, and , 1967b: A general theory of classificatory sorting
strategies. II. Clustering systems. Comput. J., 10, 271-277.
Leach, J. H., 1980: Limnological sampling intensity in Lake St Clair
in relation to distribution of water masses. J Great Lakes Res..
6, 141-145.
Ledrew, E. L., 1983: The dynamic climatology of the Beaufort to
Laptev sea sector of the polar basin for summers of 1975 and
1976. J Climatol., 3, 335-359.
, 1985: The dynamic climatology of the Beaufort to Laptev sea
sector of the polar basin for winter of 1975 and 1976. J Cli-
matol.. 5, 253-272.
LeDuc, S., 1992: Northern Hemisphere geopotential heights: A Mar-
kov chain. Preprints, 12th Conf. on Probability and Statistics
in the Atmospheric Sciences Toronto, ON, Canada, Amer. Me-
teor. Soc., 167-171.
Maheras, P., 1984: Weather-type classification by factor analysis in
the Thessaloniki area. J. Climatol., 4, 437-443.
Mather, P. M., 1976: Computational Methods of Multivariate Analysis
in Physical Geography John Wiley and Sons, Inc., 532 pp.
Maryon, R. H., and A. M. Storey, 1985: A multivariate statistical
model for forecasting anomalies of half-monthly mean surface
pressure. J. Climatol., 5, 561-578.
McBoyle, G. R., 1971: Climatic classification of Australia by com-
puter. Aust Geogr. Studies. IX, 1—14.
McQuitty, L. L., 1967: A mutual development of some typoligical
theories and pattem-analysis methods. Educ. Psychol Meas.,
17,21-46.
Milligan, G. W., 1980: An examination of the effect of six types of
error perturbation of 15 clustering algorithms. Psychometrika.
45, 325-342.
, and M. C. Cooper, 1985: An examination of procedures for
determining the number of clusters in a data set. Psychometrika.
50, 159-179.
Mo, K., and M. Ghil, 1988: Cluster analysis of multiple planetary
flow regimes. J Geophys Res., 93(D9), 10 927-10 952.
Mojena, R., 1977: Hierarchical grouping methods and slopping rules:
An evaluation. Comput. J., 20, 359-363.
Mosley, M. P., 1981: Delimitation of New Zealand hydrologic regions.
J Hydrol , 49, 173-192.
O'Lenic, E. A., and R. E. Livezey, 1988: Practical considerations in
the use of rotated principal component analysis (RPCA) in di-
agnostic studies of upper-air height fields. Mon Wea Rev., 116,
1682-1689.
Punj, G., and D. W. Stewart, 1983: Cluster analysis in marketing
research: Review and suggestions for application. J Marketing
Res , 20, 134-148.
Puvaneswaran, M., 1990: Climatic classification for Queensland using
multivariate statistical techniques. Int. J Climatol. 10, 591-
608.
Reich, V. T., 1986: Die rumliche struktur des niederschlagsfeldesz
hierarchische gruppierungsverfahren und dendrogramme. Z
Meteor, 36, 38-53.
Richman, M. B., 1986: Rotation of principal components. J Cli-
matol, 6, 293-333.
, 1987: Rotation of principal components: A reply. J Climatol.
7, 511-520.
, 1993: Comments on: "The effect of domain shape on principal
components analyses'7" Int J Climatol, 13, 203-218.
, and P. J. Lamb, 1985: Climatic pattern analysis of 3- and 7-
day summer rainfall in the central United States: Some meth-
odological considerations and a regionalizalion. J Climate Appl.
Meteror. 24, 1325-1343.
, and , 1987: Pattern analysis of growing season precipitation
in southern Canada. Atmos Ocean, 25, 137-158.
, and W. E. Easterling, 1988: Procrustes target analysis: A mul-
tivariate tool for identification of climate fluctuations. J Geo-
phys Res.. 93(D9), 10 989-11 003.
, and P. J. Lamb, 1988: On the modes of growing season pre-
cipitation over central North America. Proc of the 12th Annual
Climate Diagnostics Workshop. Salt Lake City, UT, U.S. Dept.
of Commerce, NOAA, 284-291.
, and , 1989: On the interannual variability of growing
season precipitation in central North America. Proc of the lith
Annual Climate Diagnostics Workshop, Cambridge, MA, U.S.
Dept. of Commerce, NOAA, 133-142.
, and X.-F. Gong, 1993: Relationships between the definition
of the hyperplane width and the fidelity of the resulting PC
loading patterns. Proc. of the 17th Annual Climate Diagnostics
Workshop. Norman, OK, U.S. Dept. of Commerce, NOAA,
372-379.
, P. J. Lamb, and J. R. Angel, 1991: Relationships between
monthly precipitation over central and eastern North America
and the Southern Oscillation. Proc of the 15th Annual Climate
Diagnostics Workshop U.S. Dept. of Commerce, NOAA, 373-
383.
, J. R. Angel, and X.-F. Gong, 1992: Determination of dimen-
sionality in eigenanalysis. Preprints. Fifth Int. Meeting on Sta-
tistical Climatology, Toronto, ON, Canada, Environment Can-
ada, Atmospheric Environment Service, 229-235.
Rohlf, F. J., and R. R. Sokal, 1981: Statistical Table. Freeman, 219
pp.
Ronberg, B., and W. C. Wang,1987: Climate patterns derived from
Chinese proxy precipitation records: An evaluation of the station
networks and statistical techniques. J. Climatol.. 7, 391-416.
Rousseeuw, P. J., M.-P. Derde, and L. Kaufman, 1989: Principal
components of a fuzzy clustering. Trends Analyt. Chem.. 8, 249-
250.
Sanchez, M. L., and M. C. Ramos, 1987: Application of cluster anal-
ysis to identify sources of airborne particles. Atmos Environ .
21, 1521-1527.
, D. Pascual, C. Ramos, and T. Perez, 1990: Forecasting partic-
ulate pollutant concentrations in a city from meteorological
variables and regional weather patterns. Atmos Environ.. 24A,
1509-1519.
e
-------
April 1995
CONG AND RICHMAN
931
Sausy, D. A.. J, R, Anderson, and P. R. Buseck. 1987: Ouster analysis
samples from ihe Norwegian Arctic. Amos. Environ., 21, 1649-
1657.
Skaggs, R. H.. 1969: Analysis and regionalization of the diurnal dis-
tribution of tornadoes in the United States. Mon. Wea. Rev,
97, 103-115.
Sneaih, P. H. A., and R. R. Sokal. 1973: Numerical Taxonomy. Free-
man, 573 pp.
Sokal, R. R., and P. H. A. Sneath, 1963: Principles of Numerical
Taxonomy. Freeman, 359 pp.
, and F. J. Rohlf, 1981: Biometry. 2d ed. W. H. Freeman and
Company, 859 pp.
Stemhorst. R. K., and R. E. Williams, 1985: Discrimination of
groundwater sources using cluster analysis, MANOVA, canon-
ical and discriminant analysis. Water Resour Res.. 21, 1149—
1156.
Stone, R. C„ 1989: Weather types at Brisbane, Queensland: An ex-
ample of the use of principal components and cluster analysis.
Int. J. Climaiol., 9, 3-32,
Stone, R., and A. Auliciems, 1992: SOI phase relationships with rain-
fall in Eastern Australia. Int. J. C/imatol., 12, 625-636.
Stooksbury, D. E., and P. J. Michaels, 1990: Cluster analysis of
southeastern climate stations Res Rep No, I, Southeast Re-
gional Climate Center, South Carolina Water Resources Com-
mission, Columbia, SC, 21 pp.
Tagami. Y., 1985: Studies of Japanese climate based on macro-scale
airflow patterns. Geogr. Rep Tokyo Meteropoliian Univ., 20,
45-72.
Takeuchi, K., H. Yanai. and B. N. Mukheijee, 1982: The Foundations
of Multivariate Analysis. John Wiley and Sons, Inc,, 458 pp.
Thurstone, L. L., 1947: Multiple Factor Analysis. University of Chi-
cago Press, 535 pp.
Tryon, R C., 1939: Cluster Analysis. Edwards Brothers, 347 pp.
Tukey, J. W., 1977: Exploratory Data Analysis. Addison-Wesley,
304 pp.
Ward, J. H„ 1963: Hierarchical grouping to optimize an objective
function. J. Amer. Stal. Assoc.. 58, 236-244,
White, D., M. Richman, and B Yarnal, 1991: Climate regionalization
and rotation of principal components. Int. J Climaiol., 11, 1-
25.
White, D. A., 1988: Climate regionalization: A comparison of prin-
cipal component analysis rotation schemes. Ph D thesis. The
Pennsylvania State University. 519 pp.
While, E. J ., 1981: Classification of climate in Great Britain. J. En-
viron. Manage.. 13, 241-257.
, and A. H. Perry, 1989: Classification of the climate of England
and Wales based on agroclimatic data. Int J. Climaiol.. 9,271 -
291.
Willmott, C. J., 1978: P-mode principal components analysis, group-
ing and precipitation regions in California. Arch. Meteor. Geo-
phys. BiokJimatol. Ser. B. 26, 277-295.
Winkler, J. A., 1985: Regionalization of the diurnal distribution of
summertime heavy precipitation. Preprints, Sixth Conf. of Hy-
drometeorology. Indianapolis, IN, Amer. Meteor. Soc., 9-16.
Wolter, K , 1987: The Southern Oscillation in surface circulation
and climate over the tropical Atlantic, eastern Pacific, and Indian
Oceans as captured by cluster analysis. / Climate Appl. Meteor..
26, 540-558.
Wu, C, F. J., 1986: Jackknife, bootstrap and other resampling methods
in regression analysis. Ann. Stat.. 14, 1261-1350.
-------
TECHNICAL REPORT DATA
1. r;
ifKi/lfdO/A-96/089
2.
4. TITLE AND SUBTITLE
On the Application of Cluster Analysis to Growing
Season Precipitation Data in North America East
of the Rockies
3. R£
5.REPORT DAT:
6.PERFORMING ORGANIZATION CODE
1. AUTHOR(S!
Xiaofeng Gong and Michael B, Richman
8.PERFORMING ORGANIZATION REPORT NO.
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Cooperative Institute for Mesoscale Meteorological
Studies, University of Oklahoma, Norman, Oklahoma
10,PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
CR-816318-02-0
12. SPONSORING AGENCY NAME AND ADDRESS
NATIONAL EXPOSURE RESEARCH LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
RESEARCH TRIANGLE PARK, NC 27711
13.TYPE OF REPORT AND PERIOD COVERED
Journal Article
14. SPONSORING AGENCY CODE
EPA/600/9
.15. SUPPLEMENTARY NOTES
\
1995 American Meterological Society
16. ABSTRACT
Cluster analysis (CA) has been applied to geophysical research for over two
decades although its popularity has increased dramatically over the past few years.
To date, systematic methodological reviews have not appeared in geophysical
literature. In this paper, after a review of a large number of applications on
cluster analysis, an intercomparison of various cluster techniques was carried out
on a well-studied dataset (7-day precipitation data from 1949 to 1987 in central and
eastern North America). The cluster methods tested were single linkage, complete
linkage, average linkage between groups, average linkage within a new group, Ward's
method, k means, the nucleated agglomerative method, and the rotated principal
component analysis. Three different dissimilarity measure (Euclidean distance,
inverse correlation, and theta angle) and three initial partition methods were also
tested on the hierarchical and nonhierarchical methods, respectively. Twenty-two of
the 23 cluster algorithms yielded natural grouping solutions. Monte Carlo
simulations were undertaken to examine the reliability of the cluster solutions.
This was done by bootstrap resampling from the full dataset with four different
sample sizes, then testing significance by the t test and the minimum significant
difference test. -
\
n.
KEY ViORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
b. IDENTIFIERS/ OPEN ENDED
TERMS
c.COSATI
18. DISTRIBUTION STATEMENT
RELEASE TO PUBLIC
19. SECURITY CLASS (This
Report)
UNCLASSIFIED
21.NO. OF PAGES
20. SECURITY CLASS (This
Page)
UNCLASSIFIED
22. PRICE
------- |