United States
Environmental Protection
Agency
Atmospheric Research and Exposure
Assessment Laboratory
Research Triangle Park, NC 27711
Research and Development
EPA/600/SR-94/081
June 1994
EPA Project Summary
On the Feasibility of Using
Satellite Derived Data to Infer
Surf ace-Layer Ozone
Concentration Patterns
Brian K. Eder
Principal Component Analysis (PCA)
was applied to six years (1985-1990) of
surface and satellite ozone (O,) data
collected over the eastern United States
to determine whether O, measurements
derived from satellites could be used
to infer surface-layer concentrations.
Examination of the spatial and tempo-
ral characteristics associated with the
first nonrotated principal components
(which are the dominant components,
explaining 37.95% and 41.25% of the
total variance of the surface and satel-
lite data sets, respectively) revealed
considerable coherence between the
data sets, suggesting that on continen-
tal-scales, seasonal 03 patterns derived
from the satellite data replicate quite
well those of the surface. This coher-
ence diminishes, however, when daily
patterns are compared. Upon orthogo-
nal rotation, the PCA delineated four
contiguous and statistically unique sub-
regions with each data set (the North-
west, Northeast, Southwest, and South-
east) that were very similar, suggest-
ing that the satellite data may be able
to discern 03 patterns on spatial scales
as small as 1000 km. The temporal char-
acteristics associated with the South-
west and Southeast subregions exhib-
ited cross-data set similarities; how-
ever, those associated with the North-
west and Northeast subregions were
somewhat dissimilar.
This Project Summary was developed
by EPA's Atmospheric Research and
Exposure Assessment Laboratory, Re-
search Triangle Park, NC, to announce
key findings of the research project
that is fully documented in a separate
report of the same title (see Project
Report ordering Information at back).
Introduction
Traditionally, ozone (03) has been char-
acterized as an urban-scale pollutant. In-
creasingly, however, it has been recog-
nized by scientists as a regional and even
global-scale phenomenon, as high con-
centrations are routinely observed over
vast, non-urban areas of most industrial-
ized countries, where forest retardation
and crop injury are becoming growing en-
vironmental concerns. Daily maximum 03
concentrations in these areas are often
comparable to those found in urban ar-
eas, and daily average concentrations can
even exceed urban concentrations due to
a lack of nitric oxide (NO) scavenging.
Coinciding with these realizations has been
the advent of satellite-derived 03 measure-
ments. Using data derived from the Total
Ozone Mapping Spectrometer (TOMS),
which measures total column 03 concen-
trations, and the Stratospheric Aerosol and
Gas Experiment (SAGE), which measures
the stratospheric 03 concentration, scien-
tists have been able to estimate the tropo-
spheric (residual) concentration and, un-
der certain meteorological conditions, es-
timate the surface-layer 03 concentration.
The appropriateness of this remote
sensing approach will be evaluated using
Principal Component Analysis (PCA) as
applied to surface data obtained from
EPA's Aerometric Information and Re-
trieval System (AIRS) and residual 03 data
from the National Satellite Service Data
Center (NSSDC) at NASA's Goddard
Space Flight Center. This analysis, which
employs data from the six-year period
Printed on Recycled Paper
-------
85-1990, will enable us to determine to
iat extent, if any, the major modes of
atial and temporal surface 03 variability
3 being captured by the satellite data.
le advantages of employing a technique
ch as PCA are numerous. First, we are
Baling with widely varying spatial scales
lich prohibit point-by-point comparisons.
le surface data obtained from the AIRS
twork is representative of meso-scales
- 100 km), while the satellite data is
presentative of macro-scales (> 100 km).
SA circumvents this impediment by pro-
Jing similar spatial scale results that will
ow for pattern recognition and compari-
n of 03 concentrations. Second, because
the copious amount of data resulting
>m such a large-scale study (nearly
1,000 surface observations and over
1,000 satellite observations) and because
9 individual data tend to be erratic or
iisy, it is advantageous to employ an
lalysis technique that identifies, through
reduction of data, the recurring and in-
ipendent modes of variation within the
•ger data sets. And finally, the analysis
03 characteristics and trends between
9 data sets is based on an aggregation
data from many stations (grid cells), as
>posed to individual stations (grid cells),
nimizing the effects of anomalous or
ren erroneous data often associated with
single observation.
ata
The surface 03 concentration (ppb) data
nployed in this analysis were obtained
>m AIRS, which operates under strict
anitoring criteria including multipoint cali-
ations, independent audits, and data vali-
ition based upon frequent zero, span,
id precision checks. The 03 measure-
ents were made during the "ozone sea-
m" (June 15th through October 31st for
is study) using either chemiluminescence
lalyzers, which are sensitive to light emit-
d by the reaction between 03 and ethyl-
le, or ultraviolet photometers, which mea-
ire the absorption of light by 03.
A major goal of this study was to estab-
h a complete, regionally-representative
irface 03 data base, unencumbered by
issing data or local-scale variability. At-
inment of this goal was achieved by
;ing many selection criteria. First, the
lily 1-hour maximum concentration was
;ed to help minimize local-scale variabil-
because at the time of maximum sur-
ce concentration (typically between 1
id 3 pm LST), the boundary layer is
tnerally uniformly mixed and the surface
mcentration is therefore most represen-
tive of the boundary layer concentration.
Iditionally, both the primary standard, de-
jned to protect human hearth, and the
secondary standard, designed to protect
human welfare, established by EPA as
part of the NAAQS, are based on the
daily 1-hour maximum concentration. Sec-
ond, to avoid NO scavenging effects found
in close proximity to urban areas, only
those stations classified as either rural or
suburban and reporting a land use of ei-
ther forest, agriculture, or residential were
employed in the analysis. Rural stations
received highest priority; however, to meet
our third criteria, spatial completeness,
several suburban stations were also in-
cluded in the data base. Finally, only those
stations reporting a capture rate of 90.0%
or better for the study period were consid-
ered. These criteria resulted in the inclu-
sion of 77 stations across the eastern half
of the United States, the majority of which
(55) are classified as rural. Several com-
binations of "station-seasons" were exam-
ined before the optimum period of 1985-
1990 was selected. The total capture rate
for the period was > 95.0%. All missing
data were replaced using a linear interpo-
lation scheme, across time.
The TOMS and SAGE 03 data, mea-
sured in Dobson Units (DU) (1 DU = 2.69
x 1016 molecules of 03 cm'2), were ob-
tained from the archived data sets avail-
able at the NSSDC located at the Goddard
Space Flight Center. The column of 03 in
the troposphere, or the "residual 03," is de-
termined by subtracting the integrated
amount of 0 above the tropopause derived
from the SAGE profiles from the concurrent
amount of total 0, observed from the TOMS
measurements. These data, which cover
the eastern half of the United States for
this study, are gridded with a resolution of
5° longitude by 2.5° latitude, resulting in a
total of 54 grid cells.
Methodology
One of the main objectives of PCA is to
identify, through a reduction of data, the
recurring and independent modes of varia-
tion (signals) within large, noisy data sets,
thereby summarizing the essential infor-
mation of the data sets so that meaningful
and descriptive conclusions can be made.
The analysis sorts initially correlated data
into a hierarchy of statistically indepen-
dent modes of variations which explain
successively less and less of the total
variance.
The PCA of both the AIRS and TOMS-
SAGE residuals data sets began with the
extraction of square, symmetrical correla-
tion matrices (R), having dimensions R
(77 rows and 77 columns) for the AlRS
data and MR (54 rows and 54 columns)
for the satellite data, from their original
data matrices having dimensions of 77
[stations] x 834 [days] (or 64,218 obser-
vations) and 54 [grid cells] x 834 [days]
(or 45,036 observations), respectively. By
using R and the identity matrix (I), of the
same dimensions, n = 77 (54) eigenvec-
tors can be derived that represent the
mutually orthogonal linear combinations
(modes of variation) of the matrix. Their
associated eigenvalues represent the
amount of total variance that is explained
by each of the eigenvectors. By retaining
only the first few eigenvector-eigenvalue
pairs, or principal components, a substan-
tial amount of the total variance can be
explained while ignoring the higher order
principal components that explain minimal
amounts of the total variance and can
therefore be viewed as noise. The exact
number of components that should be re-
tained was determined through the use of
the Scree Test and suggested four com-
ponent solutions for both the AIRS and
Satellite data sets; subsequently the first
four principal components were retained
for analysis and comparison.
When the elements of each eigenvector
are multiplied by the square root of the
associated eigenvalue, one obtains the
principal component loading (L), which rep-
resents the correlation between the com-
ponent and the station or grid cell. When
these principal component loadings are
spatially mapped onto their respective sta-
tions (grid cells) for each component, isop-
leths of component loadings can be drawn,
which identify the major modes of spatial
variability.
Initially, the principal component analy-
sis replaced 77 stations (54 grid cells),
measured over 834 days, with four princi-
pal components having no temporal mea-
sure. By introducing the principal compo-
nent score (PCs), a derivation of similar
temporal measurement for the principal
components over the same 834 days can
be achieved. The principal components
are identified in terms of the original sta-
tions (grid cells), the larger the loading,
the more important the station is in the
interpretation of the component. There-
fore, if a day has high values for the
stations with large loadings, then it should
have a large value on that component.
The scores have been standardized; there-
fore, they have a mean of zero and a
standard deviation of one.
Results
Examination of the spatial characteris-
tics associated with the first nonrotated
principal components (which are the domi-
nant components, explaining 37.95% and
41.25% of the total variance of the sur-
face and satellite data sets, respectively)
revealed considerable coherence between
the data sets. With only one minor excep-
-------
tion (that being southern Florida in the
surface data set), each spatial pattern re-
vealed an in-phase oscillation with the area
of greatest variance centered around the
Ohio River Valley (the centers of maxi-
mum variance are within 200 km of each
other). This would suggest that on conti-
nental scales, spatial patterns derived from
the satellite mirror those of the surface.
Inspection of the seasonal time series
(as defined by the smoothed median of
the six years of daily principal component
scores) associated with these dominant
components also revealed considerable
coherence. With both data sets, the high-
est 03 concentrations consistently occur
during the period from June 15 through
the middle of August. The transition to low
concentrations occurs slightly later (by
roughly a week) in the satellite time series
and is not quite as sharp as the transition
of the surface data set. The correlation
coefficient between these seasonal data
sets is high (0.75), indicating that on a
seasonal scale at least, the satellite data
could be used to infer surface-layer con-
centrations.
Examination of the daily principal com-
ponent scores associated with these two
dominant components was less encour-
aging, however, as marked differences
were revealed. Surface concentrations
during 1988 were by far the highest of
any year during the study; however, this
year does not stand out in the satellite
data as being unusually high. In fact, the
satellite data indicates that 1990, not 1988,
was the year experiencing the highest con-
centrations. This and other discrepancies
are reflected in the correlation coefficient
between the daily principal component
scores of the two data sets, which is quite
low (0.33). Although statistically significant
(
-------
The EPA author, Brian AC Eder (also the EPA Project Officer, see below), is on
assignment to the Atmospheric Research and Exposure Assessment Laboratory,
Research Triangle Park, NC 27711, from the National Oceanic and Atmospheric
Administration.
The complete report, entitled "On the Feasibility of Using Satellite Derived Data to
Infer Surface-Layer Ozone Concentration Patterns,'(Order No. PB94-170263;
Cost: $17.50, subject to change) will be available only from:
National Technical Information Service
5285 Port Royal Road
Springfield, VA 22161
Telephone: 703-487-4650
The EPA Project Officer can be contacted at:
Atmospheric Research and Exposure Assessment Laboratory
U.S. Environmental Protection Agency
Research Triangle Park, NC 27711
nited States
ivironmental Protection Agency
enter for Environmental Research Information
incinnati, OH 45268
fficial Business
anatty for Private Use $300
BULK RATE
POSTAGE & FEES PAID
EPA
PERMIT No. G-35
PA/600/SR-94/081
------- |