United States
               Environmental Protection
               Agency
Atmospheric Research and Exposure
Assessment Laboratory
Research Triangle Park, NC 27711
               Research and Development
EPA/600/SR-94/081
June 1994
EPA      Project Summary
               On  the  Feasibility of  Using
               Satellite Derived  Data to  Infer
               Surf ace-Layer  Ozone
               Concentration  Patterns
               Brian K. Eder
                Principal Component Analysis (PCA)
               was applied to six years (1985-1990) of
               surface and satellite ozone (O,) data
               collected over the eastern United States
               to determine whether O, measurements
               derived from satellites could be  used
               to infer surface-layer  concentrations.
               Examination of the spatial and tempo-
               ral characteristics associated with the
               first nonrotated principal components
               (which are the  dominant components,
               explaining 37.95%  and 41.25% of the
               total variance of the surface and satel-
               lite data sets,  respectively)  revealed
               considerable coherence  between the
               data sets, suggesting that on continen-
               tal-scales, seasonal 03 patterns derived
               from the satellite data replicate  quite
               well those of the surface. This coher-
               ence diminishes, however,  when  daily
               patterns are compared. Upon orthogo-
               nal rotation, the PCA  delineated four
               contiguous and statistically unique sub-
               regions with each data set  (the North-
               west, Northeast, Southwest, and South-
               east) that were very similar,  suggest-
               ing that the satellite data may be able
               to discern 03 patterns on spatial scales
               as small as 1000 km. The temporal  char-
               acteristics associated  with  the South-
               west and Southeast subregions exhib-
               ited cross-data set similarities;  how-
               ever, those associated with the North-
               west and Northeast subregions  were
               somewhat dissimilar.
                This Project Summary was developed
               by EPA's Atmospheric Research and
               Exposure Assessment Laboratory, Re-
               search Triangle Park, NC, to announce
               key findings of the research project
               that is fully documented in a separate
 report of the same title (see Project
 Report ordering Information at back).

 Introduction
  Traditionally, ozone (03) has been char-
 acterized as an urban-scale pollutant. In-
 creasingly,  however, it has been recog-
 nized by scientists as a regional and even
 global-scale phenomenon, as high con-
 centrations are routinely observed over
 vast, non-urban areas of most industrial-
 ized countries, where forest retardation
 and crop injury are becoming growing en-
 vironmental concerns. Daily maximum 03
 concentrations in these areas  are often
 comparable to those found in urban  ar-
 eas, and daily average concentrations can
 even exceed urban concentrations due to
 a lack of nitric oxide  (NO)  scavenging.
 Coinciding with these realizations has been
 the advent of satellite-derived 03 measure-
 ments. Using data derived from the Total
 Ozone  Mapping Spectrometer (TOMS),
 which measures total column 03 concen-
 trations, and the Stratospheric Aerosol and
 Gas Experiment (SAGE), which measures
 the  stratospheric 03 concentration, scien-
 tists have been able to estimate the tropo-
 spheric (residual) concentration and, un-
 der  certain meteorological conditions, es-
 timate the surface-layer 03 concentration.
  The  appropriateness of  this remote
 sensing approach will be evaluated using
 Principal Component Analysis (PCA) as
 applied to  surface  data obtained from
 EPA's Aerometric  Information  and  Re-
 trieval System (AIRS) and residual 03 data
 from the National Satellite Service Data
 Center (NSSDC)  at NASA's  Goddard
 Space Flight Center. This analysis, which
 employs data from the six-year  period
                                                               Printed on Recycled Paper

-------
85-1990, will enable us to determine to
iat extent, if any, the major modes of
atial and temporal surface 03 variability
3 being captured by  the  satellite data.
le advantages of employing a technique
ch as PCA are numerous. First, we are
Baling with widely varying  spatial scales
lich prohibit point-by-point comparisons.
le surface data obtained from the AIRS
twork is representative of meso-scales
 -  100 km),  while the satellite  data is
presentative of macro-scales (> 100 km).
SA circumvents this impediment by pro-
Jing similar spatial scale results that will
ow for pattern recognition and compari-
n of 03 concentrations. Second, because
 the copious amount of  data resulting
>m  such a  large-scale  study  (nearly
1,000  surface  observations  and  over
1,000 satellite observations) and because
9 individual data tend to be erratic or
iisy, it  is advantageous to  employ  an
lalysis technique that identifies, through
reduction of  data, the recurring and in-
ipendent modes of variation within the
•ger data sets. And finally, the  analysis
 03 characteristics and trends between
9 data sets is based on an aggregation
data from many stations  (grid cells), as
>posed to individual stations (grid cells),
nimizing the effects of  anomalous or
ren erroneous data often associated with
single observation.

ata
The surface  03 concentration (ppb) data
nployed in this  analysis  were obtained
>m AIRS,  which operates under  strict
anitoring criteria including  multipoint cali-
ations, independent audits, and data vali-
ition based  upon frequent zero, span,
id  precision  checks.  The 03 measure-
ents were made during the "ozone sea-
m" (June 15th through October  31st for
is study) using either chemiluminescence
lalyzers, which are sensitive to light emit-
d by the reaction between 03 and ethyl-
le, or ultraviolet photometers, which mea-
ire the absorption of light  by 03.
A major goal of this study was to estab-
h a complete, regionally-representative
irface  03 data base, unencumbered by
issing  data or local-scale variability. At-
inment of this  goal  was achieved  by
;ing many selection  criteria. First, the
lily 1-hour maximum concentration was
;ed to  help minimize local-scale variabil-
 because  at the time of maximum  sur-
ce concentration  (typically  between  1
id  3 pm LST), the boundary  layer  is
tnerally uniformly mixed and the surface
mcentration is therefore most represen-
tive of the boundary layer concentration.
Iditionally, both the primary standard, de-
jned to protect human hearth,  and the
secondary standard, designed to protect
human welfare,  established by  EPA as
part  of the  NAAQS,  are based on the
daily 1-hour maximum concentration. Sec-
ond, to avoid NO scavenging effects found
in  close  proximity to urban areas,  only
those stations classified as either rural or
suburban and reporting a land use of ei-
ther forest, agriculture, or residential were
employed in the analysis.  Rural stations
received highest priority; however, to meet
our third criteria,  spatial completeness,
several suburban  stations  were also in-
cluded in the data base. Finally, only those
stations reporting a capture rate of 90.0%
or better for the study  period were consid-
ered. These criteria resulted in the inclu-
sion of 77 stations  across the eastern half
of the United States, the majority of which
(55)  are classified  as  rural. Several  com-
binations of "station-seasons" were exam-
ined before the optimum period  of 1985-
1990 was selected. The total capture rate
for the period was > 95.0%. All missing
data were replaced using a linear interpo-
lation scheme, across time.
   The TOMS and SAGE 03 data,  mea-
sured in Dobson Units (DU) (1  DU =  2.69
x  1016 molecules  of  03  cm'2), were ob-
tained from the archived data  sets avail-
able at the NSSDC located at the Goddard
Space Flight  Center. The column of 03 in
the troposphere, or the "residual 03," is de-
termined by  subtracting  the  integrated
amount of 0  above the tropopause derived
from the SAGE profiles from the concurrent
amount of total 0, observed from the TOMS
measurements. These data, which cover
the eastern  half  of the  United States for
this study, are gridded with a resolution of
5° longitude by 2.5° latitude, resulting in a
total of 54 grid cells.

Methodology
   One of the main objectives of PCA is to
identify, through  a reduction of data, the
recurring and independent modes of varia-
tion (signals) within large, noisy data sets,
thereby summarizing  the essential  infor-
mation of the data sets so that meaningful
and descriptive conclusions can be made.
The analysis  sorts initially correlated  data
into  a hierarchy of statistically  indepen-
dent  modes  of  variations  which explain
successively  less  and  less of  the  total
variance.
   The PCA of both the AIRS and TOMS-
SAGE residuals  data  sets began with the
extraction of square, symmetrical correla-
tion matrices  (R), having dimensions   R
(77 rows and 77 columns) for the  AlRS
data and MR   (54 rows and 54 columns)
for the satellite  data,  from their original
data  matrices having dimensions of 77
[stations] x 834  [days] (or  64,218 obser-
vations) and  54 [grid cells] x 834 [days]
(or 45,036 observations), respectively. By
using R and the identity matrix  (I), of the
same dimensions, n = 77 (54)  eigenvec-
tors  can be  derived that represent the
mutually orthogonal linear combinations
(modes of variation) of the matrix.  Their
associated  eigenvalues  represent the
amount of total variance that is  explained
by each of the eigenvectors. By retaining
only the first few eigenvector-eigenvalue
pairs, or principal components, a substan-
tial  amount of the total variance  can be
explained while ignoring the higher order
principal components that explain minimal
amounts of  the total  variance and can
therefore be viewed as noise. The  exact
number of components that should be re-
tained was determined through the use of
the Scree Test and suggested four com-
ponent  solutions  for both the AIRS and
Satellite data sets;  subsequently the first
four principal  components were retained
for analysis and comparison.
  When the elements of each eigenvector
are multiplied by the square root of the
associated eigenvalue, one  obtains the
principal component loading (L), which rep-
resents the correlation between the com-
ponent and the station or grid cell. When
these principal component loadings are
spatially mapped onto their respective sta-
tions (grid cells) for each component, isop-
leths of component loadings can be drawn,
which identify the major modes of spatial
variability.
  Initially, the principal component analy-
sis  replaced  77 stations (54 grid cells),
measured over 834 days, with four princi-
pal components having no temporal mea-
sure. By introducing the principal compo-
nent score (PCs), a derivation  of similar
temporal measurement for the principal
components over the same 834 days can
be  achieved.  The  principal components
are identified in terms of the  original sta-
tions  (grid cells), the larger the loading,
the more important  the station is in the
interpretation  of  the  component. There-
fore,  if  a day has  high  values  for the
stations with large loadings, then it should
have a large value on that  component.
The scores have been standardized;  there-
fore,  they have a  mean  of  zero  and a
standard deviation of one.

Results
  Examination of the spatial  characteris-
tics  associated with the first  nonrotated
principal components (which are the domi-
nant components, explaining 37.95% and
41.25% of the total variance of the sur-
face and satellite data sets, respectively)
revealed considerable coherence between
the data sets. With only one minor excep-

-------
 tion (that being southern  Florida in  the
 surface data set), each spatial pattern re-
 vealed an in-phase oscillation with the area
 of greatest variance centered around  the
 Ohio River  Valley  (the  centers  of  maxi-
 mum variance are within 200 km of each
 other). This would suggest that  on  conti-
 nental scales, spatial patterns derived from
 the satellite mirror those of the surface.
   Inspection of the seasonal time series
 (as defined by the smoothed median of
 the six years of daily principal component
 scores) associated with  these  dominant
 components  also  revealed considerable
 coherence. With both data sets,  the high-
 est 03 concentrations consistently  occur
 during the period from June 15 through
 the middle of August. The transition to  low
 concentrations  occurs slightly   later  (by
 roughly a week) in the satellite time series
 and is not quite as sharp as the transition
 of the surface data set. The correlation
 coefficient between these seasonal data
 sets is  high  (0.75), indicating that  on a
 seasonal scale at least, the satellite data
 could be used to infer surface-layer con-
 centrations.
   Examination of the daily principal com-
 ponent scores associated with these two
 dominant  components was less  encour-
 aging, however,  as marked  differences
 were revealed.  Surface  concentrations
 during 1988  were by far the highest  of
 any year during the study; however, this
 year does not stand out  in the  satellite
 data as being unusually high. In  fact, the
 satellite data indicates that 1990, not 1988,
 was the year experiencing the highest con-
 centrations. This and other discrepancies
 are reflected  in the  correlation coefficient
 between the daily  principal  component
 scores of the two data sets, which is quite
 low (0.33). Although statistically significant
 (
-------
 The EPA author, Brian AC Eder (also the EPA Project Officer, see below), is on
   assignment to the Atmospheric Research and Exposure Assessment Laboratory,
   Research Triangle Park, NC 27711, from the National Oceanic and Atmospheric
   Administration.
 The complete report, entitled "On the Feasibility of Using Satellite Derived Data to
   Infer Surface-Layer Ozone Concentration Patterns,'(Order No. PB94-170263;
   Cost: $17.50, subject to change) will be available only from:
         National Technical Information Service
         5285 Port Royal Road
         Springfield, VA 22161
         Telephone: 703-487-4650
 The EPA Project Officer can be contacted at:
         Atmospheric Research and Exposure Assessment Laboratory
         U.S. Environmental Protection Agency
         Research Triangle Park, NC 27711
nited States
ivironmental Protection Agency
enter for Environmental Research Information
incinnati, OH 45268

fficial Business
anatty for Private Use $300
      BULK RATE
POSTAGE & FEES PAID
         EPA
   PERMIT No. G-35
PA/600/SR-94/081

-------