-------
« o
0
lis
•? al
^ B
•u
a>
,c
1
n
«
I1
"i
cd
(O
eo
o>
8
u>
s.
5
9
ao
S
O>
CM
o>
CO
CM
§
o>
.S
' o-
cti
CM
CM
CO
O
at
CO
6
8
§
8
in
8
o>
i
i
2
CO
o
ai
O
CO
UJ
O
I
6
JD
•6
UJ
O
O
I
CD
O
-------
•o
fi
o
z
•s-
|D O
%i tO
w fi
O 3
03
CM
§
§
CO
i
Q
&
s
fe
o
00
o
o>
8
1
I
*
1 Chesapeake GEMS b-e
i
1
i
i
i
i
i
i
1 Chesapeake GEMS b-e
fc
1 1
8
8
3
V*
2
g
a
I
3
| Chesapeake CEMS e-b
i
i
i
i
i
CO
s
V
i
| Chesapeake CEMS e-b
s
o>
£
0
o>
8
t
I
_l
I
CO
0
6
ll
s?
i
O)
s
0
0>
K
1
| Chesapeake CEMS c-d
o>
^
o>
s
o>
s
0
£
s
5
«?
i
CM
|| Chesapeake CEMS c-d
s?
0>
s
o
S
s
s
CD
m
1 Chesapeake CEMS c-d
o>
«
>
o>
a
^
0
00
en
a>
o>
S
9
1
*
1
0
s
ii
S
o>
g
a»
^•
U)
0
a>
S
1
to
1 Chesapeake CEMS c-d
•*
i
o»
i—
o
I
s
«?
1
<£>
|| Chesapeake CEMS c-d
^
§
o>
0>
s
o
CO
k
«?
i
K
S
CO
LLJ
O
i
6
fe
S
^
o>
^«
0
co
o>
o>
o>
S
00
IS
CO
1 Chesapeake CEMS c-d
s>
i
O)
S
0
i
is
9
i
o
| Chesapeake CEMS c-d
$>
i
O>
O
•a>
g
e»
IS
«?
§
o
1 Chesapeake CEMS c-d
81
o>
to
IO
o
oo
o>
g
s
9
1
^
1 Chesapeake CEMS c-d
«
0)
o>
t—
o
o>
g
o>
S
«?
1
CM
*^
1
UJ
0
1
S
s
§
^
01
I
0
CO
•*
co
3
S
1 Chesapeake CEMS d-c
S
0
o
8
£
§
o
w
«
c^
CO
I Chesapeake CEMS d-c
-------
1
o
o
O
^
lill
,|«,
•? « ?
T~E°
•"*-
W is
Computed
value
m w
1
>
O.
-0
S *
I1
1*
w Q,
O
1
O.
•P
•e
o
8
o
w
10
g
8
s
CO
1
Tf
1 Chesapeake CEMS d-c
n
o
eo
O
s
CO
CD
10
|| Chesapeake CEMS d-c
^
o
CO
CM
in
8
S
CO
3
to
to
1 Chesapeake CEMS d-c
s
8
8
g
T**
1
fe
CO
1
N
1 Chesapeake CEMS d-c
«
o
8
o
8
in
8
g
CO
1
eo
u
•6
CO
Ul
o
6
s
8
o
o
2
CM
W
CO
i
0>
Chesapeake CEMS d-c
CO.
it
8
8
r-
10
8
S
CO
1
o
| Chesapeake CEMS d-c
S
ii
o
o
8
S
8
en
%
o
i
o
•9
*-•
*
| Chesapeake CEMS c-e
§
o>
O)
R
o
i
0)
Si
i
1 Chesapeake CEMS c-e
3
8
en
o>
gj
0
O)
§
et
to
CM
I
Ul
o
6
8
i
w
o
o>
s
&
(0
CO
1 Chesapeake CEMS c-e
*
a>
o>
O)
Si
o
en
en
en
^~
i
1
| Chesapeake CEMS c-e
§
£L
O>
O>
OS
o>
«
0
o>
en
£
9
1
in
| Chesapeake CEMS c-e
T"
o.
i
CM
O
o>
s
O)
t
CD
CO
CD
6
to
UJ
o
6
§
Q.
en
en
en
CO
CO
o
en
o
o>
o>
in
en
i
r-
| Chesapeake CEMS c-e
-------
•u
i
a
Computed
value
8l~
52 «
rfu.
o>
°-
•g
S Sj
I1
z
ll
Ds
Ol
o
CO
i
o
5
o>
SI
o
CD
1*.
-i
i
co
1 Chesapeake GEMS c-e
§
^
S
s
0
o>
o>
g
r*.
«?
i
o>
1 Chesapeake CEMS c-e
§
o>
1
s
g
CD
^
i
o
1 Chesapeake CEMS c-e
o
CO
o>
s
0
o>
1*.
i
T
Chesapeake CEMS c-e
CO
CO
o>
g
0>
N
O
g
g
K
^
s
CO
N
CD
6
O
1
c§
fe
o
o
g
10
K
oi
0
R
2
V-
_l
^
s
Ul
O
1
CD
6
O
00
§
8
g
0
o
Si
CO
i
| Chesapeake GEMS e-c
51
I
5!
•*
0
PI
*
i
CM
1 Chesapeake GEMS e-c
o
CO
o
g
o
0>
2
o
CO
o>
CO
i
CO
1 Chesapeake CEMS e-c
9?
§
5
o
^
ir
1
*
I
CO
o
1
o
g
o
2
0
fc
o>
fe
CO
in
«
O
5
5
0
o
g
eg
*
o
r^
^"
i
CD
|| Chesapeake CEMS e-c
§
g
g
«
T~
0
10
CD
S
I'-
ll Chesapeake CEMS e-c
. t •
5
o
s
^
o
N
•*
i
CO
1 Chesapeake CEMS e-c
§
o
CO
2
o
r-.
o>
i
a
Chesapeake CEMS e-c
9
o
o
5
O
CD
<*'
s
CO
o
I
O
1
§
o
o
g
5
g
g
r-
0>
i
T"
(Chesapeake GEMS e-c
co
CJ
0
o
g
s
•*
0
o
g
r-
•*
i
CM
|| Chesapeake GEMS e-c
£
i
0
o>
0
i
CD
0
N.
m
*
1 Chesapeake GEMS d-e
-------
1
.1
o
U
•8
.
c
11
1"
"§•§§
=5 * •
*£&
•sis
!S ^^ O
Si a
CO
2 IS =
jl*
. %-_- , . n .
2 "^5
TO 1.1
"S
}!
li
TO y_
®
1
Q.
Computed
value
I55
£"•
o>
_3
TJ
00
0
CD
s
Ol
S>
O
1
?
1 Chesapeake CEMS c
5>
8
S
•*
§
o>
s
1
!•-
CD
CM
ID
| Chesapeake CEMS d
R
§
§}
oo
d
£
(O
w
3
0
i
CO
£
T>
0
J
S
a
6
en
i
CO
^*
fe
5>
S
i
N.
S
•«
T
|| Chesapeake GEMS d
<"'!'
S
s
s
s
6
£
en
CXI
m
o
S
CO
10
o>
| Chesapeake CEMS d
en
CM
O>
O
s
IO
oq
S
1
r^
CD
CO
T
1 Chesapeake CEMS c
P
o
3
o>
s
6
«
5
o
1
K
«
I] Chesapeake CEMS d
5
1
o
CD
to
UJ
o
6
CO
CO
*
0)
q
00
u>
00
1*;
S
9
N
CD
0
3
oo
00
d
co
8
S
V™
t^
»•
CO
CM
•o
|| Chesapeake CEMS e
S
S
o
o
5
in
T—
CO
t»;
S
9
i
n
•o
1 Chesapeake CEMS e
Z
1
S
d
«
CD
q
§
^
h-
CO
•*
•>>
•o
d>
CO
O
6
R
^~
o
o
«M
3
^
S
9
1
10
•D
| Chesapeake CEMS e
o>
oo
o
00
co
o>
d
S
s
!•».
CO
CD
•o
1 Chesapeake CEMS e
S
0
o
5
g
cq
T—
in
9
1
K
•o
Chesapeake CEMS e
-------
I4
ll§
JI8
m
•o
CD
1.1
ii
o
O
is
^
i
"S
B $
0
t a>
Da
to
I
co
co
O
LL
CM
0
CM
CM
d>
CO
UJ
o
i
CD
8-
CO
CD
6
>
CJ
eg
CD
\ |
mmj JO
•"* .e
I £
"c S
CD
1
- =
s
CD
CO
Ul
o
CD
a ™
o
i
I
I *
s =.
-------
6. Conclusions
The analysis performed in this report indicates that statistical tests for alternative
monitoring systems, are stringent but not preclusive, particularly if augmented by a procedure that
compensates for variance underestimation due to autocorrelation. Despite the absence of strict
QA/QC procedures in the field tests at the Chesapeake unit, a substantial number of subsets of
the paired CEMS/CEMS data passed the three prescribed statistical test, whether or not a variance
inflation estimate was used (Table 5, Table 6, Table 9, and Table 10). Applying variance
inflation estimates to the available CSA/CEMS data, one subset passed all three statistical tests
(Table 6). Two OSA/CEMS subsets passed all three tests when the data were analyzed at the
level of refinement of the alternative monitoring system (Table 7 and Table 8). The latter results
suggest that under-performance on the correlation test may have been due to limitations in the
data rather than to the stringency of the test Having to pair hourly CEMS measurements with
daily AMS values for the Northern States Power Co. database and with weekly AMS values for
the Niagara Mohawk database is likely to have had a detrimental impact on correlation test
results. Under the proposed regulations, which require hourly measurements for both the CEMS
and AMS, this confounding factor should not be present
65
-------
References
Box, George E.P. and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and Control.
Revised Edition. Holden-Day, San Francisco, CA.
Box, George E.P., William G. Hunter, and J. Stuart Hunter. 1978. Statistics for Experimenters.
John Wiley & Sons, New York, NY.
Clean Air Act Amendments, 1990. Public Law 101-549,101st Congress, November 15, 1990.
Cochran, William G, 1977. Sampling Techniques. John Wiley and Sons, New York, NY.
40 CFR, Part 60. Code of Federal Regulations, Title 40 ™ Protection of Environment, Part 60 -
— Standards of Performance for New Stationary Sources. Revised as of July 1, 1991.
40 CFR, Part 75 Code of Federal Regulations, Title 40 — Protection of Environment, Part 75 -
- Continuous Emissions Monitoring: Proposed Rule. Federal Register, vol. 56, no. 232
(December 3, 1991). pp. 63291-63335.
Gujarati, Damodar N. 1988. Basic Econometrics. 2nd Edition. McGraw-Hill Book Company, New
York, NY.
Magee, Lonnie. 1989. Bias approximations for covariance parameter estimators in the linear
model with AR(1) errors. Commun. Statist Theory Meth., 18(2):395-422.
Rawlings, John O. 1988. Applied Regression Analysis: A Research Tool. Wadsworth &
Brooks/Cole Statistics/Probability Series. Pacific Grove, CA.
Steel, Robert G.D. and James H. Tome. 1980. Principles and Procedures of Statistics: A
biometrical approach. 2nd Edition.McGraw-Hill, New York, NY.
Wolter, Kirk M. 1984. An investigation of some estimators of variance for systematic sampling.
JASA. 79(388):781-790.
66
-------
Appendices
... Three appendices supplement this report ....-.,,,„.. ..,--. ,.._,
Appendix A summarizes the results of screening each of the databases used in this study
for normality and autocorrelation.
Appendix B is a paper by Dr. David A. Dickey, Professor of Statistics at North Carolina
State University, entitled "Effects of Autocorrelation on Statistical Analysis." It provides a
theoretical background for the discussion in Section 4 ("Autocorrelation Analysis") of this report
Appendix C provides documentation on the data subsets analyzed in this report
-------
Appendk A
-------
THE CADMUS GROUP, INC.
Executive Park, Suite 220
1920 Highway 54
Durham, NC 27713
.Tefepispne: (91§)_554-9454 . Telefax:.(919).544:9453..._..
May 14, 1992
To:
From:
Subject:
Elliot Lieberman
Emissions Monitoring Section, ARD
William Warren-Hicks
Susan E. Spruill
Jane E. Mudano
The Cadmus Group, Inc.
Statistical Analysis of Alternative Monitoring Systems
Please find enclosed analyses and summaries for parts 2(a-f), 3, and 4(a-e) of your memo
dated April 21, 1992, requesting testing of alternative monitoring (AM) systems. The
following data were analyzed:
o UARG data from Attachment E of Public Comments (Table 1, page 4). There are 24
hours of CEM (A) and AM (B) data, but two hours are missing. All data were
recorded in ppm.
o Chesapeake data from Entropy (Section 75.21; EPA Contract No. 68-02-4462;
Work assignment No. 91-156). One reference CEM (A) and four alternative CEMs
(B-E) were monitored hourly for approximately 63 days. All SO2 data were recorded
in ppm.
o Homer City Unit #1 (from KEA), recorded as daily CEM (Ibs/MMBtu) and using daily
coal sampling (CSA). Sampling covered a 730 day period.
o Homer City Unit #3 (from KEA), recorded as daily CEM (Ibs/MMBtu) and using daily
coal sampling (CSA). Sampling covered a period of approximately 730 days.
o Niagra Mohawk (from KEA), recorded as hourly CEM (Ibs/MMBtu) and using weekly
oil sampling (OSA). Sampling covered a period of approximately 455 days.
o Northern States Power Company (from KEA), recorded as hourly CEM (Ibs/MMBtu)
and using daily coal sampling (CSA). Sampling covered a period of approximately
730 days.
-------
The following summaries are labeled to correspond to the memorandum dated April 21,
1992:
2.
Screen data to determine whether it is normally distributed.
The SAS procedure UNIVARIATE was applied to all CEMs and AMs in order to:
a) determine the mean
b) determine the standard deviation of the mean
c) compute the Shapiro-Wilks test (or Kolmogorov test) for normality
d) graph normality (Q-Q) plots, and
e) graph frequency distribution histograms of the data.
Table 1 summarizes the univariate results: 2a, 2b, and 2c, above. For the test of
normality, note that the UNIVARIATE procedure will use the Shapiro-Wilks test whenever
there are less than 2000 observations, and will automatically use the Kolmogorov test
whenever there are 2000 or more observations. Please also note that there are a few
problems with these tests which will be discussed in part 3.
Normality plots (2d) and frequency histograms (2e) for each CEM and AM are also
enclosed. In addition, time-series plots were produced for the CEMs, AMs, and their
differences (CEM-AM).
3. Screen data, which is not normally distributed, to determine whether it is
loonormallv distributed.
It should be noted that nearly all variables failed the test for normality, based on 95%
probability (a=.05). This is because the normality test is quite sensitive to large sample
size. Due to this sensitivity, the test for normality will generally reject the hypothesis that
the data are normally distributed. In addition, there are a few "outlier" observations in the
data sets we used, which may "skew" their distributions. Classical statistics theory which
assumes large sample populations to be normally distributed. Therefore, we do not
recommend the use of these tests for determining normality.
Instead, we recommend that you observe only the values of the normality statistics
(Shapiro-Wilk's W, and Kolmogorov's D), ignoring the associated probability, and visually
inspect the frequency histograms and Q-Q plots. Both statistics (W and 0} have a range
between 0 and 1: a W= 1 (or D =0) would result if the data were perfectly normally
distributed; values approaching W = 0 (or D = 1) increase the probability that the data are
not normally distributed. Note that most statistics are very close to the extreme of the
range which denotes normality. In addition, nearly all frequency distributions demonstrate
symmetric curves with the mean approximately equal to the median. The Q-Q plots
demonstrate the straight diagonal alignment of the residuals which is typical of the normal
distribution. Based on these observations, we determined that all data are actually
-------
normally distributed, except for the AM data (CEM B) from the UARG table.
The Shapiro-Wilks test is appropriate for the UARG dataset because of its small sample
size (N = 22). The CEM (CEM A) data were found to be normally distributed. Both UARG
CEM A and CEM B were transformed using the natural tog and univariate analyses were
rerun to determine if the transformation normalized their distributions. This transformation
was unsuccessful. Because CEM A and the CEM B do not appear to come from the same
distribution, it is not appropriate to compute the difference between them. However, for
consistency, statistical summaries of these differences were reported.
There was no justification for testing for normality of the differences (CEM-AM).
Differences between normally distributed variables are also normally distributed, and all
variables analyzed are assumed to be normally distributed.
4. Autorearession analysis using SAS AUTOREG procedure.
The AUTOREG procedure is not available on our SAS contract. However, determination of
autocorrelation of CEMs and AMs could be determined by a number of other methods:
a) Pearson correlation of CEM (or AM) values and the first order lag of those
values (ie.pcEM.^cEMi).
b) Regression of the CEM (or AM) values on the first order lag of those values
and observing the slope (/?).
c) Time series regression of CEM (or AM) over time and computing the Durbin-
Watson statistic (D) for first order autocorrelation. This test is available in
the AUTOREG procedure. As D approaches zero, the probability of
significant autocorrelation increases. A table of critical values for D can be
found in most statistics texts. For large samples (N> 100), the critical value
is usually around D = 1.5 at o= .05.
Table 2 summarizes the results of the above tests for CEM and AM data. Due to the high
autocorrelation which existed in nearly all CEM and AM data sets, differences between
CEM and AM were computed from the residuals of the regressions of these values on their
Lag1 {part b, above). Residuals from such analyses are independent, therefore differences
between the residuals of the CEM and the residuals of the AM are corrected for the
autocorrelation of the CEM and AM data. As a check, we ran a time series analysis of
these residual differences and the computed the Durbin-Watson autocorrelation statistic
(given in Table 2). All residual differences were uncorrelated.
-------
Table 1. Summary of univariate analysis results.
Data Source
UARG
CEM A
CEMB3
Chesapeake
CEM A
CEM B3
CEM C3
CEM D3
CEM E3
Homer City
Unit 1 CEM
Unit 1 CSA3
Unit 3 CEM
Unit 3 CSA3
Niagara Mohawk
CEM
OSA3
Northern States
CEM
CSA3
- N1
Mean
Standard
Deviation
Normal
Statistic2
Normal
22
22
484.00
468.64
36.77
46.13
W=0.9356
W=0.8825
Yes
No
1617
1342
1560
1448
1521
649.81
641.92
593.36
642.13
644.38
95.08
114.91
152.13
89.04
89.00
W =0.9707
W=0.8915
W=0.7617
W = 0.9548
W = 0.9606
.Yes
Yes
Yes
Yes
Yes
497
572
496
578
2.42
2.53
1.49
1.44
0.19
0.26
0.15
0.12
W=0.9716
W= 0.9944
W=0.9583
W = 0.9534
Yes
Yes
Yes
Yes
6801
62
0.61
0.73
0.08
0.04
D = 0.0637
W = 0.9390
Yes
Yes
16081
667
1.28
1.46
0.22
0.14
0=0.1644
W=0.9785
Yes
Yes
1 Homer City alternative monitoring (AM) measured daily, Niagara Mohawk AM measured
weekly. Northern States AM measured daily, all others measured hourly.
2 W = Shapiro-Wilks test, range: 0_<.W<.1
D = Kolmogorov test, range: OjC.D.<.1
As W approaches 0 (D approaches 1) the probability of rejecting N~(/j, a2)
increases.
alternative monitor
-------
Table 2. Summary of autoregression analysis results.
Data Source
UARG
CEM A
CEM B5
CEM A - CEM B6
Chesapeake
CEM A
CEM B5
CEM Cs
CEM D5
CEME5
CEM A - CEM B6
CEM A - CEM C6
CEM A - CEM D6
CEM A - CEM E6
Homer City
Unit 1
CEM
CSA5
CEM - CSA8
Homer City
Unit3
CEM
CSA5
CEM - CSA8
Niagara Mohawk
CEM
OSA5
CEM - OSA6
Pearson1
Correlation
Regression2
Coefficient (/?)
Durbin-Watson
Autocorrelation3
D
Statistic4
0.6841
0.6554
0.6358
0.6659
0.268
0.023
-0.376
1.385
1.925
2.694
0.9347
0.9449
0.8631
0.9739
0.9099
0.9233
0.9401
0.8668
0.9759
0.9095
0.885
0.892
0.838
0.960
0.884
-0.391
-0.218
-0.304
-0.400
0.229
0.214
0.316
0.071
0.224
2.783
2.436
2.608
2.789
0.7431
0.7376
0.7348
0.7348
0.735
0.737
-0.108
0.529
0.524
2.212
•
0.8507
0.7991
0.7834
0.9964
0.8421
0.7998
0.7782
0.9964
0.817
0.792
-0.155
0.760
0.996
-0.217
0.363
0.412
2.305
0.479
0.007
2.434
-------
Table 2 continued
Data Source
Northern States
CEM
CSA5
CEM - CSA8
Pearson1
Correlation
Regression2
Coefficient (/?)
Durbin-Watson
Autocorrelation3
0.8129
0.9754
0.8088
0.9757
0.754
0.975
-0.091
D
Statistic4
0.492
0.050
2.181
1 Pearson correlation of value {CEM or AM) to its Lag 1 value
2 Simple regression of original value (CEM or AM) on its Lag1 value
3 First-order autocorrelation from time-series regression
4 Durbin-Watson statistic (for N> 100, critical D = 1.5 at a=0.05)
5 Alternative monitor
6 Differences computed from residuals in order to remove autocorrelation
within CEMS and AMS
-------
Variable=CEM A
UARG attachment E data 15:07 Tuesday, May 12, 1992
8
UNIVARIATE PROCEDURE
Stem Leaf #
56 3 1
54 35 2
52 5 1
50 1 1
48 991366 6
46 8959 4
44 24937 5
42 29 2
+ + + +
Multiply Stem.Leaf by 10**+1
Boxplot
570+
430+
Normal Probability Plot
* +++++
* *+++++
*+++++
+++*+
++*****
++****
*+*+** *
* ++*+++
-2
-1
+1
+2
-------
Variable-CEM B
UARG attachment E data 15:07 Tuesday, May 12, 199
1
UNIVARIATE PROCEDURE
Stem Leaf #
52 250 3
50^19067 5
48 955 3
46 20 2
44 6 1
42 464 3
40 35001 5
+ + + +
Multiply Stem.Leaf by 10**+1
Boxplot
530+
470+
410+
Normal Probability Plot
** **+*+
**++++
**+++
+++*+
+++* **
*++*+ *
*++*
-2
-1
+1
+2
-------
I
o>
I
Q
o
w
'- oo
(radd)
I
-------
w
1
O)
iC
r §3
^8
- 00
-------
w
•S
1
P
rs
§
o
3
1 O
- 00
P O
a
(uidd)
-------
UARG attachment E data
17:27 Wednesday, May 13, 19
Model: MODEL1
Dependent Variable: CEM_A
Analysis of Variance
Source
Model
Error
C Total
Root MSE
Dep Mean
C.V.
Variable DF
INTERCEP 1
CEM B 1
Sum of Mean
DF Squares Square F Value
1 19302.18052 19302.18052 42.451
20 9093.81948 454.69097
21 28396.00000
21.32348 R-square 0.6797
484.00000 Adj R-sq 0.6637
4.40568
Parameter Estimates
Parameter Standard T for HO:
Estimate Error Parameter=0 Prob
176.008727 47.48895471 3.706 0
0.657207 0.10086893 6.515 0
Pirob>F
0.0001
> IT)
.0014
.0001
-------
Chesapeake Data
UNIVARIATE PROCEDURE
9:16 Thursday, May 14, 1992
2
Variable=CEM A
Histogram f
925+* 3
„****** 57
.******** 79
.*********** 109
,****************** 173
. ***************************** 281
, ********************************************* 441
,**************************** 275
.************** 137
475+****** 53
.* 6
.* 2
Boxplot
0
0
*—+—*
0
0
25+*
* may represent up to 10 counts
Normal Probability Plot
925+ *
********
*****+++
*****+
*****
+*****
********
********
*******+
475+********++
*+++
*
25+*
---- + ---- + ---- + ----
-2 -1
---- + ---- + ---- + ---- 4. ---- +
0 +1 +2
-------
Chesapeake Data
UNIVARIATE PROCEDURE
9:16 Thursday, May 14, 1992
4
Variable=CEM B
Histogram
925+*
.*****
(**********
(**************
,**************************
.********************************
.*********************************************
.***********************************
.****************
475+*******
.*
.**
.*
25-*-**
+ + + + 4. +.
* may represent up to 7 counts
f
2
35
65
96
180
224
314
242
107
45
3
14
1
1
1
12
Boxplot
0
1
*- — I
1
(-— *
0
0
0
*
Normal Probability Plot
9254- ++*
*******
******
*****
*****
******
*******
*******+
******+++
475+ *****+++
+*+++
++****
*
25+**
-2
-1
+1
+2
-------
Chesapeake Data
UNIVARIATE PROCEDURE
9:16 Thursday, May 14, 1992
€
Variable=CEM C
875+***
Histogram
,*******
*********
,************
,******************************
, ************************************************
,******************************************
, ***********************
, **************
,****
#
22
54
65
92
238
380
330
180
109
25
Boxplot
0
0
*_____*
25+*********
+ + + + + +
* may represent up to 8 counts
65
875+
Normal Probability Plot
++++ ****
+4.+******
++*****
++****
+******
*******
*******+
*****++++
****** +++
**
25-j-** *******
+- --- +—
-2
-1
+1
+2
-------
Chesapeake Data
UNIVARIATE PROCEDURE
9:16 Thursday, May 14, 1992
8
Variable=CEM D
Histogram #
925+* 1
.***** 41
.******* 58
.********** gg
( *************** 134
675-1-***** ************************ 256
.A******************************************* 3Q8
.********************************* 295
.**************** 2.42
.***** 45
425+* 2
Boxplot
0
0
0
*—+—*
* may represent up to 9 counts
Normal Probability Plot
925+ *
*******
*****++++
* * * * *++
+****+
675+ ++******
********
********
********
********++
425+*+++
-2
-1
+1
+2
-------
Chesapeake Data 9:16 Thursday, May 14, 1992
10
UNIVARIATE PROCEDURE
Variable=CEM_E
* Histogram # Boxplot
925+* . 5 0
.***** 38 0
* .******* , 61 0
(********** 86
.*************** 128
675+************************************** 336 +—— — t-
.******************************************** 394 *__-).__*
. ******************************* 279 +-.____+
.**************** 136
.****** 54
425+* 4
---- + ---- + ---- + ---- + ---- + ---- + ---- + ---- + ----
* may represent up to 9 counts
Normal Probability Plot
925+ *
*******
*****++++
****++
*
675+ +*******
********
*******
********
********+
425+*+++
-2 -1 0 +1 +2
-------
Tf
*J
*2
OQ
^»
I
s
^:
o
§ 8
0 8 8 § 8 Q °
00 b- «D UD
IS
s
I
I <=,
(uidd)
-------
g
O
a
0)
1
o
8 8
^P CO
(uidd)
-------
I
I I
I
CQ
X
o
I
g 8 «
CO
-------
II
0) Q
o S:
o
I
£
3
s
1
O
O
O5
S 8
^ CO
(uidd)
-------
!i
Ifl
8 p
O « H
8°
W o5
I
O
o
en
i » ' » ' i
(uidd)
8 S
T* CO
i ' ' « ' i
o
-------
i
1- o
I I I I
(uidd)
-------
1
(radd)
-------
o
a
(radd)
-------
J- o
CN1 ^P
i i
(uidd)
-------
Homer City daily data
30
9:16 Thursday, May 14, 1992
UNIT=1
UNIVARIATE PROCEDURE
Variable=SO2CEM
Histogram
2.85+***
.********
.*****************
.*********************************
.**********************************
.*******************************
.**********************
.************
. ****
. **
. **
1.75+*
+ + + H (. +
* may represent up to 3 counts
#
9
23
50
98
101
91
65
36
11
6
5
2
Boxplot
* +—*
0
0
Normal Probability Plot
2.85+ +*****
******
******
*******
******+
******+
******
*****
++***
+++***
** *
1.75+*
-2 -1 0 +1 +2
-------
Homer City daily data
9:16 Thursday, May 14, 199
UNIT=1
UNIVARIATE PROCEDURE
Variable=AACS
Histogram
3.35+*
.*
3.15+***
.******
2.95+**********
.************************
2.75+******************************
.*******************************************
2.55+*****************************************
. *********************************************
2.35+*********************************
.***********************
2.15+******************
.******
1.95+*****
1.75+
1.55+
1.35+*
^ ^ ^ ^ JL 1 . _1 I ^m,mm, ^L^k L ^fc^^^L^M^ 11 ^^ I
_. ™- -r._,^___._^. _^^..^ ^qp_«v^^^^Vi.._i^~j.^ ^•.. f — —^.~-^^-^mm— iy.
* may represent up to 2 counts
#
l
2
5
11
20
48
60
85
81
90
65
46
36
12
9
Boxplot
0
0
*--+—*
Normal Probability Plot
3.35+ *
I
3.15+ ****
I *****
2.95+ ****
I ******
2.75+ ****
| *****
2.55+ ****
j *****
2.35+ *****
j *****
2.15+ ******
j ****+
1.95+*+**+
1.75+
1.55+
1.35+*
-2
-1
+1
+2
-------
I
Q
<=>
CO
LO
10
o
p
o
-------
e
I
Q
O bo
CQ
1
t
P
O
CO
10
iq
o
I
(3
3
00
p
o
-------
8°,
O
§€
W o
-------
I
p
1
CO
CM
p
T-J
SOS
§
-------
Q
P
CO
I
p
-------
CO
-
O g
S'
C
I
W
CO
P
3
S
CO
P
cq
10
o
S
I
p
cq
I
-------
Niagara Mohawk
UNIVARIATE PROCEDURE
10:50 Wednesday, May 13, 1992
142
Variable=S02CEM
Histogram #
0.925+* 1
.* . 10
.** 65
.****** 244
.************** 574
.****************************** 1229
0.625+****************************************** 1753
.*********************************************** 1971
.*************** 627
.**** 167
.*** 115
.** 43
0.325+* 2
* may represent up to 42 counts
Boxplot
0
0
0
-t- +
* 1 *
+ -I-
0
0
0
Normal Probability Plot
0.925+ *
*
***
*******+
*******+
*******
0.625+ ********
**********
*******++
+****++
******
*
0.325+*
-2
-1
+1
+2
-------
Niagara Mohawk OSA by week
7:20 Thursday, May 14, 199
Variable=SO2OSA
UNIVARIATE PROCEDURE
Stem Leaf #
80 006 3
79 09 2
78 19 2
77 01256 5
76 025 3
75 004 3
74 56667 5
73 234577 6
72 277 3
71 0344579 7
70 1259 4
69 1225579 7
68 4489 4
67 3356799 7
66 8 1
+_. + + +
Multiply Stem.Leaf by lO**-2
Boxplot
*—+—*
0.805+
0.735+
Normal Probability Plot
* *++
*++
***++
****+
*+++
**+
***
***
+**
+***
++***
+****
V***
* ** ***
0.665+ *
-2
-1
+1
+2
-------
3
£
o
co
o
o
o
-------
o
ffi
*
i-«O
••
OO
SOS
p
o
-------
I
- o
CO
o
-------
Lb.
SAS 10:50 Wednesday, May 13, 1992 12
UNIVARIATE PROCEDURE
Variable=S02CEM
9.25+*
Histogram
#
1
Boxplc
4
0
.*
•
.*
*
.*
.75+
.*
.*
.*
.*
.*
.**
. **************************
.*
.25+*
1
3
7
3
2
1
3
13
613
*********************** 15005
238
191
*
*
*
*
*
*
*
*
0
+ (-•
0
*
* may represent up to 313 counts
9.25+
Normal Probability Plot
*
*
4.75+
*
*
*
*
*
++++++*********
*************************************
++*****++++++
0.25+***
H ^ 1- + + H + + 1- 1-
-2 -1 0 +1 +2
-------
Northern States Power CSA daily data 45
9:16 Thursday, May 14, 1992
UNIVARIATE PROCEDURE
Variable=S02CSA
2,15+*
.*
.*
.**
Histogram
. ************
.*****************************
. ***********************************************
. ****************************
.********
. ***
1.05+*
#
1
1
1
8
21
60
141
235
140
40
14
5
Boxplot
*
*
0
0
0
* — + — *
0
0
* may represent up to 5 counts
2.15+
Normal Probability Plot
*
*
*****
*****+++
+*******
+********
**********
*********
******++
+******
1.05+*
+ + + h + + + + + + +
-2 -1 0 +1 +2
-------
PH
zos
-------
1 I '
Ci
00
1 I '
t-
CO
CO
-------
o
O
JH
o
S
Q
o
a
3
oo
(M
CM
I
ZOS
-------
Appendix B
-------
EFFECTS OF AUTOCORRELATION ON STATISTICAL ANALYSES
D. A. DICKEY
Prepared July 28, 1992 for the Cadmus Group
INTRODUCTION
In this paper, we review the concept of autocorrelation, explain how
to look for it, and explain how to adjust for it in standard statistical
formulas. Formula (7) shows the effect on the variance of a sample mean
and formula (10) shows the effect on the estimate of individual variance.
The square root of a variance is called a standard deviation and is needed
to decide if data points or means are unusual. A data point more than 1.96
standard deviations from the mean will occur by chance only 5% of the time
and hence is considered unusual.
Likewise, a sample mean more than 1.96 standard deviations from a
hypothesized long run mean casts doubt on that long run mean. Here, of
course, standard deviation refers to the standard deviation of a sample
mean. When the standard deviation of the mean is estimated from the data,
it is referred to as the standard error of the mean. We will consider a
sample mean more than 2 standard errors from a hypothesized long term mean
as significant evidence against that long term mean but the normal or t
tables could be used to provide a slightly more accurate number than 2,
depending on the sample size.
1. AUTOCORRELATION
We use statistics to deal with variation. For example, a certain type
of automobile may get on average 28 MPG (miles per gallon), but individual
mileages will vary around this mean, some particular cars doing better and
some worse. If automobile types are to be compared by sampling, this
variability must somehow be accounted for.
Most statistical texts concern independent data. For example, if I
take a random sample of automobiles, the fact that car 7 is over the mean
MPG does not lead me to expect anything in particular about the MPG of car
8 or car 6. When one deviation from the mean tells us nothing about any
other, the data are said to be independent. To look at an example where
this independence obviously would not hold, consider measurements of flow
rate in a stream taken every hour. If the stream is flowing much faster
than average now, we would expect it to be flowing faster than average one
hour from now, that is, the stream is high now and an hour is not enough
time to clear the excess water from the stream.
Pollutants in a stream, in the air, etc. may also exhibit this failure
of independence. When data taken over time fail to be independent, we say
they are autocorrelated. The most common type of autocorrelation is
positive, that is, positive deviations from the mean tend to be followed by
positive and negative deviations by negative. , In order to adjust standard
statistical formulas to deal with autocorrelation, it becomes necessary to
pin down the nature of the autocorrelation more precisely. This is the
role of time series modeling.
In 1976, a book Time Series Analysis: Forecasting and Control by G. E.
P. Box and G. M. Jenkins (Holden-Day publishers) popularized time series
modeling. The authors stressed models called AutoRegressive Integrated
-------
Moving Averages, or ARIMA models. A subset of these models, autoregressive
or AR models, is discussed below. This subset forms a relatively simple
yet powerful class of models.
The autoregressive model of order 1, AR(1), is written as
Y(t) = M + r( Y(t-l) - M ) + e(t) , t-1,2,3,... (1)
where Y-(t) is the value of "the data at'time t, for example the flow rate of
a river at hour t. M is the process long term mean and r is a number
, strictly between -1 and 1. We interpret r as a proportion when r>0.
Finally e(t) is an unanticipated error or "shock" to the system as it is
sometimes called. This e(t) series is assumed to be an independent
sequence with mean 0 and constant variance.
Model (1) expresses the deviation of Y(t) from the mean M as a
proportion r of the previous deviation plus an unanticipated shock e(t)
and hence is quite realistic for many economic and physical situations.
The AR(1) model can be extended to a general AR(k) model in which Y(t)
depends on k previous values as
Y(t) - M + PI ( Y(t-l) - M ) + P2 ( Y(t-2) - M ) + ...
+ Pk ( Y(t-k) - M ) + e(t) , t=l,2,3,... (2)
where PI, P2, ..., Pk are numbers called autoregressive coefficients and
the previous values Y(t-l), Y(t-2), etc. are referred to as lags of Y.
2. REGRESSION ESTIMATES
In section 1, the class of ARIMA models popularized by Box and Jenkins
was introduced and one model, the autoregressive order 1 or AR(1) was
singled out. In this and the next section, we look at how we can tell if
AR(1) is appropriate for our data. The main tools here will be least
squares regression and the autocorrelation function.
Least squares regression is a topic covered in standard statistical
textbooks. The application of regression to time series is covered in
detail in chapter 8 of the book Introduction to Statistical Time Series by
Wayne Fuller (Wiley 1976). In an example on page 341 and 342, Fuller shows
how to determine the necessary number of lags in a model by running a
regression on many lags then using standard tests statistics, t and F
produced by most regression programs, to decide how many lags can be
omitted.- If we can omit all but Y(t-l) from our model, then the AR(1) is
appropriate.
Alternatively, we could look at the partial autocorrelation function
which is computed by most time series packages, such as PROC ARIMA in the
SAS computer package (SAS is the registered trademark of SAS Institute,
5 Gary, N. Carolina). The jth partial autocorrelation coefficient is
essentially the lag j coefficient in the multiple regression of Y(t) on
Y(t-l), ..., Y(t-j) as explained in The SAS System for Forecasting Time
• ... series by J. C. Brocklebank and 0. A. Dickey (SAS Institute publishers).
I Only the lag 1 partial autocorrelation would estimate a nonzero value for
• an AR(1).
-------
3. AUTOCORRELATION FUNCTION
Another way to determine if an AR(1) is appropriate is to look at the
autocorrelation function R(j). R(j) is the correlation between Y(t) and
Y(t-j) where j is called the lag number. This function is produced by
most time series computer programs, for example PROG ARIMA and PROC AUTOREG
in the SAS computer package . .
Specifically, R(j)Vs= G(j)/G(0) where G(j) is called the autocovariance
function and is defined as the covariance between Y(t) and Y(t-j). Letting
the variance of e be denoted V(e), we find from Fuller (page 37 equation
2.3.5) that the covariance at lag j for an AR(l) model is
Autocovariance = G(j) = r**j V(e)/(l-r**2)
(3)
where ** denotes exponentiation and * denotes multiplication, e.g. 3*5=15,
3**2 = 3*3=9, 2**3 = 2*2*2=8. Now the variance of Y is G(0), that is, the
variance of Y is, for an AR(1) model,
variance of Y = G(0) «= V(e)/(l-r**2)
so we see that the variance of the shocks, V(e), can be quite different
than the variance, G(0), of the data if r is near 1.
Using R(j) = G(j)/G(0) and equation (3) it is easily seen that
R(j) - r**j (4)
for the AR(l) model. It is important to note that these formulas are only
appropriate for autoregressive order 1 models, not for moving average
models, general ARIMA models, or general autoregressive order k models.
Equation (4) shows that the autocorrelation function of an AR(1) decays
exponentially.
4. EXAMPLE
A dataset of 60 observations using model (1) with r=.7 and M=100 is
generated in SAS and analyzed with PROC ARIMA. Here is the program and
part of the output;
PROGRAM
ata epa; y=100 + 10*normal(1827651)/sgrt(l-.7**2); output;
do i«2 to 60; y=lOO + .7*(y-100) + 10*normal(1827631); output;
nd;
roc arima; identify var=y nlag=10;
OUTPUT
Name of variable « Y.
Mean of working series * 91.87989
Standard deviation - 9.979986
Number of observations =60
Notice that the autocorrelations die off in approximately an
exponential manner at least for the first few lags. The dots represent two
standard errors so lines of asterisks extending beyond the two standard
-------
.errors indicate statistically significant (non zero) autocorrelations. We
are saying that if the autocorrelation were truly 0, an estimated
autocorrelation more than 2 standard errors from 0 would be unusual and
hence we reject the idea of 0 autocorrelation based on our estimate. Here,
of course, standard error refers to a standard error appropriate for
autocorrelation estimates.
, . * Autocorrelations
Lag Covariance Correlation -1987654321
01234567891
"0
1
2
3
4
5
6
7
8
9
10
99.600
49.884765
25.278038
7.336610
7.443015
11.986170
-1.278828
1.552965
0.862817
-6.552627
-11.546358
1.00000
0.50085
0.25380
0.07366
0.07473
0.12034
-0.01284
0.01559
0.00866
-0.06579
-0.11593
*
.
.
.
*
*
.
. .
*
**
********************
**********
*****.
*
*
**
.
.
.
.
.
"." marks two standard errors
The partial autocorrelations show one nonzero lag value, .50085, and
the rest are insignificant, being within the two standard error bounds:
Partial Autocorrelations
Lag Correlation -198765432101234567891
1
2
3
4
5
6
7
8
9
10
0.50085
0.00393
-0.07331
0.08643
0.08939
-0.16725
0.08753
0.01601
-0.14884
-0.05415
*
•
*
*
•
. ***
•
*
. ***
*
**********
•
•
** .
** .
•
** .
•
•
*
t
The estimated autocorrelations and partial autocorrelations seem to be
in line with what would be expected for an autoregressive order 1 series.
.5. VARIANCE OF THE MEAN, INDEPENDENT SAMPLES
-
• This section deals with sample means. Returning to the example of MPG
I .in cars, suppose a particular brand of car has mean 28 MPG (for the entire
• fleet of all such cars ever to be produced). If I take a random sample of
• 10 cars from the production line and measure MPG, I might get a sample
• average 25.0 MPG. Another sample of 10 might have a sample average 27.2
• MPG and another 28.3. Is this much variation in means of samples of 10 cars
I reasonable? It depends on the individual car-to-car variation, V, in MPG.
-------
If I know the variance V of MPG from car to car, I can compute the variance
among sample means from samples of size 10 to see if 25.0, 27.2, and 28.3
are reasonable numbers. The formula when the data are independent is
variance of means - v/n
(5)
where V is the individual car-to-car variance for these independent cars,
and n is the.number of cars in each sample. This formula is very well
known, for example, see Snedecof and Cochran's "Statistical Methods, eighth
edition, page 43 (Iowa State University Press, publisher). For a time
series, of course, formula (5) becomes
variance of means = G(0)/n
(5a)
If the estimated variance in MPG is 14.4 then the estimated variance
associated with a mean of 10 would be 14.4/10 =1.44 and the corresponding
standard error of the mean (square root of this variance) is 1.2 so none of
our sample means is more than 2 standard deviations from the fleet average
28 MPG. If, instead of 1.2, the standard error were 0.5, then the sample
mean 25 would be quite unusual since it would now be 4 standard errors away
from the fleet average. Clearly, the decision of whether a sample mean is
statistically significantly far from any stated value depends on having the
correct standard error available. Recall that the cutoff value of 2
standard errors can be refined by referring to tables of the t
distribution.
6. EFFECT OF AUTOCORRELATION ON THE VARIANCE OF THE MEAN.
For autocorrelated data, the variance of a sample mean is no longer
given by formula (5). A formula giving an approximation to the variance is
on page 194, (6.3.17) of Box and Jenkins. This same formula in a different
form is given in Corollary 6.1.1.2, page 232 of Fuller. The formula, while
approximate, holds for a large subset of ARIMA models including the AR(i)
we are discussing here.
„ We can do better than an approximation if we restrict ourselves to
the AR(1) case. In particular, the Fuller text page 232 line 10 shows that
the variance of a sample mean of n consecutive values of a time series Y(t)
is exactly given by
[n G(0) + 2(n-l) G(l) + 2(n-2) G(2)
2(n-3) G(3) + ...
+ 2 G(n-l)]/(n*n)
(6)
where G(j) is the autocovariance of the time series in question. In the
AR(1) case, G(j) is given by (3) and is seen to be G(j) - G(0) r**j so we
can plug G(j) = G(0) r**j into expression (6) and do some algebraic
simplification to get the desired variance. Note that formulas (3) and (6)
are all we really need. One could write a computer program to evaluate (3)
and use it in (6) for every case, however, algebraic reduction will give us
a nice formula, (7), for the AR(1) case that will be easy to use.
I now show the algebra just to provide a technical reference. Let
S = {n + 2(n-l) r + 2(n-2) r**2 + ... + 2 r**(n-l)J
so that expression (6) is S*G(0)/(n*n). Notice that, for independent data,
r=0 so S=n and expression (6) reduces to expression (5a). All we need to do
-------
.is algebraically reduce S then calculate S*G(0)/(n*n) . Now write S and
then multiply S by r getting
S = [n + 2(n-l) r + 2(n-2) r**2 + ...•*• 2 r**(n-l)]
rs - [ n r + 2(n-l) r**2 + 2(n-2) r**3 + ... +2 r**n]
and subtracting,
(l-r)S « n + (n-2)r - 2[r**2 + r**3 + r**4 + ... + r;**n]
- 2[r + r**2 + r**3 + r**4 + ... + r**n]
- 2r[l + r + r**2 + r**J + . . . + r**(n-l)]
= n(l-fr) - 2rD
where D = [1 + r + r**2 + r**3 + ... + r**(n~l)]. Computing D and rD we
have
D = (l + r + r**2 + r**3 + ... + r**(n-l)]
rD = [ r + r**2 + r**3 + ... + r**(n-l) + r**n]
and subtracting
(l-r)D = l-r**n so D »= (l-r**n)/(l-r) .
Now that we have D and (l-r)S = n(l+r) - 2rD we solve for S as
S = n(l+r)/(l-r) -2r[ (l-r**n)/(l-r) ]/(l-r)
-2r(l-r**n)/(l-r)**2.
Remember that expression (6) , the variance of the mean in the AR(1) case,
is S G(0)/{n*n). Thus expression (6) becomes, for the AR(l) case,
( G(0)/n ) [ (l+r)/(l-r) -2 (r/n) (l-r**n)/(l-r) **2 ] *
This is our target formula. Notice that if the data are independent, r=0
and the expression in square brackets becomes 1. Since G(0) is the
variance of Y this gives the well known formula (5a) for the variance of
the mean of n independent observations. The expression in square brackets
is thus seen to be an inf later for the usual variance formula. This
expression does not approach 1 when n gets large and hence is an important
adjustment even in very large samples.
As an example, with 50 observations, autocorrelation r=.8 (first order
autocorrelation to be exact), and process variance G(0) = 128, the usual
formula for the variance of the mean in independent data would give
G(0)/n=128/50 =2.56 and the standard error would be 1.6. This is
. incorrect, however since formula (7) shows that 2.56 should be multiplied
r) -2(r/n)(l-r**n)/(l-r)**2] -
(1.8)/.2 - 2(.016)(1-.8**50)/(.04) • • " •
9 - -0.032(1-0.0000143)/(.04) =8.2
giving the proper variance 20.99 and standard error 4.58. Notice that this
is quite an adjustment.
-------
7. EFFECT ON THE ESTIMATE OF INDIVIDUAL VARIANCE
In the above example, the variance of the data, G(0) , vas assumed to
be the known value 128. Of course, the true variance is never known and it
must be estimated from the data, using the sample variance. The usual
formula for the sample variance is
S - (Y(l) - Y)**2 -I- (Y(2) - Y)**2 + ... + (Y(n) - Y)**2]/(n-l)
- [Y(l)**2 + Y(2)**2 + ... + Y(n)**2 - n Y**2 ]/(n-l) (8)
where Y - [ Y(l) + Y(2) + ... + Y(n) ]/n is the sample mean. For the
numbers 10, 12, 13, 7, 8 we get
10,
0 + 4-1-9 +
26/4 - 6.5.
4]/4 - [100+144+169+49+64-2500/5J/4
Why is n-1 instead of n used in the denominator of S ? To answer this,
use lower case letters and let y(t) = Y(t)-M where, as before, K is the
theoretical mean. The first part of expression (8) shows that y(t) can be
substituted for Y(t) and we get
2 _ _
(n-i)S - (y(l) - y)**2 + (y(2) - y)**2 + ... + (y(n) - y)**2]
= [y(l)**2 + y(2)**2 + ... + y(n)**2 - n y**2 ] (9).
where y = Y-M is the mean of the y(t) values (y(t)-y = Y(t)-M - Y+K).
2
Now the last part of expression (9) is the numerator of S and if we
take its expected value (the mean value in repeated sampling) assuming
independent observations, we get
G(0) + G(0) + G(0)
+ G(0) - n (G(0)/n) = (n-l)G(O)
(10)
because the expected value of each y(t)**2 is by definition the variance
of Y (this being G(0)) and the last term is by definition n times the
variance of the sample mean. This shows that division by n-1 is required
to get an unbiased estimator, that is, one whose expected value is the
quantity to be estimated. Notice that if the data are from an AR(1)
process, the n(G(0)/n) in formula (10) would need to be replaced by
n G(0)/n
-2(r/n) (l-r**n)/(l-r)**2]
2
so that the numerator of S would now estimate
G(0)f n - (l+r)/(l-r) +2(r/n) (l-r**n)/(l-r)**2]
2
and S would estimate this divided by (n-1) , namely .
G(0)[ 1 + l/(n-l)[-2r/(l-r)] +2(r/n) (l-r**n)/(l-r)**2/(n-l)
so that if we divided S by
-------
[ 1 - 2r/[(n-l) (1-r)] +2 (r/n) (l-r**n)/(l-r)**2/(n-l) ] **(io)**
we would have an unbiased estimate of G(0) . In the example from section 6,
the sample variance estimates [(50) (128) - (50) (20.74) ]/49 - .853 G(0), in
other words we would multiply the sample variance by 1/0.853 = 1.172 to get
an unbiased estimate.
Expression (10) approaches 1 as n gets large and hence becomes less
important as the sample size increases although for fixed n, the value of r
also plays an important role.
8. SUMMARY
In this paper two formulas were developed for dealing with
autocorrelation. Autocorrelation is assumed to arise from a first order
autoregressive process with autocorrelation parameter r and process
variance G(0) .
( G(0)/n ) [ (l+r)/(l-r) -2(r/n) (l-r**n)/(l-r)**2 ] **(?)**.
n-l) (l-r)] +2(r/n) (l-r**n)/(l-r)**2/(n-l) ] **(io)**
Formula (7) shows a multiplicative adjustment to the usual formula, G(0)/n,
for the variance of a sample mean. For r>0 we conclude that
autocorrelation increases the variability of sample means around the long
run mean. The sample variance estimates a multiple of the true variance of
individuals. Formula (10) gives that multiple. If r=0, the multiple is l
so the sample variance is unbiased. If r>0, the multiple is less than 1
and we say the sample variance is biased downward.
Since the sample variance is defined in terms of deviations from the
sample mean, the interpretation is that the points vary less around the
sample mean under autocorrelation than they do for independent data.
Dividing Che sample variance by expression (10) provides an unbiased
estimator. As n gets large, expression (10) approaches l for any fixed r
and hence is less important for large samples.
The formulas involve r which is an unknown, but estimable, quantity.
Plugging in an estimated r obviously produces an approximate adjustment.
REFERENCES
Box, G. E. P. and G. M. Jenkins (1976) . Time Series Analysis: Forecasting
and Control. Holden-Day.
Brocklebank, J. C. and D. A. Dickey (1986) . The SAS System for Forecasting
Time Series. SAS Institute, Gary, N.C.
Fuller, Wayne (1976) . introduction to Statistical Time Series. Wiley.
* Snedecor, G. W. and W. G. Cochran (1989) Statistical Methods , eighth .
edition. Iowa State University Press .
SYMBOLS
-------
.M
theoretical, or long term, mean
V
G(0)
V(e)
G(j)
R(j)
r
n
Y
theoretical variance
theoretical variance of time series
theoretical variance of shocks in time series
covariance at lag j .
autocorrelation at lag j. R(j) = G(j)/G(0)
autoregressive order 1 model
lag 1 autocorrelation in an AR(1)
sample size, number of observations
sample mean
standard formula for sample variance
-------
Appendix C
-------
Virginia Power Co., Chesapeake Energy Center, Unit #4
Alternative Monitoring System Study Subsets Summary
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
LABEL
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-c
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-d
a-e
' a-e
a-e
a-e
a-e
a-e
a-e
a-e
a-e
a-e
a-e
a-e
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-d
c-e
c-e
c-e
c-e
Setno
(in dataset)
20
21
23
24
26
27
29
30
32
33
35
36
40
41
43
44
46
47
49
50
52
53
55
56
60
61
63
64
66
67
69
70
72
73
75
76
140
141
143
144
146
147
149
150
152
153
155
156
160
161
163
164
Subset
(in tables)
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
Start
Date
10992
20892
10992
20892
10992
20892
10992
20892
11092
20992
11092
20992
10992
20892
10992
20892
10992
20892
10992
20892
11092
20992
11092
20992
10992
20892
10992
20892
10992
20892
10992
20892
11092
20992
11092
20992
10992
20892
10992
20892
10992
20892
10992
20892
11092
20992
11092
20992
10992
20892
10992
20892
Start
Time
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
End
Date
20892
30992
20892
30992
20892
30992
20892
30992
20992
31092
20992
31092
20892
30992
20892
30992
20892
30992
20892
30992
20992
31092
20992
31092
20892
30992
20892
30992
20892
30992
20892
30992
20992
31092
20992
31092
20892
30992
20892
30992
20892
30992
20892
30992
20992
31092
20992
31092
20892
30992
20892
30992
End
Time
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
7
8
12
13
n
(before lagging)
683
682
683
682
683
682
683
682
683
682
683
682
653
670
653
670
653
670
653
670
653
670
653
670
672
663
672
663
672
663
672
663
672
663
672
663
661
676
661
676
661
676
661
676
661
676
661
676
707
701
707
701
-------
Virginia Power Co., Chesapeake Energy Center, Unit #4
Alternative Monitoring System Study Subsets Summary
OBS
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
LABEL
c-e
c-e
c-e
c-e
c-e
c-e
c-e
c-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
d-e
Setno
(in dataset)
166
167
169
170
172
173
175
176
180
181
183
184
186
187
189
190
192
193
195
196
Subset
(in tables)
. : .„• 1.
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
Start
Date
10992
20892
10992
20892
11092
20992
11092
20992
10992
20892
10992
20892
10992
20892
10992
20892
11092
20992
11092
20992
Start
Time
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
End
Date
20892
30992
20892
30992
20992
31092
20992
31092
20892
30992
20892
30992
20892
30992
20892
30992
20992
31092
20992
31092
End
Time
•'
17
18
22
23
3
4
8
9
7
8
12
13
17
18
22
23
3
4
8
9
n
(before lagging)
707
701
707
701
707
701
707
701
647
658
647
658
647
658
647
658
647
658
647
658
-------
Pennsylvania Electric Co., Homer city Unit fl
Alternative Monitoring System study Subsets summary
DBS
1
2
3
4
Unit
1
1
1
1
setno
6
vio-v
21
24
Start
Date
53185
92885
82486
112286
End
Date
62985
102785
92286
122186
n
(before lagging)
30
28
28
30
-------
Pennsylvania Electric Co., Homer City unit I3
Alternative Monitoring System study Subsets summary
OBS
£
2
3
I
5
6
7
8
Unit
3
3
3
3
3
3
3
3
Setno
3
.-5 ..
7
8
16
17
20
21
Start
Date
30285
50185
63085
73085
32786
42686
72586
82486
End
Date
33185
. J -53085
72985
82885
42586
52586
82386
92286
n
(before lagging)
28
30
28
28
30
30
29
29
-------
northern States Power Co., Sherburne county Unit 13
Alternative Monitoring System Study subsets Summary
OBS
I
3
•
5
7
-
9
10
11
12
13
14
15
16
17
Setno
1
4
6
7
8
9
10
11
12
13
14
15
16
18
21
22
23
Start
Date
10189
40189
53189
63089
73089
. 82989
92889
102889
112789
122789
12690
22590
32790
52690
82490
92390
102390
•• •
start
Time
l
. 1
1
1
1
l
1
1
1
1
l
1
l
1
1
l
1
End
Date
13089
43089
62989
72989
82889
92789
102789
112689
122689
12590
22490
32690
42590
62490
92290
102290
112190
End n
Time (before lagging)
24
••--•24 -
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
715
f Jkmf
672
720
9 <•» \f
720
« A V
680
V V W
720
1 A W
720
696
720
720
720
720
698
720
704
720
720
-------
Niagara Mohawk, oswego unit §6
Alternative Monitoring system Study Subsets summary
OBS
setno
start
Date
2 13190
5-"Vi».:'--50l90'
Start
Time
0
-0
End
Date
30190
V53090 •>•"
End
Time
23
(before lagging)
690
671
-------
-------