FINAL REPORT
APPROACHES TO ANALYZING DATA FROM THE
PORTLAND STUDY
TASK 4
Contract No. 68-03-2392
May 1979
Service Corporation
-------
EPA-420-R-79-101
Technology Service Corporation
2811 WILSHIRE BOULEVARD • SANTA MONICA, CALIFORNIA 90403 • PH. (213) 829-7411
TSC-PD-A158-5
FINAL REPORT
APPROACHES TO ANALYZING DATA FROM THE
PORTLAND STUDY
TASK 4
Contract No. 68-03-2392
May 1979
by
Larry S.L. Lai
John D. Gins
Prepared for:
Ms. Janet Becker
Characterization and Application Branch
Office of Air and Water Programs
Submitted to:
U.S. Environmental Protection Agency
2929 Plymouth Road
Ann Arbor, Michigan 48105
-------
ACKNOWLEDGMENT
During the preparation of this report, valuable guidance and
assistance were provided by Ms. Janet Becker of the Environmental
Protection Agency's Office of Air and Water Programs, who served as
EPA Project Officer for the Portland Study.
Mr. John Gins was the Principal Investigator for the project,
and Dr. Larry Lai, the principal author. Dr. Leo Breiman, Senior
Consultant, Dr. John Eldon, Senior Scientist, and Ms. Nancy Chang,
Research Statistician, made numerous valuable suggestions and
comments on the early drafts of the report. Ms. Marian Branch
reviewed and assembled the many drafts of the manuscript.
ii
-------
CONTENTS
1.0 INTRODUCTION 1-1
1.1 BACKGROUND 1-1
1.2 GENERAL DESCRIPTION 1-1
1.3 USER'S GUIDE 1-3
2.0 SHORT TEST/FTP ASSOCIATION 2-1
2.1 PROCEDURES FOR ANALYZING THE ST/FTP ASSOCIATION 2-9
2.1.1 Stage 1. Graphical Exploration 2-11
2.1.2 Stage 2. Assessment of the Distributional
Properties of the Emission Data 2-14
2.1.3 Stage 3. Implementation of Classification
Algorithms 2-36
2.1.4 Stage 4. Evaluation of the ST/FTP Association . . 2-62
2.1.5 Stage 5. Testing Hypotheses 2-69
3.0 RESPONSES TO SPECIFIC QUESTIONS 3-1
3.1 RESPONSES TO SPECIFIC QUESTIONS RELATED TO THE ST/FTP
ASSOCIATION 3-2
3.2 RESPONSES TO QUESTIONS RELATED TO IMMEDIATE
EFFECTIVENESS OF AN INSPECTION/MAINTENANCE PROGRAM ... 3-12
3.3 RESPONSES TO QUESTIONS RELATED TO LONG-TERM
EFFECTIVENESS OF INSPECTION/MAINTENANCE 3-28
APPENDIX A. STATISTICAL TABLES A-l
APPENDIX B. MASTER-AID CLASSIFICATION ALGORITHM USER'S MANUAL ... B-l
APPENDIX C. AS-51 LOG-LINEAR MODEL FIT ALGORITHM C-l
REFERENCES R-T
iii
-------
iv
-------
FIGURES
Number Page
2.1 Example of ideal situation in which short-test results
determine FTP categorization 2-2
2.2 Example of common situation in which two point sets
overlap 2-4
2.3a Example of short-test results with poor correlation, E*,
with FTP 2-6
2.3b Example of desired probability of the misclassification
computed with the formula E* = (1-P)Eq+PE-| 2-7
2.3c Example of case where one might need a nonparametric
technique for generating a nonlinear rank score .... 2-8
2.4 Series of block-stages for ST/FTP association problem . . 2-10
2.5 Examples of histogram plots 2-12
2.6 Scatterplot, with points from Categories I and II
identified by distinct letters 2-15
2.7a Simulated normal data plotted on normal paper 2-18
2.7b Simulated non-normal data plotted on normal paper .... 2-19
2.8 Linear regression line fit for assessing the joint
normality of N0X and CO 2-26
2.9a Distance representation of bivariate normal data on a
semilog probability plot 2-34
2,9b Angle representation of bivariate normal data on a
uniform probability plot 2-35
2.10 Hypothetical decision tree 2-53
2.11 Tree structure of the Output Example 2-59
v
-------
TABLES
Number Page
2.1 Bivariate HC and NOX Values with Associated Z-,, Z?,
0, r2 and rzlog r2 2-30
9k
2.2 Sorted Values of U = 0..5 + j-t the Associated Cumulative
Frequency F = and the71 KS Statistic 2-32
2.3 Output Example (Friedman's Algorithm) 2-56
2.4 Complete Models in Three Dimensions 2-78
vi
-------
1-1
1.0 INTRODUCTION
1.1 BACKGROUND
Under EPA Contract No. 68-03-2513, the Environmental Protection Agency
is currently conducting the Short Test Correlation and Effectiveness Study
(Portland Study), a large-scale emission testing program, in Portland, Oregon.
For all the vehicles involved in the study, vehicle emissions are measured by
three short tests (STs) at a State inspection station and by the three tests
plus the Federal Emission Certification Test Procedure (FTP) at an emission
testing laboratory.
The two basic areas of interest 1n the Portland Study are: (1) the
Issue of association of short-test emissions (as measured 1n a real-world
Inspection/maintenance environment) and FTP emissions; and (2) theoretical
and practical questions concerning the process of inspection and maintenance.
1.2 GENERAL DESCRIPTION
The purpose of this report is to provide various sound statistical
procedures that could be used 1n analyzing data from the Portland Study.
In particular, the report is written in the form of a user's manual, with
two types of reader 1n mind:
1. The reader involved in the Portland Study who has a limited
knowledge of statistics. He or she should be able to follow the
text, build up his/her own statistical capability with regard to
the Portland Study, and make sound statistical judgments.
2. The reader who 1s an experienced programmer. He or she can
easily implement the procedures, following the instructions and
programs contained 1n the report.
-------
1-2
Because of the many possible salient features of the data collected
from the Portland Study, the report provides the user with a sequential
approach to data analysis. Typically, the approach consists of a series
of block-stages. Several subtasks are to be performed within each
block-stage, and the exact content of the subtasks depends on the
outcome of the analysis in the previous stages. For the performance of
each subtask, several computational procedures are presented. Whenever the
computation can be done with a hand calculator, the procedure will be
displayed step by step, illustrated with numerical examples to make the
computations more transparent. Whenever the computation calls for a
packaged statistical subroutine, the source of the program (in some
cases, even the details of the program), the essentials of card prepar-
ation, the typical computer printout, and the interpretation of the
printout, as well as the kind of statistical inference, are provided
in great detail. Although the user could do a good analysis by blindly
following the complete cycle of block-stages, human judgment and Insight
are also required in performing the analysis because data analysis is
an art as well as a science.
The organization of the report is as follows: Chapter 2 deals
exclusively with the main issue of the association between short-test
emissions and FTP emissions. Following a description of the general
philosophy of the method for analyzing the short-test association, the
sequential approach to the analysis is presented by dividing Section 2.1
Into five subsections. Section 2.1.1 describes various graphical and
-------
1-3
descriptive techniques for understanding the data and for suggesting
possible ways of analyzing the data. Section 2.1.2 describes various
procedures for testing the distributional properties of the emission
data. The outcome will guide the next phase of the analysis. Section
2.1.3 describes the technique of multivariate classification. This
technique allows the user to extract the maximum amount of information
out of the multivariate data 1n evaluating the ST/FTP association.
Section 2.1.4 describes various methods for evaluating the ST/FTP
association. Section 2.1.5 describes ways of detecting those factors
(e.g., engine type, model year) that affect the ST/FTP association.
Chapter 3 gives a detailed account of the responses to the theoret-
ical and practical questions concerning the ST/FTP association and the
process of Inspection and maintenance. For each question, the answer
contains either a detailed computational procedure or a reference to
procedures previously mentioned.
Appendix A gives all the needed statistical tables; Appendix B,
a user's manual and program description of Friedman's tree-structured
algorithm; and Appendix C, the iterative proportional fit algorithm to
be used in Section 2.5, "Stage 5. Testing Hypotheses."
1.3 USER'S GUIDE
For the evaluation of the ST/FTP association, the user should first
read Section 2.0 to get a general Idea of the sequential approach. If
the user wants to get a quick answer to a specific question related to the
ST/FTP association, he or she should read Section 3.1, "Responses to
-------
1-4
Specific Questions related to the ST/FTP Association." For actually doing
the analysis, the user should read the description in Section 2.1.5,
"Testing Hypotheses," 1n which guidelines are given on which part of the
data will be analyzed, and then proceed to the analysis according to
Section 2.1 to Section 2.5, sequentially.
As to the process of inspection and maintenance, the user should
consult Sections 3.2 and 3.3, 1n which specific answers to specific
questions are given.
-------
2-1
2.0 SHORT TEST/FTP ASSOCIATION
The following paragraphs set forth the general philosophy of the
method for analyzing the short-test/FTP association.
Each car participating in the Portland Study undergoes a series of
short tests at a State inspection station: the idle test, the Federal
Three Mode, and the Federal Short Cycle. These short tests are repeated
at the emission control laboratory, and a Federal Emission Certification
Test Procedure (FTP) is also performed. Each car has sets of measured HC,
CO, and/or N0V values obtained from the short tests administered by State
personnel at the State inspection station. For the idle test, an indica-
tion of whether the vehicle passed or failed a given set of idle HC and
CO standards is given. For the FTP, an indication of passage or failure
is also provided.
With this background in mind, we designate the aggregate of cars
passing the FTP as Category I, and those failing the FTP, Category II.
And, conceptually, let us assume that an observation of HC, CO, and NO
A
values can be thought of as a single point in a three-dimensional space.
In the event that the short-test (ST) results agree perfectly with the
FTP results, there should be a clear separation between the totality of
points associated with cars in Category I and the totality of points
associated with cars in Category II. And, therefore, solely on the basis
of its short-test results, a car could be classified unambiguously into
either Category I or II. Figure 2.1 illustrates such an ideal situation,
with the short-test results on HC and N0X completely determining the FTP
categorization. In this idealized situation, the line a{NOx)+b(HC)=C
-------
2-2
N0X
Figure 2.1 Example of ideal situation in which
short-test results determine FTP
categorization. The term C is the
cutoff point for the linear rank
score a(N0 ) + b(HC)=C.
-------
2-3
would separate the two categories and provide a convenient way of estab-
lishing FTP passage or failure. For a given car, then, computing
a(N0 )+b(HC)=C*, its short-test linear rank score, would classify the
/\
car into the "pass" category if C*
-------
2-4
vCj
J*c x X
X x X
<0 ° X X
°><0 °x x
o°y
o O o xXx
o o
NO.
Figure 2.2 Example of common situation in which two
point sets overlap. The minimal expected
probability of misclassification, {Eg+E^)/2,
is as follows:
r _ 2
eo~tc
-------
2-5
misclassification. E* can be used as a measure of the degree of cor-
relation between the short test and FTP. Figure 2.3a illustrates a
situation in which the short-test results are poorly associated with
the FTP, based on HC and NO measurements (E*=.4).
The preceding discussions can be generalized to situations in which
the observations are three-dimensional and each observation comes from
Category I with probability P and from Category II with probability 1-P
(i.e., the size of Category I may be P times the total number of cars in
the Portland Study). Thus, a hyperplane may separate the two sets in
space, and, accordingly, a linear rank score for the form a(N0 )+b(C0)+c(HC)
A
can be used to perform the classification. The probability of misclassifi-
cation can be computed with the formula E*=(1-P)E0+PE1. (See Figure 2.3b.)
For equal-sized groups, P=.5, and E*=. 5(Eq+E-j ), as before.
For real data, however, the two batches of points considered before
may not follow normal distributions, and, because of large errors, a
linear rank score thus becomes inappropriate in the process of classifi-
cation. In this situation one may invoke a nonparametric technique to
generate a "nonlinear" rank score to perform the classification.
Figure 2.3c illustrates one such situation.
In view of the many possible salient features of the data, a
sequential approach to the analysis is recommended. For this problem,
the approach consists of a series of block-stages (see Figure 2.4).
Within each block-stage, a number of trial runs will be performed to
achieve a specific goal for each subtask within the block-stage. The
-------
2-6
Figure 2.3a Example of short-test results with
poor correlation, E*, with FTP.
E0 3 TT
E - 5
E1 " T5"
** s (1* TC") + (1* t!)= *4
-------
2-7
Figure 2.3b Example of desired probability of
misclassification computed with the
formula E*= (1-P)EQ+PE1.
F - 3
Er to"
eMWMW)*°-27-
-------
2-8
Figure 2.3c Example of case where one might need a
nonparametric technique for generating
a nonlinear rank score.
-------
2-9
exact context of the subtask depends on the outcome of the analysis in
the preceding stages. A feedback loop may be necessary between the two
consecutive stages until a satisfactory answer has been reached.
2.1 PROCEDURES FOR ANALYZING THE ST/FTP ASSOCIATION
In this section, a sequential approach to analyzing the short-test/FTP
association will be described. The input to the procedure consists of the
at least five-dimensional observation (two, three, or nine for the ST
measurements, three for the FTP measurements) on each vehicle. For a given
set of FTP standards, the FTP "pass" or "fail" status of a vehicle can be
determined through comparison of the three FTP emission values with
the three standards, the violation of any one emission standard resulting
in a "fail." As before, let I and II denote the sets of all cars which
pass and fail the FTP, respectively. The following stages for analyzing
the problem of association are recommended:
Stage 1. Graphically explore the emission data.
Stage 2. Assess the distributional properties of the emission data.
Results from Stage 1 may guide the analysis 1n this stage.
Stage 3. Implement the classification algorithm, based on the
conclusions from Stage 2.
Stage 4. Evaluate the ST/FTP association 1n terms of results
obtained from Stage 3.
Stage 5. Test various hypotheses, and repeat the whole process for
further adjustment.
Figure 2.4 shows a summary flow chart of the block-stages involved in
the analysis.
-------
2-10
(STAGE 1)
Try to understand the
data--histogram plots,
scatterplots, summary
statistics
YES
I
(STAGE 2)
f (STAGE 3)
Perform linear discrim-
inant analysis—deter-
mine the Fisher direc-
tion.
(STAGE 4)
Evaluate the
ST/FTP
association
Test for normality —
methods for evalu-
ating similarity of
marginal/joint
distribution
NO
'
1
(STAGE 2)
Transform the data,
*
log, etc.,
and retest for
normality
(STAGE 3)
i
NO
Perform nonparametric
classification
analysis—Friedman's
algorithm
(STAGE 5)
Perform various
hypotheses testing
Figure 2.4 Series of block-stages
for ST/FTP association
problem.
-------
2-11
2.1.1 Stage 1. Graphical Exploration
Data analysis starts with an attempt to "feel what the data are like."
It is instinctive for a statistician to ask the following questions:
• What does the distribution for a particular variable look like?
Does it look like a normal or a lognormal distribution?
• Are there any indications of a possible relationship between a
specified pair of variables?
• What is the data quality? Are there many missing values, errors,
or outliers? ("Outliers" are observations which are different
enough in magnitude from the rest of the observations that they are
regarded as having come from another population.)
The answers to these questions can best be obtained through various
plotting schemes. To be able to "see" things is the first step towards
understanding. Three commonly used plotting routines are available in
the computer statistical packaged programs. These routines are:
• Histogram plotting routine
• Bivariate plotting routine
• Descriptive routine regression.
Figure 2.5 shows some examples of histogram plots. From these plots,
the statistician may spot some salient features of distribution and thus
gain insights for a thorough analysis in the next stage.
Using the bivariate plotting routine, scatterplots, together with
regression equations and their residual plots, can be printed for various
pairs of emission variables, e.g., scatterplots for short-test N0X versus
short-test CO, or short-test N0X versus FTP-N0X- The scatterplots can
-------
2-12
FTP-CO
Figure 2.5a Histogram plot for FTP-CO. Suggests the
distribution is normal.
70
ST-CO
Figure 2.5b Histogram for ST-CO. Is highly skewed with a2long tail
¦ to the right, which suggests a lognormal or x distribution,
so that a logarithmic or square-root transformation might be
appropriate.
Figure 2.5 Examples of histogram plots.
-------
2-13
X
X
X
X
X
*fxfx/*,x,x, ) >
NO
Category I
>>
u
c
aj
3
cr
at
s_
X ,1.
X
X
i—i—I—I A I * I A J A J
Category II
NO..
A
A
B
A A
B
A
A A
B B
B
i AiA r i
>)
u
c
cu
Z3
O"
<2/
s_
X
X
X X
X , X ¦ X
X ^ X
XXX
X , X . X
X
X
X x
x Ix 1
Categories I and II
Combined group
Figure 2.5c Further examples of histogram
plots.
-------
2-14
reveal any possible functional relationship between the two variables;
the residual plots (obtained from fitting the linear equation) can detect
outliers or strange points. Also requested from such a routine are plots
for a pair of ST emission variables with points from either Category I or
II, identified by distinct letters. (See Figure 2.6). Unless seme of
these plots indicate a good separation between the two sets of letters,
there is likely to be a poor association between the ST and FTP, regardless
of the various formal procedures to be pursued in the later course of the
analysis.
A descriptive routine computes summary statistics—mean, covariance,
largest and smallest values—and lists all data and all cases containing
missing values. The quality of the data can be appraised by scanning the
output, particularly for missing values and errors.
Special attention should also be paid to the covariance matrix for
the six pollutant variables—namely, three ST emission variables and three
FTP emission variables; small values in the off-diagonal terms indicate
that the variables are mutually uncorrected.
2.1.2 Stage 2. Assessment of the Distributional Properties of the
Emission Qata
An assumption of multivariate normality underlies much of the
standard, classical multivariate statistical methodology. But in real
data, possibility of departure from normality is prevalent. Thus,
it is essential to check the normality assumption before applying the
classical theory. Such a check will be helpful not only in guiding the
subsequent analysis, but also in suggesting the need for a transformation
-------
2-15
i
B
A B
B
A
A g
B B
B B
NO
Figure 2.6 Scatterplot, with points from
Categories I and II identified
by distinct letters.
-------
2-16
of the data to make them more nearly normally distributed (and therefore
capable of having the classical theory applied).
The joint and marginal distributions of the emissions measured by
the short test for in-use vehicles have not been checked; such a check
will be necessary once the Portland data are made available.
The following steps for assessing joint normality are recommended:
Step 1. Test for marginal normality (FTP and ST separately).
Step 2. If the hypothesis of normality is rejected in Step 1,
perform a transformation of the data and redo Step 1.
(NOTE: If it is obvious that the data cannot be
normalized, go to Stage 3.)
Step 3. Test for joint normality.
Step 1. Test for Marginal Normality. Although marginal normality
does not imply joint normality, the presence of many types of non-normality
is often reflected in the marginal distributions as well. Hence, a natural,
simple, and preliminary step in evaluating the normality of multivariate
data is to study the reasonableness of marginal normality for the obser-
vations on each of the variables.
Two commonly used techniques for assessing marginal normalities are:
• normal probability plot
• skewness and kurtosis coefficients and the Kolmogorov-Smirnov
(K-S) test.
A normal probability plot is a graphical tool for assessing univariate
normality. The technique consists of
-------
2-17
1. Ordering the n observations from smallest to largest, where
is the i smallest observation in the sample, i=l,...,n,
and X{1)< X(2) <"'< X(n)"
2. Letting P*=(i - l/2)/n.
3. Plotting X^j against P.. on normal probability paper.
A linear configuration on such a plot would correspond to adequate
normality of the observations (Figure 2.7a), while systematic and subtle
departure from normality would be indicated by deviations from linearity
(Figure 2.7b).
Computing Support: TSC computer library, BMDP5D (the histogram program
mentioned in Stage 1).
Skewness and kurtosis coefficients and the K-S test provide a statistics-
based formal test on univariate normality. Let X^,..., Xn be n univariate
emission observations. The skewness and kurtosis coefficients are defined
as
•V
& L
-------
2-18
Figure 2.7a Simulated normal data plotted on normal paper.
-------
2-19
Figure 2.7b Simulated non-normal data plotted ori normal paper.
-------
2-20
The following are characteristics of arid b£:
1. When the population density is symmetric, then b-j=0. When
the population density has a long tail to the right, then
>0; and when it has a long tail to the left, < 0.
2. The value of for any normal distribution is b2=3. But 1f
the distribution is concentrated toward the mean, then b2 < 3;
otherwise, b2 > 3.
For real data, the values of these coefficients are subject to random
fluctuations; therefore, a statistical table has to be used to determine
the value. Tables of approximately 5% and IS points of these two statistics
are given in Appendix A. {Table A-l). The procedures for testing skewriess
and kurtosis are given below.
SKEWNESS & KURTOSIS TEST (/Ej"=0 and bg=3)
PROCEDURE EXAMPLE
(1) Select a variable to be tested, (1} Suppose, from a histogram plot in
say .variable X. (This may be the Stage 1, that the CO measured from ST for*
original variable or the transformed all the cars passing the FTP looks
variable.) normal. One then decides to test the
ST-CO variable for normality.
(2) Set a significance level , a. (2) Set ot=0.05
Following the convention, set ct
equal to either 0.01 or 0.05.
-------
2-21
(3) Using BMDP2D subroutine,
one obtains /b7=0.2 and b2=3.5.
(3) Compute the skewness coeffi-^
cient, ^ and kurtosls coeffi-
cient, b2, for the variable X.
These two numbers can be easily
obtained from the output of a
frequency count subroutine such as
BMDP2D or SPSS-FREQUENCY
SUBPROGRAM.
(4) Let N be the number of sample
points. Look for the one-tailed
percentage point, p^, of the
distribution of /Ei^* determined by
N and a from the table for testing
skewness (see Table A-la). Also,
look for percentage point p2 of the
distribution of bj determined by N
and a from the table for testing
kurtosis (see Table A-lb).
(5) If VTj" obtained from (3) is
smaller than p^, accept the hypo-
thesis that Vbj=0 at a significance
level and thus passes the skewness
test for normality. Otherwise,
reject the hypothesis that VTTpO;
(4) Suppose there are 112 sample
points. From the table for testing
skewness, one obtains the one-tailed
percentage point of the distribution
of ^ as p^=0.350 (determined by 112
and 0.05). Similarly, from the table
for testing kurtosis, one obtains the
percentage point of the distribution
of bg or p2=3.71.
(5) Since /^=0.2 <0.35, it follows
that the hypothesis »^j»0 is accepted
at 0.05 significance level. Also
from b2=3.5 <3.71, 1t follows that
the hypothesis b2=3 is accepted at
0.05 significance level. Therefore,
-------
2-22
X, therefore is not normal. Simi-
larly, if b£ obtained from (3) is
smaller than p2) accept the hypoth-
esis that b2=3 at a significance
level and thus passes the kurtosis
test for normality. Otherwise,
reject the hypothesis that b2=3;
X, therefore, is not normal.
In the event that /b^=0 and b2=3 have been statistically accepted,
which is an indication of normality, the K-S test can be utilized
to further test the hypothesis that the observations are normally
distributed. The procedure for the K-S test can be described as
follows:
K0LM0G0R0V-SMIRNOV TEST
PROCEDURE EXAMPLE
(1) Select the variable to be tested, (i) Suppose, in Stage 1, that the
say„X. (This may be the original histogram of CO measured from ST
variable or the transformed variable), for cars passing the FTP looks like
normal. Then one decides to formally
test the normality of the ST-CO
variable for cars passing the FTP.
(2) Set a significance level a. (2) Set a=0.05
Following the convention, set
a=0.01 or 0.05.
X passes the first test of normality
and thus can proceed for a further
check for normality as shown in the
next procedure.
-------
2-23
(3} Calculate the sample mean, IT,
2
and sample variance S of variable X.
These two numbers can be easily ob-
tained from the output of a descrip-"
tive subroutine such as BMDP5D or
SPSS-DESCRIPTIVE SUBPROGRAM,
U) Set the hypothesis, Ng: X
follows a normal distribution with
— 2
mean X and variance S ,which can be
written (N(X,S2)). Implement the sub-
routine KOLMO of the SSP-1 (available
through MIDAS and the TSC computing
libraryjon X data with hypothesized
distribution as stated in Hq.
(5) From the printout of the KOLMO
program, look for the K-S distance.
D. For the significance a-level,
Took, up the critical vsAue P far
the K-S test of goodness-of-fit
from Table A-2b (see Appendix A).
(NOTE: If the number of sample
points is less than 100, find
the critical value from the
table under the a-level column.
If the number of sample points
(3) Using SPSS-DESCRIPTIVE SUBPROGRAM ,
one obtains the sample mean jx", and
2
sample variance, S , of FTP-CO
_ 2
variable as X=34.5 and S =2.0.
U) Set the -ntU hypothesis as H0;
ST-CO follows a normal distribution
with mean 34.5 and variance 2.0.
Implement the subroutine KOLMO of the
SSP-1 on ST-CO data with hypothesized
distribution as stated in Hg.
(5) From the printout, of the KOLMO
program, look for the K-S distance
D and find D=0.035. For the 0.05
, ""look up
critical value P from Table A-2b.
Since there are 112 sample points,
the asymptotic formula for obtaining
P is used and we get
P=0.886//nrry=0.083. Since P>D,
accept the hypothesis i.e., X
follows N(34.5,2.0).
-------
2-24
is greater than 100, compute the
critical value front the asymptotic
formula.) If D
-------
2-25
Since there are only three emission variables, it usually suffices
to test the bivariate normality for each pair of variables.
Two methods for assessing the joint normality are:
• A test based on linear regressions
• A test based on radius-and-angle representation of multivariate
data.
A property of the multivariate normal distribution is that the
regression of each variable on all others is linear. Thus, a simple test
for detecting joint non-normality would be to perform a linear regression
line fit and decide from the residual plot whether the regression shows
any sign of curvature. Figure 2.8 illustrates two possible situations.
If the residual plot indicates obvious curvature, reject the hypothesis of
joint normality; otherwise.accept bivariate normality.
Computing Support: Any linear regression program with a residual plot
will do.
The second procedure is based on two things: 1) representing a two-
dimensional data point in terms of radius and angle; and 2) under the joint
normal assumption, the radius's following a chi-squared distribution with
2 degrees of freedom and the angle's following a uniform distribution over
(-ir/2,ir/2). Specifically, the operations of this procedure can be described
as follows:
-------
2-26
X X x X x
X X X X X
X X x x X
Fitted N0¥
Figure 2.8a Case for the acceptance for possible
joint normality of NO and CO.
/\
X
X X
X X
*X
X ..
X
X
Fitted NO
A
Figure 2.8b Residual plot indicating curvature; the
hypothesis of joint normality is rejected.
Figure 2.8 Linear regression line fit for assessing the
joint normality of NO and CO, where
X
A
N0X = a + b(C0) and the residual »
N0X - »x.
-------
2-27
PROCEDURE
EXAMPLE
{1) Let Xq,...X^ be n two-
dimensional observations =
(xlk»x2k)- Compute the covariance
matrix S where
S =
>11
12
. = 1
lij n-1
'12
'22/
11
5*1
k xjk
S "«)(s "¦¦)/¦
and the mean vector X^
X = (jT-j ,X"2), where
(1) From Table 2.1, which lists
HC and NQ2 data,
n = 47
= 1.94570
S22 = 8.52141
S12 = 0.6301
^ = 2.84383
Jz - 3.76447
T.s
'ik
X1 *
P
(2) Find ek and for each two- (2) Transform the data
dimensional observation.
where ek = tan
-uhi
,za
(remember to account for the
correct quadrant)
and - zfk + Z
|k> where
(Xu-2.84383)8.52H-(X2k-3.76447)0.6301
"Ik
¦2k
11.74320
(X2k-3.76447)
2.91915
The values appear in Table 2.1.
-------
2-28
(xlk-Ti)s22 - (x2k-x2)s12
"Ik
yjh2yjh} 22 " s
T~
12
"2k
(x2k - x2)
a/^22
(3) Compute SR and SLR
n 9
SR = ^ r£ and
SLR = £ r2k l09etr2k)
Calculate the goodness-of-fit
statistic G
G =
YrT
0.887
(If " lo!,e(SR)
')
0.5 +
0k
w
Compute the cumulative frequency
i, U
distribution of the k ordered
variable
k
(3) Compute SR, SLR and G
SR = 243.786
SLR = 874.178
G = 1.517
+ loge(n) - 0.4228
G has a standard normal distribution.
(4) Sort the e values and transform
them to the (0,1) interval. The
transformation is:
(4) Table 2.2 gives the
0|y 1/
sorted 0.5 + -^ values, and
KS = 0.34862
-------
2-29
Calculate the Kolmogorov-Smirnov
distance for each value
and let KS be the maximum, i.e.,
KS = max
k
(5} Check G and KS to see if
(5) G = 1.517 and KS = 0.34862 from
there is a significant difference. the normal table (Table A-3). One
finds that at 0.05 significance level,
G is insignificant, i.e., the distri-
(6) Plot r(i) against 1-P,. on semilog probability paper, Figure 2.9a
uniform probability paper, Figure 2.9b. If the data conforms statistically
to the null hypothesis of bivariate normality, the configuration on these
two probability plots should be reasonably linear. Departure from
linearity on either or both of the plots would indicate specific types of
departure from the null hypothesis.
Computing Support: TSC computer library.
2 2
bution of r is x ¦ However, from
the K-S two-sample table {Table A-2a)s
one finds that at 0.05 significance
level, KS is significant, i.e., the
hypothesis that the distribution of
6 is uniformly distributed is
rejected. Therefore, the data is not
bivariate normal. Transformation of
the variable may be needed.
(the actual paper is presented as Table A-7) and plot against on
-------
2-30
TABLE 2.1 BIVARIATE HC AND NOX VALUES WITH ASSOCIATED
Zr Z2, 0, r2 AND r2loger2
HC
NOX
Z1
Z2
9
r2
r2loger2
4.8
4.44
1.38324
0.23141
0.16576
1.96691
1.33054
2.64
7.00
-0.32152
1.10838
1.85312
1.33188
0.38171
6.35
1.84
2.64750
-0.65926
-0-24405
7.44388
14.94279
2.53
2.57
-0.16364
-0.40918
-1.95123
0.19421
-0.31827
2.02
2.20
-0.51387
-0.53593
-2.33517
0.55120
-0.32829
2.68
1.82
-0.01455
-0.66611
-1.59263
0.44391
-0.36051
15.85
1.02
9.58514
-0.94016
-0.09777
92.75883
420.19779
2.52
1.49
-0.11295
-0.77916
-1.71475
0.61984
-0.29647
1.82
1.12
-0.60105
-0.90591
-2.15658
1.18192
0.19755
11.66
4.57
6.35420
0.27595
0.04340
40.45202
149.67719
14.15
4.84
8.14657
0.36844
0.04520
66.50241
279.12646
3.45
1.69
0.55117
-0.71064
-0.91111
0.80881
-0.17163
3.31
1.57
0.45602
-0.75175
-1.02553
0.77309
-0.19897
3.30
3.01
0.37150
-0.25846
-0.60784
0.20481
-0.32476
2.45
3.64
-0.27910
-0.04264
-2.98999
0.07972
-0.20163
4.50
5.30
1.11940
0.52602
0.43929
1.52976
0.65032
3.43
1.44
0.55008
-0.79628
-0.96627
0.93665
-0.06130
2.76
2.51
0.00648
-0.42974
-1.55572
0.18472
-0.31197
1.15
2.25
-1.14786
-0.51881
-2.71709
1.58674
0.73257
3.48
3.21
0.49139
-0.18994
-0.36885
0.27754
-0.35575
3.57
6.30
0.39089
0.86859
1.14791
0.90724
-0.08832
3.46
5.55
0.35132
0.61166
1.04944
0.49755
-0.34732
1.26
2.33
-1.07233
-0.49140
-2.71189
1.39137
0.45955
3.01
2.70
0.17770
-0.36465
-1.11736
0.16455
-0.29693
1.97
3.46
-0.61775
-0.10430
-2.97433
0.39250
-0.36707
3.26
2.33
0.37896
-0.49140
-0.91387
0.38509
-0.36748
3.62
1.77
0.67024
-0.68324
-0.79500
0.91604
-0.08033
1.05
1.80
-1.19628
-0.67296
-2.62917
1.88396
1.19326
2.20
1.47
-0.34408
-0.78601
-1.98342
0.73620
-0.22546
1.63
1.61
-0.76521
-0.73805
-2.37426
1.13026
0.13840
1.61
1.75
-0.78723
-0.69009
-2.42186
1.09596
0.10043
2.45
3.55
-0.27427
-0.07347
-2.87986
0.08062
-0.20301
2.51
2.89
-0.19532
-0.29956
-2.14859
0.12789
-0.26302
4.70
3.48
1 .36219
-0.09745
-0.07142
1.86505
1.16246
3.31
2.60
0.40076
-0.39891
-0.78309
0.31973
-0.36458
2.73
2.99
-0.04104
-0.26531
-1.72400
0.07207
-0.18956
3.34
2.81
0.41126
-0.32697
-0.67171
0.27604
-0.35532
6.07
2.04
2.43359
-0.59074
-0.23814
6.27133
11.51409
-------
2-31
TABLE 2.1-Cont'd
HC
NOX
Z1
Z2
0
r2
2 2
r loger
3.03
2.62
0.19650
-0.39206
-1.10618
0.19232
-0.31706
4.64
3.54
1.31543
-0.07690
-0.05839
1.73627
0.95796
3.53
3.53
0.51050
-0.08032
-0.15606
0.26706
-0.35259
3.25
2.07
0.38566
-0.58047
-0.98437
0.48567
-0.35076
3.33
2.11
0.44156
-0.55677
-0.90894
0.51620
-0.34134
3.30
1.51
0.45199
-0.77230
-1.04130
0.80075
-0.17794
3.77
5.21
0.59451
0.49519
0.69450
0.59865
-0.30715
3.09
2.49
0.24702
-0.43659
-1.05591
0.25163
-0.34720
2.39
1.62
-0.21426
-0.73462
-1.85458
0.58558
-0.31338
-------
2- 32
TABLE 2.2 SORTED VALUES OF U = 0.5 + ^,THE ASSOCIATED
kZir
CUMULATIVE FREQUENCY F = f^pAND THE KS STATISTIC
6
U
F
KS
-2.98999
0.02413
0.02083
0.00329
-2.97433
0.02662
0.04167
0.01505
-2.87986
0.04166
0.06250
0.02084
-2.71709
0.06756
0.08333
0.01577
-2.71189
0.06839
0.10417
0.03578
-2.62917
0.08155
0.12500
0.04345
-2.42186
0.11455
0.14583
0.03128
-2.37426
0.12212
0.16667
0.04454
-2.33517
0.12835
0.18750
0.05915
-2.15658
0.15677
0.20833
0.05156
-2.14859
0.15804
0.22917
0.07113
-1.98342
0.18433
0.25000
0.06567
-1.95123
0.18945
0.27083
0.08138
-1.85458
0.20483
0.29167
0.08683
-1.72400
0.22562
0.31250
0.08688
-1.71475
0.22709
0.33333
0.10624
-1.59263
0.24653
0.35417
0.10764
-1.55572
0.25240
0.37500
0.12260
-1.11736
0.32217
0.39583
0.07367
-1.10618
0.32395
0.41667
0.09272
-1.05591
0.33195
0.43750
0.10555
-1.04130
0.33427
0.45833
0.12406
-1.02553
0.33678
0.47917
0.14238
-0.98437
0.34333
0.50000
0.15667
-0.96627
0.34621
0.52083
0.17462
-0.91387
0.35455
0.54167
0.18711
-0.91111
0.35499
0.56250
0.20751
-0.90894
0.35534
0.58333
0.22800
-0.79500
0.37347
0.60417
0.23069
-0.78309
0.37537
0.62500
0.24963
-0.67171
0.39309
0.64583
0.25274
-0.60784
0.40326
0.66667
0.26341
-0.36885
0.44130
0.68750
0.24620
-0.24405
0.46116
0.70833
0.24718
-0.23814
0.46210
0.72917
0.26707
-0.15606
0.47516
0.75000
0.27484
-0.09777
0.48444
0.77083
0.28639
-------
2-33
TABLE 2.2"Cont'd
0
U
F
KS
-0.07142
0.48863
0.79167
0.30303
-0.05839
0.49071
0.81250
0.32179
0.04340
0.50691
0.83333
0.32643
0.04520
0.50719
0.85417
0.34697
0.16576
0.52638
0.87500
0.34862
0.43929
0.56992
0.89583
0.32592
0.69450
0.61053
0.91667
0.30613
1.04944
0.66702
0.93750
0.27048
1.14791
0.68270
0.95833
0.27564
1.85312
0.79493
0.97917
0.18423
-------
2-34
Figure 2.9a Distance representation of bivariate normal data
on a semilog probability plot.
-------
2-35
Figure 2.9b Angle representation of bivariate normal data
on a uniform probability plot.
-------
2-36
2.1.3 Stage 3. Implementation of Classification Algorithms
Analysis of the ST/FTP association involves the use of classification
techniques to separate the short-test emission data according to known FTP
categorizations. Since classical classification schemes are highly sensi-
tive to underlying distributions, the conclusion from Stage 2 is a crucial
factor in deciding which particular classification technique to adopt.
Specifically, if Stage 2 concludes that-the underlying distributions are
multivariate normal, then the classical discriminant analysis algorithm is
used. Otherwise, the Friedman's recursive partitioning nonparametric clas-
sification is used.
Discriminant Analysis Algorithm. This algorithm produces a linear rank
score, the Fisher discriminant function, which separates tRe emission data
space into two parts, "pass" and "fail". If, as concluded from Stage 2, each
of the two batches of points follows a multivariate normal distribution, then
the Fisher discriminant function computed from the algorithm is optimal; that
is, no other linear rank procedure will generate a lower expected probability
of misclassification [Anderson 1958, pp. 129-136].
Specifically, assume that the observations of short-test emission vec-
tors for cars in Categories I and II are from a multivariate nor-
mal distribution N(My,S) and Z^,...,ZN from respectively, where
the true means My and and the covariance matrix are unknown (and thus
have to be estimated from the data). The algorithm computes the Fisher
discriminant function according to the following steps:
1. Compute the sample mean vectors, T and Z, as well as the sample
covariance matrices, Sy and S^.
-------
2-37
2. Compute the pooled covariance matrix as
S = [(N-j -1) S y + (N2-1)Sz]/(Ni+N2-2) .
3. Form the Fisher discriminant function (or the Fisher linear rank
score) as
r(x) = xS^lT-T)1 - 1/2(T+Z)S-1(T+T)T ,
where x = (X-j, X2, X3) is an emission testing vector. NOTE: r(x)
is a linear combination of the three individual emission variables,
Xp X2, and X3. T designates the transpose of the matrix.
4. The desired cutoff point is given by the number
C* = loge(N2/N1) .
Thus, the best way to classify the car with emission vector x is to as-
sign it to Category I if r(x)>C* and to Category II if r(x)
-------
2-38
C* = log fN-C^C,)
EXAMPLE: Assume that the raw data contain N-j = 70 cars from Cate-
gory I and Ng = 43 cars from Category II. Suppose that Stage 2 confirmed
that, for each category, the short-test emission variables X, = log(N0 },
I A
X2 = CO, and X3 = HC have a multivariate normal distribution. Then the
Fisher discriminant function can be computed, using linear discriminant
analysis, according to the following computational steps:
1. Compute the sample mean and sample covariance of Category I and
of Category II. Assume, for this example, that these values are
Y
(1, 8,2), i.e., X1 = 1, X2 = 8, X3 = 2
S
Y
1 0 0
0 2 0
0 0 3
7
(2,10,1)
S
z
2 0 0
0 2 0
0 0 1
2. Compute the pooled covariance matrix. In this example,
-------
2-39
S = [C70-T)Sy + (43-1)S^]/(70+43-2)
111
69 0 0
126 0 0
\
~1.76 0 0
0 138 0
+
0 84 0
0 2 0
_0 0 207_
0 0 42
/
0 0 2.24
-
3. Compute the matrix inverse of S, S~^. Here,
--1
4.
score).
,57 0 0
0 .50 0
0 0 .45
Compute the Fisher discriminant function (the Fisher linear rank
In this example, it is
.-1
-1
-2
Ll J
r(x) = UX1X2X33 S'
= [.57X1 .50X2 .45X3]
\ [3 18 3] S"1
-1
-2
L U
3
18
L 3 J
- [0.86 4.5 .67]
3
18
L3J
= -.57X-J - 1.0X2 + .45X3 - 85.59
where x = (X-j, X^j X^) is a testing emission vector.
5. Find the cutoff point for the Fisher linear rank score. Here, it is
C* = "J og (43/70) = -0.49
Thus,a car with short-test emission vector x is classified into Category I
(passing the FTP) if r{x) > -0.49, or equivalently,
- 0.57[log(N0^)]-1'O(CO) + 0.45(HC) 2 85.1 ,
and to Category II if r{x) < -0.49.
-------
2-40
If a false negative error were twice as crucial as a false positive
error, the derived cutoff point for the Fisher linear rank score would be
C* = loge(2 x 43/70) = .21
In practice, some of the variables least crucial to the classification
can be deleted totally from the discriminant function. When such deletion
occurs, a stepwise discriminant analysis algorithm can be used to identify
those variables which "best" discriminate between the two categories and to
produce the Fisher discriminant function based on those variables.
Most computer statistical packages (e.g., BMDP, SPSS) contain a step-
wise discriminant analysis procedure similar to that outlined above. The
following example illustrates the use of such a procedure. By following the
instructions listed in the example, a user with a BMDP or SPSS manual should
be able to perform the analysis and interpret the results.
EXAMPLE: Assume that the short-test emission variables X, = logJNO ),
I € X
~ CO, and = HC have a multivariate normal distribution and that the
raw data contain = 70 cars from Category I and Ng = 43 cars from Cate-
gory II. The subroutine BMDP7M or SPSS-DISCRIMINANT can be used to perform
the stepwise discriminant analysis.
For both subroutines, a description of the card preparation, details
of the printout, and Interpretation of the printout are given below.
-------
2-41
SPECIAL PROGRAM CARD PREPARATION
BMDP7M -- The program card deck consists of a number of paragraphs
Including PROBLEM, INPUT, VARIABLE, and GROUP. Within each paragraph
is a set of (required or optional) statements to be specified by the
user. The details of the card preparation are given 1n the BMDP manual.
Statements which require special instructions are listed below,
STATEMENT EXAMPLE
(1) In the GROUP paragraph, set (1) Compute 70/(70+43) =0.62
PRIOR = fy(N1+N2)1 Nz/(N1+N2}. 43/(70+43} = 0-38.
In case costs C^ and C£ are (1)' If the false negative error
associated with the false negative is twice as crucial as the
error and false positive error, re- false positive error, set
spectively, set PRIOR = 0.62, 0.38.
PRIOR = C1N1/(N1+N2), C2N2/(N1+N2).
(2) In the DISCRIMINANT paragraph, (2) In the DISCRIMINANT paragraph,
ENTER = limiting value of F-to-enter ENTER = 0.01
REMOVE = limiting value of F-to- REMOVE = 0.005.
remove.
(3) In the DISCRIMINANT paragraph, (3) For this example, no forcing
FORCE = a number indicating the is specified. But if the
level through which user decides that,e.g.,
variables are forced variable Xj must be included
into the equation; a zero in the equation, then write
indicates no forcing. FORCE = 0, 1, 0
Assumed value is zero.
-------
2-42
SPSS-DISCRIMINANT--The program cards consist of a number of specifi-
cation phrases followed by a number of option numbers. Details of the
program are given in the SPSS manual. Statements which require special
instructions are listed below.
EXAMPLE
STATEMENT
(1) ANALYSIS = lists dicrimi-
nating variables to be
used in the analysis. In-
clusion level may be speci-
fied. Default inclusion
level is 1.
(2) fDIRECT
METHOD = <
WILKS
v.
(1) ANALYSIS = Xr X2, X3/
In the case where X2 is forced to
be Included in the equation, write
ANALYSIS = X1 ,X2(1),
(2) Write METHOD = WILKS/
If the Fisher discriminant
function is to be computed,
forcing all three variables
in the equation, set
METHOD = DIRECT/
(3) FIN = limiting value of F-to- (3) FIN = 0.01 /
enter FOUT = 0.005 /
F0UT = limiting value of F-
to-remove
(4) OPTION = 12 gives print (4) OPTION = 12 / is required,
classification function
coefficients.
The main parts of the printouts are displayed in the OUTPUT EXAMPLE {BMDP7M)
and OUTPUT EXAMPLE (SPSS-DISCRIMINANT) .
-------
2-43
PRINTOUTS
OUTPUT EXAMPLE (BMDP7M)
Step number 1
F-matrix degree of freedom =
Classification functions
Group = 1_ 2
Variable
X-j 0.262 0.141
Constant -11.627 -3.383
Step number 2
F-matrix degree of freedom = 2
Classification functions
Group = 1_ 2
Variable
0.690 0.544
x3 211.621 199.016
Constant -213.956 -182.326
-------
2-44
Step number 3
F-matrix degree of freedom = 3 lj]9_
Classification functions
Group =1 2
Variable
0.686 0.535
21
X3 0.126 0.324
X2 211.481 198.654
Constant -214.141 -183.557
Classification matrix
Number of Cases
Group Percentage Correct Classified into Group
1 1
1 95.7 67. 3
2 83.7 7 36
TOTAL 89.7 74 39
-------
2-45
Summary table
Step Variable F-Value to U- Approximate
Number Entered Removed Enter or Remove Statistics F-Statisties
1 X-| 131.41 0.458 131.41
2 X3 17.35 0.396 84.06
3 X2 2.55 0.340 63.58
-------
2-46
OUTPUT EXAMPLE (SPSS-DISCRIMINANT)
Step 1
F-matrix - degrees of freedom: 1 117
Classification function coefficients
Group 1 Group 2
X1 0.262 0.141
Constant -11.627 -3-383
Step 2
F-matrix - degrees of freedom: 2 110
Classification function coefficients
Group 1 Group 2
X1 0.690 0.544
X3 211.621 199.016
Constant -213.956 -182.326
Step 3
F-matrix - degrees of freedom: 3 109
Classification function coefficients
Group 1 Group 2
X1 0.686 0.535
X3 0.126 0.324
X2 211.481 198.654
Constant -214.141 -183.557
-------
2-47
Prediction Results:
Actual
Group
No. of
Cases
Group 1
70
Group 2
43
Predicted Group Membership
Gp. 1 Gp, 2
67
95.7%
7
12.3%
3
4.3%
36
83.7%
Summary Table:
Step
Variable
F to Enter
Wilks'
Number
Entered
Removed
or Remove
Lambda
. Rao's V
1
xi
131.41
0.458
131.41
2
X3
17.35
0.396
84.06
3
X,
9.55
0.340
63.58
-------
2-^a
INTERPRETATION OF"COMPUTER PRINTOUT
Procedure for choosing the best subset and the accompanying Fisher
discriminant function.
PROCEDURE
(1J Choose 3, the significance
level of choosing the best
subset of the variables.
(2) Look up the F-snatrix of
each step for the degrees
of freedom dj ,dn, and
copy them.
(3) Look up far 1 and d^-1
degrees of freedom in F-table
(see Table A-5), for each
step number. Here, dg degrees
of freedom are found in the
table given in (2).
EXAMPLE
(!) Set a = .05
Va = .9b
(2) The degrees of freedom appearing
in the F-matrix following each
step number are recorded as follows*
Step number Degrees of Freedom
1
2
3
1
u2
111
2 110
3 109
(3) Look up F g5 {l,d2-lj ir the F-table
for eaci stea number and record
them, Here, ^ degrees of freedom
are found in the table given in (2)
Step number F ^ (1
1 3.96
2 3.96
3 3.9S
-------
2-49
PROCEDURE
EXAMPLE
Look up, for the variable entered
at each step, the F-to-enter or
-remove value from the summary
table. Compare the F-to-enter
or -remove value with the
F-j_a(l .dg-l) value obtained in
(3) at each step: If the F-to-
enter or-remove value is greater
than F-j_a(l ^-1), then the
variable entered in that step
should be included in the dis-
criminant function; otherwise,
the variable is excluded from
the discriminant function.
Look up the coefficient table
for the classification function
at step k in the printout. Here,
k is the number of variables in
the best subset obtained in (4).
Step number k
CIassification'functions
(4) Look up,for the variable en-
tered at each step, the
F-to-enter or-remove value
from the summary table.
Step Variable F-to-Enter
Number Entered or/-Remove
X
1
131.41
2 X3 17.35
3 X2 2.55
Compare with the table in (3).
At step 1, the F-to-enter for
X-| is greater than 3.96; thus,
X.j is included in the discrimi-
nant function. At step 2, the
F-to-enter for X^ is greater
than 3.96; thus,X^ is included.
At step 3, the F-to-enter for
X2 is less than 3.98; thus.X^
is excluded. Therefore, the
best set consists of X-j and Xy
(5) Look up the coefficient table
for the classification function
at step 2 in the printout.
Step number 2
a
Classification functions
-------
2-50
Group = A
Variable
XK1
XK2
11
*21
12
i22
Group = 1_
Variable
h
Constant
0.690
211.621
-213.956
0.544
199.016
-182.326
XKK
Constant
K1
K2
(6) Compute the Fisher discriminant (6)
function from the coefficient
table in (5) according to the
formula Z
1 * (aira12)XKl + (a2Ta22)XK2
+ ... +(aKi-aK2)xKK+C-I-C2
(7) Classify an X into group A if (7)
1 = {aira12)XKl+ •** +^aKl~aK2^XKK
+ C^-Cg > 0
Here, XK1, XKK significantly
contribute to the discrimination
of the two groups at the a
significance level.
Compute the Fisher discriminant
function from the coefficient
table in (5) by
= (0.690-0.544)X] + (211.621-199.0l6)X,
+ (-213.956 + 182.326)
= 0.146 X1 + 12.605 X3 - 31.63
Classify a car with emission vector
(NO ,C0,HC) into Category I, if
0.146 log (N0J + 12.605(HC)
A
-31.63 > 0. Here,log(N0x) and HC
significantly contribute to the
discrimination of the two categories
at the a = 0.05 level.
-------
2-51
Procedure for computing the estimated Mahalanobis distance D for each
M
step q (required in Stage 4),
PROCEDURE
(1) For each step q, look up the
value of approximate F-statis-
tics (or equivalently, Rao's V
in SPSS) from the summary
table and copy it.
(2) For each step q, compute the
2
Mahalanobis distance D„ by
q
using the relationship
n2 q(N1+N2)(N]+N2-2)
°q = N^l^+Ng-q-l) F.
where N-| is the sample size
of group A and ^ is the sample
size of group B.and F is the
value found in (1).
EXAMPLE
(1) Look up the value of approximate F-
statisties from the summary table.
Step
Number
1
2
3
Approximate
F-Statistics
131.41
84.06
63.58
(2) For step 2, compute D2 by
D? =
2(70+43)(70+43-2) g4 g
70x43(70+43-2-1) x B 'UD
= 6.31 •
-------
2-52
Fri edman' s Recursi ve Parti t1 6ni ng A] gor 1 thm for Nonparametr1c Classify.
cation. If the underlying distributions for the short-test emission data
are not normal, the classification will be performed using Friedman's
recursive algorithm. This algorithm 1s the newest development 1n the
state of the art. A complete program write-up and the user's manual are
available at TSC. Typically, the use of the algorithm will result in a
tree-structured decision rule: Starting at the root, the test vector
is directed down the tree until 1t arrives at a terminal node,and is
assigned to the class with the highest representation at the node.
Figure 2.10 illustrates a hypothetical decision tree, where X-|, Xg
and X3 denote the short-test NQX, CO and HC values, respectively, and ot's
are the cutoff points. Starting at the top node, if a car's short-test
emission NO value is less than a-i, descend to the right node and examine
A •
its short-test HC value. If this value is greater than a3, descend to the
left terminal node and classify the car into Category II as dictated by
the class labels at that node. If, on the other hand, a car's short-test
N0X value is greater than a-j, descend to the left node; and if the short-
test CO value is less than a2> classify the car into Category I, according
to the class label.
Structural relationships among the three variables, Insofar as
classification 1s concerned, can be accessed Immediately from the tree
structure. For example, at the top root, the short-test N0X clearly
is the most influential variable 1n the classification. However, the
short-test CO and HC values are Independent of each other 1n the decision
-------
2-53
Figure 2.10 Hypothetical decision tree.
-------
2-54
process; I.e., when a car's N0X exceeds its CO value 1s then the
dominant factor In determining the class. When the car's N0V 1s lower
than ctj, its HC value will be the dominant factor in discriminating the
two classes.
In general, the advantages of a tree-structured method include:
1. Being able to fully exploit variables that are important
only under limited conditions, because variable use is
determined by previous structure values.
2. Not having to calculate a variable if classification is possible
without that variable.
3. An algorithm that readily permits interpretation.
4. An efficient decision rule (the tree): Few calculations are
required to classify a sample.
The construction of such a tree proceeds as follows: At the top,
put the root of the tree, representing the entire sample. Then form
splits, the two descendants of the root or of each nonterminal node
representing the two subsamples defined by the truth or falsity of
the Inequality
x^ < a .
Choosing the particular variable x^ and the cutoff point a is done by
searching through all the allowable splits for all the variables and
picking the split of that particular variable which results 1n the
greatest decrease 1n uncertainty Insofar as classification is concerned.
The degree of uncertainty is measured by the Kolmogorov-Smirnov distance
-------
2-55
between the empirical marginal distributions of the two classes
[Friedman 1977]. Terminal conditions are needed to keep the program
from splitting down to a point of Insignificance. A typical terminal
condition has the form: Do not split a node if either a) the node
contains less than m points, where m is the threshold value chosen by
tne user, or b) the percentage of the majority class at the node is > 90%.
EXAMPLE: The Friedman's algorithm is applied to a real data set
from the FTP malfunction study. The objective of the study is to predict
the pass or fail of the FTP, based on various short-test emission measure-
ments. There are 519 car tests involved in the study. For each
car test, the data consists of a record of its pass or fail of the FTP,
along with measurements of seven variables: idle CO, Idle speed, choke
adjustment, vacuum leak, timing, EGR, and manufacturer.
Before applying the Friedman's algorithm, the data are randomly
split Into two groups. One group of 397 car tests, called the Learn
Mode, 1s used to design the classification tree; another group of 222
car tests, called the Test Mode, is used to assess the accuracy, or
the probability of error, of the tree designed.
The details of the computer subroutine, along with the user's manual,
are given 1n Appendix B. Output examples are given in Table 2.3, and an
interpretation of the output follows.
-------
TABLE 2.3 OUTPUT EXAMPLE (Friedman's Algorithm)
LEARN MODE
Node Variable Cutpoint From To left To right PASS FAIL Rule Class P(0R)%
1
IDLE CO
0.300
0
2
3
199
198
0
FAIL
50.00
2
E6R
0.00
1
4
5
181
112
0
PASS
61.66
3
IDLE CO
4.20
1
6
7
18
86
0
FAIL
82.76
4
2
Terminal
179
93
3
PASS
65.70
5
2
Terminal
2
19
3
FAIL
90.52
6
IDLE SPD
0.00
3
12
13
16
42
0
FAIL
72.51
7
3
Terminal
2
44
2
FAIL
95.67
12
6
Terminal
15
24
3
FAIL
61.66
13
CHOKE AD
-3.00
6
20
21
1
18
0
FAIL
94.76
20
13
Terminal
1
0
1
PASS
100.00
21
13
Terminal
0
' 18
2
FAIL
100.00
Classification was made using the prior probabilities of 0.5012 for PASS FTP and 0.4987 for FAIL FTP.
-------
2-57
TABLE 2.3-Cont'd
MISCLASSIFICATION Output for LEARN MODE
Classified
PASS FTP FAIL FTP
True PASS FTP 180 19
Classes FAIL np 93 105
Percentage correct = 71.7884
Percentage wrong = 28.21159
MISCLASSIFICATION Output for TEST MODE
Classified
PASS FTP FAIL FTP
True PASS FTP 96 17
Classes FAIL n? 5Q 5g
Percentage correct = 69.8198
Percentage wrong = 30.1801
Expected percentage correct = 72.0787
-------
2-58
INTERPRETATION
PROCEDURE
(1) Construct a tree diagram from
the summary table under Learn
Mode. Starting at node 1, the
variable name, x^, appears in the
second column. The cutoff point,
a, given in the third column,repre-
sentsa split of the node into a
left node (5th column) arid a
right node (6th column),according
to the truth or falsity of the
Inequality < a. For each of
the descendant nodes, look
for the corresponding splitting
variable, the cutpoint, the left
and the right nodes. If
there is a "terminal" appearing
in the To left and To right
columns,the splitting stops and
that node is considered the
terminal node.
OF THE OUTPUT
EXAMPLE
0) Construct a tree diagram from
the summary table shown 1n the Learn
Mode of the Output Example. At node
1, the splitting occurs 1n IDLE CO,
with cutpoint 0.300. Thus,a sample
point with IDLE CO less than or equal
to 0.300 will be directed to the
left notie (2), while any point with
IDLE CO greater than 0.300 will be
assigned to the right node (3). At
rode 2, the splitting occurs inEGR,
with cutpoint 0.00; the two
descendants are nodes 4 and 5. Since
"terminal" occurs at nodes4 and 5,
the splitting stops. Similarly, at
node 3, the splitting occurs in
IDLE CO with cutpoint 4.20, and the
two descendants are nodes 6 and 7 .
A complete tree structure
generated from the summary table
is shown in Figure 2.11.
-------
2-59
1
Figure 2.11 Tree structure of the Output Example.
-------
2-60
(2) At each terminal node, the CLASS
column gives the decision label; a
point which follows the tree and falls
into a terminal node with class
labeling FAIL will be classified
as failing the FTP. The PASS
column gives the number of sample
points at a node that pass the
FTP. The FAIL column gives the
number of sample points that fail.
The P(OR)% column from Table 2.3
gives the probability that a point
falling into the node will be cor-
rectly classified. In othar words,
P(QR)£ provides c confidence level
for the classification at a given
node.
(3) The table in the MISCLASSIFI-
CATION Output in the Learn Mode
provides one with an overall
impression of the performance of
the decision rule.
(2) The class label at node 5 is
FAIL. Thus, any point which follows
the tree branches IDLE CO _< 0.30
and EGR > 0.00 will be classified
as failing the FTP. The number
shown in the P(0R)% column is
90.52%. This means that, at node 5,
there is a 90% confidence level
in making the decision as FAIL
category. At node 4, this number
is 65.70%, and therefore one is
less confident in making the decision
as PASS category.
(3) In the MISCLASSIFICATION Output,
the following table provides one with
an overall impression of the perfor-
mance of the decision rule:
Classified
PASS FTP FAIL FTP
PASS FTP 180 19
FAIL FTP 93 105
Percentage correct = 71.7884
Percentage wrong = 28.21159
-------
2-61
(4) The MISCLASSIFICATION output for
the TEST MODE provides one with an
impression of the accuracy of the
decision rule, based on data in the
LEARN MODE. An estimate of the prob-
ability of misclassification is given
by the PERCENTAGE WRONG. A much more
accurate estimate can be obtained
by using the "leave-m-out" technique
on the sample; a small number m of
vectors {usually 10% to 20% of
the total sample size) are formed
in the TEST MODE and the remainder
are put in the LEARN MODE to design
the tree. Record the PERCENTAGE
WRONG in the TEST MODE. This
procedure is repeated 10 to 20
times, each time with a different
set of vectors in the TEST MODE.
The averaged PERCENTAGE WRONG
value over all of these trials is
an accurate estimate of the
probability of misclassification
for the decision tree.
(4) In the MISCLASSIFICATION output
for the TEST MODE, find the
PERCENTAGE WRONG = 30.1801. This is
one estimate of the probability of
misclassification. The "leave-m-out"
method will provide one with a much
more accurate estimate.
-------
2-62
2.1.4 Stage 4, Evaluation of the ST/FTP Association
The classification method of Stage 3 is employed to establish a
prediction scheme for the FTP categorization based on a car's short-test
emission values. Such a scheme generally produces two kinds of errors,
false positive errors and false negative errors. Let Eq and be the
probabilities of producing a false positive error and of producing a
false negative error, respectively, and let P be the probability that an
observation comes from Category I. (Thus 1-P is the probability that
an observation comes from Category II). Then an optimal classification
scheme is the one which achieves the minimum expected probability of
misclassification, E*, where E* = (1-P)EQ + PE.|. Under the joint
normality assumption, the discriminant analysis scheme is optimal; under
the nort-normality assumption, the Friedman's scheme tends to be "more"
optimal than the discriminant scheme. The degree of ST/FTP association
can now be defined as that value of E* achieved by using the optimal
classification scheme. The smaller the value E*, the higher the ST/FTP
association and vice versa. This E* can be best utilized in comparing
the classification results for various portions of the data, which will
be the main subject of Stage 5 of the analysis.
In practice, however, the true magnitudes of P, Eg and are
unknown, and therefore have to be estimated from the data. A common
estimate of P is the number N-j/Oi-j+Ng), where N-j and N2 are the sample sizes
for Category I and Category II, respectively. The estimations of Eq and
E-j are much more involved, since tfte methods depend on the particular
-------
2-63
classification algorithm being used. The details of estimating errors,
along with the details of obtaining the 9525 confidence intervals* which
provide one with an impression of the reliability of the estimates of
errors, are given below.
Estimating the error rates and 95% confidence intervals when
the discriminant analysis algorithm is used. There are two methods for
estimating the error rates and one method for constructing the 95%
confidence interval, as follows:
Procedure 1 (biased estimates)-- If M-j 1s the number
of observations from Category I that are classified
into Category II, and Mg is the number of observations
from Category II that are classified into Category I,
then estimate Eg by Mg/Ng and E-j by These two
estimates are readily available in the printout of the
stepwise discriminant analysis subroutine described in
Stage 3 (p.2-44 and p.2-47).
PROCEDURE EXAMPLE
(1) Look up the classification (1) Look up the classification
matrix in the printout of the matrix 1n the output example
stepwise discriminant analysis (BMDP7M, p.2-44).
(p.2-44).
Classification Matrix Classification Matrix
Group Percentage Number of cases Group Percentage Number of cases
Correct classified into group Correct classified into group
A B 12
A p a F 1_ 95.7 67 T
B q c d 2 83^ 7 36
TOTAL 89.7 74 39
-------
2-64
(2) Estimate En by (100-q)%
uauiiiauc Q * »
and E, by (100-p)%.
(2) Estimate EQ by 100-83.7=16.3%
and E, by 100-95.7=4.3%.
COMMENT: Since these two estimates are known to be biased (i.e., they
underestimate the errors), they can only be used as rough estimates.
Procedure 2 (unbiased estimates)—A satisfactory method
of estimating the probabilities of errors has been provided
by Okamoto [19631. The method involves the computation
of the estimated sample Mahalanobis distance, which is a
byproduct of every stepwise discriminant subroutine de-
scribed in Stage 3. The details of the procedure, along
with the numerical example of Stage 3 (p.2-51}fare given
below.
PROCEDURE
EXAMPLE
(1) Compute the estimated
2
Mahalanobis distance D
(1) Assume that in the printout
of the stepwise discriminant
analysis, the size for the best
subset is 3. Thus, according
to the procedure for computing
the Mahalanobis distance (p.2-51
the D2 for Step (3), following,
is 02=1.2 (or D-1.09).
for step k in the printout
of the stepwise discriminant
analysis, where k is the size
of the best subset of variables
chosen by the program. (See
Stage 3, p.2-51 .)
(2) Estimate P by ^/(i^+Hg).
(2) Estimate P by 70/(70+43)=2/3.
-------
2-65
(3) Compute
c = j 1 og
(4) Compute
¦] -
/D
a] = 3c-c +K(D-c)
a2 = 2D+3c-c{D+c)2-k(D+c)
a3 = 2(D+3c)-c(D+2c)2-2k(D+3c)
(5) Let $ and ^denote the cumulative
distribution function and the
density function respectively,
of the standard normal distri-
bution. Look up {c) from the
normal table (Table A-3), and
compute
*(c)vfe e"c2/2.
(6) Let N=N1+N2. Set
E1(c,D)-4(c)+dfcyN)(
(3) Compute
c=[log(.5)-0.6]/1.09=-1.186
(4) Compute
a.,-4.938
a2=0.297
a^^S.351
(5) Look up $(-1,186) from the
normal table and get
i(-l ,186)=1-$(1.186)
=1-0.8810
=0.1190
d>c-l. 186)=0.2792.
2PD 2(1-P)D
°3
7 +
(6) N=43+70=113.
Set
Erj(c,D)=(c)/N) (—-*¦ +
u «5Dn^
2PD 2(1-P)D
1 + T*
E^.2412
Eq=.0601
(7) Set
E*=PE1(c,D)+(l-P)E0(c,D)
This is an estimate of the
expected probability of
reclassification.
(7) Set
E*=|(.2412)+y(.0601)=0.18
This 1s an estimate of the
expected probability of
reclassification.
-------
2-66
Procedure 3 (95% confidence interval) — A method for
constructing an accurate 95% confidence Interval
around the Individual probability of errors has
been developed by McLachlan [1975]. The
details of the computational procedures are
described below with notations and numerical examples
of Procedure 2 carried over.
PROCEDURE EXAMPLE
Select a fixed 100(l-ot)% level- (1) Select 95% level-of-confidence
of-confidence Interval. interval,1.e.,set a=0.05.
2
Copy the values of k, P, D , c, N, (2) From the previous example (see
E^(c,D) and Eq(c,D) obtained from Procedure 2), one obtains k=3%
the aforementioned Procedure 2(un-
biased estimate of error rates).
P-|, D2=1.2, cs-l.186, N=113,
Compute
b1=D2+4(3k-4)+(k2-4k+5)(16/D2)
b2=(D2-2k)/8
b3=D4+2(llk-16)D2+8(5k-4)
b4=D4-2(k+4)D2+8k
b5=2D6+16(2k-5)D4-32(4k-13)D2
Compute
1
i -r2b
(c) =/2?e
E^O.2412,and EQ=0.0601.
(3) Compute b.p 1=1,...,5. Accordin
to the formulas given on the
left side»one gets
b1=47.86
b2=0.6
b3=130.8736
b4=9.2736
b5=77.5495
(4) <(>(-1.186) =0.2792
-------
2-67
(5) Set
cf-(+(cl)Z[Cp-+ x)/N+bl/C4PN)2
+ y(P(i-p)N2)
+ b3/(64PN2)+b4/(64(c))2[Gpr + V)/N+bl/£4(1"P)N):
+ b2/(P(1-P)N2)+b3/(64(l-P)N2)
2 2
(5) Compute c^, aQ. According
to the formulas given on
the left, one gats
a2 = 0.0003
+ b4/(64PN2)+b5/(32N)2]
(6) Let Ia be the 100a percentile of
the standard normal distribution.
For the a given in (1), look up
la/2 from the normal table (Table
A-3}.The approximate 100(l-a)«
confidence interval for the
probability of false negative
error 1s E^-Z^;
a2 - 0.021
(6) Since as.05, look up Z Q25
from the normal table and
get Z Q25=1,96' The
approximate 95% confidence
interval for the probability
of false negative error is
(0.2411. 0.2413).
while the approximate 100(l-a)%
confidence interval for probability
of false positive error is
E0m0*Zo/2
The approximate 95% confidence
interval for the possibility of
false positive error is
(0.0523, 0.0679).
Interpretation of the 95% confidence interval. The preceding
method constructs 95% confidence intervals for individual
probabilities of error. The interpretation for those two confidence
intervals can be explained as follows: Let be the 95%
-------
2-68
confidence interval for estimating , where 1=0 or 1. If the experiments
can be repeated a large number of times, then, each time the classifi-
cation algorithm Is performed and the estimated EQ and E-j are computed,
roughly 95 out of 100 times, the estimated Eg will fall into the
interval Iq and,similarly,95 out of 100 times, the estimated E-| will
fall into the Interval Ij. But this is not the same as saying 95
out of 100 times, Eg will lie in Ig and E-j will lie in 1^ Simultaneously.
In fact, a joint 95% confidence region for E^ and Eg can be easily
derived from an Inequality found by Bonferoni,* which roughly states
that the 97.5% confidence Intervals for each of the two probabilities
of errors ensure a "joint" 95% confidence region for both probabilities.
Thus, using the previous method, construct 97.5% confidence intervals
Ig and I-| for Eg and E^ .respectively. Then Eg will He 1n Ig and
E.j will lie in 1-j simultaneously with at least 95% confidence.
•ff
Bonferoni inequality: Let 1^ be the 100(l-cx)% confidence interval
of E,j when 1=0,1; then
Prob (Eg 1s 1n Ig and E^ is in 1^) > l-Prob(Eg 1s not 1n IQ)
- Prob(Ei is not in 1-j) = 100% - 100a% - 100a% = 100(l-2a)%.
-------
2-69
2.1.5 Stage 5. Testing Hypotheses
The in-use vehicles tested in the Portland study were split into two
fleets, Elements I and II [Hamilton Test Systems 1977], Element I vehicles
were recruited exclusively from the Portland region; Element II vehicles
were recruited jointly from Portland and Eugene, Oregon. Each fleet is
defined by several vehicle characteristics. The following questions
relating to the fleets are of immediate concern:
• How would one evaluate the ST/FTP association, given the two
test fleets' emission data?
• Ooes the ST/FTP association depend on vehicle characteristics
used to define the fleets? If so, can the degree of association
be improved upon by taking these characteristics into account?
These questions can be answered through statistical hypothesis
testing procedures. Owing to the nature of the problem and the particular
sampling designs of the test fleets, a multi-contingency table approach
(an approach entirely different from the one mentioned in "Short Test
Correlation Analysis" [Aerospace Corporation 1976]) is appropriate.
Since the Element I and Element II vehicle fleets were recruited by using
different sampling methods, it is more convenient to discuss separately
the hypothesis testing problem for each fleet.
2.1.5.1 Element I Fleet. The Element I vehicle fleet for the
Portland Study comprises 2,400 1975 through 1977 model-year cars,
recruited exclusively from the Portland region. A stratified sampling
method was used to recruit the cars. The Portland vehicle population
was divided into 60 vehicle groups, according to manufacturer, model year,
-------
2-70
engine family, transmission, and test inertia weight in an attempt to
identify different emission control technologies. A sample of 40 vehicles
was taken for each vehicle group, for a total of 2,400 cars.
As demonstrated above, the ST/FTP association can be evaluated using
a classification algorithm (Stages 3 and 4) for a given set of FTP standards.
Since the FTP emission standards for 1975-76 model-year cars are not as
stringent as those for 1977 cars, it is best to apply the classifica-
tion scheme separately to 1975-76 cars and 1977 cars and then to study the
resulting classifications.
Using 1975-76 cars as examples, the procedures for testing whether
the classification results depend on various vehicle characteristics and
for fine-tuning the classification schemes are described below.
1.A) Combining Vehicle Group Cells.
Among the 60 cells for the Element I fleet sampling design, 44 cells
are occupied completely or partly by the 1975-76 model-year cars.
Following the procedures described above in Stages 3 and 4, a classifi-
cation rule and expected probability of misclassification, E*, based on
the ST/FTP emission data of these 1975-76 cars, can be obtained. Here,
E* measures the overall performance of the classification rule. The
misclassification rate of the individual cells indicates how well the
rule works for each particular class of cars. An unusually large
misclassification rate in one cell indicates the inappropriateness of
the rule for that cell.
The following procedures can be used to form blocks of cells, so
that each block comprises one or more cells with similar classification
-------
2-71
characteristics. Thus, the appropriateness of the rule when applied to
a particular block can be judged from the common misclassification rate.
Method
(a) Construct a two-way table (table of counts).
There are two variables: classification (correctly classified and
mlsclassified), and the vehicle-group cell number (44 categories for
75-76 model-year cars.)
Let {X^.} and {Xq^} be two 1x44 tables, where X^. and Xq^ are the
number of misclassified cars and the number of correctly classified
fU
cars, respectively, for the 1 cell. Let njsX,]-j+xoi* ^en
th
r_j=X-|.j/n,j gives the misclassification rate for the 1 cell.
NOTE: For 75-76 model cars, there are 35 cells with sample sizes 40.
(b) Order the 44 rates from smallest to largest, denoting by
J.L.
r^j the i smallest rates (1*1,... ,44),
r(l) - r(2) - - r(44)*
(c) Group those rates which are most alike in magnitude into one
class, either by eyeball inspection or by the following more rigorous method:
First, test the hypothesis Hg.* r^^ = using the chi-square test to be
described below. If the test result is to accept the hypothesis, combine
the two cells into one block and continue to test Hq: r^^ = r^j = r(3)*
Otherwise, consider the first cell as a single block and proceed to test
Hq; r^) s r(3)- Continue the process until r^j is reached.
The chi-square test used in combining cells is as follows:
-------
2-72
(i) Suppose one wants to test whether are the same
in magnitude, I.e..test the hypothesis
H0: ri=r2=***=rk*
(ii) Compute
xl+ 55 ^ xli and X0+ = xor
(i11) Compute the statistics
*2 (xlf " +(X01 " ¥)2/(Vk>l •
tiv) Under the hypothesis, x has an approximate chi-square
?
distribution with k-1 degrees of freedom. Look up the p value of x from
2 2
the x (k-1) frequency table,i.e.,the area to the right of the x value under
the x2(k"l) curve (Table A-6, Appendix A.). If p is greater than 0.05, then
accept the hypothesis; otherwise, reject the hypothesis.
l.B) Testing Whether the ST/FTP Association Depends on Various Vehicle
Characteristics.
The classification rules designed according to Stage 3 or l.A enable
one to assess the ST/FTP association. But this association may depend on
various engine parameters, e.g., manufacturer, model year, engine family,
and CID. In other words, some significant interaction effect between
the classification and a specific engine parameter may exist. A valid
testing hypothesis procedure can be used to identify such interactions.
-------
2-73
In principle, the procedure will test the null hypothesis by fitting a
model with the selected interaction effect set equal to zero, then will fit a
second model in which this interaction effect is included. The difference
between the goodness-of-fit of the two models becomes the test statistic
of the null hypothesis, provided that the second model fits the data well.
The details of the testing hypothesis are described below.
l.B.l Select the primary variables.
In deciding what variables should be included in the testing of
interaction effects, three variables at the most are considered each time,
for ease in interpretation.
EXAMPLE: Suppose there are I categories for variable 1, J categories
for variable 2, and K categories for variable 3. The primary variables
are classification (2 categories), MAKE (combined into 5 categories for
75-76 model-year cars), and CID (combined into 8 categories).
1.B.2 Construct the contingency table (table of counts).
Let and be two JxK tables (j=l,...*], k=l,...,K), where
and Xqare the number of misclassified cars and the number of
correctly classified cars, respectively, in the jk-cell. Thus, nj)(s:Xijk+xojk
is the total number of sample cars in the jk-cell.
EXAMPLE: Let and ^xojk^ be two 5x® ta^es' where X^ and
*0jk are ^'ie num^er misclassified cars and the number of correctly
classified cars, respectively, in the jth make and CID cells. These
two tables can be combined into a 2x5x8 multicontingency table, i.e., into
table where 1=1,0.
-------
2-74
1.B.3 Construct a linear model in the natural logarithms of the cell count.*;*
the linear model is analogous to an analysis of variance (ANQVA) model. The
advantages of such a modal have been pointed out in Bishop, e£ al_. [1975],
Let {m. } be the expected value of the observed frequencies (X. . >
ijk r ^ ijk
Write the full model
log(mijk)=U+Ul(i)+U2(i)+U3(k)+U12(ij)+U13(ik)+L!23(jkfU123(ijk)*
where i=0,l; j=l,..,J; k=l K.
The interpretation of the U-terms is as follows:
U is the overall mean
U-| measures the main effect of the classification variable
U2 measures the main effect of the MAKE variable
Ug measures the main effect of the CID variable
U-j2 measures the interaction between classification and MAKE
U]3 measures the interaction between classification and CID
U23 measures the interaction between MAKE and CID
U-|23 measures the difference between MAKE-CID interaction effects
attributable to the classification variable.
Various models can be derived from the above full model by setting
particular U-terms equal to zero:
*
Grizzle et al. [1969] have proposed the use of a weighted least-square
technique for analyzing categorical data. This technique has the advantage
of giving the user more latitude in choosing the model and testing hypothesis
But the disadvantages are that 1) it requires the user to have a more sophis-
ticated background in statistics to actually carry out the analysis and
2) technical difficulties arise when there are sporadic zero cells.
-------
2-75
Model (a); Test u-j 23=0 * "^us' t'ie model is
l09(niijk}=U+Ul(i)+tJ2{j)+U3(k)+U12(ij)+tJ13(ik)+U23(jk)*
which states that the table can be described with constant two-factor
effects; for example, constant classification-MAKE,-classification-CID,
and MAKE-CID effects.
Model (b): Test U-j23=0 and one of the two-factor effects is missing.
There are three versions of this model:
Model (b.l) Selecting U-j^O gives
log(mijk)=U+Ul(i)+U2(j)+U3(k)+U12(ij)+U23(jk) .
This model states that the classification and CID are mutually independent
for every MAKE. A direct consequence of the model is that CID can be
completely deleted from the analysis.
Model (b.2) Selecting U-|2=0 9ives
1og(",ijk)=U+Ul (i )+U2(j)+U3(k)+U13(ik)+U23(jk) .
This model states that classification and MAKE are mutually independent
for every CID level. Thus,MAKE can be completely deleted from the analysis.
Model (b.3) Selecting ^3=0 gives
log(mijk)=U+Ul(i)+U2(j)+U3(k)+U12(ij)+U13(ik)-
This model states that MAKE and CID are mutually independent for each of
the two classification levels.
Model (c): Test U19-a0,. and two of the two-factor effects are zero. There
I u-J
are also three versions of this model:
-------
2-76
Model (c.l) Selecting U-|23=Ui3=1)23=0 gives
log(m1Jk)=U+U1(j)+U2(J)+U3(k)+U12(iJ).
CID is now completely independent of classification and MAKE; classification
status and MAKE are associated.
Model (c.2) Selecting 23=ui2~^ 9^ves
log(rnijk)=U+Ul(i)+U2(j)+U3(k)+U23(jk)*
Classification is completely independent of CID and MAKE.
Model (c.3) Selecting ^i23=^12=l^23=^ 9^ves
log(mi j k)=u+ui(i)+U2(j)+U3(k)+U13(ik)'
MAKE is completely independent of classification and CID; classification
and CID are associated.
Model (d): Test U-|23=U12=U13=U23=(^ Th^s 9ives m°del
I°g
-------
2-77
Table 2.4, which will be used later, gives a complete account of the
models in three dimensions.
1.B.4 Procedure for fitting a model.
Given the observed frequencies of counts the estimates
A
mijk t^ie exPected value under various model assumptions listed
in 1.B.3 can be calculated with the use of algorithm AS-51, commonly
called the "Iterative Proportional Fitting Algorithm" and developed by
Haberman [1972]. Complete details of the algorithm, including program
card preparation and standard FORTRAN statements are given in Appendix C.
An experienced programmer should have no difficulty in implementing the
program by himself/herself.
-------
2-78
TABLE 2.4 COMPLETE MODELS IN THREE DIMENSIONS
Model Type
Absent Term
Sufficient
Configurations
(a)
U123
C12 C13 C23
(b)
U12 U123
C13 C23
U13 U123
C12 C23
U23 U123
C12 C13
(c)
U12 U13 U123
C23 C1
U12 U23 U123
C13C2
U13 U23 U123
C12 C3
(d) u12 u]3 u23 u]23 c1 c2 c3
Degree
Freedom
(1-1)(J-l)(K-l)
(I-1)(J-1)K
(I-l)J(K-l)
(I-U(JK-l)
(J-U(IK-l)
(K-D(IJ-l)
IJK-(I+G+K)+2
-------
2-79
SPECIAL PROGRAM CARD PREPARATION
AS-51 — The details of card preparation are given in Appendix C.
Those input statements which require special instructions and those
output statements which deserve special attention are listed below.
INPUT STATEMENT EXAMPLE
(1) NVAR=the number of variables in the (1) NVAR = 3
table.
(2) DIM Integer array (NVAR), the number (2) DIM(l) =2, DIM(2) = 5
of categories in each variable of the DIM(3) = 8
table.
(3) NTAB = the number of cells in the (3) NTAB=80
table.
(4) TABLE Real array (NTAB), the table (4) TABLE OHlir TABLE(2)=X211
to be fit. TABLE(3)=X121,TABLE(4)=X221 ,
..., TABLE(80)=X258
(5) FIT Real array (NTAB),the fitted (5) FIT(1H, FIT(2)=1,
table. ... , FIT(80)=1
(6) MAXDEV = 0.25 (or 0.05) (6) MAXDEV = 0.25 (or 0.05)
(The following statements depend on the model type to be fitted.)
(7) NCON = the number of sufficient (7) MODEL
12 3 4
configurations for a given model. (ui3=°) (U12=U13=0^
Here,the sufficient configurations ncon<= 3 2 2 3
can be found in the table
of three-dimensional models,
Table 2.4.
-------
2-80
(8) CONFIG Integer array (NVAR.NCOM),
the sets indicating the sufficient
configurations to be fit. For a
given model, the sufficient config-
uration can be found in
Table 2.4.
(8) C0NFIG(1,1)= 110 1
C0NFIG(2,U= 2 2 2 0
C0NFIG(3,1)= 0 0 3 0
C0NFIG(1,2)= 10 10
C0NF1G(2,2)= 0 2 0 2
C0NFIG(3,2)= 3 3 0 0
C0NFIG(1,3)= 0 0 0 0
C0NFIG(2,3)= 2 0 0 0
C0NFIG(3,3)= 3 0 0 3
(9) MAXIT Integer, the maximum permis- (9) MAXIT =10 1 1 1
sible number of Iterations.
(10) NMAR Integer, the maximal dimension (10) NMAR = 40 40 40 40
of the sufficient configurations for
a given model. This dimension can
be found from the configurations
listed in Table 2.4.
(11) Nil Integer, the dimension of a work (11) U = 40. 40 40 40.
area, U, used to store fitted
configurations.
OUTPUT STATEMENT
(1) The maximum likelihood estimates for the cell counts are given in
the following FIT table:
FIT(l)-m111, FIT(2)=m211, FIT(3)=m121, FIT(4)=m221, ...,FIT(80)=m258
-------
2-81
(2) The number of the last iteration will be shown in NLAST.
(3) In case there is any error in the program, an error message is
given by one of the following IFAULT statements:
If IFAULT=0, then no error was detected
If IFAULT=1, then CONFIG contains an error
If IFAULT=2, then MARG or U is too small
If IFAULT-3, then the algorithm did not converge to the desired
accuracy within MAXIT iteration (change MAXIT to a
larger value)
If IFAULTM, then DIM* TABLE, FIT or NVAR is erroneously specified.
-------
2-82
1.B.5 Goodness-of-fit for models computed by using AS-51 algorithm.
In order to measure how well the selected model fits the data, the
2
likelihood ratio statistic G and the classical Pearson chi-square statistic
2 2
X can be used. Here, we use only the G statistic.
2
The computational procedures of G are given below.
(a) Compute
G2 = 2 Z XiJk ,
ijk ,jk \mijk/
where are the observed counts, and are the fitted values
obtained from the FIT table in 1.B.4, AS-51 algorithm.
(b) Look up the appropriate degree of freedom under the model assumption
in Table 2.4. For example, under Model (a), the degree of freedom is
(1-1) (J-l) (K-l).
2
(c) G is distributed as a chi-square distribution with degrees of freedom
2
appropriate for the model selected: For the G value computed from (a),
2
look for the probability of a value greater than the computed G value in
the chi-square table, Table A-6 in Appendix A, with the degrees of freedom
obtained from (b).
1.B.6 A strategy for testing hypotheses.
The appropriate analysis of a multicontingency table usually
involves several runs of the AS-51 algorithm, described in 1.B.4, unless
one is interested in testing some particular hypothesis. Here, we
perform two runs, as follows:
-------
2-83
(a) First Run - Model Screening
The purpose of this run is to suggest interesting hypotheses for
further analysis. The details of the procedure are given below.
PROCEDURE
EXAMPLE
(1) Compute the estimated cell counts (1) Suppose, from the emission
a[1 \
{m^|J} under model(a), using the
AS-51 algorithm described in l.B.4.
2
Compute the G statistic and look
for its chi-square probability
value according to l.B.5. Record
the values obtained.
(2) Compute the estimated cell counts
under Models (b.l), (b.2),
and (b.3), using the AS-51
algorithm described in T.B.4.
For each table obtained, compute
the corresponding
^(2)
G2(2/1)=-2S ii>iUlog4^
Uk 1JK mil,'
data of 1975-76 model-year
2
cars in Element I, the G
statistic and its chi-square
probability value under Model (a)
are obtained according to
procedures in 1.B.4 and l.B.5
and recorded as follows:
Effect D.F. L.R.CHISQ PROB
U
123
28
9.01
0.7020
(2) Under model (b.l) (U13=0),
computing the cell estimates
and (2/1) gives
G2(Z/I)=7.79. From the
ch1-square table (Appendix A,
Table A-6) with degrees of
freedom (2-1)(8-1}*7, get the prob-
ability value 0.0204. Similarly,
W
-------
2-84
where Is the contingency table
obtained 1n (1). For each G^(2/l)
obtained, look for its chi-square
probability value with appropriate
degrees of freedom. Under Model
(b.l),this degree of freedom 1s
(I-D(K-l); under Model (b.2),
it is (1-1)(J-l); under Model
(b.3), it is (J-1)(K-1). Record,
for each table computed, G (2/1)
degrees of freedom and the chi-
square probability, including
the table for 11^3 obtained in
(1).
(3) Those effects whose probability
values are less than 0.05 are consi-
dered to be significant (a value of
0.0000 indicates a highly significant
effect), while those with probability
value greater than 0.05 are considered
to be nonsignificant and thus can be
ignored for further analysis.
for Models (b.l) and (b.3), the
G^(2/l)'s and their chi-square
value with appropriate degrees
of freedom are obtained and
recorded below (with the
absent effect representing the
model).
PR0B
0.0204
0.0000
0.1200
0.7020
(3) From the above summary table,
one sees that the third-order
interaction and the two-factor
effect ^(MAKE-CID interaction)
are not significant and therefore
are not needed 1n the final
analysis. The two-factor inter-
action tern U12 1s highly
significant, and thus belongs
in the model. U^3 1s moderately
significant, and so should probably
be included in the model.
Effect
D.F.
L.R.CHISQ
U13
7
7.79
U12
4
66.81
U23
28
27.23
U123
28
9.01
-------
2-85
(b) Second Run - Testing Specified Models
Using the guidelines provided by the results of the first run,
several models can be selected to fit the data. The models selected
for testing Include the highly significant terms found in the first run.
These models follow the "hierarchy principle" that if any U-term 1s set
equal to zero, all its high-order relatives should also be set equal to zero.
PROCEDURE
EXAMPLE
(1) Select several models which include (1) Select Models anc'
the highly significant U-term found in ^U12'U13^' for "test''n9»where
the first run, (a), and which follow the the terms inside the brackets
hierarchy principle. represent the highest-order
interaction terms present in
the model.
(2) Apply procedures in 1.B.4 and
1.B.5 separately to each of the
selection models in (1) and
produce the following table:
Model
(2) For the selection models, estimate
the expected cell counts according to
,2
G
76.71
Degrees of Prob.
Freedom
63
[u12]
[U-J2 U13] 55-83 56
0.9
0.29
procedures in 1.B.4 and compute G and
its chi-square statistics according to
procedures in l.B.5. Summarize the
values into a table, with the top row
representing the model with fewest
U-terms and the second row representing
the model with the next fewest U-terms,etc.
(3) For each model, check the signifi- (3) Comparing Prob with 0.05, one
cance of the fit by comparing the finds that both [U-^ and tui2'U13^
corresponding Prob value with 0.05, are nonsignificant; therefore, both
If Prob > 0.05, the fit of the model models are sufficient to explain the
-------
2-86
relationships between the factors.
is nonsignificant; therefore, the
model is sufficient to explain the
relationships between the factors.
If Prob <0.05, then the fit is
significant, and the model is not
accepted as a good fit.
(4) Suppose Model (a) and Model (b)
are both nonsignificant as concluded
from (3), and Model (a) has fewer U-
tenrs than Model (b). A way to
decide which one of the two should
be included in the final model can
be described as follows: Let Ga and
G2 be the G2 values of Model (a) and
Model (b), respectively. Let d, and
a
d^ be the corresponding degrees of
freedom. Compute
D = Ga2 - G
-------
2-87
therefore shall be included iri the
final model. On the other hand, if
D is nonsignificant, i.e., p>0.05,
then use Model (a).
(5) The magnitude of these U-terms can
be computed from the FIT table of the
model selection according to the
following formulas:
The grand mean 1s given by
u = (riijk)/iJK .
The main effects are given by
- U
- U
- u ,
ul(i) = TT
U2(j) * "ft1
u
J?
_
3(k) 10
where &.j++ = £ = £ ^
jk
i jk'
+J+ ik i jk
and %
'++k
£ a.
1J 1Jk
The interaction terms are given by
U12(ij)
U
13{ik)
23(j k)
= Vk
Vjk _ _ *++k + U
i~ it" "IT"
(5) Compute the U-terms from the
A
FIT table [U]2,U13^' *mijk* accorctin9
to the formulas given on the left.
One obtains, for example,
U,
'12(11) " 2*5, U12(12)
The interpretation of these U-terms
can be illustrated as follows:
From ^-[2(17] = 2-5» which is
positive, one concludes that the first
CID level has a positive interaction
wft/i the number of mfsc^assfffed
cars at "every" MAKE level. From
^12(12) = which is negative,
one concludes that the second CID
level has a negative interaction
with the number of misclassified
cars at every MAKE level.
l ~ —0.5,*..etc.
lU* .hi* . hii- + u
K JK IK u
Ai++ l++k + U
" TK IT"
-------
2-88
NOTE: is the table of sums C^.
Similarly is c-j3» ^+jk* 1S C23'
Ui++} is Cr U+J.+} is C2> and U++k>
is Cg.
In the foregoing analysis, we used raw data only; we did not take
into account a sampling factor. Below is a description of an adjusted
table to account for the sampling differences.
Let X-jand X0j.k£ be the number of misclassified and correctly
classified cars, respectively, in the jkil cells. Let njkiL=XljkA+X0jkJL*
44
Thus X+J>k= £ nj^. From the MVD file, find the population sizes, N&»
"h h
of the JI vehicle groups, where I-1,...,44. The adjusted table {y..,} is
1JK
obtained by computing
44
yijk = 2 (Xijk£/njk£,) ' »
£=1
where y^.^ is the expected number of I**1 classified cars 1n the j^ MAKE, k**1
CID.
Using the adjusted table 9° from l.B.l through 1.B.6 for the
testing hypothesis. In case the classification does depend on MAKE and/or
CID, redesign the classification rule according to MAKE and/or CID levels.
* *
Compute the E from Stage 4 for each level and compare it with the E of
the current design, thus deciding the relative merit of the rules.
-------
2-89
2.1.5.2 Element II Fleet. The Element II vehicle fleet for the
Portland Study consists of 660 cars: 220 vehicles (110 1972-74 and 110
1975-77) recruited from Eugene/Springfield; 440 recruited from Portland.
Of the 440 cars from Portland, 220 were of the 1972-1974 group and 220 were
of the 1975-1977 group. In each of these two groups, 110 cars passed, and
110 failed, the Portland idle inspection test. Vehicles failing the idle
inspection test are required to be reinspected following maintenance. All
Element II vehicles are required to be tested once every three months for
one year.
For the 1972-74 vehicle group, the classification algorithm of Stages 3
and 4 can be employed. The hypothesis on whether the error rates depend on
the passing or failing of the idle test can be tested according to the
1.8 procedures described in Section 2.1.5.1.
For the 1975-77 vehicle group, the classification rule designed for
the Element I fleet can be applied, and the resulting error rates can be
tested using l.B procedures to see whether the idle test or the three-month
recall would affect the error rates.
For Eugene cars, the classification rule of the Element I fleet can
be applied directly, and the resulting misclassified rates can be compared
with those of the Element I fleet in order to judge the appropriateness
of the extension of the rule to the Eugene area. The Eugene cars were
recruited from a stratified sampling scheme based on the distribution of
vehicles registered in the Portland area [Hamilton Test Systems 1977]; but
the Element I fleet was recruited from a stratified sampling scheme based
on emission technology. Therefore, the comparison mentioned above would make
-------
2-90
sense only if the misclassified rates are properly adjusted according to
sampling factors. The procedure for adjusting rates, along with the
comparison, is given below.
PROCEDURE
0) Let the misclassified rate at the j line item be r.=x../n., where
J l J 0
x.. and n. are the number of misclassified cars and the number of
•j j
i,L
total sample cars in the j line item, respectively (j=l,...,J).
NOTE: If n. 1s too small, there is a need to combine several
J
line items into one group, based on technology and model-year similarities
(2) From the State MVD file, find N.» the total number of registrations in
J
XL
the j line item, jsl,...,J, and N, the total number of
registrations in the Eugene region.
(3) Given the sampling plan described in "Plan of Performance" [Hamilton Test
Systems 1977], for each vehicle group of the Element I fleet, identify
those line items which form the group. For example, in the i vehicle
group, there are K line items, numbered by j-p.-.j^; then the adjusted
i, L
rate of Eugene vehicles with respect to the i vehicle group is
; . r (»j±) + + r (V
ri rjl \N / jk \N
(4) For the Element I fleet, if r.. is the raw misclassified rate for
XL
the 1 vehicle group, the adjusted rate is
M.
-------
2-91
where Mi and M are the number of registrations in the ith vehicle
group and in the Portland study area, respectively.
A
Let {r.} and {r.} be the two tables obtained from (3) amd (4). Compare
' J
A
the two tables by testing the hypothesis Hq." r^~r^ for each i,
using the following test:
¦"S
Let s.. = 1 - r.. and = 1 - .
Compute
z.
iii + lifi
N M
From the normal table, Table A-3 in Appendix A, find the p value,
ignoring the sign of z. If this value is greater than .025, reject
A
the hypothesis, i.e., r. f r. ; otherwise, accept the hypothesis.
If a marked difference is shown in pattern and magnitude, then
the Element I rule has limitations for being extended to the Eugene
region. Therefore, a separate rule for Eugene may be necessary.
This difference may be due to confounding factors, such as
meteorological, geographical, and population variables. On the
other hand, if no difference is shown, then Portland and Eugene are
similar as far as the classification rule is concerned.
-------
3-1
3.0 RESPONSES TO SPECIFIC QUESTIONS
The major objective of the Portland Study is to investigate the
following three problems:
• ST/FTP association
• Immediate effectiveness of an Inspection/Maintenance program
• Long-term effectiveness of an rnspection/Mafntenance program.
Within each problem, a series of specific questions or issues is raised
by the investigators. The purpose of this chapter is to resolve, or to at
least clarify, these questions or issues by providing either computational
procedures or comments.
-------
3-2
3.1 RESPONSES TO SPECIFIC QUESTIONS RELATED TO THE ST/FTP ASSOCIATION
3.1.1 Determination of Cutpoints
Methods presented in Stage 3, Section 2.1.3, produce a cutpoint (cutoff
point), or cutpoints, for the prediction of FTP categorization. Under normal
assumptions, a linear rank score for the short-test (ST) emission vector is
computed, and the classification is done by means of a single cutpoint for the
linear score. Under a non-normality assumption, a tree-structured decision
rule is constructed and the classification is done by means of a series of in-
dividual pollutant cutpoints. In general, determination of these cutpoints is
optimal; that is, the determined cutpoints achieve the minimum expected proba-
bility of misclassification for a given data base.
Specific questions related to cutpoints are the following:
1. Can the nine measurements obtained from the Federal three-mode
test be fully utilized to calculate cutoff points?
2. Can pollutant-specific cutpoints be determined?
3. How should one determine if cutoff points are dependent on
various engine parameters?
The answers to these questions are given below.
3.1.1.1 Federal three-mode test, nine measurements. For the Federal
three-mode tests, three emission measurements are taken per test for each pol-
lutant. Since the sequential approach to analysis described in Chapter 2 is
readily applicable to high-dimensional data, these nine measurements can be
fully utilized in the analysis of the ST/FTP association. Thus, under the
joint normality condition, the stepwise discriminant analysis algorithm in
Stage 3 will produce a linear rank score involving, at most, nine variables;
under the joint non-normality condition, Friedman's algorithm will produce a
tree-structured decision rule based on nine variables. The Friedman algorithm,
-------
3-3
incidentally, has two advantages for classifying high-dimensional data: ef-
ficiency of computation and ease of interpretation. Therefore, the user should
try to use the Friedman's algorithm regardless of the outcome of the tests for
joint normality in Stage 2 (pp. 2-24 through 2-35).
3.1.1.2 Pollutant-specific cutoff points. Individual pollutant-specific
cutpoints can be obtained from the classification cutpoints in a straight-
forward fashion. Let X = (X^ Xt) be ST emission variables, where t
varies, depending on the type of emission test (number of pollutants tested
and number of modes used to test each pollutant).
A. When the stepwise.discriminant analysis algorithm is implemented,
a best subset, say,|x^,... >XkJ., is selected and the Fisher discriminant
function based on these k variables is computed. Typically, the
classification rule will have the form: "Classify a car with emission
vector X = (X-|,...,Xt) Into Category I if
a1 X1 + ... + akXk > dk,"
where a.., i=l,...,k, and dk are obtained from the printout (for details of the
discussion, see Section 2.1.3, pp. 2-40 through 2-50). The individual cutoff
points for X-j,...,Xk can now be defined as dk/a.| ,..<> ,dk/ak, respectively. For
the rest of the variables, X.., i=k+l,...,t, the determination of individual
cutoff points is not important but, for the sake of completeness, can be ob-
tained from the printout, as follows:
First, search for the classification function in which all the variables
are included (p. 2-44);e.g., classify an "Xas belonging to Category I if
b1X1 +...+ btXt > dt ,
then define individual cutoff points for Xk+^ as dt/b^+-j»... as d^/b^.
-------
3-4
The above individual pollutant cutoff points can be best utilized
to screen test cars, i.e., a car should be assigned to the fail FTP
category whenever one or more of the ST emission values, X-p...,X^,
exceeds the cutoff points. Only if all the Xp-.-.X^ emission values
are below the cutoff points will the car's linear rank score be computed
and a score-based classification be made.
B. When the Friedman's algorithm is implemented, a series of
pollutant-specific cutoff points is given explicitly in the printout
{see Friedman's algorithm, Section 2.1.3).
3.1.1.3 Cutpoints arid engine parameters. For the purpose of testing
whether the cutoff points discussed above depend on various engine param-
eters, the user should consult Stage 5, Section 2.1.5, which deals with hy-
pothesis testing.
3.1.2 Determination of Error Rates and 95% Confidence Interval
Questions related to error rates and 95% confidence interval are the
following:
1. Provide a means of estimating the expected probability of mis-
classification and its 95% confidence interval.
2. How could a 95% confidence interval around the individual
pollutant cutpoints be calculated?
The answers to these questions are given below.
3.1.2.1 Probability of misclassification and its 95% confidence interval
Stage 4 gives a complete account of the estimation of the probabilities of
errors and their 95% confidence Intervals.
3.1.2.2 Individual pollutant cutoff points mentioned in 3.1.1. The 95%
confidence interval around the individual pollutant cutoff points should be
-------
3-5
derived from the distributions of such cutoff points. But it is not at all
clear why the points computed from the classification function should have a
reasonable distribution, much less a normal one. In fact, the central con-
cept here is the 95% confidence interval for the error rate, which gives an
estimate of the "reliability" of the method being used. With the following
ad hoc method, one can roughly estimate the 95% confidence interval around
the pollutant cutoff points:
Systematically leave out 10% of the data points and use the rest of the
points to construct the classification scheme according to the methods in
Stage 3, Section 2.1.3, and compute the cutoff point as in Section 3.1.1.1.
A. For the discriminant analysis algorithm, three cutoff points are
produced:
TRIAL CUTOFF POINTS
1 d-j/a-j d1/b1 d1/c1
• • • •
• • • •
i V'k Vbk Vck
where k denotes the number of trials. For the 1 column, i"l,2,3, compute
the mean and standard deviation of the k numbers, say, y.., cr... Then the 95%
confidence interval for each of the three cutoff points is given by
yiit.95cri for 1-1,2,3,
where t^gg is the 95th percentile from the t distribution with k-1 degrees of
freedom (see Table A-4, Appendix A).
B. The same method can be applied to Friedman's algorithm, but the re-
sults are meaningless.
3.1.3 Determination of Sample Sizes
Questions related to determination of sample sizes are the following:
-------
3-6
1. What is the minimum number of vehicles/tests required for
fixed probability of misclassification in the range of
5% to 20%.
2. What is the minimum number of vehicles/tests required for
the estimates of probability of misclassification with
95% confidence.
3. Can a minimum number of vehicles/tests for the simultaneous
achievement of 1 and 2 be determined?
The answers to these questions are given below.
3.1.3.1 For fixed probability of misclassification. The purpose of this
subsection is to estimate the minimum number of vehicles/tests required for
fixed probability of misclassification, E*, in the range of 5% to 20%. Here,
E* = pE-j + (I-p)Eq, where p is the probability that an observation comes from
Category I; and E-| and Eq are the probabilities of false negative error and
false positive error, respectively. The calculations are broken into two
steps.
Step 1. Run a pilot study to estimate p and the sample Mahalanobis distance
2
D according to
Estimation of p
Suppose that, in the pilot study, n cars undergo a series of ST and FTP. For a
given set of FTP standards, if n-j cars pass the FTP and n2=n-n-j cars fail the
FTP, estimate p by n-|/n.
2
Estimation of D
Now let Y^(i=l,...,n^) be the observations of ST emissions for cars
passing the FTP and be the observations of ST emissions
for cars falling the FTP. Then compute
-------
3-7
(2) D2= (T-DW-T)
(A numerical example has been given in Section 2.1.3, pp 2-38 through 2-40.)
Step 2. Let N be the minimum number of vehicles to be sampled in the formal
study. An estimate of N can be obtained through the following computational
procedure, which is derived from a theoretical expression of the probability
of misclassification given by Okamoto [1963]:
PROCEDURE
(1) Set e = fixed level of prob-
ability of misclassification.
o
(2) Estimate p and D according
to Step 1.
(3) Compute «
(4) Let k be the number of emission
variables in the study. Compute
= 3o-c3+k(D-c)
a2 = 2D+3c-c(D+c)2-k(D+c)
a3 = 2(£H3c)-c
-------
3-8
(5) Let $ and denote the cumulative (5) Look up $(-1,186) from the
distribution function and the normal table and get
density function,respectively, of $(-1,186) = 1 - $(1,186)
the standard normal distribution. = 1 - 0.8810
Look up $(c) from the normal = 0.1190
table (Table A-3); compute
4>(c) = -J- e"c2/2 <(>(-1 .186) = 0.2792
/2tt
(6) Set (6) Set
p, (c,D)=$(c)+((c)/N)(
1 \2pD
a2 a3\
2(1-p)D
j+ — I PT = 0.1190 + (13.81J/N
Pn(c,D)-$(c)-((c)/N)( —+ w
0 \2pD 2(1-p)D
/-
V
(7) Set the equation
e=PP1(c,D) + (l-p)P0(c,D)
and solve for N, where e 1s
given in (1).
PQ = 0.1190 - (6.647)/N
(7) Set 0.15 » 0.1190 + (6.99)/N
Therefore,N~225 , which is the
minimum number of cars required,
given the level of probability
of reclassification as 15%.
CAUTION; The precision of this estimate depends on 1) the assumption that the
emission variables behave like a joint normal distribution and 2) the assump-
o
tion that the estimated D and p in Step 1 are close to their true values.
-------
3-9
3.1.3.2 For a given length of 95% confidence interval. The purpose of
this section is to estimate the minimum number of vehicles/tests required for
the estimates of probability of misclassification to fall within ± X% of the
true probability of misclassification with 95% confidence. The procedure for
obtaining this minimum sample size is based on, first of all, a method de-
veloped by McLachlan [1975] for constructing an accurate confidence interval
for the individual probabilities of errors, and Eq, and second, the Bonferoni
inequality, which relates the two individual confidence intervals to a joint
confidence region of the two probabilities of errors (discussed in Stage 4,
Section 2.1.4, p. 2-68). The details of the computational procedures are
described below (the notations and the example used in the previous subsection
are carried over here).
PROCEDURE
EXAMPLE
(1) Set a fixed level, and a (1) Set X* = 5%, a=.05
fixed 100( 1 -ct)% level of
confidence interval.
(2) Copy the values of p, D2
estimated from Step 1 and
c from Step 2(3), in
Section 3.1.3.1.
(2] From the pilot study, it has
2 2
been estimated that p=^-, D =1.2.
Thus, c»-1.186 from Step 2(3).
(3) Compute
(3) k=3; therefore
b1 =D2+4C3k-4)+(k2-4k+5){16/D2)
b2=(D2-2k)/8
b-=D4+2(llk-16)D2+8(5k-4)
b4=D4-2(k+4)D2+8k
b5=2D6+16(2k-5)D4-32(4k-13)D2
b3 = 130.8736
b4 = 9.2736
b5 = 77.5495
^ = 47.86
b2 = 0.6
-------
3-10
PROCEDURE
EXAMPLE
(4) Compute
1
(c) = e
•cZk
(4) (H-1.186) = 0.2792
(5) Set
(5) Set
or12(N)=^(c);2r(^)/W+d1/(4pW)2
+b2/(p(l-p)N2)+b3/(64pN2)
+b4/(64(l-p)N2)+b5/(32N)2]
2
o|(N)=(tJ>(c))2[(Tl^- + ^-)/N+b1/(4(l-p)N)2
+bz/(pO-p)N2) + b3/(64(l-p)N2)
0.037 , 1.260
0] (N) = ^ +
N£
02(n)= i425 + 247
+b4/(64pN2)+b5/{32N)2].
(6) Let 2^ be the 100a percentile of
the standard normal distribution
(can be found in Table A-3,
Appendix A). Find the minimum N which
satisfies the inequalities
cri (N) Za/4 < X% and
a-g(N) _< X% simultaneously;
this N will be the minimum
number of vehicles required for
the estimates of probability of
misclassiflcation to fall within
±5% of the true probability of
misclassification.
(6) a/4 « 0.05/4 = 0.012
From Table A-3,
Z0.012 = °*03
Find the minimum number N
which satisfies the
Inequalities
0.0001 + 0437 < 0>05
N
0.072
R
H
0.002
2 -
N
^—• _< 0.05
simultaneously. Equivalents,
the number N can be obtained
by, first, finding the solu-
tions of the equations
(0.001)N+0.037-(0.05)N2=0 and
(Q.072)N+0.002-(0.05)N2=0;
-------
3-11
PROCEDURE EXAMPLE
and, second, by comparing the two
sets of solutions. The maximum
number obtained is the answer.
The solutions of the above two
equations are Ni=2.7 and ^=8.2,
respectively; therefore N=9 is
the minimum number of vehicles
required for the estimates of
probability of misclassification
to fall within ± 5% of the true
probability of misclassification.
COMMENT: For individual cutoff points as defined in 3.1.1, the minimum sample
sizes required for an accurate estimation cannot be determined, mainly because
the distributional properties for such cutoff points are unknown or are too
complex.
3.1.3.3 Combination of Sections 3.1.3.1 and 3.1.3.2. A minimum number
of vehicles/tests for the simultaneous achievement of Sections 3.1.3.1 and
3.1.3.2 can be determined by simply taking the maximum of the two individual
numbers obtained in those sections.
3.1.4
Develop a method for comparing the ST(laboratory)/FTP association with the
ST(real-world)/FTP association, considering possible places for human/instrument
error in both situations.
Assume that the ST(laboratory)/FTP data is available. Then perform a
classification according to Stage 3 (Section 2.1.3) and compute the expected
-------
3-12
*
probability of misclassification, , according to Stage 4 (Section 2.1.4).
Do the same for ST(real-world)/FTP data and obtain the expected probability
of misclassification, Compare these two numbers. The lower the number,
the better the association.
3.2 RESPONSES TO QUESTIONS RELATED TO IMMEDIATE EFFECTIVENESS OF AN
INSPECTION/MAINTENANCE PROGRAM
Background
The Element II fleet in the Portland Study includes 660 vehicles of
model years 1972-1977, 440 of which are registered in the Portland area
(and thus are subject to inspection/maintenance, I/M), and 220 of which
are operated in Eugene, Oregon, where no I/M program exists. The set of 440
vehicles consists of 110 1975-1977 model-year cars which failed the Oregon
state inspection test and 110 which passed; and 110 1972-1974 model-year
cars which passed and 110 which failed. The 220 Eugene cars were selected
to be representative of the Portland population of registered 1972-1977 cars.
Each of the 660 cars undergoes an initial series of short tests at a
Portland State inspection station (the same station is used for all vehicles).
Following this series of tests, each vehicle undergoes another set of short
tests, plus additional emission and fuel economy tests at the contractor's
emission testing laboratory. Each vehicle that failed Portland's Idle
emission test at the State facility undergoes maintenance and is retested
on the idle test at the State inspection station; and 1s retested over a
series of emission/fuel economy tests at the contractor's laboratory.
3.2.1
Suggest ways to determine before-I/M emission levels and confidence
limits for the Portland area vehicles, for each model-year vehicle from
1972-1974, using the sample of data available; also determine these values
for the Eugene area and/or the parts of Oregon not requiring I/M.
-------
3-13
Assume that the before-I/M emission levels for the Portland area 1972-
1974 cars are not available directly from some other source. Then levels can
be estimated from the Eugene emission data for 1972-1974 cars, using the
weighted average technique described below.
PROCEDURE
(1) Choose a specific pollutant emission variable, say, X.
(2) Suppose there are K MAKE-YEAR combination groups [Hamilton Test Systems
1977], e.g., DODGE-72. From the DMV registration file, find the total number
of cars in the ith MAKE-YEAR group, say, 1^, i=l,...,K. Let N be the total
number of 1972-74 cars in Eugene.
(3) For the sample of Eugene cars in the ith MAKE-YEAR group, compute the
mean and variance of the variable X, say, Y. and S?, i=l,...,K, respectively.
(4) Compute the weighted average
and variance
(5) Use 7^ as an estimate of the before-I/M emission levels for the
Portland area for each model-year vehicle from 1972-1974. The
confidence limit would be set as
*X ± SX't.95^n"1) •
where t gg is the 95th percentile from the t distribution with n-1 degrees of
freedom; and n is the total number of sample points. For n > 120, use tg5
= 1.96 (which follows from the normal approximation).
-------
3-14
The validity of using Y"^ to estimate the before-I/M X emission
levels for the Portland area can be further checked by comparing the
value with the national emission value, if the latter is available.
Thus, the national emission level's falling within the confidence limits
of Yx means that the extrapolation from Eugene to Portland is valid.
Otherwise, the extrapolation is less justifiable and one should be
conservative in making the conclusion.
3.2.2
3.2.2.1 Discuss ways to determine if the State inspection test failure
rates for similar groups of vehicles are different 1n the statistical sense.
Do failure rates differ from Eugene to Portland? See Section 2.1.5, in
which hypotheses about the failure rates are tested.
3.2.2.2 Even if the failure rates for Eugene and Portland are similarT
it is possible that the emission levels of failed/passed vehicles are differ,
ent. Suggest ways to look at this possibility. Analysis of variance can be
used. For details of the procedure, see Section 3.2.7, pp. 3-22 and 3-23.
3.2.3
3.2.3.1 Suggest methods to determine if the change 1n emission levelg
and fuel economy following maintenance is significant for the Portland failed
cars and for the Eugene failed cars. The standard way of testing whether the
mean 1s equal to zero will be used. The details of the procedure are given
below.
PROCEDURE EXAMPLE
(1) Select a variable of interest. (1) For the Portland failed cars,
select the short-test NO
x
variable.
-------
3-15
PROCEDURE
+ h
(2) For the i car, compute
ja)-x!b), where x|a^
and x|b^ are the emission
levels after and before
maintenance, respectively.
(3) Use the SPSS-CONDESCRIPTIVE
or BMDP1D subroutine to com-
A
pute the sample mean, y,
and the sample standard
A
deviation, o, for the Y..ls.
(4) Test Hq:y=0.
Find tie 952 confidence In-
terval for vi, wtiich lias the
form
A A A A
U - 1.96a < x < u + 1.96a .
The significance of the mean,
y» can be tested by examining
the confidence interval to
see whether the zero value
1s contained In the Interval.
If the zero Is not contained
In the Interval, then y 1s
significant; otherwise, y 1s
nonsignificant.
EXAMPLE
(2) For each Portland failed car,
compute where
xja) and x!b) are the ST-NOx
values after and before
maintenance, respectively.
(3) Use the SPSS-CONDESCRIPTIVE
subroutine for the Y_.'s.
From the printout, one can
see that the mean is 2.3 and
the standard deviation is 0.4.
(4) Test Hg:u=Q
The 95£ confidence interval
for y would be
2.3 - 1.96(-4) < x < 2.3 + 1.96(.4).
Since 0 is not contained in
the interval, one concludes
that u is significant; i.e.,
the change in NO levels fol-
A
lowing maintenance is sig-
nificant for the Portland
failed cars.
-------
3-16
3.2.3.2 Can change In FTP be expressed as a function 1 ri ST? Let Y
denote the change in FTP and X denote the change 1n ST. Perform a regres-
sion analysis according to the procedure 1n Section 3.2.5, with Y as
dependent variable and X as independent variable.
3.2.3.3 Do highway, FSC, HFET, and FTP fuel economy all change sig-
nificantly? For any of the selected variables, let Y = Y^ - Y^a^,
where Y^ and Y^ are the values obtained before and after the
maintenance, respectively. Perform the test for the significance
of the mean according to the procedure described in Section 3.2.3.1.
If the test is significant, then one can conclude that the change is
significant.
3.2.3.4 Suggest a method for putting confidence limits around the
before-I/M levels, the after-I/M levels, and the change 1n levels. Let Y
be the level of Interest, either the before- or after-I/M level, or the
change in levels. Using the descriptive subroutine (Section 2.1.1, Stage 1),
A A
compute the mean, y, and standard deviation, cr, from the data. Then the
95% confidence limit around the level of interest 1s given by
A A A A
y - crtg7 5(v) < x < y + atg7>5(v) ,
where *97. 5(v) is the 97.5th percentile of the t distribution (Table A-4,
Appendix A) with degrees of freedom [v = (number of sample points) - 2].
3.2.3.5 Discuss ways to determine whether cost of maintenance 1s de-
pendent on a vehicle's amount, or percentage, above the short-test outpoint*
This problem can be solved by using the linear regression subroutine men-
tioned 1n Section 3.2.5. In this case, the dependent variable 1s cost of
-------
3-17
maintenance and the independent variable is the amount or percentage above
the short-test cutpolnt(s). The goodness-of-fit of the regressive line
shall be judged by the magnitude of R^. If the value of is greater than
0.6. the fit is considered significant: The higher the R^, the better the
fit.
3.2.4
Suggest ways to determine if maintenance performed because of ST failure
on one pollutant affects ST and FTP emission levels on other pollutants.
The approach to this problem depends on the type of maintenance per-
formed and the ST cutoff points.
Assume that the maintenance protocol is well established, and the cutoff
point for short-test emission value on one pollutant, for example, HC, is
available from Section 3.1.1; then, for cars which fail the short test, per-
form the following steps:
1. Compute the difference in idle CO level before and after maintenance,
Y - v(a) Y(b)
TC0 TC0 *C0 *
2. Perform the test for the mean of YCq equal to zero according to procedures
described in Section 3.2.3.1.
The same method can be applied for other pollutants.
3.2.5
Discuss ways to determine whether cost of maintenance 1s dependent on
a vehicle's amount or percentage above the short-test cutpoint(s).
A linear regression analysis can be applied to the problem. The details
of the analysis are given below.
-------
3-18
MULTIPLE LINEAR REGRESSION
(1) Select variables of Interest, one as the dependent variable and the rest
as independent variables. Let Y be the dependent variable and Xj,..,,X be
the Independent variables. Then the multiple linear regression model is
Y = bQ + b-|X-| + ... +,JpXp ,
where bQ)b^ bp are the regression coefficients to be determined.
EXAMPLE: Select cost of maintenance, C» as the dependent variable and
select a vehicle's amounts, X-|, X2 and X^, above the three
pollutant-specific short-test cutpoints as independent
variables. The model is C = bQ + b^ + bgX2 + b3X3 .
(2) Perform a multiple linear regression analysis through the use of BMDP1R
or SPSS MULTIPLE REGRESSION ANALYSIS: SUBPROGRAM REGRESSION subroutines,
with C as the dependent variable and X-p X2 and X3 as Independent variables.
A typical output example, along with Its interpretation, 1s given below
OUTPUT EXAMPLE FOR MULTIPLE LINEAR REGRESSION (SMDPlRl
MULTIPLE R 0.9741 STD. ERROR OF EST. 2.5692
MULTIPLE R-SQUARE 0.9488
ANALYSIS OF VARIANCE
SUM OF SQUARES DF MEAN SQUARE F RATIO P{TAIL)
REGRESSION X XX 98.822 0.0000
RESIDUAL X XX
-------
3-19
SUMMARY TABLE:
VARIABLE
Constant
XI
X2
X3
COEFFICIENT
-43.7040
0.889
0.817
-0.107
STD ERROR
STD. REG.
COEFFICIENT
0.119 0.779
0.325 0.253
0.125 -0.055
T P(2 TAIL)
7.481 0.000
2.512 0.023
¦0.860 0.402
INTERPRETATION OF THE OUTPUT EXAMPLE
PROCEDURE
(1) From the printout, find the regression (1)
coefficients summary table, which
includes regression coefficient, b..,
standard error, Se(b..), t-statistics
(in SPSS, this is F-statisties) and
1L
the p value for the i variable
Xi, where i=l,...,p.
(2) The linear regression equation
is given by
Y-b0+b1X1+...+bpXp
where b.'s are the regression
coefficients found in (1).
EXAMPLE
From the OUTPUT EXAMPLE, one
finds that the regression
coefficients summary table
is as follows:
VARIA- COEFFI- STD.
BLE CIENT ERROR p
CON-
STANT -43.7040
X
V1
(2)
0.889 0.119 0.000
X2 0.817 0.325 0.023
X3 -0.107 0.125 0.402
The linear regression equa-
tion is given by
Y = -43.7040 + 0.889^)
+ 0.817(X2) - 0.107{X3)
-------
3-20
PROCEDURE
For each coefficient b., test the
hypothesis HQ:bi=0. The accep-
tance of this hypothesis can be
interpreted as "the variable X,.
does not significantly improve
the prediction of Y over that
obtained by regressing Y on
the other p-1 variables." The
test results can be obtained
immediately by comparing the
p value found in the last
column of the table in (1) with
the value 0.05. If the p
value is greater than 0.05, the
hypothesis that b.=0 is accepted
and thus the coefficient is sig-
nificant; otherwise, the coeffi-
cient is significant at the 0.05
level. In SPSS-MULTIPLE REGRES-
SION ANALYSIS, the p value 1s not
available. From the F-statistic,
the p value is the area to the
EXAMPLE
(3) By observing the p values
shown in the task in (1),
one concludes that b-j is
highly significant, b^ is
moderately significant, and
b^ is insignificant. The
interpretation is that, in
the prediction of the cost
of maintenance, X^ contributes
significantly to the equation.
Xg contributes moderately,
but X3 does not contribute
to the equation in a signifi-
cant way.
-------
3-21
EXAMPLE
PROCEDURE
right of F under the F(l,i^)
frequency curve (see Table A-5),
where = n-p-1 degrees of free-
dom.
Look for the MULTIPLE R-SQUARE
value for goodness-of-fit. A
2
large multiple R (greater than
0.6) indicates that the fit is
2
good. The larger the R , the
•better the fit.
(4) R2 = 0.9488. Therefore, one
concludes that the linear fit is
quite good.
-------
3-22
3.2.6
Using the Eugene and Portland data, determine a way to analyze the ef-
fect of an I/M program's existence on the extent to which voluntary owner
maintenance will occur.
In the Portland Study, car owners responded to questionnaires concerning
maintenance. A scoring system must be designed for the questionnaire, so
that a score for maintenance is associated with each vehicle. Once all
scores are figured, the average score for vehicles in Portland and the
average score for vehicles in Eugene can be computed. These two means
(averages) can then be compared.
3.2.7
Suggest ways for determining whether the change in fuel economy/emissions
following maintenance is dependent on vehicle characteristics such as CID,
inertia weight, cutpoints to which the vehicle was subjected, engine family,
and manufacturer.
When all the categorical variables are considered in the analysis
simultaneously, classical analysis of variance is not suitable simply because
it is not designed for effectively extracting information from large high-
dimensional data sets. The situation, however, can be easily tackled by the
Automatic Interaction Detection (AID) algorithm. AID is a generalized data
analysis program which uses analysis-of-variance techniques to explain as
much of the variance of a given dependent variable as possible. In fact, it
makes successive dichotomous partitions on the sample, using independent
variables to "predict" the dependent variable in such a way as to maximize
differences among the split groups. Thus, a tree-structured output will be
produced.
-------
3-23
The MIDAS and TSC computing library contain an AID subroutine. For a
complete program card setup and output examples, the user should read
Sonquist, et al_. [1975], "Searching for Structure," Institute for Social
Research, University of Michigan, Ann Arbor, Michigan.
3.2.8
Determine estimates and confidence limits for before-maintenance emis-
sion levels (FTP and ST) and fuel economy separately for passed and failed
Portland vehicles by model-year group (1972-74 and 1975-77), and for passed
and failed Eugene vehicles by the same model-year groupings. Determine simi-
lar estimates for failed cars after maintenance.
PROCEDURE
(1) From each sample-car owner, obtain information on the time of the last
maintenance, along with the type of maintenance performed before en-
tering the study. Let T be the difference between the time of entering
the study and the time of last maintenance before the study.
(2) Select a variable of interest, say, X, and a group of cars in the
study, e.g., ST-NO for passed Portland vehicles, by model-year 1972-
A
74 group.
(3) From those cars with almost the same T (as defined in (1)), compute the
A A
mean, y, and the standard deviation, asof the X variable. The confi-
dence limit is given by y±t gg(v) a, where t 95^) equals the 95th
percentile of t distribution with degrees of freedom [v = (number of sample
points)-1]. Thus, for various before-maintenance time levels, the estimates
and confidence limits of emission levels are obtained.
Since each vehicle that failed Portland's idle emission test undergoes
maintenance and is retested over a series of STs and FTPs, the estimates
and confidence limits for failed cars after maintenance can be computed
directly from the data.
-------
3-24
3.2.9
Does the initial effectiveness of I/M on emissions and fuel economy de-
pend on seasonal variations 1n temperature?
The solutions to this problem rely on the results of 3.2.1 and the data
gathered from Element II fleet vehicles which were recruited from the Portland
region (220 cars 1n the 1972-74 model-year group). The procedure 1s described
below.
PROCEDURE
(1) Choose a variable of interest, say, X.
(2) Notice that the Element II cars are recalled every three months for
one whole year. Let x!? and X? be the average before-I/M and after-I/M
levels at time i, respectively, where i=l,2,3, and 4. Let s!? and S* be
the corresponding standard deviation of the estimates.
(3) Let T.j be the average temperature at time i.
(4) Compute r¦ = X® - x!?, the effect of I/M on emissions.
(5) Plot r. versus T., i=l,2,3, and 4, and look for a trend.
(6) Let and be the number of sample points used to determine x|? and
X®, respectively. Perform the two-sample Welch t test for H0:r,.=0.
The test statistics are
Find t>95(v), the 95th percentile of t distribution with v degrees of
freedom, where
-------
3-25
-AfiL a M: t 120,
use 1.96 instead of t g5.) Otherwise,accept it. Here,reject that r.=0 means
there is a significant I/M effect.
3.2.10
3.2.10.1 Does the probability of failing the idle test change with queue time?
The following procedures are recommended:
PROCEDURE
(1) Subdivide the queue time period into a finite number of equal sub-
intervals, say, I-j,...,It.
(2) The probability of failing the idle test when queue time falls into
the interval 1^ can be estimated by x^/n^, where x^ is the number of
cars which fail the idle test and which have queue times falling in
I.j, and n^ is the total number of cars with queue times falling in 1^.
(3) Do a regression analysis with p. as the dependent variable, and test
for slope = 0. (See Section 3.2.5 for details of the procedure.)
3.2.10.2 Does correlation between the idle test and FTP depend on queue time?
Owing to the complexity of the distribution for correlation, the easiest
way is to plot the correlation according to the following procedure:
PROCEDURE
(1) Subdivide the queue time period into a finite number of equal subin-
tervals, say, I^,...,!^..
-------
3-26
(2) Let r.j be the correlation between the idle test and the pollutant-
specific FTP for cars with queue times falling in the interval I..
(3) Plot r^ versus the midpoint pi of 1^.
(4) Do a regression analysis with r.. as dependent variable and p. as inde-
pendent variable, and test for slope = 0. (See Section 3.2.5 for details
of the procedure.)
3.2.11
What percentage of the reduction in FTP emissions could have been
achieved by a diagnositc inspection as opposed to an emission test?
Vehicle engine diagnostics are required for vehicles in the Element II
program. There are twelve specific measurements to be made for diagnostic
checks [Hamilton Test Systems 1977]. Therefore, apply Friedman's classifi-
cation algorithm as described in Stage 3 and evaluate the classification re-
sult by computing the probability of misclassification according to Stage 4.
Finally, compare this result with the ST/FTP classification result based on
emission tests to get the percentage of the reduction in FTP emission that
could have been achieved by a diagnostic inspection.
3.2.12
3.2.12.1 Discuss methods of evaluating the ST/FTP association with
respect to fuel economy. The determination of fuel economy is based on
measured C02. Thus, for each C02 level, say, level i, record the number
of misclassified cars, n,., and the total number of cars, m^, based on a
given classification scheme. Define the misclassified rate at level i as
p. = n-/m^. Do a regression analysis with as the dependent variable and
-------
3-27
level 1 as the independent variable, and test for slope = 0. (See Section
3.2.5 for details of the procedure.)
3.2.12.2 Suggest methods for determining statistically significant
relationship between pollutant-specific emission levels and fuel economy.
Do a regression analysis with pollutant-specific emission level as the
dependent variable and COg as the independent variable, and test for
slope = 0. (See Section 3.2.5 for details of the procedure.)
-------
3-28
3.3 RESPONSES TO QUESTIONS RELATED TO LONG-TERM EFFECTIVENESS
OF INSPECTION/MAINTENANCE
Background
Each of the 660 vehicles in Element II fleet will be returned to
the contractor's laboratory facility 4 times, at 3-month intervals, and
will be retested on the State inspection station test, the FTP, the hot-
start FTP, and the fuel economy test.
3.3.1
3.3.1.1 Estimate short-test, FTP, and fuel economy deterioration
over time, for Portland and Eugene cars. Suggest overall
analysis for each model-year grouping (1972-74, and 1975-77), and also
for passed vehicles and failed vehicles separately.
The following method can be used for estimating deterioration over
time:
PROCEDURE
(1) Select a variable of interest,
say, X, for a particular vehicle
group.
(2) Do a linear regression analysis
with X as the dependent variable
and time T., where i=l,2,3,4, as
the independent variable. (The
details of the procedure are
given in Section 3.2.5). Record
the regression coefficient, b, and
EXAMPLE
(1) Select the short-test N0Y
A
variable for Portland cars.
(2) Do a linear regression
analysis with ST-NO as the
A
dependent variable and time,
Tj where i=l,2,3 and 4, as
the independent variable.
From the printout (see Section
3.2.5), one finds that b=2.3
-------
3-29
its standard error, Se(b), from and Se(b) = 0.4. Thus, the
the output. The interpretation of change per unit time of the
b is "the change per unit time." ST-NO^ is 2.3.
(3) Compute the 95% confidence inter- (3) The 95% confidence interval
val according to the formula is 2.3-1.96(.4) (The number of sample points
where tg7>5(v) can be is larger than 120). Since
found from the percentile of the the interval does not contain
t distribution table with degrees zero, the coefficient b is
of freedom v=n-2, (n is the num- significant at 0.05 level,
ber of sample points). For n
larger than 120, use tg^ g = 1.96.
3.3.1.2 Is there a significant difference between pollutant-specific
Portland and Eugene slopes? Is there a significant difference between
slopes for passed versus failed vehicles? Two methods can be used to
solve the problem.
Method 1: Let the observations of a specific pollutant for N cars in
Portland at time i be X-j1" \... .X^1' \ distributed according to a normal
distribution with mean yf1'^ and standard deviation a^1^, where i=l,2,3,
X A
and 4. Similarly, let the observation of the specific pollutant for M
cars in Eugene at time i be distributed according to a
normal distribution with mean y^ and standard deviation i=1.2,3,
J *
and 4. The problem of testing the difference between Portland and Eugene
emission slopes reduces to testing the following hypotheses:
-------
3-30
H0). u(l) . u(2) . (1) . (2)
H0 " yx Mx y y
„(2); „(2) . ut3) = (2) (3)
0 Mx Mx ^y Ky
h(3); (3) . (4) . (3) _ (4)
0 Mx ^x My ^y
Using the test for as an illustration, the details of the procedure
are given below.
PROCEDURE
(1) For each car j, j=l,...,N, in
Portland, compute U. = -
J J
x(2^. Similarly, for each car K,
J
Eugene, compute
Vk=Yk1,"Yk2)' ComPute the
sample means and sample vari-
ances of U's and V's, respec-
2
tively, obtaining U and Sy for
U's, V and sj for V's.
2 2
(2) Let o£j and be the variances of
U and V, respectively. Test Hq:
2 2
= cr^ by the variance ratio
test: If sjj > Sy, form
s2
II 2 2
Fq = -j . where S£j and Sy are
SV
obtained in (1). Under Hq, Fq
EXAMPLE
(1} Suppose N=41 and M=70.
After computing the U's and
V's, and their sample means
and variances, one gets
U = 0.695, S(j = 0.146
V = 0.399, and Sy = 0.106
(2) Test Hq: cr|j = a^. Form
lCh383li = 1380_
0 (0.326)
From the table for per-
centile of the F distri-
bution (Table A-5, Appendix
A)t find
p=F>g5(69,40) = 1.6.
Since p > Fq»Hq is accepted.
-------
3-31
has an F distribution with N-l
and M-l degrees of freedom.
Find the value p=F g5(N-l,M-l)
from the table for percentiles
of the F distribution (Table A-5,
Appendix A). If Fq is smaller
than p, accept Hgi otherwise,
reject Hq.
If Hq in (2) is accepted, then
use the two-sample t test for
testing Hq^. Otherwise, go
to (4). The test statistic is
t = u - V
*0 s /I + LU2
Vn M
where Sp = [(N-l)Sy
+ (M-1)sJ]/N+M-2.
Under ^, tQ has a
student's t distribution with
N+M-2 degrees of freedom.
Find p, the 95th percentile
of the student's t distribution
with N+M-2 degrees of freedom
(Table A-4). If tQ is smaller
than p, accept Hq1^ at the a=0.05
level. Otherwise, reject Hq^.
(3) Since Hq is accepted in (2),
the two-sample t test can be
used for testing H^.
Compute
c2 . (41-1)(0.146)+(70-l)(0.106)
ap * 41+70-2
= 0.132
The statistic is
t . 0.695 - 0.399 = 4>u
0
with 109 degrees of freedom.
Since the p value is 1.658,
which is less than tg, Hq^
is rejected at a=G.05 level.
In other words, at time 1
and time 2, the difference
between Portland and Eugene
-------
3-32
If Hq in (2) is rejected, then
use the Welch two-sample test
for testing Hg1^. The test
statistic is
V
TT - V
S2 + s2
3u V
IT M~
Under H^, tQ has an approximate
student's t distribution with
degrees of freedom of
VJ
JL
H2(N-1)
MZ(M-U
emission slopes is signifi-
cant at a=0.05 level.
2 2
(4) In the case where au and are
shown to be quite different according
to (2), the Welch two-sample test
is used. Thus, compute
0.695 - 0.399
Find p, the value of the 95 per -
centile of the student's t
distribution with v degrees of
freedom. If tg is smaller than p,
accept at a=0.Q5 level; other-
wise, reject
(1)
t =
0 /0.146 ^ 0.106
= 0.125
4T
70
with degrees of freedom
u
/0.146 , 0
Kir-* -
.106V
ATV
(o. 1 46)2 ^ (0.IO6)2
y* ¦ * 1 ' "*• A 1
41T-40
70?~-69
= 40
Since the p value is 1.684,
which is greater than tg,
is accepted at as0.05 level.
In other words, at time 1 and
time 2, the difference between
Portland and Eugene emission
slopes 1s not significant at
a=0.05 level.
-------
3-33
The same procedure can be used for testing the difference between
slopes for passed versus failed vehicles.
Method 2. Perform an analysis of covariance according to the technique
described in Section 3.3.3 with any one pollutant-specific variable as
the dependent variable, region (Portland or Eugene) as independent
variable, and time as the covariate. Test for equality of slopes among
the two regions according to (2) in the Interpretation of Output, Section
3.3.3. The acceptance of the equality of slopes means that there is no
significant difference between the Portland and Eugene slopes.
3.3.1.3 Can idle/hat-start FTP arid FTP deterioration be related by
equations? The answer relies on regression analysis. The analysis is
broken into several steps:
Step 1. Choose a pollutant-specific FTP variable, say, Y, and one of
the idle or hot-start FTP variables, say, X.
Step 2. Suppose there are N vehicles. Let V.. be the pollutant value
' J
for the ith car measured at jth time, t., where i=l,...,N and j=l,2,3,4.
J
^ L
For the i vehicle, compute the regression coefficient,&., as
4 4 4 / 4 9 4 2
j=l J J=1 1 0=1 1J / j-1 J j=l 3
fJ. can be considered as the "pollutant-specific FTP deterioration" for
the ith vehicle, where i=l,...,N.
Step 3. Let B be the pollutant-specific FTP deterioration. Perform a
linear regression analysis with $ as dependent variable and X as indepen-
dent variable according to the procedure given in Section 3.2.5. Let b1
-------
3-34
be the regression coefficient found in the output.
Step 4. Perform a linear regression analysis with X as the dependent
variable and g as the independent variable according to the same procedure
used in Step 2. Let b2 be the regression coefficient found in the output.
2
Step 5. Compute p = the correlation coefficient, where b-j and b^
2
are from Step 3 and Step 4, respectively, p always lies between 0 and 1.
2
p achieves 1 if and only if 0 and X are in strict linear functional
relationship. If and only if p=Q, where X and p are said to be uncorre-
cted, are the two regression lines obtained in Steps 3 and 4 at right
angles to each other.
3.3.2 Does fuel economy seem to deteriorate with C02 deterioration?
Regression analysis can be used following the steps shown below.
Step 1. Suppose there are N vehicles. Let be the fuel economy of
the i vehicle measured at j time, t., where i=1s.«.,N» and j=l,2,3,4.
J
t h
Compute the fuel economy deterioration for the i vehicle, , according
to the formula given in Section 3.3.1.3, Step 2, where i=l,...,N.
i, L
Similarly, let Z.. be the CO- value of the i vehicle measured at time
IJ
t-, where i=l,...,N, j=l,2,3,4, and compute the CCL deterioration for
4" h
the i car, , according to the formula in Section 3.3.1.3, Step 2,
where i=l N.
Step 2. Let 0 be the fuel economy deterioration and a be the COg de-
terioration. Perform a linear regression analysis with g as dependent
variable and a as independent variable, according to the procedure
given in Section 3.2.5. Let b^ be the regression coefficient found in
the output.
-------
3-35
Step 3. Perform a linear regression analysis with a as dependent variable
and e as independent variable according to the procedure given in Section
3.2.5. Let b2 be the regression coefficient found in the output.
2
Step 4. Compute p = b-j-bg, the correlation coefficient, where b-j and bg
2
are from Step 2 and Step 3, respectively, p always lies between 0 and 1.
2
p achieves the value 1 if and only if a and 3 are in a strict linear
functional relationship. If and only if p = 0,where a and 8 are said to
be uncorrected, are the two regression lines obtained in Steps 3 and 4
at right angles to each other.
3.3.3 Is there a seasonal trend, in temperature, to deterioration on
the FTP? on the hot-start FTP? on the idle test? on the HFET?
The one-way analysis-of-covariance technique can be used in this
problem. The details of the technique, together with the computer sub-
routine using FTP as an example, are given below.
ONE-WAY ANALYSIS OF CQVARIANCE
(1) Select primary variables, including dependent, independent and
covariate.
EXAMPLE: The dependent variable is the pollutant-specific FTP
emission variable (say, CO). The independent variable is TIME (3 levels).
The covariate is TEMP (continuous variable).
(2) Form the table where y..^, x^j are the observed values of
j,L XL
the j -experiment unit in the i cell.
EXAMPLE: Form the table {y.., x..}, where y.. is the observed FTP
ij iJ ij
j. l
CO emission value for the j car measured at time i, and x.. is the
' 3
corresponding temperature.
-------
3-36
(3) Consider the analysis-of-covariance model
vTj = " + °i + e(xij> + s-j •
Interpretation of the model:
V is the overall mean
a., is the main effect of time i
0 is the regression coefficient for temperature
is the error term
' J
Here, we are interested in testing the hypothesis Hq:3=0. To reject the
hypothesis means there is a seasonal trend, in temperature, to deterioration
on FTP.
(4) Most of the statistical package programs contain a one-way analysis-
of-covariance subroutine, e.g., SPSS-ANOVA AND ONE WAY, and BMDP1V. In the
following, the output of BMDP1V is used as an illustration.
ONE-WAY ANALYSIS OF COVARIANCE (BMDP1V)
For program card preparation, one should consult the BMDP manual. Special in-
structions:
(i) GROUPING IS TIME
(ii) DEPENDENT IS CO
(iii) INDEPENDENT IS TEMP.
INTERPRETATION OF THE OUTPUT EXAMPLE
PROCEDURE EXAMPLE
(1) Look for the Analysis of Variance (1) Look for the Analysis of Variance
table in the output of BMDP1V. table in the OUTPUT example,
(2) Look for the TAIL AREA PROB. of (2) The TAIL AREA PROB. of the
the Equality of Slopes. If the Equality of Slopes is 0.907, which
-------
3-37
OUTPUT EXAMPLE (BMDP1V)
Estimates of Means
TEMP 2
CO 3
COVARIATE
TEMP
TIME
1
2.000
9.250
REG. COEFF.
0.833
2
3.750
11.750
TOTAL
3
1.750
12.500
4
2.500
9.166
STD. ERR.
0.433
T-VALUE
1.920
GROUP
1
2
3
GRP. MEAN
9.250
11.750
6.500
Analysis of Variance
Source of Variance
D.F.
ADJ.GRP.MEAN
9.666
10.708
7.125
STD.ERR.
0.632
0.804
0.677
SUM OF SQ. MEAN SQ. F-VALUE TAIL AREA
PROB.
Equality of Adj.Cell Means 2 17.492 8.746 6.196 0.237
Zero Slope 1 5.208 5.208 3.690 0.091
Error 8 11.291 1.411
Equality of Slopes
Error
2
6
0.359
10.931
0.179
1.822
0.098
0.907
Slope within Each Group
TEMP
1
1.00
TIME
2
1.00
3
0.54
-------
3-38
INTERPRETATION OF THE OUTPUT EXAMPLE (Cont'd)
value is greater than 0.05, then
the analysis of covariance model
(3) (above) holds. Otherwise, the
three slopes are different and the
model cannot be assumed.
If the analysis of covariance (3)
model is confirmed in (2), look
for the TAIL AREA PROB. in the
Zero Slope (source of variance).
If the value is greater than 0.05,
then the covariance regression
coefficient is significant, and
there is a significant seasonal
trend, in temperature, to the FTP
CO-values. If the value is less
than 0.05, then the regression
coefficient is nonsignificant and
there exists no seasonal trend.
If the analysis of covariance model
is not valid as shown in (2), look
at the Slope within Each Group table,
which gives an idea of the relative
magnitude of the slopes in each time
cell. No further conclusion can be
drawn here.
is greater than 0.05. Thus the
malysis-of-covariance model of
(3) (above) holds.
From (2), the analysis of co-
variance model is confirmed.
Thus look for the TAIL AREA PROB.
in the Zero Slope (source of
variance) and find the value is
0.091. This value is greater than
0.05; therefore, there is a sig-
nificant seasonal trend to the
FTP-CO value.
-------
3-39
INTERPRETATION OF THE OUTPUT EXAMPLE (Cont'd)
(4) If, from (3), it has been confirmed (4) From (3), it has been con-
that the slope is not zero, look for firmed that the slope is
the table of Estimates of Means and not zero. Look for the table
find the regression coefficient and of Estimates of Mean and find
standard error of the covariate, which REG. COEFF * 0.833, STD.ERR=
will give one an idea about the mag- 0.433, which gives one an im-
nitude of the slope. pression of the magnitude of
the seasonal trend.
(5) The CO deterioration can be seen from (5) The CO deterioration can be
the Estimates of Mean table. seen from the Estimates of
Mean table:
TIME
1 2 3
CO 9.250 11.750 12.500
One concludes that CO
deterioration does exist and
has a seasonal trend.
3.3.4 Is there a relationship between maintenance performed (a categorical
variable) and rate of ST, FTP, and fuel economy deterioration?
Apply the AID algorithm described in Section 3.2.7 with emission
value as the dependent variable and CATEGORY and TIME as the indepen-
dent variables.
3.3.5 How long does it take a failed and subsequently maintained vehicle
to reach its pre-maintenance ST/FTP/fuel economy levels?
-------
3-40
The answer to this question relies on a properly designed study.
One method has the following procedure:
PROCEDURE
Take a random sample of size n vehicles from among the set of failed
and subsequently maintained vehicles.
Once every fixed-time interval, say, once every three months, check
the vehicle's ST/FTP/fuel economy levels. Let Tg c T.| < ... < T^-
(here, M can be set equal to 3) be the scheduled checking times. Then
follow each car throughout the time interval from Tq to TM.
Let X be an emission/fuel economy variable. Let X.. be the observed X
* w
A, L
value for the i car at time T., where i=l,...,N, and j=l,...,M. For
J
each car i,compute
where i=l..,N.
NOTE: a. and {J^ are the coefficients of the linear regression of X on T.
-------
3-41
(4) Let v.. be the pre-maintenance level for the car, where i=l,...,N.
Compute
v. - a.
t. =-^—3—- , where i=l,...,N
¦ p.;
NOTE: t.. is the period for the ith tailed and subsequently maintained ve-
hicle to reach its pre-maintenance level.
(5) Compute the sample mean, t", and sample standard deviation, S^(tJ, of t.,
where i=l,...,N. Then t gives the average time to take a failed and
subsequently maintained vehicle to reach its pre-maintenance ST/FTP/
fuel economy levels. The 95% confidence interval is given by the two
limits
t + 1.96 Sd(t) .
3.3.6 Is the average amount of money spent on maintenance between the initial
retest and the final test different from Portland to Eugene by mode1-
year group? By pass/fail group?
The two-sample t test or Welsh test mentioned in 3.3.1.2, Method 1,
can be used. Specifically, suppose one wants to test whether the
average amount of money spent on maintenance differs from Portland to
Eugene. Let U. be the amount of money spent on maintenance for the
J
jth car in the 1972-1974 Portland vehicles, j=l,...,N. Let be that
money spent on maintenance for the car in the 1972-1974 Eugene cars.
Compute the sample means and sample variances of U's and V's, respec-
tively, and then follow the procedure in 3.3.1.2, Method 1.
-------
A-l
APPENDIX A
STATISTICAL TABLES
-------
A-2
Table A-la Table for Testing Skewness
(One-tailed percentage points of the distribution of %/A, = = mi m;' -)*
Size of
Sample
n
Percentage Points
Standard
Deviation
1"
; Size of
j Sample
1 n
Percentage Points
StanJ^rd
Deviation
5%
1%
5%
1%
25
0.711
1.061
0.4354
100
0.389
0.567 ;
0.2377
30
0.662
0.986
.4052
125
0.350
0.508 :
213">
35
0.621
0.923
.3804
150
0.321
0.464
.1961
40
0.587
0.870
.3596
175
0.298
0.430 '
.1820
45
0.558
0.825
.3418
200
0.2S0
0.403
.1706
50
0.534
0.787
.3264
\
! 250
0.251
0.360
.1531
60
0.492
0.723
.3009
1 300
0.230
0.329
.1-00
70
0.459
0.673
.2806
1 350
0.213
0.305
.12SS
80
0.432
0.63!
.2638
| 400
0.200
0.2*5
.1216
90
0.409
0.596
.2498
; 450
0.188
0.269
.1147
100
0.389
0.567
.2377
| 500
0.179
0 255
J0S9
* Since the distribution of is symmetrical about zero, the percental points [¦••.•pre-
sent 10% and 2% two-tailed values.
Table A-lb Table for Testing Kurtosis
(Percentage points of the distribution of b, =
Percentage Points
1
Percentage Points
Size of
Size of
Sample
Upper
Upper [
Lower
Lower '
Sample
Upper
Upper
Lower
Lower
n
1%
5°/
3/. ;
<0/
1% j
10/
5%
5%
1°
1 O
50
4.88
3.99
2.15
1.95
600
3.54
3.34
fJ
*--1
o
2.60
IS
4.59
3.87
2.27
2.0S 1
650
3.52
3.33
2.71
2.61
100
4.39
3.77 1
2.35
2.18 j
700
3.50
3.31
2 72
2.62
125
4.24
3.71 1
2.40
2.24 !
750
3.48
3.30
2.73
2.64
150
4.13
3.65 j
I
2.45
2.29 =
800
850
3,46
3.45
3.29
3.28
2.74
2.74
2.65
2.66
200
3.98
.1.57 j
2.51
2.37 |
900
3.43
3.28
2.75
2.66
250
3.87
3.52 |
2.55
2.42 j
950
3.42
3.27
2.76
2.67
300
3.79
3.47 |
2.59
2.46
1000
3.41
3 26
2.76
2.68
350
3.72
3.44 ;
2.62
2.50 ¦
400
3.67
3.41 1
2.64
2.52
1200
3.37
3.24
2.7S
2.71
450
3.63
3.39
2.66
2.55 !
1400
3.34
3.22
; 2.80
2.72
500
3.60
3.37
2.67
2.57 j
1600
3.32
3.21
2.81
2.74
550
3.57
3.35 !
2.69
2.58
1800
3.30
3.20
; 2.S2
2.76
600
3.54
3.34 |
2.70
2.60 !
2000
3.28
3.18
, 2.83
2.77
-------
A-3
Table A-2a Critical Values for the Kolmogorov-Smirnov
Test of Goodness-of-Fit
Sample
Size (n)
.20
Signif
.15
cance Le
.10
vel
.05
.01
1
.900
.925
.950
.975
.995
2
.684
.726
.776
.842
.929
3
.565
.597
.642
.708
.829
4
.494
.525
.564
.624
.734
5
.446
.474
.510
.563
.669
6
.410
.436
.470
.521
.618
7
.381
.405
.438
.486
.577
8
.358
.381
.411
.457
.543
9
.339
.360
.388
.432
.514
10
.322
342
.368
.409
.486
11
.307
.326
.352
.391
.468
12
.295
.313
.338
.375
.450
13
.284
.302
.325
.361
433
14
.274
.292
' .314
.349
.418
IS
.266
.283
.304
.338
.404
16
.258
.274
.295
.323
.391
17
.250
.266
.286
.313
.380
18.
.244
.259
.278
.309
.370
19
.237
.252
.272
.301
.361
20
.231
.246
.264
.294
.352
25
.21
.22
.24
.264
.32
30
.19
.20
.22
.242
.29
35
.18
.19
.21
.23
.27
40
.21
.25
50
.19
.23
60
.17
.21
70
.16
.19
80
.15
.18
90
.14
100
.14
Asymptotic Formula:
1.07
\/»
1.1-4
1.22
V«
1.36
1.63
V*
-------
A-4
Table A-2b Critical Values for the Kolmogorov-Snnrnov
Test of Goodness-of-Fit (use sample mean
and standard deviation)
Sample
Level of Significance far D = Max
|F*(.Y)-S.v(A')|
Size
N
.20
.15
.05
.01
4
,300
.319
.352
.381
.417
5
,285
.299
.315
.337
.405
6
.203
.277
.294
.319
.364
7
.247
.25S
.270
.300
.348
8
,233
.214
.261
.2 So
.331
9
,223
.233
.249
.271
.311
10
.215
.224
.230
.258
.294
11
.206
.217
.230
.249
.284
12
.199
.212
.223
.242
.275
13
.100
.202
.214
.234
.268
14
.183
.194
.207
.227
.261
15
.177
.187
.201
.220
,257
10
.173
.182
.195
.213
.250
17
.109
.177
.189
.206
.245
18
.100
.173
.184
.200
.239
19
. 1C3
.169
.179
.195
.235
20
.ISO
.ICS
.174
:i90
.231
25
.149
.153
.165
.180
.203
30
.131
.136
.144
.161
.187
Over 30
.736
.768
.805
.SSG
1.031
VA'
v'A'
v'A'
v'A'
viv
-------
A-5
Table A-3 Cumulative Normal Distribution—Values of P
(Values of P corresponding to Zp for the normal curve,
z is the standard normal variable. The value for P
for -zp equals one minus the value of P for +zD» e.g.,
the P for -1.62 equals 1 - .9474 = .0526.)
ZP
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.0
,5000
.5040
.5080
.5120
.5160
.5199
.5239
.5279
.5319
.5359
.1
.5398
.5438
.5478
.5517
.5557
.5596
.5636
.5675
.5714
.5753
.2
.5793
.5832
.5871
.5910
.5948
.5987
.6026
.6064
.6103
.6141
.3
.6179
.6217
.6255
.6293
.6331
.6368
.6406
.6443
.6480
.6517
.4
.6554
.6591
.6628
.6664
.6700
.6736
.6772
.6808
.6844
.6879
.5
.6915
.6950
.6985
.7019
.7054
.7088
.7123
.7157
.7190
.7224
.6
.7257
.7291
.7324
.7357
.7389
.7422
.7454
.7486
.7517
.7549
.7
.7580
.7611
.7642
.7673
.7704
.7734
.7764
.7794
.7823
.7852
.8.
.7881
.7910
.7939
.7967
.7995
.8023
.8051
.8078
.8106
.8133
.9
.8159
.8186
.8212
.8238
.8264
.8289
.8315
.8340
.8365
.8389
1.0
.8413
.8438
.8461
.8485
.8508
.8531
.8554
.8577
.8599
.8621
1.1
.8543
.8665
.86B6
.8708
.8729
.8749
.8770
.8790
.8810
.8830
1.2
.B849
.8869
.8883
.8907
.8925
.8944
.8962
.8980
.8997
.9015
\.Z
.9032
.9049
.9066
.9982
.9099
.9115
.9131
.9147
.9162
.9177
1.4
.9192
.9207
.9222
.9236
.9251
.9265
.9279
.9292
.9306
.9319
1.5
.9332
.9345
.9357
.9370
.9382
.9394
.9406
.9418
.9429
.9441
1.6
.9452
.9463
.9474
.9484
.9495
.9505
.9515
.9525
.9535
.9545
1.7
.9554
.9564
.9573
.9582
.9591
.9599
.9608
.9616
.9625
.9633
i.a
.9641
.9649
.9656
.9664
.9671
.9678
.9686
.9693
.9699
.9706
1.9
.9713
.9719
.9726
.9732
.9738
.9744
.9750
.9756
.9761
.9767
2.0
.9772
.9778
.9783
.9788
.9793
.9798
.9803
.9808
.9812
.9817
2.1
.9821
.9826
.9830
.9834
.9838
.9842
.9846
.9850
.9854
.9857
2.2
.9861
.9864
.9868
.9871
.9875
.9878
.9881
.9884
.9887
.9890
2.3
.9893
.9896
.9898
.9901
.9904
.9906
.9909
.9911
.9913
.9916
2.4
.9918
.9920
.9922
.9925
.9927
.9929
.9931
.9932
.9934
.9936
2.5
.9938
.9940
.9941
.9943
.9945
.9946
.9948
.9949
.9951
.9952
2.6
.9953
.9955
.9956
.9957
.9959
.9960
.9961
.9962
.9963
.9964
2.7
.9965
.9966
.9967
.9968
.9969
.9970
.9971
.9972
.9973
.9974
2.8
.9974
.9975
.9976
.9977
.9977
.9978
.9979
.9979
.9980
.9981
2.9
.9981
.9982
.9982
.9983
.9984
.9984
.9985
.9985
.9986
.9986
3.0
.9987
.9987
.9987
.9988
.9988
.9989
.9989
.9989
.9990
.9990
3.1
.9990
.9991
.9991
.9991
.9992
.9992
.9992
.9992
.9993
.9993
3.2
.9993
.9993
.9994
.9994
.9994
.9994
.9994
.9995
.9995
.9995
3.3
.9995
.9995
.9995
.9996
.9996
.9996
.9996
.9996
.9996
.9997
3.4
.9997
.9997
.9997
.9997
.9997
.9997
.9997
.9997
.9997
.9998
-------
A-6
Table A-4 Percentiles of the Student's t
Distribution
X
60
75
90
95
97.5
99
99.5
99.95
1
.325
1.000
3.078
6.314
12.706
31.821
63 657
636 619
2
.289
.816
1.886
2.920
4 303
6 965
9.925
31.598
3
.277
.765
t .633
2.353
3 IS2
4 541
5 841
12 941
4
.271
.741
1.533
2 132
2 776
3 747
4.6114
8.610
5
.267
.727
1.476
2.015
2.371
3 365
4.032
6.859
6
.265
.718
1.440
1.943
2 447
3 143
3.70<
5.959
7
.263
.711
1.415
1 895
2.365
2.90S
3 499
5.405
8
.262
.706
2.397
1 860
2 306
2.896
3.355
5.041
9
.261
.703
1.383
1 833
2 262
2.821
3 250
4.781
10
.260
.700
1.372
1.812
2.228
2.764
3 169
4.5S7
11
.260
.697
1 363
1.796
2 201
2.718
3.106
4.437
12
259
.695
1 356
1.7S2
2 179
2.681
3.055
4.318
13
.259
.694
1 350
1.771
2.160
2.650
3.012
4.221
14
.258
692
1 345
1 761
2 145
2 624
2.977
4.140
15
.258
691
1.341
1.753
2 131
2 602
2.947
4.073
16
.258
690
1.337
1.746
2 120
2 583
2 921
4.015
37
.257
1.333
i.nu
2. 111
2 56!
2 fas
3.965
13
.257
-¦6€S
1,3-SD
i
2. ic-a
2.5SI
2 S7S
3.?22
10
,257
-6&S
IMS
I-I2S
2-C9.1.
2 3M
2 Ml
3.833
2D
.257
1.3-25
i iji.
2 - L&3
2.528-
2 845
3-350
21
.257
.6£&
1-323
i .;jj
2.CP!)
2 ^l&
2 811
3.319
22
.258
1.3-23
1.(37
2.C74
2 505
2 t'A
3-W3
23
.258
685
1.319
1.714
2 069
2.500
2.807
3.767
24
.256
685-
1.318
1.711
2.VJ64
2 492
2.797
3.745
25
.258
684
1.316
1.708
2 060
2.485
2.787
3.725
26
.256
.684
1.315
1.706
2 056
2.479
2.779
3.707
27
.256
684
1.314
1.703
2.052
2.473
2.771
3.69(J
23
.256
683
t .313
1.701
2.048
2.467
2.763
3.674
29
.156
ass
1.311
1.699
2.045
2.462
2.7M
3.659
30
.256
.683
1.310
1.697
2.042
2.457
2.750
3.646
40
.255
.681
1.303
1.684
2.021
2.423
2.704
3.551
60
.254
.679
1.296
1.671
2.000
2.390
2.660
3.460
120
.254
.677
1.289
1.658
1.680
2.358
2.617
3.373
«
.253
.674
1 282
1.645
1 960
2.326
2.576
3.291
-------
Table A-5 Percentiles of the F-Distribution
95th Percentile
Vs
1
2
3
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
oo
1
161-4
190-8
215 7
224 0
230 2
234 0
2368
23BB
240 6
241 9
243-9
245 9
248 0
249-1
250-1
251 I
252-2
253 3
254 3
2
18-61
19 00
19 16
19 24
19 30
19 33
19 35
19 37
19 38
ID 10
19 41
1943
19 <5
19 45
19-46
1947
1948
19 49
19 50
3
10-13
9-65
9 28
9 12
9 01
8-94
8 89
8 85
8HI
8 79
8 74
8 70
8 80
8 04
8 02
8-69
857
8-55
8 53
4
7-71
6 94
6 69
639
6 26
0 10
6 09
601
0 00
6 96
£91
6 80
6-80
6 77
6 75
5-72
669
6-60
6 03
5
6-6!
6 79
6 41
6-19
6 05
4-96
4-88
4 82
4 77
4 74
4 68
4 02
4 50
4 53
4-50
4 46
4 43
4 40
4 36
6
599
6 14
4 76
4 S3
4 39
4-28
4 21
4 15
4 10
4 08
4 00
394
3 81
3-84
3 HI
3-77
3 74
3-70
3 07
7
6 59
4-74
4 35
4-12
3 97
3 87
3 79
3 73
3 68
:i 64
3 57
3 51
3 41
3 41
3 38
3 34
3-30
3 27
3-23
8
6 32
4-46
4 07
3 84
3 69
3 58
3 60
3-4-1
3 39
3 35
3 28
3 22
3 16
3 12
3 08
304
3 01
2 97
2 93
9
6-12
4-28
3 86
3 63
3 4B
337
3-29
3 23
3 18
3 14
3 07
301
2-91
2 90
2 SO
2 83
2 79
2 75
2 71
10
4-06
4-10
3-71
3 48
3-33
3-22
3 14
3 07
3 02
2 98
2 91
2 85
2-77
2 74
2-70
2 06
2 62
2 68
2 51
11
4 84
3-98
3 59
3 36
3 20
309
301
2 95
2-eo
2-85
2 79
2-72
2 05
2 61
257
2 53
2 49
2 45
2 40
12
4 76
3-89
3 49
3-28
3-11
3 00
2 91
285
2 80
2 75
2 69
262
2 54
2 51
2 47
2 43
2 38
2 34
2 30
13
4 07
3 81
3 41
3 18
3-03
2-92
283
2-77
2 71
2 67
2 CO
2 53
2-46
2 42
2 38
234
2 30
2-25
2 21
14
4 60
3 74
3-34
3 11
2 90
2-86
2-76
270
2 65
200
2 53
2 48
2 39
2 35
2 31
227
2-22
2 18
2 13
IS
4 54
3-88
3 29
3 06
2-90
2 79
2-71
2 64
2 59
2-54
248
240
2 33
2 29
2 25
2 20
2 10
2 11
2 07
16
4 49
3-83
3 24
3 01
2-85
2-74
264
2 59
2 64
2 49
2 42
235
2-28
2 24
2 19
2 15
2 11
2 00
2 01
17
4 45
3-69
3-20
2-99
2 81
2 70
2 61
2 55
2 49
2-4.1
2 38
2 31
2-23
2 19
2 15
2 10
2 00
2 01
1-911
18
4 41
3 53
3-18
2 93
2-77
208
2-58
2 51
2-48
2 41
2 34
2 27
2-19
2 IS
2 11
206
202
1-97
1-02
19
4 38
3 62
3-13
2-90
2-74
263
2 64
2-48
2-42
2 38
2 31
2 23
2 18
2-11
207
2-03
1 98
1-U3
IBS
20
4-35
349
3-10
2 87
2 71
260
2-61
246
239
2 35
2 28
220
2 II
2 08
2 04
1 99
1-95
1 00
1-84
21
4 32
3 47
307
2-84
2 68
257
249
242
237
2 32
2 25
2 18
2 10
2 05
2 01
1 96
1 92
1-87
1-SI
22
4 30
3 44
305
2 82
2 06
2 65
2 48
2 40
2 34
2 30
2 23
2-15
2 07
2 03
1 98
1 94
1 89
1-84
178
23
4 2S
3 42
303
2 80
2 64
2-63
2-44
2 37
2 32
2 27
2 20
2 13
205
2 01
1 90
1-91
1*86
1 81
1 76
24
4 26
3 40
3-01
2-78
2 62
2 51
2-42
2 36
2 30
2-25
2 IB
2 11
2 03
1 98
191
1 89
1-84
1-79
1-73
25
4 24
3-39
2-99
2 76
2 60
2 49
2 40
2 34
2 28
2 24
2 10
2 09
2 01
1-06
1 92
1-87
1-82
1-77
1-71
24
4-23
3-37
2-98
2 74
2 69
247
239
2-32
2-27
2 *22
2-15
2 07
1 99
1 95
1 90
1 85
1-80
1-75
109
27
4 21
3 35
2-06
2-73
2 61
2 40
2-37
2 31
2-25
2 20
2-13
2 06
1-37
1 93
1-68
1-81
1-79
1 73
1-07
28
4 20
3 34
2 95
2 71
2 56
245
2 38
2-29
2 24
2 19
2 12
2 04
1 90
1 91
1-87
1 82
in
1-71
105
29
4 18
3 33
2 93
2 70
2-65
243
2 35
2 28
2-22
2-18
2-10
203
1 94
1 90
1 85
1 81
1-75
1-70
1-64
30
4 17
3-32
2-92
2-69
2 63
2-42
2 33
2 27
221
2 10
2 09
2-01
1 93
1 89
1 84
1 79
1 74
168
1 62
40
4-08
3 23
2 84
261
2 45
2 34
2-25
2-18
2-12
2-08
2-00
192
1 84
1 79
174
1 <19
1 64
1-68
151
60
4 00
3-15
2 70
253
2 37
2 23
2 17
2-10
2 04
1 99
1-02
1 84
1-75
1 70
1-65
169
1 fi3
1 47
139
120
3-02
3 07
2 08
245
2-29
2 17
2 09
2 02
1 00
1-91
1-83
1 75
ICO
1 61
1 55
1 50
I 43
1-35
1-26
CO
3-64
3 00
2 00
237
2 21
2-10
2 01
1 94
1-88
1 t>3
1 75
1 67
1-07
1 52
1 46
1 39
1 32
1 22
1 00
-------
Table A-5 - Cont'd
99th Percentile
1
2
3
4
5
6
7
8
9
10
13
IS
20
24
t
4032
48994
6403
6623
6704
6869
6928
6981
0022
6050
«10fl
•167
6209
6236
3
#3 60
99-00
99-17
99-25
99-30
9933
09 36
9937
99-39
99 40
99-42
99-43
9945
99-46
3
34-12
30 82
2940
38 71
28-24
27-91
27-07
27-49
27-36
27-23
27-00
2087
2669
26-60
4
ai-20
1800
14 69
16(18
16 62
1621
1498
1480
14-06
14-65
14-37
14-20
14 02
13 93
5
USB
13-27
12-00
11-39
10-07
1067
1040
10-20
10-16
1006
9 89
9-72
955
» 47
6
13 TC
J 0-92
9-78
9-16
8-76
4 47
B-se
B-1Q
T-D8
7'87
7-7:1
7-6U
7-40
7-31
7
1225
9 65
6 45
7-85
7-46
7 19
693
6 84
0 72
862
0 47
0 31
6 10
0 07
8
11-26
866
7-59
7-01
6 63
637
6-18
6 03
6 91
6 81
0 67
5-62
5-36
5-28
9
10-56
8-02
699
6 42
606
6 80
6 01
647
6 36
6 26
5 11
4-96
4 81
4 73
10
10-04
7-66
6-65
6-99
6-64
639
6-20
6 00
4-94
4-85
4-71
4 66
4 41
4 33
u
9 85
7-21
622
6 67
632
6 07
4 89
4 74
4 63
464
4 40
4-25
4-10
4 02
11
9J1
6M
5-95
6 41
506
4-82
4-04
4-DO
4 39
4-30
4H6
4 01
3 80
3-78
13
#07
8 70
6 74
6-21
4-80
4-62
4 44
4-30
4 19
410
3-96
3 62
3-00
3-69
M
tea
8-61
6-66
604
4-09
446
4-28
4-14
403
3-94
S-80
3 68
3 61
3-43
15
8-68
fl 36
5 42
4 89
4-66
4-32
4-14
4-00
3-89
3 BO
367
3-52
3-37
3-29
It
hi
0 23
6-29
4-77
4 44
420
4-03
3-89
3-78
3-69
3-55
3 41
3 20
3-18
17
ft-40
All
8 14
i-67
4-34
4 10
393
3-79
16S
3 GB
3 40
3 31
3 16
JOB
18
t 20
6 01
6-09
4 68
4-25
401
3-84
3 71
3 60
3 51
3-37
3-23
3 08
3-00
19
8 18
6-93
6 01
4 CO
4-17
3 94
3-77
3-03
3 62
3 43
3 30
3 10
3 00
2 92
20
8-10
6 85
4 94
4 43
4-10
3 87
3 70
3-66
346
3 37
3 23
3 09
2 94
2 86
21
8 02
6-78
4-87
4-37
404
3 81
3-64
3-51
3-40
3-31
3 17
3-03
2 88
2-80
22
7 95
6-72
4-82
4-31
3-99
3-78
3-50
3-46
336
3-20
3 12
2-98
2 83
2-75
23
IBS
6-68
4 70
4 26
394
3 71
364
3 41
330
3 21
3 07
2113
2-78
2-70
24
7-82
5 61
4-72
4 22
3-90
307
5 50
338
3-26
3 17
3 03
2 89
2-74
2 00
23
7-71
6 67
488
4 18
3-B5
363
340
3-32
322
3 13
2 99
2-B5
2-70
2-02
26
7-72
6-63
4 04
4-14
3-B2
369
342
3 29
3-18
3 09
2-90
2 81
206
2-58
27
7-68
5 49
4-80
4-11
3-78
368
3 39
326
3-IS
3 06
2-03
2-78
283
2-55
28
784
6-45
4 57
407
3-76
363
3 36
3 23
3-12
3-03
2-00
2 75
280
2-62
29
7 eo
642
4 64
4-04
3-73
360
3 33
3 20
3-09
3-00
2-81
2 73
2 67
2-49
30
7-60
fi-39
4 61
4-02
3-70
347
330
I 17
3 07
2-98
2-84
2 70
2 66
2 47
40
7-31
518
4 3!
383
3 51
320
3-12
2 99
2 89
2 80
2C6
2-52
2 37
2-29
40
7-08
498
4 13
3-6S
3 34
312
2-0S
282
272
263
260
2-35
2 20
2 12
110
6 85
479
395
3 48
3 17
£96
2-7#
2 150
2 50
2-47
2-34
2 19
2 03
1-95
CO
0 63
481
3 78
332
3-02
2-80
264
2-51
241
2 32
2 18
2-04
188
1 79
30
40
60
120
oo
G261
6!B7
6313
0339
6366
99-47
90-47
9948
99-49
99 50
20 50
20-41
20 32
20-22
20-11
13-84
13-75
13-6S
13 66
13-40
0 33
9-29
0-20
9-11
9 02
7-23
7-H
7 00
0-07
688
5-99
6-91
0-B2
6 74
6 65
6-20
6-12
603
4-95
4-88
4 05
4-67
4 48
4-40
4-31
4 25
4-17
4 OS
4-00
3 61
3-94
3 68
3-78
3-09
3-00
3-70
3-62
3-64
3 45
3-38
3 61
3 43
3 34
3 25
3-17
3-35
3-27
3-18
3 09
3-00
3-21
3-13
3-05
2-00
2-87
3 10
3-02
2-93
2 84
2-70
300
2-02
283
2-76
2-05
2 92
2-84
2-76
2-00
2-57
2 84
2-76
2 07
2-68
2-49
2-78
2-69
20!
2 62
2-42
2-72
2 64
2-66
2-46
2 38
2-67
2-68
2-60
240
2 31
2 02
2-64
2-44
2-35
2 20
2-58
2-49
2 40
2-31
2 21
3-54
2*46
2-38
2-21
217
2-50
2 42
2-33
2-23
2 13
2 47
2-38
2-29
2-20
2-10
2 44
2 35
2 26
2-17
2-05
2 41
2 33
2-23
2 14
2-03
2-39
2 30
2 21
2-11
2-01
2-20
3-11
2-02
1-92
1-80
2-03
1-94
1 84
1 73
1-80
] 86
1-70
1-G&
1 53
1-38
1 70
1-69
1-47
1-32
1-00
-------
A-9
2
Table A-6 Percentiles of the x Distribution
X
0.5
t
2.5
5
10
20
30
40
1
0.0'393
0.0J157
0.03982
0.0*393
0.0158
0.0642
0.148
0.275
2
0.0 IOO
0.0201
0.0506
0.103
0.211
0.446
0.713
1.02
3
0.0717
0.115
0.216
0.352
0.584
1.00
1.42
1.87
4
0.207
0.297
0.484
0.711
1.06
1.65
2.19
2.75
5
0.412
0.554
0.831
1.15
1.61
2.34
3.00
3.66
6
0.676
0.872
1.24
1.64
2.20
3.07
3.83
4.57
7
0.989
1.24
1.69
2.17
2.83
3.82
4.67
5.49
8
1.34
1.65
2.18
2.73
3.49
4.59
5.53
6.42
9
1.73
2.09
2.70
3.33
4.17
5.38
6.39
7.36
10
2.16
2.56
3.25
3.94
4.87
6.18
7.27
8.30
II
2.60
3.05
3.82
4.57
5.58
6.99
8.15
9.24
12
3.07
3.57
4.40
5.23
6.30
7.81
9.03
10.2
13
3.57
4.11
5.01
5.89
7.04
8.63
9.93
11.1
14
4.07
4.66
5.63
6.57
7.79
9.47
10.8
12.1
15
4.60
J.23
6.26
7.26
8.55
10.3
11.7
13.0
16
5.14
5.81
6.91
7.96
9.31
11.2
12.6
14.0
17
5.70
6.41
7.56
8.67
10.1
12.0
13.5
14.9
IS
6.26
7.01
8.23
9.39
10.9
12.9
14.4
15.9
19
6.84
7.63
8.91
10.1
11.7
13.7
15.4
16.9
20
7.43
8.26
9.59
10.9
12.4
14.6
16.3
17.8
21
3.03
8.90
10.3
11.6
13.2
IJ.4
17.2
18.8
22
8.64
9.54
11.0
12.3
14.0
16.3
18.1
19.7
23
9.26
10.2
11.7
<3.1
14.8
17.2
19.0
20.7
24
9.89
10.9
12.4
!3.a
15.7
18.1
19.9
21.7
25
10.5
11.5
13.1
14.6
16.3
ia.9
20.9
22.6
26
11.2
12.2
13.8
15.4
17.3
19.8
21.8
23.6
27
1 1.8
12.9
14.6
16.2
18.1
20.7
22.7
24.5
28
12.5
13.6
15.3
16.9
18.9
21.6
23.6
25.5
29
13.1
14.3
16.0
17.7
19.8
22.5
24.6
26.5
30
13.8
15.0
16.8
18.5
20.6
23.4
25.5
27.4
3S
17.2
18.5
20.6
22.5
24.8
27.8
30.2
32.3
40
20.7
22.2
24.4
26.5
29.1
32.3
34.9
37.1
45
24.3
2J.9
28.4
30.6
33.4
36.9
39.6
42.0
JO
28.0
29.7
32.4
34.8
37.7
41.4
44.3
46.9
75
47.2
49.5
52.9
56.1
59,8
64.5
68.1
71.3
too
67.3
70.1
74.2
77.9
82.4
87.9
92.1
95.8
-------
A-10
Table A-6 - Cont'd
50
SO
70
80
90
95
97.5
99
99.5
S9.9
1
0.455
0.708
1.07
1.64
2.71
3.84
5.02
6.63
7.88
10.8
2
1.39
1 83
2.41
3.22
4.61
5.99
7.38
9.21
10.6
13.8
3
2.37
2.9J
3.67
4.64
6.25
7.8!
9.35
11.3
12.8
16 3
4
3.36
4.04
4.88
5.99
7.78
9.49
111
13.3
14.9
18.5
5
4.35
5.13
6.06
7.29
•9.24
11.1
12.8
15.1
16.7
20.5
6
5.35
ft 21
7.23
8.56
10.6
12.6
14.4
16.8
18.5
22.5
7
6.35
7.28
8.38
9.80
12.0
14.1
16.0
18.5
20.3
24.3
8
7.34
8.35
9.52
11.0
13.4
15.5
17.5
20.1
22.0
26.1
9
8.34
9.41
10.7
12.2
14.7
16.9
19.0
21.7
23.6
27.9
10
9.34
10.5
11.8
13.4
16.0
18.3
20.5
23.2
25.2
29.6
11
10.3
11.5
12.9
14.6
17.3
19.7
21.9
24.7
26.8
31.3
12
11.3
12.6
14.0
15.8
18.5
21.0
23.3
26.2
28.3
32.9
13
12.3
11 6
15.1
17.0
19.8
22.4
24.7
27.7
29.8
34.5
(4
13.3
14."
16.2
18.2
21.1
23.7
26.1
29.1
31.3
36.1
IS
14.3
15.7
17.3
19.3
22.3
25.0
27.5
30.6
32.S
37.7
16
15.3
16.8
18.4
20.5
23 5
26.3
28.8
32.0
34.3
39.3
17
16.3
17.S
19.5
21.6
24.8
27.6
30.2
33.4
35.7
40.8
IS
17,3
18.9
20.6
22.8
26.0
28.9
31.5
34.8
37.2
42.3
19
IS.3
19.9
21.7
23.9
27.2
30.1
32.9
36.2
38.6
43.8
20
19.3
21.0
22.8
25.0
28.4
31.4
34.2
37.6
40.0
4S.J
21
10.3
22.0
23.9
26.9
29.6
32.7
35.5
3S.9
41.4
46.8
22
21.3
23.0
24.9
27.3
30.8
33.9
36.8
40.3
42.8
4S.3
23
22.3
24 1
26.0
28.4
32.0
35.2
38.1
41.6
44.2
49.7
24
23.3
25.1
27.1
29.6
33.2
36.4
39.4
43.0
45.6
51.2
25
24.3
26.1
28.2
30.7
34 4
37.7
40.6
44.3
46.9
52.6
26
25.3
*>7 ">
29.2
31.8
35.6
38.9
41.9
45.6
48.3
54.1
27
26.3
2S.2
30.3
32.9
36.7
40.1
43.2
47.0
49.6
55.5
28
27.3
"N 2
31.4
34.0
37.9
41.3
44.5
48.3
5 t.O
56.9
29
28.3
30J
32.5
35.1
39.1
42.6
45.7
49.6
52.3
58.3
30
29.3
3!.3
33.5
36.3
40.3
43.8
47.0
50.9
53.7
59.7
35
34.)
36.5
38.9
41.8
46.1
49.8
53.2
57.3
60.3
66.6
40
39.3
41.fi
44.2
47.3
J1.8
55.8
59.3
63.7
66.8
73.4
45
44.3
46.8
49.5
52.7
57.5
61.7
65.4
70.0
73.2
80.1
50
49.3
51.9
54.7
58.2
63.2
67.5
71.4
76.2
79.5
86.7
7J
74.3
77.5
80.9
85.1
91.1
96.2
100.8
106.4
) 10.3
118.6
100
99.1
102.9
106.9
111.7
118.5
124.3
129.6
135.6
140.2
149.4
-------
0.003
0.1ES
0.203
0.303
0.«t00
0.500
0.B00
0.700
0.BBB
0.900
0.910
0.920
0.930
8.9W
0.950
0.950
0.970
0.9B0
0.990
0.991
0.992
0.993
0.994
0.995
8.996
0.997
0.938
0.339
-------
B-l
APPENDIX B
MASTER-AID CLASSIFICATION ALGORITHM
USER'S MANUAL
-------
B-2
MANUAL FOR MASTER-AID PROGRAM
by John Gins
1.0 INTRODUCTION
This manual describes the uses and access rules for two very similar
programs that form decision trees. The first program, Master-Aid-
Classification,performs non-parametric discriminant analysis on data sets
that can be classified into two classes (an n-class version may be
implemented in the future). The second program, Master-Aid-Splitting,
performs the standard AID algorithm on continuous data sets.
Since the second program is still under development, the first draft
of the manual will describe the first program only.
-------
B-3
2.0 MASTER-AID-CLASSIFICATION PROGRAM
This program implements the Friedman algorithm, the weighted Friedman
algorithm with the Breiman cost function, the Stone algorithm, and the
Aid Classification algorithm. Since these are non-parametric statistics,
no prior knowledge of the distribution of the various variables is
necessary. The problem of unequal class sizes observed with the Friedman
algorithm is handled by the other three algorithms. Section 2.1 describes
the input and output of this program, and Section 2.2 describes the way
the program works.
2.1 INPUT AND OUTPUT OF THE MASTER-AID-CLASSIFICATION PROGRAM
The program requires two general types of input files: data files
and an information file. The information file is described in Section
2.1.1, the data file in Section 2.1.2, and the output in Section 2.1.3.
2.1.1 Information File
The first two records of the information file are in free format.
All values should be separated by either a comma or a space. Alphabetic
values should have single quotes around them. Values in exponential
format should have no space between the fractional part and the exponent
(i.e., 1.030E17).
The first record contains information about the data input files,
the location of the format for a test file, and the number of variables
in the data vector. Table 2.1.1 describes this record.
The second record contains further information about the data input
files and the algorithms to be used. Printing information and decision
-------
B-4
Table 2.1.1 First Information Record
VARIABLE
NAME
TYPE
DESCRIPTION
INFILE
INTEST
I SAME
LREC
INTEGER
INTEGER
INTEGER
INTEGER
The FORTRAN number of the learning data file to be used.
This should not be 6, 7, or 5. If this value is 0 then
the learn mode will be skipped (refer to ITSAVE),
The FORTRAN number of the testing data file to be used.
This should not be 6, 7, 5 or the same as INFILE. If
this value is 0 then the test mode will be skipped.
This is a flag indicating if further information is to
be read for the test mode. 1 means read new information;
0 means use old information.
This is the number of variables that can be used for
splitting. 1
-------
8-5
rules are also specified. Table 2.1.2 describes this record. A few of the
options are further clarified in the following paragraphs.
Only one number is allowed as the missing value indicator, VMIS;
therefore, the input data sets will have to be internally consistent with
respect to missing value indicators (i.e., if -9999.0 is used as a missing
value for one variable, it should be used for all). Use of blanks for
missing values is unwise because^ when read in by FORTRAN, a blank is given
a value zero (0). However, zero is a legal missing value.
The choice of algorithm (NALG) is important to the user of this
program. The Friedman algorithm (NALS=1) was originally intended to handle
data having a balanced design model (i.e., 50$ of the data in class 1 and
50% in class 2). A balanced design model is not always the case, however.
It has been suggested that, unless a balanced model can be assured, the
weighted Friedman algorithm be used (MALG=0). Friedman and weighted
Friedman algorithms are the same if the learning data set contains an
equal number of vectors assigned to each class and RLAMDA=0. RLAMDA is
a cost parameter used to weight splits that create pure nodes more heavily.
The Stone splitting algorithm (NALS=2) and the Aid Classification algorithm
(NALG=3) can also be accessed in the current version of the program.
The header vector, which can contain up to 100 real or A4 values, is
used only when the record is read in the first time. The only important
piece of information that must be in the header vector is the class label
for the record. This value is converted to either a 1, for class 1, or a
2, for class 2.
-------
B-6
Table 2.1.2 Second Information Record
VARIABLE
NAME
TYPE
CLASS1
CLASS2
VMIS
ITSAVE
NALG
REAL or
ALPHA
REAL or
ALPHA
REAL
INTEGER
RLAMDA
NHEAD
INDCL
NFORM
NRULE1
RULE2
INTEGER
REAL
INTEGER
INTEGER
INTEGER
INTEGER
REAL
DESCRIPTION
The value of the class label (see INDCL) that corresponds
to the first of two classes.
The value of the class label (see INDCL) that corresponds
to the second of two classes. If a blank value (i.e., 1 ')
is input,then any class label not equal to CLASS1 will be
assigned to the second of the two classes.
This is the missing value code for all variables of the
splitting vector.
The FORTRAN unit number of the data set where the tree
data should be saved if INFILE/0. If INFILE=0,then tree
data should be read from ITSAVE. If ITSAVE=0,then do
not save the tree. ITSAVE?«7,6,5JNFILE, or INTEST.
This is a flag to indicate choice of algorithm,If NALG=1, use
the unweighted Friedman algorithm for splitting.If NALG=0, the a
weighted Friedman algorithm is used.If NALG=2,the Stone algoritf^
is used. If NALG=3,the Aid Classification algorithm is used.
The cost for incorrect classification using the weighted
Friedman algorithm RLAMDA > 0. If NALG^O, RLAMDA should
be 0.
The number of variables in the header of each data record
(l
-------
B-7
Table
2.1.2 Second Information Record (Cont'd)
VARIABLE
NAME
TYPE
DESCRIPTION
RULE3
REAL
Smallest allowable Beta value of the distance term between
O.OQOlandl.O. This rule can be suppressed (see INODEC
below).
NCOICE
INTEGER
Classification algorithm flag. NC0ICE=1 means use majority
rule to determine the classification of a node.
NC0ICE=2 means that the class with the largest probability
of correct classification value 1s us£d to classify a node.
NODEC
INTEGER
This flag governs the type of nodes to be printed by the
program. Use of this parameter actually can provide a
limited amount of look-ahead capability.
N0DEC=0 Print all terminal nodes and those split nodes that
are valid with respect to the four stopping rules.
N0DEC=1 Print all terminal nodes and all split nodes that
are valid with respect to stopping rules 1, 2, and 4. If a
node split violates rule 3 but has a descendant that 1s
valid with respect to rule 3,then a split 1s allowed.
N0DEC=2 Print all terminal nodes and all split nodes that
are valid with respect to stopping rules 1, 2, and 4.
NTREE
INTEGER
Is a flag which,1f 1,writes data to FORTRAN unit 7,which is
compatible with Dennis Bicker's tree-drawing program.
NCNAME
INTEGER
Is a flag which,if 1,means read in class names after the
format cards. (Default names are CLASS! and CLASS2.)
NVNAME
INTEGER
Is a flag which,if 1,means read in the LREC variable names
after the class names.
NINOX
INTEGER
Is the total number of variables used in the classification
process during this run. V4UNDX_
-------
B-8
There are four termination rules used in this program. A termination
is a decision not to split a node. Modes which are not split are called
terminal nodes and are assigned a class. The first rule, NRULE1, specifies
the smallest number of values for which a split can be determined. NRULE1
is user-determined and is usually equal to the square root of the total
number of cases. RULE2 stops further splitting by checking the probability
of correct classification once the node is reached. If the probability
is very high (i.e., 97.5%), then further splitting may not be necessary.
RULE3, the Beta rule, stops further splitting if a split will not explain
very much about the process (Beta is too small); it can be suspended either
temporarily or permanently (see NODEC) for a given run. RULE4 sets an
absolute bound on the minimum amount of explanation (Beta) required before
splitting (RULE4=0.0001).
The array INDSV contains the indexes of the variables in each record
that are to be processed and in what order they should be processed. The
order is important only when ties occur in the parameters used for
splitting. It a tie occurs, the first variable is used for splitting.
The third through sixth records are read in, using 10A8 format. These
contain information on format and names. These cards are described in
Table 2.1.3.
2.1.2 The Data Files
Two types of files can be used. The learning data file is on FORTRAN
unit INFILE, if INFILE ^0. This data is described with the format data on
the information records. If the test data file (INTEST^O, INTESTj'INFILE)
has a different file structure, then ISAME=0 and another set of information
cards 2 through 6 must be read in. Table 2.1.4 describes a typical record.
-------
B-9
Table 2.1.3 Information Records 3 through 6
RECORD
VARIABLE
NAME
TYPE
DESCRIPTION
3
4
AFORM
CLNAME
VNAME
ATITLE
ALPHA
ALPHA
ALPHA
ALPHA
Contains NF0RM*80 columns of valid FORTRAN format
describing the data input.
Two names of 8 columns each. The first is the
name given class l,and the second name,class 2.
(This 1s only used 1f NCNAME=1.)
Contains the LREC names for each variable in input
order. The name length is 8 columns and there are
ten to a card. (This is only used if NVNAME=1.)
Contains an 80-column title. This is optional but
must be at least a blank line if ISAME=1. The title
is put on each output sheet and on the tree.
-------
B-10
Table 2.1.4 Typical Data Record
VARIABLE
NAME
HEAD
ALL
TYPE
REAL or
ALPHA
REAL
DESCRIPTION
This is the header information read of the record.
There are NHEAD values read 1n,and the INDCL-ik value
is the class indicator of the record.
This is a real vector containing the LREC values read
off the record. These LREC values are used to define the
decision tree.
-------
B-n
2.1.3 Output for the Master-Aid Classification Program
Three types of output records are generated from this program. Two
are data files and one is printout.
If NTREE = 1, then a file containing node and split data is written
on FORTRAN unit 7. This information is in a form that can be read by
Dennis Bicker's tree plot program.
If ITSAVE is a valid FORTRAN unit number, then the matrix containing
all node information is written. This data, described in Table 2.1.5, is
used as input to the program if only the test mode is used.
Printout for this program is of three different types in learning mode
and two different types in test mode, as well as a printout of input informa-
tion used in both modes.
For each node examined in Teaming mode, a detailed breakdown of the
values for distance, the cutpoint theta, and the Beta value for each
variable are given.
A node summary is given which tabulates the contents of the array CUT.
Part of the array description is given in Table 2.1.5, with the following
exceptions:
• CUT(2) is the number of vectors in class 1 with respect to CUT(l)
• CUT(3) is the number of vectors in class 2 with respect to CUT{2)
• CUT(ll) is the number of vectors in class 1 split from CUT(7)
• CUT(12) is the number of vectors in class 2 split from CUT(7)
• The variable name, distance, Beta, and cutpoint are left blank
for terminal nodes.
-------
B-12
Table 2.1.5 Data Tree Written—to (Read From) a Save File
RFTORD
VARIABLE
NAME
VARIABLE 1
FORMAT
COLUMNS
DESCRIPTION
Occurs once
1
Q1
F10.8
1-10
Probability of misclassifying class 1.
1
Q2
F10.8
11-20
Probability of misclassifying class 2.
Repeated as many times as nodes
2
I
15
1-5
Node number of this record.
2 i
CUT(l)
F10.0
6-15
The variable number for which this
node was split.
2
CUT(2)
F10.0
16-25
The number of learning vectors in
class 1 which descended from the
previous node (CUT(7)).
2
CUT(3)
F10.0
26-35
The number of learning vectors in
class 2 which descended from the
previous node (CUT(7)).
2
CUT(4)
F10.5
36.45
Kolmogorov-Smirnov distance for the
split with respect to CUT ("I).
2
CUT(5)
;
LO
O
I—
U-
46-55
BETA value for CUT(4) (corresponds
to the amount of explanation of the
classification CUT(4) makes).
2
; CUT(6)
<
1
F10.5
56-65
The splitting point for CUT(1): every-
thing less than or equal to goes to
the left descendant, greater than,to
t he right.
2
; CUT(7)
F5.0
66-70
Node number this node descended from.
2
1 CUT(8)
F5.0
71-75
Node number of left descendant node.
2
j CUT(9)
F5.0
76-80
Node number of right descendant node.
2
CUT(IO)
F5.0
81-85
Type of descendant:0 means top node,
-1 means left descendant of CUT(7),
1 means right descendant of CUT(8).
2
CUT(T 3)
F5.0
1
I
86-90
i
Stopping rule applied;!,2,3,4 or 0.
-------
B-T3
Table 2.1.5 Data Tree Written—to {Read From) a Save File (Cont'd)
RECORD
VARIABLE
NAME
VARIABLE
FORMAT
COLUMNS
DESCRIPTION
2
CUT(14)
F5.0
91-95
Class of node:1 = class 1,2= class 2-
2
CUT (15)
F5.0
96-100
0 if class was selected normally; 1.
randomly.
2
CUT(16)
F5.0
101-105
0 if CUT(1) was unique "best";I,
ambiguous¦
2
CUT{17)
F5.0
106-110
% Probability of correct classification
given that this node was reached.
-------
B-14
t "TERMINAL NODE" replaces descendant node numbers for terminal nodes
• Descendant type CUT(IO) is replaced by "LEFT", " or "RIGHT"
instead of -1, 0, 1, respectively
• Where possible,the class name is used instead of a number
• CUT(15) = 1 is replaced by and CUT(16) = 1 is replaced by
Footnotes are given explaining these abnormal situations
• If any prior probabilities are used in classification, they are
listed at the bottom of the page.
A misclassification summary is given on a separate page. This summary
tabulates the number of correct and incorrect classifications made by the tree
described above. The percentages of classifications correct and wrong are
also given.
In the test mode, a set of data is run through a previously constructed
decision tree. A node summary and a misclassification summary are given for
the test. The printout is the same as for the learn mode,with the following
exceptions:
• The vectors for classes 1 and 2 given in CUT (2) and CUT(3) are the
same as those in CUT (11) and CUT (12) in the learning mode
• The number of vectors listed in CUT(11) and CUT(12) are the number
of test vectors in classes 1 and 2, respectively, that were split
from the previous node (CUT(7))
• All the rest of the information in this node summary was taken from
the learn mode
• The misclassification summary reflects only the vectors that were
classified during the test
-------
B-15
• An estimate is made of the probability of correct and wrong classi-
fication.given the test set and probability of misclassification
from the learn set.
2.2 PROGRAM DESCRIPTION
The next two sections briefly describe how the program works
and its limitations. The weighted and unweighted Friedman algorithm will be
more completely described in a paper by Leo Breiman.
2.2.1 Learning Mode
The learning mode is turned on when INFILE is given a valid unit number.
The current limitations are as follows: 1 < LREC < 48. The number of valid
records used in learning, NT, must be greater than zero and 2NT(LREC + 2)
S 80000. The largest number of nodes that can be calculated is 1000.
The main program reads in the first information card from FORTRAN unit 5.
It determines the dimensions of the array ALL,which is used to store and pro-
cess the learning vectors.
The main subroutine MAINP is used to do most of the data processing and
1/0. After the rest of the information is read in, the class of the record
from HEAD(INDCL) is read and checked with the valid class codes. If it fails,
then the next card is read;if it passes, then all LREC vector variables (inde-
pendent variables) are placed in the array ALL.
Unless all the data is in one class, a split is always made on the first
node. For all other vectors, termination rules 1 and 2 are applied before
splitting is attempted. Rules 3 and 4 are applied after a splitting criterion
has been determined. Classification of a terminal node is always made before
splitting is attempted.
-------
B-16
In general, a split (cut) point is determined as follows:
• The variables are treated individually
• They are sorted without missing values,and the class is kept with
each value
• The maximum distance is determined for this variable,
and the associated Beta and split point are saved.
(The value of the variable at which the distance is
maximum is the split point.)
• Notice that this method might result in each variable set using a
different number of vectors since missing values are only taken into
account during the sorting process. Thus,CUT(2) < CUT(ll).and CUT(3)
< CUT(12).
After a split is determined,and if all the stopping rules are satisfied,
then the data set is split into two parts, left and right descendants, and these
parts are moved to different parts of the array ALL. Processing continues after
all terminal nodes are reached, and summary tables are printed.
2.2.2 Testing Mode
The test mode is turned on when INTEST is given a valid value. If INFILE=0»
then a decision tree is read in from FORTRAN unit ITSAVE. Currently,
the data is read and processed exactly as in the learning mode, except that no
distance is calculated, since the split point from the learning mode
is used at each node.
-------
C-l
APPENDIX C
-51 LOG-LINEAR MODEL
FIT ALGORITHM
-------
C-2
Algorithm AS SI
Log-linear Fit for Contingency Tables
By S. J. Habhrman
Department of Statistics, University of Chicago
Language
ANSI Standard Fortran
Description and Purpose
This algorithm performs an iterative proportional fit of the irmrg-o-1 totals of a
contingency table. The method used has beer, described by Denting ard Stephen
(1942), Fienberg (1970) and Goodman (1970). The algorithm may be used to obtain
maximum likelihood estimates which correspond to hierarchical log-linear models
for both complete and incomplete contingency tables.
In hierarchical log-linear models for complete tables, the logarithms {p.( ^ of
the expected values {mi, of the frequencies of a factorial tabie 1 < iJ^ri
for/ = 1,are assumed to satisfy an additive model similar to ih^; cricountered
in analysis of variance. In such a model,
t
!L 4- ; — r'L3
then
fin: - i-' +tf -rH +£....¦ -rtf : -I-'-5. (2)
¦I| 't •; I -j ! l ^ ¦ < - ¦ - 1"""
and
2 3j
V rl = 2 rf = 1 E'3 = 0
1,-1 1.--1 ¦ l't-I "
-2 1-14
V ^ -1 - v rV". = V 13 = V ,.I
.-''¦'¦tit 1,1a ,
i,«l 1,-1 1.--1 4,-1
J I
V ,-23
= V ,.23 M v rja = ql
.T'l
In ibis model, d = 3, 5, = {1% 3, = {I}, 3Z =
-------
C-3
STATISTICAL ALGORITHMS 219
maximum likelihood estimate of has the same marginal totals
'"{i* ;rCl\ as {'i,for k — 1, In the example,
= »W = "i;» = "\h (4)
= »\+i, = "ii + 'i = "!*, (5)
and
HJ73, = i ~ i ~ • (&)
Thus a hierarchical model may be described in terms of the marginal totals to be fit.
The Deming-Stephan algorithm is an iterative method which finds {/ȣ,..The
basic iteration cycle begins with an initial estimate (j} of such that
satisfies (1). At the first cycle, one may set each = 1 for 1
and j = 1, An iteration cycle consists of s steps. Each step is defined by the
equation
„,(*> _ nru--i) ,Tfir hck) ,-r\
~ [ni]g* JeC/l)" y)
At the end of a cycle, one may let o ^or l'ie next cycle be In the example,
<*)
jji>0J '
iii«+
™?lf ='J0.
Deming and Stephan (1942) use this algorithm to adjust tables so thui they have
specified marginal totals.
Structure
SUBROUTINE LOGLIN {NVAR, DIM, NCON, CONFIG, NTAB, TABLE, FIT,
LOCMAR, NMAR, MARG, NU, U, MAXDEV, MAXIT, DEV, NLA ST, IFAULT)
Formal parameters
A'VAR Integer input: the number of variables d in
the table.
DIM Integer array (NVAR) input: the number of categories r;-
in each variable of the table.
NCON Integer input: the number s of marginal
totals to be fit.
CONFIG Integer array (NVA R, NCO N) input: the sets Ct, fc=l,...,s, indi-
cating marginal totals to be fit.
-------
C-4
NTAB
Integer
input:
TABLE
Fir
LOCMAR
Real array (NTAB)
Real array (NTAB)
Integer array (NCON)
input:
input
and
output:
output:
NMAR
MARG
NU
V
Integer
Real array (AW/.4 R)
Integer
Real array (.VU)
input:
output:
input:
output:
MAX EE's
Real
input:
max:t
Integer
input:
DE7
Real array (MAXIT)
output:
SLAST
I FAULT Integer
wiVpiil:
output:
the number of elements in the
table.
the table to be fit.
the fitted table,
pointers to the tables in
MARG.
the dimension of MARG.
the marginal tables to be fit.
the dimension of U.
a work area used to store
fitted marginal tables,
the maximum permissible dif-
ference between an observed
and fitted marginal total,
the maximum permissible
number of iterations.
DEV(I) is the maximum ob-
served difference encountered
in iteration eye's / between an
observed and fitted marginal
total.
Vrit number of tbe\ast iteration,
an error indicator. See failure
indications below.
Comments
The tables TABLE and FIT are arranged in. standard Fortran fashion. If the
observations to be fit are represented in an array {»,¦,•}, where 1 ^ < 2, 1 < U ^ 3 and
then NVAR - 3, DIM{\) ** 2, DIM(2) = 3, DIM(3) = 4, NTAB ~ 24,
7ri3LE{\) — vni, TABLEt'Z) = TABLE(3) = f!m> TABLE(4) = etc. If the
S:red table is an array {»/,•_;*.}, then FIT(Y) — >fiul, FIT{2) = rii.m, FIT(3) — etc.
Marginal totals to be fitted are described by the array CONFIG. If Ck has Jk
elements, then these elements are given by CONFlG(J,K), where 1 ^.J£db. If dk fit, then one may Set F1T\I) = 1 if the corre-
sponding fitted value can be positive and FIT{1) = 0 if the corresponding fitted value
mis: be zero. If in the three-way table under discussion, «lu cannot be observ ed, one
would set FIT[D = 0 and FIT(i) = I for 2 il^NTAB.
An iterative procedure is used to transform FIT into t!ie desired fitted table. As
many as MAXIT iterations may be performed, where MAXIT may be set as 10 or
15 in most problems. Iteration I is the last iteration if Dr,V(!)< MAX'DEV. A
reasonable value for MAXDEV is often 0-25.
-------
C-5
Under sonic circumstances, convergence occurs after one cycle. In such cases,
AfAXIT may be set to 1 to avoid an unnecessary second cycle. No error condition
for failure of convergence will be observed ir this case.
The marginals to be fit are placed in MARG by the subroutine. They are arranged
in the order specified by CONFIG. In the example under discussion, NMAR must be
at least 26. Thearray .1W/?Gisconstructed so that MARG(i) — /?u+, MARG(2) = nn_,
MARGQ) ---- ..., MARCH) — h1+1, MARG{$) — /r3+1, etc. To facilitate reference
to marginal tables, locations of first elements of marginal tables are stored in
LOCMAR. Thus LOCMAR{\) = 1, LOCMAR{l) = 7 and LOCMAR{3) = 15.
During computations, a work area U is required. This array should be as large
as the largest marginal table used.
Fd'ure indications
If I FAULT — 0, then no error was detected.
If IFAULT ~ 1, then CONFIG contains an error.
If IFAULT = 2, then MARG or U is too small.
If IFAULT = 3, then the algorithm did not converge to the desired accuracy
within MAX1Titerations.
If 1FA.ULT = 4, then DIM, TABLE, FIT or NVAR is erroneously specified.
Restriction's
The number of variables NVAR must not exceed 7, unless alterations are made as
specified in the algorithms.
Acknowledgement
This algorithm was written as part of research sponsored in part by National
Science Foundation research grant No. NSF GS 2S18.
References
Deming, M. E. and Stephan, F. F. (1940). On a least squares adjustment of a sampled frec.eacy
table whan the expected marginal totals are known. Ann. Math. Statist., U. 427-444.
FiESiiERC., S. E. (1970). An iterative procedure for estimation in contingency tables. Ann. Mask.
Statist., 41, 907-917.
Goodman, L. A. (1970). The multivariate analysis of qualitative data: interactions amo;:£ r-vji;ir!e
classifications. J. Amer. Statist. Ass. 65, 226-256.
SUBRSUTIN'E LJGLIN ( NVaR,C IM,NCSN.,CeNFIG,NTA2, TABLE,FlT-.LSCtfA^j
« KMAS<"ARCjNU#UJMAXD£V/."1AXITi 0EV *NLAST/ .FAULT)
C
C /iLGSRITHM 'AS SI APPL."STAT 1ST* U97H1 VSL«21* N0«2
C
C PERFORMS an ITERATIVE PR0P9RTISNAL FIT 6F THE MASSIMAL TSTaLS
C CF A CBNTINGENCY TAELEt
C
c this versisn permits up ts seven variables, if this limit is
C 759 Si*ALLj CHANGE THE VALUE 8F MAXVA3/ AND 5r THE DIMEN'SI2^
C IN THE D£CLAKATI3N 6F CH^CK AND ICSN - SEE AL33 THE CHA^S
C NEEDED IN AS 51.1 AND AS 51.2
REAL MAnG(NMAR))/TABLE(NTA2)#FITCNTA3)
-------
C-6
c
C CHrCK VA1.I0ITV Br hV*3» THt. KUr3ER 0" VASIaB-.ESj
C aW> CF MAXl T / THE MA*I^ <\jMb£R 3F ITE^ATIiNS
C IF(,lJVAF?.GT.O.AJO«NVASi ,LE .^AXVA*,anq ,KAX It ,3T .C ) 03T31C
5 IFAULT"4
return
C
C LOSK at table and FIT CONSTANTS
c
to srzt»i
Da 30 J»1»\VAS
IfIDIM1J1»LE.CI3ST6o
20 s[Z£»sizel-ot-rcji
33 Ce>iTlSC£
l^tSUt-LE^TAaJOfTrv;
3= IFAjlT.2
RETjjRN
*c **;.o
v«c.c
Da 60 i»:»3ize
:p ! TABLEI)»LT»0"O*Er?.FITII) «LT >C »0I I=3T&5
XiKfrTASLE11)
Y«Y t-FIT [ I )
60 C3NTIWE
r
C MAKE *¦ PRELIMINARY AO-USTrtE\T To ftBTAIN THE FIT TB AN
I E*PTY C6NF'GUSATIBM LlSf
IriY .EC»C»0)G3T&5
X«X/r
CS go I»1'SIZE
sC .- IT! 11»X»FIT< I J
r
C ALL9CATE MARGINAL TABLES
PSIhT-t
CS 15C J11 j '^CSN
C A ZES5 B-GINMNG A CSNFIGijSATieN INDICATES THAT THE LIST 15
C CSfPLETE3
C
3F(C3^FIG<1*[>.E3.0fG8Tai«C
r
C CET MARGINAL TABLE SIZE. ^HILE D&ING TH13 TASK, SEE IF THE
C CS^F13URATI8N LIST CSNTAJ.SS D'JPLICAT rSNS £=1 ELEMENTS 3UT BF
c range:
c
SIZr.l
09 90 J" 1 »NVArl
50 CHEC<(JJ»0
rJ3 120 J»lfNVAR
K«CONFIG
-------
C-7
c
C GET SIZE
C
SIZ£*SIZE*DIM(K)
120 CSNtINUE
C SINCE U IS USED T3 ST9*E FITTED MARGINALS, SIZE MUST NST
C EXCEED NiJ
130 1F(SIZE»GT.NU)G9T835
C LBCKAR P5INTS 78 MARGINAL TABLES T3 BE PLACES IN HARG
1*0 LOCKAS 11) *P?!'>T
P3l\T»PeISTtsIlE
150 CONTINUE
r
GET N, 8F VALiO C£NFIG'JRATI9NS
I«NC3N>1
160 N * I»1
C SEE I" CAN HSLD ALL MARGINAL TABLES
r
IF{PSINT«'2T.vmaR + i JC9T535
C
C G9Ti.!1-- !-.-.^3:n*L TABLES
17C 08 IS- -
D8 18C w"l jk v1-**
183 icsmj! 1)
call C5—.•>?;-.vas, tabled arg,i/lsckari:>,ntab,nmar,dim,ic8n,.true.)
1=0 C5NTIVE
r
C p£="iiM :TE^AT!8NS
c
D3 2ZZ <»t»-AXJT
c
C x-'X :3 DEVI ATI Sn eBSERVED BETWEEN FITTED An'J TRUE
C i-.»-S:":AL O-'RJnG A CYCLE
c
X^4x»c.c
C8 2ic
c? 2;s _«:»•.var
200 =:o^-i3tj<:)
Call C;LLA?(NVAr,F IT,U,1, 1,NTa3,\L',0:M/ IC2\, .TRUE.)
Call iC j-ST {NVA^FITjU^'AHG, 1, 1 .LSCpMaH (I)/NTAojNL'jNMARjDIM,
» IC5*-f"<""AX)
2io csnt:--^e
c
c TrST c5svs;sa£NCE
c
cev:<:*xuax
IF(x-iX.LT«^AXDEV )G8T82'r3
220 cs-nt;v„=:
IFifAX!T.GT.1)GST023C
NLABT=I
RETw^M
c
C \» CS-'iVESGENCt
c
230 !"4._T-3
NL*3r* " tT
c
C KSSMAL TERMINATION
C
2*t0 NLi3"»<
RETJRN
en:
-------
C-8
S'JSRSL'TIME CDLLAP INVAR^X^YjLliCXjLftCYjNXjNYjIIfljCS.S'FiOjtJ? r is.'v)
C ALG3KITHM AS 51-1 APPL•STATIsT. (19721* VSLt21, N3»H
C IF 6PTIBN' is TRUE:* Ce*PUTES A Marginal TA2LE FRSM A CS^PL^TE
C TABLE.. IF 6PTI9N IS FALSE* Ri;HBVES AM EFFECT FRQM A TABLE.
C ALL PARAMETERS AR£ ASSUMED VALID - TH£R£ IS N3 TEST F6K ER33i?3
C IN INPUT•
C IF the VALUE 3F IWAR IS TS 0E GREATER THAN 7, THE 0IMENSI3MS IM
C THE DECLAj?ATI5NS 0F SIZE AND C00SO KL'ST tJE INCREASED T2 THeC
C VALUES 6F NVAR+1 AND *VAR, RESPECTIVELY
C INTEGER SIZE{S)*DIM(NVAR)/CaNFlG(NVAR>*Cd8R£><7)
LeGlC 9PTISN
c
C THE LARGER TA3LE IS X AND THE SMALLER 0NE IS Y
r
REAL X < k-X J, ¥ < NV J
c in:t:»l:se arrays
c
siz^:;:=>:
S3 iC
L»C=\-:3iJ}SdT940
L?:-J«L5CY*SIZE{K)-1
13 J«L2CY*LSCU
30 Y ; J;»0.C
C
C INITIALISE C9SRD INATES
C
4C -5 50 K-IjNVAR
b'O CSJA-JtKJO
c
C FIND LeCATIBNS IN TABLES
C
i-.icx
C2 70 K»1»N
L»CQ\FJG(K)
J*J+CS8fiD(L)»SIZE(K)
70 "5Nf»»UE
IFtSPTI8M'Y
-------
C-9
SUaqaUri^E A3-'UST
H£A[, XiMX] «Y',NY>iZ\F[3:<)
IF(l.E3.C)30T3=C
s i zr [<*•! j-s: zz', v.) *0 r m (l j
1C CSNTl'H-i
r
c FI\3 ,\-''"3£S 2F VAF«[AELt3 In C3\rIOVSAT16n
2C ,\=X.«
Z TEST SJZZ :r DS7IATI3N
r
L»s: 2=: k:
J*L;CY
30
t*A33; <1 )
ir!E.3ro' :-t£
K=X*|
3C c ^
r
C 1V ; T :_i-Z =: CSrRHE^TES
2? 1-C
hO C~?~Cj<•'3
I'LSCX
: =;??Fr?.M ADJUSTME*^T
;:1 sZ
L I j C < >
_i-.c;asaa)»siZH:(K)
6v CrN- if.uE
<=jr'_;c;
v.'-'L!Cv
'.;TE THAT Y ( af SHSULD BE NSN-NEjATIVE
I"iY(J).LEiC.OJXaMO.O
.GT.O.Cmi ] - X( I WCK)/Y
-------
R-l
REFERENCES
Aerospace Corporation, Mobile 5ystems Group,. 1975^ "Short Test Correlation
Analyses an 300, 197E Hade! Year Cars, Volumes I and1 TL" Las Angeles,
California. (Prepared for Contract No* 68-01-0417, U.S. Environmental
Protection Agency, Office of Air and Waste Management, Ann Arbor, Michigan,
Publ. Nos, EPA-46Q/3-76-Q10a and EPA-46G/3-76-Q10b.)
Afifi, A. A., and S. P. Azen, 1974. Statistical Analysis: A Computer
Oriented Approach." Academic Press, Inc.
Anderson, T. W., 1958. An Introduction to Multivariate Statistical Analysis.
John Wiley & Sons, Inc.
"Biomedical Computer Programs (BMDP}," 1975. University of California Press.
Bishop, Y. M., S. E. Feinberg, and P. W. Holland, 1975. Discrete Multivariate
Analysis Theory and Practice." MIT Press.
Fox, D. J., and K. E. Gufre, 1976. "MIDAS." Statistical Research Labora-
tory, University of Michigan (3rd edition).
Friedman, J. H., 1977. "A Recursive Partitioning Decision Rule for Non-
parametric Classification." IEEE Trans. Computers, Vol. C-26, pp. 404-408.
Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate
Observations. John Wiley & Sons.
Grizzle, J. E., C. F. Starmer, and Gary G. Koch, 1969, "Analysis of
Categorical Data by Linear Models." Biometrics.
Haberman, S. J., 1972. "Log-linear Fit for Contingency Tables (Algorithm
AS—51)." Appl. Statist., 21_, pp. 218-225.
Hamilton Test Systems, 1977. "Plan of Performance for a Short Test Correlation
and Effectiveness Study to the United States Environmental Protection Agency."
(Prepared for Contract No. 68-03-2513, U.S. Environmental Protection Agency.)
Lachenbruch, P. A. and Mickey, M. R., 1968. "Estimation of Error Rates in
Discriminant Analysis." Technometrics, _1£, pp. 1-11.
McLachlan, G. J., 1975. "Confidence Intervals for the Conditional Pro-
bability of Misclassification in Discriminant Analysis." Biometrics, 3U
pp. 161-167.
Ckamoto, M.* 1963. "An Asymptotic Expansion for the Distribution of the
Linear Discriminant Analysis." Biometrics, pp. 1286-1293.
Sonquist, J. A., E. L. Baker, and J. N. Morgan, 1971. "Searching for
Structure." Institute for Social Research, University of Michigan, Ann
Arbor, Michigan.
-------
R-2
REFERENCES (Cont'd)
"Statistical Package for the Social Sciences (SPSS)," 1977. McGraw-Hill
(2nd edition}.
------- |