Evaluation of Statistical Procedures
for Grading Fuel Efficient Engine Oils
Report 3520-2/BUF-40
-------
FALCON RESEARCH
Falcon Research & Development Co.
A Subsidiary of Whittaker Corporation
One American Drive
Buffalo, New York 14225
716/632-4932
WhrttakBR
Evaluation of Statistical Procedures
for Grading Fuel Efficient Engine Oils
Report 3520-2/BUF-40
Prepared Under
Contract 68-03-2835
Task Order #2
Prepared for
Environmental Protection Agency
Ann Arbor, MI 48105
Prepared by:
S. Kaufman
H. T. McAdams
N. Morse
Date: November 1980
-------
TABLE OF CONTENTS
Section Title Page
1.0 INTRODUCTION 1
2.0 SUMMARY AND CONCLUSIONS 2
3.0 FUEL EFFICIENCY RATING ERROR ANALYSIS 5
3.1 Non-Carryover Procedure 5
3.2 Carryover Procedure 8
3.3 Comparative Evaluation 9
4.0 MODEL-TO-MODEL VARIATION IN FUEL EFFICIENCY OF OILS 12
4.1 Grading Policies Under Model Differences 12
4.2 Statistical Test for Model Differences 14
4.3 Rationale for Selection of Test Models 14
5.0 REQUIRED NUMBER OF TEST VEHICLES 16
5.1 Grading Accuracy Requirements 16
5.2 Statistical Test of Model Homogeneity 19
6.0 STATISTICAL STRATEGIES IN FUEL ECONOMY MEASUREMENT 22
6.1 Non-Carryover Procedure 22
6.1.1 Error Reduction Through Replication 22
6.1.2 Robust Estimation 23
6.2 Carryover Procedure 25
6.2.1 Dominance of Extended Mileage
Accumulation Variability 25
6.2.2 Proposed Alternative Experimental
Design 26
6.3 Concluding Remarks 27
7.0 SCALING ASSUMPTIONS IMPLICIT IN SPECIFIC FUEL
EFFICIENCY RATING MEASURES 29
-------
TABLE OF CONTENTS (Continued)
Section Title Page
Appendix A—ENGINE OIL FUEL EFFICIENCY GRADING UNDER MEASURE- A-l
MENT ERROR AND BETWEEN-MODEL DIFFERENCES
Appendix B—ACCOMMODATING SPURIOUS OBSERVATIONS IN FUEL B-l
ECONOMY MEASUREMENTS
Appendix C—"BEST TWO OUT OF THREE" PROCEDURE C-l
Appendix D—BEHAVIOR OF EPA REPEATIBILITY TEST D-l
-------
1.0 INTRODUCTION
In March 1980, the Environmental Protection Agency (EPA) distributed
for industry comment a draft "EPA Recommended Practice for Evaluating,
Grading, and Labeling the Fuel Efficiency of Motor Vehicle Engine Oils."
Following receipt of responses, EPA requested Falcon Research and
Development Company to "Analyze comments received...with regard to the
recommended statistical procedure." This analysis was completed and
documented in June 1980.*
As a follow-on effort, Falcon was requested to provide an independent
assessment of certain aspects of the proposed statistical procedure,
extending beyond evaluation of just those issues which evolved from the
industry review. Particular areas identified for consideration were:
(1) comparative accuracy of carryover and noncarryover effect procedures,
(2) means for encouraging (rewarding) accurate testing, (3) dealing with
outliers in fuel economy test data, and (4) the impact of variability in
oil effects across car models on the meaningful grading of oils. An
investigation of these areas, and cursory examination of some additional
topics, form the subject of this report.
* H. T. McAdams, S. Kaufman, and N. Morse, "Analysis of Industry
Comments on EPA Recommended Practice for Fuel Efficient Oils,"
Falcon Research and Development Company, Report 3520-2/BUF-36,
June 13, 1980. For convenience this report will henceforth be
referred to as Falcon's "Analysis of Industry Comments."
-------
2.0 SUMMARY AND CONCLUSIONS
Section 3 recapitulates prior ASTM and Falcon error analyses of
fuel efficiency rating estimation by non-carryover (NCO) and carryover
(CO) procedures. The possibility of a carryover effect bias error
occurring in the NCO procedure is explicitly considered. Experimental
evidence regarding the effectiveness of special flush procedures to
minimize such bias in carryover effect oils is not currently available,
so that relative evaluation of NCO and CO procedures is not yet possible.
However, the potential feasibility of the former, assuming demonstration
of~a special flush capability, is reaffirmed, whereas the CO procedure,
in light of the best available component error estimates, appears doomed
to require an exorbitantly large number of vehicle replications to achieve
acceptable probabilities of correct grading.
Section 4 introduces the additional problem of potential model-to-model
variability in fuel-efficient oil effects. It is observed that if such
variations are significant in an oil, i.e., comparable to the grade
spacings or greater, then there seems to be no completely satisfactory
rule for determining a grade designation for that oil. In any event,
in the absence of sufficient experimental evidence that such variabilities
are insignificant, it is necessary to determine by statistical test for
each candidate oil whether it could be characterized by a single fuel
efficiency rating over all models or alternatively whether it is necessary
to separately estimate individual model-specific ratings.
The rationale for the five particular models selected for testing is
questioned. The point is then made that if it were possible to stratify
all vehicle models into a relatively small number of homogeneous classes
with respect to oil fuel efficiency effect (on the basis of fundamental
physico-chemical properties), then test model selection could be made
with less arbitrariness.
-------
In Section 5 the full implications of the issues discussed in the
previous two sections are carried through with respect to requirements on
the numbers of vehicles for testing a given candidate oil. It is shown
that the statistical test for homogeneity across models imposes the severest
requirements. On bracketing with two reasonable alternative levels of
performance of this test, the determination is made that for the NCO
procedure from 65 to 200 vehicles are required, whereas for the CO
procedure the comparable requirements range from 1660 and 4650. Even if
the problem of model-to-model variability were completely resolved,
in order to achieve a probability of mtsgrading no higher than
10% requires 60 vehicles under the NCO procedure and over 1000 vehicles
under the CO procedure. If the 10% standard is relaxed to 20%, the
requirements reduce to 20 and 300 vehicles, respectively. It must be
strongly emphasized that these numerical results rest on the error
component estimates stated in the ASTM response. In particular, the
dominant error term which very markedly penalizes the CO procedure is the
2.34% estimate for the standard deviation in car mpg variations about a
linear trend with extended mileage accumulation. Downward revision of
these errors could significantly alter our conclusion. With this caveat
in mind it is concluded that the CO procedure is impractical and that
even the NCO procedure imposes a substantial test load for adequately
sharp results. The basic problem is seen to be the smallness of the
effects to be discerned relative to uncertainties inherent in vehicle-
oil performance and the measurement process.
Section 6 is primarily concerned with statistical strategies of
measurement replication for purposes of error reduction and protection
against possible (outlier) spurious readings. The performance of the
multistage procedure specified in the draft EPA Recommended Practice was
analyzed. Alternative methods of robust estimation published in the
literature and developed for special application to the problem on hand
-------
were reviewed and recommendations made which will improve performance.
With respect to the CO procedure, specifically, it is recommended that
substantial accuracy improvement can be achieved with no increase in
numbers of FTP tests if the individual measurements used to establish
the reference oil trend line are uniformly spread out over the
2000-mile interval instead of being clumped together as currently
specified.
Section 7 presents a qualitative discussion of various formulations
of fuel efficiency rating measures and the scaling assumptions implicit
in them. For example, the form adopted in the draft procedure ratios the
increase in mpg due to an oil to the base mpg. This suggests the belief
that the same oil should provide twice the mpg increase in a car with
twice the fuel economy (if ratings are to be invariant over models). The
relationship of scaling assumptions to the basic mechanisms of friction
reduction are discussed briefly.
-------
3.0 FUEL EFFICIENCY RATING ERROR ANALYSIS
3.1 Non-Carryover Procedure
The most direct method of assessing the fuel efficiency performance
of a candidate engine oil on a test vehicle would be to "age" the oil in
that vehicle over a given mileage accumulation interval (specified at
2000 miles by EPA), measure fuel economy (Fc), flush the oil, replace
with the high reference oil, measure fuel economy again (FR), and
ratio the two results. Thus, fuel efficiency rating of the candidate
oil is expressed as
F
e = —5.
FR
with e > 1 indicating some fuel efficiency effect (relative to the
high reference oil) and e <_ 1 indicating no effect or a relatively
adverse effect of the oil. The validity of this method critically rests
on the assumption that, following normal flushing of the candidate oil,
there is no carryover of a residual efficiency effect due to the former
oil. Hence, this procedure is referred to as the "non-carryover procedure."
As previously discussed in Falcon's "Analysis of Industry Comments,"
the variance of the error in e from non-carryover (NCO) measurements
on a single test vehicle is resolvable into three components, given
approximately by:*
ae2 is the measurement error variance of a single fuel economy test
result. Replicate tests are assumed carried out for each oil as specified
* Strictly speaking, batch-to-batch variability in oil characteristics,
both candidate and reference, also contribute to total error.
Industry responses however suggest this effect can be neglected.
-------
in the EPA procedure or as modified for improved robustness against
occasional spurious test results (see Section 6). In either case the
variance per oil estimate is reduced to approximately one-half, but then
combinations of both candidate and reference oil estimates doubles the
total result back to ag2. ASTM asserts in its comments to EPA that
ae ~ 0-75% for volumetric or gravimetric techniques, while (by implication)
ae - 1.9% for the carbon balance method of fuel economy measurement.
The second component a0(,2 refers to the variance in true fuel
efficiency rating e among sampled cars in the population of each test
model (specified nameplate, model year, engine, etc.). Apparently,there
are no direct experimental assessments of this variability, although the
ASTM comments include an estimated upper bound of 1% for a .
The final component a 2 refers to the variance in mean model
fuel efficiency rating over the population of vehicle models. The last
phrase needs to be made more definite. Ideally, this refers to all in-use
models, weighted by their respective proportions in the fleet. In the
interest of practicality, EPA has selected five specific models, equally
weighted, to serve as proxies for the entire fleet. It is hence only
feasible to measure variability over the five selected models, and
a 2 will then refer to this measure. It should be noted that this
om
restriction can result in either underestimation or overestimation of
the fleetwide crom2, depending on the models selected. There exist no
comprehensive data on the magnitude of a . Union Oil Company included
test results on a single oil in its comments which suggest that aQm
might be well below 0.25%.. However, other respondents, notably Witco
Chemical, suggest that large model-to-model differences do occur. The
whole question of model-to-model variability, its impact on the significance
of oil grading, and how one might deal with it are considered in more
detail in Section 4.
-------
If K cars of each of the five models are tested and the mean
taken over the 5K individual fuel efficiency ratings, then the
resulting error variance is reduced to:
a 2 + a 2 a 2
Note that, in contrast to the first two error components, model variability
error is not further reduced by replication over additional copies of the 5
designated models, i.e., increasing K. This is to be expected inasmuch
as the same models continue to be tested. On the other hand, if a
om
is large, then as will be discussed in the next section one could
question the meaningfulness of any single mean estimate of fuel efficiency
estimate for the candidate oil.
In addition to the above described random errors, if a given candidate
oil does have a carryover effect and is tested by the "back-to-back"
non-carryover procedure, it will induce a systematic error, a shift in
measurable fuel efficiency rating by amount Ae = -e • AT, where
0 <_Ai^ 1. The limits indicate the possible range of carryover fraction:
from full carryover, AT = 1 (under which condition the car temporarily
retains its candidate oil fuel economy level immediately after changeover
to the reference oil), to zero carryover, AT = 0. Fuel efficient oils
are generally dichotomized on the basis of whether or not they exhibit a
carryover effect. However, it is more likely that there is a continuum of
carryover fractions.
For those oils with appreciable carryover fraction, the non-carryover
procedure is clearly inadequate since no amount of replication will
reduce the systematic error Ae. A special flush procedure has been
developed which is claimed to eliminate all or most of the carryover
effect. If experimental investigations corroborate this claim, then, of
course, all oils could be tested under the non-carryover procedure.
-------
Concern has been expressed, however, that the neutralizing action of
the special flush may be hard to control, leading to undercompensation
in some cases and overcompensation in others—that is, the final level of
fuel economy reached after replacement with reference oil could be higher
or lower than would have been achieved had the candidate oil never been
added.
3.2 Carryover Procedure
Under the supposition that there is no reliable way to eliminate
carryover effects in some oils, and invoking the principle that the
basic testing procedure should be the same for all oils, EPA has
recommended adoption of a "carryover procedure." The basic concept is to
measure reference oil fuel economy at several tntleage points before aging of
the candidate oil and linearly extrapolate to the candidate oil test
mileage to provide a predicted reference oil fuel economy for comparison
with candidate oil fuel economy. An underlying assumption of this
procedure is that automobile fuel economy generally varies with accumu-
lated mileage and that over relatively short intervals, say < 5000 miles,
with cars that are sufficiently broken in (starting mileages > 10,000)
this variation is sufficiently close to linear. Thus, as before, fuel
efficiency rating is defined as a ratio:
£ = V
where F * is now predicted instead of directly measured and consequently
R
has additional error contributions. The variance of the error in this
case (see Falcon's "Analysis of Industry Comments") is approximately:
-------
where: a 2 = variance of unpredictable (small-scale) deviations in
fuel economy from a linear trend as a function of mileage
(under reference oil operation)
n = number of mileage points tested with reference oil
x-j = individual reference oil mileage points
x" = - Z x,
n !
X = candidate-oil mileage test point (at which the FR*
prediction is made).
If we restrict consideration to the basic mileage interval configuration
in the EPA procedure of 1000-1000-2000 corresponding to adoption of a
trend line with three mileage points, then the prediction propagation
factor takes on the numerical value of 4.83 and we can then write:
a 2|CO = 2.92 a 2 + a 2 + 4.83 a 2 + a 2
e ' e oc m om
3.3 Comparative Evaluation
Applying the previously noted estimates for a and a together with
c OC
an estimate of 2.34% for a provided by ASTM, we find:
(1.25%)2 + a 2 ; metered mpg
a_2|NCO = .' m
(2.15%)2 + aom2 ; carbon balance mpg
NCO bias = -e • AT
(5.39%)2 + aom2 ; metered mpg
a 2|CO = '
e ' (6.16%)2 + a 2 ; carbon balance mpg
Then, the corresponding errors in the mean estimates using 5K test
vehicles are:
-------
o_2|NCO =
NCO bias = -E . AT
ft.41 Jj 2
I /—17~ */ T U.
o_2|CO =
e
45
The a term is kept separate since no estimate for it has been provided.
UIII
However, one should keep in mind that this contribution should be small
to justify a single grade determination for the candidate oil. Bearing
in mind the postulated grade widths of the order of 2-3%, it is seen
that the CO procedure, which was introduced to overcome the potential
systematic error of -e" • AT in carryover oils, achieved this objective
at the expense of a considerable increase in random error standard deviation.
For example, it exhibits a magnitude comparable to grade width when K=l.
To minimize classification errors in grading would require a fairly large
K which,in turn, implies considerable testing expense and utilization of
limited experimental resources.
The NCO procedure, on the other hand, may be able to yield acceptable
grading accuracy for K's in the vicinity of 2 to 6, assuming of course,
that the carryover effect is negligible or effectively eliminated by a
special flush technique.
10
-------
In the final analysis, it is essential to obtain reliable
determinations of the various random error contributions and of the
performance of special flushing techniques before reaching conclusions
as to relative effectiveness or absolute feasibility of the CO and NCO
procedures. The numerical estimates used in the foregoing analyses are
admittedly tentative and are intended only to provide ballpark figures'and
suggest the likely direction of more accurately based conclusions.
11
-------
4.0 MODEL-TO-MODEL VARIATIONS IN FUEL EFFICIENCY OF OILS
4.1 Grading Policies Under Model Differences
If a given candidate oil is accurately determined to have mean fuel
efficiency ratings of 1.06 in model X cars, 1.02 in model Y cars and 0.97
in model Z cars, what grade should be assigned to that oil? Some industry
comments suggest that this kind of a situation could very well occur,
so it is far from a hypothetical question.
One policy approach might be to grade on the basis of-a grand mean
over all cars/models in the in-use fleet. The grade designation would
then represent an average performance and the public would be educated
to interpret it that way, allowing for the possibility that individual
car performance could be considerably better or worse. An objection to
this approach is that performance results on specific identifiable
subsets of vehicles would be withheld from the public. If the oil in
the above illustration did in fact receive an "average" grade of B,
wouldn't the owners of X- and Z-model cars be_misled by the label? A
second objection is that averaging over the five selected test models
may not necessarily provide a good estimate of the true fleet mean since
the model sample is small and the criteria by which these test models
were selected may have resulted in substantial bias. Considerable
differences observed among the five tested models suggest comparable
variability among the untested models with resulting high degree of
uncertainty in the mean estimate.
An alternative policy might be to assign a grade based on worst model
performance (or on the low end of the total range of uncertainty in
calculated fuel efficiency ratings—including other error contributions).
This in effect is what the EPA recommended procedure would accomplish
with its LSDgs term. However, similar objections could be raised with
respect to the withholding of useful information from the public and the
possible unrepresentativeness of the five selected test models to the
12
-------
total in-use fleet.
A further comment on the potential impact of the LSD j. term in the
yo
draft EPA Recommended Practice is perhaps in order. As noted above, it
would tend to yield a conservative grade determination that is unlikely to
exceed the poorest model-specific grade. An additional motivation
for its inclusion, apparently, is to encourage the tester to achieve
maximum feasible accuracy by control of experimental errors or by
increased vehicle replication. If, however, the major contribution
to the LSDgs term was to derive from between-model variability, then
even substantial accuracy improvements would not appreciably reduce
LSD95, possibly much to the frustration of the tester.
A more complicated alternative, in those cases where it has been
established that significant differences exist among the five test
models, would be to assign multiple grades or a range of grades corres-
ponding to the observed spread in performance. This approach partially
meets some of the above objections, but the complexity of such labeling
could conceivably cause increased confusion for some people.
Finally, it has been suggested that oils which exhibit significant
differences among test models not be labeled at all. This approach, too,
has its shortcomings. An oil manufacturer could devote considerable
resources to a testing program which in the end yields no tangible
result. Consider the case wherein a candidate oil tests grade B on all
models except one on which it tests grade A. Such an oil would be left
unlabeled despite the intuitive reasonableness of assigning it grade B.
Another example is an oil that tests grade A on all models except one on
which it tests grade B. The intuitive resolution of the problem in
this case, however, is not as clear.
13
-------
4.2 Statistical Test for Model Differences
We conclude that there is no completely satisfactory policy for
grading oils that exhibit significant between-model variability. A
policy will, of course, have to be established based on subjective
evaluation of the various issues discussed above. In any event, one
point is clear—it is important to determine whether or not significant
between-model variability exists, and to estimate its magnitude. A
proposed statistical procedure for accomplishing this is described in
Appendix A. One conclusion stemming from the analysis in Appendix A is
that the need for adequate statistical power to evaluate between-model
variability will impose the severest requirement on number of vehicles to
be tested. This aspect is discussed in more detail in Section 5.
4.3 Rationale for Selection of Test Models
The EPA recommended procedure is deficient in not indicating any
rationale for actual model selection, other than the general phrase, "to
represent a significant cross section of high volume production cars."
This aspect of the procedure needs to be clarified because one should
have a sound basis for generalizing from the tested models to the total
in-use fleet, comprising mainly untested models.
The questions may be raised: why was it decided to test oils
directly in automotive vehicles rather than by standard laboratory
instrumentation, and then how was it determined that tests on five
different types of vehicles would suffice? Apparently, an early
decision was made that the realism and concrete interpretability of
direct improvements in fuel economy observable in actual vehicles were
important attributes that should not be given up in favor of
more precise grading based on well-controlled laboratory measurement
14
-------
of fundamental physico-chemical oil properties. On the other hand,
practicality dictated that only a small sample of the very large number
of operational vehicle types could be tested. One approach is to sample
from the models randomly (with each model perhaps sales-weighted), but
for sample sizes of the order of five, sampling errors can lead to
highly unrepresentative selections. An alternative is to stratify the
total population of models into a small number of distinct classes on
the basis of some design parameter(s) known or believed to be predictive
of fuel efficiency effects. Random sampling within each stratum could
then assure a more representative mix of test models.
15
-------
5.0 REQUIRED NUMBERS OF TEST VEHICLES
The overall objective of the procedures under consideration is
assignment of a grade to each candidate engine oil tested that accurately
characterizes its fuel efficiency effects in light-duty motor vehicles.
An ancillary objective, as discussed in Section 4, is to determine whether
or not there are significant differences in mean fuel efficiency rating
among the five test models and, if so, to estimate model-specific efficiency
ratings. In this section the number of test vehicles required to achieve
these objectives at selected levels of confidence/precision will be
analyzed for both the carryover and non-carryover procedures.
5.1 Grading Accuracy Requirements
We consider,first,achievement of accurate grading under the condition
of model homogeneity, that is, no between-model variability in fuel
efficiency rating. A reasonable negative measure of performance is the
probability of incorrectly grading the oil. Drawing on an analysis
previously reported in Falcon's "Analysis of Industry Comments," we note
two alternative precise formulations of this measure:
P: the probability of misgrading an oil of a given true grade,
assuming the fuel efficiency ratings of all oils of that grade
to be uniformly distributed within the grade limits
Q: the probability of misgrading an oil with a true fuel efficiency
rating at the center of its grade interval (most favorable
case)
We shall restrict attention to only interior grades since the highest and
lowest grades perform better. If an interior grade has width R and a
is the standard deviation in estimated fuel efficiency rating, then it
16
-------
is deduced from the previous analysis that:
Q = 2
where $ is the standard normal cumulative distribution function.
Fixing R at 2.5% (the average of the 2% and 3% interior width in the
EPA proposed grade structure), and utilizing the estimates for a-|NCO,
CT^:|CO derived in Section 3 (specifically selecting the more favorable
experimental situation of metered mpg determination,assuming no carry-
over effect bias in the non-carryover procedure,* and setting a =0
consistent with our assumption of model homogeneity) we arrive at:
PNCO = °'36//R
PCQ = 1.54//K (K > 9)
QNCQ = 2 Cl - « (2.23 /K)J
QCQ = 2 Ql - $ (0.52 /K)]
where K is the number of balanced model replications, i.e., total
number of vehicles tested = 5K. Plots of P and Q as functions of
K are shown in Figure 1. Imposition of a 10% average misgrading level,
i.e., P = 0.1 yields K = 12 and -210 for NCO and CO procedures,
which implies 60 and over 1000 vehicles to be tested, respectively, per
candidate oil. Note that satisfaction of P = 0.1 leads to vanishingly
small Q. What this means is that misgrading probability is highly
sensitive to the location of true fuel efficiency rating relative to
grade boundaries and those oils closest contribute most of the errors.
* Carryover effect bias cannot be reduced by replication and should
therefore not be considered in setting vehicle number requirements.
17
-------
CQ
CQ
o
14 9 16 25 36 49 64 81 100 121 144 169 196 225 256
NUMBER OF BALANCED REPLICATIONS, K
FIGURE 1. Misgrading Probabilities as a Function of
Number of 5-Model Replications
18
-------
If the 10% misgrading level criterion is accepted as reasonable
and the various error component estimates given in the ASTM response
are verified, then one would be forced to conclude that the carryover
procedure requires an impractically large number of vehicles to be tested.
Even the non-carryover procedure imposes a substantial requirement
on numbers. Relaxation of the 10% misgrading level to 20% reduces
the vehicle requirement for CO and NCO procedures to 20 and 300,
respectively.
5.2 Statistical Test of Model Homogeneity
Now let us determine what requirements are imposed on vehicle numbers
in order to test for significant model-to-model differences in fuel
efficiency rating. Appendix A develops such a statistical test and
establishes a precision parameter X = / K/2 (R/a£) which determines
the power of the test for a given level of significance. Here R may
again be taken as the (interior) grade width and ae refers to the
standard deviation in estimation of mean model fuel efficiency rating
from a single test vehicle. Expressions for ae are given in Section 3,
but with the a z term deleted since our focus here is on model-
specific ratings.
We propose that our test have the following error properties: If
the true maximum difference among the five model-specific fuel efficiency
ratings exceeds grade width R, then we fail to declare the models as
significantly different in at most 10% of such cases; conversely, if
the true maximum difference among the five ratings is less than
0.75 R, then we fail to declare the models as effectively equivalent
in at most 10% of such cases. It is shown that this performance is
achievable with X = 10. If the above performance specification is
relaxed by substituting 0.50 R in place of 0.75 R, then the corresponding
X is reduced to 5. Other error or statistical power properties can
19
-------
be specified, but it is believed the above two examples encompass
reasonable minimum requirements. What are the implications for
number of test vehicles? Again using ASTM error estimates with metered
mpg experimental setup,as provided in Section 3,we deduce:
K = 2 X
50; NCO, X = 10
930; CO, X = 10
13; NCO, X = 5
232; CO, X = 5
Hence for total test vehicles
5K =
250; NCO, X = 10
4650; CO, X = 10
65; NCO, X = 5
1660; CO, X = 5
Comparison with requirements based on control of misgrading errors
shows the present requirements to dominate. In particular, the CO
procedure is basically impractical if one phase of the statistical data
analysis requires separate determination for each candidate oil as to
i
whether or not its fuel efficiency effects are substantively different
among the models tested. The NCO procedure could be construed, perhaps,
as marginally feasible depending on tradeoff of testing costs with
incremental value to the oil manufacturer of establishing a favorable
grade label.
What is the basic problem? The analysis makes it clear that measure-
ment/sampling errors must be roughly an order of magnitude smaller than
the oil effect differences to be resolved. The inherent precision of
fuel economy measurement and uniformity among copies of the same vehicle
20
-------
model class are simply not sharp enough, and this lack of precision can
only be overcome by a very large number of replications.
Finally, we must emphasize that the conclusions reached here are
based on error component estimates that are not all firmly established.
Substantial reduction, for example, in the estimate for fuel economy
variability under extended mileage accumulation a 2 could put the CO
m
procedure in a very much more favorable light. Also, if an experimental
study involving carefully selected samples of "fuel efficient" oil
brands and vehicle models were to establish the absence of any sub-
stantive between-model variabilities, then there would be no necessity
for such a statistical test to be performed as part of the procedure for
each candidate oil.
21
-------
6.0 STATISTICAL STRATEGIES IN FUEL ECONOMY MEASUREMENT
6.1 Non-Carryover Procedure
6.1.1 Error Reduction Through Replication
We first consider the non-carryover procedure as described in
Section 2. No mileage accumulation effect and extrapolation errors are
involved. Assume no distinction among test models so that the between-
model variability issue does not arise. If a single fuel economy measure-
ment is made with candidate oil in a sampled vehicle and a single
measurement again after replacement with reference oil, then the error
variance of the ratio (fuel efficiency ratios) is approximately
a_2 = 2 a 2 + a 5
e e oc
where a and a are the standard deviations of test replication
error and car-to-car differences in true rating, respectively. ASTM
estimates ae = 0.75% and a = 1%. It then makes sense to achieve
some reduction in total error variance by retesting the same vehicle
which is a relatively inexpensive means of replication. Thus, with
n replicate tests
On the other hand it is also clear (assuming ASTM's error estimates) that
relatively little will be gained for the effort expended with n beyond
2 or 3 because of the increasing dominance of a 2. Further replication
OC
is best done by sampling more vehicles. An additional motivation for test
replication on the same vehicle is to provide protection against spurious
observations, and this purpose, too, can be accomplished to some level of
confidence with n = 2 or 3.
22
-------
EPA has in fact incorporated test replication at variable levels
of n = 2, 3, or 4 in its recommended practice. Admittedly, it was
done in the context of the carryover procedure involving trend line
extrapolation, but the basic principles of test error reduction and
protection against spurious readings still apply.
6.1.2 Robust Estimation
Being troubled by the apparent inefficiency of the EPA procedure,
which bears some resemblance to a "best two out of three" rule (see our
discussion on this point in Falcon's "Analysis of Industry Comments" and
Appendix C), we decided to look more deeply into the problem of small
sample robust estimation. The resulting analysis is included as Appendix B.
Also, some earlier simulation results on performance of the EPA procedure,
previously alluded to but not explicitly documented, are provided in
Appendix D.
Among various estimators considered in Appendix B, two which were
determined to have particularly good robustness properties in the face
of possible spurious contamination of a single reading are the Veale-
Huntsberger estimator based on three replicate measurements and the K-
estimator which is two-staged and may take either two or three
observations. The Veale-Huntsberger estimator is a weighted mean of
three observations in which the weight corresponding to the observation
most deviant from the unweighted sample mean decreases gradually with
increasing deviation until a criterion( deviation (in standard deviation
units) is reached, beyond which the assigned weight drops to zero, i.e.,
the deviant observation is totally rejected. The K-estimator takes the
mean of the first two observations if they are closer than a preset
r
criterion value (in standard deviation units); otherwise it requires
a third observation to be made and then reverts to a Veale-Huntsberger
estimator.
23
-------
If the non-carryover procedure can accept the cost of three back-to-
back FTP measurements per fuel economy determination, then the Veale-
Huntsberger estimator is recommended. For a particular criterion value
(c=2.4042), the root mean square error of estimation (RMSE) under
conditions of no contamination and worst case single spurious reading are
0.589 og and 1.091 ag, respectively, where a is the single-
measurement standard deviation. This compares with the best possible
RMSE of a //3~= 0.577 a achieved by the sample mean only under
no contamination. The penalty paid under no contamination for using
this estimator rather than the sample mean is (0.589 - 0.577)/0.577 = 2%
increase in standard deviation. On the other hand, the gain in performance
under contamination is dramatic since the RMSE of the sample mean increases
without bound as a function of the magnitude of the contamination.
If, on the other hand, measurement costs make it prudent to minimize
the number of replications, while at the same time maintaining some
robustness property, then the K-estimator is recommended as having
superior performance to the method proposed in the draft EPA Recommended
Practice. Although further exploration of the K-estimator as a function
of criterion values is needed for detailed optimization, an indication
of its capability is given by simulation results for selected criteria,
shown in Appendix B. In one particular case (c = d = 2), the mean
number of replications per fuel economy determination is bounded by 2.22
(under the assumption of 0.05 probability of contamination), while
the RMSE achieved under conditions of no contamination and worst case
single spurious reading are 0.704 a and 1.078 ag, respectively. This
is superior to the procedure in the draft EPA Recommended Practice (closely
approximated by the E-estimator in the Appendix with d = 2) which takes
somewhat fewer observations on the average (bounded by 2.06) and yields no-
contamination RMSE and worst-case RMSE of 0.723 ag and 1.180 ag,
repsectively. The noted K-estimator also compares favorably with the
24
-------
best possible RMSE (for N=2.22) of 0.681 a achieved by a random combina-
tion of 2-sample and 3-sample means under no contamination. The penalties
paid under no contamination, in terms of percent increase in error
standard deviation, are hence: (.704 - .681)7.681 = 3.4% for the K-
estimator and (.723 - .681)/.681 = 6.2% for the EPA procedure.
6.2 Carryover Procedure
6.2.1 Dominance of Extended Mileage Accumulation Variability
We next consider the carryover procedure as described in Section 2,
which is the procedure incorporated in the draft EPA Recommended
Practice. Again, based on unreplicated* fuel economy testing on a single
sampled vehicle (and between-model variability assumed zero), the error
variance of the calculated fuel efficiency rating would be approximately
(see Section 3):
2 = a 2 + a 2 + 4.83 (a 2 + a 2)
: e oc v m e '
= 5.83 a 2 + 4.83 a 2 + a 2
e m oc
This result involves three single reference oil mpg determinations
separated by successive 1000-mile intervals and a single candidate oil
mpg determination following a 2000-mile aging interval. Introducing the
component error estimates by ASTM we may write:
a 2 = 5.83 (0.56) + 4.83 (5.48) + 1.0 = 30.7 (%)2
a = 5.54%
c
* That is, wherever an FTP fuel economy measurement is called for it is
done only once instead of at least twice, back-to-back, with possible
third or fourth measurements, as called for in the draft Recommended
Practice.
25
-------
We observe at once that the dominant error contribution is due to a *
m
which, it will be recalled, is the variance of the deviation of fuel economy
of a car from a linear trend under extended mileage accumulation. If fact,
if measurement error itself were totally eliminated, a would decrease
e
only slightly to 5.24%!
6.2.2 Proposed Alternative Experimental Design
EPA, as previously described in its draft Recommended Practice,
requires replicated mpg determinations at each of the test mileage
points. Clearly this would yield little improvement in accuracy under
normal circumstances. The sole justification for replication in this
situation must therefore be to achieve robustness in the event of
spurious observations. That would be accomplished to a significant
degree as indicated in our discussion of the non-carryover procedure.
However, as long as additional FTP tests are to be made, we propose
a different placement of these tests in order to achieve a more sub-
stantial gain in precision along with the protection against spurious
observations. Specifically we suggest that the six FTP tests with the
reference oil (which is approximately the expected number of such tests
under the presently formulated procedure) be performed singly at 400 mile
spacings to cover the same total range of 2000 miles. It is believed
that the cost impact of such a modification would be relatively small.
The FTP test with the candidate oil could continue to be replicated in
the present manner or could apply a K-estimator as discussed earlier.
Under this modified design, the error variance of e would reduce
approximately to:
- 3.88 ae* + 3.38
= 21.7 (%)
a = 4.66%
c
26
-------
The major reduction comes from the reduced extrapolation coefficient
of the dominant a 2 term. This result assumes that the frequency
spectrum of unpredictable mpg variations with mileage is mostly in a
high enough region so that variations separated by 400 mile increments
remain essentially uncorrelated. If appreciable "low frequency components"
do exist then the accuracy gain would not be as great. This issue needs
to be experimentally resolved.
Accurate estimation in the face of possible spurious observations
can continue to be achieved through robust procedures for linear
regression. Detailed discussion of specific methods is not herein
provided, since an adequate literature on this subject exists.
It is interesting to observe that the coefficient of the a *
e
term, 3.88, in the proposed evenly spaced design is in fact larger
than the comparable coefficient, 2.92, for the present design of three
paired measurements separated by successive 1000-mile increments (see
Section 3). This is explained by the fact that measurement errors are
uncorrelated in either design and the present EPA design, by its more
extreme placement of tests, yields lower extrapolation error due to
that component. In the case of the mileage accumulation variation
component, the decorrelation achieved in the modified design more than
makes up for its reduced extrapolation capability.
6.3 Concluding Remarks
In conclusion, if EPA were to finalize a recommended practice based
on a non-carryover procedure, we recommend consideration of either the
Veale-Huntsberger or K-estimators for robust and reasonably efficient
estimation of FTP fuel economy utilizing small numbers (< 3) of replicated
27
-------
measurements. It appears unlikely in the light of industry comments
and the analyses presented herein that the carryover procedure in the
draft Recommended Practice will be found feasible. However, if it is,
we recommend consideration of a modified experimental design which
spreads out individual FTP measurements during reference oil opera-
tion, along with applications of a suitable robust (outlier-resistant)
linear regression technique.
28
-------
7.0 SCALING ASSUMPTIONS IMPLICIT IN SPECIFIC FUEL EFFICIENCY RATING
MEASURES
The definition of "grades" for candidate fuel-efficient oils must
be based on some measure of "performance" of these oils in the sense of
fuel saving. What constitutes an appropriate "figure of merit" is a
basic question which needs to be addressed before any statistical pro-
cedure is invoked for the establishing of grades.
This section of the report contains some observations pertinent
to the parameterization of the "benefit" to be derived from an alleged
fuel-efficient oil. It considers both physical and statistical impli-
cations of the choice of a performance measure and concludes with some
further observations pertinent to the LSDgg aspects of the EPA Recommended
Standard Practice.
The performance measure proposed in the EPA Recommended Practice is
based, in essence, on percentage improvement in fuel economy as deter-
mined by comparative tests of candidate oil (CO) and high reference oil
(HR) in five species of vehicles. Specifically, one computes
C - LSD95
H
where C is the mean fuel economy as experienced with the candidate
oil and H is the mean fuel economy as experienced with the reference
oil. Fuel economy has its usual connotation of miles per gallon (MPG),
and the means are computed in MPG space for K vehicles of each of the
five species. Were it not for the LSDg5 term, the performance measure
would translate into a simple ratio of candidate-oil mean MPG to
reference-oil mean MPG.
29
-------
Questions pertaining to the LSD95 aspects of the measure are dis-
cussed later in this section. Here we are concerned with the more
basic question of parameterization of "improvement" in some sense. We
assert, ipso facto, that other bases of comparison of the two oils
exist and that the use of e, as defined in the EPA procedure, needs
to be defended on more than its intuitive appeal. In particular, the
notion of e seems to impose an arbitrary "scaling law" which may not
be realistic. It implies that the more fuel efficient a car is, the
more will its fuel economy (MPG) be improved by the use of the fuel-
efficient oil.
The realism of such an assumption is by no means obvious, nor is it
necessarily consistent with physical theories of lubrication and the
friction process. Though it might be argued that energy losses due to
friction are proportional to distance traveled, this argument does not
necessarily hold in the aggregate of all vehicles. Different regimes
of lubrication are known to exist in different automobiles, and the
greater fuel economy of a particularly fuel-efficient vehicle could
stem from causes unrelated to its lubrication characteristics. The
assumption that the benefit to be derived from a fuel efficient oil
is proportional to the base-level fuel efficiency of the vehicle may
therefore be untenable.
The importance of scaling can be appreciated by examining its effect
on the computations involved in the determination of e. Among the test
vehicles, observed improvements on fuel economy can be expected to vary.
These variations are due, in part, to errors of sampling and errors of
measurement. Another possible source of variation, however, stems from
the fact that the mean response of vehicles to the candidate oil may be
vehicle-species dependent. If this species-related variation is in
accordance with the proportional scaling implied in the EPA Recommended
30
-------
Standard Practice, then the expected relative improvement in fuel efficiency
attributable to the candidate oil is constant across vehicle species,
even though the differential improvement in MPG is not. Under such
conditions, the choice of vehicle species to be included in the test
sample could be arbitrary, because the mean relative improvement in fuel
economy would be independent of species selection.
If the scaling assumption does not hold, then the value of e is an
artifact of the choice of vehicle species to be included in the test.
This fact would not preclude the possibility, however, of some other
scaling law and an associated figure of merit. What one would hope for
is some basis of comparison which would be invariant with regard to
vehicle species—in short, one hopes to find a "scaling parameter"
applicable to all vehicles. Undoubtedly, absolute invariance does not
exist, but the figure of merit should be "robust"--that is, as insensitive
to vehicle characteristics as possible.
One might well ask whether evaluation of fuel efficiency improve--
ment should be based on fuel consumption (gallons per mile) rather than
MPG. For small perturbations, of the order of a few percent, the bases
do not differ substantially. In other words, a 5% increase in miles
per gallon translates to good approximation into a 5% reduction in fuel
consumption. The gallons-per-mile point of view is not without merit,
however, in view of the fact that the matter of concern is fuel con-
servation for a given distance traveled. For a test based on several
vehicle species with a range of fuel economy ratings, the MPG and GPM
bases could still lead to different evaluations. An illustration of
this fact follows.
31
-------
Let Rj, R2, ... R5 denote the reference-oil fuel economies in
miles per gallon of the five vehicle species tested. Similarly, let
k-j denote the factor of improvement which scales the reference MPG for
the ith species into the fuel economy C.j experienced with the candidate
oil. Thus,
and we can define a quantity e1 as
_ E k. R.
I R.
Note that e1 is analogous to the quantity e defined in the EPA
Recommended Practice except for the deletion of the LSD95 term.
Now if k-j = k for all i, then e' = k, where k is expected
to have a value quite close to unity.
But, suppose that the ki assume distinct values which vary over
an appreciable range for the five species. Then it is clear that the
R.J serve as weighting factors and that most weight is given to those
vehicles which normally achieve the highest MPG. For example, if
Rj = 15 MPG and R5 = 30 MPG, then k5 is given twice as much weight
as k^.
Now one can envision various scenarios, two of the most interesting
being (a) the case in which k values monotonically increase with MPG
and (b) the case in which k values monotonically decrease with MPG. In
case (a), the candidate oil would receive a relatively high e' because
of the heavy weighting accorded the high-MPG vehicles. In case (b),
the candidate oil would receive a relatively low e1 because of the
32
-------
heavy weighting accorded the low-MPG vehicles. It is a moot point,
however, as to which value would most realistically represent the fuel
efficiency of the oil, in view of the fact that the low-MPG cars have
greatest fuel consumption. Accordingly, there may be merit in
comparing candidate and reference oils on the basis of consumption, in
GPM, rather than on the basis of MPG.
The question of what constitutes proper parameterization of fuel-
efficiency improvement is made more complex by introduction of the
LSDgs term. From a statistical point of view, the use of the LSDg5 term
is illogical, particularly so if the true kn- differ significantly from species
to species. If the observed values vary only because of measurement
error, then each observation is an estimate of the common, population
value k for the candidate oil in question. Only then can averaging
and other statistical procedures be defended on the basis of improved
estimation of e. Even in that case, however, the proposed use of
LSD95 confuses population parameters with estimates based on realizations
of random variables. This point needs further explication.
The procedure recommended by EPA comes under the heading of what
is often referred to in the statistical literature as a paired
comparison. Two random variables X and Y with realizations xn-
and y-j (i = 1, 2, ..., N) are arrayed as in the following table.
33
-------
A PAIRED COMPARISON
VARIABLE
•
OBSERVATIONS
MEAN
X
Xl
X2
•
X
n
X
Y
"i
y2
•
n
y
DIFFERENCE, D
d! ' *1 - h
d = x - y
•
d = x - y
n n Jr\
d = x - y
One can then calculate the standard error of the dn- and test the
hypothesis that the true mean difference nip is zero. The level of
significance can be as desired, and one can choose either a one-sided
or two-sided alternative hypothesis. If the 95% level is selected,
then the LSDg- is simply the value of d required to reject the
hypothesis H,.: m = 0.
The important point to note here is that the LSDgc
is a sample-derived quantity which serves as a criterion for
rejection of the null hypothesis. The EPA Recommended Practice
proceeds to compute the amount by which d exceeds this criterion
and to use this excess as the basis for defining oil grade. It is our
belief, however, that the intent of the procedure is not served by
adjusting the observed d by a sample quantity. Rather, it should
be adjusted by a fixed quantity, that fixed quantity being one of the
grade boundaries.
34
-------
For example, suppose that the mean difference A between the
candidate oil and the reference oil is 1 mpg with a standard error
-of 0.3 mpg. Suppose, further, that the sample mean fuel economy for
the reference oil is H = 20 mpg. For 4 degrees of freedom,
t = 2.1318 and
c • 21'° - ^18(0.30)
Accordingly, by the EPA Recommended Practice, the oil would qualify as
Grade B.
Note that the use of the LSD _ procedure is equivalent to computa-
yt)
tion of a one-sided confidence interval for the population mean. The
lower confidence bound is provided by
21.0 - 2.132(0.3) = 20.36,
or, in ratio terms relative to H = 20.0, the 1.018 seen above. In
other words, the statement that the true mean exceeds 1.018 has 95%
probability of being a correct statement. It might well be asked if
there is not a "reasonably high" probability that the true mean exceeds
1.03.
This prospect can be examined as follows. Suppose the true mean
is 1.03. The computation
21-0 - 1-03(20) = L333
0.3
r
provides a t value corresponding to nearly a 90% confidence interval
35
-------
= 1-533). Similarly, if the true mean is 1.01, the computation
21.0 - 1.01 (20) 0 ...
0.3 ' 2'666
provides a t value corresponding to nearly a 97.5% confidence interval
(*().025 = 2.776). In other words, it is "almost as likely" that the true
mean exceeds 1.03 as that it exceeds 1.018 or even 1.01. Indeed, it can
even be argued that the grade class is more likely to be A than B.
The difficulty is further illuminated when viewed in a different light.
The LSDgtj procedure;, as employed by EPA, addresses the question of bounding
the true mean improvement given a sample measurement. An equally important
question concerns bounding the sample measurement, given the true mean
improvement. For example, suppose that the true value for a candidate oil
is really (1.01)(20) = 20.2—that is, at the low end of the B-grade interval
With what probability would such oils yield a sample value exceeding the
LSDgc criterion? Clearly, one is concerned with the distribution of the
quantity
x - 20.2
where x is the observed value of the sample mean for the test vehicles
and s is the estimated standard error. By a previous argument, the event
2.132
would occur with probability 0.05, so that one would seldom "recognize"
this oil to be a B-grade oil by the EPA procedure.
36
-------
Admittedly e = 1.01 barely qualifies the oil as Grade B, but a
minimal A-grade oil, i.e., an oil with e > 1.03, would fare only slightly
better, as will be shown by the following argument. With e = 1.03 and
reference value of 20 MPG, the expected value for the candidate oil is
(1.03)(20) = 20.6 MPG. Therefore each observed value would be 0.4 MPG
greater than in the previously postulated case in which e = 1.01 was
assumed. Accordingly, one is concerned with the distribution of the
quantity
(X + 0.4) - 20.2 = (X - 20.2) + 0.4
s s
and with the probability of occurrence of the event
(x - 20.2) + 0.4
A bit of algebra shows that the probability of this event is the same as
that of the event
o.799
By reference to tables of the t distribution, it is evident that a
Grade B assignment or better would result "less than 25% of the time"
(tn oc = 0.741) even if the true improvement were at the upper limit of
U.£b
the B range.
In summary, we have examined the procedure whereby the EPA Recommended
Standard Practice combines measurements from several test vehicles into a
single oil grade rating. It is concluded that, from a physical point of
view, the procedure is tantamount to an arbitrarily assumed scaling law when
37
-------
the vehicles tested differ significantly in their response to the fuel-
efficient oil. It is further concluded that the use of statistical concepts
such as the LSDg5 is appropriate only if the observed differences among
vehicles are due to sampling and measurement errors.
When examined in this light, it is concluded that the LSDg5 procedure,
as employed in the EPA recommended standard practice, is structured so as to
guard against overgrading a candidate oil. On the other hand, when viewed
from the standpoint of both Type I and Type II errors, it tends to undergrade
oils, perhaps seriously. It must be concluded, therefore, that the LSDg5
approach does not afford the supposed 95% level of significance in the sense
desired and should be replaced by a hierarchy or sequence of hypothesis tests
involving the postulated grade boundaries. These tests should be constructed
in such a way as to balance the two types of error, overgrading and undergrading.
38
-------
APPENDIX A
ENGINE OIL FUEL EFFICIENCY GRADING
UNDER MEASUREMENT ERROR AND
BETWEEN-MODEL DIFFERENCES
If C and H are the true FTP fuel economies achieved by a vehicle
with candidate oil and reference oil, respectively, then e = c/H is
defined as the candidate oil's fuel efficiency rating with respect to the
given vehicle.
Let G < Gp < . . < G be a grading structure such that an oil is
designated as Grade A if e > G , Grade B if G _ < E <_ G , and so
forth. This grading structure will be characterized by a minimum grade
width, R = G. - G. for some j = 2, ..., r.
J J ~ J-
The EPA proposed procedure for grading a candidate oil involves
testing the oil in a balanced sample from five preselected vehicle models.
Specifically, designate E.^ as the oil's measured fuel efficiency rating
in the kth car of the ith model, where i = 1, ..., 5 and k = 1, ..., K.
The basic issue addressed here is: under what conditions of meesurerent
error, car-to-car variability,.and model-to-model differences can a
reasonable statistical procedure be applied to the data Uik) to obtain an
accurate and meaningful fuel efficiency grade? A subsidiary question, of
A-l
-------
course, is whet constraints need to be imposed on K, the number of test
cars per node!?
We assume an error/effects model for the data as follows:
.. = LI. + e..
i k i i k
where u. is the true mean value of candidate oil fuel efficiency rating
for cars of the model i population, and e., is an additive error term
1 K
which includes both experimental (measurement) errors and sampling effects
(vehicle-tc-vehicle variability within model i). It is also believed
reasonable to assume that the e are independent, normally distributed
variates with zero mean and common variance a2. One may, further, introduce
an external estimate S 2 of a2 based on prior experience/data. Previous
analysis by ASTM1 provides estimates for experimental error standard
deviation of 1.6% under the non-carryover procedure and 6% under the carryover
procedure. These are therefore lower bounds for S , but if it could be
established that sampling effects are relatively small, then they would
serve directly as appropriate values of S for the two distinct experi-
mental procedures.
Eetv.-een-r.odel variability is represented in the above formulation by
the individual model means. Unfortunately, we know very little about the
A-2
-------
magnitude of this effect at the present time, yet it could conceivably
negate the basic validity of the grading program. If model-to-model
differences for a given oil are large compared to grade width measure R,
then one could quite justifiably question the meaningfulness of a specific
grace assigned to that oil. Our recommendation in such an event is that
it is better to leave the oil ungraded than to assign a fictitious "average"
or "worst" grade. An "average" grade could give the owner of one of the
poorly performing rodels a false sense of value. A "worst" grade is likely
to be so low that the oil company would rather opt for no grade.
For purposes of the present analysis we regard the five models pre-
selected by EPA as the universe of models. The rationale for the selections
made and their representativeness are clearly relevant, but will not be
pursued here.
The basic approach which we propose is to test the hypothesis that be-
tween-model differences are substantial relative to grade width. If the data
permit us to reject this hypothesis, then a fortiori the measurement/
sampling error contribution is acceptably small and we would then estimate
the oil's fuel efficiency rating by the total sample mean:
K 5
i X—' ^—'
E = 5K 2_i 2_j Eik
k=l i=l
A-3
-------
We would then proceed to determine the grade interval which contains E
and assign the oil that grade.
If, conversely, the data cause us to accept the hypothesis, that is
equivalent to saying we have insufficient confidence that the between-model
variability is small encugh to permit a meaningful grade to be assigned.
The oil company (tester) may decide at this point to take additional
measurements (increase «) an.d repeat the above test with the augmented
data. It could also decide not to continue in which case the oil may be
left ungraded or some prescribed algorithm based on estimated model -
specific ratings may be applied to determine a single or multiple grade
outcome. Testing costs will, of course, put practical constraints on K.
However, consideration should also be given to the advisability of EPA
imposing a maximum limit on K.
A Proposed Test for Between-Model Variability
The form of the hypothesis we wish to test is
H : nax {|y. - y.|] > PR
° i.j ' J ~
where p > 0 is a fixed number which establishes the severity of the test.
r
Define Ay = max {|y. - y.|) and R1 = pR and restate the above as HQ: Au>_R'
• J
A-4
-------
The hypothesis states that there exists a pair of car models whose difference
in true mean fuel efficiency rating is at least p times the smallest grade
width. In other words, H affirms that model-to-model variability in
mean fuel efficiency rating is too large relative to grade width to permit
T'eaningful determination of a specific grade. A reasonable value for p
is 1, in which case rejection of H would imply that the model variability
contribution to rating error is likely to be less than half a grade width.
Other p values could be specified. As will be seen later, specification
of p and of level of significance are interchangeable decisions.
What is a reasonable statistic for testing this hypothesis? Define
the within-model sample means e. (estimates of the y.) by:
r\
- i V
:. = — > e.
i K / , i
K
" k
k=l
The common variance of these sample means is a2/K. Some alternative
estimates of cr2 are:
* - s>
5 K
T
T2
a' "2 - r V > (e-, - e.)2 J K> 2
< - 1) /^ -;
5 (K
k=l
A-5
-------
vSQ2 + 5(K - 1) S2
v + 5(K - 1)
where S2 as previously noted is an external estimate of a2 and is assumed
to be chi-square with v degrees of freedom. Note that "internal" estimation
of a2 is possible only for K ^ 2, i.e., sone within-model replication
is required. The third alternative, which uses all available information,
is clearly best under the assumption that o"2 has not changed. For purposes
of the present analyses we shall assume that an S 2 is available with
o
sufficiently large v to be effectively equivalent to known a2.
Now, let r. < r0 < , < rc be the order statistic of {E.}.
I — c — — o I
Define the statistic
r - r - R1
O J.
Under the assumption of relatively precise estimation (a « R1) and
neglectability of intermediate y.,* T will be approximately normal with
nean equal to / K/2 (LU - R')/a and unity variance. Consequently, we
That is, there exists a unique pair of extreme y. and the other
three y-j have intermediate values which are relatively far from
the extremes in a units.
A-6
-------
may make the following probability statement:
Pr{T < zn | H } < Q
1-a o —
Therefore, en acceptance recion for H defined by T > - zn constitutes
o — 1-a
a test with level of significance = a under the above stated assumptions.
It is interesting to observe that if, contrary to the assumption, more than
two y. are extreme or close to extreme, then T will be biased in a
positive direction and we will be more apt to accept H (which affirms
that model-to-model differences are too great). This effect nicely matches
our inclination since for fixed Ay polarization of the p. corresponds to
an increase in variability.
Power of the Test
Suppose that Ay is truly smaller than the criterion value of R1.
What is the probability of making the correct decision to reject H ?
This is the power, P, of the test and will, of course, be a function of
Ay. Again, invoke the previously stated assumptions of a « R1 and
neglectability of intermediate y. Define:
T' = T - J\ (Ay - R')/o
A-7
-------
which is approxinately standard normal. We may then write
P = Pr {T < - z. |
1-a1
= Pr {T1 < - 2 -/ ^ (Ay R')/a)
J.~Cl
I- zi_a
where $ is the normal cumulative distribution function. In terms of the
dirr.ensionless parameters,
* -
6 " R-
this expression becomes
- 6))
Figure 1 shown plots of P vs. 6 for various values of significance
level a and precision parameter X. We focus on the interval
0 < 6 < 1 which corresponds to the alternate hypothesis H • Ay < R1.
At 6=1, power equals level of significance, but increases with
decreasing 6.
A-8
-------
CC 00
99.9
FIGURE 1. Power of Test:
T < - Zi_a -»- H0 (That is,
Probability of Rejecting H0)
as a Function of Ratio of
Maximum Difference in Model
Means to Grade Width for
Selected Values of:
a (Level of Significance)
K R1
•^ — (Precision Parameter)
c. O
0.2 0.4 0.6 0.8 1.0 1.2 1.4
R.etio of Maximum Difference in i'odel Xeans to Grade Width £
A-9
-------
Fixing, for the moment, on the case a = 10* (which is perhaps as
high as one might want to go as an acceptable level of significance), we
ask the question: what level of precision is required to achieve at least
90>; probability of acceptance of H. if Ay is no greater than 75% of R'?
1
Examination of Figure 1 yields \ ~ 10. This numerical illustration is
not at all unreasonable as a set of practical performance requirements to
impose on the oil grading procedure. Recapitulating: if Ay >_ R1 we want
to reach the decision that with probability at least 90%that model variability
is unacceptably large, and if Ay <_ 0.75 R1 we want to reach the decision with
probability at least 90% that model variability is within acceptable limits.
Now then, how severe is the requirement \ = 10? If we set: R = 2.5%
(which would be the case if the current draft procedure made a slight
alteration in Grade B-C boundary from 1.01 to 1.005) and p = 1, then
a/t^~ must equal 0.18% to satisfy A = 10. The following combinations of
K and a would meet this requirement:
K
1
4
16
64
1024
0.18%
0.35%
0.71%
1.41%
5.66%
A-10
-------
Suppose \ve arbitrarily relaxed the precision requirement to
X = 5. This implies at least 90S power for Ay <_ 0.49 R1 which is of
course appreciably poorer performance, but perhaps acceptable. (The
impact of such a relaxation is that more real situations of truly acceptable
model-to-model variability are likely to end up with the oil left ungraded.)
Now, the requirement on a/ /K~ is 0.355r and a comparable set of K, a
pairs is
1
4
16
256
0.35%
0.715:'
1.415;
5.665;
Taking the ASTM estimates of a = 1.6% and 6.0% for the non-carryover
and carryover procedures, we conclude that 320 cars and over 5000 cars,
respectively, should be tested, under X = 10 for each candidate oil.
These numbers reduce to 80 and nearly 1300 if X is relaxed to 5. Even
with these large numbers there is no assurance, particularly so for the
smaller X, that a final grade would be achieved when model variability is
small.
A-ll
-------
APPENDIX B
ACCOMMODATING SPURIOUS OBSERVATIONS
IN FUEL ECONOMY MEASUREMENTS
Replicated fuel economy measurements at a particular mileage point
introduce the potential of reducing random errors through averaging and
also providing some degree of protection against errors caused by spurious
observations. Since measurement costs are considerable, a limit is imposed
on the maximum permitted number of replications. For purposes of the
present discussion, we set this limit at 3. If higher limits are of interest,
then the analysis presented below would have to be appropriately modified.
The Measurement Model
We denote the (unknown) true fuel economy of a vehicle at a given
mileage point by y, and three successive fuel economy measurements at
that mileage by X., X_, X * When none of these measurements are spurious,
the X. are assumed statistically independent and each normally distributed
about y with variance a*. It is assumed that either the standard
deviation cr itself is known or o=yr where T (the coefficient of
* Mileage accumulation caused by the experimental procedure itself 1s
neglected and the vehicle 1s assumed to be in the same state at initiation
of each measurement.
B-l
-------
variation) is known. These alternative models are designated I and II
respectively for future discussion. The assumption of known measure of
dispersion makes sense in the context of a standard experimental procedure
for which a large body of prior experience is available.
If one could be certain that all observations are "good," i.e., none
are spurious, then a best estimator of y is the sample mean 7 =
(X. + ... + X )/n, and the choice of one, two, or three (or more)
measurements is a relatively straightforward trade-off between cost and
estimation error. Specifically, the mean square error (MSE) of this
estimate (which is equal to its variance because of unbiasedness) is
inversely proportioned to n while cost increases approximately linearly
with n. The actual choice of n would be determined by absolute con-
straints on acceptable costs and acceptable errors and/or conceivably by
minimization of total "cost" which incorporates a dollar-equivalent error
component.
Modelling the Spurious Observation
The real problem, however, is compounded by the possible presence of
spurious observations. By the very nature of the concept "spurious," we
B-2
-------
should not presume to be able to give it a probabilistic characterization.
However, in conformity with common practice, we assume that a spurious
observation is also normally distributed with variance a2, but with mean
shifted away from y by an arbitrary number of standard deviation units
to y + ba. (A deterministic reading of X> = y + ba, i.e., all
probability mass concentrated at y + ba, but with y and ba separately
unknown, would probably best capture the essence of the "spurious" observa-
tion-; the introduction of the random normal spread about y + ba has only
a secondary averaging effect over some limited range of severity of
spuriousness.)
An additional aspect of spurious observations is their frequency of
occurrence. Under good experimental controls, which we assume applies to
the problem at hand, spurious observations should be relatively rare events.
More specifically, what is really desired is that the likelihood of more
than one spurious reading among three measurements be negligibly small. Such
a property appears as a natural consequence of independence in the occurrence
of relatively rare spurious readings. For example, if each measurement has
an independent probability, X < 0.05, of giving rise to a spurious reading
then the probabilities of zero (PQ), one (P^, and more than one (P^)
spurious observations among three measurements are bounded by:
B-3
-------
PQ > 1 - 3X > 0.85
PI = 3X(1 - X)2 < 0.14
P>n = 3X2(1 - |x) < 0.0075
Thus, on the average, fewer than one out of every 130 measurement triples
would harbor multiple spurious observations. Our concern with limiting
multiple spurious occurrences stems from the fact that among three observa-
tions they would be essentially uncorrectable. We therefore assume for
purposes of our analyses that multiple spurious observations will not occur.
General Strategies for Dealing with Spurious Observations
A group of measurements is to be used to estimate a parameter. How
would a spurious observation be distinguishable in the measurement set?
Basically, it is likely to appear as an "outlier," that is, an unexpectedly
extreme observation or as an unexpectedly large discrepancy between two
measurements.
A traditional approach for dealing with spurious observations is to
explicitly test for and identify outliers by a prescribed procedure, and
then to eliminate them as "bad" data points. The problem with this approach
is that good (non-spurious) observations occassionally manifest as outliers
while, conversely, a small to moderate spurious shift, ba, may frequently
not be distinguishable.
B-4
-------
A more sophisticated approach is to devise estimaters, based on all
the data, with the property of being relatively insensitive to contamina-
tion from spurious observations, yet without explicitly having to decide
which if any observation is spurious. Since the ultimate objective is
accurate parameters estimation (in our case the mean of a normal distri-
bution) there is no particular need to reach a yes/no decision as to
whether a given observation is good or bad. Rather, all observations can
be utilized, with the weight each carries dependent on some measure of its
concordance with the total data set. Such an approach is generally termed
"accommodation" to spurious effects or robust estimation.
Specific Mean Estimators
A classic paper treating mean estimation from n normally distributed
observations in which at most one may have a spurious shift is by Anscombe.1
The estimator proposed for the case of known o is:
X" |Zm|. < ca
where X = — IX. (sample mean)
n 1
Z = X - X is the m deviation from the sample mean with the
m m property of being largest in absolute value
c > 0 (criterion to be selected).
B-5
-------
Thus, the sample mean and all deviations are computed. If the maximum
absolute deviation Z is sufficiently small the sample mean is accepted
as the estimate; if Z exceeds a specified criterion, then X is
excluded and the reduced sample mean is used. This approach is tantamount
to identifying and rejecting X as an outlier when it is too extreme.
An analogous estimator applicable to the case in which the coefficient of
variation T is known (Model II) could be formulated as:
X" ]Zml/x" CT
y
A,II
and makes sense if T is reasonably small, say < 10%.
Veale and Huntsberger2 developed a modification of Anscombe's estimator
which they have shown to be at least as good and in some situations have
superior performance. It is given by:
VH
' X
x"
Zm
n-1
Zm2
n-1 2 2
n
|Zm| < c
Zm| > c
Note that for |Z | < co, VVH is identical to yA, and that for very
large |Z | it is approximately equal to yA, i.e., the reduced sample
mean. However, for intermediate values of |Zm| r an intermediate weight
is given to X (the observation with largest absolute deviation), the
3 m
B-6
-------
extent of which increases with decreasing JZ^ until the condition
Uml < ca is reached, at which point an abrupt transition occurs
and "full" weight is given to X as an element of the sample mean. As
with Anscombe, an analogous Veale-Huntsberger estimator for Model II
is constructed by replacing o by XT, i.e.,
\
;HV,II = ;
/
!v 1 •tJ|1 1
x
X " n-1
Zm
1 2
^ Y -r2 + 7m2
ATT /JT1
L- n -I
IZm!
< CT
V f+
X
Numerous other robust estimators have been proposed based on, e.g.,
maximum likelihood principle, order statistics, and rank order statistics.3
Generally the focus has been on achieving good asymptotic performance for
very large n. Since our concern is with n < 3 and there is no evidence
of real advantage of these approaches in our context, we shall not consider
them.
The estimators so far presented are clearly only meaningful for n > 3,
and, with our restriction, only for n = 3. What if there is a strong
incentive because of high measurement cost to try to reduce the number
of observations to 2*, at least to achieve such a reduction in a signi-
ficant proportion of cases? Are any effective approaches available?
* There is no hope of dealing objectively with spurious observations
if only a single measurement is taken; hence the alternative of
n=l is not considered.
B-7
-------
Desu, Gehan, and Severe" have considered such a problem and propose
a two-stage estimation procedure wherein only two measurements are taken
at the first stage. If these are sufficiently close to each other, their
mean is accepted as the estimate. Otherwise, a third measurement is
taken. Their procedure in the latter case is not relevant to our problem
since they then assume the third observation to be non-spurious. However,
we can incorporate the concept of a two-stage procedure into the Veale-
Huntsberger estimator as follows, for n=3:
2
x"
V Zm
A - o
|XrX2] < da
iX-Xl > da ; |Zmj < ca
Zm2
a2 + Zm
Again, as before, for Model II
VX2
— Zm
X - -y
Zm2
— X2T2
3 * T
- X.
(X1 + X2
^2l_>dT. IzmK
HX0)/2^dT' v < CT
A lZml .
> dr; ' _' < CT
X~
B-8
-------
The promise of this two-staoe estimator is that it may be able to
retain a good deal of the robustness of performance of the Veal-Huntsbeyer
estimator under possible spurious contamination while at the same time
achieving an appreciable reduction in the average number of measurements
conducted per estimation.
For comparison purposes, we consider two additional estimators, the
S\ •*"
sample mean of three observations, p<-> and tne staged procedure y£ described
in a draft EPA Recommended Practice relating to engine oil fuel efficiency
grading5, except that it is modified to be terminated at three rather
than four measurements. These are defined as follows:
ys = ys,n = x
2
3
E.II
2
3
2
i
1=1
- X2| < da
da
X2)/2
X2)/2 -
B-9
-------
where Xm is the measurement with largest absolute deviation from the
sample mean. Note in the case of yF and y ., that when a third
d L. 3 l I
observation is taken, the estimate corresponds to the "best two out of
three" rule.
We proceed next to consider, in detail, the performance of the
estimators that have been defined.
Estimator Performance
An individual estimate of y of y (the true fuel efficiency of the
test car) will be in error by an amount y-y. A useful, as well as
mathematically convenient, statistical measure of esuch errors is the
expected value of the square of the error, also called the mean square
error, MSE:
MSE (y) = E [(y-y)2].
The practical utility of MSE stems from the fact that the weight given
to large errors is enhanced by squaring, corresponding to the subjective
notion that the seriousness or disutility ascribable to an error grows
with increasing rapidity as its magnitude increases.
B-10
-------
In general, one can relate KSE to the bias and variance of an
estimator
MSE (y) = E [ y ] - y + var [ y ]
The first term is the square of the bias. If an estimator is unbiased,
i.e., if its expected value is equal to the parameter y being
estimated, then MSE (y) = Var [ y "] . However, in the face of
contamination by spurious observations, estimators will generally be
biased (in a manner depending on the direction and magnitude of the
mean shift), so it is inappropriate to use variance alone as the measure
of performance.
In comparing MSE of X, yA, and yH, we normalize by MSE (X) under
condition of no spurious observation, equal to a2/3 for n=3. This
represents optimum performance in that X is then an unbiased and
efficient estimator. Figure 1 graphs 3MSE("X)/a2, 3MSE(yA)/a2
(for c=2.460) and 3MSE(yVH)/a2 (for c=2.404) as functions of b,
the magnitude of spurious shift from the mean.* With the normalization
* The numerical values are taken from calculations by Veale and
Huntsberger.2 However, it should be remarked that the calculations
were based on the unjustified assumption that the spurious con-
taminant may be identified with the measurement having the largest
absolute deviation from the sample mean. This leads to erroneously
low values for MSE at intermediate b-values. For example, whereas
they compute 3-MSE/a2 =2.94 for b=4, c=2, a direct Monte Carlo
simulation of this case yielded 3.41 (with simulation standard
deviation ~ 0.05). Nevertheless, the results plotted should be
reasonably valid for relative comparison of estimators and for
discussion of qualitative features.
B-ll
-------
:----7~ SOURCE: References 1 and 2
4 :E
fe yfl (c = 2.4600)
24 6 8
NORMALIZED MAGNITUDE OF SPURIOUS MEAN SHIFT—b
FIGURE 1. Comparison of Estimators of Mean of Normal Distribution Using
Three Observations,When One Observation has Spurious Mean Shift
B-12
-------
defined above, the ordinate value of 1 is the ideal limit, achievable
only if no spurious observation is present (b=0). Note that while
MSE(X) is, by definition, equal to 1 for b=0, it increases rapidly
and without bound with increasing b, demonstrating lack of robustness.
MSE(yA) and KSE(yyH), on the other hand, while paying a modest
premium of 4% relative to ideal at b=0, increase to maximum values
of 3.5 and 4.25, respectively, times the ideal at b in the vicinity
of 3-4 and then diminish again, both approaching a limiting value of
1.5 as b -*• °°. The latter limit corresponds to the reduced sample
mean using the two "good" observations and its associated variance of
9 ~"~ ^
a /2, compared to a /3 for X. The particular c-values for y^ and
yy^l were selected to yield a 4% premium at b=0. As evident from the
additional plot of yVH for c=1.5, lower c-values can achieve better
protection against intermediate values of b, but at the expense of an
increased premium at b=0. Conversely, larger c-values would lower the
premium but at the expense of reduced protection for intermediate values
of b. The curve maxima would increase and shift to higher b. Note
^ ^
finally the modestly superior performance of yy^ relative to y^.
In order to determine the performance of the PK and yg estimators,
Monte Carlo computer simulations were performed for selected criteria
values of d and c and for selected spurious shift magnitudes b
r f
occurrinq in the 1st, 2nd, or (potentially) 3rd_ measurement with equal
B-13
-------
likelihood. Results are depicted in Figures 2, 3, ana 4 with straight
line segments connecting the different b-values for each specified
estimator. Clearly the fine scale dependence on b is only roughly
approximated by linear interpolation, but the same qualitative behavior
of MSE peaking at some intermediate b-value, as exhibited in Figure 1,
occurs also with these estimators.
Also shown is the mean number of measurements per estimate, N,
which depends on d and b and ranges between 2 and 3. N clearly
decreases with increasing d. On the other hand, for fixed d, N
increases with increasing b. If we denote the values of N at b=0
and b -*- °° by NQ and N^, respectively, then it can be shown that
N = 2 + Nn/3.
CO (J
Because of the reduced numbers of observations, the normalizing
factor applied to MSE is oz/N rather than a2/3. This corresponds
approximately to the variance of a randomized mix of the 2-sample
and 3-sample means as follows:
(X1+X2)/2 with probability 3-N "7 independent of
(X^X2+X3)/3 with probability N-2 j V V X3
When no contamination is present, and under the constraint that the mean
number of observations per estimate = N (2 1 N < 3), ^R is an unbiased
B-14
-------
NORMALIZED MAGNITUDE OF SPURIOUS MEAN SHIFT--b
FIGURE 2. Performance of 2-Stage Estimators of Mean of Normal Distribution
Using Two or Three Observations, When One Observation has
Spurious Mean Shift
B-15
-------
CASE B: d = 2
(Criterion for Taking ihird Observation) *
I
I
cc
§
o:
et
cr
Q
UJ
<:
o
2 4
NORMALIZED MAGNITUDE OF SPURIOUS MEAN SHIFT—b
FIGURE 3. Performance of 2-Stage Estimators of Mean of Normal Distribution
Using two or Three Observations,When One Observation has
Spurious Mean Shift
B-16
-------
CASE C: d = 3
(Criterion for Taking Third Observation)
NORMALIZED MAGNITUDE OF SPURIOUS MEAN SHIFT—b
FIGURE 4. Performance of 2-Stage Estimators of Mean of Normal Distribution
Using Tv/o or Three Observations, When One Observation Has
Spurious Mean Shift
B-17
-------
efficient estimator with variance approximately equal to a*/N.*
Examination of Figures 2, 3, and 4 suggests that it should be
possible to determine values of c and d such that MSE(y.,) will be
K
below MSE(y^) for all possible magnitudes of spurious mean shift b.
In particular, the condition b=0, which corresponds to no contamination,
is crucial since that is expected to be the case most of the time. Hence,
great emphasis should be given to minimizing MSE at that point. Note
^.
that yjr performance in that region begins to become acceptable only for
d > 3, but then the associated peak MSE at intermediate b-values becomes
relatively large.
Optimization of p., with respect to choice of c, d requires
specification of further information and definition of optimality
since MSE depends on b. One possibility is to estimate a prior
distribution for b and then to minimize the expected value of MSE
(Bayesian approach). An alternate approach, which depends only on
knowing the probability A that any given measurement is contaminated,
* Strictly speaking, the variance of VR is {5-N)a2/6; hence
that is the proper normalizing factor. However, the maximum
discrepancy between (5-N)/6 and 1/N within the range
[2,3] is 4% (when N = /T= 2.45), so 1/N is a quite acceptable
approximation.
B-18
-------
would minimize
This is a hybrid of minimax and Bayesian approaches. The first term is
the contribution to MSE when there is no contamination. The second
term is the maximum possible contribution to MSE when a spurious
observation of unknown magnitude has been included. Although additional
Monte Carlo simulation would be required to perform such an optimization
with precision, the results on hand suggest that with X of the order of
.02 to .05 the solution would not be too far from c=d=2.
Notwithstanding the defect in the numerical results for V\H, we
can attempt a rough bounding comparison of yVH and y . Specifically.
compare these two with former specified by c=2.404 and the latter by
s\
d=l, c=2. Note that MSE at b=0 are comparable but that yK has a
lower peak MSE. While the true peak MSE of {L may be higher at b-values
slightly different from 4, it is likely to remain below the MSE peak
of VVH indicated in Figure 1. Furthermore the latter is probably
significantly below true value because of the previously noted incorrect
assumption.
B-19
-------
REFERENCES
1 F. J. Anscombe, "Rejection of Outliers," Technometrics, 2:123,
1960.
2 J. R. Veale and D. V. Huntsberger, "Estimation of a Mean When One
Observation May be Spurious," Technometrics, 11:331, 1969.
3 V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley
& Sons, New York, Pages 144-145, 1978.
" M. M. Desu, E. A. Gehan, and N. C. Severe, "A Two-Stage Estimation
Procedure When There May be Spurious Observations," Biometrika,
61:593, 1974.
5 Environmental Protection Agency, "EPA Recommended Practice for
Evaluating, Grading, and Labeling the Fuel Efficiency of Motor
Vehicle Engine Oils," No Number, No Date.
B-20
-------
APPENDIX C
"BEST TWO OUT OF THREE" PROCEDURES
An issue related to the repeatability test is "the fallacy of the best
two out of three," which was documented in a series of papers by Youden1'2
and Lieblein.3 "The best two out of three" refers to a common practice
in the chemical laboratory of taking a third determination "to indicate
which of the other two is more likely to be off the mark." If two of the
three measurements are in close agreement, the experimenter discards the
remaining one as representing some gross error which renders it invalid.
The two authors showed: (1) that the spacing between the most discrepant
observation and its nearest neighbor is often many times the spacing
between the two closest observations, even where all represent valid
estimates of the same (population) value, (2) that use of the closest
pair of observations causes the experimental error (the underlying
standard deviation) to be underestimated, and (3) that the mean of the
closest pair out of three is subject to larger variability than the mean
of a fixed sample of two determinations. (It is clear that its variance
is larger than that of the mean of a fixed sample of three observations.)
It is not difficult to illustrate the "fallacy." Monte Carlo
experiments with three normal populations, all with the same mean,
showed1* that the standard error of the mean of two readings chosen
according to "the best two out of three" is approximately 0.8a, where
a is the error variance common to all populations. This is to be com-
pared with a/ /2~= 0.71a for two observations chosen completely at
random, and with a/ /3~= 0.58a for three randomly chosen observations.
The reason for this is the asymmetry frequently shown among the three
observations.
1 W. J. Youden, "The Fallacy of the Best Two Out of Three," NBS Technical
Bulletin, 33: 77-78 (1949).
2 W. J. Youden, "Sets of Three Measurements," The Scientific Monthly,
77: 143-147 (1953).
3 J. Lieblein, "Properties of Certain Statistics Involving the Closest
Pair in a Sample of Three Observations," Journal of Research of the
National Bureau of Standards, 48: 255-268 (1952).
" The exact value, based on a mathematical derivation given in Lieblein's
paper, is 0.7986o. Our results, based on 1000 trials, yielded the
estimate 0.7945a.
C-l
-------
Suppose that the values are placed in order, so that Xj, X2, X3
implies Xi ^ X2 <, X3. Let D represent the larger of the two
spacings (X2 - X,, X3 - X2) and let d represent the smaller. Then
the ratio Q = D/d is 1 when (Xj, X2, X3) are symmetrically
arranged, and is large when two of the observations are close together
compared with the distance to the third. Whenever Q is large, there
is a tendency to compromise the averaging effect associated with random
sampling, where large negative errors "cancel out" positive errors of
the same magnitude. Instead, there is more of a tendency to base the
estimate on pairs of observations "off in one corner." It is this
tendency that inflates the standard error of the sample mean. But note
that large Q's not only are associated with increased sampling errors;
a sample with a large Q is most likely to convince an unsuspecting
experimenter that one of the observations is an outlier.
The probability of obtaining a large Q is surprisingly high, even
where there are no outliers. In our Monte Carlo experiments, it was
shown that P(Q ^ 5) ~ .30. Indeed, the same experiments showed that
P(Q ^ 10) ~ .15. In other words, it is not uncommon to see two readings
close together, and a third some distance away (in fact, five or ten
times farther away), due to chance alone.
Knowing this fact, it might be expected that Q would provide a
poor discriminator of outliers. Monte Carlo experimentation showed that
this was indeed the case. Suppose that one of the three populations
(say population 1) is shifted away from the others. Explicitly, let
y.j and a.,- be the mean and standard deviation of the i population
(i = 1, 2, 3) and suppose
(1) a. = a (i = 1, 2, 3)
(2) y2 = u3 = y
(3) yx = p + 6a
Since the justification of "the best two out of three" is its ability to
screen out outliers, it is of interest to know how sensitive the distri-
bution of Q is to the size of 6 and how often the procedure would
succeed in eliminating the observation from population 1 from the pair
which goes into the sample estimate. Table C-l shows the results of our
experiments for various sizes of 6.
Let U be the mean of the two observations that are closest together,
i.e., the pair that determines the denominator of Q. The value Q^
of Table C-l is by definition
C-2
-------
Table C-l
BEHAVIOR5 OF "BEST TWO OUT OF THREE" SAMPLES
FOR SELECTED VALUES OF 6
6
0
0.5
1.0
2.0
2.5
3.0
5.0
E(U)
y
y + . 14a
y + .29c
y + .40a
y + . 37a
y + .31a
y + .06a
SD(U)
0.80a
0.82a
0.86a
0.98a
1.02a
1.03a
0.86a
P(5 4 Q < 10)
.147
.149
.138
.157
.156
.177
.212
P(Q >_ 10)
.149
.157
.158
.150
.161
.175
.272
"i
0.67
0.64
0.59
0.39
0.30
0.21
0.03
5 Based on Monte Carlo experiments; N = 1000 trials,
C-3
-------
QI = P (U Contains the Population 1 Observation)
Note that Qj =_ 2/3 when 6=0. Table C-l also shows the behavior of
E(U) and SD(U) with 6.
The table shows that there is a substantial probability of including
the outlier unless its distribution is centered 5a or more away from
the other two population means. This is associated with the behavior of
the distribution of Q; the latter is insensitive to 6 until 6 is at
least 5. Because of the positive probability of including the offset
population, U becomes a biased estimator. The amount of bias is
determined by the probability of including the observation from population
1, and the distance 6 between population 1 and the others. Up to about
6=2, the size of the distance dominates, and the amount of bias
increases even though Qj is getting smaller. For larger 6, the
decreasing size of Q^ takes over, and the amount of bias decreases.
A similar effect is observed with respect to SD(U), although the decrease
doesn't begin until 6-3. As 6 increases and Qj_ approaches zero,
SD(U) will decrease and will asymptotically approach a/ /~2 .
The application to fuel economy testing is straightforward. If a
spurious observation is included among three, there is little chance of
weeding it out by taking the closest pair unless the outlier is about
5a away. Thus judgment on the efficacy of the procedure depends to
some extent on the size of a. Generally, we have been concerned with
the range 0.75% <^ 0^2%, i.e., .15 mpg £ a £ .4 mpg for a 20 mpg model.
This indicates that "the best two out of three" could be effective if an
outlier is over 0.75 mpg away from the others, and that fine a discrimina-
tion would be possible only if a = .15. If the a = .4 mpg estimate
were realistic, then there would be little chance of eliminating the out-
lier from a triple of observations unless it were 2 mpg removed from the
others.
Of course, the EPA proposed procedure is not "the best two out of
three." It differs in that one is not committed beforehand to taking
the third observation, but only does so when the difference between the
first two exceeds a preassigned criterion. Perhaps more important, the
"best two out of three" forces a discard of one of the observations,
rather than allowing all three to be used if (as in the repeatability test)
they are mutually close. Nevertheless, analysis of the "best two out of
three" sheds light on the frequency which widely divergent, non-spurious
observations can appear in the same small sample. Since such occurrences
will tend to cause discards of good observations even when a preassigned
criterion is used, these conclusions are pertinent to analysis of the
repeatability test.
C-4
-------
APPENDIX D
BEHAVIOR OF THE EPA REPEATABILITY TEST
The repeatability test of the EPA proposed procedure was investigated,
by means of Monte Carlo experimentation, to estimate the variability of
fuel economy estimates yielded by the procedure, and to estimate the
probabilities of requiring the second and third stages.
A preliminary approximation was available for the probability of
advancing to the second stage. Note that
-
A12 " R/X'
where R is the sample range. Taking X = 20 mpg, for example,
P(A12 > .021) - P(R > .42).
Table D, p. 614 of Duncan1 yields E(R) = 1.1280, SD(R) = 0.853a. When
a = 0.15 mpg (=.0075(20)), the procedure is designed so that P(R > .42) = .05.
Suppose a has a different value, say 1% of the mean (a = 0.2 mpg). Then
yR ~ .226, aR - .085, and .42 ~ yR + 2.3 aR. Then2 P(R > .42) could
not be large. On the other hand, if a = 2% of the mean, i.e.,
a = .4 mpg, then .42 ~ yr, and thus P(R > .42) would be near 0.5
for that a.
A summary of the results of the Monte Carlo experiments is given
in Table D-l. No spurious observations were present in these runs. It
can be seen that the rough conjectures above were borne out. With
c = 1% of the mean, it was found that the estimate was based on the
first two observations in 88% of the trials when there were no spurious
observations; i.e., the third observation was employed 12% of the time.
Under the same conditions, but with a = 2% of the mean, the third
observation was required in 48% of the trials. Thus only in those cases
where a is larger than anticipated, is there substantial probability
of going beyond the first stage. When (a/y) = .75%, the design
condition, the probability of going to the second stage is about 5%,
as specified.
1 A. J. Duncan, Quality Control and Industrial Statistics, R. D. Irwin,
Inc., 663 ppr, 1952.
2 Chebychev's Inequality indicates P(|R| > PR + 2.3 aR)< 0.19.
D-l
-------
Table D-l
MONTE CARLO EXPERIMENTS WITH THE REPEATABILITY TEST
(2.1% CRITERION)
a
(% OF y)
0.75
1.00
1.25
2.00
NUMBER
OF
TRIALS
7500
2500
2500
2500
AVERAGE
(y-y)/y
(%)
+ .001
+ .027
+ .024
-.021
SD(y)/a
.7625
.7667
.7685
.7806
% TRIAL
2 DBS.
95.28
86.32
77.08
52.36
S ENDING
3 OBS.
4.72
13.60
22.24
39.80
AFTER
4 OBS.
0.
0.08
0.68
.7.84
When a is in the range anticipated in establishing the 2.1% criterion,
there are almost no cases in which the fourth observation is required.
Indeed, even if employed in a "off-design" situation, such as a = .02y,
the fourth observation is required only 8% of the time. Thus the inclusion
of the third stage in the procedure has very little effect on its results.
The estimates of y obtained under the various sizes of a were
all substantially "on target." The deviations of those estimates from
the true value are given in Table D-l, expressed as a percent of the
true value. The differences shown represent random variability, since
the estimator is unbiased when no spurious observations are present.
The procedure does share with the "best two out of three" the
property of yielding estimators that are more variable than fixed
samples of size two. The standard deviations obtained, all within the
range (.75a, .79a) are to be compared with .7071a for the fixed sample
of size two. Note, in comparison, that the standard deviation among
estimates yielded by the "best two out of three" is approximately O.SOa
when no spurious observations are present. Some tendency is apparent
for the standard deviation to grow in disproportion to the growth in
a, but it is not clear whether or not this is a chance phenomenon.
D-2
------- |