v>EPA
United States
Environmental Protection
Agency
Environmental Research
Laboratory
Corvalds OR 97330
EPA 600/5 78 016b
July 1978
Research and Development
Human Health Damages
From Mobile Source
Air Pollution
Additional Delphi
Data Analysis
Volume I
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development. U.S. Environmental
Protection Agency, have been grouped into nine series. These nine broad cate-
gories were established to facilitate further development and application of en-
vironmental technology. Elimination of traditional grouping was consciously
planned to foster technology transfer and a maximum interface in related fields.
The nine series are:
1. Environmental Health Effects Research
2. Environmental Protection Technology
3. Ecological Research
4. Environmental Monitoring
5 Socioeconomic Environmental Studies
6. Scientific and Technical Assessment Reports (STAR)
7. Interagency Energy-Environment Research and Development
8. "Special" Reports
9 Miscellaneous Reports
This report has been assigned to the SOCIOECONOMIC ENVIRONMENTAL
STUDIES series. This series includes research on environmental management.
economic analysis, ecological impacts, comprehensive planning and fore-
casting, and analysis methodologies. Included are tools for determining varying
impacts of alternative policies; analyses of environmental planning techniques
at the regional, state, and local levels, and approaches to measuring environ-
mental quality perceptions, as well as analysis of ecological and economic im-
pacts of environmental protection measures. Such topics as urban form, industrial
mix. growth policies, control, and organizational structure are discussed in terms
of optimal environmental performance. These interdisciplinary studies and sys-
tems analyses are presented in forms varying from quantitative relational analyses
to management and policy-oriented reports
This document is available to the public through the National Technical Informa-
tion Service. Springfield. Virginia 22161.
-------
EPA-600/5-78-016b
July 1978
HUMAN HEALTH DAMAGES FROM MOBILE SOURCE AIR POLLUTION:
ADDITIONAL DELPHI DATA ANALYSIS - VOLUME II
by
Steve Leung
Eureka Laboratories, Inc.
401 N. 16th street
Sacramento, California 95814
Norman Dal key
University of California
Los Angeles, California 90025
Contract No. 68-01-1889
Project Officer
John Jaksch
Criteria and Assessment Branch
Corvallis Environmental Research Laboratory
Corvallis, Oregon 97330
CORVALLIS ENVIRONMENTAL RESEARCH LABORATORY
OFFICE OF RESEARCH AND DEVELOPMENT
U.S. ENVIRONMENTAL PROTECTION AGENCY
CORVALLIS, OREGON 97330
EPA-RTF LIBRARY
-------
DISCLAIMER
This report has been reviewed by the Corvallis Environmental Research
Laboratory, Environmental Protection Agency, and approved for publication.
Approval does not signify that the contents necessarily reflect the views
and policies of the U.S. Environmental Protection Agency and the California
Air Resources Board, nor does mention of trade names or commercial products
constitute endorsement or recommendation for use.
-------
FOREWORD
Effective regulatory and enforcement actions by the Environmental
Protection Agency would be virtually impossible without sound scientific
data on pollutants and their impact on environmental stability and human
health. Responsibility for building this data base has been assigned to
EPA's Office of Research and Development and its 15 major field installations,
one of which is the Corvallis Environmental Research Laboratory (CERL).
The primary mission of the Corvallis Laboratory is research on the
effects of environmental pollutants on terrestrial, freshwater, and marine
ecosystems; the behavior, effects and control of pollutants in lake systems;
and the development of predictive models on the movement of pollutants in
the biosphere.
A. F. Bartsch
Director, CERL
111
-------
ABSTRACT
This report contains the results of additional analyses of the data
generated by a panel cf medical experts for a study of h'urcan Health Damages
from Mobile Source Air Pollution (hereafter referred to as l-.i-'O) conducted by
the California Air Resources Board in 1973-75 for the U.S. Environmental
Protection Agency (Contract f-io. 68-01-1380, Phase 1).
The analysis focussed on two topics: (1) assessment of the accuracy of
group estimates and (2) generation of a model of the group estimate as a
function of percent of population affected and degree of impairment.
Investigation of the first topic required a more thorough formulation
of statistical theory of errors as applied to group judgment than has been
available up to now. This formulation is presented in Section 5 of the report.
A major new feature of this theory is the postulation of a non-linear response
with estimated numbers similar to the non-linear relationship observed by
psychologist between physical magnitudes and subjective estimates.
The investigation of the second topic and the application of the theory
of errors to the data from the HMD studies are presented in Section 7. The
hypothesis for this model is that the level of impairment scales on the
logarithm of the concentration. The fit of the model to the data is surprisingly
good. This model can be used to simplify additional Delphi studies of other
pollutants. Thus, with the model, it is necessary to obtain estimates from
panelists only of the base case, and the remaining cases can be predicted by
the model.
This report was submitted by the California Air Resources Board in the
fulfillment of Contract No. 68-01-1889 under the sponsorship of the Environmental
Protection Agency. Work was completed as of September 30, 1976.
-------
CONTENTS
Foreword iii
Abstract iv
List of Figures vi
List of Tables viii
Abbreviations and Symbols ix
Acknowledgments xi
1. Executive Summary 1
2. Conclusions 8
3. Recommendation 10
4. Introduction 11
5. Conceptual Background 12
A. Theory of Errors and Group Judgment 12
8. Expected Error 18
C. The Psychonumeric Hypothesis 21
D. Self-Evaluation 32
E. Estimated Confidence Ranges 34
6. Research Methods 35
7. Results 38
A. Theory of Errors 38
B. Lognormality 41
C. Logarithmic Scaling . 45
D. Correlation of Indices of Uncertainty 53
E. An Estimation Model 64
8. Discussion 74
References and Notes 77
-------
FIGURES
Number Page
1. Illustrative Distribution of Individual Responses ... 13
2. Illustration of Bias and Random Error 14
3. Distribution of Initial Answers 17
4. Invariance of E/a , 19
5. Cumulative Frequencies of zm on Probability Scale 22
6. Density Distribution of e m 23
7. Relative Frequency of Digits Occurring as Second Digits in
in Almanac Tables (3114 numbers) 26
8. Distribution of First Digits, Subject Responses
(5,037 Responses) 27
9. Averag_e Standard Deviation as a Function of Log True 29
10. Average Error as a Function of Log True 30
11. Group Self-Rating 33
12. Frequency Distribution of z Scores for all
Best Estimates 43
13. Observed Standard Deviation S as a Function of the Log
Geometric Mean m for the Normal Population 46
14. Average Standard Deviation vs Percent Population 48
15. Average Standard Deviation s vs Average Mean m
for Oxidant 49
16. Average Standard Deviation s vs Average Mean m
for Nitrogen Dioxide 50
17. Average Standard Deviation s vs Average Mean m
for Carbon Monoxide 51
18. Illustration of Reduction in Correlation with
Pooled Populations 56
VI
-------
Number
19. Average Estimated Interval vs Observed
Standard Deviation .................... 60
20. Average Estimated Confidence Interval vs
Average Log Mean ..................... 61
21. Normalized Estimate ft as a Function of
the Parameter P .... ................. 68
22. Illustration of Cumulative Distribution ........... 71
23. Variation of x~with Population .............. 73
VI 1
-------
TABLES
Number Page
1. The Ratios of S*/S and GM/Md for the Normal Population .... 40
2. Average Error and Confidence Limits from Theory of Errors . . 41
3. Average s and Average m of all Population Groups 52
4. Correlation Between Self-Rating and Confidence Range 54
5. Correlation Between R and Ay, for NCL, Disability,
Children 57
6. Correlations Between Uy,s), (Ay, R), (s, R) 58
7. Average R by Percent Population, Degree of Impairment
and Pollutant Type 62
8. Overall Averages for Ay, R and s 63
9. Ay/3.28s for Three Pollutants 63
10. Normalized Estimates and Standard Deviation by Percent
Population, Degree of Impairment and Pollutant Type .... 67
vn i
-------
ABBREVIATIONS AND SYMBOLS
X. -- individual response
x. -- log X.
i -- individual members of group ranging from 1 to n
T -- true answer
B. --a bias term
b. -- log B.
M.J -- mean of individual's distribution of responses
m. -- mean of distribution of logarithm of individual's responses
u -- theoretical mean of a distribution of log quantities
a -- theoretical standard deviation
S -- observed standard deviation
s -- observed standard deviation of the logs
n -- total number of respondents or sample size
D.(x) -- density distribution of individual's responses
Ex -- expectation
R. -- random error of the individual's response
r. -- random error of the logarithmic of the individual's responses
z -- normalized form of the mean of the logarithmic of responses
y -- psychological magnitude
HMD -- Human Health Damage
*
S -- best estimator of standard deviation
IX
-------
OX -- oxidant
CO -- carbon monoxide
NOp -- nitrogen dioxide
AE -- theoretical average error
AE1 -- empirical average error
CL -- theoretical confidence limits
CL1 -- empirical confidence limits
2
X -- chi square
q- -- observed frequency in cell i
p. -- expected frequency in i
YU -- mean of upper limit
Y-, -- mean of lower limit
Ay -- difference between means of upper and lower limits, Y - Y,
R -- average self-rating
r -- correlation for pooled population
CV(x,y) -- a generalized covariance of the means of subpopulation
F -- fraction of the population affected
c -- dosage
(z) -- normal density function
z -- normalized variate
DC -- Discomfort
Da -- Disability
I -- Incapacity
-------
ACKNOWLEDGEMENTS
This study was an extension to Contract No, 68-01-1889 from the U.S.
Environmental Protection Agency. Assistance in planning program objectives
and program direction was given by John A. Jaksch of the Environmental
Protection Agency's Corval1 is Environmental Research Laboratory, Criteria
and Assessment Branch.
The authors are indebted to the staff members of the Air Resources Board.
Appreciation is extended to Jack Suder, Laura Storey Dick and Kingsley Macomber
for their assistance in administrative and contractual matters.
Grateful acknowledgement is made to John R, Kinosian, Chief of the
Technical Services Division, and John Holmes, Chief of the Research Division
for valuable advice and encouragement.
-------
SECTION 1
EXECUTIVE SUMMARY
This report contains additional analyses of data obtained in a Delphi
study of dose-response relationships for air pollutants conducted by the
California Air Resources Board for the Environmental Protection Agency . The
study involved collection of estimates from a panel of 14 medical experts of
the dosage (concentration of a given pollutant experienced for one hour)
required to produce a given level of impairment in a given fraction of a
specific population at risk. The panel judgments generated in the study
constitute one of the best and most complete sets of data that have been
collected in a Delphi study of health effects. The additional analysis reported
in the present supplement has two aims: (a) formulating a more complete and
rigorous treatment of the reliability of the data, and (b) a more complete
investigation of the underlying model determining the relationships between
dose, type of disability, and fraction of population affected.
I. RELIABILITY
Data generated by Delphi studies are not equivalent to data generated by
statistical sampling procedures. Panels are not selected for representativeness,
but for expertise. Responses by panel members are taken to be their best
judgments on a question, and should be scored for correctness, not for what they
convey concerning the panelists. As a result, standard statistical analyses
are not directly applicable to evaluating the excellence of Delphi results.
Applicable standards are "quasi-statistical" in that indices are defined in
statistical lenguage, but the criteria are derived from empirical studies.
To date the most extensive and relevant empirical studies have involved
experiments with university upper-class and graduate students, with general
* Referred to in the following as the HHD (Human Health Damages) Study.
-------
information and short-range prediction questions. The material was selected
so that the students would have some information on which to base on estimate,
but they were not expected to know the precise answers. The essential property
of the questions is that the accuracy of the responses could be measured
objectively.
The basic issue with regard to the applicability of the results of these
experiments to the HHD study is the closeness of the analogy between the
estimation process exhibited by the student subjects and the estimation process
exhibited by the panel of medical experts. The best evidence would consist of
assessing the panel estimates against extensive field studies with actual
pollution situations. Lacking this field data, several less definitive
considerations can be examined with the present data. .
To provide a firm mathematical basis for the analyses, the standard
theory of errors has been extended to include the case of log normal distri-
butions, with the panel distribution of estimates assumed to arise from
independent individual distributions. In addition to random error, provision
is made for systematic error or bias. A general conclusion from the empirical
studies is that bias is a larger contributor to the total error than random
variabi1i ty.
A. Log-normality of Distributions. The criteria for evaluating Delphi
estimates that have been deduced from the experimental data are based on
the assumption that the distributions of responses are log-normal. In the
experiments, the distributions were log-normal to a high degree of
approximation. The data from the HHD study are compatible with the
assumption that the medical panel's estimates are log-normally distributed,
providing that allowance is made for observed dependencies among individual
panelist's estimates for closely related cases.
-------
B. Observed and Estimated Standard Deviations and Medians. The
critical statistic derived from the laboratory studies is the ratio of
the sample standard deviation to the median. For a log-normal distribution,
the median coincides with the geometric mean. The maximum likelihood estimate
for the geometric mean is the anti-logarithm of the mean of the logarithms
of the responses. The maximum likelihood estimate of the sample standard
deviation is obtained by first computing the standard deviation of the
logarithms of the estimates, and then computing the standard deviation of
the original estimates from that of the logs.
Examination of the HMD data indicates that there is a large enough
discrepancy between the statistic standard deviation/median computed from
the raw data and the statistic computed from the log transform data to
recommend that evaluations of panel estimates be based on the statistic
computed from the log transform, not on the statistic derived from the raw
data.
£• Conclusions from Pure Sampling Theory. Although the panel estimates
cannot be considered as sampling data, some conclusions can be drawn treating
the data as if it were the results of a sampling procedure. In particular,
some upper bounds can be set on the reliability of the data.
Assuming that all the variance in the estimates arises from random
variability, average expected errors for the three pollutants are 10% for
Oxidant, 21% for Nitrogen Dioxide, and 18% for Carbon Monoxide, with
confidence ranges of ± two standard deviations of 128%, 163%, and 153% for
the three pollutants respectively. These are gross averages intended to
show the order of magnitude of expected errors based on random sampling
theory alone. They are weak lower bounds to the expected error that would
be derived using the empirical relationships between standard deviation and
error.
-------
D. Correlations Between Indices of Uncertainty. Three different
indices of the panelists' confidence in their answers can be derived from
the data. The panelists were asked to rate their confidence in each estimate
on a scale from 1 to 10, where 1 meant "sheer guess" and 10 meant "I know
the answer". In addition, each panelist estimated high and low bounds on
their estimates where the high bound was defined as the concentration such
that no more than 5% of cases would exceed the bound; and the low bound was
defined as the concentration such that no more than 5% of cases would be
lower. The self-rating and the high and low bounds are explicit judgments by
the panelists of the reliability of their responses. In addition, the standard
deviation is an implicit measure of the degree of certainty. In the experimental
studies, the correlation between the average self-rating and the standard
deviation was .67.
For the HHD data, correlations between these three indices were rather
low, and in some cases were of the opposite sign from what would be expected.
For example, it would be expected that the correlation between self ratings
and estimated confidence ranges (the difference between the upper and lower
bounds) would be negative since a wide range indicates low confidence. In
some cases, the correlation between these two indices were positive.
Part of the explanation for the anomalous findings is that the panelists
were estimating the upper and lower bounds as fractions of their original
estimates, whereas in general, they were determining their original estimates
as if the concentrations were scaled in a logarithmic manner.
The analyses, then, lends some support to the assumption that both the
self-ratings and the upper and lower bounds are related to the degree of
credence that can be placed in the estimates. However, the data does not
support the assumption that the upper and lower bound estimates can be used to
establish firm confidence limits for purposes of policy formulation.
-------
E. Relation of_ Indices of Uncertainty tp_ Problem Characteristics.
One question of interest is whether the panel showed systematic relationships
between the indices of confidence and the major variables -- type of pollutant,
level of impairment, percentage of population affected, and population type.
One clear result is that the panel is less certain in their estimates for
Nitrogen Dioxide than they are for Oxidant or Carbon Monoxide. This holds true
for all three indices, self-rating, confidence bounds and standard deviation.
There also appears to be an interesting positive relationship between self-
ratings and percentage population. The panel expressed greatest confidence in
their estimates for the 90% population, less in estimates for 50%, and least in
estimates for 10%. This effect could indicate an interaction between the
degree of confidence and the percentage estimate; percent population may
represent another scale of "certainty".
The major effect observed with the confidence estimates (difference
between upper and lower bounds) is an increase roughly proportional to the size
of the concentration being estimated. Thus, to a first approximation, the panel
is expressing a similar degree of uncertainty for the basic cases.
With regard to the relationship between self-rated confidence and populatio
type, the clearest conclusion is that the panel felt most confident about their
estimates concerning the normal population. For most of the other population
types the average self-ratings were relatively uniform except for what appear
to be special cases. For Oxidant, the special cases are asthma, chronic lung
obstruction, and viral bronchitis. For the first two, the self ratings were
high, which might indicate an affirmation that the cases were "serious" for
oxidant effects. The average rating for viral bronchitis was distinctly low.
II. SUBSTANTIVE ANALYSES
The purpose of the substantive analysis was to test the hypothesis that
the panelists were implicitly using a relatively simple model to formulate
-------
their estimates for the various cases associated with a given pollutant and
population type. Specifically, the hypothesis is that each panelist treats
one of the cases (degree of impairment and percent population) as a "base"
case and derives his estimates for the other cases by a systematic modification
of the base case.
A beginning for such a model is furnished by the assumption that the
panelists view the effects of air pollution as a threshold phenomenon, where
within a given population there is a distribution of concentration levels for
the onset of a given symptom. This submodel is rounded out be the assumption
that the distribution of thresholds is normal on the logarithm of the concentra-
tion. This model was used to analyse the data in the original study, and found
to fit the data rather well. However, the test was somewhat weak in that only
three points were used to fit the cumulative dose-response curves for a given
level of impairment.
In the present analysis, the hypothesis is extended to include both the
dose-response curve and the level of impairment in a single model. The
hypothesis is that the level of impairment (Incapacity, Disability, Discomfort)
also scales on the logarithm of the concentration. The fit of the model to
the data is surprisingly good, considering the large amount of random variation
to be expected in estimates with the dagree of uncertainty expressed by the
panel.
The model can be used to simplify additional Delphi studies of other
pollutants. Thus, with the model, it is necessary to obtain estimates from
panelists only of the base case, and the remaining cases can be predicted by
the model. Similarly, if empirical (experimental) data can be obtained for
one of the cases, the model can be employed to furnish a rough extrapolation
technique to extend the data to other conditions.
The degree of impairment scale was defined independently for the different
-------
pollutants and different populations. What the model indicates is that the
respondents translated the specific symptoms into a more general scale which
applied to all cases. This suggests that the scale can be extended to include
a much wider range of degrees of impairment than the three points included in
the present study. The formulation of a more extensive scale would allow ex-
tension of the dose-response model to include a wider range of phenomena. One
immediate benefit would be the possibility of using data (e.g., lethality or
life-threatening cases) not applicable to the present scale.
III. PSYCHONUMERIC HYPOTHESIS
One new outcome of the present study is an hypothesis concerning the way
in which individuals make estimates of uncertain quantities. The data suggest
strongly that individuals treat uncertain numbers as if they were on a logarithmic
scale, rather than on the ordinary arithmetic scale. This assumption, which can
be called the psychonumeric hypothesis in analogy with the better known psycho-
physical phenomena, rationalizes the observed lognormality of the distributions
of estimates, and explains the experimentally observed exponential increase in
error with the size of the number being estimated.
However, the hypothesis poses significant questions concerning the
appropriate measure of error in formulating pollution criteria. Roughly speaking,
the significance of a given error for potential health effects is perceived by
the panel of experts as being related to the logarithm of the error, not to the
error in terms of concentration levels. Again, roughly speaking, concentration
at levels appropriate for CO, an error of 100 ppm is not perceived as 10 times
as serious as an error of 10 ppm, but only as twice as serious (using logarithms
to the base 10).
The question whether the psychonumeric phenomenon should be taken into
account in formulating policy is an issue that exceeds the terms of reference
of the present study.
-------
SECTION 2
CONCLUSIONS
The following conclusions are based on the additional analysis of Delphi
data presented in the 1975 EPA report on "Human Health Damages from Mobile
Source Air Pollution".
1, Distributions of individual estimates for air pollution concentrations
which may cause damages to human health are approximately log-normal.
2. The relationship between variance and geometric mean of individual
estimates is compatible with the psychonumeric hypothesis.
3. Insufficient data exist in the HHD study to add to the assessment of
self-ratings.
4. Estimated confidence ranges (upper and lower limits) are
probably large underestimates of the actual confidence limits.
5. A relatively simple model of the concentration as a function of
percent of population affected and degree of impairment fits the data
rather well.
Some implications of these results for Delphi studies in the air pollution
area are: (1) The conclusion of previous studies that the standard deviation
of the distributions of estimates is a valid indicator of the accuracy of the
geometric mean is compatible with the HHD data; however, some care must be
taken in applying this rule to estimates which are systematic variants of the
same estimate. (2) The value of obtaining estimates of expected range (upper
and lower limits) appears marginal, (3) Several possibilities for extending
the scope of estimation models by subjective scales (e.g., of degree of impair-
-------
ment, or of relative severity of an illness) look promising. (4) The
psychonumeric phenomenon poses serious issues concerning the role of judgment
and the notion of error in decision problems.
-------
SECTION 3
RECOMMENDATIONS
1. A simple dose-response model has been developed by this study. This
model opens the possibility of formulating a scale of degree of impairment
which would cover a much wider range of symptom states than those embodied
in the three categories of conditions used in this study, Incapacity,
Disability and Discomfort. A similar study should be undertaken by policy-
making organizations such as EPA to further refine this model. Generalization
of this model to fit in more comprehensive scales should not be a large step.
2. There are numerous decisions made by governmental agencies based on
judgments. The delphi techniques used in this study have essentially provided
a statistical basis of using judgments for decision-making. The resolution
to the question of how should a policy organization such as EPA use or react
to the delphi data may involve a detailed analysis of the organization's
decision-making procedures.
3. There are some fundamental areas in the Delphi methodology requiring
some in-depth explorations. These areas include calibration of standard
deviations, self-ratings, estimated confidence and psychonumeric hypothesis.
Investigation of these fundamental areas should be undertaken by agencies such
as National Science Foundation.
10
-------
SECTION 4
INTRODUCTION
This project was first initiated in June, 1973 and completed in August,
19751, as an in-house effort by the California Air Resources Board (CARB) with
funding support from the U. S. Environmental Protection Agency (EPA). At the
request of Dr. John Jaksch, the EPA contract officer, the project was extended
in order to include more comprehensive data analysis.
The initial analysis of the data generated by this study of dose-response
relationships for air pollutants was incomplete. The panel judgments obtained
in the study constitute one of the best and most complete sets of data that
have been collected in a Delphi study of health effects, and it fully warrants
the additional analysis presented in this report. The data analysis reported
in Section 7 is focused on two issues: (a) a more complete and rigorous
treatment of the reliability of the data; (b) a more complete examination of the
information contained in the data, e.g., concerning the interrelationships
between type of pollutant, type of impairments and population type.
The conceptual framework within which the analysis of the HMD data is
based upon is presented in Section 5. The concepts discussed in Section 5
include theory of errors and group judgment, expected error, the psychonumeric
hypothesis, self-evaluation and estimated confidence ranges.
11
-------
SECTION 5
CONCEPTUAL BACKGROUND
A. Theory of Errors and Group Judgment. The basic approach to estimation
that is applied in the following analyses could be called the theory of errors.
There are other theoretical approaches to numerical estimation that could be
taken, e.g., treating the judgments as probabilistic statements, or treating
2
them as expressions of an underlying model in the minds of the respondents .
However, the theory of errors is the simplest theoretical structure that can
be applied to data of the type generated in the HHD study, and introduces the
fewest assumptions.
Given a set of responses X. to a numerical question, where i = l,...,n
indexes the individual members of a group, the theory of errors assumes that
each individual response X. is a function of three components, the true
answer T, a bias term B., and a random error term R.. Ordinarily, the function
is assumed to be additive, i.e.,
X. = T + B. + R. (1)
The bias term is presumed to be a function of the specific questionj and of the
information which the individual has concerning the question. The precise
relationship between the bias term and amount of information is obscured by
the difficulty in defining amount of information for the case of an expert
answering an uncertain technical question. The assumption that the size of
the bias term is inversely related to the amount of information appears
plausible in a qualitative way.
The random error term is also considered to be a function of the specific
question and of the individual's information. But in addition, it is a
12
-------
random variable. If the question is repeated, T and B. are constants, but R^
will vary in a random way. In practice, direct observation of this variation
on repetition of a specific question is inhibited by memory effects, changes
of information (if the repetitions are separated by significant periods of
time), and by other more obscure factors such as degree of attention, motivation,
and the like. In order to measure the random error, it is usually necessary
to examine the responses to a large set of questions, and assume that a
common mechanism is generating the deviations from the true value. For this
reason, R- is often called the "residual error" or "unexplained variation."
Despite the fact that individual distributions of responses to specific
questions are usually not available, it is possible to make inferences
concerning the nature of these distributions from data concerning sets of
responses generated by a group of individuals, providing the responses are
independent. This point will be discussed further in sub section 3.
It is usually assumed that the random error is sufficiently characterized
by its mean, which by definition is zero, and by its dispersion or standard
deviation. However, for many investigations, it is further assumed that the
random error is normally distributed. This more restricted assumption will
be made for most of the analyses which follow. The elementary theory of errors
model, then, is equivalent to asserting that the individual "selects" his
response out of a distribution that is normal around some mean that is displaced
by the bias, 8^, from the true response T, as illustrated in Figure 1.
FIG. 1 ILLUSTRATIVE DISTRIBUTION OF INDIVIDUAL RESPONSES
13
-------
The notion of bias has not received as much attention in psychological
literature as the notion of random error (they are often lumped together as
"error"); a simple illustration from a physical situation may make the idea
clearer. Suppose there is a marksman firing at a target who has not
compensated adequately for windage or distance, His pattern of shots might
look like the dots in Figure 2, which are clustered about a point displaced
from the center of the target. The displacement illustrated by the solid
line in the figure is the bias of the pattern; the offset from the center of
the pattern, illustrated by the dashed line, is the random error of the specific
shot labelled X. It should be clear from this illustration that the notions
of bias and random error are idealizations. The "biassing influences",such
as wind and adjustment for distance,which are assumed to be constant throughout
the trial .
FIG. 2 ILLUSTRATION OF BIAS AND RANDOM ERROR
14
-------
Referring back to Figure 1, if the bias is unknown, the process appears
? a random selection of a response X. out of a <
M. where M- = T + B.. In Figure 1, B. is negative.
to be a random selection of a response X. out of a distribution around a mean
An additional consideration arises in dealing with free responses to
numerical questions, i.e., responses which are not limited to a few categories.
The distribution of random errors is usually not normal when plotted against
the quantity being estimated, but is skewed in the direction of increase of
the quantity. In part this arises from the fact that the quantities being
estimated have a natural zero but no natural upper bound. However, a more
basic psychological process appears to be operative here that will be discussed
at greater length on page 21 in the section on psychonumeric scaling. For the
moment, it appears to be the case empirically that over a wide variety of
types of questions, the distribution of responses is not normal; rather, the
distribution of the logarithms of the responses is normal. In technical terms,
the responses are lognormally distributed.
Lower case letters will be used to designate the logarithms of the
corresponding capitalized terms; thus x. = log X-, b. = log B,, r. = log R.,
etc. Lower case letters will designate the observed (sometimes called the
sample) statistics of logarithmic quantities, m for the observed mean, s for
the observed standard deviation. Thus m = 1/nz x. = 1/n? log X., and
2 9 i ' "i '
s = 1/n I (x. - m) , where the X- are the observed responses of a group.
1
Small Greek letters will be used to designate the theoretical parameters of
distributions of logarithmic quantities; in particular y designates the mean
of a distribution of log quantities, and o its standard deviation.
Rewriting (1) in logarithmic form, we have
x. = t + b. + r. (2)
Note that (2) is equivalent to asserting
X. = T B. R. (3)
15
-------
Figure 3 shows the distribution of responses of upper-class and graduate
student subjects to several hundred general information type questions, about
*
5000 responses in all . To construct Figure 3, the responses to each question
were normalized by setting z. + (x. - m)/s. The figure shows the density
distribution of e i . As is evident from inspection, the fit of the lognormal
approximation is very good.
If we set P.J = t + b^ , the assumption of lognormality implies that the
density distribution of individual i's responses, D.(x) has the form
1 i - (*-"D2
2o.2 (4)
The corresponding distribution for the non-transformed estimate X. is
p
1 i i _ (log x - u.j)
Given (4) or (5) and the assumption that individual responses are independent,
the distribution of the geometric mean G of a set of responses can be computed.
The assumption of independence of responses, at least on the first round, is
one of the basic features of the Delphi process, and is the justification for
the rule of anonymity.
!_
Setting G =UX. )n then m = log G. The distribution, D(m) is then
1 2
1 i _ (m - u)
D (m) = -F-- e 2 o 2 (6)
2 22
where p = 1/n £y. and o = 1/n i: o. . Formula (6) is derived using theorem
i 3 i n
2.3 in Aitcheson and Brown.
* The set of experiments included studies at the Rand Corporation and at the
Center for Computer-Based Studies at UCLA. They will be referred to
collectively hereafter as the Rand studies and the data generated by the
experiments will be called the Rand data4,5.
16
-------
.8
.6
Density
.4
Observed
Log Normal
-(LogX)2
2
1.0 1.5
Normalized Estimate
2.0
2.5
3.0
FIG. 3 DISTRIBUTION OF INITIAL ANSWERS
-------
(6) states that the geometric mean of a set of independently lognormally
distributed responses is itself lognormally distributed with a mean equal to
2
the average of the individual means, and variance a> equal to 1/n of the
average of the individual variances. A specific set of responses, then, is a
sample out of the joint distribution of the individual responses, and the
sample mean m is the maximum likelihood estimator of the group mean.
B. Expected Error. Formula (6) displays one of the advantages of
group over individual estimation; namely, the standard deviation of the group
mean is smaller by a factor of l/>/7rthan the average of the standard
deviations of the individuals. This feature of group estimation--well known
in statistical sampling theory--is a precise way of stating that the process
of combining many estimates "washes out" the random variability of the
individual estimates.
If the distribution of the group mean were unbiassed, y = t, then the
only error to be expected would arise from the residual variability of the mean;
that is, the expectation of the error j m - t| can be computed as
Ex m- t = a= - (7)
v TT >/ TT >/n
where Ex designates expectation.
The assumption y = t is equivalent to the assumption that the average
bias 1/n I b.. = 0. This assumption is generally not fulfilled in experimental
studies of estimation. Figure 4 displays the relationship between observed
average error and observed standard deviation for the Rand experiments. The
upper line is the observed relationship (least squares fit) while the lower
line is the relationship that would be expected for y = t. Random error
accounts for only about 1/3 of the total error. Thus
2 2
Total expected squared error = b + Ex (r ) (8)
18
-------
1.8
Mean
Group
Error
1.5
1.2
0.9
0.6
0.3
0
0.5
1 0 1.5
Standard Deviation
2.0
2.5
FIG. 4 INVARIANCE OF E/(T
-------
where b = 1/n I b- = bias of the mean, and r = 1/n £ r- = random error of the
i 1 i 1
mean. Equation (7) states that
Ex(r) =
Thus, as n becomes large, the random error term in (8) approaches zero.
There is no such guarantee for the bias term. However, there is a weaker
guarantee for the bias term which displays a second advantage cf group over
individual estimation. It is easy to show that
b2
-------
error are coupled, that Is, as the standard deviation increases, the bias
increases. This is a plausible coupling on the assumption that both random
variation and bias have a single cause, namely incomplete information.
However, there is no theory at the present time that specifies the nature
of this coupling. Hence, the only way to estimate the bias in the HMD data
is to assume that the empirical relationship holds for estimates generated
by a panel of doctors on medical topics as well as for the estimates generated
4 5
by upper level college students in the Rand studies * with topics of general
information.
C. The Psychonumeric Hypothesis. The formulas in the two preceeding
sections were derived on the assumption that the individual distributions are
lognormal. As remarked above, these individual distributions are not usually
observable directly. However, it is possible to give a fairly substantial
grounding to the assumption. Aitcheson and Brown have shown that if the
distribution of m (the observed mean) is lognormal, then the underlying
individual distributions are also lognormal. If there were no bias, i.e.,
if y = t, then the assumption that the observed set of means all come from a
common family of distributions could be tested by forming the statistic
z = (m - t)/s and testing the distribution of z for normality.
Figure 5 shows the cumulative distribution of zm from the Rand data
in a log probability diagram. The straight line is the cumulative distribution
predicted for a lognormal with y = 1/3 and a = 1/3. The fit is relatively
good. Figure 6 shows the same curve plotted as a density distribution, The
curve from Figure 3 is plotted on the same scale for comparison. The existence
of bias is evident. Rather than y = 0 (no bias), y = 1/3. However, the
lognormality is well demonstrated, and the inference that the underlying
individual distributions are lognormal but biassed appears reasonable.
The lognormal distribution for individual estimates can be derived from
the central limit theorem and the assumption that the residual error can be
decomposed into many small errors that combine multiplicatively. However,
without some justification for the multiplicative combination, that route
appears highly ad hoc.
21
-------
1
Line Corresponding To N(l/3. V1/3)
J 1 L
12 5 10 20 30 40 50 60 70 80 90 95 98 99
% ile
FIG. 5 CUMULATIVE FREQUENCIES OF Zm ON PROBABILITY SCALE
-------
.8
.6
Density
1.0
1.5
2.0
2.5
3.0
3.5
FIG. 6 DENSITY DISTRIBUTION OF
-------
A more general approach that seems to be more in line with a large body
of psychological work is to postulate a non-linear transformation between
physical magnitudes and the psychological scales on which these magnitudes
are estimated. Formula (2) above would be interpreted as asserting that for
the individual making an estimate, it is not the usual real number metric
which is "even" (additive), but rather the logarithmic transform of the real
number metric.
Non-linear transformations between physical quantities and psychological
magnitudes are well-known in many perceptual modalities, including the
logarithmic relationship between the physical intensity of sound and psychological
intensity that has given rise to the decibel scale of sound intensity. There
has been a long-standing debate as to the "general" form of such psychophysical
transformations, whether they are basically logarithmic as postulated by Weber-
o
Fechner, a power law as claimed by S. S. Stevens, or some more general form .
To my knowledge, the question whether a similar type of transformation exists
for numbers generated by more general types of estimation such as those
generated in non-perceptual estimates of uncertain quantities has not been
treated in the psychological literature. However, there is now a fairly large
amount of data which suggests that the numbers which "come to mind" in attempting
to answer uncertain questions also have a non-linear relationship to the
corresponding physical quantities. These data are summarized in figures listed
below:
*
1. The lognormality of observed distributions in Figure 3 and Figure 12
on page 43.
2. The lognormality of the distribution of the means of group estimates
in Figure 5 and Figure 6.
3. The linear relationship between log error and standard deviation of
log estimates in Figure 4.
4. The non-linear scaling of standard deviations with size of the
estimate in Figure 9.
5. The non-linear scaling of error with size of estimate in Figure 10.
6. The logarithmic distribution of first digits of responses to numerical
questions.
* In Fig. 3 the distribution is plotted against ez™ whereas in Fig. 12 it is plotted
against zm. Both graphs indicate a normal distribution on the logarithm of the
responses.
24
-------
The sixth item is somewhat out of the context of the others, and needs
some background. One of the intriguing features of statistical tables such
as are found in almanacs and similar reference works, is the distribution of
the first digits of the numbers. There is a tendency for the first digits to
be distributed in a logarithmic pattern; specifically the frequency of digit
d (d = 1,...,9) is roughly proportional to log(d + 1) - log d. A somewhat
more general hypothesis would be that the numbers x are themselves distributed
as 1/x. This would imply that not only the first digits but the second and
subsequent digits would also have the appropriate distribution. For example,
the frequency of d as a second digit would be proportional to
10
^(logUOi + d + 1) - log (lOi + d)}
This hypothesis has not been verified in detail. A relatively large body of
data would be needed to generate stable statistics, but a quick try of a few
thousand numbers selected more or less at random out of an almanac looks
favorable. The distribution of second digits is shown in Figure 7. The
dotted lines shows the theoretically expected frequencies.
More relevant to estimation, when several thousand responses to estimation
questions were analyzed in terms of the distribution of first digits, they
exhibited precisely the same logarithmic distribution as the data from the
almanac tables. The distribution of the first digits of the responses is
given in Figure 8. The only major departure from the theoretical distribution
is an evident preference for the digit 5.
One might be tempted to believe that the distribution of first digits
in the responses is being "driven" by the corresponding distribution in the
true answers which the subjects are trying to approximate. Indeed, the true
answers exhibited the logarithmic distribution. However, the two distributions
were completely independent. Whatever psychological mechanism generated the
distribution of responses, it was independent of the mechanism that generates
a logarithmic distribution of first digits in the almanac tables.
25
-------
O1
Frequency Of
Digit (As
Second Digit)
A 2
.1
.10
.09
.08
.07
.01
Observed
Theoretical
I
34567
Digit
8
FIG. 7 RELATIVE FREQUENCY OF DIGITS OCCURRING AS SECOND
DIGITS IN ALMANAC TABLES (3114 NUMBERS)
-------
Relative
Freqency
Of Digit
3 T 1
0
—— Observed
Theoretical
i
123456789
Digit
FIG. 8 DISTRIBUTION OF FIRST DIGITS, SUBJECT RESPONSES
(5.037 RESPONSES)
-------
The distribution of first digits, then, is partial confirmation of the
hypothesis that the real number system in the minds of the respondents, at
least for the task of estimating uncertain numbers, is distributed like 1/x.
Further confirmation of the hypothesis comes from an examination of the scaling
of the standard deviation with the size of the number being estimated.
Figure 9 is a plot of the average observed standard deviation against the
logarithm of the true answer to the question. Figure 10 is a corresponding
plot Of average error against log true. Both figures show an increase in
the respective indices with increasing size of the estimates. It is interesting
to note that the peculiar dip in the vicinity of 106 in both figures appears
to be an artifact introduced by the particular choice of questions for these
experiments, but no specific property of the questions has been identified
that would explain the anomaly.
Both the log error and the standard deviation of the log estimates are
invariant under a multiplicative scaling of the estimated quantity, Thus,
if {X-j is the set of estimates for one question, with a true answer T, and
(aX-j and aT are the corresponding responses and true answer for another
question, then the standard deviations of the log responses are the same for
the two sets of estimates, and the error |m - t| is the same. This is
easily seen for the error since 1/n l\ log aX. - log aT I = 1/n Z |x. + log a -
i i 1 i i
log T - log a| = 1/n £ |log X- - log T| . Figures 9 and 10 show immediately
that the subjects were not scaling their estimates in this simple multiplicative
manner, Roughly speaking, the data indicate that scaling is being carried
out at the level of the logs, not at the level of the original estimates. Put
in other terms, if estimates are generated for two questions, one of which has
an answer T and the other an answer T'= aT, then the responses x/ = log X.'
will be scaled like x^log a, not as x. -f- log a. A theoretical scaling of error
based on this hypothesis is plotted as the dashed line in Figure 10. The fit
is obscured somewhat by the erratic nature of the data in the vicinity of
106, but otherwise the hypothesis that the subjects are scaling their responses
on the log appears to fit the data quite well.
The psychonumeric hypothesis applied to dose-response estimates, for
example, is equivalent to stating that the relative significance of an additional
28
-------
3.0
ro
2.0
1.0
23456
LogT (base 10)
FIG. 9 AVERAGE STANDARD DEVIATION AS A FUNCTION OF LOG TRUE
-------
2.5
2.0
1.5
E =
Log
Psychonumeric
Hypothesis \.
co
o
1.0
.5
0
FIG 10.
; 4 5 6 7
Log T (base 10)
AVERAGE ERROR AS A FUNCTION OF LOG TRUE
8
-------
increase in dosage depends on the percentage increase> not on the absolute
increase. At the level of significant effects for Oxidant, about 0.8 ppm for
50/.; disability for a normal population according to the panel, an increase in
concentration of 0.2 ppm raises the expected percent of population affected to
90%. 0.2 ppm is a 25% increase over 0.8 ppm. For CO, the estimated dosage for
Normal, 50% population, disability, is 170 ppm. An increase of 0.2 ppm is an
increase of 0.12%, which the panel clearly finds inconsequential.
Thus from several different directions, the lognormality of group and
individual distributions, the distribution of first digits in estimates, and the
scaling of distributions, with size of estimates, the same conclusion is arrived
at, namely, the psychonumeric hypothesis that subjects think of the real numbers
in terms of the log transformation.
One reason for emphasizing the psychonumeric relationship is the
consequences of such a non-linear transformation for the notion of error.
For example, if an individual is asked to estimate the intensity of a sound
in terms of psychological units (technically, in sones) and makes an error
of, say, 10% at the 100 decibel level (he should have said 100 sones, but
said 90) he has made an error of a factor of 10 in the physical intensity
of the sound. On the other hand, if he makes an error of 10% at the 30
decibel level (he says 27 instead of 30) the error in the physical magnitude
is "only" a factor of 2. More generally, if we express the percent error in
the psychological magnitude
-------
the case of the intensity of sound scale, for purposes of estimating noise
pollution, where this is defined in terms of psychological responses, the
decibel scale is quite appropriate. If there were reason for believing that
physiological responses to air pollution were related to the logarithm of the
concentration, rather than to the concentration directly, then there would be
a happy match between the psychological estimation process and the relevant
physiological processes.
As it turns out, the first approximation model of the HHD data relating
percent population affected and degree of impairment to dosage is logarithmic.
Whether this is a sufficient reason to relax criteria for error concerning health
damage estimates is a decision which the medical community may want to take
seriously.
D. Self-Evaluation. An additional indicator of accuracy can be obtained
by asking each respondent to rate his estimates. Such self-ratings have been
used routinely in Delphi exercises, and respondents do not appear to have
any particular difficulty in making the judgments. In the past these ratings
have not been tied to any particular theory of estimation and their use for
assessment of the solidity of group judgments has been largely informal.
In practice, each respondent has been left a good deal of freedom in
interpreting what the self-rating entails. In the Rand experiments, each
respondent rated each of his answers (on the first round) on a scale of 1
to 5, where 1 meant "sheer guess" and 5 meant "I know the answer". The
correlation between individual error and individual self-ratings was about
-.25, large enough to indicate an association, but not large enough to use
the self-rating as a figure of merit for individual answers. On the other
hand, the correlation between average self-ratings and error of the median
was about -.60, a respectable degree of association .
The correlation between average self-rating and observed standard
deviation was also quite high, about -.67. The correlation between standard
deviation and error was about .63. A combined index of average self-rating
and standard deviation gave a multiple correlation of .67 with error. The
improvement of the combination over the standard deviation alone is small,
a result of the high correlation between the two components. Figure 11 shows
32
-------
GO
GO
Average
Group
Error
o
Average Self- Rating
FIG. 11 GROUP SELF-RATING
-------
a plot of average group error (average error of the median) against average
self ratings for a large subset of the Rand data .
It is difficult to integrate the self-rating with the theory of errors
outlined above. If the self-rating expresses mainly the uncertainty that is
exemplified in the variability of individual responses, then it is of limited
value in assessing group responses, since the standard deviation is a more
reliable measure of the variability. The high correlation between standard
deviation and self-ratings in the Rand experiments would tend to suggest
that the self-ratings are expressing roughly the same thing as the variability,
However, a number of studies have suggested that both bias and variability
are a function of the amount of knowledge that the individual has concerning
the given question
E. Estimated Condi fence Ranges. One additional index of certainty
was included in the HHD study. Each respondent was asked to give a high
and a low estimate, where these were interpreted roughly as the 95% and 5%
confidence limits on the respondent's "best estimate". To include these
estimates in the theoretical approach would require treating the responses
as probability judgments, with the best estimate interpreted, e.g., as the
individual's geometric mean for his probability distribution.
The foundations for this more extensive theory of estimation have been
developed . However, it dod not appear worthwhile including it in the present
analysis of the HHD data. In other studies where such limits have been
elicited, and where objective answers to the questions were available, the
bounds have proved to be almost uniformly too narrow . Capen concludes
from his studies that, however the bounds are characterized, what is obtained
*
is something more like a 40% to 50% range, rather than a 90% range .
* The terms "limit" and "range" can be confusing. The 95% confidence limit
refers to the value, Xgs, of the quantity where 95% of the cases are expected
to be lower than this value. The 5% confidence limit, then is the value of
X for which 5% of the cases are expected to be lower. The 90% confidence
range refers to the interval within which 90% of the cases are expected to
fall. The 90% range is generally taken to be the interval between the 5% and
95« confidence limits.
34
-------
SECTION 6
RESEARCH METHODS
The previous section contains the conceptual framework within which
the analysis of the HHD data will be pursued. The very general form of
theory of errors can, of course, be applied to practically any replicated
data. However, to make the results of previous studies applicable to the HHD
data, it is necessary to show a sufficiently close analogy so that it is
plausible to assume that the same type of estimation processes were involved
in the previous studies and in the HHD study. In effect this comes down to
determining whether the distributions are roughly lognormal, and whether
the estimates exhibit the psychonumeric form of scaling.
The HHD study interrogated a panel of 14 medical experts concerning the
dosage (concentration with one hour exposure) of three different air pollutants
required to produce three levels of severity of symptoms in 14 population
groups. These population groups are:
Normal
Children
Old Age
Heart Condition, Mild
Heart Condition, Severe
Hay Fever, Sinusitis
Influenza
Upper Respiratory Infection
Asthma
Acute Viral Bronchitis
Acute Bacterial Pneumonia
Chronic Respiratory Diseases
Mild Chronic Obstructive Lung Disease
Severe Chronic Obstructive Lung Disease
35
-------
The dosage estimates were obtained for 0%, 107,, 50% and 90% of the
respective populations. This led to 504 separate estimates by each panelist.
In addition, each respondent was asked to estimate the upper and lov/er limits
for each concentration estimate, to estimate the duration (in hours) of
the resulting symptoms, and to express a self rating on a scale of 1 to 10.
Thus each panelist generated 2520 judgments, 35,280 estimates in all.
The exercise was iterated one additional round. Estimates which dod not
fulfill a criterion of "good agreement" (interquartile range < 1/2 median)
were resubmitted to the panel, along with a report of the median and inter-
quartile range from the first round. The estimates from this second round,
_along with the carry-over of the estimates that were not iterated, constituted
the final outputs of the study.
The HHD study is one of the most thorough Delphi studies involving
professional respondents and dealing with a significant substantive topic
that has been conducted to date. During the study there was insufficient
time to examine the data for the light it could shed on a number of basic
issues relevant to the assessment of the validity of the panel estimates.
Validity assessment were based on the results of previous theoretical and
experimental studies involving different types of respondents and different
kinds of subject matter.
The following analyses are an attempt to investigate the analogy
between the HHD data and the experimental more completely. The analysis
is based on the first round estimates. Primarily this is because the theory
of errors assumes that the estimates are independent, and the feedback of
first round results during iteration leads to non-independent estimates on
the second round. The feedback step leads to convergence (reduction in
standard deviation) and also leads to a small but significant increase in
the accuracy of the estimates . However, the convergence is relatively
much greater than the error reduction. The ratio of error to standard
deviation roughly doubled between round one and round two for the experimental
36
-------
study cited. Thus the first round standard deviation is a more diagnostic
measure of the variability than the second round standard deviation. It
should be noted that this is true only if no additional information beyond
the results of the first round is introduced at the second round. This
condition was fulfilled in the HMD study.
The validity assessments for the HHD study were based on (1) the
experimentally observed relationship between standard deviation of the log
estimates and the log error as shown in Figure 4 and (2) an experimentally
observed relationship between self-ratings and error, Figure 11. These
experimental results were obtained using university upper-class and graduate
students as subjects and questions of general information such as might be
contained in an almanac or statistical abstracts as subject matter. A
recent study of professional petroleum engineers and analogous general
information questions shows that there is no significant difference in types
1 *"*
of results obtained as a function of the type of respondent. ^ In addition,
analysis of one type of professional estimate, namely bids on oil leases,
shows a close similarity between such judgments and the student estimates
for almanac questions, namely, lognormality of responses and variances
13
appropriate to the assumption of log scaling. However, these results
were not examined in terms of group judgment.
37
-------
SECTION 7
RESULTS
The first subsection below takes up some of the conclusions that can
be drawn directly from the theory of errors. This means basically what can
be concluded omitting the possibility of bias. The second subsection
examines the lognormality of the responses in the HHD study. The third
looks at the scaling of standard deviation as a function of the size of
the estimates. Subsection four deals with the correlation between the
three- indices of uncertainty, standard deviation, self-rating, and estimated
confidence range. The final subsection takes up the possibility of formulating
a model of the estimates relating dosage to degree of impairment and percent
population.
A. Theory of Errors. As noted above, the average error AE computed
from the standard deviation of the mean is a lower bound to the expected
error. Similarly, confidence limits computed from the standard deviation of
the mean furnish a lower bound for the expected confidence range. The
criterion for acceptability (on the first round) imposed in the original
HHD report, namely S
-------
Turning to the HMD data, the majority of estimates for oxidant (OX) fit
the criterion of s < .5. None of the nitrogen dioxide (NCL) or carbon monoxide
(CO) estimates fit the criterion. The fact that a small percentage of the
NO- and CO estimates fit the criterion in the form S < 1/2 median can be
ascribed to discrepancies between the median and the geometric mean, and to
discrepancies between S as observed and S as it would be computed from s.
The second type of discrepancy perhaps requires some explanation. Assuming
that the distributions are Icgnormal, the appropriate maximum likelihood
estimators for the distribution parameters are in for the logarithm of the
geometric mean and s for the standard deviation of the log estimates. The
geometric mean is then estimated by em and the standard deviation by the
formula ~
s*2 = e2m+s2(es -1) (n)
Let S designate the standard deviation computed directly from the estimates,
? 2
S = 1/n i(X. - M) , M = 1/n I X.. M, of course, is not the geometric
i 1 i 1
mean. In the HMD study, the median was used as a surrogate for the geometric
*
mean and the observed standard deviation as a surrogate for S , the former
because earlier studies had indicated that the median is slightly more
accurate than the geometric mean, and the latter because, on the basis of
examining only the OX data, S appeared to be a sufficiently good approximation
*
to S . That was a too hasty conclusion, as indicated by Table 1 which
*
displays the ratios S /S and GM/Md for the Normal population. I, Da, DC,
are abbreviations for Incapacity, Disability, Discomfort, respectively. The
*
very large discrepancy S /S = 2.94 for OX, 0%, DC should be disregarded. One
*
of the estimates for this case is 0, and S is highly sensitive to the
approximation for log 0 (which theoretically = -«>)• Otherwise, for OX,
*
the median is a relatively good approximation to the geometric mean, and S
is relatively close to S. However, for N0? and CO, this is not the case.
Table 1 suggests that for the HHD data, assessments should be based on s
and m, not on the median and S. That is, the oest estimate should be defined
m *
as the geometric mean e , and the estimated standard deviation should be S ,
the standard deviation computed by (11).
This conclusion has some consequences for the evaluations presented in the
39
-------
TABLE 1. The
Pollutant
OX
NO
2
CO
% Population
0
10
50
90
0
10
50
90
0
10
50
90
Ratios of S*/S and GM/Md for the Normal Population
s*/s
I
.98
1.01
.89
.73
.97
.76
.95
1.10
.84
1.11
1.10
1.05
Impairment
Da
1.10
1.07
.89
.81
1.32
.88
.92
.77
.83
.90
.83
.87
DC
2,94
1.01
1.04
1.13
1.65
.91
1.19
1.35
.77
.94
.88
1.02
GM/Md
Impairment
I
.97
1.05
1.07
1.07
1.09
1.31
1.32
1.69
1,30
1.45
1.45
1,61
Da
1.10
1.14
1.10
1.12
.82
1.12
1.04
1.26
1.11
1.17
1.20
1.23
DC
.90
.97
.90
.89
.89
1.03
.78
.88
.95
.97
.82
.94
I = Incapacity
Da = Disability
DC = Discomfort
40
-------
original HHD study. For example, as noted above, none of the NCL and CO
estimates fulfill the criterion s < .5. The criterion s < .5 has a certain
amount of arbitrariness associated with it. It was selected in part because
of the fact that estimates which fulfill the criterion do not (on the average)
improve with iteration . The question whether this criterion needs revision ii
the light of HHD data will be taken up in the conclusion section.
As an example of the application of the "pure" (assumption of no bias)
theory of errors to the HHD data, Table 2 displays the average s (averaged
over all cases) for the three pollutants, and the average error AE and associated
confidence limits CL computed from s, AE computed from (7), CL = t 2s.
TABLE 2. Average Error and Confidence Limits from Theory of Errors
Pollutant Average s AE CL AE' CL'
OX .489
N02 .908
CO .795
10%
21%
18%
!28%
+.63%
153%
37^
80S
68'J
+ 6Q,
+238-;
+ 215?
Table 2 also shows, for comparison, the AE and CL values (primed entries)
based on the empirical relationship between error and s (Figure 4). In
this case, CL1 = + 2s + b, and b - AE1 - AE.
Again, whether these average AE and CL figures are acceptable (keeping
in mind they are lower bounds) would presumably depend upon the decision problem.
B. Lognormality. Figure 12 shows the frequency distribution of z
scores for all "best estimate" responses of the panel. The z score is defined as
z. = xi ~ m
where x. is an individual log estimate, m is the mean of the log estimates,
and s is the observed standard deviation of the log estimates. Computation of
41
-------
z scores were made for each set of 14 responses to a given pollutant, degree
of Impairment, percent population case; hence there were 504 separate
distributions treated. The average of these 504 distributions is shown in
Figure 12, where the ordinate indicates the relative frequency with which
the z score falls within the intervals indicated on the abscissa. If the
distributions tend to be log-normal, then the distribution in Figure 12
should be approximately normal. The solid line indicates the observed
distribution; the dotted lines show the expected normal distribution.
Altogether, there are 7056 estimates represented in Figure 12. If we
test the hypothesis that each of these estimates was independently selected
from a normal distribution, the observed distribution fails a test of goodness
2
of fit at the .001 level, x ~ 230.8 with 10 degrees of freedom. However,
there is little justification for assuming that each estimate was independently
selected. A distinction has to be drawn here between independence for estimates
by different individuals, and independence for estimates by the same individual.
Some care was taken to assure that estimates by different individuals were
independent on the first round, and there is no evidence that the procedure was
not effective. An entirely different issue is involved in the question whether
individuals made independent estimates for each of the many cases they dealt
with, or whether in effect estimates for closely related cases were generated
by systematic modification of a single estimate.
2
We can reformulate the x statistic in the form
x2 = n( i q.2/p. - I ) (12)
where n is the number of independent estimates, q. is the observed relative
frequency in cell i, p. is the expected frequency in cell i, and m is the number
of cells, m-1 is the number of degrees of freedom. The expression in the
brackets can be interpreted as a measure of the degree of similarity between
the observed distribution and the expected distribution, where 0 indicates
22 2
perfect similarity. Define r = /: q . /P--1. In the case of Figure 12 T = .0327,
i 1 1
which indicates a high degree of similarity. However, the statistical signifi-
2
cance of this measure depends on n. For T = .0327, an n of 560 would be required
42
-------
u>
Relative
Frequency
Of z
.25
.20
.15
.10
.05
Observed
Theoretical
, i
1
I
i
-2.75-2.25-1.75-1.25 -.75 -25 .25 .75 1.25 1.752.252.75
z
FIG. 12 FREQUENCY DISTRIBUTION OF z SCORES FOR ALL BEST ESTIMATES
-------
to reject the hypothesis of normality at the .05 level.
It is difficult to specify a reasonable figure for n, the number of
independent estimates. It is clear from the investigation of the model
discussed below that there are highly systematic interrelations between the
estimates for different percent population and degree of impairment cases. In
addition there are strong correlations between individual estimates across
pollutants and across populations, i.e., individuals who tend to give relatively
low estimates in one case also tend to give relatively low estimates in others,
etc.
A frequently employed measure of the degree of agreement within a set of
*
judgments is Kendall's coefficient of concordance, W . It is defined for
rankings of a set of objects by a group of individuals. If R.. is individual
i's ranking of the j'th object, define
Q = z U R,. - i)2
j i ^
where z is the average sum of ranks. Q is thus the sum of squared deviations
of the sum of ranks from the average. The coefficient of concordance is
defined as
W =
where Q is Q computed for the case where all the respondents are in complete
max
agreement on their rankings. W measures the divergence of the ranking from
perfect agreement.
2
An approximate x can be derived for W by multiplying W by n(m-l) where
2
n is the number of respondents and m the number of objects. In using the x
tables, the number of degrees of freedom is taken to be m-1.
For example, consider the degree of agreement in the panel across popula-
tions with an otherwise fixed case. For this computation, the role of
* Kendall, M. G., Rank Correlation Methods. Charles Griffon & Co., London, 4th Ed.,
1970.
44
-------
"individual" and "object" is reversed. Each population generates a ranking of
the individuals (the ranking of the individual estimates for the given popu-
lation and case). There are thus 14 rankings of 14 "objects". For the case CO,
50%, Disability, W is .884, which gives a x2 of 160.9. With 13 degrees of
freedom, W is significant well beyond the .001 level.
Another measure of agreement is the correlation coefficient. The product-
moment correlation between estimates for Normal and Children populations for
the case OX, 50%, Disability is .784, significant at well beyond the .01 level
(two-tailed test).
The high degree of agreement between individual estimates across cases
indicates that the estimates are not being selected at random out of a
common distribution -- there is a high degree of dependence among the responses.
With this in mind, it does not appear unreasonable to conclude that the data
does not reject the hypothesis that the distributions are approximately log-
normal .
On the other hand, it is also true that other distributions, such as a
beta distribution, would fit the data equally well. Here we have to rely on
the presumption borrowed from previous studies that the lognormal distribution
is a likely candidate.
C- Logarithmic Scaling. As was stated in Section A.4, log-normal
distributions are to be expected if the estimates are scaled on a logarithmic
transform of the physical quantity being estimated. To that extent, Figure 12
gives some weight to the psychonumeric hypothesis. More direct evidence is
furnished by Figure 13 which displays the observed standard deviation s as
a function of the mean m for the Normal population. The dashed line is a
continuation of the empirically derived relation between standard deviation
and true answer displayed in Figure 9.
The data tends to lie along the empirical curve; however Figure 13 cannot
be construed to establish the hypothesis of logarithmic scaling for the standard
45
-------
1.5
1.0
.5
0
©'
©;
••10%. O-Incapacity. OX-0'
+ -50%. A-Disability. N02-0'
x-90%. D- Discomfort. CO - 0"'
-2
-1
2 3
nm
FIG. 13 OBSERVED STANDARD DEVIATION s AS A FUNCTION OF THE
LOG GEOMETRIC MEAN rm FOR THE NORMAL POPULATION
-------
deviation. The data is somewhat sparse for this purpose. Each of the different
pollutant cases form separate blocks. In addition, there are rather mysterious
within block uniformities. In particular, the Incapacity cases for NCL and the
Disability cases for CO seem "out of line".
The average standard deviations for the various percent population and
degree of impairment combinations are displayed in Table 3. In addition,
the corresponding averages of the means are listed. Inspection of Table 3
shows a rather anomalous pattern of variation of s with percent population
and degree of impairment. Only NCL exhibits the pattern that would be
expected from scaling, namely increase of s with percent population, and
increase of s from DC to I. Figures 14 (a) - (c) show s plotted against
percent population, and Figures 15 - 17 show s plotted against m, for the
various pollutants. If the three pollutants showed similar anomalies it
might be tempting to ask what generated them. As it stands, the anomalies
are puzzling.
Summing up the picture on logarithmic scaling: Figure 13 is compatible
with the hypothesis of logarithmic scaling, and taken in conjunction with
lognormal distributions, is some evidence for assuming that the same general
estimation processes are operative in professional judgments concerning
dosage as in student responses to almanac questions. However, the data for
variations within pollutant categories presents a rather muddled picture.
The theory of errors in log form, and concomitantly the hypothesis of log
scaling, is based on the assumption of independent estimates. The estimates
concerning special cases within pollutants are clearly not independent. The
implications of this lack of independence will be taken up after the investi-
gation of models below.
47
-------
A. OXIDANT
.54
.52
.50
.48
.46
.44
.42
1.0
.9
.8
.9
.8
DC
DC = Discomfort
Da = Disability
I = Incapacity
10% 50%
Percent Population
B. NITROGEN DIOXIDE
10% 50%
Percent Population
C. CARBON MONOXIDE
10%
90%
90%
90%
50%
Percent Population
FIG. 14 AVERAGE STANDARD DEVIATION vs PERCENT
POPULATION
48
-------
.53
.52
.51
.50
.49
S 48
.47
.46
.45
.44
.43
-1.2 -1 -.8 -.6 -.4 -.2 0 .2 .4
nnn
FIG. 15 AVERAGE STANDAND DEVIATION s vs AVERAGE
MEAN m FOR QX1DANT
49
-------
1.0
.98
.96
.94
.92
S .90
.88
.86
.84
.82
.80
.6 .8 1.0 1.2 .4 1.6 1.8 2.0 2.2
nm
FIG. 16 AVERAGE STANDARD DEVIATION s vs
AVERAGE MEAN rm FOR NITROGEN DIOXIDE
50
-------
.90
.85
.80
.75
.70
.65
4.5
5.0
nm
5.5
60
FIG. 17 AVERAGE STANDARD DEVIATIONS vs AVERAGE
MEAN rm FOR CARBON MONOXIDE
51
-------
TABLE 3: Average s and Average m of All Population Groups
% Population
10
50
90
Impairment
I
Da
DC
I
Da
Oc
I
Da
DC
OX
s
.447
.454
.515
.456
.446
.525
.449
.430
.496
m
- .539
- .751
-1.269
- .048
- .395
- .824
.236
- .189
- .531
NO 2
s
.910
.836
.809
.958
.888
.850
1.011
.965
.942
m
1.480
1.056
.641
1.755
1.398
1.016
2.089
1.687
1.287
CO
s
.912
.699
.757
.808
.678
.800
.899
.682
.819
m
5.399
4.874
4.192
5.672
5.151
4.511
5.825
5.314
4.741
52
-------
D. Correlation of Indices of Unvertainty. The three measures, standard
deviation, estimated confidence range, and self-rating, are all related to the
amount of uncertainty the respondent has concerning a given estimate, and
hence are related indirectly to the accuracy of the estimate. The standard
deviation is related to accuracy by the theory of errors; the estimated
confidence range and the self-rating are related by psychological assumptions con-
cerning the perception on the part of the individual of the relative "solidity"
of the evidence he has for his estimate. This psychological theory is not
sufficiently advanced to make quantitative predictions concerning the relation
between self-ratings or estimated confidence ranges and error. The empirical
relationship for self-rating and error observed in the Rand studies was
described above.
Since error could not be measured directly for the HMD data, the only
analysis available was investigating the correlation among the three indices.
The only pair out of the three for which correlations for individuals could be
computed is confidence range and self-rating. Estimates were pooled across
percent population to give 56 data points for each correlation; otherwise
correlations were computed separately for each pollutant, degree of impairment,
and population type. Thus 126 correlations were computed. These are displayed
in Table 4.
Inspection of Table 4 shows that the correlations are uniformly rather
small; the largest is -.244. Since R increases with certainty and Ay decreases,
the correlations should be negative. About 1/3 of the entries in Table 4 have
the opposite sign from what is expected. None of them reach statistical
significance assuming 56 independent cases. At first sight, this appears to be
dubious support for the hypothesis that both R and Ay measure the degree of certainty
that the respondent has in his estimate. However, in pooling the responses
across percent population, a significant reduction in the size of the correlations
was introduced. This results from the fact that the average R is almost constant
across the four subsets of data, but the average Ay declines sharply from 0% to
90%.
53
-------
TABLE 4: Correlation Between Self-Rating and Confidence Range
Pollutant
OX
N02
CO
Impairment
I
Da
DC
I
Da
DC
I
Da
DC
1
.096
.099
- .0628
- .038
- .108
.108
- .093
- .024
- .040
2
.111
.090
.128
- .051
- .244
- .037
- .117
- .051
- .083
3
- .077
- .028
- .091
- .052
.062
- .059
- .066
.022
- .043
4
- .001
.022
.012
- .034
- .063
- .110
- .050
- .021
- .036
5
- .079
- .051
- .026
- .071
- .024
.047
- .063
- .075
- .199
POPULATIO
6
- .012
- .194
- .191
- .103
- .066
- .067
- ,071
- .004
- .056
7
- .151
- .111
- .092
- .073
.125
- .066
- ,051
.013
- .004
N TYPE
8
- .164
- .140
.012
- .006
.001
.040
- .053
- .027
- .034
9
- ,018
- .037
- .084
- .134
- .094
.028
- .006
.058
- .004
10
- .085
- .041
- .002
- .105
.040
.134
- .058
- .012
- .040
11
- .028
- .030
- .042
- .102
.168
.243
- .058
- .010
- .059
12
- .117
.032
- .009
- .169
.177
.066
- .053
.012
- .023
13
- .123
.025
- .032
- .093
- .055
.026
- .031
- .010
- .055
14
- .080
- .092
- .078
- .073
- .078
.180
- .094
- .033
.049
I = Incapacity
Da = Disability
DC - Discomfort
-------
The correlation for pooled populations, r is related to the correlations
xy
within the subpopulations r (where i indexes the subpopulations) by the
* ~\j -j
formul a
r = I (_L_ ) ( "*i Ji ) r + ^ ^>JJ (13)
XY ,-„ ov o,, ¥-" (iJ>
s and s are the standard deviations of x and y in the total population,
x y <\j
sx. and Sy. are the standard deviations in the subpopulations. CV(x,y) is a
generalized covariance of the means of the subpopulations where averaging is
accomplished by the weights n^/n. The total number of cases is designated as
n, n-j the number of cases in subpopulation i. Because R is virtually constant,
CV(R, Ay) is essentially 0. On the other hand, s is uniformly larger than the
s . Hence rDft will generally be much smaller than the rn . This effect
is illustrated graphically in Figure 18. Here three subpopulations are shown,
in each of which rD = 1.
K-jAy-j
case illustrated in Figure 18.
in each of which rD = 1. Obviously r- * 1. In fact it is .577 for the
KAy
This effect was checked by computing correlations separately for the
0%, 10%, 50%, and 90% subpopulations for the case N02, Children, Disability.
The correlation with these four subpopulations pooled is .24. Table 5 lists
the four subpopulation correlations separately. The correlations for 50%
and 90% are significant at the .05 level (two-tailed test) for n - 14. The
question whether the stringent two-tailed test should be used here is unclear,
since the sign of the correlation is also a part of the hypothesis being tested.
There was insufficient time to rerun the correlations separately for all
cases. There is no reason to suspect that all cases would turn out as favorably
as the results in Table 5. About the strongest conclusion that can be reached
with the presently analyzed data is that the results favor the hypothesis of
negative correlation between estimated confidence range and self-rating.
55
-------
1.0
.75 U
.50
.25
0
.25
.50
X
.75
1.0
FIG. !8 ILLUSTRATION OF REDUCTION IM CORRELATION
WITH POOLED POPULATIONS
-------
TABLE 5: Correlation Between R and Ay, for N09, Disability, Children
Subpopulation
0%
10?=
50%
90%
rD .
R/.y
-.507
-.445
-.612
-.626
Correlations were also run for the three group indices, Ay = Y - Y,
(difference between the mean of the upper limit and mean of the lower limit),
s, and R (average self-rating). Cases for these correlations were pooled across
percent population and population type, giving 56 data points per correlation.
This produced 9 different correlations for each pair type. These are displayed
in Table 6.
As might be expected with aggregated data, correlations in Table 6
are generally higher than those in Table 4. This contrary to the effect of
pooling data which, because of the virtual constancy of R across populations,
decreases correlation; aggregating data by taking averages tends to increase
the correlations. However, a slightly higher percentage have a different sign
from the expected; self-rating should be negatively correlated with both estimated
confidence range and standard deviation; estimated confidence range should be
positively correlated with standard deviation. Thirteen of the twenty-seven
correlations are significant at the .05 level (one-tailed test); five of these
have the wrong sign.
57
-------
TABLE 6: Correlations Between (Ay,s), (Ay.R), (s,R)
Pollutant
OX
N02
CO
Impairment
I
Da
DC
I
Da
DC
I
Da
DC
(Ay,s)
,197
.158
.162
-.247*
-.274*
-.274*
.369*
.534*
.215
(Ay,R)
.025
.016
.069
-.142
-.022
.109
-.280*
-.242*
-.216
(s,R)
.284*
.214
-.237*
.317*
-.309*
-.366*
-.142
-.259*
-.010
* Significant at .05 level
(one-tailed test)
I = Incapacity
Da = Disability
DC - Discomfort
58
-------
Figure 19 graphs Ay against s; Figure 20 graphs Ay against m. From
Figure 20 it is evident that the panelists are scaling their confidence
ranges roughly in proportion to their X estimates. The correlation between
Median X and Ay (Median Y - Median Y,) for all Oxidant cases except 0%
population (126 cases) is .61. There is thus a discrepancy between the
logarithmic scaling of the concentration estimates, and the arithmetic scaling
of the confidence ranges. In addition, from Figure 20 it can be observed
that the slope of the relationship between m and Log (Y - Y,) is different
(lower) within pollutants than it is across pollutants. This explains, in
part, the peculiar within-pollutant behavior of Ay in Figure 19. Sine* s
also scales on the log transform, it appears likely that the negative
correlations between Ay and s for the N0? cases in Table 6 arise from the
same discrepancy between log scaling on s and arithmetic scaling on Y - Y,.
Table 7 is a display of average R broken out in percent population and
degree of impairment categories. There is an evident increase of self-ratings
between estimates for 10% population and 90% population. This is an interesting
trend which might be paraphrased by the statement that the respondents link
their assurance in their estimates with the percentage of the population they
expect to be affected. One hypothesis might be that this is a form of
synaesthesia -- percent population is one scale for "certainty". Otherwise
Table 7 shows no clear pattern that can be related to Ay or s. These is one
feature on which all three indices agree, namely that the estimates for N02
are less certain than those for OX and CO. The overall averages for the
three indices for each of the three pollutants are shown in Table 8. The
comparison of OX and CO is a virtual standoff, essential equality on R,
greater uncertainty for OX on Ay, and greater uncertainty for CO on s.
59
-------
1.8
1.6
©
1.4
1.2
Ay 1-0
.8
.6
.4
0©
©
©
©
©
• ox
© N02
H CO
.5
.6
.7 .8
S
.9
.0
FIG. 19 AVERAGE ESTIMATED INTERVAL vs OBSERVED
STANDARD DEVIATION
1.1
-------
CTi
2.0
1.5
Ay 1.0
• ox
© N02
H CO
©
©
©
-4 -3 -2-1 01
FIG. 20 AVERAGE ESTIMATED CONFIDENCE INTERVAL vs
AVERAGE LOG MEAN
-------
TABLE 7: Average R by Percent Population, Degree of
Impairment and Pollutant Type.
% Population
10
50
90
Impairment
I
Da
DC
I
Da
DC
I
Da
DC
Oxidant
4.07
4.48
4.65
4.26
4.47
4,64
4.44
4.64
5.17
Nitrogen
Dioxide
3,91
3.84
4.44
3.97
3.90
4.16
4.23
4.01
4.14
Carbon
Monoxide
4.36
4.21
4.26
4.79
4.61
4.59
4.96
4.76
4.73
I = Incapacity
Da = Disability
DC = Discomfort
62
-------
TABLE 8: Overall Averages for Ay, R and s
R
AY
s
OX_
4.54
.86
.47
NO,,
c.
4.07
1.03
.91
CO
4.59
.64
.80
This is a difficult subsection to summarize. The theory of errors gives
a central role to the standard deviation, and to this extent, it might be
considered the most reliable of the three indices. Despite some anomalous
behavior within pollutants, it exhibits a fairly clear scaling property with
in (Figure 13) which Ay does not (Figure 20). If we assume that s is roughly
the average of the indivudual standard deviations then Ay should be about
3.28s. For a normal distribution, the 95 percentile occurs at m + 1.64o,
and the 5 percentile at m - 1.64a. Thus the 90% confidence range is 3.28o.
Since Y is defined as the dosage level which the respondent thought no more
than 5% of cases would exceed, with a corresponding definition for Y., Y - Y^
should correspond to the 90% confidence range. The ratio Ay/3.28s for the
overall averages of these two indices is displayed in Table 9.
TABLE 9: Ay/3.28s For Three Pollutants
OX. N0_2 CO
.52 .35 .24
63
-------
These ratios are well in line with Capen's conclusion that estimated confidence
13
ranges are underestimates by a factor of two or more .
E. An Estimation Model. Inspection of the panel responses shows, as
one might expect, systematic variations from case to case. Simple logic
dictates that estimates for the various population percentages should increase
with the percentage. Similarly, the estimates should increase with the degree
of impairment; for the same percentage population, x(I)> x(Da)>x(Dc). A some-
what more subtle issue is, whether estimates for the various populations consist
simply of scaled versions of each other, or whether each population is a
special case.
There are two points of view from which models of the estimated data can
be approached. One is the standard view that the panel members are trying to
approximate an implicit and informal model of the underlying phenomena. The
other is to view the data as expressing a model of the estimation process, that
is, as resulting from psychological scaling operations that are loosely tied
to the actual phenomena. Without extensive objective data for comparison,
there is no decisive way to distinguish between these two possibilities.
Observed regularities in the data may result from common perceptions of the
physiological reactions to inhaled pollutants, or they may result from common
modes of judgment.
Delphi practitioners in the past have not addressed this issue. By
and large, estimates for complex, interrelated quantities have been treated
as independent (or at least as separate) judgments, and the data analyzed
estimate by estimate. This is clearly unsatisfactory where a sequence of
estimates are closely related, as in the HHD study. There is no "canonical"
treatment for such data.
The present investigation is in a sense a pioneering effort in this area.
As it turns out, it appears useful to borrow elements from both points of
view to impose a meaningful structure on the data.
64
-------
In the HMD study, one elementary model was proposed to rationalize the
percent population estimates, namely, the threshold model for the onset of a
given complex of symptoms. The model assumes that for a specified set of
symptoms, and a specified pollutant, each member of the relevant population
can be characterized by a critical value (threshold) of the dosage, beyond
which he will exhibit the symptoms. The threshold has a distribution within
the population which determines the fraction of the population that will
exhibit the symptoms at a given dosage. On a priori grounds it was suggested
that the distribution was likely to be lognormal. This led to the model
_=_li 2 = A + B log c (14)
dc c
where F is the fraction of the population affected, c is the dosage (concentra-
tion experienced for one hour). <{>(z) is the normal density function, and A
and B are constants which depend on the type of pollutant, population, and
degree of impairment. This is often called a probit model.
For computational convenience, (14) was approximated by the logistic model
log ~ _ = C + D log c (15)
1-F
Since both the probit model and the logit model have natural zeros at 0,
and since the panel estimates for 0% population were somewhat erratic and
exhibited large standard deviations, only the 10%, 50% and 90% estimates were
used to fit the model (least squares fit to the constants). As might be
expected, it was not difficult to find relatively good fits in most cases to
these three points, and no persuasive confirmation of the model was attempted.
It was treated more as a convenient method of analyzing the data and extrapolating
the given estimates to other population percentages.
Actually, both the logit and probit models have an elementary consequence
that can be used to formulate a useful test of fit. The consequence is
log c(50%) = l/2{log c(10%) + log c(90%)} (16)
65
-------
For the probit model, this follows from the symmetry of -j(z); i.e.,
z(90%) = -z(10%), z(50%) = 0. The same symmetry holds for the logit model,
since log. ~ = -log v. . and log .5/.5 = log 1 = 0. The difference between
the left hand side of (16) and the right hand side is thus a good measure of
error for either model. Its application to the HHD data will be discussed below.
It is apparent on inspection of the tabulated data, that regularities
exist among the estimates for different levels of impairment. However, the
symptoms defining the levels of impairment were specified separately for each
pollutant and each population type. Thus, there is no obvious physical scale
to which the levels of impairment can be attached. In fact, it is not obvious
that it is physically meaningful to refer to such a scale. On the other hand,
the terms "Incapacity", "Disability", and "Discomfort" were employed as common
descriptors across populations. As it turns out, the respondents appear to
have interpreted these three labels as defining three points on a scale which
is very similar to the percent population scale; i.e., log c(Da) is roughly
halfway between log c(I) and log c(Dc). In this instance, a relatively stable
and consistent psychological scale appears to have determined the estimates
for different degrees of impairment.
v Y r
The values of x = _ * " ^\u^'^'c' _ averaged over populations for each
x(I,90%) - x(Dc,10%) ^
pollutant, are displayed in Table 10. The standard deviation of x is s computed
"\,
across populations. As can be seen, the s are generally small, indicating that
the average values tabulated are a good approximation to the separate population
cases.
The same data is displayed in graphical forms in Figures 21 (A)-(C), with
a, ^
x plotted against a rescaling of the percent population labelled p. Since x
has been normalized by dividing by x(I,90%) - x(Dc,10%), p(10%) = 0 and
p(90%) = 1. From the discussion of Equation 16 above, one would suspect that
a best fit would be obtained by setting p(50%) = .5. However, a slightly
better fit is obtained by setting p(50%) = .6 for NO^ and CO. To this extent,
the condition prescribed by Equation 16 is not strictly met.
66
-------
TABLE 10: Normalized Estimates and Standard Deviation
by Percent Population, Degree of Impairment,
and Pollutant Type.
% Population
10
50
90
Impairment
I
Da
DC
I
Da
DC
I
Da
DC
OX
%
X
,589
.343
0
.807
.581
.296
1.0
.762
.491
*\j
s
.040
.050
0
.040
.070
.038
0
.065
.038
N00
a,
X
.563
.282
0
.789
.516
.253
1.0
.715
.431
%
s
.033
,052
0
.008
.047
.036
0
.046
.035
CO
%
X
.767
.427
0
.909
.580
.176
1.0
.667
.314
^
.028
.004
0
.016
.022
.017
0
.009
.024
67
-------
A OXIDANT «c=.55
1.0
.8
.6
.4
.2
0
Fnc
.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
P
B NITROGEN DIOXIDE ~=.55
.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
C CARBON MONOXIDE ^ = .73
.1
FIG. 21
.2 .3
.5 .6
P
NORMALIZED ESTIMATE 7 AS A
FUNCTION OF THE PARAMETER P
.7 .8 .9 1.0
68
-------
The straight lines in Figures 21 (A) to (C) are obtained by the relationship
x = ttp + (1 - cx)q (17)
Where q is a scaling on degree of impairment; q(I) = 1, q(Dc) = 0, and q(Da) =
.6 for N0? and CO. The analogy with the percent population scale is quite
striking, a is a parameter which expresses the relative weight given to
percent population and to degree of impairment in estimating x. The normalized
r\j
estimates x for different pollutants expressed as a function of the parameter
are shown in Figures 21 (A) to (C), where a{OX) = a(N02) = .55 and a(CO) = .73.
No attempt was made to optimize any of the parameters in the construction
of Figure 21. Simplicity of both form and content have been the major criteria.
In this spirit, Figure 21 (A) was constructed setting p(50%) = q(Da) = .5. A
surprisingly good fit to the OX data is obtained with this simple scaling.
Since there are only three general cases (the three popputants) for
establishing the model, the question whether the simpler scaling p(502) -
q(Da) = a = .5 would fit other pollutants to an acceptable degree of
approximation remains a viable option. It seems probable that for many
purposes the simple model would be adequate.
Perhaps the model, as developed to this point, would be more easily
understood described in terms of the method of implementation. For each
pollutant, and each population type, panelists would be asked to estimate three
numbers: X(Dc,10%), X(I,90%) and a. It will probably be clear that a is a
constant for each pollutant; in which case, it need not be estimated for every
population type. From these three numbers, the constants A and B in Equation
14 can be determined, and the X for any other degree of impairment, percent
population combination can be calculated. Auxilliary estimates, such as
self-ratings or confidence ranges, would be elicited only for the three
estimated numbers. Notice that the model replaces the nine X estimates (for
a given pollutant and population type) in the HHD study with two, or at most
three, numbers.
69
-------
An interesting area for further exploration is the possible extension of
the psychological scale of level of impairment to a continuum, rather than the
three discrete levels specified in the present study. The data indicate that
an interval scale, with reference points specified by defining Discomfort and
Incapacity in terms of particular symptoms, should be relatively easy to
construct. Some modification of the present model might be needed if the
extension included conditions outside the Discomfort-Incapacity interval. One
useful application of this scale would be to generate cumulative distributions
of degrees of impairment in a given population for a designated dosage.
Figure 22 (a) and (b) are speculative illustrations of this application, based
loosely on the present data within the I-Dc interval. The ordinate expresses
the percent of the given population exhibiting at least the degree of impair-
ment indicated on the abscissa, for the specified dosage. For example, in
Figure 22 (b), for c = 1.9, the graph indicates that 70% of the population
will exhibit Disability or worse.
One additional question relating to modelling was investigated. There is
the possibility that a rough scale of "severity of illness" for the various
subpopulations determined to a large extent the required dosage estimates.
This exploration ran into a thorny problem. The differences between estimates
across populations are in most cases small compared with differences within
populations (e.g., between different degrees of impairment). As a result,
small errors can seriously influence across-population scaling. Errors can
be expected from the variability of panelists' judgments; but in addition the
published "raw" data contains a number of recording errors arising either from
transcription or from carelessness or. the part of respondents. Some of these
can be identified by noting, e.g., that the X estimate is outside the Y , Y,
range, or that X(50%) is less than X(0%) or that X(Normal) is less than
X(I11), etc. A pleasant irony of the development of the model described
above is that in examining some egregious discrepancies with the model, a number
of errors of this sort were uncovered.
To ameliorate the problem of variability with small differences, and
hopefully to average out some of the recording errors, the x estimates (group
70
-------
(A) OXIDANT. NORMAL
100
Percent
Population
50
Dead
I Da
Degree Of Impairment
Or 0
100
Percent
Population
50
Dead
CB) NITROGEN. DIOXIDE
I Da
Degree Of Impairment
Dc o
FIG. 22 ILLUSTRATION OF CUMULATIVE DISTRIBUTION
71
-------
averages) were averaged across all percent population, degree of impairment
cases except 0 percent population for each population type. This generated
what could be labelled x. x is thus a kind of index of the degree of severity
(of illness) associated with each population. Figure 23 is a graph of x (Normal)
- x for the 14 populations and the three pollutant types. The connecting lines
are not intended to show functional relationships, but merely to keep track of
the three pollutants.
It is clear from Figure 23 that there is a fair amount of similarity in
the way in which "severity" is judged for the three pollutants, after some special
cases are discounted. These are primarily the two heart conditions, 4 and 5,
for CO, and Hay Fever and Asthma, 6 and 9, for OX. Otherwise the qualitative
behavior of the index is similar for the different pollutants. Kendall's
2
coefficient of concordance, W, across the populations is .70, x = 27.3,
p<.01. However, the numerical relative severity does not appear to be sufficiently
stable to justify trying to introduce population type into the model on the basis
of the present data. Figure 23 does justify some optimism concerning the
possibility of formulating a general severity scale that could account for a large
part of the variance over populations.
72
-------
.6
.5
(Normal) - x
.3
.2
o 1
345
6 78 9 10
Population Type
11 12 13 14
FIG. 23 VARIATION OF x WITH POPULATION
-------
SECTION 8
DISCUSSION
Despite the large number of estimates elicited during the HHD study, the
high degree of dependence among the estimates reduces the amount of statistical
information by a large but difficult to specify factor. It probably would be
desirable in future studies of this sort to introduce estimates of two
sorts not included in HHD, namely, (1) cases (e.g., other types of pollutants)
where it is likely beforehand that the panel members are relatively poorly
informed and (2) cases (if available) where the panel members probably can
make very good estimates. The purpose of these additional estimates, of
course, is not to obtain information about the subject matter, but rather to
"calibrate" standard deviations, self-ratings and estimated confidence ranges,
If at all possible, the additional, calibration estimates should cover a
relatively wide range of size of estimates.
The overriding issue, if the HHD study is to be considered relevant to
practical decisions, is whether the analogy between the HHD data and the
data from the Rand studies is sufficiently close so that the large bias
observed in the latter can be imputed to the HHD estimates. Since the analogy
looks reasonably close, the hypothesis of lognormality of distributions appears
plausible, and the scaling of standard deviation with mean is compatible with
the psychonumeric hypothesis. The prudent conclusion is probably the cautious
one, namely that the HHD estimates contain about the same proportion of bias
as the Rand estimates. This entails the assumption that the bias is about
twice the expected error computed from the observed standard deviation.
How this assumption is to be translated into an operational assessment
of a given estimate depends in part on the role accorded to the psychonumeric
hypothesis in defining error. If respondents scale their estimates on the
logarithm, then both standard deviation and expected error will increase
74
-------
exponentially with the size of the estimate. This effect will not be invidious
if the relevant phenomena fit the same sort of scaling law. As an example, the
psychologically harmful effects of noise increase as the logarithm of the
physical intensity. Presumably, the relevant errors are those in the psychological
scale, not in the physical scale. Similarly, the relevant physiological
effects of air pollutants may increase as the logarithm of the concentration.
The model derived from the HHD estimates appears to be saying just this. Both
the susceptability (threshold) and the degree of impairment are proportional
to the logarithm of the dosage. The relevant error would appear to be in
terms of these effects, not in terms of the concentration.
On the other hand, it is clear that at the present time policy is defined
in terms of the physical scales (e.g., in attaching various kinds of alerts
to various concentration levels). If decisions are focussed on concentration
levels, then a difficult problem of "correcting for" the psychonumeric
phenomenon is posed. Rescaling is not a serious problem; what is serious is
correcting for the exponential increase in expected error with size of the
estimate.
The issue posed by these considerations clearly goes well beyond the
application of the theory of errors to group estimates, and involves both the
substantive nature of the effects of air pollution, and relevant policy
variables. Neither of these are within the competence of this report. If the
scale of primary interest is determined to be the concentration, then, rather
than the wholesale rejection of estimates that would result from imposing the
criterion s < .5 (As pointed out above, none of the NOo or CO estimates meet
this criterion.)> a more sagacious procedure would be to publish the entire
set of estimates with their attendent indices (s, R, Ay) and a brief explanation
of their significance.
The picture concerning the usefulness of the estimated confidence range
and the self-rating does not emerge sharply in the HHD data. In part this
results from the small variation of the average self-rating across populations.
The correlations between Ay, R, and s weakly support the hypothesis that they
75
-------
are measuring something in common, but this is highly obscured by anomalous
and at present unexplained variations within related sets of estimates. With-
out some explanation for these anomalies, it is probably unreasonable to base
a decision concerning the trustworthiness of a given estimate on such
variations.
One rather firm conclusion would appear to be that the estimated
confidence range cannot be used to specify absolute confidence limits (e.g.,
confidence limits intended to guide policy). However, the results of the analysis
do not rule out the possibility that they have some utility in measuring
relative degree of certainty. That is, it seems unlikely that 90% of all actual
cases, if they could be established by experiment or field trial, would lie
between the estimated Y and Y, limits. However, for two different estimates
for the same pollutant, but, e.g., for different populations, if the panel
estimated a larger AY for one estimate than for the other, then on the average
the estimate with the larger AY can be expected to be less reliable than the
other. This statement is expressed in terms of the untransformed estimates,
rather than in terms of the log transform Ay, because of the linear scaling on
the confidence estimates discussed in Section 7,D.
Perhaps the most positive outcome of the present analysis is the identifi-
cation of a relatively simple model which appears to account for most of the
variation dependent on the percentage population and degree of impairment
variables. Although a number of interesting questions are raised by the model --
e.g., whether some distribution other than the lognormal or the logistic would
give a better fit -- these appear to be issues of fine-tuning. Considering the large
amount of randomness in the estimates that one would expect from the theory of
errors, given the size of the standard deviations, the model does suprisingly
well. The model opens the possibility of formulating a scale of degree of
impairment which would cover a much wider range of symptom states than those
embodied in the terms Incapacity, Disability, and Discomfort, and from what
can be gathered from the present data, generalization of the model to fit this
more comprehensive scale should not be a large step.
76
-------
REFERENCES AND NOTES
1. Leung, S. K..E. Goldstein, N. Dalkey, Draft Final Report: Human Health
Damages from Mobile Source Air Pollution . EPA Contract No. 68-01-1889,
by California Air Resources Board, March 1975,
2. The theory of uncertain estimates as probabilistic judgments has been
elaborated under the terms Bayesian estimates, subjective probability,
personal probability. Relevant names are F. P, Ramsey, B. De Finneti,
L. J. Savage, W. Edwards, A good exposition of the approach can be
found in L, J. Savage, Foundations of Statistics, John Wiley & Sons,
1954. The theory of estimates as a model has been employed by many
cognitive psychologists, including E. Brunswick, P, Hoffman, K. Hammond,
L. Goldberg, Robyn Dawes, A convenient reference is P. Hoffman, "Cue-
Consistency and Configurality in Human Judgment", in Formal Representation
of Human Judgment, B. Kleinmuntz, ed. John Wiley and Sons, 1968.
3. Aitchison, J.and J. A. C. Brown, The Lognormal Distribution, Cambridge
Univ. Press, 1957.
4. Aitchison, J. and J. A. C. Brown, ihid., Chap. 5.
5. Dalkey, N.and Bernice Brown, Comparison of Group Judgment Techniques
with Short-Range Prediction and Almanac Questions. The Rand Corporation,
R-678-ARPA, May 1971.
6, Dalkey, N. An Experimental Study of Group Opinion . Futures,. Sept., 1969,
pp 408-426.
77
-------
7. Oalkey, N. Group Decision Analysis. To be published, winter 1976.
8. Stevens, Vide, S. S. Ratio Scales of Opinion. In D. K. Whitla, ed.,
Handbook of Measurement and Assessment in Behavioral Science, Addison-
Wesley, 1968.
9. Raimi , R. A. The Peculair Distribution of First Digits. Scientific
American, Dec., 1969, pp 109-120.
10. ,A number of relevant studies on this topic concerning weather forecasting
o
* and almanac type judgments were presented at the Conference on Bayesian
*
u Research, Los Angeles, Cal., 1976. Especially pertinent was the report,
2 "Do Those Who Know More Also Know More About How Much They Know?", by
Sarah Lichtenstein and Baruch Fischhoff of the Oregon Research Institute.
"J 11. This topic has received extensive investigation. The issues were first
* presented in an unpublished paper by H. Raiffa and M. Alpert, Harvard,
,< 1967, and have been followed up by R. Winkler, J. J. Selvidge, T. Brown,
^ and others. Of particular interest is a recent study by E. C. Capen,
* "The Difficulty of Assessing Uncertainty", presented at the 50 Annual
Fall Meeting of the Society of Petroleum Engineers of AIME, Dallas, Texas,
Sept. 28 - Oct. 1, 1975.
12. Capen, E. C. The Difficulty of Assessing Uncertainty. Presented at the
50 Annual Fall Meeting of the Society of Petroleum Engineers of AIME,
Dallas, Texas, Sept. 28 - Oct. 1, 1975.
13. Capen, E. C., R. V. Clalp, Wm. M. Campbell Competitive Bidding in High-
Risk Situations. Jour. Petroleum Technology, June 1971, pp. 641, 653.
78
-------
TECHNICAL REPORT DATA
(/'lease read Instructions on the reverse before completing!
REPORT NO.
EPA-600/5-78-016b
4. TITLE AND SUBTITLE
Human Health Damages from Mobile Source Air Pollution
Additional Delphi Data Analysis - Volume II
6. PERFORMING ORGANIZATION CODF
Steve Leung
Eureka Lab.,.Inc. 401 N. 16th St.
Sacramento, CA 95814
Norman Dal key
Univ. of Calif.
L.A., CA 90024
g PERFORMING ORGANIZATION NAME AND ADDRESS
Contractor: California Air Resources Board
1709 llth Street
Sacramento, CA 95814
Subcontractor: Eureka Laboratories, Inc.
3. RECIPIENT'S ACCESSION NO.
5. REPORT DATE
.luly 1978
8. PERFORMING ORGANIZATION REPORT NO.
10 PROGRAM ELEMENT NO.
111. CONTRACT GRANT NO.
EPA Contract No. 68-01-1889
12. SPONSORING AGENCY NAME AND
Corvallis Environmental Research Laboratory
Office of Research and Development
U.S. Environmental Protection Agency
Corvallis, Oregon 97330
J13 TYPE OF REPORT AND PERIODCOVLRED
Final Report
i AGHNCY CODE
EPA/600/2
S SUPPLEMENTARY NOTFS
Volume I of this report is EPA-600/5-78-016a.
16 ABSTRACT
This report contains the results of additional analyses of the data generated by a
panel of medical experts for a study of Human Health Damages from Mobile Source Air
Pollution (hereafter referred to as HHD) conducted by the California Air Resources
Board in 1973-75 for the U.S. Environmental Protection Agency (Contract No. 68-01-1889,
Phase I).
The analysis focused on two topics: (1) assessment of the accuracy of group esti-
mates and (2) generation of a model of the group estimate as a function of percent of
population affected and degree of impairment.
Investigation of the first topic required a more thorough formulation of the statis-
tical theory of errors as applied to group judgment than has been available up to now.
This formulation is presented in Section 5 of the report. A major new feature of this
theory is the postulation of a psychonumeric scaling on estimated numbers analogous
to the psychophysical scaling of sensory magnitudes.
The investigation of the second topic and the application of the theory of errors
to the data from the HHD studies are presented in Section 7.
This report was submitted by the California Air Resources Board in the fulfillment
of Contract No. 68-01-1889 under the sponsorship of the Environmental Protection Agency
Work was completed as of September 30, 1976.
17.
KEY WORDS AND DOCUMENT ANALYSIS
DESCRIPTORS
Air Pollutions Health Effects
Delphi Study
Dose-Response
Decision Theory
I).IDENTIFIERS/OPEN ENDED TERMS
:. COSATI I icld/(;rnup
18. DISTRIBUTION STATEMENT
. SECURITY CLASS (This Reportj
Hnclassified
21. NO. OF PAGES
Q?
Unlimited
20 SECURITY CLASS fThis page/
Unclassified
22. PRICE
EPA Form 2220-1 (Rev. 4-77)
79
ft U.S. GOVERNMENT PRINTING OFPICE: l978-797.668'226 REGION 10
-------
U.S. ENVIRONMENTAL PROTECTION AGENCY
Office of Research and Development
Environmental Research Information Center
Cincinnati, Ohio 45268
OFFICIAL BUSINESS
PENALTY FOR PRIVATEUSE.S3OO
AN EQUAL OPPORTUNITY EMPLOYER
POSTAGE AND FEES PAID
U S ENVIRONMENTAL .PROTECTION AGENCY
EPA-335
Special Fou rth-C lass R ate
Book
If your address is incorrect, please change on the above label
tear off; and return to the above address.
If you do not desire to continue receiving these technical
reports. CHECK HE'RE"Q, tear off label, and return it to the
above address.
EPA-600/5-78-016b
f » »»_r
»-» r . »
------- |