&EPA
         Unrt«d StatM
         Environmental Protection
         Agency
          Offica of Policy, Planning
          •nd Evaluation
          Washington, DC 2046O
EPA-23O-12
         Statistical Policy Branch
ASA/EPA Conferences
Interpretation of
Environmental Data

II. Statistical Issues in
Combining Environme
Studies
October 1-2,1986

-------
                           DISCLAIMER

This document has not undergone final review within EPA and should not
be used to infer EPA approval of the views expressed.

-------
                                     PREFACE


    This volume is a compendium of the papers and commentaries that were presented at
the second of a series of conferences on interpretation of environmental data conducted by
the American  Statistical Association and the  U.S.  Environmental  Protection  Agency's
Statistical Policy Branch of  the  Office of Standards and Regulations/Office of Policy,
Planning, and Evaluation.
    The purpose  of these conferences is to provide a forum in which professionals from the
academic, private,  and public sectors  can exchange ideas  on statistical  problems that
confront EPA in its charge to protect the public and the environment through regulation of
toxic exposures. They provide a unique opportunity for Agency statisticians and scientists
to interact with their counterparts in the private sector.
    The theme of the conference, "Statistical Issues in Combining Environmental Studies,"
is particularly  appropriate because policy formulation rarely depends upon  a single study.
At any rate, the conclusions from various studies are often seemingly contradictory or the
evidence from  any single study is not clear-cut. No  matter how inconclusive the evidence
may be, it is still  necessary to formulate policies.  Recently, great strides have been made
in  the   formal   statistical  combination   of  research information.   A  new  term,
"meta-analysis," has appeared, most often in  medicine and social science research reports.
This ASA/EPA research conference was  held  to make environmental statisticians  and
scientists aware  of  these  new  techniques  and  to examine  the   applicability  of  the
methodology to environmental studies.
    The holding  of a research conference  and  preparation of papers for  publication
requires the  efforts of many people.  Gratitude is expressed to the ASA  Committee on
Statistics and  the Environment which was instrumental in developing  this series of
conferences.   In  addition,  appreciation is  given  to Dr. Kinley  Larntz,  University of
Minnesota, for his  work  in  assembling  and  coordinating  the presentations  for  this
conference.  Thanks are also owed  to  members of the ASA staff and, particularly,  Ede
Denenberg and Mary Esther Barnes,  who supported the  entire effort. Although there  was
no provision  for  a formal peer review, thanks are also due to the reviewers who  assessed
the articles for their scientific merit and raised  questions which were submitted to the
authors for their consideration.
    The views presented in this conference are those of individual writers  and should not
be construed as reflecting the official position of any agency or organization.
    Following  the  first conference  on  "Current  Assessment of  Combined  Toxicant
Effects,"  in May  1986, the  second  conference  on  "Statistical  Issues  in Combining
Environmental  Studies" was held in October  1986.  Two additional conferences were held:
"Sampling and Site  Selection  for Environmental  Studies" in May 1987 and "Compliance
Sampling" in October  1987. From these two conferences, proceedings volumes will also be
published.

                              Emanuel Landau, Editor
                              American Public Health Association

                              Dorothy G. Wellington, Co-Editor
                              U.S. Environmental Protection Agency
                                        iii

-------
                                     SUMMARY


    The first set of papers and discussion in the volume begins with an exciting paper by
David M. Eddy introducing a Bayesian method for  evaluating and summarizing evidence
from various sources.  The sources  can include empirical studies and expert judgments.
Graphical methods are provided to display the conclusions from the research synthesis.
One of the highlights of the conference was an interactive computing  demonstration of the
techniques given by Eddy and Vic  Hasselblad.  Robert L. Wolpert's paper discusses the
critical issue of selecting the appropriate scale of measurement for combining evidence.
Following the two papers  is discussion by David A. Lane.  Lane offers strong support for
the Bayesian viewpoints of Eddy and Wolpert.  He also reminds us of potential difficulties
in implementation of the methods.
    Larry Hedges, a major contributor to the meta-analysis literature, presents  a  clear
picture  of the issues in combining studies. In contrast to Eddy, Wolpert, and Lane, Hedges
adopts  the  frequentist viewpoint presenting  combined  significance tests  an confidence
limits.  Hedges  carefully presents methods and points out  their possible limitations.   In
discussion of the Hedges paper, James M. Landwehr point out that combining studies should
be considered within the usual framework of statistical applications.  Landwehr then  takes
us through the steps of standard analysis, illustrating the special aspects of meta-analysis.
    In  the   next  paper,  Thomas  B.  Feagans  presents  his  viewpoint on  probabilistic
assessments. Feagans bases his methods  on the fundamental axioms  of  probability.
Interesting discussions  are given by Harvey M. Richmond, Anthony D. Thrall, and Miley W.
Merkhofer.
    The final paper, given by G.P. Patil, G.J. Babu, M.T. Boswell, K. Chatterjee, E. Linder,
and C. Taillie, presents several case studies of combining data in the environmental  area,
specifically  in marine  fisheries management.  Lloyd L. Lininger gives a discussion raising
fundamental questions important to any problem of combining studies.

                               Kinley Larntz
                               University of Minnesota
                                         iv

-------
                 INDEX OF AUTHORS
Babu, 6.J	  70
Bosvell, M.T	  70
Chatterjee, K	  70
Chen, Chao W	  45
Eddy, David M	  1
Feagans, Thomas B	  50
Hedges, Larry  	  30
Landwehr, James H	  47
Lane, David A	  26
Larntz, Kinley  	  90
Linder, E	  70
Lininger, Lloyd L	  89
Merkhofer, Miley W	  66
Patil, G.P	  70
Richmond, Harvey H	  61
Taillie, C	  70
Thrall, Anthony D	  63
Wolpert, Ronald L	  19

-------
               TABLE  OF CONTENTS
Preface                                                                   iii

Summary.  KINLEY LARNTZ, University of Minnesota                            iv

Index of Authors                                                             v

Confidence Profiles:  A Bayesian Method for Assessing Health Technologies.
DAVID M. EDDY, Duke University                                              1

Choosing a Measure of Treatment Effect.  ROBERT L. WOLPERT, Duke University   19

Comment on Eddy's Confidence Profile Method. DAVID A. LANE, University
of Minnesota                                                               26

Statistical Issues in the Meta-Analysis of Environmental Studies.
LARRY HEDGES, University of Chicago                                       30

Discussion. CHAO W. CHEN, U.S. Environmental Protection Agency               45

Discussion. JAMES M. LANDWEHR, AT&T Bell Laboratories                      47

Integration of Empirical Research:  The Role of Probabilistic Assessment.
THOMAS B. FEAGANS, Decisions in Complex Environments                       50

Discussion. HARVEY M. RICHMOND, U.S. Environmental Protection Agency        61

Discussion. ANTHONY D. THRALL, Electric Power Research Institute             63

Discussion. MILEY W. MERKHOFER, Applied Decision Analysis, Incorporated       66

Statistical Issues in Combining Ecological and Environmental Studies
with Examples in Marine Fisheries Research and Management. G.P.  PATIL,
G.J. BABU, M.T. BOSWELL, K. CHATTERJEE, E. LINDER, C. TAILLIE,
Pennsylvania State University                                                 70

Discussion. LLOYD L. LININGER, U.S.  Environmental Protection Agency           89

Appendix A:  ASA/EPA Conference on Statistical Issues in Combining
Environmental Studies Program                                               90

Appendix B:  Conference Participants                                          9 2
                                     vi

-------
                                CONFIDENCE PROFILES: A BAYESIAN METHOD

                                   FOR ASSESSING HEALTH TECHNOLOGIES


                                           David M. Eddy, M.D., Ph.D.«
                                       J. Alexander McMahon Professor of
                                         Health Policy and Management,
                                                   Director
                                Center for Health Policy Research and Education
                                                Duke University
                                            Durham, North Carolina
INTRODUCTION

The  first  step  in  the  assessment  of  a  health
technology  is to evaluate  the  existing evidence  to
estimate  how the technology affects the magnitude
or  probability of  important  health outcomes—its
benefits and harms.  These  estimates form the basis
for  the subsequent steps  of an assessment:   com-
parison of benefits and harms, estimation of overall
benefit, calculation of marginal returns,  and  design
of a policy.

At  present, for the great  majority  of health  tech-
nologies there is no explicit quantitative estimation
of  the  technology's  effects  on  health  outcomes.
Current clinical  and  administrative  decisions are
typically based on a qualitative  subjective judgment
that  a  technology's  benefits  outweigh  its harms.
However, the rising cost of health  care, increasing
competition, concern over wide variations in practice
patterns,  increasing  malpractice   claims,  and  a
variety  of   other  forces  all  create pressure  for
quantitative estimates of a technology's effects, and
therefore for quantitative assessment methods.

Estimating  the effects of a  technology on  health
outcomes is complicated by several factors. Ideally,
for each  technology and  each outcome, there would
be several well designed controlled  trials that provide
direct evidence of how the technology affects each
outcome.    Unfortunately,   this   ideal   is   rarely
achieved.  The  empirical evidence  is rarely  com-
plete.  What evidence is available usually comes from
many   different   sources,   including   randomized
controlled trials (RCTs),  nonrandomized  controlled
trials,  uncontrolled   clinical   series, case-control
studies,   cross-sectional   studies,   case  reports,
longitudinal  studies, and  animal experiments.  Even
anecdotes, theories, testimonies, and analogies  play a
role in many assessments. Each  piece of evidence
can  be subject to  a variety of  biases  and  other
factors   that  affect   their   internal   validity,
comparability, and  applicability  to  a  particular
assessment (external validity).  Much of the available
evidence  does not  deal  with outcomes  that are
important   to  patients   (e.g.,  death),   but  with
intermediate  outcomes (e.g., cholesterol level),  or
performance  indicators  (e.g., the sensitivity  of a
diagnostic test).  Finally,  even  pieces of evidence
that are complete and have the same design can be
inconclusive  (e.g..  not statistically  significant)  or
give   inconsistent   results.    Because   of   these
   •I thank  Vic  Hasselblad, Greg Critchfield, Dick
Smallwood,  and  Bob  Winkler  for  many  helpful
suggestions.  I  especially thank  Robert Wolpert for
his  contributions to this work  and collaboration in
extending it.
complexities, the process for synthesizing  evidence
tends  to  be  highly subjective—which  leave  them
vulnerable to oversimplification, errors  in reasoning,
wishful thinking, and self-interest.

This  paper   introduces  a  Bayesian   method  for
synthesizing  the available evidence—from  both em-
pirical studies and expert judgments—to estimate the
effect of  a  health technology on  health outcomes.
Called the Confidence Profile Method, it can be used
to   evaluate  evidence  from  different  types  of
empirical studies, adjust individual pieces of evidence
for biases to internal and external validity, combine
evidence from different studies (not necessarily with
the same designs),  and  incorporate focused subjective
judgments, to derive a probability distribution for the
effect of a  health  technology on health outcomes.1
Because  the  probability  distribution   explicitly
incorporates   subjective  judgments, it   is   called  a
Confidence Profile.  The Profiles for each  outcome
can  then  form  the basis  for adjustments for risk
aversion,  comparison of  benefits  and  harms, and
other steps of a technology assessment.

This paper gives the basic formulas of the method,
and illustrates its use with an  analysis  of the effect
of  a  thrombolytic agent—tissue-type plasminogen
activator—on one-year survival from heart attacks.

DEFINITIONS

The  term health technology is used very broadly  to
include any  intervention that  might affect  a health
outcome.   Examples  include  health  education,
diagnostic tests, treatments, rehabilitation programs,
pain control,  and psychotherapy. A health outcome is
an outcome  of a disease or injury that people can
experience and care about.  Examples  are  life and
death, pain,   disfigurement,  disability,  anxiety, and
range  of  motion of a  limb.  It   is  important  to
distinguish   health   outcomes   from   intermediate
outcomes, which are markers  of biological changes
that  might  indicate  or affect the  probability  or
magnitude of health outcomes. Examples are  blood
pressure, serum cholesterol, intraocular pressure, and
the reperfusion of a coronary artery after treatment
of a heart attack.2

The  objective of an assessment is to estimate the
technology's    effect   on  health   outcomes.    To
accomplish this, we use "chains"  that  connect the
performance   of  the  technology  to   the  health
outcome.  If  there  is direct  evidence that directly
relates  performance  of  the  technology   to  the
occurrence of the  health outcome, the chain has  a
single link.  In other cases the evidence is  indirect,
with one body of evidence relating the performance
of the  technology  to  one or  more  intermediate

-------
outcomes (or followup actions—see below), and other
evidence relating the intermediate outcome(s) to the
health  outcome.   For  example  (see  illustration
below),  to assess the  effect  of  changing dietary
cholesterol on  heart  attack rates, a two-link chain
might be used; the first link would evaluate evidence
about the  effect of  diet (the  technology) on serum
cholesterol (the intermediate outcome);  the  second
link  would evaluate  evidence that reducing serum
cholesterol reduces heart  attack rates (the  health
outcome).  When multiple-link  chains are used, care
must  be taken  to examine  the accuracy of  the
intermediate  outcome as an indicator for the health
outcome,   and   any  independent  effects  of   the
technology on   the  health outcome (not  mediated
through the intermediate outcome). These issues will
be discussed in detail below.

Change diet—> lower serum cholesterol—>
prevent heart attacks

Followup actions are  important in  the  evaluation of
diagnostic  or   screen   technologies,   where   the
technology's purpose is to provide information, which
in turn  can affect health outcomes only if it changes
a  followup action  (e.g., changes  treatment).   For
example,   to   evaluate   screening  for   ocular
hypertension, a three-link chain would be constructed
to (1)  relate the  use of  the  screening test (e.g.,
tonometry) to detection  of high  intraocular pressure
(an  intermediate  outcome),  relate   detection  of
elevated pressure to  a decision to treat  (a followup
action), and (3) relate the treatment to a decrease in
the chance of blindness (a health outcome).

Frequently there  are features  of the  population,
disease, technology, provider or setting that can alter
the effect of   a  technology  on health  outcomes.
Examples are  the  relative  risk  of  a disease in a
population  to   be   screened,   the  sensitivity  or
specificity of a  diagnostic  test, the dose or frequency
of a drug,  the experience of a  practitioner, and the
adherence  of   a  patient   to   a  treatment.   The
Confidence Profile Method treats these  features as
parameters.  By  performing  an  assessment  as  a
function of  various  parameters,  the assessment's
results  can be tailored to a variety of circumstances.

MEASURING A  TECHNOLOGY'S EFFECT

To  estimate   a   technology's   effect  on   health
outcomes,  a suitable measure  must be  chosen  for
each outcome,  and  an estimate made of how the
technology (compared  with  a  designated  control)
changes  the outcome,  according  to  the  chosen
measure.  Quantitative  measures provide  the  least
ambiguous way to describe,  and the most powerful
way to calculate, a technology's effect.

Given a quantitative measure for a health outcome,
the effect of a technology can be defined  in several
different ways.  For dichotomous health outcomes, an
obvious measure of effect is the change in  the
probability of the  health outcome.  For  example, if
the chance of the health outcome (e.g., death within
one year)  without the  technology is  0.8,  and  the
chance of the health outcome with the technology is
0.4, the effect  of the technology by this measure is
-0.4.  (The technology caused the chance of death to
be 40% less than without the technology.)  For health
outcomes  that  can take one of several values (e.g.,
 mild, moderate or severe  pain,  or a discrete-valued
health status  measure),  the technology's effect can
be  defined as  the  shift  in  probabilities  of  the
different outcomes.  For continuous-valued outcomes
(e.g., weight, IQ, or a continuous-valued health status
measure), the technology's effect can be measured as
the change in tiie  magnitude of the health outcome.
Other measures of a  technology's effect are possible,
such as a change  in  the  odds-ratio,  or the percent
change in probability or  magnitude of an outcome. In
each case, uncertainty about the technology's  effect
can be described  as  a distribution for the effect, on
the chosen measure.

OVERVIEW OF THE CONFIDENCE PROFILE
METHOD

Steps. The Confidence Profile  Method is applied in
five basic steps.  This section outlines the steps and
the general form of some of the formulas. Examples
of specific formulas are given in a later section.

1.   Define the technology, the  control with which it
     will  be compared (the "designated control"),  the
     circumstances in which it  will  be applied  (the
     "circumstances   of   interest"),  and  the  health
     outcomes  it  affects.   A  separate  assessment
     should be performed for each health outcome.

2.   For  each health outcome, describe chain(s) that
     relate  the   performance  of  the  technology
     (compared with  the designated  control)  to  the
     occurrence of the health outcome. These chains
     should  be based on  the available evidence and
     knowledge    of    the   pathophysiology   and
     management  or  the  health problem. The chains
     should be created so  that each piece of evidence
     applies to one and only one  chain.  Each chain
     will  be analyzed  separately,  and  the   results
     combined in a later step (step 4).

3.   For  each chain,  derive a probability distribution
     for  the effect of the  technology on the  health
     outcome, as  indicated by the evidence  for that
     chain.  This  is accomplished by examining  the
     evidence for  each link of the chain, one link at  a
     time,  deriving a probability distribution for  the
     link (step  3a), and then combining the probability
     distributions across the chain (step 3c).

     a.    The derivation of a probability distribution
          for  a  link, that describes our knowledge
          about the true effect of the action  on  the
          outcome for  that link,  is accomplished by
          first deriving  for  each link a  likelihood
          function  L(«)  for the  likelihood  of  the
          observed results of the evidence  for  the
          link as  a function of  the possible values of
          the   true   effect. 3    To   simplify   the
          discussion, consider  a  single-link  chain
          (direct  evidence) and denote is e  the true
          (but unknown)  effect of the technology on
          the health outcome.  Denote the  observed
          results  of an experiment or other source of
          evidence   as  X«,  where  the subscripts
          denote  the ir* piece of evidence for the jtfl
          chain.   Where there is no  ambiguity about
          which chain is being considered, the second
          subscript   will be dropped.  (Below,  the
          collection  of evidence for the j** chain will
          be denoted X.j
and  the  total body of
          evidence  for all  chains will  be  denoted

-------
1.
ii.
X..).  The likelihood  function  we  want to
derive for the link, based on, say,  n pieces
of evidence, is U&i, X* ~. *•)•

      To derive this likelihood  function  for
      the link, examine each independent
      piece of evidence for the link one by
      one and derive a  function  for  the
      likelihood of the observed result (X{)
      as a function of the possible values of
      the true effect  of  the  technology
      (e).   Denote this likelihood function
      for the ith piece of evidence as
      *,«*)• The  form  of  the likelihood
      function will depend on the  type of
      evidence.  The  likelihood function for
      an  RCT will  be  given  in the  next
      section.

      Sometimes the result  of a particular
      piece of  evidence is influenced by
      factors  that  affect  internal  validity
      (e.g.,  patient selection bias,  errors in
      measurement of outcomes, crossover
      of  patients  between  treated   and
      control   groups),  external   validity
      (e.g.,    differences    between    the
      circumstances  of a trial  relating to
      the    patients.    technology,    or
      providers,    compared    with    the
      circumstances   of   interest  in   a
      particular  assessment).   Because of
      this,   the   formulas   for   deriving
      likelihood  functions  for  individual
      pieces  of evidence contain  variables
      to  adjust  for these factors.   The
      requirement    is  that,    when   the
      appropriate  adjustments  are  made,
      the likelihood function for each piece
      of  evidence  should  describe  the
      likelihood of the observed results of
      the study in  the circumstances of the
      study,  as  a  function  of  the   true
      effect  of  the   technology in  the
      circumstances  of interest.   Specific
      examples of likelihood functions  that
      contain adjustments  will be  given  in
      the next section.

 iii.   Calculate  a  joint likelihood function
      for the observed results of all pieces
      of evidence,  as  a  function of the
      (unknown)    true   effect    of   the
      technology  in  the  circumstances of
      interest, by multiplying  the likelihood
      functions of the individual  pieces of
      evidence (possibly adjusted for  biases
      to internal and external validity).4
         _, Xj
 iv.    Derive a probability  distribution for
       the effect  of  the  action  on   the
       outcome  (for that  link), using  the
       continuous  form of Bayes' formula.
       Denote  this  (posterior)  probability
       distribution by  ic(dX.). Thus

  jc(epf1.X,,-,X.)»tUepr,.X2,_,XJ>Ke)        (2)
                                                    4.
        where *fc) is a noninformative  prior
        distribution  for  e. ,   and  k   is  a
        normalizing constant.   The choice of
        a   suitable    noninformative   prior
        distributions  is discussed by Jeffreys
        (1961)  and  Bernardo  (1979)   in  a
        general setting, and by Wolpert and
        Eddy   (1986)  in  the  context  of
        Confidence Profiles.

  If  the  evidence is  direct  (a  single-link
  chain),     the     posterior     probability
  distribution derived in the previous step is
  the Confidence  Profile for  the effect of
  the technology  (skip  to  step  4).   If the
  evidence is indirect (a multiple-link chain),
  repeat  step  3a   to  derive  probability
  distributions for each  link and  proceed to
  step 3c.

  Combine the  probability distributions for
  each   link   to   derive    a   probability
  distribution  for  the  entire  chain — the
  effect  of  the  technology  on  the  health
  outcome.  If the occurrence  of an outcome
  for a particular link is determined solely by
  the action  for that link, and not affected
  by any preceding  actions,  (e.g., if the
  technology  has  no independent effect on
  the health  outcome  not  mediated  through
  the  intermediate  outcome),5   then  the
  probability  distributions  for the links are
  combined   by  an  operation  analogous to
  multiplication of  two  random  variables.
  Specifically, let  *«(£«.)  be the distribution
  for the  effect of the  technology (t)  on a
  dichotomous  intermediate  outcome    (u),
  let   Jw(c«») be  the  distribution for  the
  effect  of  the intermediate  outcome  on a
  dichotomous health outcome (h), where in
  both  cases  the  effect is  measured as the
  difference   in   the   probability  of   the
  outcome caused  by the action (see footnote
  3). Then the distribution for the effect of
  the    technology    on     the    health
  outcome (>c*(e*)) is given by
           J TTT
              -
                                                                                                      (3)
  If  the  occurrence  of an  outcome is  not
  determined solely by  the  action for that
  link,  but is influenced by  an action in a
  preceding link, the formula for  combining
  probability distributions  across links of a
  chain must include correction factors. The
  equation  with correction factors is given
  below (Eq. [25]).

  The result  of  this step  is  the probability
  distribution, or Confidence Profile, for the
  effect  of the technology  (compared  with
  the  designated   control)  on  the  health
  outcome, for a particular chain.

If there are two or more chains, combine their
separate  probability distributions  to derive  a
single probability distribution that incorporates
the evidence  in all  the chains.   Let  X.. be the
evidence  for  the itn chain, let  Ji.
-------
      technology,  based on  the evidence  in  the itlt
      chain, and let it(e)  be the (noninformative) prior
      for the technology's effect. The formula is     (4)
                         .I) xj(etX.,)
                                                1
                                               it(e)

     where k is a normalizing constant and n  is the
     number of  chains.  This equation assumes the
     posterior  distributions  for  each   chain  are
     independent  in  the  sense  that no piece  of
     evidence  is  used  in  more  than  one   chain.
     Equation (4) will be derived below (Eq.[33J).


5.   Sometimes a body of evidence will compare the
    technology  (T)  with  a  control  (C*)  that  is
    different from the designated control (C).  When
    this  occurs,  the  effect  of  the  technology
    compared  with  the  designated  control  (call
    this  c«) can be found as follows:
    a.
         use  steps  1-4  to derive  a  probability
         distribution for the effect of T compared
         with C* (call this  e«. ),

         similarly, derive a probability distribution
         for the effect of C* versus the designated
         control C (call this t^,)° and
          calculate  the  probability distribution  for
          the  effect  of the  technology compared
          with the designated  control by  convolving
          the  probability distributions  outlined in  the
          previous     two    steps.     Specifically,
          let  n^(e«.)  be  the   distribution  for  the
          effect  of the technology  compared with
          the   control  C*,   let  xjej   be   tne
          distribution   for  the   effect   of   the
          technology compared with  the  designated
          control   C,   and  let   >We^t)   be   the
          distribution  for  the  effect  on  the  health
          outcome of the control  C*  compared with
          the designated control C. Then
                                                  (5)
Use of Subjective  Judgments.  When applying  the
formulas of the Confidence Profile Method, empirical
evidence  is used  to the  greatest extent possible to
estimate   the   necessary   variables.    However,
whenever  the   available   empirical  evidence  is
incomplete, focused  subjective  judgments  must be
used  to  complete  an assessment.'  An  important
feature of the  Confidence  Profile Method  is  that
whenever subjective judgments are used, uncertainty
about   any   variable  being   estimated   can  be
incorporated  in   the   analysis  by   describing  a
probability distribution for the  variable  (instead of
using a  point estimate),  and  integrating over  the
variable.   No  additional  subjective  "weighing"  of
individual pieces of evidence is required; the "weight"
of  each   piece  of  evidence  (along   with   any
adjustments)  is  automatically  captured   in   the
likelihood functions and therefore in the Confidence
Profiles calculated from  them.  This  feature will be
illustrated below.   Because  the Confidence Profile
automatically encodes this uncertainty  about  the
variables in the  formulas, as well as  the  uncertainty
due to  the random sampling  that affects empirical
observations, there is no need  to perform sensitivity
analysis for such factors.8
ILLUSTRATION

The  Confidence Profile Method will be  illustrated
with  formulas  for  evidence  involving  RCTs with
dichotomous   intermediate  outcomes  and   health
outcomes, where the technology's effect is measured
as the difference it causes in  the probability  of  the
health  outcome.   Use  of  the  formulas  will  be
illustrated  with  an assessment of the  effect  on
one-year   survival   of   a    thrombolytic   agent
(tissue-type  plasminogen  activator) used to  treat
heart  attacks.   The  Method  currently  includes
formulas for other types  of  experimental  designs
(e.g., nonrandomized controlled trials, clinical  series,
case-control   studies,  and   some  cross-sectional
designs);    categorical    and    continuous-valued
intermediate  outcomes  and  health outcomes;   and
other measures of a technology's effect (e.g., change
in odds-ratio, and percent change  in a rate).  These
are described elsewhere (Eddy and Wolpert  1986).

Background.    Tissue-type  plasminogen   activator
(t-PA) is one of several thrombolytic agents used to
dissolve  (lyse)  blood  clots  (thrombi)  in coronary
arteries  after heart  attacks,  with the intention of
restoring  blood  flow  through the  coronary  artery
(reperfusion), and  thereby increasing the  chance of
survival. There are conflicting policies about the use
of t-PA (and  about  payment  for it by  third-party
payers).  The conflicts  reflect the complexity of the
available evidence. The main problem is that there is
no single RCT that compares the effect of t-PA with
conventional  care or any  other  thrombolytic  agents
on  long-term  (one-year)  survival.  The available
studies of  t-PA (see  Table I) involve  intermediate
outcomes    (e.g.,   perfusion   and   reperfusion),
short-term   outcomes  (in-hospital  mortality),   and
different  controls  (placebo, conventional  care,  and
intravenous streptokinase).  In  addition to  the  studies
described in  Table I, a large number of studies have
examined  other  thrombolytic  agents—intravenous
streptokinase (IV SK), intracoronary streptokinase (1C
SK). and urokinase (UK)  (Yusuf et al 1985).  While
they do not provide direct evidence about t-PA, they
contain information to compare  the  various controls
used in studies involving t-PA.

Likelihood  Function  for  an  RCT.   The  likelihood
function for  an RCT  is  derived  from the binomial
distribution.  Designate the occurrence of the health
outcome a "success" (5),  the  nonoccurrence  of  the
outcome a "failure" (/), and the true probability of a
success in the  treated and control groups  as p\  and
PQ, respectively.  Let the number of people  in  the
treated and control groups be n\ and rig, the observed
number of successes in each group be s\ and SQ,  and
the observed number of failures in each group be f\
and /o (*i + ft » HJ). The joint likelihood function for
PO and pj, given observed values of SQ, JQ, s\, and f\ is
                                                                                                             (6)
                                                           Designate the "effect" of a technology as  t, which in
                                                           this case is defined as  t = />,-/>0.    A joint likelihood
                                                           function for c and PQ can be derived by substituting
                                                                 for pi in Equation (6).
                                                            Ut. PolWo.
                                                  (7)
                                                           If it is reasonable to behave as though there were no
                                                           prior  knowledge  linking PQ  and   c,  a  marginal

-------
 likelihood  function  for e  can  be  obtained  by
 integrating  the  function  £(e./>o)  with respect  to  a
 marginal  noninformative prior distribution for  PQ.
 Because  p\  is a probability, then  0 < p0 •*• e < 1.   and
 the assumption of independence is  an approximation;
 it is a very  close approximation, however, for a  wide
 range  of  possible  values  of  PQ  and    c.   The
 reasonableness  of  the assumption is only  threatened
 when po and   approach 0 or 1. and the sample sizes
 are  small.  Furthermore, other measures  of  effect
 (e.g., change in odds ratio,  percent  change in rate,
 and  relative  risk)  can  be used   to  help achieve
 independence between pg and  e (Wolpert  and Eddy
 1986).  If g(po) is the (possibly noninformative) prior
 for pn. the marginal likelihood function for e is
                                                   (8)
 There are several possible choices for noninformative
 priors for the parameter PQ.  Use of a uniform prior
 and  the normal  approximation  for  the  binomial
 likelihood function leads to the approximation
                                                   (9)
 where  A/duo2)      is  the normal distribution with
 mean  u - Vi - V"o   and variance  o2 - soM"*? +
 'i M"i)3.      It can be shown that the parameters u
 and o2 differ only  by terms of order    (l/«o + l/«()
 from  those  that  would  be  obtained  with other
 reasonable candidates for noninformative priors.

 Equation  (9) can be illustrated with  the TIMI study,
 the results  of which are given in Table I (TIMI 1985).
 Figure  1  shows the  likelihood function for the effect
 on   reperfusion of  t-PA  compared  with  IV  SK
 calculated from Equation (9) with SQ  • 44, /Q * 78, s\
 • 78, /i - 40.  The  horizontal axis of this figure is e,
 the true effect of the technology (compared  with IV
 SK)  on  reperfusion.   There is  no  vertical  scale
 because a likelihood function is defined only up to an
 arbitrary constant.   The likelihood  function  can be
 multiplied by a noninformative prior and normalized
 to  derive a posterior distribution that has a virtually
 identical  appearance.  It  would show  that  t-PA
 (compared with IV  SK) can be expected to increase
 the probability of reperfusion by  about 30%, with a
 95% range of confidence' from 18% to 42%.

Adjustment  of  Likelihood  Functions  for  Biases.
Adjustment of evidence for biases that  affect  their
internal validity, or  for  factors that affect  their
applicability  to a  particular  assessment  (external
validity) will be illustrated with formulas that adjust
likelihood functions  for RCTs to correct for errors in
outcome  measurement, crossover of patients,  and
two types of confounding  factors.   Formulas  have
also been  written to adjust studies for differences in
length of followup.

Adjustment for Errors in Measurement of Outcomes.
In an RCT. the joint likelihood function for PQ and p\
is a function of the observed number of successes and
failures   in   the   treated  and  control    groups.
Unfortunately,  there  can  be errors in the  method
used to determine whether an outcome is a success or
failure,  due   to  such  problems as  an imperfect
measurement  instrument, errors in reporting, clerical
errors,  and so forth.   If  the  true  outcomes were
known,  the counts  of successes  and  failures could
simply be corrected, and used in Equation (8). More
 frequently, it is only possible to estimate the chance
 of  error.   Let OQ  be  the  probability  that, in the
 control  group,  a true  success will be  incorrectly
 labeled a  failure, and let &Q ** tne probability that,
 in the control  group, a true failure will be  labeled a
 success.   Let  a]  and b\  be   the  corresponding
 probabilities for the treated group. It is easy to show
 that
                               £,,«,, + (l-ftXI
 Each of the expressions in brackets represents  the
 true probability of a success or failure, as a function
 of the values of p,-, a,- and &,-.

 The joint likelihood function for t and   pa We. /»„)) I0
 can again be obtained by substituting E + PQ for p\ in
 Equation (10),  and a  marginal  likelihood  function
 for e  can  be  obtained  by integrating  t(e./>0)  with
 respect   to   a   (possibly  noninformative)   prior
 distribution for PQ.

 Any uncertainty about OQ, bQ, «1> and/or b\ can be
 described   with  probability  distributions,  and   the
 expression  integrated with respect  to  the  uncertain
 variable(s),  causing any uncertainty about  any  of
 these  parameters  to be  encoded in the likelihood
 function. Thus, if uncertainty  about OQ is  described
 by h(aQ), then
Up* Pil«o)
                                                (11)
 Incorporating uncertainty about all  four parameters
 (OQ, a\, bQ, b\) would involve quadruple integration.

 This feature of the Confidence Profile Method can be
 illustrated by deriving a likelihood function for  the
 reperfusion  rates  observed  in  the   TIMI   study,
 adjusting  for   the possibility  that  the  observed
 reperfusion rates  might  have been  distorted by  the
 performance of the coronary angiography  used to
 measure  reperfusion.*'   Suppose we estimate  bQ *
 0.05  and b\  -  0.05  (based  on  estimates  of  the
 proportion of patients with occluded arteries that can
 be  opened by angiography  alone),  and OQ ~ a\  - 0
 (assuming  that  angiography will not close an  open
 artery).  Using these estimates of aQ> a,, bQ, b,, and
 Equation (10), the adjusted likelihood function for  the
 effect of t-PA versus IV SK on reperfusion, derived
 from the TIMI study, is shown in Figure 2.

 Intensity and Additive Bias.  Often the circumstances
 of a study do not precisely match the circumstances
 of interest, making it difficult to compare studies or
 to  apply  them  directly  (without  adjustment)  to a
 particular assessment.  Examples are differences in
 the  population,  the technology,  the providers,  or
other factors that can modify  a technology's  effect.
 These  differences can affect the interpretation of a
study in two basic ways, to create what will be called
an intensity bias and/or an additive bias.

A  bias is an intensity bias if it has a  proportional
effect  on  the   effectiveness  of  the  technology.
Specifically, a factor is said to affect the intensity of
a  technology   if,  whatever  the   effect  of   the

-------
technology  in  the absence of  the factor   (e),  the
presence or modification of the  factor  causes  the
effect (call this 4 ) to be  e'-tt.   where t  is the
measure of the magnitude of the bias caused  by the
factor.   Thus,  t *  I implies  a factor causes  no
intensity bias.   Examples of factors that affect  the
intensity of a  technology  are the dose  of a drug,
frequency of an examination, skill of a provider, type
of 'equipment,  or susceptibility  of a  patient  to a
treatment.  The notion of intensity  is described in
statements such  as  "this  technology has  improved
20% since  the  study was  completed,"  and "the effect
of this technology in community hospitals will be only
80% that  observed in a research setting."  Notice
that if a technology has no effect (e « 0), presence or
modification of the factor  will leave the technology
with  no effect,  and  if  the technology  is harmful,
increasing   its  intensity  will  indicate  it  is  more
harmful.

An additive bias  shifts the probability  or magnitude
of an outcome  in the treated and/or control groups by
a constant  amount. Specifically, a factor  is said to
cause an additive bias  if,  whatever the true effect of
the technology in the circumstances of interest  (c),
the bias causes the observed effect  (O to be  e'« 0 +
+  e,  where (5 is the amount of the additive bias.
Thus, p -  0 implies no additive bias, and  P  can be
positive or negative.   An  example of a  factor that
causes additive bias  is any difference between  the
treated and control groups of a  controlled trial that
can modify  the probability of the health  outcome,
even in  the absence of the technology (e.g..  age of
the patient, severity of the health problem).

If a particular  study is thought  to be affected by one
or more factors that  create an intensity or additive
bias, with estimates of the intensity  (t)  and additive
biases  03).  the  likelihood  function  for  pn, Pi, and
therefore for e can be found by substituting  p+te
for e in Equation (8).  For example, in  a randomized
controlled  trial the adjusted joint likelihood function
for e and PQ would be given by

             ,. P. t) = (pj* (
 Uncertainty  about  Port   can  be described  with
 probability distributions and  the likelihood function
 integrated over those variables.  If there are several
 independent  factors  affecting  additive  bias  and/or
 intensity bias,  p and t   can be vectors.

 Additional issues must  be considered when  adjusting
 for a  bias  that is uncertain or variable,  and that
 affects more than a single experiment.  One example
 is  that several experiments can be affected  by the
 same bias.   Another example is that a  factor  can
 modify the  effect of a  technology  on  a  health
 outcome, but  the  effect must  be  estimated  from
 indirect evidence.  In such a case it is not possible to
 adjust the bias  simply by  adjusting  the likelihood
 function for an individual experiment (as in Eq. [12]).

 In  the   first   example  the  appropriate  approach
 depends on whether the biases affecting the separate
 experiments  are independent.  Suppose  there are n
 studies affected by the same biases  P and t,   which
 have distributions    * (P) and A(i).     If the biases are
 independent,  the  likelihood  function  for  combined
 evidence in the n studies is given by
Lip* pilX.,. X.2....X* P. T) =                           (13)


             /  J L, (po. />,|X.,. p. t) *(P) A(T) <$ rft


       J  J  i.0>o. PilX.. P. t)
If the biases are completely dependent,  such that
their  magnitudes are the same in all n experiments,
then
 Up* ptlX.,X

i. P. t) • •
P. T) =

.. P. t)]
                                                 (14)
 J J
The second example can arise if all the evidence is
gathered  in  experimental  circumstances  that  are
different from the circumstances of interest.  In such
cases,  the  approach is  first  to  analyze  all  the
evidence  to derive a posterior distribution for the
technology's effect in the experimental setting,  call
this     x(e'|X..) .   and   then  derive  a  posterior
distribution  for  the  technology's   effect  in  the
circumstances of interest by  substituting t » P •*• tt' .
   for  t'.  If there is uncertainty about  port.    it
can be incorporated by convolution.  That is,
                                                  (15)
 where  •  is the  convolution  operator and •  is  the
 multiplication  operator  for  two distributions.1*  Use
 of Equation (15) will be illustrated below.

 Dilution and Contamination.  A  frequent problem in
 controlled trials is that  some subjects in the "treated
 group" might not receive the designated technology,
 which  "dilutes"  the  observed effectiveness  of  the
 technology,  and some  subjects in  the control group
 might  inappropriately  receive  the   technology,
 thereby  "contaminating"  the   control  group   and
 distorting   the   observed   effectiveness   of   the
 technology.  In some cases,  the number of subjects in
 each  group who "cross over"  and  the  number  of
 successes and  failures in each group  are known.  If it
 is reasonable to assume they are similar to the other
 subjects  with  respect  to  factors  that  can  affect
 outcomes,  the counts  could be corrected  and  an
 appropriate likelihood function could be derived along
 the lines of Equation (8).

 Let qo be the number of subjects in the control group
 who  cross over to receive the technology, and let q\
 be the number of subjects  in  the  group  offered the
 technology (the  "treated" group) who cross over and
 do not receive the technology.  Further, let j and k be
 the  number  of  successes  in  the  crossover control
 group and the crossover treated group, respectively.
 Then
 As  before,  the marginal  likelihood function for e
 can be calculated by substituting  E+PQ for pj  to
 derive a joint  likelihood  function for e  and PQ, and
 then  integrating  that  with respect  to a  (possibly
 noninformative) prior distribution for PQ.

-------
A more complicated formula  is required if there  is
reason to believe that subjects who cross over are not
similar to the  other  subjects  with respect  to the
expected effect of the technology.  Let    f50 and T,
be  additive  and intensity  biases, respectively,  that
cause  the subjects in  the control group  who  cross
over to respond differently to the  technology than
subjects chosen randomly from the  "treated" group.
Let po' designate the probability  of a success in this
subset of the  control  group that  crosses over to
receive treatment.  Then in these subjects  the effect
of the technology is  e/-p,'-p0-  p + tt-p + to»,-/»0).
        and  the probability of all success is  p0' -
P + t(Pi-A>+*»       Similarly, let  p,   and  t,  be
additive and intensity biases that cause subjects in
the  treated  group who  cross  over  to  respond
differently to  the technology than  subjects  chosen
randomly from the control group.  Let pj'  be the
probability of success in this  subset of  the treated
group.   Then in  these  subjects  the effect  of the
technology is   e,' « />,-/»,' - P,+T,£  « P,*!,^,-/^
         and the probability of success is
P\ * Pi-Pi**iO»t-/»o) -     Using this notation
                        (p,)
                          '1 "
is
An expression in terms of     Po. Pi. t» and T,
easily obtained by substituting   &, + TO(PI-PO)+PO   for
Po' and     PI - Pt + XI(PJ - po) for p\    in Equation (17).
As before,  a  likelihood function  for  e  can  be
obtained   by  substituting  pg  +  t   for   p\   and
integrating over  a  (possibly  noninformation)  prior
distribution for pg.  Uncertainty Suout the   p$ or ts
can be described with  probability distributions and
the   likelihood   function   integrated   over  those
variables.

Equations (16) and (17) will not be illustrated here.

Multiple-Link Chains.   Derivation of a  Confidence
Profile from indirect evidence involving multiple-link
chains  is  performed  in two  steps:   first   evaluate
evidence   for each link in the chain  to  derive  a
probability distribution  for  that  link,  and  then
calculate  across  the links  to  obtain  a probability
distribution for the entire chain.  Formulas for the
first  step have been  described.  Formulas for the
second step  will  be  given for  a  two-link  chain
involving  dichotomous  intermediate  outcomes  and
health outcomes, taking into account the possibility
that  the  intermediate  outcome  is not  a perfect
indicator for the health outcome.

Derivation of the formula is illustrated in Figure 3.
Let TO represent the event the technology is not done
(the  control  group);  T\ represent  the event the
technology is done  (the  treated group);  IQ represent
the event  the  intermediate outcome does not occur;
/I represent the event the intermediate outcome does
occur; and H represent the event the health outcome
occurs.  Thus,  the  circles represent various events
and combination of events (e.g., [/i.Tj] is the event
that the technology is done [T]] and the intermediate
outcome occurs [/}]).

Define  q  as the probability  of the  intermediate
outcome occurring, in the absence of the technology
(f-Awtt/iiro)).      and  p as the probability of the
health outcome,  in  the  absence of the intermediate
outcome and the  absence of  the  technology  (p  *
                 . To)).     Let tm be  the  difference in  the
          probability of the intermediate outcome ((•) be the distribution for X« and  /„(•)
be the distribution for X,,. Let y, * /Vo*(/,ir,),   and

                                                  (25)
          *,{•) distributions for  •». for i - 0,1.  Then
          where  * is  the convolution operator, and  • is  the
          multiplication operator.13  Distributions  for /,,{X.)
          and  *,<•£)  can  be derived from empirical data  by
          deriving a likelihood  function for the parameter and
          multiplying  by   a  noninformative   prior   to   the
          parameter.   When  deriving  distributions  for  the
          elements of  Equation (25), care must be  taken  to

-------
ensure that they are independent—the same piece of
evidence  can not  contribute  to  more  than  one
distribution.

Example  of   Calculating   Multiple-Link   Chains,
Equation  (25) can  be used  to  derive  a probability
distribution for the effect of t-PA compared with IV
SK on  one-year survival,  using  reperfusion  as an
intermediate  outcome.   A  distribution  for  tm  is
obtained from Equations (8) and (2), with sg, /Q, s\, /]
estimated from the TIMI  study (SQ - 44, fn • 78, si -
78, /] « 40) and a noninformative prior. Similarly, a
distribution for e»» is obtained from  Equations (8)
and (2)  using data reported  by Kennedy (summarized
in Fig.  4); SQ • 85,  /Q • 17, s\ »14, /] • 0, where the
subscript  0 refers  to patients who do not reperfuse,
and the subscript 1 refers  to patients who reperfuse.


Distributions   for   the   "inaccuracy"   of    the
intermediate outcome can  be estimated from data
that relate the intermediate outcome  to the  health
outcome,  with  and  without the  thrombolysis (as
defined in Eq.  [18]).  Kennedy's data  (Fig. 4) again
can be used to derive the needed distributions for X«
and  X,,.  The  distribution is obtained by deriving  a
likelihood function for the  difference in rates  (see
Eq. [8]) and  multiplying  by a noninformative  prior
(see Eq. [2]). 14

Distributions  for   ProbdJTd   and   Probd^)   are
obtained  in   a  similar   fashion.   First  derive  a
likelihood function  for the likelihood of the true rates
of the event if interest (e.g., /Q,  the nonoccurrence of
the  intermediate  outcome),  as a  function of the
observed  rates.  For  example, if we denote
    as YD-   and  />ra6
-------
technology's effect).  Specifically let it,(clX.,)  be the
distribution for  the technology's effect derived from
the first chain,  Jtj(£pf.j)   be the distribution derived
from  the second chain, and so forth for an arbitrary
number  of chains (n).   We seek  the  distribution
for n(e|X.i, X.2> _ X.J. based on all n chains.
For each chain, by Bayes formula
and therefore
                                                 (30)
                                                 (31)
where  x(e) is the (noninformative) prior distribution
for t,  and *,• is a normalizing constant.

Furthermore,
               - k Mepf.,) ^(eif.J •'• i.(£pf.J«(c)

 Substituting Equation (31) into (32)
 where ft and k' are normalizing constants.

 ASSESSMENT  OF  THE  EFFECT  OF  t-PA  ON
 ONE-YEAR SURVIVAL

 The evidence summarized in Table 1  and the formulas
 just given can be used to derive a Confidence Profile
 for  the effect  on one-year survival of t-PA versus
 conventional care (Eddy 1986). The result  is shown in
 Figure 6, marked
 Notice that  none of  the existing  controlled trials
 examines this question directly; the Profile  must be
 derived from the indirect evidence  in Table 1, using
 the methods just described. The specific steps are (1)
 construct a  two-link chain that relates t-PA (versus
 IV SK) to reperfusion, and  reperfusion  to one-year
 survival;  (2)  use the results of  the TIMI study and
 Equation (8)  to  derive a probability  distribution  for
 the first link, the effect of t-PA (versus IV SK) on
 reperfusion; (3) use the results  of the Kennedy study
 to derive a probability distribution for the second link
 of the  chain   (Eq.  [19]),   and  to  estimate  the
 inaccuracy of reperfusion as a predictor of one-year
 survival (Eq. (21]);(4) combine the evidence about the
 links (and the connection between links) to derive a
 probability distribution for the  effect of t-PA versus
 IV SK on a one-year survival  (Eq. [25]); (5) combine
 the results of 20 other studies  (summarized in Yusuf
 et al 1985) to  derive  a  probability  distribution that
 relates IV SK to conventional care (Eq. [8]) (see Eddy
 1986 for details); (6)  use Equation (28) to derive a
 probability distribution for  the  effect on one-year
 survival of t-PA versus conventional care (call this
 it|). This  distribution  is illustrated  in   Figure  6,
 marked x(c|X.,).

 Then (7) construct a  new chain that related  t-PA
 versus  conventional care (instead   of  IV  SK)  to
 reperfusion, and reperfusion to one-year survival; (8)
 use  the  results  of  Collen's  study to  derive  a
 probability  distribution for the first  link (Eq.  [8]); (9)
using probability distributions for the second link and
the inaccuracy of reperfusion, calculate across the
chain  (with  corrections  for  the  inaccuracy   of
reperfusion) to derive  a probability distribution for
the  effect  of t-PA  versus conventional  care   on
one-yea- survival  (Eq. .[25]) (marked n,(eUf.j in Fig.  6);
and  (11) combine  K,andjtt  by Equation (33).  The
results  is  the   Confidence  Profile  in Figure   6
marked *«*..). This Profile combines the evidence  in
the TIMI, Collen  and Kennedy studies, as well as  20
RCTs that compare IV SK with conventional care.

The  Method can  also be used to derive a Profile for
the effect of IV SK (compared with  conventional
care).   That  Profile, the result of combining direct
evidence from 20 RCTs in  step  5  above, is shown
beside the Profile for t-PA in Figure 7.  (Because  of
the high degree of certainty about the effectiveness
of IVSK, the  scale for Fig. 7 is twice as high as the
scale for the other figures.)

Adjustment for  Intensity  Bias.  The  use  of the
Confidence Profile  Method  to  adjust   for  possible
biases can be illustrated with a problem that arises in
the  interpretation of  the  evidence on t-PA.  The
effect  of  t-PA observed in published  RCTs  might
underestimate the true effect of t-PA in realistic
settings,  because  in   the  trials  patients  were
catheterized   before   administration   of   t-PA
(to observe perfusion), which delayed administration
of the  drug,  which  in turn might have  decreased  its
effectiveness. The impact of this possibility can  be
included in the assessment by estimating how much
more effective  t-PA  might  be in actual  clinical
settings.   (This  estimate can  be based on  animal
studies,   knowledge    of    clotting    mechanisms,
knowledge of the mechanism of action  of the drug,
and  review of human studies that recorded outcomes
as a function  of time.)  For  example, if the  trials are
believed to understate  the  true  effect  of  t-PA  by
about  20% (implying an intensity bias  in the trials
of t - 0.8), the Profile marked  t « 0.8 in  Figure 8
would be  obtained by applying Equation (12). If there
were uncertainty about the estimated intensity bias
described, say, by a beta distribution with a mean »
0.8 and variance - 0.2,  the Profile  marked  T * 0.8u
in Figure 9 would be obtained.  Figure 9 also includes
for  comparison   the  original,   unadjusted  Profile
(marked  T - 1).

Comparing Generations of Technologies. Use of  the
Confidence Profile  Method to  compare  different
variations of  a technology is illustrated in  Figures 6
and  10. Figure 6 showed the effects of two individual
versions  (t-PA  and  IV SK),  both compared with
conventional  care.  Figure  10  shows the  effect  of
t-PA compared with IV SK,  calculated from Equation
(29).16
 RESEARCH PLANNING

 Once  Confidence Profiles  for  the  effects  of  a
 technology on various outcomes have been derived,
 they can be used for a variety of purposes, such as
 adjustment  for  risk  aversion,  comparison  of  a
 technology's  benefits  and  harms,  derivation of  a
 measure  of overall  benefit, and  research planning.
 This section  uses the  results  of  the assessment of
 t-PA  compared  with conventional care  to illustrate
 one of these uses—research planning.

-------
Research planning  is a complicated  activity,  ideally
involving  estimating   the   probability   a   new
experiment   will   yield  particular  results,   the
probabilities   those particular results  will change
behavior,  and  the  change   in   health   outcomes
expected  from   the   change  in  behavior.   The
Confidence Profile itself provides an estimate of the
third element—how use of the technology is expected
to change health outcomes.  The Confidence  Profile
Method  can also  estimate  the  first element—the
probability a particular experiment will yield a result
that will change behavior.  For convenience we will
call this a "Delta Result" or &•  Result (drawing on the
common use of  the Greek letter "^" to denote a
difference  or  change).   Notice   that   a   single
experiment  can produce several different  A Results,
depending on the  type of action  the result triggers,
and the force  with  which  it triggers it  (i.e.,  the
proportion of people who will change behavior  if the
result  occurs).   For   example,  if  an  experiment
indicates a  technology causes a 60%  increase  in
survival, 99% of physicians might adopt it,  whereas if
the  experiment indicates a 5% increase in survival,
only  10% of physicians might adopt  it.   When  the
chance of a 4 Result and the Profile for the effect of
a change in use of the technology are combined with
estimates of the second element—how a A Result will
change use  of the technology—the impact  of  the
experiment  on health  outcomes  can be calculated,
different experiments compared, and priorities set.

Use of the  Confidence Profile Method to calculate
the probability a particular experiment will yield a &
Result is illustrated for an RCT, using a particular A
Result—a   statistically  significant   result.    The
principles  behind  the calculations  are  as follows.
Let  5  be the observed difference between  the rate in
the  treated and control groups of an RCT (thus  S *
SI/A]  -  SO/HO), let «  be  the true  difference, and
let  /I5|t)  be   the  distribution   for   the   observed
difference, given a true difference of t.   Define u
to   be  the  threshold  for  determining   statistical
significance of an  RCT under  the null hypothesis of
no effect.

 u  will depend on  the level of statistical significance
chosen, and  whether a  one-tail or  two-tail  test is
being  performed.   For example,  if a   level  of
significance  of p  - 0.05 is chosen,  and  if we  are
performing a one-tail  test for  a positive difference in
rates of a dichotomous outcome,  then  u is found  by
solving the following equation for  u
                0.05 = //(file » 0)
(34)
To calculate  the  probability  that  an   RCT  of
particular  size  no and  nj  will  yield a  statistically
significant result, first  derive a distribution for the
outcome  of  the   trial,   based  on  the  current
distribution for  the true effect of the technology (the
Confidence Profile a(«Pf»)).  This distribution is
                   *«*..)

                                                 (35)
For the one-tail test just described, the probability a
trial will yield a statistically significant result is
          To  apply  Equation  (36)  we  need  a  distribution
          for  jisie).     This will depend on the experimental
          design   and  the  available   empirical   data.   For
          example, if the  contemplated  trial is an RCT, it is
          appropriate to expand /I5|c)  over PQ to obtain
                                                           (37)
             i  i
                                                  (36)
where  £(PQ)  is  a  prior  distribution  for  PQ.   For
sufficiently large  sample  sizes,  /I5|/j0t)    is  well
approximated  by a normal  distribution with mean u
- e and variance o* - fad - pJl/to + UPU + £)0 - Po - e)]//!,,
              where no and  it} are the  number of
observations in the control  and treated groups of the
contemplated  experiment.  A distribution for PQ can
be derived from the currently existing evidence using
the methods described in previous sections (e.g., Eqs.
[26] and [27]).

Application of Equation (36)  is illustrated  with  an
analysis of the probability that RCTs of various sizes
will yield statistically significant results (one-tail, p
-  0.05),  using the  Profile  for the effect  of the
technology illustrated in Figure 6.  The  results are
shown  in  Figure  11.  For example, given the existing
evidence   about   the  effect   of   t-PA  versus
conventional care, the probability a  new RCT with a
total  of  1200 patients would  show a  statistically
significant  increase in  one-year  survival  is  about
80%.  If the trial  were to simulate "actual  practice"
and not involve catheterization, the probability of a
statistically   significant result  would   be  higher
because   the   Profile  for  the technology in  this
circumstance  shows a greater effect (see Figs. 8 and
9). This will not be illustrated here.

This   section  has   focused  on   calculating  the
probability  of  a  statistically  significant  result.
Recall that the method is more general, and can be
applied to a wide variety of £ Results.  Examples of
other A Results are  that  the experiment  will  show
the technology has a "positive" effect   (8 > 0),  the
experiment will show an effect between, say, 0 and
10%   (0< fis 0.10),   the  experiment  will  show  an
effect greater than 10%  ((8 > 0.10), and so forth.


DISCUSSION

The health of millions of people and the  expenditure
of billions of  dollars depend on the decisions of health
care   practitioners  and  policymakers   about  the
appropriate use of  medical  technologies.   If  these
decisions  are  not  to be  arbitrary, they should be
based  on  estimates of the effects or outcomes of the
technologies—what  good  or  harm they  can  be
expected    to  cause.   Making   these   estimates
accurately, however, can be extremely difficult.  The
traditional  approach  involves  collecting  individual
pieces of  evidence, and "synthesizing" their results
into  a  conclusion  by  a single  global  subjective
judgment.  The result of this process  is  usually  a
statement such as "the  technology should be used for
the  following indications..."   Rarely  is  there  an
explicit description of the expected magnitude of the
technology's effect, much  less  a description of how
that   magnitude  was   estimated  or  the  range of
uncertainty.

For complicated  assessment  problems  that  involve
                                                       10

-------
many pieces of evidence, evidence from studies with
different   designs,   indirect   evidence   involving
intermediate outcomes, biases, or other complicating
factors,     this   approach   is   vulnerable   to
oversimplification,  errors  in  reasoning, and  wishful
thinking.

The Confidence  Profile  Method was  developed to
provide  a  formal   framework   and  formulas   for
adjusting and combining evidence, and incorporating
subjective  judgments,  to  estimate  a  technology's
effect on outcomes.  The Method breaks the  process
of evaluating evidence into parts—down to the level
of  individual  chains, individual  pieces of  evidence.
and individual biases—and then  combines the parts.
The result is a quantitative (and visual)  description of
a  technology's  effect, both  the  magnitude of  the
effect and the range of uncertainty about the effect.

Depending    on  the   available  evidence,   other
techniques  are  available  for  making quantitative
estimates of the technology on a health outcome.  If
the evidence  consists of a single RCT that compares
the  designated  technology  with  the   designated
control,   in   circumstances   that   match    the
circumstances  of  interest,  standard   statistical
methods can  be  used to  estimate the  effect of  the
technology and confidence limits.'?  If the evidence
consists of  several RCTs that all compare the same
technology   with   the   same   control   in    the
circumstances of interest, their results pooled, again
yielding  an  estimate of  the  effect  and  confidence
limits.  If there  are several  RCTs, but some differ
with respect  to the recipients or other confounding
factors, these differences can sometimes be adjusted
for   by  stratification   and   related   statistical
techniques   (Kleinbaum et  al  1984,  Anderson et al
1980).  Meta-analysis18  can be  used  to  analyze a
collection of RCTs and  calculate an  estimate  and
confidence  limits   for   the   "effect  size"   of  a
technology19  (Glass  1977; Hedges and Olkin 198S). A
collection of  RCTs  involving  dichotomous  outcomes
can be analyzed to  calculate a  combined odds-ratio
and confidence limits for the  odds ratio (Mantel and
Haenszel 1959; Mantel 1966; Peto et al 1977).


All  these   methods  imply direct  evidence  from
controlled trials.  When this type of  evidence exists,
the  Confidence  Profile   Method  can  also  derive
probability   distributions   for   the   effects   of
technologies,  measured in a variety of  ways such as
the difference in probabilities,  the  odds  ratio,  the
percent change in outcome rate.  However, there is a
large class  of  technologies for which the existing
evidence  is  not  suitable for  analysis  by  other
techniques,  but  that  can  be  analyzed   with  the
Confidence  Profile Method.  Some features  of  this
class of technologies are: (1) there are no RCTs—the
assessment  must  be based on one  body of evidence
relating  the  technology  to intermediate outcomes
and/or followup actions, and other evidence  relates
the intermediate outcomes (and  followup actions) to
health outcomes; (2) there are RCTs, but they differ
from each other with respect  to the technology being
assessed, the  control,  the  population, the  providers,
or other important  features;  (3) there are multiple
studies   with   different   designs  (e.g.,   RCTs,
nonrandomized  controlled  studies,  clinical  series,
case-control studies, cross-sectional studies);  and (4)
interpretation  of individual  pieces  of evidence is
complicated  by  errors  in outcome  measurement,
crossover of patients between "treated" and control
groups, differences  in length of followup,  and other
important   factors.   The   assessment   of   t-PA
illustrates many of these features.

The  Confidence  Profile Method  also  differs from
other  methods  by  its  formal  incorporation   of
subjective  judgments. It is  important to understand
the role of subjective judgment in the Confidence
Profile Method.  Specifically, the Confidence  Profile
Method  does  not   require  subjective  judgments.
Rather  it  enables  decisionmakers  to incorporate
subjective  judgments should they feel a need  to  do
so.  If the  evidence is  clearcut and decisionmakers
are  content  to  accept  it  at  face  value,  the
Confidence Profile  Method  can  be used to derive a
posterior distribution for   the  technology's  effect
without  using  any  subjective  judgments (other than
the initial judgment that the evidence  is "clearcut,"
and other than the choice of a noninformative prior).
If  on the  other  hand,  a  decisionmaker  identifies
factors  that  influence  the  interpretation  of  the
evidence,   and   if   the  decisionmaker   wants   to
incorporate subjective judgments about  these  factors
in the interpretation of the evidence, the Confidence
Profile  Method   provides  a  formal  language  for
accomplishing that.

It will never be possible to completely eliminate  the
need for subjective judgments in the  evaluation of
health technologies.  The Confidence Profile Method
attempts to improve the use of subjective judgment
in  several   ways.  First,  by  providing  a   formal
framework  for breaking the assessment problem into
parts,  the  Method  decreases   the   demands   on
subjective    judgments.    Instead   of   requiring
policymakers and experts to make  global  judgments
about dozens of  factors all at once (e.g.,  "Should
t-PA  be  used for  patients  with  acute  MI?"),  the
Method allows  them to focus on  one factor at a time
(e.g.,  "By  how   much  (what  proportion) does   a
60-minute delay in administration of t-PA  reduce its
effectiveness  in  increasing  survival after  an  acute
MI?").  Second,   the judgments  are  targeted   at
elements  that  are  intuitively  accessible,  and  for
which there  is  usually some supporting  empirical
evidence or practical experience.  Third, the  Method
allows anyone who is uncertain about a parameter  to
express   that   uncertainty   as   a   probability
distribution.  The uncertainty thus expressed  about
any parameter  will be  carried by the  formulas
through the entire analysis,  automatically (according
to  the axioms of  probability theory) combined with
any uncertainty  about  any other  parameters, and
included in the final Confidence Profile.  Fourth, the
Method makes assumptions  and  judgments explicit,
allowing for review.  Fifth, it  provides  a  formal
language for combining subjective judgments; experts
can think together about a  parameter  and describe
their collective beliefs in a probability distribution.
Disagreements   can  be  explored   by  performing
separate assessments, or  resolved by  describing  a
bimodal distribution for  the parameter in question.
Last, the Method  can be used to estimate the value
of  additional  information about  a  parameter that
must be estimated subjectively.


Once derived, the Confidence Profiles  have  several
uses  in the design  of health  policies.  The Profiles
themselves provide explicit, visual descriptions of the
effect  of   the  technology—including the  range  of
                                                      11

-------
uncertainty   about   the   effect—for   use   by
decisionmakers    (patients,    practitioners,    and
policymakers).   They  can  be  revised  to  test  the
impact  of different  assumptions  or beliefs about a
variable, or to tailor  an  assessment to a particular
set of circumstances (defined  by  a particular set of
parameters).    Because   the  Profiles  provide  a
quantitative description of  the uncertainty about a
technology's effect, they enable  the use of formal
methods   for   incorporating  risk   aversion  (e.g.,
calculating certainty  equivalents   and  expected
utilities).   The  Profiles  (or certainty  equivalents)
provide a  basis for comparing a technology's benefits
and harms (using multidimensional utility theory), and
enable the derivation of  a quantitative measure of
"overall"   benefit  (or   harm).   The  quantitative
measure  of   benefit   also  provides  a  basis  for
estimating  a   technology's marginal   returns,  and
therefore for  setting priorities.  Profiles  based on
existing evidence can be used to estimate the value
of conducting additional empirical  research to collect
more empirical evidence.  Finally,  a Profile can be
continually revised  to  incorporate  new  evidence
about a particular aspect of the assessment (e.g., new
direct evidence,  new indirect   evidence,  or  new
evidence  about  a particular parameter [e.g., bias]  in
the assessment).

The  Confidence Profile Method (as do all methods of
technology assessment) depends on the quality of the
available  evidence.  While  the  Confidence  Profile
Method can combine  evidence from  many sources,
and can adjust evidence for a wide variety of biases,
it  can not create evidence where it does not exist.
Like all methods of technology assessment, the value
of the Confidence Profile Method is improved if the
volume and quality of empirical research is improved.
                   FOOTNOTES

   1  The  method  itself  is  more  general,  being
applicable to the assessment  of  evidence about the
effect of a wide variety of interventions on a  wide
variety of outcomes. However, the development of
the  Confidence   Profile   Method   was  initially
stimulated by a need to assess health technologies.
  2 Some outcomes can  be  both  health outcomes
(people care about them) and  intermediate outcomes
(they are physiological  variables that indicate  the
probability of other health  outcomes.)  An example is
obesity.
  3 In general, for each link, the antecedent event (on
the left)  will be  called an "action" and the subsequent
event (on the right) will  be  called  an "outcome."
Thus for a two-link  chain involving one intermediate
outcome:  for the   first  link  the  "action" is  the
performance of the technology, and the "outcome" is
the intermediate outcome; for  the second link the
"action"  is   the  occurrence  of the intermediate
outcome, and the "outcome" is the health outcome.
    4 This  assumes   the   pieces of  evidence  are
independent.   If   pieces   of   evidence   are  not
independent,  a  joint  likelihood  function  for  the
dependent pieces of evidence  must  be derived before
use of Equation (1).  Equation (1) also assumes that
there is  a particular true value of    that all the
pieces of evidence are trying to estimate.  If there is
reason to believe this is not the case—that the  effect
being estimated by one experiment is different than
the effect being estimated by another—hierarchial
Bayesian  methods   can  be  used  to  estimate  a
distribution  for the true effects (Wolpert  and Eddy
1986).
  5 For  example, for  dichotomous health  outcomes
and intermediate outcomes,  this condition requires
that Prob(Htl.T) - Prob(H[,TQ) - Prob(H[I), where H is
the occurrence of  the  health  outcome,  I is  the
occurrence  of the  intermediate  outcome,  T  is  the
performance of the  designated technology  and TQ is
the designated control.
  6 The two distributions should be independent in the
sense that each is  derived from  different  pieces of
evidence.
  7 These judgments  are  called "focused" because
they are made one  at  a time about specific elements
of an assessment.
  8 As with standard  statistical methods,  sensitivity
analyses  might still  be  required  to  examine  the
importance  of structural  assumptions, such  as  the
choice  of  a  statistical  model  for  a   particular
experiment.
  9 The 95% "range  of  confidence" is defined here as
the range that has a 95% posterior  probability of
containing the true value.  It is  not the same as  a
confidence  interval  or confidence limit (see footnote
17).
   10  Where  there  is  no  danger of  ambiguity,  the
parameters   on  which the  likelihood  function is
conditioned will not be listed. Thus in this case
Ut. *) - Uf, fit** fa *,./„ •»*»•» »i)
  '' Coronary  angiography involves placing  a catheter
at the opening of the coronary artery and injecting a
dye.  Either the catheter  or the dye could possibly
open an occluded artery.
   T2  Let  X  and  Y  be  random variables with
distributions  f^x)  and  fy(y),   respectively.    The
distribution   for  the  random variable W  «  X+Y,
denoted as f*(w)is calculated as
XY,
The distribution for the  random  variable Z
denoted f^z), is calcuated as
           #«>-/. -A- J -jj

 13 See footnote 12.
   ^4  The distribution for  V*  is  calculated  from
 Equation (8) using SQ - 35, /n -  6, s\ -  85, f\ *  17.
 Similarly, x,, is calculated from Equation (8) using SQ
 - M. /O - 5, si -14, ft - 0.
   15 In  this case the distribution for Prob (lQ[T\) is
 calculated from Equations  (26) and (27) with SQ - 41,
 /O - 93, and the distribution for P(I}IT]) is calculated
 from Equations (26) and (27) with s\ « 93 and /i - 41.
   16 The  Profile in Figure 10 is different from  the
 .Yofile in Figure 5 (indicating greater certainty and a
 slightly  greater  effect, because the  former  is based
 on the results  of the TIMI  study,  while  the latter
 incorporates  information  from  both the TIMI study
 and Collen's study.
   17 "Confidence limits" (and "confidence intervals")
 do  not   define  a probability  distribution  for a
 parameter.  Confidence limits can be thought of as
 defining the set of null hypotheses that will cause the
                                                       12

-------
observed results  of  a trial  to  be not  statistically
significant at a specified significance level.
  18 The term "meta-analysis"  is often used in two
senses.  It  is  the name of  a specific technique for
combining  evidence to estimate the effect size.  It
has also been used  as a general term for the entire
class of techniques used to combine evidence from
many  sources.   Here  the  term  is  used  in  the
restricted sense.
  19 The effect  size is defined as the difference  in
magnitude  or rate  of the outcome with and  without
the technology, divided by the standard deviation of a
parameter in the control group.
                   REFERENCES
Eddy,  D.M. 1986.  The Use of Confidence Profiles to
Assess Tissue-Type Plasminogen Activator.  Chapter
in Acute Coronary  Care 1987  G.S.  Wagner and R.
Califf (Eds). Martinus Nijhoff Publishing Company.

Eddy,  D.M.  and  R.  Wolpert.   Extensions  of  The
Confidence   Profile    Method   for    Technology
Assessment I,  Center for Health Policy Research and
Education Working Paper, 1986 (in preparation).

Wolpert,  R.   and D.M.  Eddy.   Extensions  of  The
Confidence   Profile    Method   for    Technology
Assessment II, Center for Health Policy Research and
Education Working Paper, 1986 (in preparation).

Jeffreys, H. Theory of Probability (3rd Edn.).  Oxford
University Press, London, 1961.

Bernardo, J.M. Reference Posterior Distributions for
Bayesian Inference (with discussion).  J. Royal Statist.
Soc.41 113-147, 1979.

Collen, D., E.J. Topol, A.J. Teifenbrunn et al. 1984.
Coronary  Thrombolysis  with  Recombinant  Human
Tissue-Type Plasminogen Activator:  A Prospective,
Randomized,   Placebo-Controlled  Trial.  Circulation
70, 1012-1017.

Hedges,  L.V.,  I. Olkin, 1985. Statistical Methods for
Meta-Analysis. Academic Press, London.

Kennedy, J.W., J.L. Ritchie, K.B. Davis, et  al. 1985.
The   Western  Washington   Randomized  Trial  of
Intracoronary  Streptokinase in Acute  Myocardial
 Infarction.   A  12-Month  Follow-up  Report.  NEJM
 312, 1073-1078.

 Kleinbaum,   D.G.,  L.L.  Kuoper,  H.  Morgenstern.
 1984.   Epidemiologic  Research,   Principles   and
 Quantitative    Methods.      Lifetime     Learning
 Publications, Belmont, CA. pp 343-351.

 Anderson  S.,  Auquier  A.,  Hauck   W.W.   et   al
 Statistical   Methods   for   Comparative  Studies.
 Techniques  for  Bias Reduction.   New  York John
 Wiley & Sons, 1980.

 Mantel, N.  1966.  Evaluation of Survival  Data  and
 Two  New   Rank  Order  Statistics  Arising  in   Its
 Consideration. Cancer Chemother Rep 50, 163.

 Mantel, N.   and W.  Haenszel.   1959.   Statistical
 Aspects of the Analysis of  Data from Retrospective
 Studies of Disease. JNCI 22, 719-748.

 Peto,  R., M.  C. Pike, P. Armitage et al. 1977.  Design
 and Analysis of Randomized Clinical Trials Requiring
 Prolonged Observation of Each  Patient. Br. J.  Cancer
 Glass,  G.   V.  1977.  Integrating  Findings:   The
 Meta-Analysis  of  Research,  in  L.  Shulman  (Ed)
 Review of Research in  Education  Vol. 5, Itasca, IL,
 Peacock.

 TIMI  Study   Group.   1985.   The   Thrombolysis  in
 Myocardial Infarction (TIMI) Trial.  Phase I Findings.
 NEJM 3 12, 932-936.

 Verstraete,  M.,  R.  Bernard,  M. Bory  et al.  1985a.
 Randomised   Trial   of  Intravenous  Recombinant
 Tissue-Type     Plasminogen     Activator,    versus
 Intravenous   Streptokinase   in  Acute  Myocardial
 Infarction. Lancet 1 , 842-847.

Verstraete,  M.,  W.  Bleifeld,  R. W.  Brower  et al.
 1985b.    Double-Blind    Randomised    Trial   of
Intravenous    Tissue-Type   Plasminogen   Activator
versus  Placebo  in   Acute  Myocardial  Infarction.
Lancet 2, 965-969.

Yusuf,  S.,   R.  Collins,  R.   Peto  et   al.  1985.
Intravenous and Intracoronary  Fibrinolytic Therapy in
Acute Myocardial Infarction:  Overview of Results on
Mortality,  Reinfarction  and  Side-Effects  from  33
Randomized Controlled Trials.  European Heart J. 6,
556-585.
                                                      13

-------
              TABLE I: EVIDENCE FOR T-PA ANALYSIS
Study
TIMI
Colltn
Vtmncie
Vcntncic
Keu«lr
CompuiiM
TicMmeM CoMral
I-PA
I-PA
I-PA
I-PA
ICSK
IV SK
plKcbo
IV SK
plKcfao
CCt
Outcome
icpofuiioi*
in-hotpiul mo
Rpofuao*
ruliiyt

pcrfutio*
w-hoipiul moruliiy
pcrfiuioi
m-hoipiul noulily
icpcrfutioi
12 rnontliBioculity
RM
TreamcM
71/111
7/14)
U/»
43/61
1/64
31/6!
1/64
KM 34
11/1)4
t*
Co«ml
44/122
12/147
1114
14/62
V6J
IV62
4«]
14/116
17/116
p-nlue
aooi
01
0.001
0054
05
aoooi
0365
00001
0107
SlKitucal
Siccuficaacc
jra
•o
yn
•0
no
ya
•o
y«
no
    • RMCI in jivea for paiknu with ptniil or laul occhaio*.
    t Rilei «rt givci f
-------
FIGURE 3
       FIGURE 4
                                                                             THROHBOLVTIC
                                                                                ftSENT
                                                                                (134)
                                                             Reperfuse
                                                               (93)
                         Not Reperfuse
                              (41)
                                                       Live
                                                       (66)
Die
(5)
Live
Die
(6)
                                                                             CONVENTIONAL
                                                                                 CARE
                                                                                (116)
                                                             Reperfuse
                                                               (Ml
                         Not Reperfuse
                              (HO)
                                                       Li ve
                                                       (14)
Die
(0)
Li ve
(65)
D-ie
(17)

-------
-\M

-------
FIGURE 9
                                                           FIGURE 10
                                      '0,4
"0
,2
0
0
,2
0,4

-------
                                   81
                              Probability of a Statistically

                                   Significant Result
2.Z


C1"
r> v>
                                                                                  I

-------
                              Choosing a Measure of Treatment Effect

                                     Robert L Wolpert, Duke University
1. Introduction
   Any new drug, surgical procedure, or other medi-
cal  treatment must be shown  to  be effective  before
many insurance companys will reimburse for its use,
and hence before it can become part of general medi-
cal  practice.  Before it can displace existing alterna-
tive  treatments,  it must  be  shown  to  have  some
advantage— to  be more effective, less  expensive,
safer, or more convenient Showing that such a treat-
ment is effective at all, or more effective than existing
treatments, requires evidence.
   Evidence about the effectiveness of an experimen-
tal treatment can take many different forms, depend-
ing  upon  the design  of  the experiment  intended to
measure or detect  treatment effect;  the simplest evi-
dence to analyze is that from a well-designed random-
ized controlled trial (RCT).
   In such a trial  subjects are randomly assigned to
one of two or more groups. Usually one group (called
the  control group)  receives  "conventional care" (the
standard and expected treatment  at the time  of the
trial) while another (called the treated group) receives
the  experimental treatment, but in  all other respects
the  patient protocols are identical  for the two groups.
In more complicated designs several groups might be
given different treatments or different variations of a
single treatment, all to be compared simultaneously.
   The evidence from the trial consists of recorded
measurements of experimental quantities for each sub-
ject.  From this  evidence the  investigator can try to
detect and quantify any systematic difference between
the  treated group  and the control  group;  since the
patient  protocols  were  otherwise  identical,  such  a
difference must be  attributed to either chance variation
(from  the random assignment  of  subjects  to the
groups) or to the treatment
   A systematic difference between the groups could
be caused by improvement (or harm) caused  by the
treatment, by chance  variations in  the study popula-
tions, by side effects of the treatment or control proto-
cols, or even by differing sample sizes in the treated
and  control groups.  The  investigator must choose a
measure of treatment  effect which is sensitive to the
expected improvement or harm caused  by  the treat-
ment, and relatively insensitive to  unimportant side-
effects and to chance variations.
   The large  sample sizes  necessary  to  minimize
chance fluctuation are  not always attainable in medical
trials. This forces  us  to  pay  careful attention  to the
probability distribution of the chance variations due to
random sampling and may lead us  to consider com-
bining  the  evidence  from multiple  trials.  When  the
evidence from a single trial is inconclusive, or when
the evidence from several  sources seems  to be con-
tradictory, we might like  to pool the evidence from
more than one trial and make inferences on the basis
of a synthesis of all available evidence.
   The investigator has more  freedom in the  choice
of an effect measure when inference about treatment
effect is be made using  the evidence  from a single
trial than when evidence from several studies must be
combined.  We  will see  below  that it is sometimes
possible for an  investigator to  choose a measure of
effect which is comparable across studies,  despite  the
inevitable variations  in patient  population and treat-
ment detail, and to pool the evidence  using objective
Bayesian methods.
   The problem of synthesizing evidence from multi-
ple trials is a statistical minefield where at every step
we  are   tempted   to   make   assumptions   and
simplifications which can threaten the validity  of  our
analysis. Yet without making some of these assump-
tions we can  make no progress at all.  We have to
assume that the treatments studied in the several trials
are in  some  way comparable, for  example, and  that
the effect of treatment can  be compared meaningfully
despite differences among the experimental conditions
(such as patient populations) of the trials. We usually
make broad assumptions about the stochastic indepen-
dence of evidence from separate  trials, and also of the
conditional  independence  of  study  results  within
different arms of each single trial.  This note illus-
trates a simple fact— that the "technical" assumptions
made to simplify a statistical analysis  often have  real
consequences. Sometimes we must adapt our method
of  analysis  to assure  that the  assumptions are  not
flagrantly  false,  in order to assure that our findings
will accurately reflect what was observed.
    In Section 2 the  ideas are  introduced in a  simple
example, the binomial RCT, in which only a single bit
of data is taken  for each  subject. The formalities  and
notation necessary for the case of evidence from more
general designs  of clinical trials is considered else-
where (Wolpert and Berger (1986), Wolpert and Eddy
(1986)).  A summary follows in Section 3.  I would
like to  thank David Eddy  for introducing me to  the
idea of using  Bayesian methodology to combine  evi-
dence from clinical trials.  The present work grew out
of my efforts to understand and to extend the scope of
his method of Confidence Profiles (see Eddy 1986).
                                                    19

-------
2. The Binomial Trial
   A binomial  trial is an  RCT in which the only
experimental quantity  measured  for each subject  is
whether the subject did or did not experience a partic-
ular favorable event, such as one-year survival  follow-
ing a surgical procedure.  The  random assignment of
subjects to  treatment groups and the (assumed) sto-
chastic  independence  of outcomes among  subjects
together guarantee that the total number X1 of subjects
in the treatment group who do experience the favor-
able  outcome (which we  call a "success") will have
the binomial probability distribution with known sam-
ple size n' but unknown success  probability p'; simi-
larly the number Xe of successes  in the control group
will   have   the  binomial  distribution  with possibly
different parameters n° and ff.

2.1.  Measures of Treatment Effect
   Since  the measured outcome  was described  as
"favorable", the treatment will be  regarded as effective
if pf exceeds ff and its effect will be quantified as

                    t = gtf,if)                (2.1)
for some function g(pf, ff) which vanishes when pf=pe
and  is  positive  when ff>pe.  The investigator can
choose  among  many  such  functions #(•,•)>  and  so
among many measures e of treatment effect
   One  such  measure  of  treatment effect  is  the
change in probability of success

                                              (2.2a)
With this measure (and the law of large numbers) it is
especially easy  to predict the  increased number of
successes if the  treatment is given to some number N
of subjects- it is just NxCP.  An individual patient or
physician  considering  the treatment might be  more
interested in the  relative-risk of failure
                00 — /1_
                IvIV .—' I 1—j
                                O,           (2.2b)
which represents the fractional decrease in the proba-
bility (l-p) of the unfavorable outcome.  It would be
especially meaningful for cases in  which  the  pre-
treatment  success probability  p°  is close to one. In
case p* is close to 0, the fractional-increase in success
probability
                     FI := p'lpF               (2.2c)
would  be more meaningful.  Another choice, which
(for small effects) is nearly equal to RR when p° is
near 1, to IIFI when p' is near 0, and to the exponen-
tial «4xCP for moderate pc, is the odds ratio
                                             (2.2d)
 Estimates of the odds ratio are often reported in pub-
 lished accounts of clinical trials or retrospective  stu-
 dies of treatment effect
   It is more convenient to work with a measure  of
effect which  is positive when  the treatment is helpful
(i.«.  when p? > p°) and vanishes when p' = p°, so we
make simple transformations using logarithms where
necessary to find:
                                              (2.2)
Change in Probability:

Log Relative-Risk:

Log Fractional-Increase:

Log Odds-Ratio:
                                                                                         :=   log
The untransformed  measures  can  be recovered  as
CP = EC/., RR = «"E", FI = e", and OR = e°*.
    Any of these four choices  might be an appropriate
way to measure or to report treatment effect; in each
case e = 0 if p1 = p* (whatever the values of n' and if)
and e > 0 if and only if the treatment improves suc-
cess probability, i.e. p' > p*.  From each measure  E
and the value of p6, any of the other measures can be
computed. How is the experimenter to choose among
them?
    An answer emerges when we consider the problem
of  combining  the  evidence  about e from  several
independent  trials, and consider carefully  the assump-
tions we  will want  to make.  For simplicity we will
assume that  the conditions  of the trials were substan-
tially  identical except  possibly for some inevitable
variation  in  the patient  populations  (and  thus  in the
success probabilities />•  and pf  across  trials),  so that
the true treatment effect
                                                                               = 8
-------
some function G(v) satisfying:
                   ri = G(t,pT).               (2.5)
The  binomial probability  distribution function  for X-
and X? can now be rewritten as a function of £ and pf ,
yielding  a joint  likelihood  function  for  these  two
parameters:

                                              <2-6>
Note that the Jacobian
of the transformation (p'jf)-*(efe) does not enter the
formula  for  the  transformed  likelihood  function,
though it does enter the formula for the transformed
prior density  (and therefor the  formula  for  the
transformed posterior  density).  If we have  a prior
density function expressing joint uncertainty about p'
and  ff, and wish to transform to one expressing joint
uncertainty about e and p*, we  would calculate it as
the product

                                              (2/7)
Of course,  it is sometimes more natural to specify the
joint prior density function K^zjf) directly. If it is
proper and nondegenerate, it can be written as a pro-
duct of a marginal and a conditional density function
in either of two ways:
                                             (2.8a)

                                             (2.8b)

If evidence about e is available from two or more stu-
dies, all with the same pc, we should multiply the pro-
duct of  all  the individual  likelihood functions  times
the joint prior density  function  in order to find a con-
sensus joint posterior, then integrate with respect to pc
and multiply by  a normalizing constant c  to find a
marginal posterior density for e:
            ,Xt) =
(2.9)
Indeed  this possibility  of combining  evidence from
multiple studies is one of the principal reasons Baye-
sians  have for  computing likelihood functions  (rather
than posterior densities) from experiments.  It is  the
product of likelihood functions, and not posterior den-
sities, which must appear between the braces { • • • }
in (2.9), to avoid including  the prior density function
(£+1) times instead of once.  For this same reason it is
important  to include the Jacobian  term  only  when
transforming   density   functions   and   not   when
transforming likelihood functions.
              Unfortunately,  it is frequently  the  case  that  the
           separate trials do not share the same p°, so that (2.9)
           cannot be used to find a posterior density function for
           e.  Rather, rf  is a nuisance parameter  which varies
           from study to study.  In that case it is  appropriate to
           find the posterior density function  for e in  each study
           individually by integrating the product  of  the indivi-
           dual likelihood function (2.6) and a joint prior density:
                                                                                                     (2.10a)
                                                                                              pf) dpi.
           This will give a marginal posterior density function
           for e on the basis of the observations (X{, Xf) from the
           i* study, but how can we combine them?  What we
           really need are likelihood functions for e, not posterior
           density functions.
               It is not obvious what a "marginal likelihood func-
           tion for e" is or how one ought to be computed in the
           presence  of  nuisance parameters  like  pf,  but we  do
           know what we would like to be able to do  with one —
           multiply it times a (marginal) prior density function
           7ie(e)   to  produce  the  posterior  density  function
                  | X|,Xf). Substituting (2.8a) into (2.10a) gives
                                                       (2.10b)

                                                      dpi
                                                          7Ce(£
                                                          and suggests that we define  our marginal likelihood
                                                          function by the quotient j^(e | X{,Xf) /
                                                              | X|,Xf) :=
                                                                 i
                                                        (2.11)
                                                                            Dtf I M) ic^(pf |e) dpi
                                                          Now that we have a likelihood function for each trial,
                                                          we can combine the evidence from the several  trials
                                                          to find a posterior density function for e given all the
                                                          observed data:
                                                                                                       (2.12)
           The use and  properties  of marginal likelihood func-
           tions  (2.11)  and  posterior  measures  (2.12)  are
           described elsewhere (Wolpert and Berger, 1986).
               It is here that an opportunity arises to simplify the
           analysis.  If we can choose an effect measure

                               e = Z(P\> PT)
           in such a way that e and pi are a priori independent,
           i.e.

                                                        (2.8c)
                                                      21

-------
then the conditional  prior  n~(ri\t) in (2.11) and
(2.12) is just the marginal prior K^fpD, with no func-
tional  dependence  upon  e.   This  simplifies  the
integrals and, moreover, allows us to use a noninfor-
mative prior for pi in computing a marginal posterior
density function (2.12) for e. Using a noninformative
prior minimizes the influence of any subjective opin-
ion on the analysis.
    We  will return in Section 3 to the consequences
and advantages of using an effect measure indepen-
dent (under the prior) of pe, i.e. one  satisfying (2.8c);
we first consider what thai independence  means  in
specific examples, and how to achieve it in general.

2J. The Assumption of Prior Independence
    The assumption of prior independence of e and pi
requires the use of an  effect measure e = g(p', pf) con-
sistent with that assumption, but simplifies the analysis
thereafter by allowing the investigator to ignore prior
beliefs and preliminary evidence about the value of pf
in each study when searching for evidence from that
study about the treatment effect
    Consider once  again  the  four  effect  measures
introduced  in (2.2) for binomial trials:
 Change in Probability:

 Log Relative-Risk:        tgg

 Log Fractional-Increase:   e/?/

 Log Odds-Ratio:          ton
One way to investigate the prior dependence of e and
pf is to predict how the treatment would act on subpo-
pulations with differing pretreatment success probabil-
ities pc.  Suppose, for example,  that  the treatment
were known to improve the pretreatment success pro-
bability of a subpopulation  with pc = 0.50 to  a post-
treatment probability of p' = 0.60. What would be the
success probability following treatment for a different
subpopulation, one with a pretreatment success proba-
bility of only pF = 0.25?  Or for a subpopulation with
a higher pretreatment success probability of p° = 0.75?
The  four proposed measures of treatment effect differ
in their predictions.
   A treatment whose effect is to change the success
probability by a  fixed amount, and which  improves
one  subpopulation from pF = 0.50  to p' = 0.60,  must
add  0.10  to pF for each subpopulation.  This would
increase pF=Q2S  to />'=0.35, ^=0.75  to p'=0.85, and
lead to impossibly high predicted success probabilities
for pSO.90. Conversely a treatment whose effect is to
add  a  fixed constant  to  the log-odds, or maintain a
fixed  odds   ratio,   and  which   improves   one
subpopulation's success odds from 0.50/0.50= 1.0 to
0.60/0.40 = 1.5, must  generate an odds ratio of 1.5 in
each subpopulation.  This would  increase  pc = 0.25
(with odds 0.25/0.75 = 1/3)  to p' = 1/3  (with  odds
(l/3V(2/3) = 0.5 = 1.5x1/3), for a net increase of only
0.0833 in the success probability, and would increase
/ = 0.75  (with odds 0.75/0.25 = 3) to p' = 9/11  (with
odds (9/liy(2/ll) = 4.5 = 1.5x3) for a net increase of
0.0682.  Such  an improvement in  odds ratio  would
improve pF = 0.01 or 0.99 only to p' = 0.015 or 0.995,
respectively,  for  an  increase  of  only about  0.005,
while pe = 0.5  is increased twenty  times  as much, a
full  0.100.  In general a  treatment  whose effect  is
measured as an odds ratio would be expected to cause
a smaller  increase in  success  probability  near the
extremes  of ^=0 or pc~\  than  near  intermediate
values such  as p^O.5,  while one whose  effect  is
measured  as  a shift  in  the  probability of  success
would be expected to cause the same size of increase
for any pF. Table 2.1 summarizes some of the predic-
tions of each of these four treatment effect  measures,
with asterisks  (***)  indicating an  impossibly  large
prediction.
Effect Measure
Increased probability
Relative risk
Fractional increase
Odds ratio
^=.01
0.110
(+.100)
0.208
(+.198)
0.012
(+.002)
0.015
(+.005)
Success Probability
£^=.25 ^=.50 pe=.75
0.350
(+.100)
0.400
(+.150)
0.300
(+.050)
0.333
(+.083)
0.600
(+.100)
0.600
(+.100)
0.600
(+.100)
0.600
(+.100)
0.850
(+.100)
0.800
(+.050)
0.900
(+.150)
0.818
(+.068)
^=.99
***
0.992
(+.002)
***
0.993
(+.003)
                       Table 2.1.  Measures of Treatment Effect for Binomial Experiments
                                                      22

-------
   If we expect that the increase in success probabil-
ity due to treatment will be smaller for subpopulations
with very high or very low initial success probability
ff, then necessarily our  prior beliefs about p* and e
cannot be independent if we measure treatment effect
as increased success probability  e/j» := p'-p*.  With
such  a measure  the  conditional  density  ^(elpO
would have to be more concentrated near 6 = 0 for pF
close to 0 or 1 than for pF close to 0.5, and in  particu-
lar  it must display  functional dependence upon ff.
Similarly  in this  situation  pf and e  could  not be
independent under the prior distribution if the effect of
such a treatment were measured  as a decrease in the
relative risk or a fractional increase in the success pro-
bability; of the four measures introduced in (2.2), only
the odds-ratio measure is consistent with a  smaller
shift p'-if at both extremes  than for moderate p? if jf
and  e are to be independent  Only the relative-risk
measure is consistent with a larger shift for small ff
and  a  smaller  shift  for  large  ff,  and  only  the
fractional-increase measure  is  consistent  with  the
opposite pattern of a smaller shift for small ff and a
larger one for large ff, if ff and  e are to be indepen-
dent
   The  example  above  illustrates how  one  can
approach the problem of choosing a measure of treat-
ment effect in general to assure that, under the prior
specification, e and ff are stochastically independent.
First imagine a  subpopulation with  moderate  popula-
tion  characteristics PQ.  One at a time consider several
possible treatment outcomes p'0 for the studied  treat-
ment on  that moderate subpopulation, from the most
pessimistic (po •« Po) to no-effect  (po = />o) to the most
optimistic (p'o > />§)•  For each, predict the treatment
outcome p\  for other subpopulations with  population
characteristics pi varying over the gamut from 0 to 1
(or at least over a subinterval of high prior probabil-
ity),  and  try to identify  a useful invariant e = gtfjf)
(such as p'-ff, p'lpf, (l-pVd-pO, etc.).
   The procedure just described  leads to an effect-
measure  e whose conditional distribution, given ff,
does not depend on ff— i.e.  to an effect measure sto-
chastically independent of ff.
   For    convenience    in   later   computations,
reparameterize if necessary so that e=0 denotes  no-
effect and e>0  describes a treatment which improves
the success probability; in our case, that required tak-
ing  logarithms  or negative  logarithms.  In  the  more
general  setting  (described  in Wolpert and  Berger
(1986)) it may also be necessary to introduce a nui-
sance variable r\\  describing population attributes in
the i* treated population  unrelated to treatment effect,
so that the treated-group parameter (here p'e(Q,l), but
more generally some parameter 9' taking values in a
parameter space 6*) can be written as a function of e,
•HJ, and the control-group parameter 9*:
                  6' =
2.3.  Benefits of an Independent Effect Measure
   Business leaders and policy makers  welcome  the
opportunity Bayesian  analysis  affords  of explicitly
incorporating prior knowledge and subjective  beliefs
into  their  analyses, but scientists  must scrupulously
avoid the appearance of subjectivity.  While the (sub-
jective)  choice of designs,  models,  and  methods
always influences  study results regardless  of whether
the statistical methods used are Bayesian,  frequentist,
or those  of some other school, Bayesian methods have
sometimes been criticized on  the  grounds that their
explicit use of prior information precludes  objectivity.
More recently  objective Bayesian methods have been
developed  which use prior density functions selected
on  some  basis  other than  subjective  beliefs.   A
number  of authors  (e.g.  Berger  (1986), Bernardo
(1979), Jeffreys (1961), .and  Box  and  Tiao  (1973))
have recommended methods for selecting prior density
functions to meet  various objective criteria.  These
criteria include the preservation  of invariance under
some change of measurement origin or scale, the max-
imization  of  the   likelihood  function's contribution
(and minimization  of the prior's contribution)  to  the
Kullback-Leibler measure of information contained in
the posterior distribution, and  the  stationary  of  the
prior measure under certain infinitesimal deformations.
The  prior density  functions these authors recommend
(which do not always coincide)  are  variously  called
"noninformative" or "reference" priors.
   In general equation (2.12) for the posterior density
function  for e calls  for integration  of each likelihood
function  with  respect to a conditional prior measure
on the possible values of ff. If e and p'  are stochasti-
cally dependent, i.e. if the conditional density function
for p\ given e does in fact depend on e, then a  nonin-
formative prior density cannot be used in the integral.
Independence of e  and pf opens  up the  possibility of
using a noninformative prior density  function n.(dp1)
in those integrals:
                                            (2.12b)
and hence  of avoiding oven subjectivity. While it is
true  that subjective prior opinion  has  still played  a
role  (in  directing  the choice  of an effect measure
e = g(-)), that role now enters as part of the unavoid-
able one of model selection.
   Often there are observable patient attributes which
might help a clinician make a more specific  prediction
for a particular patient or  subpopulation of patients
than is possible for all patients together.  For example,
in the binomial trial it might be possible to observe p*
                                                     23

-------
(or evidence bearing  on p*)  in individual subjects  or
subpopulations  of subjects.  In  such situations one
would prefer to have a conditional posterior density
function   rc^. Jiff \ p*,Xi, •  • • ^fj)  expressing   the
uncertainty remaining about p' after considering all the
evidence from the trials, given the value of pe, rather
than the posterior density function (2.12b) for some
abstract measure of treatment effect  If an  indepen-
dent treatment effect measure has been used, then it is
possible to derive from (2.12b) the desired conditional
posterior  density  function.   For  fixed  ff,  (2-5)
expresses pf as an explicit function of e — thus we can
just change variables  in (2.12b) to calculate this con-
ditional posterior density as
With this conditional posterior density for p' a further
change of variables  using  (2.3)  leads to formulas for
the conditional posterior densities of any of the four
effect measures  tcp, ZM, e«» or *OK following obser-
vation of the available study data. The prior indepen-
dence of e and pf together  with the assumption that pe
can  vary  from study to study (which implies  that the
studies we have observed offer no evidence about the
value of pc  in a later study) allow us to compute and
report posterior  density functions for any of the four
treatment effect  measures (or any other which can be
expressed in the form (2.3) and (2.5)):
        • xfixx
(2.15)
                    °, X,,
Here e^x, GXX(',')> and /xx
-------
   Massachusetts.
Eddy,  D.M.  (1986)  Confidence profiles:  a Bayesian
   method   for  assessing   health   technologies.
   (appears elsewhere in this issue).
Jeffreys,  H. (1961) Theory of Probability (3rd edn.).
   Oxford University Press, London. 41, 113-147.
Wolpert, R.L. and J.O. Berger (1986) Conditional pri-
   ors and partial likelihood functions,  (in prepara-
   tion).
Wolpert, R.L. and D.M. Eddy (1986) Extensions of
   the  confidence  profile  method for  technology
   assessment  (in preparation).
                                                     25

-------
                              Comment on Eddy s Confidence Profile Method
                                  David A. Lane. University of Minnesota
   David Eddy and his co-workers are in the process
of constructing an impressive and potentially very
useful technology, whose purpose it is to guide the
deliberations of a panel evaluating health practices.
They have chosen to build their technology on a
Bayesian foundation. The main purpose of this
comment is to offer several arguments in support of
that choice, which are presented in Section 1 below.
In Section 2,1 raise some questions about the way in
which Eddy implements the Bayesian program and
point out some alternative approaches.

1. Bavesian Analysis and Integrative Inference
   I claim that Bayesian analysis is the right way to
attack the inferential problem that is the focus of
this Conference and of Eddy s paper, the problem of
integrative inference.  To make clear the content of
this claim. I will begin by defining the problem of
integrative inference and describing what Bayesian
analysis is. Then. I will outline the Bayesian
approach to integrative inference and discuss its
advantages.
   The problem of inteqrative inference.  A decision -
maker is trying to  determine what course of action
to take.  Which action is appropriate depends on
what the consequences of the various possible
actions will be.  So the decision-maker asks a panel
of experts to predict these consequences. For
example, he might ask the experts to tell him what
risk of cancer is assumed by some particular
population if it is exposed to various levels of a
particular chemical. The decision-maker expects the
experts answer to reflect ail the evidence available
to them, and he needs them to use this evidence to
produce their best prediction about what will occur
as a function of the action he takes, along with some
measure of uncertainty about their prediction.
    Typically, there are many different streams of
evidence that affect the experts' judgement about
the consequences they want to predict. Some of this
evidence comes from formal studies that directly
relate the actions and outcomes of interest. But
there are other sources as well, involving different
 modes of knowing: theory from basic sciences like
chemistry and toxicology; data from laboratory
 studies using animal models or in vitro cell
 preparations; even hunches based on professional
lore and personal experience.  This evidence cannot
 be ignored; it even affects the way the experts think
 about the meaning of the data obtained from the
 best designed formal studies, as they try to
 generalize or modify them to apply to the special
 circumstances surrounding the actual predictions on
which the relevant decision hinges.
   The different streams of evidence may point in
different directions. The experts have to evaluate
the evidentiary significance of each of the streams.
and then they must integrate their evaluations to
come up with the predictions on which the choice of
action depends.  How are the experts to proceed?
This is the problem of integrative inference.
   Bayesian analysis:  The purpose of a Bayesian
analysis is to measure the  analyst's uncertainty
about some propositions that  are relevant to the
problem at hand, in the light of the available
evidence.  Two attributes distinguish Bayesian
analysis from other statistical methodologies.  The
first is the broad view Bayesian analysis takes of
just what constitutes "evidence ".  All of the modes of
knowing — theory, 'hard  data from formal
experimentation, observation and experience -- can
yield evidence that is incorporated into a Bayesian
analysis. Second, in a Bayesian analysis, all
uncertainty, from whatever source, is measured in
the same scale, that of subjective probability.  This
makes it possible to use the laws of probability to
combine the uncertainty about particular
propositions arising from different sources and
ultimately to merge different streams of evidence to
obtain an overall judgement about the plausibility of
the proposition of primary inferential interest. The
use of these laws in this way is supported by  a
normative theory for reasoning in the face of
uncertainty,  the theory of coherent inference
developed by de Finetti (see de Finetti (1974)).
     he Baesian aproach to integr^Jv? inference:
The Bayesian strategy for integrative inference can
be summarized as follows.  Expert judgement is used
to decompose the problem of primary inferential
interest into a series of component problems, each of
which is more accessible to the knowledge and
experience of the expert analysts. Then, the original
problem and each of the component problems is
formulated in terms of subjective probability
evaluations, and the relations between the problems
determine corresponding relations that must obtain
between the probability evaluations. Next, the
component evaluations are carried out, using
techniques of direct expert elicitation or model-
based Bayesian updating (when appropriate
quantitative data are available). Finally, the
solutions to the component evaluations are merged
according to the laws of probability to yield an
answer to the overall problem. Implementations of
this strategy can be found in Eddy ( 1 980), where it
is applied to the evaluation of effectiveness of cancer
screening, and Lane (1987. 1988). where it is used to
develop a procedure for causality assessment for
                                                   26

-------
adverse drug reactions.
   Advantages of th» Ravesian approach: There are
three important advantages to the Bayesian
approach to integrative analysis, compared to
alternatives such as global introspection (where the
experts list all the relevant factors and sources of
information, and then do an implicit mental
integration to reach their overall conclusion),
qualitative decision algorithms, or frequentist
statistical techniques for combining evidence:
   (1) It answers th» question  that the decision-
maker asks.  Suppose the decision-maker is
concerned about the lifetime attributable incidence
of cancer, if the individuals in a specified population
are exposed to a chemical at a specified level. The
output of a Bayesian integrative analysis will be a
probability distribution for that incidence. This
distribution describes the expert s uncertainty about
what that incidence would be if the population were
actually exposed at the indicated level, based on all
the evidence available to them.  The mean of that
distribution is their best prediction for what the
attributable incidence will be if the decision-maker
adopts a course of action that results in the given
level of exposure. As such, it is the appropriate
quantity to measure the expected frequency of
cancer due to exposure for use  in a decision  analysis
to select the  best course of action, according  to the
normative theory of decision-making under
uncertainty developed in Savage (1972). The
"spread" of the experts  distribution gives the
decision-maker information about how sensitive his
choice of action is to the residual uncertainty the
experts have about their prediction, in the light of all
the available evidence.  In contrast, frequentist
estimates do not take into account information
derived from modes of knowing other than formal
studies, except through the imprecise process of
model specification, and frequentist measures of
uncertainty only  take into account  sampling
variability and do not regard uncertainty due to
model mis-specification. In addition, it is hard to see
how to combine the estimates of a unit-free effect
size" that are obtained in frequentist  meta-analysis
with measures of the value of the appropriate
consequences, as is required in formal analysis to
determine the best available course of action.
   (2)  All the available evidence can be
incorporated into a Bayesian analysis, in the most
appropriate form. In contrast to frequentist
statistical methods, expert opinion can enter
explicitly into a Bayesian integrative analysis.
Moreover, by skillfully decomposing the problem of
primary interest, the questions that elicit the
experts' opinions can be formulated in such  a way
that the experts can understand their meaning
unambiguously and actually bring the knowledge
and experience that constitute their expertise to
bear to answer them. In addition. Bayesian analysis
can process quantitative data quantitatively, in
contrast to nonstatistical approaches to integrative
inference like global introspection or qualitative
decision algorithms.  That expert opinion and
quantitative data are both expressed in terms of
subjective  probability in a Bayesian analysis means
that it is straight-forward to merge evidence from
these two different sources.
   (3) The rules for combining evidence in Bavesian
analysis have a normative justification.  The
normative force of the combination rules means that
it is simply inconsistent to disagree with the global
conclusions of a Bayesian  analysis without finding an
appropriate source for the disagreement at a level
localized at those questions of theory and
observation where expertise actually resides.  In
essence,  anyone disagreeing with the conclusion
derived from a Bayesian analysis must believe
either that some relevant piece of evidence was not
included (and. if he says what it is. the omission can
be easily corrected), or that the experts who carried
out the analysis were wrong about some specific
conclusion -they derived from their  shared
knowledge base (and. if he gives  a convincing reason
why. the analysis can be appropriately  modified)

2. The Confidence Profile Method
      In practice, Bayesian analysis is only as good as
the  methodology that implements it.  If the
methodology does not ask the user  to produce some
relevant piece of evidence, or elicits it in such a way
that its actual evidentiary content is obscured, then
the analysis will not be based on all the available
evidence, theoretical possibilities notwithstanding
And if assumptions are built into the methodology
that describe how the experts ought to feel about the
relation  between various propositions, then the
analysis will reflect those assumptions rather then
the experts actual beliefs.
    To what extent does Eddy s Confidence Profile
Method succeed in fulfilling the promise of Bayesian
integrative analysis?  I will discuss three
 reservations I have about it. listed  in increasing
order of seriousness. Two factors must be kept in
 mind in mitigation of these criticial remarks: first,
the method is still in the early stages of its
development, so it is likely  that later versions will
improve its performance; second, no technology will
ever achieve the normative status of de Finetti s or
Savage's theories.
     (1)  The role that expertise olavs in the method
is too restrictive The method takes the formal
study as its primary unit of analysis. Expert opinion
                                                   27

-------
is brought to bear qualitatively to create the chain
structure that directs the analysis; from then on. its
only role is to adjust results obtained from each of
the formal studies that the experts regard as
relevant to the analysis. However, direct theoretical
arguments from basic science can affect expert
opinion about the strength of particular links quite
independently of any data obtained from a formal
study, and personal experience and professional lore,
if carefully expressed and evaluated, can
supplement or even substitute for formal studies as
evidentiary sources for certain kinds of propositions.
Neither the paper nor the demonstration of the
method presented at the Conference indicate that
these types of evidence would enter into the
"confidence profile  (or using a more standard and
less confusing terminology, the posterior distribution
of the outcome of interest).
   (2) The method makes many seeminylv
unwarranted assumptions of independence. For
example, consider equation (1). which asserts that
the likelihood function for epsilon. based on the
results of n "independent" studies, can be factored as
the product of n separate likelihood functions, one
for each of the studies. Recall what epsilon is: it is
the "true effect of the technology in the
circumstances of interest". Now. as Eddy recognizes.
the  circumstances of interest" are not typically the
circumstances in which the n studies were carried
out: for example, patient populations can differ with
respect to key demographic variables (like age and
sex) and severity of the underlying clinical condition.
and the way in which  the treatment is administered
can change from study to study. As a result, it is
essential to adjust the results of each study to take
such differences into account, and so Eddy s method
requires such an adjustment for external validity
   How do the experts carry out this adjustment?
Presumably, they have a mental model that
describes how they believe the  "true effect" depends
on the different variables for which they adjust.  The
experts sampling distribution for the results of a
study depend on the study's circumstances and the
experts views on how circumstances affect outcome.
thus, these distributions do not  depend just on
epsilon, but on the parameters of the adjustment
model as well. Conditioning on  the parameters of
this model and on epsilon. the experts can regard the
results of n "physically independent" studies as
independent random variables. However, unless
they do not believe that study data can affect their
opinions about the relation between the "true effect
of a study and the value of the  adjustment variables.
the experts marginal  distribution for all the study
results, just given epsilon, will not factor as the
product of the the marginal  distribution for each
study result, just given epsilon. This is the
factorization that is asserted by equation (1).  Since
it is nearly impossible to imagine that the experts
would not change their opinions about how the  true
effect of a health technology depends on such
factors as patient age. sex. baseline clinical condition
and mode of delivery of the technology, in the light
of data from studies that test the technology on
particular patient populations, equation (1) is  nearly
always false.
   The same kind of argument applies to many
other formulae in Eddy s paper.  What this argument
implies is that the experts will make their successive
adjustments incoherently,  since nothing in the
method will guarantee that they update their
(implicit) adjustment model parameters in the light
of the evidence in the studies they sequentially
examine.  Since, as Eddy argues, these adjustments
constitute an essential step in the process of
combining evidence (quantitatively as well as
qualitatively), this incoherence is a serious
deficiency in the method.  Making the adjustment
models explicit is the only way to correct this
deficiency, carrying it out will be a major
undertaking.
   Three problems have to be solved.  First, the
increased complexity of the underlying model
imposes knowledge representation difficulties: what
needs to be updated in the light of information of
which type, and how can this updating be carried
out in a computationally efficient way? David
Spiegelhalter has developed a very promising
approach to this question in his work on structures
for Bayesian expert systems: see Speigelhalter
(1986) for an introduction to this important line of
research.  Second, the more parameters in a
Bayesian model, the higher the dimension of the
integration that needs to be carried out to achieve
the appropriate marginal distributions. Recent
research in approximate methods for high-
dimensional integration in the Bayesian context is
summarized in Kass. Tierney and Kadane (1988)
The third  problem  is probably the most difficult:
how are the experts ideas about the form of the
adjustment model to be elicited and expressed? This
is part of the more general problem with Eddy s
method discussed below.
   3.  The method provides no technology to deal
with the difficulties involved in eliciting expert
opinion and measuring expert  uncertainty. The
probabilities that appear in de Finetti s normative
theory represent an idealized construction: the price
that the evaluator would be neutral between buying
and selling a ticket worth S1 of the proposition of
interest is true and otherwise valueless. This  price
      by an easy monotonicity argument, just as the
                                                   28

-------
price that you would just be willing to pay to
purchase a new house eiists: but it can be
eiceedingly difficult to determine this price eiactly.
especially since the de Finetti transaction is  entirely
metaphorical, whereas at least you may be required
to "put your money where your mouth is with
respect to the house. Thus, the advantages of de
Finetti s normative theory for a procedure based on
subjective probability depends to a large extent on
how the designers of the procedure solve the
problem of measuring probabilities with precision.
   That people encounter serious difficulties when
they try to measure their uncertainty about
propositions is well-documented  in the psychological
literature; see for example Kahneman. Slovic and
Tversky (1982). Whether or not an expert can
provide a  meaningful measure of his uncertainty
about a proposition depends crucially on the context
in which the proposition is to be interpreted and the
care with which it is formulated.  The most difficult
problem that confronts the designers of any
procedure based on subjective probability is to
frame assessment tasks that are accessible to the
knowledge and experience of the experts who must
use the procedure. There is no discussion in Eddy s
paper about how he has dealt with this problem
with respect to the confidence profile method.
   Based on my own work with experts on the
causality assessment of adverse drug reactions, 1  am
quite skeptical about the ability of experts to assess
directly a  distribution for such complicated,  global
adjustment variables as Eddy s tau and beta in a
way that accurately and coherently incorporates all
their relevant beliefs and opinions. Exactly  what
questions  are the experts asked when they assess
these distributions?  What internal consistency
checks are made to show that they understand what
these questions mean and that their answers to
them are mutually consistent? What happens when
experts disagree?  Does the method provide any way
to probe for the sources of their disagreement and to
construct  models for the adjustment factors that are
based on a pooled knowledge base and command
general agreement among those with the relevant
expertise? Eddy s  method seems to ignore these
questions; its technique for handling subjective
probability evaluations appears to be just global
introspection, with all its faults and pitfalls (see Lane
(1984), made quantitative. For a different approach
that attempts to determine which assessment tasks
are accessible to the relevant experts and to use the
rules of probability to decompose unaccessible
problems  of interest into accessible components, see
Lane (1987, 1988).
                LITERATURE CITED

De Finetti B( 1974). Theory of Probability. John
  Wiley, New York.

Eddy D (1980).  Screening for Cancer: Theory,
  Analysis, and Design.  Prentice Hall. Englewood
  Cliffs. N.J.

Kahneman  D, Slovic P. Tversky A (1982).  Judgement
  under Uncertainty:  Heuristics and Biases.
  Cambridge University Press, Cambridge.

Kass R. Tierney L. Kadane J (1987). Asymptotics in
  Bayesian computation. To appear, Bayesian
  Statistics 3, ed. J Bernardo, M De Groot,  D Lindley.
  A Smith.

Lane D (1984).  A probabilist s view of causality
  assessment. Drug Information Journal,  18. 323-
  330.

Lane D (1987). Causality assessment for adverse
  drug reactions: an application of subjective
  probability to medical decision making. To
  appear, Statistical Decision Theory and Related
  Topics, ed. J Berger. S Gupta

Lane D (1988). Subjective probability and causality
  assessment (with discussion). To appear, Journal
  of Applied Stochastic Models and Data Analysis.

Savage L (1972). The Foundations of Statistics
  Second revised edition.  Dover, New York.

Spiegelhalter D (1986). Probabilistic reasoning  in
  predictive expert systems. In Uncertainty in

Artificial Intelligence, ed. L Kanal. J Lemmer.
North-Holland. Amsterdam.
                                                  29

-------
                                STATISTICAL ISSUES IN THE META-ANALYSIS OF

                                           ENVIRONMENTAL STUDIES

                                     Larry Hedges, The University of Chicago
    The rapid growth of research literatures in many
areas of  scientific  interest  has led to  an almost
universal desire to find better ways to understand the
accumulated  research   evidence.    The   use   of
statistical   methods  for   combining   studies   or
"meta-analysis" has been one response to the problem
of  extracting  summary  evidence from a  body  of
related research  results.  Statistical methods  for
combining   the  results   of  different  studies  have
recently come into wide use in psychology (Smith &
Glass,  1977; Glass &  Smith, 1979), sociology (see e.g..
Crane  & Mehard, 1983), and the biomedical  sciences
(see e.g., Stampfer,  1982). There is a longer  tradition
of such work in physical  sciences such as chemistry
(see e.g., Clarke,  1920)  and physics  (see e.g., Birge,
1932 or Rosenfeld, 1975).  Of course, there is a long
tradition  of research in  statistics,  and  agricultural
science on combining the results of  research studies
(see e.g., Tippett, 1931; Fisher,  1932; Pearson,  1932;
Cochran, 1937; Yates and  Cochran, 1938).
    Many different terms have been  used to describe
the process of  combining results  from  a series  of
experiments.    The    terms    meta-analysis    and
quantitative research synthesis are  used  in  social
science;  the terms  overview and  pooling of results
are used in the biomedical sciences, and the  terms
review  and critical  review  are often used in  the
physical  sciences.   I  prefer  the   terms   research
synthesis  or  research  review.  In  each case  the
general organization of the problem  is the same.  In
the simplest case, each study provides an estimate of
a parameter that  is believed to be  the same  across
studies, and these estimates from different studies
are combined to  yield an overall estimate of  the
parameter.  In  more complex  (and  more realistic)
cases the estimates  from individual  studies  are used
to  study  the   variation  of  the  parameter  across
studies.  For example, if the parameter of interest is
a treatment effect then  meta-analyses involve using
estimates  of  treatment  effects   from individual
studies to estimate an overall treatment effect or to
study its variation across experiments.
    It  is  important  to  recognize  that combining
experiments does  not mean pooling raw data.  Direct
pooling of raw  data  from different experiments may
produce misleading  results.   One example of  the
misleading  consequences of  pooling raw  data is
Simpson's (1954) paradox, which arises when two 2 X
2 tables both show an effect in a given direction, yet
a summary table based on pooling  the  cell counts of
the individual tables shows  the opposite effect. A
real  life   example  is  that  recent  demographic
statistics showed  that  the death rate has gone down
in every Illinois age cohort, but the overall death rate
is higher.   The substantive explanation, of course, is
that the population  is  now older.  The point of  this
example is that the overall pooled  statistics  do not
reflect the results  in  the individual  age  cohorts.
Similarly,   the  results of  analyses  using  raw  data
pooled across  studies  may  actually  conceal  some
aspects of the results of individual studies.
     It  is perhaps interesting to note that the  term
meta-analysis  arose as  an attempt to describe an
activity that was an analysis  of  the results of  the
analyses from  individual studies  (Glass, 1976).  The
assumption  is  that  the  relevant  result   of   the
statistical  analysis  of an  individual  study  is  a
parameter estimate.  The information from different
studies is  combined via their parameter estimates. If
these  estimates are sufficient  statistics  then  of
course  no  information   is   lost   in  combining
information via these statistics.

1.0 Whv Is  It Desirable to Combine the  Results  of
    Studies?
    One  of  the  first  questions   that  arises   in
connection with combining research results is why is
it  desirable to do so.  The answer depends  to  some
extent on  the use that  is to be made of the  research
review.   Policy  decisions  and  many   scientific
decisions  frequently  require  information  about  a
phenomenon under  a  wide range of  conditions.   For
example, these decisions may involve questions about
the probable  effect of a treatment under  a  specific,
even  eccentric  set of  conditions.  Research  reviews
are perhaps  most  helpful to  inform  decisions  that
involve  general  conclusions about a  typical  range of
situations.   They are  likely  to be  less  helpful  in
identifying  precisely  what  to expect  in  a  very
specific or  a very eccentric  situation.  There  are,
however,  several  general  reasons  to  expect  that
combining evidence will be useful.

1.1 Synthesis Provides  Robust  Evidence.
    Even  very  similar  studies  differ  in  their
experimental conditions and in the  details of their
execution. Obviously,  planned and plausibly relevant
differences in study design or procedure  may result in
differences in study results.
    There are also  many subtle, unplanned, and often
unrecognized differences between studies that often
lead  to  variation  in  study  results.  Seemingly
irrelevant  differences  in  experimental  conditions,
procedures,   or   measurement     methods   quite
frequently  lead to  substantial  variation  in  study
results.  Even large  single studies involve a limited
set of experimental conditions  and context  which
reduces the generalizability of their findings.
    Syntheses based on several studies draw evidence
from  across contexts to provide conclusions that are
more  robust  to variations in experimental context
and thus more useful  for broad policy decisions.

1.2 Syntheses Mav Utilize a Less Biased "Sample" of
     the Evidence.
    The selection  of evidence on which to base  a
policy  decision is  essentially a  sampling  process.
Haphazard or uncontrolled sampling  of evidence can
lead  to very substantial biases. Unfortunately there
are often sharply  conflicting  interests  in the policy
making context which  may have vested interest  in
drawing attention  to  the particular subsets of  the
research  evidence  that  support  their  viewpoint.
Research results are often  part  of the  rhetoric of
competing   interests.   Thus   there  is  a  natural
tendency   for   competing  interests to  emphasize
                                                       30

-------
 different parts of the total body of research evidence
 (different studies) which most closely corresponds  to
 their own beliefs. That is, different  interests have a
 tendency to emphasize biased subsamples of the total
 body  of research evidence.   One  crucial  aspect  of
 quantitative  systhesis  of research is arriving  at a
 minimally biased sample of research  studies in which
 to   base   inferences.    Carefully   operationalized
 procedures  for  selecting  research evidence  can
 reduce both bias and the appearance  of  bias in study
 selection.

 1.3 Synthesis Formalizes What Policy  Makers  Must
     Do Anvway.
     Policy makers are frequently  required to make
 decisions  even  though  they  are  faced with  many
 studies yielding possibly  inconclusive or conflicting
 findings.  That is. policy  makers will derive general
 conclusions.  The only question is whether they do  so
 via formal, quantitative means or  informal, intuitive
 means.  One  of the difficulties  in  relying  on the
 intuitive procedures for combining research  results is
 that  the intuition of many people (even sophisticated
 people) is  horrendously  bad (see  Hedges  & Olkin,
 1985).  Procedures that seem  intuitively sensible are
 often highly  misleading.  For  example,  there  is  a
 tendency to look for the prepondenance of evidence
 by asking what proportion  of the studies actually
 found a statistically significant effect.  Examination
 of such a  procedure  will demonstrate  that  when
 effects of interest are small, this procedure not only
 has a very poor chance of detecting a real treatment
 effect, but  its  properties do not improve as  the
 amount of evidence (the number of studies)  increases
 (see  Hedges  and Olkin,  1985. pages  48-52).   The
 fundamental  problem   is   that   research   results,
 whether  they   are  expressed  as   estimates,  the
 outcomes of  significance  tests, or  p-values, have  a
 substantial   scholastic  component.   It is  simply
 difficult to find the structure in a series of scholastic
 observations  without some sort of formal  methods.
 Informal  methods either  ignore  a  great  deal  of
 information  or use it suboptimally.  Moreover,  it  is
 difficult  to  deduce  the  properties  of  informal
 methods of combining research results.  Thus even  if
 they were adequate,  it would be difficult to produce
 a convincing demonstration that this was the  case.

 1.4 Synthesis Increases Statistical Power.
    One  of  the  most   obvious    advantages   of
 quantitative research synthesis is that the pooling of
 information  from different  studies  increases   the
 statistical  power of hypothesis tests (e.g., for  the
 treatment  effect) and  decreases standard  error  of
 estimates  of  the experimental  effect. Thus  the
 evidence from several studies that  are  marginal  or
submarginal  in terms  of statistical  power but  are
otherwise well done  might be  combined to yield  a
relatively  powerful  test  for  the  existence of  a
treatment effect.

 1.5 Synthesis Provides Formal Standards of Rigor for
    the  Process of Accumulating   Evidence  from
    Different Research Studies.
    The generation  of  scientific  knowledge  begins
 with  the  conduct of the  individual experiments  to
generate  empirical  evidence.   A single experiment
seldom stands alone, however. The generation of new
scientific knowledge almost invariably involves  the
synthesis   of  evidence   from  several  replicated
studies.   Evidence  from   relevant   experiments
 becomes part of the scientific knowledge-base  only
 after   it   has  been  suitably  synthesized  and
 interpreted.   Every  scientist   learns  rules   of
 methodology or procedure that are designed to insure
 the validity of original research studies. Because the
 combination  of evidence  across studies  is  just  as
 important to the generation of scientific knowledge
 as the  combination  of evidence (from  different
 observations) within  studies,  rigorous procedure  in
 original research.  Moreover,  procedural rigor serves
 the same purpose in both  contexts:  It protects the
 validity  of  conclusions  from  potential sources  of
 bias.  In fact,  the  parallels  between procedure  in
 original research and research synthesis are usually
 used as  the  basis for examining  rigor in research
 synthesis.
     This paper is an examination of statistical issues
 in research synthesis. I  am using  a broad  definition
 of statistical issues which assumes  statisticians have
 a contribution to make  in every stage of a research
 enterprise.  The  convenience,  the treatment  that
 follows  is organized  into a sequence of generic stages
 that apply equally well to any original  research study
 or to any research synthesis.

 2.0 Issues in Problem Formulation.
     Problem formulation is often  conceptualized  as
 the  first step  in  any  original research  study  or
 research review  (Cooper,  1984; Light &  Pillemer,
 1984).  It involves formulating the precise  questions
 to be   answered by  the  review.   One aspect   of
 formulating questions is deciding whether the purpose
 of the review is confirmatory (hypothesis testing)  or
 exploratory (hypothesis generating).
     A   second  aspect   of   problem  formulation
 concerns decisions  about  when studies  are  similar
 enough  to  combine.  That  is,  deciding   whether
 treatments,   controls, experimental procedure,  and
 study outcome measures are comparable enough to be
 considered.  Philosophers of science have provided a
 conceptualization that is helpful in thinking  about
 this  problem.  They  distinguish the  theoretical or
 conceptual variables about which knowledge is sought
 (called  constructs) from the actual  examples of these
 variables that appear in studies (called operations).
    For  example, we  may  want  to  know   if  a
 particular method of teaching mathematics leads to
 better mathematical  problem solving.  To find out,  a
 comparative study is conducted in which students are
 randomly assigned to  teachers, some of  whom use the
 new method.  The students are then given a problem
 solving   test to determine  which group of students
 were  better  at  mathematical problem solving.  The
 exact  conceptualization  of  mathematical  problem
 solving  is a  construct.  The particular test used to
 measure  problem  solving  is  an  operation  cor-
 responding to that construct.
    Similarly, the  particular  teaching method  as
 defined   conceptually  is  a  construct,  while  the
 behavior of a particular teacher trying  to implement
 that teaching method is an operation.  The point here
 is that even when studies share the same constructs,
 they  are  sure  to  differ  in  the  operations  that
 correspond to those constructs.
    Thus defining questions precisely   in a  research
 review   involves  deciding   on  the  constructs of
 independent   variables,   study   characteristics   and
outcomes that  are  appropriate  for  the  questions
addressed  by  the   review  and  deciding  on  the
operations that will be regarded as corresponding to
 the constructs.  That is,  the reviewer  must develop
                                                       31

-------
both  construct  definitions and  a set  of  rules  for
deciding which concrete  instances  (of treatments,
controls, or measures) correspond to those constructs.
    Although the questions of interest might seem
completely   self-evident  in  a  review of  related
clinical trials, a little  reflection  may  convince you
that  there   are  subtleties  in  formulating  precise
questions.   For example,  consider clinical  trials  in
which the outcome  observed is the  death  rate.  At
first, the situation  seems completely  clearcut, but
there are subtleties. Should  deaths  from all  causes
be included in the death  rate or only deaths related
to  the disease  under  treatment?   If the  latter
approach  is used,  how are  deaths related to side
effects of the treatment to be counted? If there is
follow-up after different  intervals,  which  intervals
should be used? Should unanticipated or data defined
variables (such as "sudden death") be used?  Careful
thinking about the problem under review usually leads
to similar issues which require consideration.

2.1 Selecting Constructs and Operations.
    One of the potential  problems of meta-analysis
or quantitative research syntheses is that  they may
combine incommensurable evidence.   Some  meta-
analysis have been  criticized for combining "apples
and oranges."  This  is essentially a  criticism  of  the
breadth of constructs and operations chosen.  In one
sense breadth of  constructs  and  operations  chosen
must  reflect the breadth of the question addressed by
the review.  The issue is complicated  by  the fact that
constructs  and operations chosen must reflect  the
breadth of  the questions  addressed by the  review.
The  issue is complicated  by  the  fact that constucts
and operations are often distinguished more narrowly
by the reviewer than may be reflected in  the final
presentation of results.  Thus the issue  of constructs
and operations are to be  included in the review and
then  what  constructs  and operations  are  to  be
distinguished  in the data  analysis of the review, and
finally which constructs and operations are presented
in the results of the review.
    Meta-analyses have tended  to use  rather broad
constructs  and operations in their presentation  of
results  (Cook  &  Leviton,  1980).   This  may have
resulted from  the  arguments  of  Class  and  his
associates (Glass, McGaw, & Smith,  1981) who urged
meta-analysts to  seek  general conclusions.   It may
also  be a  consequence  of  the  ease  with  which
quantitative methods can analyze data from large
numbers of  studies  (Cooper  & Arkin.  1981).  It  is
important  to recognize however that while  broad
questions necessitate the inclusion of studies  with  a
range  of constructs and  operations, they need  not
inhibit   the   meta-analyst    from    distinguishing
variations of these constructs in the data analysis and
in presentation of results.

2.1.1  Broad Versus Narrow Constructs.
    The   advantage   of   broad   constructs  and
operations    is  that  they   may  support   broad
generalizations.   Because   they   are  maximally
inclusive, broad constructs and  operations obviate
most arguments about  studies that should  have been
included, but were not.
     However, the   uncritical  use  of very  broad
constructs in meta-analysis is problematic.  Analyses
based on operationalization of broad constructs  are
vulnerable to the criticism that overly  broad  choices
of construct obscure  important differences among
the   narrower  constructs subsumed   therein.   For
example  Presby  (1978)  argued  that  the   broad
categories of  therapies used  by Smith and  Glass
(1977) in their review of studies of the effectiveness
of  psychotherapy  obscured  important  differences
between therapies and their effectiveness.  A similar
argument may be made  about the breadth of outcome
constructs.  Moreover, empirical data from research
synthesis  sometimes  confirm   the  truth  of  these
arguments.
    Perhaps  the   most  successful  applications  of
broad or  multiple  constructs  in meta-analysis  are
those that may include broad constructs in the review
but  distinguish  narrower constructs  in  the   data
analysis and presentation of results.  This permits the
reviewer  to  examine  variations in  the pattern  of
results  as a  function on construct definition.  It also
permits separate analyses to be  carried out for each
narrow  construct (see  e.g..  Cooper,  1979, Linn  &
Peterson,  1985,  Eagly  &  Carli,  1981; Thomas   &
French, 1985).  A combined analysis across constructs
may be carried out where  appropriate or  distinct
analyses  for  the  separate   constructs   may  be
presented.

2JL2  Broad Versus Narrow Operations for Constructs.
    Another issue  of  breadth arises at  the level of
operationalization of constructs. The reviewer will
always  have  to admit several different operations for
any  given   construct.   Treatments  will  not   be
implemented identically in all studies and different
studies   will  measure   the  outcome  construct   in
different  ways.   Thus,  the  reviewers  must  judge
whether each operation is a legitimate representation
of  the  corresponding  construct.   This  involves
obtaining as  much  information  as possible  about the
treatment actually implemented  and  the outcome
actually used in each study. This may involve the use
of  secondary  sources  such  as  technical  reports,
general descriptions of treatment  implementations,
test reviews, or published tests.
     In  spite of the difficulty they may present to the
reviewer,  multiple  operations  can  enhance  the
confidence  in relationships between  constructs if the
analogous  relationships  between   operations  hold
under   a  variety  of different  (and  each  imperfect)
operations  (Campbell,  1969).   However,   increased
confidence comes from multiple operations only when
the different operations are in  fact  more  related  to
the  desired  construct  than  to some other construct
(see Webb, Campbell,  Sechrest,  & Grove, 1981  for a
discussion of multiple  operationism).  Thus although
multiple operations can lead to  increased confidence
through    "triangulation"    of    evidence,    the
indiscriminate use of  broad   operations   can  also
contribute to invalidity of results via confoundings of
one construct with another (see Cooper, 1984).

2.2  Exploratory versus Confirmation Reviews.
     A  crucial   aspect  of  problem formulation  is
distinguishing whether the purpose of the review is  to
test  a small  number  of  reasonably  well-defined
hypotheses,   or   to    generate  new  hypotheses.
Obviously new hypotheses (even new variables) arise
in  the  course  of meta-analyses,  just as  in any
scientific activity.  The critical issue is to distinguish
the  clearly a priori hypotheses from those that are
suggested  by   the  data.    This  distinction  has
implications for  the  choice  of statistical analysis
procedures  used  in   the  meta-analysis  and for
interpretation  of   results.   Most   statistical   tests
calculate levels of statistical significance assuming
                                                       32

-------
that  the  hypothesis  is  a. priori  and  is tested  in
isolation.  When statistical tests are suggested by the
data  the  usual  procedures for assessing statistical
significance are likely to be  misleading. Similarly,
when many statistical analyses are conducted on the
same data, the usual significance  levels  will  not
reflect  the  chance of making ai least  one Type I
error in  the  collection  of  tests (the simultaneous
significance level).  Thus when  conducting many tests
in  an exploratory mode there  is  a  tendency  to
"capitalize on chance."
    One  method of  dealing  with the  problem  of
exploratory  analysis  in research  reviews is to  use
statistical methods that are specifically designed for
exploratory  analysis   such as clustering   methods
(Hedges & Olkin, 1983, 1985).  Another alternative is
to adjust  the significance level to  reflect  the  fact
that  many  tests  are conducted  on the same  data
(Hedges & Olkin, 1985).  The problem  with  this and
all other simultaneous procedures  is that  they reduce
the power of  statistical  tests  and  the effect  is
dramatic  when  many  tests  are  conducted simul-
taneously.
    Another alternative is the  use of procedures that
do  not involve statistical significance.  The simplest
procedures   are   simply   descriptive   statistics.
Graphical procedures such as Light  and Pillemer's
(1984) funnel  diagrams.  Hedges  and Olkin's  (1985)
confidence interval plots, or many of the graphical
ideas presented by Tukey (1977) may also be helpful.
    A third alternative  is  to randomly divide the
data  into two  subsets. The first subset is used to
generate  hypotheses  whose statistical significance is
then  evaluated (cross validated) on the second subset
(Light & Pillemer, 1984).

3.0 Issues in Data Collection.
    Data collection   in  meta-analysis  consists  of
assembling  a  collection of  research  studies  and
extracting    quantitative    indices     of     study
characteristics   and   of  effect   magnitude   (or
relationship  between  variables).  The   former  is
largely  a problem of  selecting studies that  may
contain information relevant to the specific questions
addressed in  the  review.  It  is  largely a  sampling
process.   The   latter  is  a  problem  of  obtaining
quantitative  representations   of   the  measures  of
effect magnitude and the other  characteristics  of
studies  that are relevant to  the specific questions
addressed  by  the review.   This  is  essentially  a
measurement process similar to other complex tasks
or  judgments   that  researchers  are   sometimes
required  to make  in  other research  contexts.   The
standard  psychological measurement  procedures  for
ensuring the reliability and validity of such ratings or
judgments are  as appropriate in  meta-analysis as in
original  research  (Rosenthal, 1984;  Stock,  Okun,
Haring, Miller,  Kinney, &  Ceurvorst, 1982).

3.1 Sampling in Meta-analvsis.
    The  problem  of  assembling  a  collection  of
studies is often viewed as a sampling  problem:   The
problem  of  obtaining  a representative sample of all
studies that have actually been conducted.  Because
the adequacy  of samples necessarily determines the
range of valid  generalizations  that are possible, the
procedures used to locate studies  in  meta-analysis
have  been regarded as crucially important.  Much of
the  discussion   on  sampling in  meta-analysis (e.g.,
Cooper, 1984; Glass,  McGaw, & Smith, 1981; Hunter,
Schmidt,   &   Jackson,   1982;   Rosenthal, . 1984),
concentrates  on  the  problem  of   obtaining   a
representative or exhaustive sample  of the studies
that have actually been conducted.  However,  this is
not  the  only or even the most  crucial  aspect  of
sampling   in   meta-analysis.    Another,   equally
important sampling  question is whether the samples
of subjects  and treatments in the individual studies
are  representative  of the subject populations and
treatment populations of interest.
     The  importance of representative sampling  of
subjects  is  obvious.  For  example,  studies of  the
effects of psychotherapy on college students who  do
not have psychological problems  may  not be relevant
to the determination of the effects of psychotherapy
on  patients  who have real psychological  problems.
The   importance  of   representative   sampling   of
treatments  is perhaps  more subtle.  The question is
whether  the treatments which occur  in studies  are
representative of  the situations  about  which  the
reviewer seeks knowledge (Bracht & Glass, 1968).  A
representative sample of  studies, each  of  which
involves  a  nonrepresentative sample  of subjects  or
treatments,  brings us no closer to the truth about the
subject or treatments that we care about.
     Thus there  are two  levels  of sampling to  be
concerned   about   in  meta-analysis.   One   level
concerns the  mechanism for selecting the sample of
studies that  are used in  the  review.   The  other
concerns  the  mechanism   used  within  studies  for
selecting  individual  replication.   The  situation  is
much  like  that  of two-stage  samples  in sample
surveys.  The  reviewer samples clusters or  secondary
sampling units first; then the individual subjects  or
primary sampling units are sampled from the clusters.
     Strategies   for   obtaining  respresentative   or
exhaustive samples of studies have been discussed by
Glass, McGaw, and  Smith  (1981) and  Cooper (1984).
The  problem of obtaining  representative samples  of
subjects  and  treatments  is  constrained by  the
sampling of  studies and consequently is not under the
complete control of the reviewer.  The reviewer can,
however,  present  descriptions  of  the  samples  of
subjects  and treatments and examine  the relationship
between  characteristics of these samples  and  study
outcomes.      Such    assessments      of     the
representativeness  of  treatments  and  subjects  are
obviously crucial in  evaluation  of  the  studies  on
which the review is based.

3.2  Missing Data in Meta-analvsis.
     Missing data is  a  problem  that plagues  many
forms of applied research.  Survey researchers  are
well   aware  that  the  best  sampling   design  is
ineffective  if the  information  sought  cannot  be
extracted from the  units that are sampled.  Of course
missing data  is  not a substantial problem if it is
"missing  at  random,"  that  is,  if  the  missing
information is essentially a random sample of all the
information  available  (Rubin, 1976).   Unfortunately
there  is usually  very little reason to believe  that
missing data  in meta-analysis is missing at random.
On the contrary, it  is often easier to  argue that the
causes of the missing data are systematically related
to  effect size  or  to important  characteristics  of
studies.  When  this  is true,  missing data poses a
serious threat  to  the  validity  of  conclusions  in
meta-analysis. The specific cases of  missing data on
study   outcome  and  missing   data   on   study
characteristics are considered separately.
                                                      33

-------
3.2.1  Missing Data on Study Outcome.
    Studies (such as  single case  studies) that  do not
use statistical  analyses  are one source of missing
data on study outcome.  Other studies use statistics
but do not provide enough statistical information to
allow  the  calculation   of  an   estimate  of   the
appropriate outcome  parameter.  Sometimes this is a
consequence  of failure to report relevant  statistics.
More  often it is a consequence of  the  researcher's
use of  a  complex design that  makes  difficult  or
impossible the construction  of a parameter estimate
that is completely  comparable  to  those of  other
studies.  Unfortunately both  the sparse  reporting of
statistics   and . the  use  of  complex  designs   are
plausibly related to study outcomes.  Both result at
least  in part from  the  editorial  policies  of some
journals which  discourage reporting  of all but the
most   essential   statistics.   Perhaps   the    most
pernicious  sources of missing  data are studies which
selectively  report   statistical  information.    Such
studies  typically  report  only information  on  the
effects  that  are  statistically significant,  exhibiting
what  has  been called reporting  bias (Hedges. 1984).
Missing effects  can  lead  to very  serious  biases,
identical  to  those caused by selective publication
which are discussed in the section on publication bias.
    One strategy for dealing with incomplete effect
size data  is  to  ignore the problem.  This is  almost
certainly  a bad  strategy.  If nothing else,  such  a
strategy reduces the  credibility of the meta-cnalysis
because the presence  of at least some missing  data is
obvious to knowledgeable readers.   Another prob-
lematic strategy for handling missing  effect size data
is  to replace all of the missing  values by the same
imputed value (usually zero).   Although this strategy
usually  leads  to  a  conservative (often extremely
conservative) estimate of the  overall average effect
size,   it   creates  serious  problems   in   study
characteristics to effect size.  A  better strategy is to
extract from  the study  any available  information
about the outcome of the study.  The direction (sign)
of  the effect  can often  be deduced  even when an
effect size cannot be calculated.   A tabulation  of
these directions  of effects can  therefore be used to
supplement the effect size analysis (e.g., Giaconia  &
Hedges  1982;   Grain  &  Mahard,   1983).   Such  a
tabulation can even  be  used to  derive a parametric
estimate of effect (Hedges & Olkin, 1980, 1985).
    Perhaps the best strategy to deal with  missing
data  on  study outcomes is  the  use of  the  many
analytic  strategies  that  have  been  developed for
handling  missing  data  in sample  surveys (Madow,
Nisselson, & Olkin  1983;  Madow,  Olkin, &  Rubin,
1983;  Madow  & Olkin,   1983).   Generally  these
strategies involve using  the available  information
(including  study  characteristics)  to estimate the
structure  of  the  study  outcome   data  and  the
relationships among  study characteristics and  study
outcome.  They  can also  be  used  to  study the
sensitivity of  conclusions to  the possible effects of
missing data.  Although  these strategies have  much
to recommend them they have only rarely been used
in meta-analysis.

3.2.2 Missing Data on Study Characteristics.
     Another  less obvious form of  missing  data  is
missing data on study characteristics  which results
from  incompletely  detailed  descriptions   of the
treatment, controls, experimental procedure, or the
outcome  measures.  In  fact, the generally  sketchy
descriptions  of  studies  in  the  published literature
often constrain the degree of specificity possible in
schemes used to code between study differences.
    The  problem  of   missing  data   about  study
characteristics is  related to the problem of breadth
of    constructs    and    operations    for    study
characteristics.  Coding schemes  that  use a high
degree of detail (and have higher  fidelity) generally
result   in  a  greater  degree  of  missing  data.
Consequently, vague  study  characteristics are  often
coded on  all studies  or  more specific characteristics
are coded on a relatively few  studies (see Orwin &
Cordray,  1985).  Neither procedure alone seems to
inspire  confidence   among  some  readers  of  the
meta-analysis.
    One  strategy for  dealing  with  missing  in-
formation about study characteristics is to have two
levels  of  specificity: a broad level  which  can  be
coded  for nearly  all studies  and  a  narrower level
which can be coded for  only a subset  of the studies.
This  strategy  may  be  useful  if  suitable  care is
exercised in describing  the differences between the
entire collection  of  studies and the smaller number
studies  permitting the  more  specific  analysis.  A
more elegant solution is the use of the more refined
methods   for  handling  missing   data   on  study
characteristics  are  little  used  but  deserve   more
attention.   One  is   the  collection  of   relevant
information  from  other sources  such  as  technical
reports, other  more general descriptive reports on a
program,   test reviews  or articles that describe a
program,  treatment, or measurement  method.  The
appropriate   references  are  often   published  in
research  reports.  A  second  and often neglected
source  of information is the direct collection of new
data.   For  example  in  a  meta-analysis  of  sex
differences in helping behaviors, Eagly and  Crowley
(1986)  surveyed  a  new  sample  of   subjects  to
determine the degree  of  perceived danger  in the
helping situations examined  in the  studies.   This
rating  of degree  of perceived danger  to the  helper
was a valuable factor in explaining the variability of
results  across studies.

3.3  Publication Bias.
     An important axiom of survey sample design is
that  an excellent sample design cannot guarantee a
representative  sample  if  it  is  drawn  from  an
incomplete  enumeration  of  the  population.  The
analogue  in  meta-analysis is that an apparently good
sampling  plan may be thwarted by applying the plan
to an incomplete  and unrepresentative  subset  of the
studies that  were  actually conducted.
     The  published  literature  is   particularly  sus-
ceptible to the claim that it is unrepresentative of all
studies that  may  have been conducted (the so-called
publication  bias  problem).  There  is  considerable
empirical evidence  that  the  published literature
contains  fewer statistically insignificant  results than
would be expected from the complete collection of
all studies actually  conducted  (Bozarth  &  Roberts,
 1972; Hedges, 1984b; Sterling, 1959).  There  is also
direct  evidence  that journal editors and reviewers
intentionally include statistical  significance  among
their   criteria   for   selecting   manuscripts   for
publication  (Bakan,  1966;  Greenwald,  1975; Melton,
 1962).  The  tendency  of the published literature  to
over-represent statistically significant  findings leads
 to biased overestimates of  effect magnitudes from
published literature  (Lane &  Dunlap,  1978; Hedges,
 1984b), a phenomenon that was confirmed empirically
 by Smith's (1980a) study of ten meta-analyses,  each
                                                       34

-------
of which presented average effect size estimates for
both published and unpublished sources.
     Reporting  bias  is  related to  publication bias
based  on  statistical  significance.   Reporting  bias
creates missing data when researchers fail to report
the  details of results  of some statistical analyses,
such   as   those   that   do  not  yield   statistically
significant results.  The  effect of  reporting bias is
identical  to that of publication  bias:  some effect
magnitude estimates are unavailable (e.g., those that
correspond to statistically insignificant results).
     Publication or reporting bias  may not always be
severe  enough  to  invalidate  meta-analyses  based
solely on  published articles (see Light and Pillemer,
1984;  Hedges,  1984b).   Theoretical  analysis of the
potential effects of publication bias showed that even
when nonsignificant results are never published (the
most severe form of publication bias), the effect on
estimation of effect size  may not be large unless
both the within study sample sizes and the underlying
effect size are small. However, if  either the sample
sizes in the studies or the underlying effect sizes are
small,  the effect on estimation can be substantial.
     The possibility that publication or reporting bias
may  inflate  effect size  estimates suggests  that
reviewers  may  want  to consider  investigating  its
possible  impact.   One  method is  to compare  the
effect size estimates derived  from  published  (e.g.,
books, journal articles) and unpublished sources (e.g.,
conference   presentations,   contract  reports,   or
doctoral dissertations).   Such  comparisons  however
are  often problematic  because  the source of  the
study  is  often  confounded  with  many  other  study
characteristics.  An alternative procedure  is to use
statistical  corrections  for estimation of effect size
under publication bias.  This corresponds  to modeling
the  sampling of  studies as involving a censoring or
truncation mechanism.   If these corrections  produce
a negligible effect, this  suggests that publication and
reporting bias are negligible.
     There  have   been  relatively  few  detailed
statistical analyses of the existence  and magnitude of
publication and reporting bias.  Such studies are badly
needed as are refinements of statistical analysis tools
to handle less extreme  and  more realistic censoring
models than those considered thus far.

4.0  Issues in Data Evaluation
     Data  evaluation  in  meta-analysis is the process
of critical examination  of the corpus of information
collected,  to  determine which  study  results  are
expected to yield reliable information. Judgments of
study  quality  are  the  principal  method  of   data
evaluation.  A second aspect of data evaluation is the
use  of empirical  methods  to detect  outliers  or
influential  data  points.  When  properly  applied,
empirical  methods  have  uses  in both meta-analysis
(Hedges &  Olkin,  1985) and  in  primary research
(Barnett and Lewis, 1978; Hawkins, 1980).
     Meta-analysts  and other reviewers  of  research
have sometimes  used   a  single   binary (high/low)
judgment which may be useful for  some purposes such
as  deciding  which  studies   to  exclude  from  the
review.   It  is  seldom  advisable  to  make  such
judgments directly.   The  reason is  that  different
researchers do not always agree on which studies are
of  high  quality.   Empirical  research suggests that
direct ratings  of  study  quality  have  very  low
reliability   (see   Orwin   &   Cordray,    1985).
Consequently,  most meta-analysts at least initially
characterize study quality by using multiple criteria.
One approach to criteria for  study quality  is  the
threats-to-validity approach, in which each study is
rated according to the presence or absence of some
general threats to validity such as those  presented by
Campbell and  Stanley  (1963) or Cook and  Campbell
(1979).      A     second     approach     is     the
methods-description   approach  (Cooper,   1984)   in
which the  reviewer exhaustively codes the stated
characteristics  of each  study's  method.  A third
approach to assessing  study quality  is a combination
of  the first  two approaches involving coding of  the
characteristics  of study  methodology and assessing
threats to validity that may not be  reflected in the
characteristics of study methods (Cooper, 1984).
    Another source of  information in data  evaluation
comes  from   data analyses  themselves.   It  often
happens  that one or more observations (estimates of
effect magnitude) fail  to fit the pattern of the other
observations.  That is. one or more of the data points
fail to conform to the same model  as do  the other
observations.  These deviant observations or outliers
may be the result of studies or situations in which the
treatment is exceptionally powerful  or exceptionally
weak.  In some cases a careful  examination of details
of  study design  or  procedures   suggests  plausible
reasons for the exceptional treatment effect.
    Although statistical  methods may  be used  to
detect outliers in meta-analysis  (Hedges  & Olkin,
1985), the question of what to do about them  cannot
always be resolved so easily. Outliers that result
from   detectable   (and  remediable)  errors   in
computation  should   of  course  be  replaced   by
estimates based on the correct calculations.  When
they  are based on suspicious  data  then a cautious
data  analyst   might  want  to delete,  or  at least
consider separately, such suspicious observations.
    The   most   difficult  problem   arises   when
examination  of  the   outlying studies  reveals  no
obvious reason why their effects  sizes should differ
from  the rest.  The analysis of data containing some
observations that are  outliers  (in the sense  of  not
conforming to the same model  at the other studies) is
a complicated task. It invariably requires  the use of
good  judgment and the making of decisions that are
in some sense, compromises.  There are cases (Rocke,
Downs, & Rocke, 1982; Stigler, 1977; Tukey, 1977)
where setting  aside a  small  proportion  of the data
(certainly  less   than  15-20  percent)  has  some
advantages.  If nearly all the data can be modeled  in
a   simple,   straightforward   way   it   is   certainly
preferable to  do so,  even at  the risk  of requiring
elaborate descriptions of the studies that are set
aside.
                                                       35

-------
    Studies that are set aside should not be ignored;
often  these   studies  reveal  patterns   that   are
interesting in and of themselves.  Occasionally these
deviant studies  share  a common  characteristic  that
suggests an interesting direction for future research.
One of the reasons it is preferable to use a model
which includes most of the data is that the results of
studies  that are identified  statistically as  outliers
often do not deviate  enough  to disagree with  the
substantive result of the model.  That is,  an effect
size estimate may exhibit  a statistically significant
difference from those of other studies, yet fail to
differ from the rest to an extent that would make a
practical or substantive difference. However, it is
crucial that all  data be reported and that any deleted
data be clearly noted.

S.O Data Analysis and Interpretation.
    Data analysis and interpretation are the heart of
the  meta-analysis  and  have a   long history  in
statistics and  the  physical sciences.   Two distinctly
different directions have been taken  for  combining
evidence from different studies in agriculture almost
from the very beginning of statistical analysis in that
area.  One approach relies on  testing  for  statistical
significance of  combined results across studies,  and
the  other  relies on estimating treatment  effects
across studies.  Both methods  date  from as early as
the  1930's (and  perhaps  earlier)  and continue to
generate interest  among  the statistical  research
community to the present day.
    Testing   for   the  statistical   significance   of
combined  data  from  agricultural  experiments is
perhaps  the older of the two traditions. One of  the
first  proposals  for   a   test  of  the   statistical
significance of  combined results  (now  called testing
the  minimum  p or  Tippett method)  was given by
L.H.C. Tippett in 1931. Soon afterwards, R.A. Fisher
(1932)  proposed a  method for combining  statistical
significance,  or  p-values,  across  studies.   Karl
Pearson   (1933) independently derived   the  same
method shortly  thereafter,  and the  method variously
called  Fisher's  method  or  Pearson's method  was
established.  Research on tests of the  significance of
combined results has  flourished since  that time, and
now well over 100  papers in the statistical literature
have  been devoted to such tests.  A  review of  this
literature with special reference to meta-analysis is
given in  Hedges and Olkin (1985).
    Tests of the significance of combined results are
sometimes  called  omnibus or nonparametric tests
because  these tests do not depend on the type of data
or the statistical distribution of those  data.  Instead,
tests of the  statistical  significance  of  combined
results   rely  only  on  the  fact  that p-values  are
uniformly   distributed between   zero  and   unity.
Although omnibus  tests have  a strong appeal in  that
they can be applied universally, they suffer from an
inability to provide estimates of the magnitude of the
effects being considered.  Thus, omnibus tests do not
tell  the experimenter  how  much of an effect  a
treatment has.  Consequently  omnibus tests are of
limited utility in most research reviews.
     In  order to determine  the magnitude of  the
effect  of an agricultural,  a  second   approach  was
developed  which   involved  combining   numerical
estimates of treatment  effects.  One of the early
papers on the subject (Cochran, 1937) appeared a few
years  after  the first papers on omnibus procedures.
Additional  work in  this tradition  appeared  shortly
thereafter  (e.g., Yates & Cochran,  1938; Cochran,
1943).  It is also interesting to  note that work  on
statistical   methods  for  combining  estimates from
different experiments in physics dates from the same
era (Birge, 1932).

6.0 Combined Significance Tests.
    This section outlines some of the methods used
for combined significance testing.
    Consider  a collection of  k  independent  studies
characterized  by  parameters  9],...,  6^,  such  as
means,  mean differences,  or correlations.  Assume
further that the ith study produces a test  statistic Tj
to be used to test the null hypothesis
                         0, i-1 ..... k,
where large  values of  the test  statistic  lead to
rejection  of  the null hypothesis.  The hypothesis
HOJ ..... HQ|C  need not  have the  same  substantive
meaning,  and similarly, the statistics  Tj.....^  need
not be of related form.  The omnibus null hypothesis
HO is that none of the effects is significant, that is,
that all the O's are zero:

              HQ: 9] »92 = ••• =6k"°

Note that the composite hypothesis HQ holds only if
each of the subhypotheses HQi.-.-.Hqk holds.
    The one-tailed p-value for the ith study is
          Pj.Prob (Tj  > ti0) .
                                           (1)
where  IJQ  is  the  value  of  the  statistic  actually
obtained  (the  sample realization  of Tj)  in the  ith
study.  If HOJ is true, then Pj is uniformly distributed
i the interval [0,1].

6.1  The Minimum p Method.
     The  first  test  of the significance of  combined
results was proposed  by  Tippett (1931),  who pointed
out  that  if  p]	pjc are independent p-values (from
continuous test statistics), then each has a uniform
distribution  under  HQ.   Therefore,  if   pji]  is  the
minimum of p\,.-,p^, a test  of  HQ at  significance
level is obtained by comparing pjij with 1- (1 -a
so that the test procedure is to
           reject H0 if P[i]  < 1  - (1 -
-------
needed for the Fisher method.  The test procedure
becomes

    reject H   if  P -  -2  fc  log  P, > C.
where the critical value C is obtained from the upper
tail of the chi-square distribution with 2k degrees of
freedom.

6.3 The Inverse Normal Method.
    Another procedure for combining p-values is the
inverse  normal  method  proposed   by  Stouffer,
Suchman,  Devinney. Star and Williams (1949).  This
procedure involves transforming  each p-value to the
corresponding normal  score, and then "averaging."
More specifically,  define z$ by PJ - + (zj), where + (x)
is   the standard   normal  cumulative   distribution
function.  When HQ is true, the statistic
    zl
             zk
                                               (4)
has  the  standard  normal  distribution.  Hence  we
reject  HQ  whenever  Z  exceeds  the  appropriate
critical value of the standard normal distribution.

6.4 The Logit Method.
    Yet another method for combining k independent
p-values Pi....,Pk was suggested by George (1977) and
investigated   by  Mudholkar  and   George  (1979).
Transform  each p-value into a  logit, log(p/(l  - p)j,
and then combine the logits via the statistic
    ,     Pl              Pl
    log - + ... + log
       1 - Pl
                         ~ Pk
                                               (5)
The exact  distribution of L  is not simple, but when
HO is true, George and  Mudholkar (1977) show  that
the distribution of L (except for a constant) can be
closely approximated by  Student's t-distribution  with
Sk +  4  degrees of  freedom.  Therefore, the  test
procedure using the logit statistic is
    reject H0 if L* - |L| ->(0.3) (5k +4)/k(5k+ 2) >C
                                               (6)

where  the critical  value C  is  obtained from the
t-distribution with 5k2+ 4 degrees of freedom.  [The
term 0.3 in (6) is more accurately given by 3/n 2.
For large values of k, V3 (5k +
sothatL*£(0.55/vk|L).
                                     (5k +2) z 0.55,
6.5 Limitations of Combined Significance Tests.
    In spite of the intuitive appeal of using combined
test   procedures  to  combine  tests  of  treatment
effects,  there  frequently  are  problems   in  the
interpretation  of results  of  such  a  test  of the
significance of combined results  (see  e.g.,  Adcock,
1960; or Wallis, 1942).  Just what can be concluded
from   the   results   of   an  omnibus   test   of  the
significance  of  combined results?  Recall that the
null hypothesis of the combined test procedure is

    HQ: 9] «92» ... -6^-0;
that is, HQ states that the treatment effect is zero in
every  study.  If we reject HQ using a combined test
procedure, we may safely conclude  that HQ is false.
However, HQ  is false if at least one of 9], ..., Ou is
different  from zero. Therefore,  HQ could  be  false
when 9]  > 0 and 62 ----- ®k » 0-  It is doubtful if a
researcher  would   regard  such   a   situation   as
persuasive evidence of the efficacy of a treatment.
     The difficulty  in the  interpretation of omnibus
tests of the significance of combined results stems
from  the  nonparametric   nature   of  the  tests.
Rejection of the combined null hypothesis allows the
investigator to conclude only  that  the  omnibus  null
hypothesis is  false. Errors of interpretation usually
involve   attempts   to   attach    a   parametric
interpretation to the rejection of HQ.   For example,
an  investigator  might  incorrectly conclude   that
because  HQ is rejected,  the treatment effects  are
greater than zero (Adcock, 1960).   Alternatively, an
investigator  might  incorrectly conclude  that   the
average  treatment  effect  e  is positive.   Neither
parametric  interpretation  of the rejection of HQ is
warranted without additional  a priori assumptions.
     An  additional  assumption  that  is  sometimes
made is the assumption that  there is "no  qualitative
interaction."  This  assumption  is essentially that if
the treatment effect in  any study is positive,  then no
other  treatment  effect  is negative.  That is, all of
the treatment effects are of the same sign  but  not
necessarily  of  the  same   magnitude.  While   this
assumption  may seem innocuous, it is  important to
recognize that it must be made independent of the
data and that it may be false.  For example, if a  drug-
th^t  actually has  a positive  effect on  a  specific
disease also has toxic side effects,  it  might actually
increase the death rate among  some (e.g., older or
sicker) patients.  If some studies have more older or
sicker patients, it is plausible that they might obtain
negative treatment  effects while other studies  with
younger   or   healthier   patients   found   positive
treatment effects.
     An important  application of omnibus test  pro-
cedures is to combine the results of  dissimilar studies
to  screen for any treatment  effect.   For example,
combined test procedures can be used to test whether
a  treatment  has an effect  on any of a series of
different    outcome   variables.    Combined    test
procedures can even be used to combine the results
of   related   analyses   computed   using  different
parameters such as correlation coefficients or effect
sizes.
     Omnibus  tests of the statistical significance of
combined results are  poorly suited to the  task  of
drawing  general conclusions about the  magnitude,
direction,  and  consistency   of treatment  effects
across studies. On the other hand, techniques based
on combination of estimates  of effect  magnitude do
support inferences about  direction, magnitude,  and
consistency  of   effects.   Therefore,  statistical
analyses based  on effect sizes are  preferable  for
most applications of meta-analysis.

7.0  Combined Estimation.
    When all  of  the studies  have similar designs  and
measure the outcome construct in a similar (but  not
necessarily  identical)  manner,  the combined  esti-
mation approach is probably  the preferred method of
meta-analysis (Hedges & Olkin, 1985).
    It is  difficult  to discuss  the problem of com-
bining the results of studies in complete  generality.
The purpose  and  procedures  of   research  studies
                                                      37

-------
obviously vary tremendously even within a discipline.
It is usually the case, however, that research studies
seek   to   estimate  one  or  more  substantively
meaningful parameters.  The results  of  a study can
therefore  often be  summarized via  an estimate of
that parameter and its standard error.  An important
special  case is  the  situation  where studies  examine
the effect of a  "treatment"  and the result of the
study is an estimate of the "effect" of this treatment
measured  in some relevant fashion.  In this case, the
first step in combined estimation is  the selection of
an index of effect magnitude.  Many different indices
of effect magnitude have been used in meta-analysis
including  the  raw  mean difference  between  the
standardized  difference  between   treatment  and
control group means (e.g.. Smith & Glass, 1977), the
observed minus expected frequency of some outcome
like  death  (e.g., Yusof,  Peto,  Lewis,  Collins,  &
Sleight, 1985),  the risk ratio between treatment and
control  groups   (Canner,  1983)  or   the   simple
difference between  proportions of some outcome  in
the  treatment  and  control groups  (e.g.,  Devine  &
Cook, 1983).
     Statistical  analysis procedure for meta-analysis
using any of these  indices of effect magnitude are
analogous  (Elashoff.  1978;   Fleiss,  1973;  Gilbert,
McPeek, &  Mosteller, 1977; Hedges, 1983; Hedges &
Olkin,   1985;  Mantel  and  Haenzel.  1959;   Sheele,
1966).   All  involve  large-sample theory  and differ
mainly in the details of calculation of standard errors
and bias corrections.
     Before  discussing  examples  of modeling pro-
cedure  in  combined  estimation,   it  is useful   to
consider conceptual  issues that have  implications  for
that modeling.

7.1  The Nature of Between-Study Variation.
     Between-study variation is defined as variability
in  the study  outcome  parameters.  Three  natural
conceptualizations of between-study variation  treat
this variation  as  totally  systematic   (e.g.,  fixed),
totally  random   (e.g.,  nonsystematic),  or  mixed
(partially  systematic  and  partially  nonsystematic).
These  three conceptualizations  give rise to  three
different  types  of models for the results of a series
of  studies.   In  the  fixed-effects conceptualization.
the  true or population  values  of the treatment
effects in the  study are an (unknown)  function  of
study  characteristics.  By  studying  the relationship
between study  characteristics and treatment effects
the  data  analyst tries to  deduce stable relationships
that explain essentially all of the variability in study
results except  for  that attributable to within-study
sampling  variability. The evaluation of a particular
explanatory models is part of this process.
     The  random-effects  conception arises  from  a
model  in  which  the  treatment  effects  are  not
functions of known study characteristics.   In  this
model, the  true  or population values of treatment
effects vary randomly from study to  study, as if they
were sampled  from a universe of possible treatment
effects (see Hedges &  Olkin, 1985).  The random
effects   conceptualization   is   consistent   with
Cronbach's (1980) proposal that evaluation  studies
should consider a model in which  each treatment site
(or study) is a  sample realization  from a  universe of
related treatments.  The primary difference between
the interpretation  of  fixed- and random-effects
 models is that  between-study variation in treatment
effects   is   conceived   to   be   unsystematic   in
random-effects  models  and consequently explanation
of this variance is  not possible.  Instead  the data
analyst usually  seeks to quantify this  variation by
estimating  a  (treatment   by  studies  interaction)
"variance component":  an index of the variability of
population treatment effects across studies.
    Mixed models involve a combination of the ideas
involved in fixed- and in random-effects models. In
these   models,   some  of  the  variation   between
treatment effects is  fixed (i.e., explainable)  and some
is  random.  Consequently,  the data Analyst seeks to
explain some of  the variation between  study results
and quantify the remainder by estimating a variance
component (Raudenbush & Bryk, 1985, DerSimonian &
Laird,  1983). Such models have considerable promise
as data analytic tools  for situations in which it is
useful  to treat some of the variability between study
results as random.
    The most important difference in  the  outcomes
produced  by the three types  of statistical  analyses
lies  in  the  standard errors that they associate with
the overall (combined)  estimate of the  treatment
effect.    Fixed-effect   analyses   incorporate  only
within-study variability into  the  estimate  of  the
standard  error  of the  combined  treatment effect.
Fixed-effects analyses produce the smallest standard
error estimates  because they  are, in fact, conditional
on  the known  and  unknown  characteristics of  the
particular    studies    that    have    been    done.
Random-effects  analyses include the between-study
variance  component in estimates  of   the  standard
error of  the overall  (combined)  estimate  of  the
treatment effect, and hence produce standard errors
between those of fixed- and random-effects analyses.

7.2  Monitoring Models for Between-Study
     Difference.
     Although most statisticians have a great deal of
experience  which   aids  their intuition about  the
specification  and robustness   of statistical models
within  studies,  few  of us have broad experience or
accurate   intuition  about  statistical   models   for
between-study   variation.    Unlike    within-study
variation,  between-study  variation is  completely
uncontrolled by the  investigator (reviewer)  and even
retrospective information about  the sampling units
(studies)  may  be difficult to  obtain.   Consequently
modeling assumptions about betv/een-study variation
are  likely to be wrong, often horrendously  wrong. In
such an environment,  it is very unwise  to depend on
statistical  analysis  procedures  that   are  strongly
dependent on specification of a particular  model for
between-study variation.
     Research synthesis requires procedures that are
either    robust    against     misspecification    of
between-study   models ox  procedures that  allow
monitoring  of  the  adequacy of  the  model  for
between-study variation.  Nonrobust procedures  that
do  not   permit  (indeed  compel)  comprehensive
monitoring are  a recipe for disaster. Two examples
may illustrate the point. The first example concerns
weighting of the results  of  different  studies when
combining estimates of treatment effects.   The most
efficient estimate  of  a common treatment  effect
uses  a  weight for  each study  that is  inversely
proportional to  the standard error of  the  estimated
treatment  effect.  If  you really believe the model,
you use it regardless of  the weights  it assigns.  A
skeptic might  believe  that  no  single study  should
receive too  much  weight and therefore  place an
 upper  bound on the weight  any  study may attain.
 This more robust solution  (called partial  weighting)
                                                       38

-------
 was suggested by  Yates and  Cochran  (1938) as  an
 alternative to excessive  belief  in  a  model.  The
 Techniques of modern  robust estimation obviously
 have  a  wide  range  of  possible   application  in
 meta-analysis, but less robust  procedures may be just
 as useful if their application is  carefully monitored.
     A second example concerns the use of monolithic
 data  analysis  procedures versus   data   analysis
 procedures that are easy to monitor  in detail. In one
 sense it is natural to attack the problem  of combining
 information  from  different  studies by  utilizing  a
 comprehensive statistical  model  and performing  all
 aspects of the analysis simultaneously, for example
 by maximum likelihood estimation. This procedure is
 elegant and  may have certain technical advantages.
 Yet simultaneous estimation of all aspects  of a model
 may have the disadvantage  that it is difficult  to
 discover parts of the model that are not consistent
 with the  data.  Moreover,  failures of one aspect  of
 the  model may affect estimation  of  other aspects  of
 the  model.  The alternative of explicitly computing
 estimates  from each study  and then  combining those
 estimates   using   explicit  procedures   such   as
 generalized least squares is far less  elegant.  It may
 also have some  technical  disadvantages,  but it has
 the  advantage that  the  adequacy of the model  is
 much  easier to monitor.   Moreover, failures  of one
 aspect of the model are less likely to spill over and
 create  problems  in another part  of  the model.  This
 issue also  arises in econometric modeling where the
 same  tradeoffs  are  recognized  between so-called
 "full information" modeling and "limited information"
 modeling.  The point  here is not that  simpler methods
 of combining results are better. It is that  models for
 between-study differences should be tentative and
 therefore  require   serious   monitoring.    Simpler
 combination  procedures  are often easier to monitor
 and  therefore  have  an advantage that might  not  be
 obvious.   Methods  for   monitoring   models  or  for
 producing   robust    combination   procedures  are
 essential in research synthesis.

 7.2.2 Realistic Modeling of Between-Studv
      Differences.
     Modeling  between-study  differences  is often
 fraught   with  complications   that   are   highly
 idiosyncratic to particular data set  under analysis.
 Research  syntheses can only be helpful if  they  make
 a serious   effort  to incorporate these idiosyncrasies
 into the data analysis model. Some of the sources  of
 these  idiosyncrasies  have  already been mentioned.
 For  example, publication or reporting bias leads  to
 censoring   or  truncation at the  sampling level  of
studies. Other censoring or truncation  effects  may
 exist  within  studies and both  may  need  to  be
explicitly   modeled.  Similarly,  missing data  that  is
not   missing at random may be important to consider
in the  model  for  the data  analysis.  Finally the
possibility   of   dependence    between    supposedly
independent studies  cannot be ignored.  If a given
team of investigators produces  several studies whose
results  are   more   alike   than   those  of  other
investigators, dependencies  are introduced which may
need to be incorporated into the model specification.

8.0   Some Statistical Methods for Combined
     Estimation.
     This section  presents  statistical methods  that
are   frequently used in  research   synthesis.   The
outline  of methods  that  follows  is  intended  to be
generic and  therefore it  does not  incorporate the
complexities  that  might  be  added  in  an  actual
research synthesis.

8.1 Statistical Methods for Fixed Effects
    Meta-analvsis.
    Suppose  that  Tj.....^  are  independent  esti-
mates of  effect  magnitude  form  k  studies with
sample  sizes n],...,^  and unknown population effect
magnitude parameters  6], ...  , 0^.  Assuming that  the
standard error of Tj is a  function of 9j,  denote  the
standard errors of Tj	T^ by S\(Q\), ... , 5^(6^)  and
the estimated standard errors by S](Ti),  ..., S^T^).
Assume further that each  Tj  has a normal  asymptotic
distribution  leading   to   the  large-sample  normal
approximation
       - N(0j.
                                                (7)
8.1.1 Estimating the Overall Average
     Treatment.
     One of the first statistical questions that  arises
is how  to estimate  the overall  average  treatment
effect  when it is  believed that Oj, ....  0^ are very
similar.  One  way  of combining the  estimates is
obviously to take the simple average of T^.-.T^.  The
most precise  combination  (i.e.,  the  most  efficient
estimator  of  0  when  9\ »  9^ •  8)  is a  weighted
average   that    takes    the    standard    error
SifTiJ....^^^) into account.  This  weighted average
is
       T.
  Ek  w
  1-1 *
                      T./Z   w.
                       *  i-l  1
                                                (8)
where «j * 1/S2(T]). One slight refinement of (9) is
the iterated estimator T.CJ) defined by T. (°) - T. and
T.


                              k
where w.CJ) » 1/S * (T.CJ - D).  When N = Ek  nj
with n\/N fixed                          i-l

(that  is, if each study has a large sample size) and if
each   Sj(T)  is  a  continuous  function  of  T,  the
estimators T  and T  CJ) have  (the same)  asymptotic
distribution  leading  to  the  large-sample  normal
approximation
T-N(6.

where

S.-2(T.)»r   S
           i-l
k   T2
    *
                    (T.).
                                              (10)
                                              (11)
This  result   can  be  used  to  compute  tests  of
significance  and confidence intervals  for  based  on
T..   For example a 100 (1  - a)  percent confidence
interval for e is given by
T.  - x  /2 S.(T.) <  e <   T. + za/2 S. (T.),
                                  (12)
                                                      39

-------
where  zay2is the two-tailed critical value of the
standard normal distribution.  Alternatively, a test of
the hypothesis that e = 0 uses the statistic
    HM
                                                         (20)
    Z » T. / S.(T.),
(13)
which  is  compared  to  the critical  values  of the
standard normal distribution. If the T.,  i=l,...,k, are
symptotically efficient  (have  asymptotic  variances
equal to the Cramer-Rao lower bound), then  T.  is
asymptotically efficient.
    Note that  weighted combinations of estimators
using estimated  weights and  their  iterated  coun-
terparts have identical large sample properties.  Thus
the decision about whether to iterate to obtain the
estimates of 9  used in the weights depends on the
form of S 2(9) and on the small sample properties of

T..   If s2(6) "IS  almost  independent of 9 for the

plausible values of 9 in the problem at hand, iteration
is unjikely to have much effect. On  the other  hand,
if S, (©)  changes  considerably  as  a  function of
plausible values of 6, iteration may change the  value
of the  estimate  by a considerable amount.  Note also
that even if the T; are unbiased estimates of 9, T. is
usually biased.   The iterated estimators  T.U) will
usually tend to be less biased than T.
     Another point is that the regularity  condition
that the  nj/N  remain  fixed as N  - •. If we  let N
increase by  increasing k  and  letting the  nj/N   0,
then a variety of problems arise.  For example,  under
this condition T. is not eve.n consistent if  the Tj  are
biased estimators of 8 (see Neyman & Scott, 1948).

8.1.2 Testine Homogeneity of Treatment Effects.
     Combining estimates of effect magnitude across
studies is  reasonable if the studies  have a common
population  effect  magnitude  9.    In  this  case,
properties of  T2 and  on  the  dependence  of  the
variance of  the Tj  on ®j . If  the  variance S,(9) is
almost  independent  of 0,  then  the iteration  process
will not change the  estimates very much.  If  5^(0) is
greatly influenced by e,  then iteration  could change
the  estimate   of  3 considerably.   The  iterated
estimators are also likely to be less  biased than the
uniterated estimators.
     This result  is often useful in situations where the
model  for the data is well understood,  but the obvious
means of obtaining  estimates, such as the  method of
maximum  likelihood, require  complicated iterative
methods.   The  present  method yields  estimators that
are  simple  to  compute  but   have   the  same
large-sample  properties  as   maximum-likelihood
estimators.  In  addition the diagnostic  procedures
routinely  used  in  weighted least  squares  can be
applied in the usual manner.
     If the investigator proposes a linear model  for 9,
it may be  desirable  to see if the model  is reasonably
consistent with  the  estimates T^.-.J^.  If  the  model
does not seem  reasonably consistent with the data,
then the entire analysis should be suspect.  Iterated
estimators, in particular, would not  be expected to
perform well if the model is misspecified.  Whenever
k > p, a natural  test of model specification arises in
connection with the estimator presented above.  The
test given below provides a way  to check that the
data are consistent with the proposed  model.

     If 3 » 0, then the statistic
is  distributed approximately as a chi-square with p
degrees of freedom.  The statistic  H^j can be used as
for a simultaneous  test  that  3 j  «...  3 p  = 0, or
alternatively as a simultaneous  test that 9] * ... 9^ *
0.  More significantly HM is used in the calculation of
a  test for goodness  of fit or  specification of  the
linear  model.  If k > p  and  the model  e  = X3 is
correctly specified, then the statistic
                 HE-T'V-1(T)T-HM

             has an approximate distribution given by

                 HE-*?
                                             (21)
                       k-P

                 The  statistic  HM  is  the  (weighted)  sum  of
             squares due to the regression model and the statistic
             H£  is  the  (weighted)  sum of  squares about  the
             regression plane (the r sum of squares).  Thus  the test
             for  model misspecification is a test  for larger than
             expected residual  variation.   If H£  is  large  or
             significant, the  investigator might  use any  of  the
             standard  tools  of regression   analysis  (e.g.,  the
             examination  of residuals, the  search for influential
             observations,  etc.)  to look  for  problems with  the
             model.
                 Note that  the linear  model analyses described
             above can all  be performed  with standard packaged
             computer programs such as  SAS PROC GLM. The
             weighted  regression  (or  analysis  of variance)   is
             performed by simply specifying the weight  for each
             case (each Tj) as
                  w
             The  regression coefficients are printed directly and
             the   variances   of    the    estimated   regression
             coefficients are the diagonal elements of the inverse
             of the (X'WX) matrix.  The statistics HM and HE are
             printed  as the weighted  sum of squares due  to the
             regression  model  and  the  weighted  error  sum   of
             squares  about the regression plane.

             8.2 Statistical Methods for Random-Effects
                 Meta-analvsis.
                 Again suppose  that  TI	Tj<  are  independent
             estimates of treatment effects from  k experiments
             with (unknown) population treatment effects
             Again denote  the standard error of Tj  given    by
             Sj(Tj).    Assume  as   before  that   the  Tj  are
             approximately normally distributed.  Now, however,
             introduce the  random-effects  model  that
             are sampled from a  hyperpopulation  of treatment
             effects.  Often the    are assumed to be normally
             distributed.  The object of the analysis is to estimate
             the mean   and variance   (the hyperparameters) of
             the distribution of population  treatment effects, and
             to test the hypothesis that  - 0.
                 A distribution-free approach  to estimating    is
             analogous to  the  procedure  used  to  estimate  the
             variance component in the one-factor random effects
             analysis of variance. The  estimate    is given by

                                                           (22)
                                                       40

-------
 where ST is  the  usual sample variance of Tj^....^
 (see  Hedges  and  Olkin,  1985)..  More  complex
 procedures    for   estimating  o   under   various
 distributional assumptions on  the 8 i are given  in
 Champney (1983),  Raudenbush  and Bryk  (1985), and
 Hedges and Olkin (1985).
     The usual estimate of e is the weighted mean
 where
     T"  -  E *>iV

              . ,,A2
                           .
                                              (23)
The weighted mean T*. is approximately distributed

     T?  - N
-------
generalizations  within  and  across  studies.   The
experimental  paradigm  predisposes  researchers  to
view  the  generalization across  subjects  as  natural
because,  by  definition,  the  differences  between
subjects are  "experimental  errors"  (see  Cronbach,
1957).  The  few  systematic  individual  difference
variables  recognized in  the experimental paradigm
are  incorporated  into  the  design  and all  other
differences are by definition nonsystematic.
    Differences between studies, on the other hand,
are   viewed  as   systematic  because   the   same
experimental  paradigm  stresses  the  importance  of
the design of  research studies.  A great deal  of the
training and  professional effort  of  researchers  is
devoted  to    learning   about,    planning,   and
implementing  systematic aspects of research studies
that  make  one  study  different  from  another.   I
emphasize that the differences  between  studies are
viewed  by knowledgeable researchers as systematic.
because researchers strive to  make  their studies
systematically  different   from   those  of   other
researchers to obtain new information.  They  do so
because they believe these differences could have an
effect  on  the results of  the  study.  Moreover, the
differences   between   studies   are   not   usually
unidimensional.  The design and execution of research
studies  is so complex  that  even "similar" studies
often differ in many ways. For  this reason, it would
be  expected that researchers  would  have difficulty
with  any  method  of   generalizing  across  studies,
because such  methods implicitly relegate the many
complex and  important differences between  studies
to  the  status of  "error" or unsystematic variation.
Researchers  *re  likely   to   find  even    more
problematic,  statistical  methods  for  generalizing
across  studies   that  explicitly  define  all of the
unmodeled  variation between  study results   to  be
sampling  error.   Conventional  statistical  methods
(such as  t  tests, analysis of  variance and multiple
regression  analysis) applied  to  effect  sizes  are
examples of this type.  Statistical  methods developed
specifically   for  meta-analysis  separate  variation
between studies that is  due to sampling error  within
studies  from  that which  is  due  to  systematic
variation  between studies.
     The most persistent criticisms of meta-analysis
stem, in part, from the perspective of  researchers
who feel  that differences among  studies and among
their results  are  systematic and that meta-analysis
fails  in some way to  recognize  those differences.
Perhaps the most consistent of  these  criticisms of
meta-analysis (which  could be  criticisms of  any
review) have  come to  be  called the  "apples  and
oranges"  criticism and  the "garbage-in garbage-out"
criticism.
     The apples and oranges criticism maintains that
meta-analysis combines evidence from  studies which
do not have the  "same"  procedures,  independent
variables,    or     dependent    variables.     Thus
meta-analysis  is  combining  the incommensurable
because  studies  exhibit   systematic   differences.
Another statement of essentially  the same criticism
(Presby, 1978) is that combining research studies into
overly    broad    categories   obscures   important
differences between those studies and their  results.
In  each case, the fundamental issue is the breadth of
constructs  that are  the "same," and  the  critic's
position  is  that only  aggregation  across  a  rather
narrow range of  treatment,  control,  and outcome
constructs is sensible.
    The    "garbage-in    garbage-out"    criticism
(Eysenck,  1978)  is  that  by  abandoning  "critical
judgment"   about  the  quality  of  research  studies
reviewed, meta-analysis  placed too much emphasis
on studies of low quality.  Because studies of  low
quality  are presumably subject  to  many biases,  they
cannot  be  the foundation  of  reliable  knowledge.
Meta-analysis  therefore  becomes  another  case  of
garbage-in  garbage-out.   Although   the  criticism
concerns the question of methodological quality, it is
firmly  rooted  in the  conception  that  there  are
systematic  differences   (in  methodology)  between
studies  that influence study results.

9.2 Improving Meta-analvsis in  the Service
    Scientific Explanation.
    The    improvement    of    meta-analysis    as
explanation depends  on  greater attention  to  both
methodological  detail  and to persistent criticisms of
meta-analysis.  Critics tell us  why they do not find
meta-analyses  to  be  convincing  as explanation.
Attempts to respond to those criticisms (where  they
do   not  conflict  with   other   requirements  of
methodology) are  likely  to  yield  more  persuasive
meta-analyses.  Many of these criticisms are among
the  issues  discussed  in  earlier   sections  of   this
chapter, but  two  general  issues emerge.  One is the
issue  of specificity versus generality  of constructs.
The  other  is the appropriate use of quantitative
methods.
    The   issues   of   specificity  arise   because
researchers  tend  to  think  of  studies in  terms  of
specific and rather narrow constructs.  This tendency
toward  specificity is reflected  in the usually narrow
choice of constructs in conventional reviews (Cook &
Leviton, 1980).  Meta-analyses  are  likely to be  more
credible as  explanation  if  they  use (or  at  least
distinguish)  constructs of treatment,  control,  and
outcome  that are relatively -narrow  and  relatively
specific  to   the  research   domain   at   hand.
Meta-analyses are also  likely to be more credible if
they use conceptions of study quality  that  recognize
the specific difficulties  associated with the domain
under study.  By  treating between-study differences
in rather specific ways,  meta-analyses  will offer a
richer  variety  of  connections   with researcher's
conceptualizations of the research domain.
    The  issues of  appropriate use of  quantitative
methods might  be interpreted to include all issues of
the   formal    (mathematical)   appropriateness  of
statistical  methods   in  a  given  situation.    More
important  is  the  question of when should  statistical
methods  be  used given  that they  are  formally
correct.  Researchers are  not always  comfortable
with  the use of quantitative method  to empirically
"define" the  differences among studies that deserve
consideration. For example,  the argument  that  study
quality  can  be defined empirically  by determining
which groups of  studies give  different  answers has
not  always  been persuasive.   Critics seem  to be
saying  that quantitative  analyses  cannot  carry  the
whole  load.  Meta-analyses are  likely to be  more
persuasive   if  they  use  qualitative  methods  to
determine  interesting  differences among  studies.
 Researchers  know both  that  quantitative  methods
cannot resolve all questions and that these methods
must be guided  and  set  in context  by qualitative
analysis.   Qualitative   information  that   is   not
explicitly coded as between-study differences has an
important role in interpretation  and  should not be
neglected  even   if   it  requires  rather   lengthy
                                                       42

-------
descriptions  of  important   aspects   of   individual
studies (Light & Pillemer, 1984).
    The net effect of these suggestions would be to
make  meta-analyses  look more  like  conventional
narrative reviews,   involving  perhaps  fewer studies
distinguishing  narrower  constructs,  and  providing
more detailed  qualitative  and conceptual arguments.
In fact, earlier conventional reviews of an  area may
be  a  model  for   conceptualization  and  level  of
operational  detail  that are  appropriate.   The most
persuasive meta-analysis  is  likely to  be one that
combines the strengths of qualitative reviews  and
those of serious quantitative methodology.

                    References

Adcock, C. J. (1960). A note on combining probabil-
    ities. Psvchometrika.  25, 303-305.
Bakan, D. (1966).  The  test of significance in psycho-
    logical   research,   Psychological   Bulletin.  6.6,
    423-437.
Barnett, V., & Lewis, T. (1978).  Outliers in Statistical
    Daia.  New York:  John Wiley.
Bozarth, H. D., & Roberts, Jr., R. R. (1972). Signi-
    fying    significant   significance.    American
    Psychologist. 22. 774-775.
Birge, R. T. (1932).  The calculation of errors by the
    method  of least squares.   Physical Review.  16,
    1-32.
Bracht, G., & Glass, G. V. (1968). The external
    validity  of experiments.   American Educational
    Research Journal. 5, 437-474.
Campbell, D. T. (1969). Definitional versus  multiple
    operationalism. ej. al, 2,  14-17.
Campbell, D. T., &  Stanley, J. C. (1963). Experi-
    mental   and   Quasiexperimental   Designs   for
    Research.  Chicago: Rand McNally.
Canner, P. L. (1983). Aspirin in coronary heart
    disease:   A  comparison  of  six  clinical trials.
    Israel Journal of Medical Sciences. 12, 413-423.
Chalmers, T. C. (1982). The randomized controlled
    trial as  a  basis for therapeutic decisions.  In J.
    M.  Lachin, N.   Tygstrup. & E. Juhl (Eds.).  The
    Randomized   Clinical  Trial   and  Therapeutic
    Decisions.  New York: Marcel Dekker.
Champney, T. F. (1983).  Adjustments for Selection:
    Publication  Bias  in  Quantitative   Research
    Synthesis.   Unpublished  doctoral  dissertation.
    The University of  Chicago.
Clarke, F. W. (1920). A redetermination of atomic
    weights.  Memoirs  of the  National Academy of
    Science. 16(31.  1-48.
Cochran, W. C. (1937).  Problems arising in the
    analysis  of  a  series  of  similar  experiments.
    Journal  of    the  Royal  Statistical  Society
    (Supplement). 4, 102-118.
Cochran, W. C. (1943).  The comparison of different
    scales of measurement for experimental results.
    Annals of Mathematical Statistics. 14.  205-216.
Cook, T. D., & Campbell, D. T. (1979).  Ouasi-
    experimentation. Chicago:  Rand McNally.
Cook, T. D., & Leviton, L. C. (1980). Reviewing the
    literature:  A  comparison of traditional methods
    with meta-analysis.  Journal  of Personality. 48.,
    449-472.
Cooper, H. M. (1979).  Statistically combining
    independent  studies:  A  meta-analysis of  sex
    differences in  conformity  research.  Journal  of
    Personality and Social Psychology. 32, 131-146.
Cooper, H. M. (1984).  The  Integrative Research
    Review: A Systematic Approach.  Beverly Hills:
    Sage Publications.
Cooper, H. M. & Arkin, R. M. (1981). On quanti-
    tative reviewing.  Journal of  Personality.  42,
    225-230.
Grain, R. L. & Mahard, R. E. (1983).  The effect of
    research    methodology    on    desegregation-
    achievement    studies:    A    meta-analysis.
    American Journal of Sociology. S3, 839-684.
Cronbach, L. J. (1957). The two disciplines of
    scientific  psychology.   American  Psychologist.
    12, 671-684.
Cronbach, L. J. (1980). Toward Reform of Program
    Evaluation. San Francisco: Jossey-Bass.
DerSimonian, R.. & Laird, N. (1983). Evaluating the
    effectiveness of coaching for SAT  exams:  A
    meta-analysis.  Harvard Educational Review. 53.
    1-15.
Eagley, A. H. & Carli, L. L. (1981). Sex of
    researchers  and sex-typed  communications  as
    determinents    of    sex     differences    in
    influenceability:   A  meta-analysis  of  social
    influence  studies.   Psychological  Bulletin.  9_Q,
    1-20.
Eagley, A. H. & Crowley, M. (1986).  Gender and
    helping behavior: A meta-analytic review of the
    social  psychological  literature.    Psychological
    Bulletin. 99.
Elashoff, J. D.  (1978). Combining  the results of
    clinical trials.  Gastrocnterology. 28, 1170-1172.
Eysenck, H. J. (1978). An exercise in mega-silliness.
    American Psychologist. 33, 517.
Fisher, R. A. (1932).  Statistical Methods for
    Research  Workers  (4th ed.)  London:  Oliver &
    Boyd.
Fleiss, J. L. (1973). Statistical Methods for Rates and
    Proportions. New York: John Wiley.
George, E. O. (1977).  Combining Independent One-
    sided and Two-sided  Statistical  Tests  — Some
    Theory and Applications.  Unpublished doctoral
    dissertation, University of Rochester.
Giaconia, R. M. & Hedges, L. V. (1982). Identifying
    features of effective open education. Review of
    Educational Research. 52, 579-602.
Gilbert, J. P., McPeek, B., & Mosteller, F. (1977).
    Progress in surgery and anesthesia:  Benefits and
    risks of  innovation  therapy.  In J.  Bunker,  B.
    Barnes,  and F.  Mosteller  (Eds.).  Costs. Risks.
    and Benefits  of Surgery.  New  York:  Oxford
    University Press.
Glass, G. V. (1976).  Primary, secondary,  and
    meta-analysis   of    research.     Educational
    Researcher. 5, 3-8.
Glass, G. V. & Smith, M. L. (1979). Meta-analysis of
    the  relationship   between   class  size   and
    achievement.  Educational Evaluation and Policy
    Analysis. 1, 2-16.
Glass, G. V., McGaw, B., & Smith, M. L. (1981)..
    Meta-analvsis  in   Social   Research.   Beverly
    Hills:  Sage Publications.
Greenwald, A.  G. (1975).  Consequences of prejudice
    against   the  null   hypothesis.    Psychological
    Bulletin. £2, 1-20.
Hawkins, D. M. (1980).  Identification of Outliers.
    London: Chapman Hall.
Hedges, L. V. (1983).  Combining independent esti-
    mators  in   research   synthesis.   The  British
    Journal   of   Mathematical  and  Statistical
     Psychology. 36. 123-131.
Hedges, L. V. (1984). Estimation of effect size
    under normal  nonrandom sampling:  The effects
    of  censoring   studies   yielding   statistically
     insignificant   mean  differences.   Journal  of
     Educational Statistics. 9, 61-85.
                                                       43

-------
 Hedges, L. V. & Olkin, I. (1980). Vote counting
     methods in  research  synthesis.  Psychological
     Bulletin. 8JJ, 359-369.
 Hedges, L. V. & Olkin, I. (1983). Clustering estimates
     of effect magnitude from independent  studies.
     Psychological Bulletin. 23, 563-573.
 Hedges, L. V. 4 Olkin, I. (1985). Statistical Methods
     ffir Meta-analvsis. New York: Academic Press.
 Hunter, J. E., Schmidt, F. L., & Jackson, J. B. (1982).
     Meta-analvsis:    Cumulating   findings   across
     research.  Beverly Hills: Sage.
 Lane, D. M. 4 Dunlap, W. P. (1978).  Estimating
     effect    sizes:    Bias    resulting   from   the
     significance  criterion  in  editorial  decisions.
     Stilish. Journal of Mathematical and Statistical
     Psychology. 3_L 107-112.
 Light, R. J.,  4 Pillemer, D. B. (1984). Summing Ue=
     The Science of Reviewing Research.  Cambridge,
     Massachusetts:  Harvard University Press.
 Linn, M. C., 4 Peterson, A. C. (1985). Emergence
     and characterization of sex differences in spatial
     ability.  Child Development. 56, 1479-1498.
 Madow, W. G.. Niesselon, H., & Olkin, 1. (1983).
     Incomplete data  in sample  surveys:   Vol.  I,
     Report and case  Studies.  New York: Academic
     Press.
 Madow, W. G., 4 Olkin, I. (1983).  Incomplete daia
     ID samclfi surveys:  Vol.  J. Proceedings and the
     symposium. New  York: Academic  Press.
 Madow, W. G.. Olkin, 1., 4 Rubin, D. B. (1983).
     Incomplete data  in sample   surveys:   Vol.  2-
     Theories    and   bibliographies.   New    York:
     Academic Press.
 Mantel, N. 4 Haenszel, W. (1959).  Statistical aspects
     of  the   analysis  of  data  from  retrospective
     studies.   Journal   of   the   National   Cancer
     Institute. 22, 719-748.
Melton, A. W. (1962). Editorial. Journal of Ejgjsri-
     mental Psychology. 64, 553-557.
Miller, R.  G.  (1981). Simultaneous Statistical Infer-
     fince, (2nd Ed.). New York: Springer-Verlag.
Mudholkar, G..S. & George, E. O. (1979). The logit
     method   for   combining  probabilities.   In  J.
     Rustagi   (Ed.).    Symposium  on  Optimizing
     Methods in Statistics (pp. 345-366).   New York:
     Academic Press.
Neyman, J., 4 Scott, E. L. (1948).  Consistent esti-
     mates    based    on     partially    consistent
     observations.  Econometrica. 16, 1-32.
Orwin, R.  G.,  4  Cordray, D. S. (1985). Effects of
     deficient   reporting   on  meta-analysis:   A
     conceptual     framework    and     reanalysis.
     Psychological Bulletin. 22, 134-147.
Pearson, K. (1933).  On a method of determining
     whether a  sample of given  size  n  supposed to
     have been drawn from a parent population having
     a  known probability integral has  probably been
     drawn at random.  Biometrika. 25, 379-410.
Presby, S. (1978). Overly broad categories obscure
     important differences  American  Psychologist.
     33, 514-515.
Raudenbush, S. W., 4 Bryk, A.  S. (1985).  Empirical
     Bayes meta-analysis.  Journal  of  Educational
     Statistics. 10_, 75-98.
Rocke, D. M., Downs, G. W,, 4  Rocke, A. J. (1982).
     Are  robust    estimators   really   necessary?
     Technometrics. 24, 95-101.
Rosenfeld, A. H. (1975).  The particle data group:
     Growth   and   operative.   Annual  Review  Q£
     Nuclear Science. 555-599.
Rosenthal, R. (1984).  Meta-analytic Procedures for
    Social   Research.     Beverly   Hills:    Sage
    Publications.
Rubin, D. B. (1976). Inference and missing data.
    Biometrika. 63, 581-592.
Sheele, P. R. (1966).  Combination of log-relative
    risks  in  retrospective  studies  of  disease.
    American   Journal   Q£   Public   Health.   56,
    1745-1750.
Simpson, E. H. (1954).  The interpretation of inter-
    action  in  contingency  tables.  Journal of  the
    Roval Statistical Society. Series B, 13, 238-241.
Smith, M. L. (1980a).  Publication bias in meta-
    analysis.    Evaluation    in    Education:    An
    International Review Series. 4, 22-24.
Smith, M. L., 4 Glass, G. V. (1977). Meta-analysis of
    psychotherapy    outcome   studies.    American
    Psychologist. 32, 752-760.
Stampfer, M. J., Goldhaber, S. Z., Yusuf, S., Peto,  R.,
    4  Hennekins,   C.   H.   (1982).   Effects   of
    intraveneous streptokinase on acute  myocardial
    infarction.  New  England  Journal of  Medicine.
    307. 1180-1182.
Sterling, T. D. (1959).  Publications decisions and
    their possible effects on inferences  drawn from
    tests of  significance—or vice versa.  Journal of
    Ibfi American Statistical Association. 54, 30-34.
Stigler, S. M. (1977).  Do robust estimators work with
    real data?  Annals of Statistics. 5. 1055-1098.
Stock, W. A., Okun, M. A., Haring, M. J., Miller, W.,
    Kinney,  C.,  4 Cuervost, R. W. (1982).  Rigor in
    data  synthesis:   A case study  of reliability in
    meta-analysis.    Educational   Researcher.   ]_!,
    10-14. 20.
Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star,
    S. A.,  4  Williams,  Jr.,  R.  M., (1949).   The
    American Soldier. Volume 1. Adjustment  During
    Army  Life.   Princeton:   Princeton  University
    Press.
Thomas, J. R. & French, K. E. (1985). Gebder
    differences across age in motor performance: A
    meta-analysis.     Psychological   Bulletin.   28.
    260-282.
Tippett, L. H. C. (1931).  The Method of Statistics.
    London:  Williams and Norgate.
Tukey, J. (1977).  Exploratory Data Analysis.
    Reading, MA: Addison-Wesley.
Wallis, W. A.  (1942).  Compounding probabilities from
    independent  significance  tests.  Econometrica.
    Ifl, 229-248.
Webb, E., Campbell, D., Schwartz, R., Sechrest, L.,
    4  Grove,  J.  (1981).   Unobstrusive  measures:
    Nonreactive  research  in  the social  sciences.
    Boston:  Hoi-ghton Mifflin.
Woolf, B. (1955).  On estimating the relation between
    blood  group  and  disease.   Annals   of  Human
    Genetics. 19, 251-253-
Wortman, P. M. (1981).  Randomized clinical  trials.
    In P. M. Wortman (Ed.).  Methods for Evaluating
    Health   Services.     Beverly   Hills:    Sage
    Publications.
Yates, F., 4 Cochran. W. G. (1938). The analysis of
    groups of  experiments. Journal of Agricultural
    Research. 2£. 556-580.
Yusuf, S., Peto, R., Lewis, J., Collins, R., 4 Sleight,
    P.  (1985).   Beta  blockade  during  and  after
    myocardial   infarction:   An  overview  of  the
    randomized  trials.  Progress  in  Cardiovascular
    Diseases. 22, 335-371.
                                                       44

-------
                                                 DISCUSSION

                                                 Chao W. Chen,
                                      U.S. Environmental Protection Agency
     Dr.  Hedges  has given  an excellent  discussion
about  general issues that one  should  consider when
combining  information  from  different  studies.
However,  the  methodologies proposed  in his  pre-
sentation do not appear to have much usefulness for
most  problems  that the U.S.  Environmental  Pro-
tection Agency  (EPA) encounters in environ- mental
management.  A major  difficulty  in  environmental
management is  that scientists often  cannot provide
precise estimates of environmental risk posed by a
pollutant because of a lack of scientific  knowledge
and data. As is usually the  case, very little is known
about  the potential risks  of chemicals in  the  air,
water, or workplace. Our knowledge is  even weaker
with respect to the mechanism of carcinogens.  Yet
EPA is frequently called upon to make  decisions on
the management  of  environmental risk, in the face of
such an enormous uncertainty.  A simple example  is
about  the issue of whether one should combine benign
and  malignant  tumors in statistical  evaluations  of
carcinogen  data.  This decision could  make  a big
difference in classifying an  agent  as  to whether or
not  it is carcinogenic.  For instance, suppose that
three  benign and four malignant neoplasms of  the
same cell  type are  found in a group  of 50  animals,
versus none in a control  group of 50 animals.  When
these incidence data are analyzed separately, none of
them is statistically significant  (one-sided, p > 0.05)
with  the use of the Fisher Exact Test.  However,
when  benign and malignant  tumors  are  combined
(0/5.0  vs.  7/50),  the incidence  is  highly  significant
(one-sided,  p <  0.007).  Since  there  are scientific
reasons  for favoring  and  opposing combination  of
neoplasms, it would be more appropriate  to reflect
this  uncertainty in risk assessment  and management.
The  "raeta-analysis" which  considers only  statistical
variability is not capable of taking into  account this
dynamic nature of the problem.

    The methodologies presented by Hedges consist
of two parts:  namely, combining  significance tests
and statistical analysis of combined effect size. The
procedure  for  analyzing combined  effect  size   is
mainly  an   analysis-of-variance    (ANOVA)-type
approach, which may not be adequate  for the kind of
problem  usually encountered by  the EPA in the area
of risk assessment.  Some  of the procedures used to
combine significance tests  are  multiple comparison
tests in  nature  (e.g.,  minimum p method), which  is
certainly not the objective of combining information
from different studies.  In the minimum  p  method,
the  null  hypothesis HQ:HOI - Hrj2  -  ••• *
is rejected if
               Pm  <  1  -  (1  - a)1/"
where  pm  is  the minimum  of  pi,  P2»  ••-,  Pic-
This procedure  is  simply a simultaneous inference
test for a family of k independent statements (Sj, i
« 1, 2, .... k) with a family error rate  a.  This can be
easily seen as follows:
              Pi  = Pr(S-j is incorrect)

              1-p, = Pr(Si is correct)
 1-a » Pr(all  S1  are correct) »  n(l-p.j)  < (1-pJ11
Therefore, the null hypothesis HQ is rejected if
Similarly, the use of Fisher's procedure of combining
information  from  different  studies  may  not  be
appropriate,  as  the following example demonstrates.
Suppose  two studies  are  identical  except  for the
sample size  and  are  summarized in 2x2 tables as
follows:
The p values for these two studies are, respectively,
0.60 and 0.04, using the one-sided Fisher Exact Test.
The  combined  result   by  Fisher's  procedure   is
-2£jlog(pj)  - 3.2,  which  has a  one-sided  p-value
of 0.52 under HQ.  Clearly, it is  not appropriate to
perform such a statistical procedure when one of the
two  null   hypotheses  is  already  rejected.   This
example  demonstrates  that  combining  significance
tests as proposed may not be meaningful.

    The  last example I will present is the problem of
combining  cancer  risk  assessment  results.   Since
there are  many uncertainties associated with  each
step in risk assessment,  the problem of combining
these results is complex and obviously could not be
resolved   by  ANOVA-type  statistical  approaches.
Table  1  provides   hypothetical  information  on  a
suspect carcinogen  which induced liver tumors in
both male and female B6C3F1 mice via either the
gavage or inhalation routes of exposure. This suspect
carcinogen also induced leukemia in male and female
F344 rats by inhalation, but  failed to induce  tumors
in Osborne-Mendel  rats in a lifetime gavage bioassay
or in Sprague-Dawley rats in a one-year  inhalation
study.  There is a debate in the scientific community
with regard  to the significance  of  B6C3F1  mouse
liver tumors to humans because  of the  high spon-
taneous tumor rates in these animals.  The problem
facing  EPA is how to use this and .other information
to arrive  at  a conclusion or decision as to whether
this  suspect carcinogen  could   cause  cancer  in
humans, and if it is a human carcinogen, to determine
its risk to humans.
                                                      45

-------
    Assuming that the compound is a human car-
cinogen,  the risk estimates (risk at  1 u g/m^) are
calculated with a dose-response model that  is linear
at low doses.  It is generally accepted that low-dose
linearity provides  a  plausible upper-bound estimate
of  risk.   However,   some  fragmental information
indicates that the true shape of the  dose-response
curve may be sublinear, but  the degree of curvature
is  not known.  In combining statistical  significance
tests or effect sizes (risk estimates), it is  desirable to
take into account all the available scientific infor-
mation, which  is itself very  uncertain.   Statistical
procedures,  such as  meta-analysis, that  take  into
account only the sampling variability, are clearly not
adequate.
                   TABLE 1.  SIGNIFICANCE LEVELS,  p, OF TREND TEST FOR  VARIOUS BIOASSAYS
                                   AND ESTIMATES OF CANCER RISK AT 1  ug/m3,
                                  CALCULATED ON THE BASIS OF THESE STUDIES
Animals
B6C3F1 mice



Osborne-Mendel
rats
F344 rats

Sprague-Dawley
rats
Sex
M
F
M
F
M
F
M
F
M
F
Route of
exposure
Gavage
Gavage
Inhalation
Inhalation
Gavage
Gavage
Inhalation
Inhalation
Inhalation
Inhalation
Site
Liver
Liver
Liver
Liver
No
No
p-values
Malignant Benign Combined
0.018 N.S. N.S.
0.003 N.S. N.S.
0.001 0.001 0.001
0.001 N.S. 0.001
response
response
Leukemia 0.004
Leukemia 0.050
No
No
response
response
Risk
estimates
at 1 ug/m3a
6 x 10-5
7 x lO'6
5 x 10'7
4 x lO'7

6 x lO'7
1 x ID'7

aThese estimates  are  calculated on  the basis of  malignant tumors  alone.   In  practical risk  assessment,
 estimates  could  also be  calculated on the  basis of benign and/or benign  and malignant tumors  combined.
 For ease of  presentation, they are  not presented here.

N.S. » Not  significant (p > 0.05).
                                                      46

-------
                                           DISCUSSION
                            James M. Landwehr, AT&T Bell Laboratories
    Combining  the  results  from  several studies
through  performing  a meta-analysis of them is
clearly becoming both important and widely prac-
ticed,  especially in  the  social  sciences.  Prof.
Hedges has, in this paper as well as in other pub-
lications, carefully and  systematically  presented
the statistical  methodology  of meta-analysis.   In
this discussion  I will briefly give an overall
framework for statistical applications that I find
useful.  Then I will relate  meta-analysis to this
framework, identify  parts  of  meta-analysis that
need  further attention, and make  several sugges-
tions  concerning the  methodology.
    Before  proceeding, let  me briefly  state  my
main   point  and  general   conclusion.   Meta-
analysis  should  not  be thought of as some com-
pletely  new  and  different  kind of  statistical
methodology.    Its  steps  fit  nicely within  the
framework that  we use for statistical applications.
Doing a  meta-analysis  well,  however, requires
much care with the underlying assumptions and
the conclusions.
    I like  to think of a statistical application as
having five main parts. The first  is problem for-
mulation, including  the  design of the  study  or
experiment, and data collection. The second step
can be thought of as data analysis, or descriptive
statistics, or exploratory data  analysis;  this step
often  uses graphical displays  extensively.  Fol-
lowing the data analysis is  construction of more
formal models,  which can either  be deterministic
and/or stochastic.  The fourth  step involves for-
mal statistical   inference, perhaps expressed  in
terms of parameters of a model constructed previ-
ously, along  with  diagnostic  checking of  the
model.  Finally,  the results must be presented in a
way that is informative to the audience.
    Clearly there is  some  subjectivity in  the
definitions  of these  steps  and overlap  between
them, but it is not necessary to  be terribly pre-
cise.  I  will relate some of the terms, methods,
and issues of meta-analysis to these steps.
    Problem Formulation, Study Design,  Data
Collection.  Some  of the  important points   of
meta-analysis related to this step are the follow-
ing: the breadth  of questions asked in the separate
studies  and in  the  meta-analysis; having  clear
definitions of the variables; making sure  that it is
reasonable  to  treat the parameters as being  the
same  across studies; and considering whether or
not there is a "representative sample" of studies,
subjects, and treatments.  Judgments  of the qual-
ity of the individual studies are an important part
of the  data evaluation process.  Prof.  Hedges dis-
cussed these issues carefully  and extensively.
    Data Analysis, Descriptive Statistics, Graphi-
cal Displays.  This is an important part of statisti-
cal applications that  deserves  more  emphasis  in
meta-analysis.  Plotting the data in various ways
and studying the plots should be done at an early
stage of any application. This helps the analyst to
become  familiar with the data, to find possible
errors, to generate new  ideas and hypotheses, and
to get  initial  views on  whether or not  the  data
will  answer questions  of interest.   Sometimes
simple   plots   do  give  clear and   convincing
answers  to the important questions, so this is all
the analysis that is really necessary.
    Suppose we have an estimated parameter  7,
with its  estimated standard deviation £,(7,) from
the i* study,  for i  from  1  to K.  We  should
display  these  values somehow.   For example,
construct a  plot with estimated parameter value
on the ordinate and study index on the abscissa.
The  ordinate has  a *  at  7, and  a vertical  line
from 7, - 2j,(7,) to 7, + 2j,(7,), and  the abscissa
is the  index i.  Examining  such  a plot  gives  a
rough  idea  of whether  or  not  the   differences
among  the  estimates 7,  are consonant with  the
internal  confidence intervals from the individual
studies.  That is, does  it appear that the  studies
are consistently estimating a common parameter
or not?  The plot might indicate particular studies
that  seem quite different from the rest, either in
the estimate 7, or its estimated variability.
    For  any moderator variable (i.e.,  explanatory
variable) X, that measures some relevant charac-
teristic of the  ith study, we should also construct
the corresponding plot in which the abscissa is X,
rather  than  the study index  i.  For  the ordinate
plot  the  same  vertical  line as before.  This  plot
shows   the  general  relationship  between   the
parameter being studied and the moderator vari-
able  X.  Many such  plots should be  constructed
and studied  using whatever variables are available
for analysis.
    In their book,  Hedges and Olkin (1985) do
give  an example of a plot of  the type described
two paragraphs earlier.  However, it appears on
page 252 of the book!  It is  the first  plot of data
                                                 47

-------
that is shown and follows much discussion of sta-
tistical models and tests.   I suggest that  such
plots  should be constructed and used at the very
beginning of  any  meta-analysis,  before  getting
into   more complicated  modeling,  testing,  and
estimation.
    Models,   Deterministic  and/or  Stochastic.
The  canonical model in meta-analysis  seems to
be the following.   From the i* study we  have
some parameter 9,, its  estimate 7, from  sample
size  n,-, and a formula for  the standard deviation
of 7,, which is denoted  J,-(9,-).  Generally we
assume  that 7, is  approximately normally distri-
buted. From  each  study there is  essentially one
data  point, 7,. The key difference between this
situation and  most  other  statistical problems  is
that  here we also  have an estimated standard
deviation, namely s,(7j), for this data point that is
calculated without  using  any  information  from
other data points. Typically in statistical problems
we must use the variability across data points in
order to calculate a  standard  deviation   that
applies to a particular data  point.
    Here is  a specific  model  illustrating  these
concepts taken from Chapter 5 of Hedges and
Olkin (1985). Suppose  the Ith  study has  experi-
mental and control groups and response variable
Y. Assume  that in the  experimental group the
observation on the j^ subject, Yj  ', is distributed
normally with mean (I '  and variance a,2, and in
the control  group  Yj '  is distributed normally
with  mean  \ic'  and variance a,2.  Define the
parameter   for   the   meta-analysis   to   be
9l=(U£|-HC')/
-------
can be developed to do this and are discussed by
Prof.  Hedges.
    The real problem, as  in the previous  step
concerning  model construction,  is what to con-
clude if the null hypothesis is rejected.  While the
alternative being considered might in fact hold, it
is also possible that one or more assumption such
as those  listed  previously might not be  valid.
The challenge for meta-analysis methodology, as
in other statistical applications, is to decide when
the framework of a specific mathematical statisti-
cal model is useful for analysis.  I would like to
see more discussion and work on this issue.
    Presentation  of Results.  It is  obviously
important that the conclusions be  presented so
that   the  intended audience  understands   and
believes them. This is an important part  of any
statistical application.  I suggest that the presenta-
tion of results from a meta-analysis include plots
like those discussed earlier in the Data Analysis
section, but  supplemented with additional  infor-
mation from the Models and Statistical Inference
stages.  For example, if the between-study varia-
bility  can  be adequately modeled  by dividing the
studies into two groups with a common 0 in each
group, this situation can be shown on the plot by
ordering  the  studies along  the  abscissa  so  that
those  in  the  same  group  are   adjacent  and
members of  the  group  are labeled,  say using
braces.  Similarly, we can supplement other plots
to show fitted relationships to moderator variables
or confidence intervals.
    The  advantage  of this  approach  is that  it
stays  close  to the data  and  is  not  likely to
overwhelm the audience with confusing technical
details.  Technical  discussion  does, of  course,
have  its place, but it is also important to present
the results so that the audience finds them plausi-
ble and  intuitively  reasonable,  given  the  data.
Appropriately  chosen  plots  are more  likely to
achieve this than are complicated tables and state-
ments about levels of statistical significance.
    In  summary,  in my  view the  problems of
meta-analysis  are  clearly important.   The  key
points of meta-analysis fit nicely within the stan-
dard stages of a statistical analysis.  But, as with
other  statistical   applications,  doing   a  meta-
analysis  well  requires  much care and thought at
each stage.
                                                 49

-------
   INTEGRATION OF EMPIRICAL RESEARCH:
  THE ROLE OF PROBABILISTIC ASSESSMENT

             Thomas B. Feagans
      Decisions in Complex Environments
 1.0 Introduction
    If risk assessments conducted to support
 important environmental decisions are to use all
 of the relevant information available, some means
 of combining or integrating empirical studies is
 needed.  Such means should be designed and
 discussed with a clear understanding  of the
 function being served by the integration. One
 such function is that served by a probabilistic
 assessment  The primary purpose of this paper
 is to address probabilistic assessments and the
 function they serve within the decision-making
 process.
    Probabilistic assessments and their function
 will be clearly distinguished from two other types
 of integration which serve two other functions.
 Terms  such as'integration,' 'synthesize,' and
'combine* are ambiguous.  Meta-analysesi And
 what will  be called state of  information
 assessments^ also integrate information, but in
 different ways for different purposes.  The
 functions  served by  these  two types  of
 integrations will be described briefly in sections
 2.0 and 3.0, respectively.  These descriptions
 serve  to distinguish the other two types  of
 integrations from probabilistic assessments, but
 are not comprehensive discussions of these two
 complex topics.
    Probabilistic assessmentsand their function
 are described in section 4.0. Being the focus of
 the paper, this  topic  is  discussed more
 thoroughly. The concept which underlies  various
 possible approaches to probabilistic assessment
 is the concept of probability  In section 4.1
 probability is discussed both from the perspective
 of the producer and from the perspective of the
 user of probability assignments.
    In section 4.2 an approach to probabilistic
 assessment is  presented and advocated. The
 approach presented is advocated as a normative
 framework because it has greater generality than
 the alternatives.
    A distinction is made in section 4.3 between
 advocating that  an approach to  probabilistic
 assessment be   regarded  as  a normative
 framework  and prescribing that an approach be
applied in a particular circumstance.  Two less
general approaches that are special cases of the
general framework are mentioned as possibilities
for those circumstances where full generality is
not needed and/or feasible.
    There is a dual nature  to the  functions
provided by three types of integration discussed
below.  On the one  hand, each integrates
available knowledge; but on the other hand, each
deals with and represents uncertainty. Although
much has been written on uncertainty, the great
complexity of the topic has been underrated.  The
need for three types of integration  of knowledge
and concomitant means of dealing with and
representing uncertainty has  been overlooked.
As a result, the unique principles and mode of
thought that should guide probabilistic (risk)
assessments  do  not  seem to  have  been
thoroughly understood.  In section 5.0,  the
apparent reasons this situation has persisted are
analyzed, and  some  practical  reasons  for
improving the situation are identified.

2.0 Meta-Analysis  •
    One means of integrating knowledge under
development within the discipline of statistics, is
"meta-analysis."  The development of meta-
analysis began in earnest  when the size of the
.research  literature on various  topics which
needed to be integrated  for the purposes of
education program evaluations, regulatory policy
assessments, and other policy-related analyses
became so large that narrative integrations were
deemed unsatisfactory.  "Although scholars
continued to integrate studies narratively, it was
becoming clear that chronologically arranged
verbal descriptions  of research failed to portray
the accumlated knowledge."^
    There are by now some standard definitions.
The original analysis of data in research studies is
called primary analysis.^  Typically, statistical
methods  are applied  in such  analyses.  The
reanalysis of such data  for the purpose of
answering  new questions, or  the original
research  question with  better  statistical
 techniques, is called secondary analysis.^
    A  distinction is needed before addressing the
definition of meta-analysis.  Pooling the data
analyzed in a  set of primary  analyses  and
deriving new results from the pooled data needs
to be distinguished  from a statistical analysis of
the set of results of a  set of primary analyses.
Current definitions  tend to define meta-analysis
                                              50

-------
   For many situations data will be sparse for
some aspects of the phenomena to be represented
no matter how the model is constructed.  For
example, many dose-response relationships of
interest extend beyond directly applicable data.
The importance of simulation modeling and the
generality of the probability theory are enhanced
in such situations.
   Model  construction  is  a joint  process
requiring cooperation between  those persons
expert at probabilistic  modeling  and those
persons with expertise concerning the substantive
phenomena being modeled. Lack of expertise of
both types can lead to avoidable mistakes. Those
untrained in probabilistic modeling can make
avoidable mistakes of one type and modelers
unfamiliar with the substantive phenomena being
represented can make another type.  Both types
of avoidable mistakes are possible despite the fact
that there is generally no right or correct model

   4.2.2 Selection of Probability Assessors
   Probability judgments are made as inputs to
the probabilistic model for the most significant
uncertain factors. These judgments represent
both the knowledge  and uncertainty about that
factor. They are also produced by particular
individuals with particular histories, expertises,
and points of view.
   Probabilistic assessment is more subjective
than meta-analysis.  Two probability assessors
will generally  not  make exactly the same
judgments even based on the same evidence and
even if  they have  similar points  of view.
Furthermore, unless the available evidence is
strong in its implications for the judgments to be
made, equally well  informed individuals can
diverge significantly in the judgments they make.
   Since probability  assignments  can be so
subjective the choice of those individuals asked
to make the judgments is very important Thus,
the subprocess of selecting who is to make the
probability judgments for significant uncertainties
is very important?'  For each uncertainty this
subprocess involves the steps of identifying a set
of highly qualified candidates,  deciding how
many assessors  to have and then selecting a
"balanced" set In this context balance would be
a matter of representing adequately the diverse
perspectives that exist within the set of highly
qualified candidates.  Lack of balance would be
having all the judgments made by  those who
share  a  similar perspective  when  diverse.
perspectives exist
   Obviously, it is important that those with
diverse perspectives be identified in the initial
step of the subprocess. There will not be perfect
information  about  perspectives and  their
implications for probability judgments apriori.
but peers will be familiar with ways in which
perspectives diverge among their colleagues, so a
balanced set can be identified and chosen.  The
details of the participation of agencies, review
committees, etc. can vary, but in one way  or
another the involvement of peers in the selection
process is preferred.

   4.2.3 Elicitation of Probability Assignments
   A  probabilistic  assessment   derives
(probabilistic)  implications from all of the
available information and analysis relevant to the
connection between possible policy alternatives
and the consequences of concern. In making
probability assignments probability  assessors
integrate  diverse  studies,   background
information, and any other considerations they
deem relevant  The flexibility needed to assure
that any considerations deemed relevant can  be
factored into the judgment goes hand in hand
with the fact that there are no hard and fast rules
for making probability assignments in general.
   An example of the kind of judgments health
experts may be asked to make is the probabilistic
representation  of the uncertain relationship
between doses of a (suspected) pollutant and the
resulting response (under specified conditions) in
a  specified  group of  people.   Probability
assignments  may involve two uncertainties:
uncertainty as to the fraction of people (if any) in
the population for whom a causal relationship
exists between exposure to realistic levels of the
pollutant and the occurrence of a given adverse
health effect; and, uncertainty as to the level that
would affect a  given fraction if  a  causal
relationship indeed does exist for that fraction.
In such cases the required set of probability
judgments may be decomposed into two sets  of
judgments  from which the required set may  be
derived mathematically:   a set of probability
judgments concerning existence of  a  causal
relationship below a specified upper bound level
for various fractions of the group; and a set  of
probability judgments concerning the level  at
which various fractions of the group would  be
affected if a causal relationship does indeed exist
for the fraction addressed below the specified
upper bound level.  .
                                              56

-------
   How to integrate various studies in making
the probability assignments is a decision made by
the substantive expert Whether the probability
judgments  concerning   a  dose/response
relationship are decomposed into two sets is a
modeling decision to be made  by the modeler,
after consultation with the substantive experts.
But after all such modeling decisions are made it
is the substantive expert who  decides how to
incorporate both positive and negative studies,
and how to integrate studies of various types,
such as epidemiological, human clinical, and
animal toxicological studies.
   No matter what algorithmic analyses may
have been performed, each expert making such
judgments is the final arbiter for his or her own
judgments and for how to arrive at them. Each
expert  uses  and  mentally  integrates  all
information that he or she believesshould have a
bearing on a  given judgment^Each expert
decides how much weight to give each piece of
information  and how to incorporate that
information.   No information is deliberately
thrown away.  No attempt is made to encourage
close agreement with other experts.
   Although those with substantive expertise
concerning the phenomena being analyzed are
generally  the  final  arbiters  of  probability
assignments, they can be aided  in various ways.
As with modeling, the  making of probability
assignments is best thought of  as a cooperative
process in which professional analysts elicit
probability assignments  from  substantive
experts.  The assignments are elicited in such a
way that they are coherent, cognitive biases are
minimized,  and  motivational   biases  are
discouraged.
    A formal process for  eliciting probability
assignments   has  been developed  by
psychologists  and decision  analysts.   This
process is called probability encoding.4^ As well
as applying  various techniques for eliciting
coherent probability assignments that either are or
closely represent the assignor's judgments, the
process encourages assignments that are as free
of cognitive  and  motivational biases as is
feasible.  Research on such  biases has been
applied in developing the process.
    The type of coherence achieved is dependent
on the generality of the theory of probability
applied.  Under the deFinetti/Ramsey theory
sharp probability assignments are forced; under
the Koopman theory upper and lower probability
assignments are allowed.  Coherence under the
Koopman theory is simply a matter of the entire
set of probability assignments being logically
consistent
   Allowing  upper and lower probability
assignments provides the flexibility sometimes
needed.    When  the  relevant information
supporting an assignment  is  sparse, sharp
assignments   may be  hard  to make  and
unnecessarily arbitrary.  The allowance of upper
and lower probability assignments assures that
the expert making the judgments truly discerns a
difference in making his or her comparisons, no
matter how weak the state of information. There
may be vagueness as to where this discernible
difference disappears, but this vagueness is not a
problem in practice.44  Experimental evidence
evaluated  by internal psychometric  criteria
suggests that experts perform well using  this
general theory.45

   4.2.4  Computation and Presentation of
Outputs
   Monte Carlo simulation is required except for
very simple  models such as  the simplest of
benchmark models.46  Statements exist in the
literature on applying the Monte Carlo method to
the effect that the  model calculation should be
iterated M times, where M varies some with the
statement  The perspective being taken in such
statements is that at least M iterations are needed
to approximate well the distribution that would be
approached in the limit were the number of
iterations performed exceedingly  large.   But
computation can require significant resources, so
how many iterations should  be done requires a
judgment as to the marginal value of improving
the approximation versus the marginal value of
spending the required resources to improve the
assessment in some other way.
    In conventional Monte Carlo simulations each
input to the model on a given iteration has either
been a single assigned value or  a single value
arrived at by random selection from a probability
distribution.  Iterating the computation, with the
random selections  from all the input probability
distributions repeated on each iteration, has given
the desired probabilistic output
    In  a probabilistic  assessment done in full
generality there are  two changes from  the
approach just described.   First,  in general the
input  probabilities  are  upper  and lower
probability  assignments rather than  sharp
                                              57

-------
 assignments.  As a result, if at least one of the
 inputs is  in terms  of  upper  and  lower
 probabilities rather than sharp then the output is
 in terms of upper and lower probabilities rather
 than sharp.
       Second, since in general more than one
 individual is making probability assignments for
 each  significant  uncertain  factor  in  the
 probabilistic model, there is not a unique set of
 probabilistic representations of uncertainty on
 which to base the Monte Carlo simulation.  The
 multiple   representations  of  the  primary
 uncertainties   capture  the  phenomenon of
 secondary  uncertainty,  that  there is  no
 nonarbitraiy best way of representing the primary
 uncertainties. Ideally, the existence of secondary
 uncertainty is propogated to the output of the
 simulation.  An  added dimension  is used to
 enable an output  to represent  secondary
 uncertainty.47  Number the  N uncertainties
 1,2,...,N; let r; be the number of representations
 for theJ*  uncertainty; then there are rj x r2  x ...
 x recombinations of the probabilistic inputs to
 the model. Were we to iterate M times for each
 of the R combinations then a distribution of R
 points could be plotted in the added dimension
 for each point of the corresponding output which
 did not represent secondary uncertainty. WhenR
 gets large a representative sample, say S, of the
 R combinations can be used.
   In the resulting outputs rather than a single
 sharp probability  for  an event there  is  a
 probability (risk)  ribbon.   The graphical
 representation of a ribbon has width, height, and
 shape rather than being a single point.48  The
 width and shape  of the ribbon tend to reflect the
 state of information supporting the probabilistic
 measure for that event; the ribbon will tend to be
 straight and thin, and thus most like a single
 number, when the state of information for the
 event in question is strong and clear; the ribbon
 will  tend to lack  integrity in this sense when the
 state of information is  weak and amorphous.
 Thus, the more general output conveys useful
 information.

 4.3  Less General Approaches
   The  approach   to  probabilistic  (risk)
 assessment outlined in section 4.2 is the most
general approach but  not  the only possible
approach.   There  are  two other levels of
generality possible  corresponding to the two
 other levels  of generality for  probability
 discussed in section 4.1.  The  three levels of
 generality are nested in the sense that the least
 general is a special case of each of the other two
 and the middle level (Ramsey/dcFinetti)  is a
 special case of the most general.
    An important distinction concerns advocating
 an approach as a normative framework  and
 prescribing its implementation in particular
 circumstances.  There are various kinds of
 circumstances in which one of the  less general
 approaches can justifiably be prescribed despite
 the fact that from a normative point of view this
 introduces  bias  and  reduces  control.    For
 example, if those conducting the  assessment
 and/or those using the assessment are unprepared
 for the greater generality the potential benefits of
 that generality may not occur in practice.
    However, even if a less general approach is
 prescribed and implemented,  it shoukttSbne so
 for the right reasons and with an understanding
 of why and what has been lost  Currently,  less
 general approaches are sometimes prescribed and
 implemented for what appear  to be partially
 wrong reasons. In the next section the current
 situation in this regard is analyzed briefly.

 5.0 The Probabilistic Mndq
    Probabilistic assessments are  based on a
 different set of principles and require a different
 mode of thought than other types of assessments.
 The process is not directed toward estimating a
 correct,  true,  or  actual probability or set of
 probabilities.  Rather, the process is directed
 toward generating probability assignments  that
 represent the  state  of information about the
 relationship of interest as well as  possible.
 Criteria for  how well this has been done in a
 given case are in terms of how well the process
 has been conducted.   How  well  the overall
 process has been conducted will be a matter of
 how well the subprocesses addressed in sections
 4.2.1-4.2.4 have been conducted.  Evaluation of
 particular performances of these subprocesses
 should be conditional on the resources it is
 reasonable to allocate for conducting them.
   When the wrong mode of thought is adopted
 in thinking  about probabilistic assessments
 unwarranted conclusions about points of practical
significance  can be the result At present  this
type of mistake is still prevalent The explanation
seems to be that the mode of thinking appropriate
for the scientific research on which probabilistic
assessments should be based is carried over to
                                             58

-------
the probabilistic assessments.  Some aspects of
thinking about scientific inference should cany
over but some should not
   Perhaps the most frequent type of mistake is
the  procrustean  forcing  of  probability
assessments into molds apparently associated
with scientific objectivity. Encouraging  sharp
probability judgments and expert convergence
has as an objective single number probabilities
for specified events. Such outputs do not reflect
the states of information on which they are
based. They certainly are not more objective.
   The paradigm which assigns the  same
probability  (risk)  to individuals  who are
obviously at different risks not only suppresses
secondary uncertainty but primary uncertainty as
welL  This paradigm has the added problem that
in general important information is not integrated
into the formal probabilistic assessment process.
   The practical upshot of such  mistakes is
poorly informed decision making.  Even  when
public decision making is poorly informed for
understandable reasons, it tends to result in poor
public policy.

References
 1. Larry V. Hedges and Ingram Olkin (1985).
      Statistical Methods for Meta-Analysis.
      New York:  Academic Press, Inc.
 2. Thomas  B. Feagans and William F.  Biller
      (198la).  Risk assessment:  Describing
      the protection provided by ambient air
      quality standards.  The  Environmental
      Professional 3(3/4): 235-247.
 3. Gene V. Glass, Barry McGraw, and  Mary
      Lee Smith  (1981).  Meta-Analvsis in
       Social  Research.  Beverly Hills, CA:
       Sage Publications.
 4. Glass, et al. (1981).
 5. Glass, et al. (1981).
  6. R.  J.  Light  and P. V.  Smith  (1971).
       Accumulating evidence: Procedures for
      resolving contradictions among different
       research studies.   Harvard  Educational
       Review 41: 429-471.
 7. Glass, etal. (1981).
 8. Hedges and Olkin (1985).
 9.  Glass, etal. (1981).
 10. Feagans  and Biller (1981a).
 11. D.  R. Cox and  D.  V. Hinkley  (1974).
       Theoretical Statistics. London: Chapman
       and Hall Ltd
 12. L.  L.  Lauden  (1973).   Induction and
       probability  in the nineteenth century, in
      Logic. Methodology, and Philosophy of
      Science. IV: edited by P. Suppes, Leon
      Henkin,  Athanase  Joja,  and Or.  C.
      MoisiL New York: American Elsevier.
13.  Augustus De Morgan.   Probabilities, in
      T .ariffr's Cabinet Ccloaedia.
14. Stanley Jevons (1877).  The Principles of
      Science.  2nd  edition.    London:
      Macmillan.
15. L. L. Lauden (1973).   Charles Sanders
      Pierce and the Trivializarion of the Self-
      Corrective Thesis.   Bloomington, IN:
      Indiana University Press.
16. John M. Keynes (1921).  A Treatise  on
      Probability.   London:   Macmillan.
      (Reprinted 1962.  New York:  Harper
      Torchbooks.)
17. Rudolp Carnap (1962). Logical Foundations
      of  Probability. 2nd edition.   The
      University of Chicago Press.
18. Rudolph Carnap (1955).  "Statistical and
      Inductive  Probability,"  The  Galois
      Institute  of Mathematics  and  Art.
      Reprinted in Baruch  A. Bordy, ed.
      (1970).  Readings in the Philosophy of
      Science.   Englewood  Cliffs, NJ:
      Prentice-Hall, Inc.
19.  Glenn Shafer (1976).   A Mathematical
      Theory   of  Evidence.    Princeton
      University Press.
20. David Krantz. Presentation of unpublished
      manuscript;  Thurstone Psychometric
      Laboratory,  University  of  North
      Carolina; May 3, 1983.
21. L. Jonathan Cohen (1977). The Probable
      and the  Provable.  Oxford University
      Press.
22. Howard Raiffa (1968).  Decision Analysis.
      Reading, MA: Addison-Wesley.
23. Ronald A. Howard and  James E. Matheson,
      eds.  (1983).  The  Principles  and
      Applications of Decision  Analysis.
      Menlo Park, CA:   Strategic Decisions
      Group.
24. Feagans and Biller (198 la).
25. Rex V. Brown, Andrew S. Kahr, and Came -
      ron Peterson (1974). Decision Analysis
      for the  Manager.  New York:   Holt,
      Rinehart and Winston.
26. Emile Borel (1924).  Apropos of a Treatise
      on Probability. Revue Philosophique.
      Reprinted in Henry E. Kyburg, Jr. and
      Howard  E. Smokier, eds.  (1964).
                                             59

-------
                                                                                        11
       Studies in Subjective Probability. New
       York: John Wiley & Sons, Inc.
27. Daniel Ellsburg (1961). Risk, Ambiguity,
       and the Savage  Axioms.  Quarterly
       Journal of Economics,, vol. 75, 643-669.
28. Thomas B. Feagans  (1986). Resolution of
       the Ellsburg Paradox. To be submitted to
       the Quarterly Journal of Economics.
29.  Frank Ramsey (1926).  Truth  and
       probability.  Reprinted in Kyburg and
       Smokier (1964).
30. Bruno de Finetti (1937).  Foresight:  Its
       logical laws,  its subjective  sources.
       Reprinted in  Kyburg  and Smokier
       (1964).
31. Kyburg and Smokier (1964).
32. Feagans and Biller (1981a).
33. B. O. Koopman (1940).   The bases of
       >robability.  Bulletin of the American
       Viathematical Society, vol. 46, 763-774.
34. B. O. Koopman (1940).  The axioms and
       algebra of intuitive probability. Annals of
       Mathematics, vol. 41,269-292.
35.  Paul  Edwards,  ed.  (1967).    The
       Encyclopedia of Philosophy r New York:
       Macmillan Publishing Co., Inc. and The
       Free Press.
36. Antony Flew  (1979).  A Dictionary of
       Philosophy. New York:  St. Martin's
       Press.
37. T. B. Feagans and W. F. Biller (1980).
       Fuzzy concepts in the analysis of public
       health risks. Fuzzy  Sets:  Theory and
       Applications to Policy Analysis and
       Information Systems. Paul P. Wang and
       G. S. Chang (eds.). New York:  Plenum
       Press.
38. Thomas B. Feagans and William F. Biller
       (1981b). A  general method for assessing
       health risks associated with primary
       national ambient air quality standards.
       Office of Air  Quality Planning and
       Standards, U.S. EPA, Research Triangle
       Park, NC
39. Thomas B. Feagans (1986).  Two types of
       exposure  assessment.  Proceedings of
       APCA International Specialty Conference
       on Environmental Risk Management
       Pittsburg:   Air Pollution  Control
       Association, 1986.
40. Feagans and Biller (1981b).
41. T. S. Wallsten  and R. G. Whitfield (1986).
       Assessing the risks to young  children of
       three effects associated with elevated
      blood-lead levels.  Argonne National
      Laboratory report AA-32 submitted to
      U.S. EPA Office of Air Quality Planning
      and Standards.
42. Bruce C. Jordan, Harvey M. Richmond, and
      Thomas McCurdy (1983).  The use of
      scientific information in setting ambient
      air standards.  Environmental  Health
      Perspectives 52: 233-240.
43. Carl S.  Spetzler and C. A.  S. Stael von
      Halstein (1975). Probability encoding in
      decision analysis. Management Science.
      vol. 22. No. 3.
44. Thomas S. Wallsten, Barbara H. Forsyth,
      and David V. Budescu (1983).  Stability
      and coherence  of health experts' upper
      and lower subjective probabilities about
      dose-response functions. Organizational
      Behavioral and Human Performance,, vol
      31: 277-302.
45. Wallsten, etal. (1983).
46. Feagans and Biller (198 la).
47. Feagans and Biller (1980).
48. Feagans and Biller (1981aV
                                             60

-------
DISCUSSION
Harvey M. Richmond, U.S. Environmental  Protection
Agency

   My remarks will address briefly some of the
history and past reviews of the ideas put for-
ward by Thomas Feagans.  I will also briefly
describe the current status of risk assessment
efforts sponsored by the Office of Air Quality
Planning and Standards (OAQPS) involving de-
cision analytic approaches.
   As noted 1n Tom Feagans1 paper, the approach
to probabilistic (risk) assessment described in
the paper was developed within the U.S. EPA's
OAQPS by Tom Feagans and William F. Biller.
The initial risk assessment work by Feagans
and Biller occurred during the review of the
ozone national ambient air quality standard
in 1977.  The ozone risk assessment employed
the traditional (Ramsey/deFinetti) decision
analytic approach  to probability.  Oaring the
ozone standard review, EPA requested formation
of a Science Advisory Board (SAB) Subcommittee
on Health Risk Assessment  to review the ozone
risk assessment.  The SAB  Subcommittee met in
April 1979 and raised a number of questions
about the Initial decision analytic appli-
cation to ozone.1  The Subcommittee en-
couraged OAQPS to pursue development, but
not application, of the Feagans/8i1ler
approach.  The Subcommittee also
recommended that OAQPS explore alternative
approaches to risk assessment  that
employed decision  analytic and Bayesian
techniques.
   During the period from  1979 to 1983, OAQPS
pursued development and review of several
alternative  approaches on  risk assessment  to
aid decision making on .national  ambient  air
quality standards  (NAAQS).2   In parallel with
the  alternative  approaches projects, Feagans
and Biller further developed  the  approach
used  for the ozone risk assessment  to  the
more  general framework  involving  upper-
and  lower-probability  assignments dis-
cussed  in  the paper for this  conference.
    In  the Spring  of 1981,  OAQPS  asked  six
experts  in  a  variety of fields  including
statistics, decision  analysis, and
philosophy of  science  to  review  a report
by  Feagans  and Biller  describing  their
approach.3  The  reviews generally indicated
that  the  approach advocated by Feagans
and  Biller was promising  and  merited
further development,  although a  number of
the  reviewers  expressed reservations  about
the practicality  of  any near-term applica-
tion  of  the  approach  by EPA.   One of  the
six  reviewers, Dr. Isaac  Lev1, Professor
of  Philosophy at Columbia University
stated the following,
       Feagans  and Biller  are  exploring ways
    and means  of  avoiding  the  polarisation
    existing  in current theory between
    Bayesians  and anti-Bayesians.   They
    suggest specifying upper  and  lower
   probabilities  (i.e.,  intervals  of
   probabilities)  for  the purpose  of  probability
   and  risk  assessment.  This  old  idea,  going
   back  to Keynes  and  Koopman  and  advocated  by
   philosophers  and  statisticians  such as  I.J.
   Good,  H.  E. Kyburg  and C.A.B.   Smith over
   20 years  ago  has  not  been widely  appreciated
   either by statisticians  or  philosophers.   Yet,
   in my view, variants  on  such approaches hold
   the  most  promise  for  yielding  approaches  to
   risk  assessment which exhibit  the generality,
   flexibility,  neutrality, and lack of
   arbitrariness which standard "Bayesian" and
   "anti-Bayesian" approaches  lack.'
   The  concerns  of several  of  the reviewers  about
the possible near-term application of the
approach are captured  by the following comments
of another  reviewer, Dr. David Bell, a decision
analyst at  Harvard University:
      It is  well known that there are decreasing
   returns  to scale with the complexity of a
   model to the point where you can end up
   worse off than no model   at  all.  I think  the
   models here are overly  complex for the
   current state of applied art.  I agree with
   the approach but I  believe  it  is a little
   too much all  at once.   I would be  happier
   seeing more modest goals at this point.  If
   the report is only intended to be  a  look at
   the future or  as a research document  as
   opposed  to a draft of an EPA manual  then I'm
   content.  I don't believe  its  realistic  to
   expect a methodology such  as this  to be
   performed with much  creedence  given  to it,
   in the next 10 years.5
   In May 1981  the  SAB  Subcommittee  met  to
 review  the  report by Feagans  and  Biller and
 the six  reviews commissioned  by EPA.   In  a
 September 1931  report,  the Subcommittee
 concluded that,  "While  the F/B approach may
 have commendable  aspects as a research effort,
 it is  not,  in  its present  form,  an  implementable
 tool for public  policy  decision  making...."5
 The Subcommittee  also recommended that the
 authors  publish their works in peer-reviewed
 journals so that  others in the professional
 community could judge the  merits  of  their
 viewpoint.   The Subcommittee  suggested that all
 material relating to  upper- and  lower-
 probabilities  be considered basic research  and
 that OAQPS  should focus on standard decision
 analysis or Bayesian  methods, using single-
 valued probability  assignments,  for developing
 an implement able tool.
    OAQPS,  following the advice of the SAB
 Subcommittee,  has focused  on development of
 decision analytic approaches based on single-
 valued probability assignments since 1981.
 The OAQPS  risk  program moved from the develop-
 mental stage to a real  world  application with
 the initiation of the  lead NAAQS risk  assess-
 ment project in 1983.  The project, managed
 by Argonne National Laboratory under  an
 interagency agreement  with OAQPS, has been
 reviewed by EPA's Clean Air  Scientific
 Advisory Committee (CASAC) in May 1985  and
                                                    61

-------
 March 1986.  Probabilistic  dose-response
 relationships were  elicited from 10  nationally
 recognized experts  (4  experts  for one  end-
 point and 6 experts  for  the other) for two
 distinct health endpoints.   Probability
 encoding was  unnecessary  for a third
 endpoint for  which  a large  epidemiological
 data base existed and  a  Bayesian  statistical
 approach was  used to represent the uncertainty
 in the dose-response relationship.  The
 probability encoding and  dose-response
 aspects of the lead  risk  project  have
 received generally favorable reviews
 from CASAC members.  A final report
 describing the methods and  results from
 the lead risk  assessment  will  be  released
 shortly.7
    EPA  is pursuing similar  risk  assessment
 efforts  as  part  of its review  of  the ozone
 NAAQS.   Efforts  are  underway to  address
 both  health and  welfare effects  associated
 with  exposure  to ozone.8,9  Both  efforts
 employ  elicitation of expert judgment  to
 integrate the  results of different studies
 using standard decision analytic  approaches.
    While  the concept of using upper-  and
 lower -  probabilities has proved  to be
 too controversial for near-term use
 in  the  risk assessment work  sponsored
 by  OAQPS, many of the ideas, princi-
 pals, and specific models developed
 by  Feagans  and Biller are being used
 in  the  current lead   and ozone NAAQS
 risk  assessment projects.  It  is my
 hope  that this conference will  mark
 another step in the  constructive
review, discussion,  and under-
standing of the innovative
 Ideas and concepts Tom  Feagans  and
William Biller have  put forth in
this  important area.
REFERENCES

1.  Science Advisory Board Subcommittee on Health
    Risk Assessment (1979).  Review of "A Method
    of Assessing the Health Risks Associated With
    Alternative Air Quality Standards for Ozone."
    Washington, D.C.:  U.S. Environmental  Pro-
    tection Agency.

2.  Thomas McCurdy and Harvey M. Richmond (1983).
    Description of the OAQPS Risk Program and
    the Ongoing Lead NAAQS Risk Assessment
    Project.  Proceedings of the 76th Annual
    Meeting of the Air Pollution Control
    Association.

3.  Thomas B. Feagans  and William F. Biller
    (1981).  A General Method for Assessing
    Health Risks Associated with Primary
    National Ambient Air Quality Standards.
    Research Triangle Park, NC: U.S.
    Environmental Protection Agency.

4.  Issaac Levi (1981).  Review In
    Six Reviews of "A General Method for
    Assessing the Health Risks Associated
    with Primary National Ambient Air Quality
    Standards".  Research Triangle Park.  NC:
    U.S. Environmental Protection Agency.

5.  David Bell (1981).  Review In
    Six Reviews of "A General Method for
    Assessing the Health Risks Associated
    with Primary National Ambient Air Quality
    Standards".  Research Triangle Park,  NC:
6.
U.S. Environmental Protection Agency.

Science Advisory Board Subcommittee on
Health Risk Assessment (1981).  Review
    of "A General Method for Assessing
    Health Risks Associated with Primary
    National Ambient Air Quality Standards".
    Washington, D.C.: U.S. Environmental
    Protection Agency.

7.  Thomas S. Wallsten and Ronald G. Whitfield
    (1986).  Assessing the Risks to Young
    Children of Three Effects Associated
    with Elevated Blood-Lead Levels.
    Aryonne, Illinois:  Argonne
    National Laboratory.

8.  S.R. Hayes, T. Wallsten, and R. Winkler
    (1986).  Design Document for a Study
    to Develop Health Risk Estimates for
    Alternative Ozone NAAQS.San Rafael,
    CA: Systems Applications, Inc.

9.  Donald C. Peterson, Jr. (1986).
    Workplan to Develop Probabilistic
    Damage Functions Relating Ozone to
    Yield Reductions of Selected Forest
    tree Species.Boulder, CO:Energy
    and Resource Consultants, Inc.
                                                   62

-------
                                               DISCUSSION*
                          Anthony D.  Thrall,  Electric Power Research Institute
   First, I would like to thank Tom Feagans not
only for today's interesting presentation,  but
also for his efforts over the past years to alert
the rest of us to some fundamental problems.
These problems arise when we combine the findings
of various researchers in various disciplines in
an attempt to make informed decisions about man-
aging our environment.  I am also glad that we
have someone familiar with the workings of  the
EPA among our invited speakers because many
issues concerning the combining of studies  are
likely to be specific to the field of
application.
   For example, the environmental field seems to
differ in at least one important respect from the
field of application presented by our first
speaker, David Eddy.  That is, the range of con-
ditions and outcomes seems to be much more  re-
stricted in studies of a specific medical treat-
ment than is generally the case in environmental
studies.  Similarly, in the study of adverse drug
reactions described by David Lane, the observa-
tional condition of principal interest is the
taking of a specific drug, and the outcomes of
interest are whether an adverse reaction did or
did not occur.  In contrast, the review or
establishment of an environmental policy requires
us to formulate alternative policies and to con-
sider many different types of outcome.  On  the
other hand, the lessons learned in developing
meta-analysis for making decisions about educa-
tion, described by Larry Hedges, may be more
directly applicable to environmental studies,
since both education and the environment are
complex areas of public policy.
   I will not attempt to provide a technical
critique of Tom's presentation in the remaining
time.  Instead I will simply highlight what seem
to me to be Tom's main points, as an invitation
to others to join in the discussion.  But I would
first like to try to bring us back to the pres-
ent, everyday world in which we must decide envi-
ronmental issues, making what use we can of all
relevant information.

  PURPOSES OF SYNTHESIZING ENVIRONMENTAL STUDIES
   The EPA's mission is to promulgate environ-
mental regulations that are designed primarily to
protect public health, and secondarily to protect
the public welfare.  For the sake of discussion
let's focus on public health.  A suspected en-
vironmental problem goes through various stages
of scrutiny by the agency.  At several of these
stages, multiple studies must in some way be
synthesized, and therefore the methods of syn-
thesis that have been presented deserve consider-
ation by the agency.  Here is one characteriza-
tion of the various stages of scrutiny, as  I have
called them, with brief comments about how  the
methods of synthesis might be applied.

Identifying potential problems.  Oftentimes one
or more animal studies suggest that a compound,
present at some concentration in the environment,
may possibly have an adverse effect on human
health at some dose.  In these circumstances it
may be appropriate to conduct some form of test
of the statistical significance of the combined
studies, as discussed by Hedges.  However, rather
than cautiously championing some educational
innovation, we are cautiously evaluating whether
we can afford to dismiss this problem for lack of
evidence.  (This decision, of course, would not
be based on statistical significance alone.)  The
appropriate null hypothesis is therefore that for
at least one of the sets of experimental condi-
tions, there is a nonzero effect.  In Hedges'
notation, we would apply a Fisher or Tippett
test, say, to the set of (1-p) values.

Estimating the magnitude of a problem, i.e., the
toxicity of a compound.  This can be pat in the
estimation framework discussed by Hedges.  In
education, the standardized test ("instrument")
seems to define what is being measured.  There
may be no special attachment to the scale of the
standardized test, so dividing the treatment-
control difference by some standard deviation, s,
to obtain an "effect size" presents no inconven-
ience.  Moreover the division by a may be
necessary if we are trying to combine the results
of different standardized tests.  In contrast, if
we are to combine the treatment-control difter-
ences from several bioassays in a manner  that is
toxicologically meaningful, we may need to retain
the original units or convert to a common tox-
icological unit.  For example if the bioassays
are on different species, we may need to estimate
the human toxicity corresponding to  the results
of each study before containing studies.

Estimating public exposure to a compound.  Again
we have a difficult estimation problem.   As
Harvey Ricnmond has pointed out in the case of
lead, we are interested in both the  total expo-
sure  and the exposure that is subject to  some
degree of control.  Total exposure might  be esti-
mated from environmental measurements, but  the
controllable portion of the exposure must typic-
ally  be estimated by computer simulations of  the
emission and dispersion of the compound.  Here,
we might consider using expert opinions about how
to combine the output from different simulation
models or different model runs.

Formulating alternative regulatory actions could
conceivably require us  to combine multiple
studies of, say,  the degree of disruption that
would result from various environmental con-
trols.   In practice,  the  formulation of alterna-
tives seems to oe less  of a synthesis of  multiple
studies  than an iterative process involving  many
people  outside and  inside the agency, and at  many
levels  of  responsibility.  In any case,  it  should
be noted  that  the technical or managerial staff
who  must synthesize various studies  may be  the
same people who are defining  or helping  to  define
the  regulatory alternatives  to  be considered.

Estimating  the reduction  in  the  risk to public
health  for each alternative.  This  is an  ex-
tremely  important and difficult estimation  prob-
                                                    63

-------
letn that say involve, in David Eddy's termin-
ology, one or more chains each composed of one or
more links.  As discussed by Feagans, the studies
to be combined a<- this stage are not homogeneous
but rather bridge the causal span from regulatory
policy to benefits in public health.

Estimating the cost, in various forms, of each
regulatory alternative.  Again, this could be a
very complex estimation problem depending on the
desired degree of realism.  I do not know to what
extent current practice entails multiple studies,
nor am I familiar with the special problems of
combining such studies.  It would be illuminating
to hear more on this topic from the EPA'a Office
of Policy Planning and Evaluation.

Making a decision.  The EPA Administrator reviews
and eventually approves the proposal or range of
proposals that have been developed within the
agency.  The Administrator, then, is the reader
for whom the agency's "decision package" has been
developed.  Undoubtedly over the course of sev-
eral proposals, the reader and the authors learn
how to make the decision package most useful,
that is, what level of summarization or detail
works best, the type of tables and graphics that
are helpful, and so on.  To the extent that suc-
cessive administrators agree on these matters, it
would seem that this is an opportunity to hone
this particular process of combining studies so
that the decision package fits usefully into the
broader context in whic^ che decision must be
made.
   Once approved by the Administrator, the pro-
posal or range of proposals is published in the
Federal Register and comments are invited from
the public.  It sometimes happens that these
comments include studies sponsored by trade asso-
ciations or environmental organizations.  It can
also happen that the comments prompt the agency
to conduct further studies.  Thus the previous
integration of studies leading to the proposal
must be updated.  The Administrator then makes a
final decision, which may be reported in the
Federal Register and/or promulgated in the Code
of Federal Regulations.
   I wish to make two rather obvious but never-
theless important points about the role of com-
bining studies at the decision stage.  The first
point is that those of us who help to prepare
such "decision packages" should not forget that
the package or system is merely a technical aid
to the individual or group responsible for making
decisions.  (David Eddy's situation strikes me as
unusual in this regard, for David was both the
principal developer of the decision tools he
describes, and a member of an advisory board only
one step removed from the final decision.)
   The second, related point is that there is, I
believe, a natural tendency on the part of public
administrators (at all levels) to ascribe respon-
sibility for difficult highly technical decisions
to some disembodied decision-making process, if
it exists.  I think this should be resisted.
   Administrators deserve credit and sympathy for
having to make difficult decisions precisely
because their decisions must be based on human
judgment about a larger set of concerns than can
be "packaged" or "processed."  Moreover,  as pres-
sure is brought to bear for the process to yield
a decision that at the time appears desirable or
necessary (and this pressure may come more from
the middle managers who are familiar with the
daily evolution of technical information), the
technical synthesis becomes distorted and less
useful as a genuine source of information.
Difficult though it is, I believe we must recog-
nize and honor the distinct contributions of
technical synthesis on the one hand and indiv-
idual or group judgment on the other.  Both are
necessary for responsible decisions.

             HIGHLIGHTS FROM FEAGANS
   I would like to restate what I understand to
be Tom's key points, with occasional editorial
comments.  Again, my purpose is to invite discus-
sion by others.

Uncertainty and variability should not be
confused.  Uncertainty refers to our state of
knowledge, whereas variability refers to the
world around us.  We are uncertain about, say, a
dose-response function, i.e., the response to
different doses averaged, at least conceptually,
across a large population.  Even if this curve of
averages were known, however, it is reasonable to
expect that individuals would vary about the
average at each dose.  (If the response is dich-
otomous, i.e., an individual either does or does
not respond, then the average tells us the pro-
portion of the population that responds to the
given dose.)  Moreover, Feagans argues that the
calculus of probability, which is legitimately
applied to variability in the world, should be
replaced by a different calculus when it comes to
uncertainty.
   I agree that is important to distinguish be-
tween uncertainty and variability, but I would
add that the distinction is sometimes subtle.  In
the example of the dose-response curve, we can
imagine that technical improvements allow a more
accurate determination of delivered dose and that
the individual variation about the new dose-
response curve is substantially reduced.  The
reduction in conditional variability (the var-
iability of the responses to a given dose) can be
interrupted as an increase in certainty, i.e., we
are more certain of the response of an individual
to the more accurately determined dose.  Thus the
technical advance increases our certainty by
explaining some part of the variability  (the part
due to poorly measured dose, in the example).  Of
course, the basic distinction between the state
of our knowledge and the variability of the world
remains intact, since the overall variability of
responses to the uncontrolled doses  (however
measured) "delivered" by the environment remains
unchanged.

The goal of combining studies is to reduce uncer-
tainty, not variability.  Variability might be
reduced by changing our management of the envi-
ronment, but this is another matter.

There are alternative ways to quantify
uncertainty.  Feagans proposes that we change
terms, from "uncertainty" to "degree of confirraa-
                                                    64

-------
tion."  I also prefer the latter term because it
emphasizes the use of experimental or observa-
tional results as evidence for or against a
scientific proposition, and it is le<"? likely to
be confused witn some quantification of one's
feeling of uncertainty.  Feagans goes on to dis-
cuss three different schools of thought with
regard to quantifying degrees of confirmation:
classical statistics (standard errors, confidence
intervals, p-values, etc.), Bayesian statistics
(posterior distributions, or risk profiles to use
the term of David Eddy and co-workers), and
Koopman's probability intervals•
   With regard to eliciting the opinion of
experts on a matter subject to douot, it seems to
me that we've moved away from "degree of confirm-
ation" back toward "uncertainty."  There are two
distinct questions concerning the elicitation and
quantification of expert opinion:  1) is the
logical foundation of the procedure sound, i.e.,
are the results meaningful, and 2) do decision-
makers find such results useful.
   As to the technical issues of what questions
to ask the experts, how to ask them, and how to
encode the responses, my only comment is on the
last issue.  That is, the simpler encodings would
seem to be more reliable, i.e., a single-valued
probability seems preferable to a probability
interval, which seems preferable to a prior dis-
tribution.  My general concern is that these
methods may give a false sense of specificity.
Whatever protocol is chosen, the method can only
organize and express knowledge; it cannot create
knowledge.  (Admittedly, this is a somewhat fuzzy
distinction.)  In cases where the evidence is
scant or contradictory the respective procedures
should yield wide confidence intervals, wide
probability intervals, or "vague" posterior dis-
tributions.  Even if the procedures do give such
readings, we can only hope that the decision-
maker is sufficiently sophisticated to discern
the simple message, "We don't know."

A coherent program of research and analysis is
needed.  In discussing the scientific work re-
quired to determine appropriate air quality
standards, Feagans identifies three major links
in the causal chain leading from environmental
standard to public health benefit:  the effect
that adherence to the proposed standard will have
on human exposure to the pollutant, the relation-
ship between exposure and effective dose, and the
consequences for public health as determined by
the dose-response relationship.  To ensure that
research funds are put to best use, Feagans makes
the very sensible suggestion that we start with
the last link and work backwards.  That is, we
should determine what reductions in dose would be
most beneficial so that the most beneficial re-
ductions in exposure and the most appropriate
environmental standards can be identified.  This
is tantamount to designing multiple studies, a
topic that seems as important as our current
topic, summarizing the evidence from multiple
studies.  Perhaps the design of multiple environ-
mental studies can be discussed in a  future
EPA/ASA Conference.
   IS QUANTITATIVE SYNTHESIS REALLY NECESSARY?
   In the interests of inviting discussion I
would like to close by asking wnether the various
quantitative methods presented really improve our
understanding of environmental issues and thereby
guide environmental policy.  According to Feagans
(or my understanding of Feagans), the research
synthesis we are discussing is designed to reduce
uncertainty (or at least reduce confusion by
quantifying uncertainty).  But of course the
synthesis by itself cannot reduce variations in
environmental conditions or variations among
individual responses to a given set of condi-
tions.
   If the aim of research synthesis, then, is to
reduce uncertainty, how much effort of this kind
do the various aspects of an environmental issue
deserve?  It seems that on close examination any
issue is fraught with uncertainties.  These un-
certainties may not be important, however.
   Consider, for example, a different area of
public policy--deciding how much to spend on the
maintenance and improvement of roadways.  Here it
seems that reducing uncertainties is less impor-
tant than reaching a compromise between ade-
quately informed people who have different prior-
ities.  The costs of roadway maintenance must by
now be well established, as are the reasons for
roadway maintenance, e.g., safety, reduced wear
on vehicles, promotion of commerce, and the
pleasure of driving one's car on well-maintained
roads.  The last consideration, by the way, may
be the most decisive and tne most Difficult to
quantify.  But perhaps the quantification of this
factor through multiple surveys of drivers fol-
lowed by a synthesis of the multiple results is
unnecessary.  Are there analogous tactors in
environmental policy?
   It seems to me to be worthwhile to  take stock
of how we are currently piecing  together  the
results of various environmental studies and the
manner and extent to which such syntheses guide
environmental policy.  How, for example, did we
decide to phase out leaded gasoline?   Case
studies of this kind would help  to clarify the
current and potential benefits of the  quantita-
tive methods under discussion.

1.  The opinions expressed by  the author do not
    necessarily reflect  the prevailing views of
    the Electric Power Research  Institute.
                                                    65

-------
                                               DISCUSSION
                           Miley W. Merkhofer, Applied Decision Analysis, Inc.
   In Che  course  of his paper, Ton Feagans  has
raised  and commented on  some  very  fundamental
issues.   Some  of  these  issues,  such  as   the
role  or function  of analysis   and  the  meaning
and use of probability, are highly complex  and
more than  a  little  controversial.
   Since there  are aspects  of  Ton's paper with
which  I disagree,  it  is  only  fair  that   my
comments   be  prefaced  with   two   confessions.
First,  I  am a  decision  analyst.   Like most
decision   analysts,  when  confronted  with   a
problem  requiring   a   subjective   approach   to
probability     I      use     the      standard
Bernoulli  - Laplace - De  Morgan  definition   of
probability   that   Tom   criticizes   as   the
"level-two  generality."   It is only natural  to
expect  that  proposals requiring the  re learn ing
of  basic   concepts   and  methods   of   analysis
would be  approached by decision  analysts with
a sense of skepticism.   Second, there  has been
a tremendous  amount written about  the  various
definitions of  probability.  As a  practitioner
rather  than  a  student of the  philosophy   of
science,   I  have   not  read   most   of  this
literature.   Thus,  you might  characterize  me
as  somewhat  biased  and  largely  uninformed
about many of  Tom's key arguments.   Those  of
you  who have  followed Tom's   paper  will,   of
course,  recognize   this  as   Tom's  "canonical
situation"!
   Having     appropriately     undermined     my
credibility   in   this  context,   I  will   now
proceed to my comments.  My discussion will  be
organized  into  three  segments.    First,  I'll
give you my  view  of the basic  problem  that  Tom
is  addressing.   Second,  I'll   indicate what  I
think are  some of the important  criteria  for
judging approaches  for integrating  information
for   environmental  decision   making.    Third,
I'll  indicate  some specific   aspects  of Tom's
paper with which  I agree  and  some with which I
disagree.
   As   I   see   it,   the  basic  problem  is   as
illustrated  in  Figure  1.   Health,  safety,  and
environmental decisions involving risk  must  be
based  on  scientific  knowledge,   but   the   gap
between   the   available    knowledge   and   the
Scientific knowtedg*:

— primoiy onolym
- nwto onafyra
- itate-of-lnformation
       Figure 1. The Problem of Health and Environmental
              Decision Making and the Potential Role
              of Risk and Probabilistic Assessment'
information  that  would  make  decision  making
easier  is   great.    In  most  cases  what   is
offered  to  decision   makers   is   inconclusive
data,  unstructured  opinion,  and  debate.  Risk
assessment  (or,  more  generally,  probabilistic
assessment)  is  meant  to provide  an  efficient
link between scientific knowledge and decision
making—a  link  that  is  designed  to  lead   to
more  efficient  and  defensible  decisions.   It
is  in  this  sense that  probabilistic  assessment
may  be  regarded  as  a  means  for   integrating
empirical research.
   Figure 2   indicates   the   way    risk    and
probabilistic   assessment    work.     A    model
representing the  cause-effect  linkages  between
decision  alternatives  (such as  choices  among
regulatory  policies)   and   consequences   (like
numbers   of  deaths   and   various   types   of
morbidity)  is  constructed.   Because  of  lack of
knowledge,  some  of   the   parameters of  this
model  and  the  structure   of  the   model  are
uncertain.   This  uncertainty  is  described  and
quantified  using a  theory  of probability,  and
the  model   is   then  used   to   translate  the
parameter    and    model    probabilities    into
probability  distributions  over  risk  outcomes.
Finally,  these  distributions  are   summarized
using  various  statistics   (such   as expected
value) to provide quantitative  indices of  risk.
   Figure 3   illustrates  an   important   point
about  the  nature  of  risk  and  probabilistic
assessment.    What   is   quantified   is    not
real-world  risk,  it  is  risk as represented by
a  model,  an abstraction  of reality.   It   is
clear  to  anyone who  has ever conducted a risk
assessment  that  the  process  involves  making
many      simplifying       assumptions      and
approximations.   What  is  sometimes  less  clear
is  that  the quantitative measures  produced by
a  risk assessment  must be  translated back into
the  real world.   The  measures  produced  by  a
risk   assessment  are  based  on  an  imperfect,
incomplete  approximation.   Making  a decision
requires  considering  the  risk  estimates  along
with   other   information   relevant   to    the
decision,   including  values   and   information
about  the  weaknesses  and   limitations  of  the
risk    assessment.     Thus,    a    risk     or
probabilistic  assessment can never be  a  means
for  making  decisions;   it   is  only   an  aid  to
decision making.
   Recognizing   that  risk   and  probabilistic
assessment  are aids  to  decision  making  makes
it    easier   to    identify   some   of    the
characteristics we  would like the methods used
in  such  assessments  to have.   Table 1  shows
some evaluation  criteria that would  seem  to be
important    when    comparing   methods.     The
criteria are categorized as either  internal or
external.   Internal criteria,  such  as  logical
soundness,   completeness,  and  accuracy,   lie
within  the  domain  of  analysis  and   relate to
the  quality  of  analysis.   External criteria
reflect  the  desires and  constraints  imposed by
users  of risk  assessment,   by  the  public,  and
by the limitations  of time  and  resources.
   Logical  soundness relates to  the  degree to
which  a method  can  be justified  in terms  of
                                                   66

-------
                                 PARAMETER ii MODEL UNCERTAINTIES
     Decision alternatives:
         - A.
                                           RISK MODEL
                                           Quantitative measures of risk:
                                               - probabilities and
                                                 magnitudes of
                                                 consequences

                                               - summary statistics
                          Figure 2.  Generating Risk Estimates with a Risk Model
           Probabilistic (Risk)
              Assessment
jdel ^ ""•
Impli
Abstraction <3c
Approximation
JUei
lotions

Interp
 Real -
 World
Problem
Decision
 Figure 3.  The Relationship Between Risk or Probabilistic
          Assessment and the Real World
                     Table 1
       SOME CONSIDERATIONS FOR EVALUATING
       ASSESSMENTS  AND  ASSESSMENT METHODS
           logical  soundness I
           completeness      >      internal
           accuracy         ;
           acceptability
           effectiveness     \      external
           practicality
theory  and  whether   actual  applications  are
likely  to  violate   fundamental   assumptions.
Completeness   addresses   whether   the   method
accounts  for  all important  problem aspects and
whether,   due  to  difficulties  encountered  in
practice,  an  analyst  who  uses the  method  is
likely  to omit  certain considerations  because
they  are  difficult to  accommodate.   Accuracy
relates  to the  precision  and possible  biases
of  the   method   and  to  the   sensitivity  of
assumptions   that  have   not   or   cannot   be
tested.   Acceptability relates  largely  to  the
attitudes  of  and  perceptions  of  potential
users,  clients,   and  consumers  of  risk  and
probabilistic  assessment,  especially  decision
makers  and  the  public.    Effectiveness  deals
with  the  method's  ability  to  enable  risk  and
probabilistic   assessment  to  accomplish  its
intended    ends;    namely,    describing   and
quantifying the  level  of risk in a  way  that  is
useful   to   the   decision   making   process.
Practicality  reflects  the extent  to which  the
method  can  be  conducted  in  the   real-world,
problem-solving   environment  using  available
resources and  information.
   The  above  criteria  are  not   necessarily
complete.   Other  sets  may  be preferable.  The
above sec  is  offered for two purposes.   One  is
to justify  an opinion that  there is no "right"
or "wrong"  way to  define probabilities  and  to
perform probabilistic  assessments.   The reason
for   this   conclusion   is   that  it  would  be
unusual  for  any   single  method  to be  clearly
superior    according   to    every    evaluation
criterion.   Indeed,  strengths  in   some  areas,
such  as   logical  soundness  or  completeness,
typically  lead  to weaknesses  in others,  such
as   practicality.    The  second    reason   for
introducing  explicit evaluation criteria is  to
suggest   that   the  appropriate   way  to  judge
whether one method  is superior  to another  is
not   to  determine  whether  one  is  preferable
according  to  one  or   two  dimensions,   but  to
identify   all  of  the  dimensions   that  are
important,  estimate  the  performance  of each
method  along each dimension,  and  then consider
the   relative    importance   of   the   various
dimensions.
   Although  we  have  used  different  words,   I
believe  that Tom shares most,  if  not  all,  of
the   views   expressed   above.    Tom's   paper
recognizes  the  value  of a  general  theory  of
probability    as    a   practical    means   for
                                                    67

-------
 integrating  empirical studies.  He places  this
 theory  within  a  decision  analysis   framework
 with  the  intent of  providing  an effective  aid
 for  decision  making.  Furthermore,  he  argues
 that  "less  general"  definitions of probability
 and methods  of analysis  should  be selected  if
 called  for  by the specifics of  the problem  at
 hand.   These  are  valid  and  important  points
 that  Tom contributes  in his  paper.
   In  order  to  explain  further the  logic  of
 approach selection  and to provide  a  basis  for
 discussing  some of  the  points  in  Tom's paper
 with  which  I  disagree,  it  is useful to apply
 some   of   the  explicit   evaluation    criteria
 introduced  above.   For example, the   criterion
 of  logical   soundness   might   be  applied   to
 evaluate  classical,  Bayesian,  and   intuitive
 probabilities,    the    theories   that    Tom
 respectively  refers  to  as level-one,  -two,  and
 -three generalities.
   The  classical   definition   of  probability
 would   seem    to   score    high  on    logical
 soundness.   The theory has  a  long history  and
 is  well  developed.   The   practice   of using
 relative frequency  is supported by the  law  of
 large   numbers   and  considerable    empirical
 evidence.   The principal  weaknesses  appear  to
 be  defining   and  justifying  the  conceivable
 outcomes  to  an  uncertainty  as  equally  likely
 and treating  a small number of  data  points  as
 if  they  were  a  large  number  of   identical
 trials.    All   in   all,   though,    classical
 probability   is   clearly   a    well-developed,
 internally consistent theory.
   What  about  Bayesian   probabilities?    Tom
 criticizes  the Bayesian  approach  because   the
 standard   method   of  elicitation   does   not
 necessarily   reveal   personal  probabilities.
 His argument  is that the  state  of information
 on which  probabilities  are  based  matters   and
 that  the information underlying the  preference
 lottery   used    for    eliciting    subjective
 probabilities  is  different  than that  for   the
 uncertain event.   In  particular, the  reference
 lottery is  a "known" probability,  whereas   the
 uncertainty   is   an  "unknown"  probability.
 According  to Tom, when  the  subject  says  that
 he   is   indifferent  between  betting,   for
 example, on  a probability wheel  and betting  on
 the  uncertain  event, he is  not  necessarily
 equating probabilities.   Like the  subjects  in
 Ellsberg's  paradox,   he  may  simply  prefer  to
 bet on known  probabilities.
   It  is,  of course, true  that the  state  of
 information      matters      with      Bayesian
 probabilities.   This  is   the  essense   of   the
 subjective  approach.   It is  also   true  that
 elicitation   techniques   may  fail  to   elicit
 probabilities   that   are  consistent   with  a
 person's underlying  information  and  beliefs,
 due to cognitive biases  for  example.   However,
 this  is  a  practical  difficulty  rather  than a
 logical flaw.   The reasons  for  this  assertion
 fo 1low.
   For  probability  encoding,  analysts   use a
variety of  different elicitation  techniques.
They   deliberately   switch  among   frames    of
 reference  for the  purpose of  identifying   and
 alerting the  subject  to  any  inconsistencies  in
 reasoning.    Betting   is  only one  analogy  that
 is used.   The  analyst  will  also  ask whether
the  reference  and uncertain  events  are judged
equally probable.
   Furthermore,  one  of the  most  commonly used
probability  encoding techniques,  the  interval
technique,  does  not suffer  from  the problem
that   Tom   mentions.    With   the    interval
technique,  the  analyst  divides  the  range  of
uncertainty  into  regions  that  are  judged  by
the   subject  to   be  equally   likely.    For
example,  if   the  subject  thinks   it  equally
likely  that the uncertain  variable  will  lie
above  or  below  a given  value,  that  value  is
assumed to  be  the median  of the distribution.
Since  all  comparisons with  this  technique are
based  on  the  uncertain  event, the  subject  is
not  required  to  compare  known  and  unknown
probabilities.
   Since subjective  probabilities  are based on
a well-developed, internally consistent theory,
Bayesian  probabilities   would   also  seem  to
score  high  on  logical   soundness.    The  fact
that  the  basic  axioms of probability  calculus
apply  means that analyses based  on  subjective
probabilities  are  similarly  well   founded  in
theory.
   What      about     Koopman's       intuitive
probabilities?   Unfortunately, I  have not  as
yet   had    an    opportunity   to   explore   the
foundations      of     Tom's      "level-three
generality."    Although   Tom   offers  several
references,  Koopman's  theory  is  not  widely
known, as evidenced  by the  fact  that  it is not
mentioned  in  the  dozen  or  so  reference   texts
that  I keep on my  bookshelf.   The  fact  th
-------
           NOTE. LETTERS A THROUGH E
               REPRESENT PANELISTS
                                            EFFECTIVE POROSITY
        Figure 4.  Cumulative Judgmental  Probability Distributions for Cohassett Basalt Average Flow Top
                  Effective Porosity at Macroscale, Obtained independently from Experts
techniques.    To  support  these   assessments,
there was,  in  effect,  only one data point  from
directly  applicable tests,  and that  test was
of   questionable   accuracy.     Therefore,   it
should not  be  surprising that the experts  felt
tremendous  uncertainty  (up   to  six  orders  of
magnitude)  and  disagreed  with  one  another.
Despite  the  lack   of  data,  the experts   were
quite  adamant  about  the  precise  location  of
their curves.  In  every  case, the  final twenty
to thirty minutes  of each probability encoding
exercise was spent  exploring  whether the  curve
should be moved  two to three  percentage points
in one direction or the  other, judgments  that
were  in   each  case  made  definitively  by the
expert.  The precision of these estimates  was,
evidently,   baaed    on   convictions   born   of
personal experience  with  tests on similar  rock
and differing  theories about  how  processes  of
formation  affect  the  parameter in  question.
Any uncertainty  bands  existing in  the minds  of
the   subjects   were   clearly   insignificant
relative  to the differences  of opinion.   In
this  instance  at   least,   the  added  work  of
assessing   uncertainty  bands   would  not   have
provided  much  additional   insight  to  decision
makers.   In  addition,  permitting indecision  in
the  encoding process  might be  less  effective
at  forcing the  sort  of hard  thinking that  is
so  important  to the  formulation of  scientific
judgments.
   In  conclusion,  although   I  disagree   with
some of  the  specific  points  of Tom's  paper,  I
endorse  his  central  theme   that   probability
applied  within  a  decision  analysis  framework
can  be   a  powerful  and   practical  way   of
integrating     empirical     research.      The
exploration  of  alternative  theories  for  the
foundation    of    risk    and   probabilistic
assessment  is  important,   for  advancements  in
this  area  are  most   likely  to  produce   major
improvements  in  our  ability to  analyze complex
problems.
   Risk  assessment  is  an   art  as  well  as  a
science.   The   real   challenge  is   to select
methods  that  illuminate  and  provide  insights
without  misleading.   Research that extends  the
useful  options  or provides   insights  for  the
choice of methods  is clearly of  high  value.
                                                   69

-------
STATISTICAL ISSUES IN COMBINING ECOLOGICAL AND ENVIRONMENTAL STUDIES
      WITH EXAMPLES IN MARINE FISHERIES RESEARCH AND MANAGEMENT
                  G. P. Patil, G. J. Babu, M. T. Boswell,
                 K. Chatterjee, B. Linder, and C. Taillie

                     The Pennsylvania State University
                 1.   INTRODUCTION

   When a substantive problem needs a solution,
the  information  needed is  invariably not available
as desired.  Encountered or historical data may
have to be used  (Hennemuth. Patil, and Ross,  1986;
Hennemuth, Patil. and Taillie, 1985; Patil, 1984;
Patil. Rao, and  Zelen, 1986).  Often, an ad hoc
decision is made by  the manager based on
incomplete or inadequate data involving similar
situations, perhaps  augmented by various experts
opinions.  Ecological studies in this manner  have
been done on a continual but informal basis.  This
points up the importance of developing systematic
methods to combine studies.  Three approaches that
have been generally  used are:  (1) combining
different data sets  to obtain a long-enough time
series or a large-enough data set to perform  the
desired analysis; (2) combining the results of
different studies; and (3) combining expert
opinions.

            1.1  Combining Data Sets

   Usually, pooling  occurs for the same type  of
data taken under different conditions, including
different locations, and different seasons or
different years.  Alternatively, entirely
different types  of data may be combined.  It
becomes necessary to assume that some underlying
common features  exist among various data sets; in
order to extract these features, it is necessary
to transform the data sets to make them
comparable.
   Section 2 is an example where individually
small recruitment data sets, (giving the number of
fish of "catenable" size entering a fish stock),
for different species of fish and different stocks
from various oceans are combined to give a data
set  large enough to estimate a "universal"
recruitment distribution.  This may then be used
to estimate a recruitment distribution for an
individual fish stock.
   Another common situation arises when a change
in the instrumentation or  in the data collection
protocol occurs  during the course of an
investigation.   If there is only one such change,
then the two data sets need to be combined into
one.  Here the purpose of combining .data is to
obtain a consistent  data set for use in testing
hypotheses, investigating trends, etc.  Section 3
is an example that involves a change from one ship
to another in a marine fisheries research trawl
survey.  A paired experiment is carried out to
compare the fishing power of the two ships.   The
results of the experiment are used to calibrate
the two data sets.

             1.2  Combining Results

   Extrapolation from one situation to another
is usually done by assuming a super model that
combines the results of the two different
situations.
   Section 7 describes a method for assessing
the risk of a toxicant to a species of fish by
utilizing results from laboratory tests.
Estimates of toxic concentrations obtained from
various bio-assay tests are combined to form new
data sets.  A pattern is established for each of
the data sets by curve fitting.   The curves are
used in turn for extrapolating the long-term toxic
effect as a function of the short-term effect on a
species for which the results on short-term
effects are available.
   Sections 4, 5 and 6 discuss an environmental
index effort for coastal and estuarine
degradation.   The overall approach combines the
results of a statistical analysis on data from a
control region with the data from a test region to
                                       produce a single number  indicating in  some manner
                                       the health of the  environment.   Section  5 uses  the
                                       reproductive  success  of  osprey  as  an example.   The
                                       reproductive  success  varied  in  time with DDT
                                       pollution.  A period  of  time with  little pollution
                                       effect is used as  a control  instead of a control
                                       region.
                                          Section 6  is another  example in which
                                       dissolved oxygen measurements are  used.  The
                                       formulation is different since  the measurements
                                       are combined  with  the results of three statistical
                                       studies on laboratory experiments.  The  laboratory
                                       studies produce three different dose-response
                                       curves for three different species and for three
                                       different responses.   The dose  is  exposure to low
                                       dissolved oxygen episodes.   The three  responses
                                       are mortality, reduced growth and  avoidance.  The
                                       results of these studies are combined  with
                                       dissolved oxygen measurements to produce a single
                                       number for each low dissolved oxygen episode.
                                          Section 8  describes the initial  stages of an
                                       analysis whose goal is to partition the  causes  of
                                       early-life-mortality  of  fish among various
                                       climatic and  pollutant variables.   The data  set
                                       (Summers et al. 1984)  is a historical  data set
                                       which combines several ecological  studies.   The
                                       fish data set  is a stock index  data set  for  each
                                       species.  This index  is  the  catch  per  unit effort
                                       which was estimated from many sources.   A model
                                       relating fishing in the  Potomac river  and fishing
                                       in the Chesapeake  Bay was constructed  and the
                                       results were  combined with landings data to  give
                                       the stock index.   The environmental data consisted
                                       of river flow and  temperature.   Pollution data
                                       include gross  indicators of  pollution  such as
                                       population size, employment  levels, sewage
                                       discharged, dredging  and some loading  variables
                                       such as dissolved  oxygen and nutrients.  All of
                                       these data sets were  combined to give  a
                                       multivariate  time-series data set  for  analysis  and
                                       interpretation.
                                          The purpose of  the analysis  and interpretation
                                       of this combined data set was to be able to
                                       evaluate different statistical  techniques.
                                       Furthermore,  different techniques  use  different
                                       modeling assumptions  and, therefore, may reveal
                                       different pollution effects  on  different species.
                                       It should be worthwhile  to be able  to  combine the
                                       results of these analyses by the methods of
                                       combining expert opinions to provide a more
                                       accurate picture of pollution effects.
                                                     2
                                                1.3   Combining  Expert  Opinions

                                          Included in .this approach is the combining of
                                       different models where specific p-values are
                                       available for  each model.  These correspond  to
                                       probabilities  of various propositions  as related
                                       by experts.  There may be two or more  propositions
                                       of which exactly one  is  correct.  Each model or
                                       expert gives probabilities of these propositions.
                                       There are two  cases.    Either the probabilities  add
                                       to one, or the probabilities add to a  value  less
                                       than one.   The latter  case allows the  possibility
                                       of an opinion  to be held in  'reserve.'
                                          Section 9 gives a brief review of some of the
                                       problems and possible  approaches for the
                                       combination of expert  opinions.   Sections 4,  5,  and
                                       6 are also relevant in this  connection.  To begin
                                       with,  opinions are solicited on  potentially
                                       informative variables.   Even after  this  is
                                       accomplished,   there remains  a considerable need to
                                       identify and utilize expert  opinions on  issues,
                                       such as,  what  data sets are  suitable,   how to
                                       organize the data,  what  to use  as control and as
                                       test regions,  and how  to combine and summarize  the
                                       data into indices  that are comparable.   "The
                                       choice of variables is not easy and usually
                                       involves extensive exploratory data analysis"
                                       (O'Connor and  Dewling  1986).
                                   70

-------
   The candidate  data sets  result  from Many
different studies on different aspects of the
ecosystems.  These  are expected  to yield
information regarding the health of the ecosystem.
Examples include  benthic species composition and
abundance, fish and shellfish diseases, fecundity
in fish and shellfish, mortality in eggs and
larvae of fish and  shellfish in  the field and
reproductive success in marine birds.  Also
included are measures of pollution, such as
toxicants in marine foods,  pollutants in the
sediments and dissolved oxygen in  the water
column.  When the available indices are considered
over a region and through time,  a  picture of the
ecological health of the region  begins to emerge.
At this stage a meta analysis of the separate
index values over time and/or space should be  of
some help to the  managers in their task of
managing natural  resources.

      2.  RECRUITMENT DATA AND KERNEL APPROACH

                  2.1  Background

   For several years the Northeast Fisheries
Center (NBFC) has been assembling  recruitment
series for a large  number of oceanic fish stocks.
Recruitment is defined by the number of fish of
'catchable* size  entering a fish stock.
Estimation of recruitment distributions is
important for the assessment and prediction of
long term frequencies of good and poor year
classes.  In this connection,  several parametric
distributional models have been  fitted to each of
the available recruitment data sets (Hennemuth.
Palmer,  and Brown,  1980;  Patil and Taillie,  1981).
The small sample  sizes prevented reliable
assessment of goodness-of-fit.    It also proved
difficult to effectively discriminate between
competing models,  e.g., between  the gamma and the
lognormal distribution.
   In view of the preceding, Richard Hennemuth
of NEFC suggested tnat the recruitment data for
the various stocks be combined into a single large
data set and analyzed with the two-fold purpose:
   (i)  better assess the fitting performance of
the different methods and models,
  (ii)  arrive at a fairly precise estimate for
a  universal" recruitment distribution.

          2.2  Combining the Data Sets

  Recruitment series for 18 stocks were selected
for analysis.  The data and histograms for the
individual stocks appear in Table  2.1.   Sample
sizes range from  10 for North Sea mackerel to 43
for Georges Bank haddock.   On the whole,  the data
exhibit strong positive skewness with the
occasional occurrence of large positive values
corresponding to  the appearance  of a strong year
class.
   When combining data, the various data sets
must have some common features (or there would be
no reason to combine)  as well as some differences
(or the matter would be trivial).  The trick is to
model the common  features and to suitably adjust
the data for the  differences before combining.
The large combined data set is then used to draw
reliable inferences concerning the common
features.
   In our case,  it  is hypothesized that the yth
recruitment data  set can be described as a random
•ample from a scale-parameter family of
distributions
          F(x,ep)
F(x/ep).
(2.1)
Here the scale parameter  0   is allowed to vary

from stock to stock.  The functional form of the
cdf  F  is assumed to be the same for all stocks
and therefore represents a "universal" recruitment
distribution.  The pth data set is adjusted by
dividing through by a suitable scale statistic.
The arithmetic mean (divided by 5) was used in the
present analysis but it may be worthwhile to
mention some other possibilities:
                                        a)  As pointed out previously,  large positive
                                   values are sometimes encountered.   For a given
                                   stock, it may be entirely a matter  of chance
                                   whether such a value occurs in  the  available  data.
                                   The arithmetic mean is sensitive to the presence
                                   of large values.  Thus, using the arithmetic  mean .
                                   to descale, introduces considerable extraneous
                                   variability into the combined data  set.  A  scale
                                   statistic such as the geometric mean may be
                                   preferred for this reason.
                                        b)  The assumption that the different  stocks
                                   differ only in the scaling is only  an
                                   approximation.  One might attempt to develop  data
                                   transformations that would adjust for differences
                                   in distributional shape as well.  For example,  the
                                   z-score of the logged data adjusts  for scaling  and
                                   also for certain types of shape parameters  (e.g.,
                                   Weibul1, lognormal).
                                       2.3
                   Estimating the Universal Recruitment
                            Distribution
   Having combined the descaled recruitment
values, the next step is estimation of the common
cdf  F.  Here a nonparametric approach has been
adopted.  In passing, it may be noted that the
problem would be trivial if we were prepared to
assume a parametric form for  F.  In fact, if F(•)
= G(-,«f)  where  G  is a known distribution and  


-------
stock can be used to improve the estimate.  Here
the Ja»es-Stein (1961) paradigm any offer some
guidance.  Envision the separate (descaled)
recruitment distributions as forming a cloud of
points in the space of all probability
distributions.  The universal curia sstimates the
center of this cloud.  Use the available data to
obtain, perhaps by^the kernel method, a low
quality estimate  F   for a particular
distribution.  For the final estimate, use a
convex linear combination of the imprecise
estimate F  and the precise but inaccurate
universal estimate.
   It may be of interest to close this section
with an interpretation of the kernel estimator.
The recruitment process is governed by many
factors, both environmental and biological.
Currently there is little understanding of what
these factors are, how they operate quantitatively
and how they interact.  The kernel method attempts
to account for the annual variability in
                                recruitment without developing a detailed
                                explanatory model.   Consider the multidimensional
                                space of all relevant factors and let this space
                                be partitioned into  N  subsets, one for each
                                available recruitment value;  the subsets occur
                                with the same long  term relative frequency of 1/N.
                                Conditional upon a  particular partitioning set,
                                there is still residual environmental variability
                                within that set and a corresponding variability in
                                recruitment. It is  this variability that is
                                represented by the  lognormal kernels.  Each kernel
                                is centered at the  corresponding observation, in
                                effect treating each observation as typical for
                                its partition set.
                                   Also note that the bandwidth expresses the
                                within-partition-set variability.  In particular,
                                the bandwidth must  decrease toward zero as the
                                partition becomes finer (i.e. as the number of
                                observations increases).   It follows that the
                                bandwidth cannot be treated as a universal
                                constant:  the bandwidth appropriate to a
                                particular stock is larger than the bandwidth
                                obtained for the combined data.
                           TABLE  2.1 s
                       RECRUITMENT  DATA
                 8BCD  8EORSES BANK COD
                 
-------
NACO  NORTHEAST ARCTIC COD
(I2,3F12.2,3F8.3)
N
16












6BHO
(12,
N
43











































n
200.00
307
1163
2364
1931
264
• O^
172
300
607
1940
2782
820
1031
8EOR6ES BANK
3F12.2.3FB.3)
IX
13.00
42
43
36
61
60
37
107
77
64
112
m
63
74
* ^
6!
92
93
60
34
129
38
no
49
146
64
100
78
73
61
133
127
57
41
148
464
36
9
11
.3
1
3
.173
10
16
NEAN SO CV 8KEH KURT LOSHEAN
1011.31 749.99 .742 1.044 3.023 6. 638
2.333
3.813 | |
11.820 HI |
9.633 Illlll | | | |

.860 0 3 10 13
1.300
3.033
7.700 1024 3.120
13.910 419 2.093
4.100 723 3.623
3.133 332 2.660
HADDOCK III 1974 - 1976 EXCLUDED

HEAN SO CV SKEW KURT LOSnEAN
73.36 72.67 .991 3.340 19.631 3.702
2.800
3.000
3.733
4.067 |
4.000 |
3.800 |
7.133 ||
3.133 ||
4.267 m |
7.4*7 m m
7-733 Illlllllll
4.200 Illlllllll |

4.333 0 3 10 13 20 23 30 33
6.133
6.200
4.000
2.267
8.600
3.867
7.333
3.267
9.733
4.267
6.667
3.200
4.867
4.067
8.867
8.467
3.800
2.733
9.867
30.933
2.400
.600
.733
.020
.067
.333
.012
.667
1.067
LOBSO
.774












•

LOSSO
1.339











































                                    73

-------
NSHD  NORTH SEA HADDOCK

N
18

















NAHD
(12,
N
17

















SBHR
(12,
N
12












U
200.00
142
632.
3005
63
43
147
747
4296
«g t
380
111
901
1324
234
1278
2337
302
377
481
NORTHEAST
3F12.2.3F8.
IX
30.00
479
130
344
438
30
26
247
1 Afl
1 "w
1323
408
89
73
89
183
317
33
43
6EORSES 8
3F12.2.3F8.
IX
330.00
3344
1848
2113
1383
1733
1910
1184
833
739
3844
737
74S
MEAN 8D CV BKEM KURT L08HEAN
1082.94 1499.03 1.384 2.401 8.391 4.219
.710
3.140
13.023
.340 |
.313 |
.733 || |
3.933 || | |
31.480 mil III |
.333 0 3 10 13 20 23 30 33
4.303
6.420
1.280
6.390
12.783
1.310
2.883
3.403
ARCTIC HADDOCK
3)
HEAN SO CV SKEW KURT L06HEAN
273.88 343.36 1.242 2.490 10.123 3.037
9.380
3.000
7. 280
8.740 |
.400 ||
.320 || | |
4.940 mil mi |

30.440 0 3 10 13 20 23 30 33
8.140
1.780
1.440
1.7BO
3.700
4.340
1.100
.840
ANK HERRIN8 | | | 1973 - 1974 EXCLUDED
3)
HEAN SO CV SKEW KURT L08HEAN
1743.23 993.86 .371 .940 2.848 7.309
10.126
3.337
4.037
4.329 |
3.014 | |
3.437 || |
3.383 HIM |
2.443 » 	 * 	 * 	 *
2.149 0 3 10 13
10.983
2.143
2.129
LOB8D
1.274



















L06SD
1.079



















L06SD
.333












                                     74

-------
MSHR
(12,
N
18



















NHHR
(12,
N
20
















BBHC
(12,
N
16
















NORTH SEA
3F12.2.5F8.
IX
2000.00
21370
5640
7820
1980
16720
7330
8730
10930
• ^ • A
97 Iv
3290
7380
7620
3820
9060
7110
3010
2240
3900
NORUESIAN
3F12.2.3F8.
IX
3000.00
78267
20718
11234
12642
4680
3114
4338
3723
2937
J.7A1?
^ / ***
28631
9927
2807
17937
11426
942
6EORSES B
3F12.2.3F8.
IX
400.00
917
428
429
341
1208
3179
7791
tnm
OUOw
3208
1616
1686
1202
1868
2300
800
700
HERRING
3)
BEAN SO CV SKEH KURT L06MEAN
7771.11 4620.03 .393 1.366 3.264 8.800
10.683
2.820
3.910
.990 ||
8.360 ||
3.663 ||
4.363 mi
3.473 IIIHI | |


2.643 0 3 10 13
3.790
3.810
1.910
4.330
3.333
2.303
1.120
2.930
SPRING SPANNING HERRING
3)
BEAN SO CV SKEH KURT LOBHEAN
13734.93 18376.43 1.332 2.307 7.919 8.730
26.089
6.906
3.731
4.214 |
1.360 |
1.038 || |
1.319 || |
1.241 || |
•'79 || |||| || |

9.344 0 3 10 13 20 23 30
3.309
.936 11194 3.731
3.986 687 .229
3.809 399 .200
.314 1194 .398
ANK MACKEREL
3)
MEAN SO CV SKEH KURT L06HEAN
1934.88 1776.28 .918 2.137 7.330 7.243
2.293
1.070
1.073
1.333 |
3.020 | |
7.948 mi |
19.478 Hill || |

8.020 0 3 10 13 20
4.040
4.213
3.003
4.670
3.730
2.000
1.730


L06SD
.371





















LOSSD
1.341


















LOSSD
.792
















75

-------
NSMC  NORTH SEA MACKEREL  | | |  1979 EXCLUDED
U2,3F12.2,SF9.3)
N U
10 100.00
1077
3481
635
467
173
• ^ J
324
387
318
83
83
NSSA NORTH SEA
(I2,3F12.2,3F8.
N U
18 30000.00
60818
80890
196266
141893
191399
134993
424108
436820
A 1 B A^ 1
469071
237633
236391
240269
281607
710443
233169
179341
196909
282436
NSHH NORTH SEA
(I2,3F12.2,3F8.
N ' IX
16 300000.00
1493430
333120
680024
774709
973333
2609047
839892
776330
MM J MA*
824927
1784213
2321931
1606143
2240953
1332680
1441383
1234003
HEAN 3D CV SKEW KURT L06HEAN
743.20 936.02 1.286 2.233 6.790 6.017
10.770
34.810
6.330
4.670 | |
1.730 || HI! | |
3.870 0 3 10 13 20 23 30 33
3.180
.830
.830
SAITHE
3)
HEAN SO CV SKEH KURT L06HEAN
263372.11 133240.17 .377 1.323 4.622 12.330
1.216
1.618
3.923
2.838 |
3.832 |
3.100 HI
8.482 | HI |
8.736 HIM || |
4.733 0 3 10 13
4.728
4.803
3.632
. 14.209
3.103
3.387
3.938
3.649
WHITING
3)
HEAN SO CV SKEH KURT L08HEAN
1333274.00 630841.20 .473 .514 2.271 13.981
4.983
1.184
2.267
2.382 |
3.231 | |
8.697 | |
2.866 | || |
2.388 Hill ||
3.947 0 3 10
7.740
3.334
7.470
4.442
4.803
4.180
L08SD
1.080











L06SD
.380



















LOQSD
.313















                                     76

-------
SAPD  SOUTH AFRICAN PILCHARD

IX
.60
1.3
1.3
2
3.2
7.4
6.3
3.3
1 3
1 • «*
1.4
.7
.6
3.9
1.9
4.167
3.107
2.230
2.020 |
3.243 |
7.100 |
12.333 HI
13.867 HI
10.433 mi |
8.667 Illllllll 1 II


3.400 0 3 10 13
3.867
2.290
1.180
.760
1.340
3.833
3.073
2.197
1.940
3.033
1.437
3.733
6.033
4.633
ANCHOVY

MEAN SO CV SKEH
36.03 14.81 .264 .683
3.333
3.000
3.091
6.182 |
4.343 HI
3.091 mi
3.343 mi |

3.927 0 3 10
8.191
4.909
6.909
3.091
ROUND HERRIN8

NEAN 80 CV SKEW
2.83 2.13 .738 .923
2.167
2.300
3.333
3.333 |
12.333 |
10.300 HI
8.833 HI || | | |

2.333 0 3 10 13
1.167
1.000
6.300
3.167
KURT L06HEAN L06SO
4.043 9.272 .714




























KURT L06NEAN LOGSD
3.088 3.992 .260















KURT LOBHEAN L08SD
2.481 .731 .771













                                    77

-------
6BSH  6EOR6ES BANK  SILVER HAKE
U2,3F12.2,3FB.3)
 N           IX
22      230.00
        339
        412
        371
        883
        1303
        1993
        2207
        2993
        5237
        1931
        1119
        626
        403
        339
        317
        439
        994
        1174
        1100
        1663
        1400
        330               2.200

PVAN  PERUVIAN  ANCHOVY
(I2,3F12.2,3FB.3>
MEAN
1207.03
1.336
1.648
2.284
3.332
3.220
7.972
8.828
11.972
13.028
7BA J
.804
4.476
2.304
2.412
2.236
2.068
1.736
3.376
4.696
4.400
6.632
3.600
30 CV SKEH KURT L08HEAN L088D
810.48 .671 1.116 3.361 6.883 .647



1
1
1
II 1
Hill 1
mum i i
0 3 10 13










 N
16
    IX
 60.00
332
237
183
403
193
439
383
338
377
333
339
 32
160
180
160
392
MEAN SO CV
307.36 141.00 .438
3.333
3.930
3.030
6.717 | |
3.217 | |
7.317 || ||
6.383 | || Hill
3.633 * 	 * 	 *
6.283 0 3 10
9.217
8.983
.867
2.667
3.000
2.667
SKEH KURT L06HEAN L06SD
.061 2.040 3.388 .389















                          6.333
                                              Lognoraal

                                       » • • « Ganaia

                                      	  Kernel
                  10
                               15
                                            20
                                                         25
                                                                      30
                                                                                   35
                                       78

-------
                3.  ESTIMATION OF
   RELATIVE FISHING POWER OF DIFFERENT VESSELS

                3.1  Introduction

   This section discusses standardization
and pooling of data from different parts of a
sample survey.  Trawl surveys carried out by NEFC
at Woods Hole are used to Monitor the year to year
changes in the abundance of several marine fish
stocks.  The principal objective of these surveys
is to provide data necessary to assess the
production potential of traditional and
underutilized species (see Byrne and Fogarty
1985).  These surveys may also be of use in
assessing the long-tern effects of pollution,
where a time series of data is necessary to
determine trends.
   A critical aspect of any long-ten survey
program is the standardization of survey units.
inherent differences in vessels, nets, etc., which
change from time to time, Bay introduce bias due
to differences in the resulting fishing power.
Two ships, the Albatross IV and the Delaware II,
have been used at different tines in the last two
decades.  A conversion factor may be necessary to
make the various parts of the survey comparable.
The conversion factor Bay be different for
different species depending on their size, weight,
schooling behavior, etc.
   To see if there is any difference in the
fishing power, and to estimate the conversion
factor, if necessary, paired tows were Bade using
the Albatross IV and the Delaware II off southern
New England and on Georges Bank.  The station
locations were preselected using a stratified
sampling scheme.  A total of 142 successful pairs
of tows were performed with these vessels during
1382 over a large area that encompassed a variety
of depth and bottom types.
   Byrne and Fogarty (1985) carried out an
analysis of the data using non-parametric methods
by rank transforming the observations.  But, this
method loses much of the information contained in
the data.
   Due to the highly skewed nature of the
distribution of the catches, the difference in the
mean catches is not an efficient estimate of the
relative fishing power.  In Bultiapecies fish
surveys, when large areas are sampled, any
particular species usually occupies only a part of
the total survey area.  In these circumstances,
the zero values can be taken to represent areas of
unsuitable or unoccupied habitat.  The proportion
of non-zero values in the sample estimates the
proportion of the'total survey area that is
occupied by the species.
   The interpretation of the proportions of
non-zeros in a sample as an estimate of habitat
area may be vague in some situations, especially
for mobile populations.  A suitable habitat may
change from time to time due to many factors
including the timing of the survey, or the
non-occupancy of an area simply because of low
population level.  However, keeping the zeros
separate often enables one to fit a relatively
simple distribution like the lognormal to the
non-zero values.
   Transformations like  lo*(a+x) have been
suggested to avoid the problem of zeros in the log
transformation, where  a > 0  is a constant.  The
problem here ia the choice of  a.  The
transformation using  a = 1 has been studied at
Woods Hole to transform the data to normality.
Because of the large proportion of zeros,
log(l+X)  is far from normally distributed, which
makes it difficult to retransform and interpret
the results expressed in the transformed scale.
Further it has been observed that different values
of  a  near 1  lead to different conclusions so
this class of transformations leads to unreliable
conclusions.

                   3.2  Method

   It is reasonable to assume that the population
mean of the catch per tow varies in proportion to
the relative abundance of fish over a region.
Further, a zero catch by both the vessels at a
station is non-informative with regard to the
relative "fishing power" of the ships.  Zero
catches Bay simply be due to lack of fish in the
area.  Consequently, it is enough to consider
those pairs of the data where at least one
component is non-zero.  This leads to the
consideration of the independent vectors
(Xt,Yt)	(Xp.Yp) "here for each  i, XA > 0, Yt
£ 0  and Xi •*• Yi > 0.  It may be reasonable to

define 9 = B(X(X>0)/E(Y|Y>0) as the relative
fishing power (or the conversion factor).  A
natural estimate of  0  is
    9= ri_  11.1/[i-  I Tii
        I X i=l  *  •• Y i=l  J
where  nv  and  nv  are the number of non-zero
observations
and
              Xi
and Y;  respectively.  If the X-
         are assumed to have lognormal
distributions, then some modification of this
formulation is required (see Babu  1986).   Further,
the bias can easily be estimated using the paired
data and shown to be practically negligible.   The
paired data set is also used in estimating the
standard errors.

                  3.3  Results

   A total of 32 species were identified for the
analysis.  The non-zero values are approximately
lognormal ly distributed.  Overall  relative fishing
power was computed for catch in numbers and in
weight.  Table 3.1 gives the estimates.
Table 3.1
Catch in! Lo* Fishin* Power 1 Standard Error
Weight
Number
-.2780
-.1401
0.0850
0.1040
   Both  in terms of  total numbers  and total
weight,  Delaware II  appears  to have significantly
Bore  fishing  power than  Albatross  IV.
   The results  for the 32 species  are presented
in Tables 3.2 and 3.3.   For  additional
discussion, see Babu, Pennington,  and Patil
(1986).
                                                    79

-------
                                                      TABLE 3.2
                         BOTH NONZERO
                                                  CATCH  IN  NUMBERS

                                          A.NONZERO  D.7ERO     A.ZERO D.NONZERO
NO.
       SPECIES
 1 Smooth Dogfish
 2 Spiny Dogfish
 3 Winter Skate
 4 Little Skate
 5 Silver Hake
 6 Atlantic Cod
 7 Haddock
 8 White Hake
 9 Red Hake
10 Spotted Hake
11 American Plaice
12 Summer Flounder
13 Fourspot Flounder
14 Yellowtail Flounder
15 Winter Flounder
16 Windowpane
17 Butterfiih
18 Bluefiih
19 Scup
20 Longhorn Sculpin
21 Sea Raven
22 Northern Sea Robin
23 Aaer. Sand Lance
24 Ocean Pout
25 Gooae Fiih
26 Amer. Lobeter
27 Jonah Crab
28 Rock Crab
29 Sea Scallop
30 Shortfln Squid
31 Lonffln Squid
32 Bay and Striped
      'Anchovy Coabined
tPAIRS

10
33 -
43 -
55 -
95
12
10 -
8
36
16 -
5 -1
29
53 -
43 -
44
36
81 -
19
21
36 -
18
13
18 -1
14 -
13 -
43 -
5 -1
29 -
19
64 -
80 -
MEAN

.1869
.2914
.0954
.1732
.2846
.0376
.1738
.5918
.1978
.3107
.0456
.0945
.0468
.3764
. 1619
.0340
.1153
.3214
.3177
.0498
.1529
. 1998
.2217
.1354
.1728
.0764
.0961
.4385
.31 16
.5004
.0105
STAND
ERROR
.2503
.2437
.2178
.1051
. 1011
.2388
.2817
.3390
.1921
.1531
.6390
. 1339
.1116
. 1065
. 1396
. 1383
. 1983
.2665
.2267
. 1418
. 1905
.2020
.4179
.2390
.2392
.1209
.5920
. 1488
. 1614
. 1746
. 1633
fPAIRS

6
10
8
21
16
5
3
12
13
8
5
11
11
5
4
12
14
12
11
9
9
12
11
3
11
15
16
19
9
10
16
MEAN

.3466
1.2035
1.0165
.6284
1.4740
.8318
.0000
.6550
.5713
.3385
.8254
.7750
.3151
.7167
.0000
.6013
.7529
.3226
1.2772
.5071
.6577
.6846
.7134
.2310
.2628
.5894
.4642
.6774
.2310
.5886
1. 1250
STAND
ERROR
.1987
.8345
.4751
.2453
.4985
.5133
.0000
.2583
.2705
.2431
.6967
.2968
. 1665
.4769
.0000
.3476
.3150
.1683
.5525
.2606
.4144
.3525
.3018
.2267
.2093
.2100
.2074
.2519
. 1331
.3049
.4741
tPAIRS

8
11
11 1
17
13 1
3 1
6
12 1
19
8
1 2
5
16
7 1
16
23
17 1
7 1
11
3
16
7
11 1
5
19
18
27
29
6
17 1
17 1
MEAN

.2878
.5395
.4617
.5463
.1121
. 1755
.8540
.2006
.6264
.7758
.0794
. 1386
.7160
.3806
.4839
.4572
. 1357
.3802
.3258
.0000
.4300
.4540
.6870
.6664
.6712
.3655
.8238
.9752
.8931
.0700
.3970
STAND
ERROR
.2187
.3167
.5940
.1925
.4659
.9456
.4666
.4923
.2003
.4931
.0000
. 1381
.3569
.7196
.2313
. 1678
.3645
.7634
.1668
.0000
.1,652
.3677
.8256
.4741
.2661
.1704
.3279
.2992
.4297
.4334
.5038
DIFF.
MEAN
.0588
.6640
-.4452
.0821
.3619
-.3437
-.8540
-.5456
-.0551
-.4373
-1.2540
.6364
-.4009
-.6639
-.4839
.1441
-.3828
-1.0576
.9515
.5071
.2277
.2306
-.9736
-.4354
-.4085
.2239
-.3597
-.2978
-.4621
-.4814
-.2720
STAND.
ERROR
.2955
.8925
.7606
.3118
.6823
1.0759
.4666
.5560
.3366
.5498
.6967
.3273
.3939
.8633
.2313
.3860
.4818
.7817
.5771
.2606
.4461
.5094
.8791
.5255
.3386
.2704
.3880
.3911
.4498
.5299
.6918
ESTIMATE

. 1334
-.2251
-. 1219
-.1472
.2863
.0197
-.3555
.2835
. 1357
-.3198
-1 . 1408
. 1722
-.0731
-.3807
-.0106
.0466
-. 1540
.1779
.4024
.0775
. 1644
.2040
-1.1760
-. 1868
-.2512
-.0263
-.5810
-.4207
.2233
-.4985
-.0243
STAND
ERROR
.1910
.2351
.2094
.0996
.1000
.2331
.2411
.2895
. 1668
.1475
.4709
.1239
. 1074
. 1057
.1195
. 1302
.1833
.2522
.2110
. 1246
.1752
.1878
.3774
.2176
.1954
.1104
.3245
.1391
. 1520
.1658
. 1589
1.9778 1.0729
                                                3.5186  1.7880
                                                                    5.1205 2.7365-1.6019  3.2688    1.8297   1.0194
                         BOTH NONZERO
                                                      TABLE  3.3
                                                    CATCH IN  WEIGHT
                                          A.NONZERO  D.ZERO
                                                               A.ZERO D.NONZERO
tPAIRS MEAN

1 Smooth Dogfish
2 Spiny Dogfish
3 Winter Skate
4 Little Skate
5 Silver Hake
6 Atlantic Cod
7 Haddock
8 White Hake
9 Red Hake
10 Spotted Hake
11 American Plaice
12 Summer Flounder
13 Fourspot Flounder
14 Yellowtail Flounder
IS Winter Flounder
16 Windowpane
17 Butterfish
18 Bluefish
19 Scup
20 Longhorn Sculpin
21 Sea Raven
22 Northern Sea Robin
23 Amer. Sand Lance
24 Ocean Pout
25 Goose Fish
26 Amer. Lobster
27 Jonah Crab
28 Rock Crab
29 Sea Scallop
30 Shortfln Squid
31 Longfin Squid
32 Ba.y and Striped
Anchovy Combined

10
33
43
55
95
12
10
8
36
16
5
29
53
43
44
36
81
19
21
36
18
13
18
14
13
43
5
29
19
64
80

7

.0551
-.3556
-.1559
-. 1143
.1723
-.5914
-.2444
.3374
.3582
-. 1491
-.5414
.0713
-.0040
-.2999
.1346
.1135
-.0796
.1435
.4261
-.0987
.2548
.0101
-.2677
-.2743
.2097
-.2799
-1.0757
-.1062
.1041
-.2813
-.0042

1.3880
STAND
ERROR
.2425
.2433
.1914
. 1079
.0747
.4649
.3535
.3499
. 1858
. 1382
.6859
.1695
.0963
.1251
.1446
.1712
.1286
.2661
.2418
.1329
.2218
.1931
.2119
.3833
.4206
. 1612
.7001
.1235
.1514
.1735
.1178

.8358
tPAIRS MEAN

6
10
8
21
16
5
3
12
13
8
5
11
11
5
4
12
14
12
11
9
9
12
11
3
11
15
16
19
9
10
16

7

2.4826
2.5624
2.0419
.7703
.3066
3.1104
2.6654
2.0549
.6349
.9876
1.6953
2.8378
1.0732
1.4991
2.2043
.2257
.5895
1.9613
1.0089
.9644
1.5180
.8202
.8443
.9635
2.1525
2.7667
.8044
.5769
.0770
1.2309
1.1352

1.5357
STAND
ERROR
1.1533
1.0804
.8819
.2853
.1557
1.5477
1.8353
.6454
.2846
.4176
.9-776
.8842
.4339
.8161
1.0037
.1623
.2549
.7428
.5148
.3961
.5683
.3671
.4968
.7453
.7944
.7012
.2726
.2468
.0770
.4524
.4596

.9191
tPAIRS MEAN

8
11
11
17
13
3
6
12
19
8
1
5
16
7
16
23
17
7
11
3
16
7
11
5
19
18
27
29
6
17
17

5

3.4733
2.5255
2.5011
1.1513
1.2148
3.9963
2. 1186
^-2-r0173
.9625
.9450
2.7081
2.6152
.7532
1.7449
1.8175
1. 1653
1.3903
3.6719
2.4138
.5973
1.6428
.7268
1.5818
1.9910
2.5994
2.0594
.9936
.6782
.5669
1.5985
1. 1851

2.0447
STAND
ERROR
1.3641
.8603
.9447
.3756
.4570
2.3170
1.0568
.8742
, .3310
.4475
.0000
1. 1661
.3307
.6272
.4821
.3594
.4612
1.4393
.8804
.4175
.4948
.5172
.5785
.9972
.7016
.5299
.2623
.1897
.3425
.4684
.3802

1.1973
DIFF.
MEAN
-.9907
.0369
-.4592
-.3810
-.9083
-.8859
.5469
.0376
-.3276
.0426
-1.0128
.2226
.3200
-.2458
.3868
-.9397
-.8007
-1.7107
-1.4049
.3671
-. 1247
.0934
-.7375
-1.0276
-.4469
.7072
-. 1891
-. 1013
-.4898
-.3676
-.0499

-.5090
STAND.
ERROR
1.7863
1.3811
1.2923
.4717
.4828
2.7864
2. 1178
.9333
.4365
.6121
.9776
1.4634
.5456
1.0293
1. 1135
.3943
.5269
1.6197
1.0199
.5755
.7536
.6342
.7610
1 .2449
1.0599
.8789
.3783
.3113
.3511
.6512
.5965

1.5094
ESTIMATE STAND

.0361
-.3438
-. 1624
-. 1276
.1471
-.5994
-.2230
.3004
.2530
-. 1398
-.6969
.0733
.0058
-.2991
. 1388
-.0535
-. 1201
.0948
.3286
-.0751
.2246
.0172
-.3015
-.3395
. 1204
-.2477
-.3895
-.1055
.0109
-.2870
-.0059

.9428
ERROR
.2403
.2396
. 1893
.1052
.0739
.4585
.3487
.3277
. 1709
. 1348
.5615
. 1684
.0949
. 1242
.1434
.1570
. 1249
.2626
.2353
. 1295
.2127
.1847
.2042
.3663
.3910
.1585
.3328
.1148
.1391
. 1677
. 1156

.7312
                                                        80

-------
         4.
A CRYSTAL CUBE FOR COASTAL
AND ESTUARINE DEGRADATION
   Environmental  regulators  and  decision-makers
would  like  to have a crystal ball  that  could
predict how ecosystems would respond to factors
such as stress, pollution or over-fishing.  In
this way, information on important parameters such
as amounts  of contaminants entering an  estuary,
their  effect on the biota, the propagation  of
these  effects through the ecosystem,  and
subsequent  recovery after the removal of these
stresses, could all be properly  considered  in the
use and protection of important  natural resources.
In the real  world, however,  such predictions
cannot be made with certainty.
   This conceptual crystal cube  has a series of
faces, each of which represents  a  specific
parameter that can be directly related  to marine
environmental degradation.   At present,  ten
indices or  faces  of the cube are being  tested:
dietary risks from contaminants  in marine foods;
contaminant  stress in sediments; contaminant
stress in the water column;  human  pathogen  risks;
benthic species and composition; fish and
shellfish diseases; reproduction in fish and
shellfish; mortality of eggs and larvae of  fish
and shellfish; reproductive  success in  marine
birds; and  oxygen depletion.  For  details,  see
Boswell and Patil (1985. 1986),  Patil (1984a,b),
Patil  and Taillie (1985), and Pugh.  Patil,
and Boswell  (1986).
   The emphasis in testing these indices is on
standardizing long term data sets  in order  to
construct a  single summary variable—termed an
index—to represent each.  This  index is baaed on
a variable  that measures contamination  or,
ideally, contamination effects.  The choice of
such a variable is not easy  and  usuallly involves
extensive data analysis.  To be  useful,  the
index—summarizing data—must be sensitive  to
contamination and relatively insensitive to other
factors.
   The use  of this variable  index  in the crystal
cube analyses is  in the separation of "concern or
alarm" from  "no concern" conditions.  Here,
concern or  alarm  does not necessarily mean  only
the violation of  legislated  or regulated
standards, but indicates that the  scientific
community is not  able to certify that issues of
widespread public concern will not arise from
existing environmental stress.
   The index is calibrated so that when the
number falls in the range of 0 to  1,  there  is "no
cause  for concern."
   A flag is raised as soon  as the index
becomes I.   The range from 1 to  10 indicates
"warning;"  something is happening  and should be
investigated.  A  value of the index in  the  range
from 1 to 10 indicates that  the  environment has
been adversely affected.
   The range above 10 indicates  "cause  for  alarm."
The index is designed to be  10 when there is
scientific  reason for grave  ecological  concern.
   The fundamental concept underlying the use of
a single summary  variable for each of the ten
measures of  environmental degradation is to
compare conditions in a stressed estuary or
coastal area with those from a clean  region.  The
crystal cube with ten faces, each  with  a single
summary variable  representing one  important
environmental pollution parameter,  will  flag cases
where  legal  or scientific benchmarks  for concern
or alarm are exceeded.
   Ultimately, this technique can  assist the
environmental manager or regulator, who must
evaluate large data sets, to narrow attention to
those  specific environmental parameters where
there  are serious problems.  The crystal cube is
not intended to be the "ideal" crystal  ball
desired by  environmental managers,  but  it does
provide a potential framework for  evaluating and
comparing different environmental  measures  that
must ultimately be weighed not only against each
other, but  also against other (e.g.,  economic,
aesthetic,  etc.)  considerations.   Thus,  the
crystal cube could develop into  a  valuable  tool to
help define or delineate "•———~—K1-
degradation" and make environmental
decision-making a more systematic science,
reflecting an integration of the complexity  in
ecosystems.
                                                 5.
          REPRODUCTIVE SUCCESS OF MARINE BIRDS
                ON THE EAST COAST
                                             Ospreys  are  large marine birds nesting and
                                           fishing on  the  east  coast of the United States.
                                           They  nest in  accessible areas using the same
                                           nesting sites from year to year, and they are
                                           tolerant of humans.   This has permitted the entire
                                           osprey population to be censused at regular
                                           intervals since 1974 (see Spitzer,  Poole, and
                                           Scheibel 1983).   In  common with other shore birds,
                                           reproductive  potential  of the osprey is sensitive
                                           to the presence in the  environment of toxicants
                                           such  as DDT.  At the same time,  osprey
                                           reproduction  is much affected by naturally
                                           occurring stresses such as wind, storms and food
                                           shortage.   Thus, any index of anthropogenic impact
                                           upon  osprey reproduction must carefully
                                           incorporate the effects of natural variation.
                                             Figure 5.1 gives  three-year moving-average
                                           plots of the  reproductive success of osprey
                                           at several  locations along the East Coast.   Here,
                                           reproductive  success is measured as the average
                                           number of young fledglings in the active nests.
                                           It is evident that there is both spatial and
                                           temporal variablity.  Much of the temporal
                                           variability is  attributed to the effects of DDT.
                                           In the early  1970's  there was extensive DDT
                                           pollution which gradually cleared from the
                                           environment after its use was banned.   The years
                                           1973  to 1979  are transition years when the effects
                                           of DDT were still present.   The 1980's appear to
                                           be essentially  free  from DDT effects upon osprey
                                           reproduction.
                                           Figure 5.1    AVERAGE CSPHEY  YOUMC  PER ACTIVE NEST

                                                        (3  year running averages)
                                           2.0
                                           1.5
                                           1.0
                                          0.5
                                           0.0
                                                TO  71  72 73 74  75  75  77  78 79  30  81 32  33
                                                Surrounding Areas
                                                Gardiner* Island
                                              For the reasons just described, the five
                                           years from 1980 to 1984 were selected as the
                                           reference or control period used to calculate a
                                           reference value,  R, using the effects of natural
                                           variability in the environment.  The index, in its
                                           basic form, is given by
                                                I = R/Y,
                                              (5.1)
             'unreasonable
where  R   is  the reference value  for reproductive
success, expressed  in young per active  nest,  and Y
is the reproductive success observed in the year
and the region being indexed.  The  reference value
R  is the  estimated 10th percentile of  the
distribution  of reproductive success during the
unstressed reference period from  1980 to 1984.
Thus, index values  greater than 1 indicate that
reproductive  success is so low that it  could occur
only one year in 10 under unstressed conditions.
                                                    81

-------
The index is constructed to flag cases where  the
reproductive success falls short of the
reference values.
   The index is calibrated in the range 0  to  1
using data fro* 1980 to 1984.  On the other hand,
expert opinion of knowledgeable biologists is
necessary to calibrate the index in alarming
situations.  When the reproductive rate of the
ospreys drops below about .8 young per active
nest, then the population tends to decrease
(Spitzer 1985).  The basic index, using a
reference value of  H=1.7 young per active nest
calculated fro* the combined data fro* the East
Coast, was calibrated to take the value 10 when
the reproductive success is .8 young per active
nest.  The index then becomes
                  Figure 5.2    INDEX I.  FOR OSPREY DATA
     I = (H/Y)C =  (1.7/Y)'
(5.2)
where  c  is the constant used for calibration.
   The values of the index (5.2) are tabulated
below for the four regions (i) the Northeast Coast
from New York City to Boston, (ii) Massachusetts,
(iii) Gardiners Island located off the tip of Long
Island,  and (iv) area surrounding New York City.
          Index of Osprey Reproduction
Region
Northeast Coast
Massachusetts
Gardiners Island
New York City
11969
,'35.
:26.
:17.
:66.
0
6
9
2
1970
22
4
17
61
.8
.4
.9
.7
1971
19.
11.
35.
20.
7
7
0
7
1972
28.
4.
1643.
29.
0
4
6
6
1973
11.
.
29.
66.
2
6
6
2
Region!
Northeast Coast]
Massachusetts '
Gardiners Island*
New York City!
1974
7.
1.
11.
17.
2
7
7
1
1975
7.
1.
22.
6.
2
9
8
1
1976
2.
.
5.
2.
0
8
7
5
1977
4.
18.
4.
3.
6
8
1
2
1978
4.
3.
10.
2.
2
7
0
9
1979
3.6
1.5
21.7
2.9
   The values in the above table were calculated
using the global reference value  R = 1.7 obtained
from the combined data for the entire Northeast
Coast.  The population of ospreys on Gardiners
Island is stressed by a limited food supply.
Using a global reference value for such
populations results in index values perpetually in
the warning or alarm range.  A local index based
upon a local reference value is to be preferred in
such cases because we are attempting to  index
stresses of authropogenic origin.  For ospreys a
local index was calculated for each region using
the years 1980 to 1984 as a reference period.  The
resulting index  is shown in Figure 5.2  for each
of the three local regions.  Notice, in
particular, that the local Gardiners Island index
approaches the value 1 as the DDT passes from the
environment.  This was not the case when a global
reference value was used.
           1000
           100
                                                        10
            1.0
                                                        0.1
                  \    -A       \
                                 \/
                                                      78   79
                                                         TTEA3
                                                              69  70  71  72  73  74  75  76  77


                                                              Massachusetts      •
                                                              Surroundi.nl Areas   *     *

                                                              Gardiner* Island    9     9
           daily basis  throughout  the summer  season.
           Low-dissolved-oxygen  (low DO)  episodes,  defined to
           occur when the dissolved oxygen  concentration
           drops below  5 ng/1, occur in  the summer  months.
           An index value for the  season is the  maximum of
           index values calculated for the  low DO episodes
           throughout the summer.  The value  calculated for a
           given low DO episode  is, in turn,  the sum  of three
           values corresponding  to the three  responses.   To
           calibrate the index (see discussion in Sections 4
           and 5) to be 10  in an alarming situation the value
           corresponding to mortality is multiplied by  10
           before adding.
              Before an index can  be calculated, it is
           necessary to know the intensity  of each  response
           to low DO concentrations; this is  estimated  by
           laboratory studies.   The resulting dose-response
           curves provide reference values  to compare with
           the observed low DO concentrations.   Three curves,
           giving the day's exposure to  produce  the indexed
           response as  a function  of the DO concentration
           must be determined.   Since the effect of low DO
           varies with  temperatures, the average summertime
           temperature  in the region to  be  indexed  is used.
              Let  c^  be the observed DO concentration on
           the ith day of a low  DO episode, i=l,2,...,n .
           Let  m^, BJ  and  r^  stand for the days  exposure

           to a DO concentration of  CA  to  produce  the
           mortality, avoidance  and reduced growth
           respectively.  The index can  be  formulated as
        6.  LOW  DISSOLVED OXYGEN:  AN INDEX

   When the amount of oxygen dissolved  in  the
water drops below 5 mg/1 the ecosystem  becomes
stressed.  The biological response is species
dependent and varies with the  temperature,  the
concentration level as well as  the duration of
exposure.  Three responses have been identified
for incorporation into an index:   mortality (10*
of surf clams),  avoidance (by 50%  of red hake),

and reduced growth (15* for winter flounder over

the summer season).  These species were chosen  as
the most sensitive from among  important species
for which data were available.
   To calculate  an index value, dissolved  oxygen
data must be available for a given location on  a
                 I  = 10  Z
           where   q^   is the combination of the three

           exposure curves.   The data is summarized into DO
           categories  and the above calculation is simplified
           by using the frequencies.   The index becomes
    k
I = 2 \
   i=l
                                                        (6.1)
                                                        where  fi  is the number of days that  the  low  DO
                                                    82

-------
episode has concentrations in the i th category
(corresponding to the value  q^ and where   k   is

the number of categories.  The calculation of an
index value is illustrated in Table 6.1.

                    TABLE 6.1

    Calculation of the dissolved oxygen  index
      for a low DO episode at the Narrows*,
          New York Harbor, Summer 1975
DISSOLVED
OXYGEN
CONCENTRATION
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
3.8
4.0
4.2
4.4
4.6
4.8
5.0
FREQUENCY
OBSERVED AT DO
CONCENTRATION
0
0
2
0
1
3
1
2
2
2
1
2
1
3
2
2
3
4
3
1
2
COMBINED
RESPONSE
CURVE
_
4.55
4.35
4.17
4.08
3.91
3.70
3.55
3.33
3.11
2.86
2.63
2.38
2.08
1.79
1.52
1.11
0.67
0.22
0.16
0.01
CONTRIBUTION
TO THE
INDEX
0.0
0.0
8.7
0.0
4.1
11.7
3.7
7.1
6.7
6.2
2.9
5.7
2.4
6.2
3.6
3.0
3.3
2.7
0.7
0.2
o-o
TOTAL INDEX 78.9
*The Narrows  is  known  to be polluted.

   This  index has undergone many changes from
the original  formulation and is currently being
looked at.   The  final  fora of the index has not
been fixed  at this time.  The exaaple given in
Table  6.1  is for illustration purposes; the data
is for a su*mer  season which MY not correspond to
one  low-dissolyed-oxygen episode required for the
purposes of this index.


         7.   COMBINING BIO-ASSAY RESULTS
       TOR  EXTRAPOLATION OF CHRONIC EFFECT
         THRESHOLDS FOR RISK ASSESSMENT

                 7.1  Introduction

   Ecological effects  of toxic chemicals ar,e
commonly assessed by estimating a "safe" exposure
level, below which no effects will occur.  To
protect  organisms at their most sensitive stages,
life cycle tests or, in soae cases, early life
stage tests are  necessary for estimation of
chronic  effect threshold levels.  It is not
feasible to conduct tests for every possible
 toxicant and species of interest.  Instead, "safe
 levels are commonly extrapolated from  laboratory
 test results of a few standard test species and
particular life stages by applying correction
 factors and subjective judgement.  Suter, et al.
 (1986a)  propose a more structured approach called,
 ^Analysis of Extrapolation Error"  (ABE) of
 extrapolating chronic effect thresholds.  Its main
 features and advantages over traditional methods
 lie in the explicit quantification of  the
 consequence of exceeding  the estimated safe  level,
 of interspecies differences in sensitivity between
 tested and extrapolated species, and the variable
 relationship between acute and chronic effects or
 chemicals.  See also  Linder, Patil, Suter, and
 Taillie (1986).

         7.2  Acute-Chronic Extrapolation

    AEE  is based on statistical  analysis of acute
 and chronic  toxicity  test data  sets collected
 using uniform experimental protocols.   For each
species and chemical pair,  two  different studies
are conducted to determine  long-term low-level
effect, or maximal allowable toxic  concentration
(MATC) and 96-hour LC50 high-level  effect to
achieve 50% mortality in 96 hours.   If enough of
these results are available, a  functional
relationship, the so called acute-chronic
extrapolation can be estimated.   It is used in
turn to extrapolate from the LC50 to the MATC (see
Figure 7.1).  Note that each LC50-MATC pair is the
result of a different study reported in the
toxicological literature.   To ensure
comparability, only results obtained under similar
experimental conditions with uniform protocols are
used for establishing the acute-chronic
relationship.
   Estimated error (residuals)  about that
relationship and error of the parameter estimates
determine the error of an extrapolated MATC,
similar to prediction variances in  ordinary linear
regression analysis.  This  allows the calculation
of the risk that a given environmental
concentration of the chemical being assessed
exceeds the extrapolated MATC for the species of
interest.
         lo»«rttiHM of «»TC v«luej from life-cycle or ptrttil
         H*«-cycle t»»t» plotted l?«tnst lowrUnni m «i-h LCjo
         I vtluej determined for tor SIM sptdes «no cnemcil 1n tne
         SM* Uooritory. The Hue U aeMvea Bv in
         . erron-1«-»«ru&lei r»|re»ilon;

                Figure  7.1  (fror
                  Suter et  al., 1986a)


    Suppose we want to  assess  the  chronic effect
 of a given chemical on species A  of fish.   If an
 LC50 Is available  for  this species-toxicant pair,
 the acute-chronic  extrapolation can be applied
 directly.  The variability of the data about any
 fitted curve is  quite  large.   As  a result,
 different classes  of curves might be considered
 for defining the extrapolation relationship.  In
 the applications examined  so far, logarithmic
 transformation in  both variables  produced a linear
 trend and homogeneity  (equal variances) about that
 trend.

      7.3  Taxonomic  and 2-step extrapolation

    In the case where no  test results on species A
 of interest are  available, another test species B
 has to  be chosen for the  purpose of extrapolation.
                                                      83

-------
The uncertainty due  to  extrapolating from  B  to A
depends on  the difference  of the sensitivities of
species A and species B to the  chemical.   This
difference  is assumed to be proportional  to the
"taxonomic  distance" between A  and B.   For this
reason, extrapolation relationships are estimated
between taxa having  the next higher taxon in
common.  This is done by pairing LCSO's of common
chemicals to members of the two taxa,  whenever
there are enough such pairs to  allow curve
fitting.  The resulting curve is used for
extrapolating the  LC50  of  the species of  interest
from the LC50 of the test  species.   This  is called
taxonomic extrapolation.   Figure 7.2 depicts an
example of  a taxonomic  extrapolation between two
different genera that are  members of the  family
Salmonidae.  The LCSO's data base is compiled

either from a single laboratory or from several
laboratories (Suter  et  al.   19o6b).  The  data base
is screened to insure compatability with  respect
to testing  conditions.   As with the acute-chronic
extrapolation, the data are log transformed to
produce linearity  and homogeneity.
     5.0

     4.5

     4.0

     3.5

  5  3.0
  o>
  a.
  -  2.5
  S
  3  2.0

  I  "

  ii  1.0
  _j
  *  0.5
  2    0
     •0.5


     -1.0


     -1.5

     •2.0
I
                      I
I
        I
               I
                      I
       -0.5     0.5      1.5     2.5     35

                   log  SALMO LCJO (^.g/L)
                             45
           IDIirltlllK of LCjn xluts for SllvtHnm plotted igitnit
           Sjlue.  Tht lint Is otttramtfl Dy «n trrorj-tn-vtrublti
           rtjrt»»1on;


              Figure 7.2  (from
                Suter et  al.,  1986b^


   Taxonomic and  subsequent acute-chronic
extrapolation are combined in  order  to  extrapolate
to a chronic effect  threshold.  Thus, Z =  c+dY  =
c+d(a+bX), results  from  combining Y  = a+bX,  the
estimated line  for  taxonomic extrapolation and
Z = c+dY, the estimated  line for  acute-chronic
extrapolation.  The  variance of an extrapolated
MATC is calculated under the assumption of
statistical independence between  the set of
variables associated with  the  two extrapolations.
Estimated variance  of an extrapolated MATC is
quite large especially when the extrapolation
requires more than  one step.   Assuming  a normal
distribution for  the extrapolated MATC  (Suter et
at. 1986a) is therefore  not likely to affect the
resulting risk  calculation too strongly.
                                                           7.4  The Data

                                           We  focus in the following on the acute-chronic
                                         extrapolation.  Most of the problems and issues
                                         discussed arise equally in the context of the
                                         taxonomic extrapolations.   Let  (Xi,Yi)  be an

                                         LC50-MATC pair for a particular toxicant species
                                         combination (i = l,...,n).  The following are the
                                         features  of the acute-chronic "data set" of Fig.
                                         7.1:
                                              (i)   Bach point (or pair) represents a
                                         reported  result from a bio-assay experiment.
                                             (ii)   Different points result from different
                                         studies.
                                            (iii)   The collection of points has been
                                         gathered  from the literature.  Hence the  (Xi,Yi)

                                         may  not constitute a random sample from the
                                         population of all possible LC50-MATC pairs.
                                             (iv)   Since  (X.Y)  are estimates of
threshold concentrations,  they are themselves
random guantities.  There  is  considerable
uncertainty about their  "true" values.

            7.5  Combining Estimates

   Traditional methods of  extrapolation  are often
based on the use of a single  test  species such as
fathead minnow for fresh water fish.  The ratio of
its MATC to LC50 is multiplied by   XQ,  the LC50

for the species of interest;  using the above
notation, this is
                                             Y  =
                   where   (X,,Y,)
                                                   ^jj,  were    ,,,

                                                  denotes  (LC50,  MATC)  of the test species.

                                         This provides a point estimate of the chronic
                                         threshold concentration.  It is subsequently
                                         "scaled down" by correction factors accounting for
                                         the  uncertainties due to extrapolation and natural
                                         variabilities.   The final value for the
                                         extrapolated MATC depends strongly on the test
                                         species chosen as well as on the particular
                                         sources of uncertainties considered.
                                           An improved combined estimate can be obtained
                                         by using the test results of several toxicants and
                                         several species.   Let bA = Yj/Xi  be the "slope"

                                         from the test on the i th toxicant -species pair.  A
                                         combination estimate can be formed by means of a
                                         linear combination of the individual estimates
                                         using weights w:
              i=t  x  l

In some cases, when the weights  are chosen
proportional  to the statistical  influence of the
corresponding observations,  the  resulting
combination estimate  turns  out to correspond to a
particular estimate of that  curve.   For linear
regression through the origin, the slope estimate
is obtained by choosing the  weights proportional
to the squared lengths of the  X-:
                                             b  =
                                                             *? v*i
                                                                      = X wibi
                                                           where
                                                           (i=l	n) .

                                        Thus, combining extrapolation factors (estimates)

                                        bi   is equivalent  to estimating an extrapolation
                                                    84

-------
curve for the collection of points
                                             This
•otivates the procedures described below.  Notice
that we have not considered any of the
technicalities such as variable transformation or
correct! en for the intercept.
   The aain advantage of estimating an
extrapolation curve over traditional Method lies
in the availability of standard error estimates.
Thus the method provides explicit quantification
of the sources of errors involved.  Combining all
available LC50-MATC pairs reduces the uncertainty
by using the largest possible data set.  On the
other hand, it increases the variability because a
wide variety of toxicants and species are "lumped
together.  Extrapolation by using results of one
particular class of chemicals reduces this
variability.  However the representativeness of
the remaining species might be in question after
the data have been partitioned by chemical class.
The use of the appropriate data set for a given
extrapolation problem needs to be carefully
examined.
                 7.6  The Model
                                          and
   Logarithmic transformation of both X^

produces linearity and homogeneity to a
satisfactory degree given the natural
variabilities.  We propose an errors-in-variables
(BIV) model for estimation of the extrapolation
line for the data described  in 7.4.  The
assumptions underlying ordinary least-squares
(OLS) regression analysis are clearly violated.
In the errors-in-variables model  (Xj.Y^)   are
assumed to have been  recorded with error.   They
represent unknown mathematical quantities  (U^.V^).

Linearity ia assumed  between the  Vi  and   V^

resulting in the model:


        i     i   i   V-  =  «+0U.,   i =  l,...,n.
 Normal distributions with zero means are commonly
 assumed for the errors  t, t, with
                              cov(«i,ei) = 0,  and

                             «e)  for  i * J-
        )  = a*;, var(«i) =
  («i,ei)  independent of


    Two EIV models have been studied extensively:
 (Kendall and Stuart, 1979; Gleser 1983).
      (i)  The structural model: the  Ui  are

 assumed to be a random sample.
     (ii)  The functional model: no assumption on

 the  U4.

 For identifiability in both models, one of the
 variance parameters has to be assumed known a

 priori.  In the classical case, this is X =
    More complicated models would be more
 realistic for the extrapolation problem.  These
 are:
     (iii)   The  ultrastructual  model:  Vi
                                          random but
                                          i   (Dolby,
  with different  locations for different
  1976)
      (iv)   The model  with error in the equation
  (SchneeweisB. 1976).                      .
     Both (iii) and (iv)  can not be distinguished
  from the functional  model (ii) unless replicate
  observations are available or additional a priori
  assumptions about the error structure are made

     Maximum likelihood estimators of the slope  f
  and the intercept «  for both, the structural and
  the functional model, are:

       ft = h * sign(SXY)(ha+X)*;
     where  h = SYY"XSXX , a = Y-0X;
                  2SXY

where bars denote averages and  SXX, SXY, SYY the
usual corrected sums of squares.


   The corresponding OLS slope  b = SXY/SXX is

smaller than  ft  in absolute value, thus downward
biased,  f  can also be obtained by minimizing the
                                                        sum of squared distances from
                               (X/.Y,)
to the
                                                        line under a vertical angle with tangent of
                                                        (Mandel, 1984).  Such a least-squares
                                                        interpretation of  ft  allows for straightforward
                                                        generalizations to weighted BIV in the case of
                                                        unequal variances (Sprent, 1966).   Thus reporting
                                                        biases and inhomogeneities resulting from
                                                        combining the LC50-MATC estimates can be
                                                        incorporated in the model.  This produces the
                                                        estimates:


                                                             * = h"*(hw*X)«; where  h* - SYY*-XSXX"
                                                                                           2SXYW
                                                                          X" =
  SXXW = £ wi(Xi-Xw)a, etc.


       7.7   Risk  Calculation and Evaluation

    The final risk calculation is determined by the
 statistical  distribution of the MATC that has been
 extrapolated from an input  LC50.  Exact
 distributions of  the EIV estimators are not
 tractable.   Given the large variabilities,
 estimates  of asymptotic  standard error combined
 with normality assumption are considered accurate
 enough  in  this context.   For future refinements,
 resampling methods such  as the bootstrap might be
 applied  to obtain small  sample distributions or
 standard error estimates.
    For  the approximate variance of an

 extrapolated MATC, we add a variance term  (o*)

 to the  variance of a fitted Y-value as given in
 Mandel,  1984.  This results in the following
 formula for the weighted EIV model:
                                                                                                       J
         .      -a
        where  S* =
                c
                                                                            gaSXX'"<-2?SXY*+SYYw
                                                                            e-s - *"• - *-
                                                                                   n-2
                                                            It has been frequently suggested  (Lindley,
                                                         1946; Kendall and Stuart, 1979) to use OLS
                                                         regression if the purpose of  the analysis  is
                                                         prediction even when the X's  have been measured
                                                         with error.  Prediction, or in general regression
                                                         analysis is based on the covariation of  the two
                                                         random variables X  and  Y.   Predictions by means
                                                         of the conditional properties of Y   given   X   are
                                                         possible for the structural EIV model.   The
                                                         situation is different in the functional model.
                                                         Strictly speaking, the (Xi,Yi)  constitute

                                                         different random variables for  i =  l,...;n.   The
                                                         only existing relationship lies in the proposed
                                                      85

-------
structure (in this case, the line) for the
location of the Beans.  Thus this is not a
classical prediction problem, since no conditional
means or variances are involved.  It is for this
reason, we propose to use EIV estimates for
the extrapolation problem.
   The method of combining estimates for
extapolating MATC's is based on purely statistical
(rounds.  Relative magnitudes of estimated
standard errors seem to be appropriate from a
biological point of view.  For the applicability
of the methods, the results (absolute magnitudes)
have to be evaluated in terms of their biological
	ling.  This has been done extensively (Suter et
al., 1986a) by comparing extrapolated MATC's with
measured MATC s for toxicant-species combinations,
where the results are available.  In addition the
method was compared to the traditional approaches
using only results on fathead minnow.  It has been
generally found to out perform the old methods.

      8.  COMBINING CONCLUSIONS ACROSS SPECIES

   The NOAA Chesapeake Bay Stock Assessment
Committee has been set up to help develop a plan
to establish a cooperative stock assessment
program.  The Center for Statistical Ecology and
Environmental Statistics has been studying various
continuous and categorical multiple time series
methods to partition the effects of fishing
mortality, natural mortality and the effects of
pollutant loadings on stock sizes (Boswell,
Under, Ord, Patil, and Taillie, 1986).  The data
set used to evaluate these methods was compiled
from historical environmental data, pollution
data, and fishing data (see Summers et al. 1984).
As much as possible, the environmental and
pollution variables were chosen to be meaningful
in terms of the fish stocks to be investigated.
However, the pollution variables are
macro-pollution variables which give gross
indications of the corresponding pollution
loadings.
   Examples of environmental variables are
average monthly air temperature, river temperature
and flow, wind speed and direction, etc.  Examples
of pollution variables of regions in or near the
selected water systems are human population size,
employees in manufacturing industries, sewage
volume discharge, acreage in improved farmland,
total annual volume dredged, five-day biochemical
oxygen demand for loadings from treatment plants,
minimum 28 day average summertime dissolved
oxygen, etc.  The fish data, consisting of
information from various sources, was combined to
give a stock index in the form of catch per unit
effort for the dominant fishing gear used in the
region.  Different species were chosen for
different parts of the Chesapeake Bay.  For the
Potomac River system, the species chosen are
striped bass, American shad, American oyster and
blue crab.
   The results of a study of the pollution
effects on fish stocks would be of interest to
managers with the job of deciding what pollution
is in need of abatement programs.  It may turn out
that different fish stocks are affected
differently by the pollutants.  If the study could
provide some measure of impact for each pollutant
on each species and if the manager can provide
weights giving the importance of each species,
then the problem can be approached by the method
of combining expert opinions, described in the
next section.  With macropollution variables such
as those included in the study, clear-cut results
are not to be expected.  This study was mainly to
identify, adapt, and develop the statistical
techniques as needed.
   The first technique considered is that of
multivariate time-series regression.  This
requires the assumption of some functional form  to
incorporate the effect of the variables on the
stock size variable.  The usual assumption of a
linear model was used and all biologically
meaningful lags were incorporated.
   Categorical regression was used by Summers et
al  (1984) and was studied here as a starting point
for modifications to incorporate meaningful
biological concepts into the lag structure.  The
methods tried incorporate a distributed lag
structure and a combination of continuous and
categorical methods.
   Transfer function modeling was also tried.
   It is possible that various methods would
yield different results.  If the correct method to
use is unknown, then the combining of the results
of various studies using different methods may
 ?rovide better results than any one study by
 tself.  This problem is analogous to the problem
of combining expert opinions as outlined in the
next section.

       9.  COMBINING CONCLUSIONS FROM EXPERTS

   The necessity for combining probabilities
could arise in two situations—described
respectively as the group decision problem and the
panel of experts problem.  In the first,
individuals with different probability judgements
(and different preferences) have to make a joint
decision..  Under certain circumstances, this joint
decision 'could be the result of maximizing a group
expected utility where the expectation is taken
with respect to a combined or group probability
distribution.  In general, the group decision
problem is intractable.  (References to various
aspects of this problem include Arrow (1951).
Hyland and Zeckhause (1980) and Wilson (1968).)
   The second situation, the panel of experts
problem, is set in the context of a single
decision maker who wishes to combine information

obtained from various experts, rather than
opinions.  While, in the group decision problem,
agreement on probabilities would facilitate
solution of the problem, divergence of information
is beneficial for a decision maker seeking
independent sources of information as input  into
his or her judgement.
   Morris (1977), Winkler (1981), Lindley
(1983,1985) and others produce resolutions of the
panel of experts problem.  In Lindley (1983), the
decision maker has a diffuse prior on the quantity
of interest.  The conditional distribution of
expert assessments given the true value of the
parameter is multivariate normal with the
individual expert's assessment of the mean
allowing for bias and with the decision maker
having to specify the covariances between the
different expert assessments.  The decision
maker's posterior mean, given the expert
assessments, is then shown by Lindley to be a
weighted average of the experts' assessed means.
   Note that it is always at least as good to
include additional experts (if they are free) as
not to include them.  This is a different way of
stating the standard result on the non-negative
value of information.  However, this assumes that
all experts have appropriate incentives to gather
and to report information.
   In certain circumstances, the likelihood
function of expert assessments may be impossible
to specify.  In this case, insights obtained from
group decision theory and axiomatic approaches to
combination may help in aggregating assessments
without explicitly calculating posterior
distributions.  Axiomatic approaches are to be
found in Madansky (1964), Morris (1983) and the
extensive literature in Section 3 of Genest and
Zidek (1986).  Madansky (1964) considered the
linear opinion pool, a weighted arithmetic average
of probabilities, and showed that such a linear
opinion pool was not "externally Bayesian with a
fixed constitution."  That is, it was impossible
to find a set of non-negative weighting constants
such that a posterior based on a common likelihood
and a weighted prior would be the same as a
weighted posterior based on a common likelihood
and different priors.  Raiffa (1968) argues that,
in this case, the priors should be combined.
   The Wilson (1968) theory of syndicates yields
a geometric average of individual probability
distributions (or an arithmetic average of log
odds ratios) as the appropriate combined
distribution.  This avoids the difficulty found by
Madansky.
   An example due to John Pratt (cited in Raiffa
                                                    86

-------
(1968)) shows that linear opinion pools have an
additional deficiency of not preserving the
independence of events after combination.  Genest
and Zidek (1986) discuss recent investigations of
this independence preservation property.
   Another reaction to the lack of a well-
specified likelihood function is to dispense with
probability theory and rely on the Dempster-Shafer
theory of combining evidence.  (Shafer (1976),
Krantz and Miyamoto (1983).)

                ACKNOWLEDGEMENTS

   The research effort leading to this paper and
presentation has been partially supported by
research grants of the National Oceanic and
Atmospheric Administration under the auspices of
the Chesapeake Bay Stock Assessment Committee, the
Northeast Fisheries Center, Woods Hole,
Massachusetts and the Ocean Assessments Division,
Rockville, Maryland to the Center for Statistical
Ecology and Environmental Statistics of the
Department of Statistics at the Pennsylvania State
University, University Park.  The co-authors of
this paper are members of the Center; G. P. Patil
is the Director.  During this year, he is also a
Visiting Professor of Biostatistics at the Harvard
School of Public Health.

                   REFERENCES

 1.  Anandalingam, G.  and Chatterjee, K. (1986).
Personal communication.
 2.  Arrow, K. J. (1951).  Social Choice and
Individual  Values.  Yale University Press.
 3.  Babu, G. J. (1986).  A note on comparison of
conditional means.  Preprint.
 4.  Babu, G. J., Pennington, M., and Patil, G. P.
(1986).  Estimation of relative fishing power of
different vessels.  In Oceans 86 Proceedings:
Vol. 3: Monitoring Strategies Symposium, pp.
914-917.  Washington, D.C.
 5.  Boswell, M. T., Linder, E,. Ord, J. K.,
Patil, G. P., and Taillie, C. (1986).  Time series
regression methods for the evaluation of the
causes of fluctuation in fishery stock sizes.  In
Oceans 86 Proceedings:  Vol. 3: Monitoring
Strategies Symposium, pp. 940-945.  Washington,
D.C.
 6.  Boswell. M. T. and Patil, G. P.  (1985).
Marine Degradation and Indices for Coastal  and
Estuarine Monitoring and Management.  A research
paper presented at the spring meetings of the
American Statistical Association and  the Biometric
Society, ENAR, North Carolina State University,
Raleigh, N.C.
 7.  Boswell, M. T., and Patil, G. P.  (1986).
Field based coastal and estuarine statistical
indices of marine degradation.  In Oceans 86
Proceedings:  Vol. 3: Monitoring Strategies
Symposium, pp. 929-933.  Washington,  D.C.
 8.  Byrne, C. J., and Fogarty, M. J.  (1985).
Comparison of fishing power of two fisheries
research vessels.  Preprint.
 9.  Dolby, G. R. (1976).  The ultrastructural
relation:  A synthesis of  the functional and
structural relations.  Biometrika, 63, 39-50.
10.  Genest, C., and Zidek, J. V. (1986).
Combining probability distributions:  A critique
and an annotated bibliography.  Statistical
Science, 1(1), 114-148.
11.  Gleser, L. J. (1983).  Functional, structural
and ultrastructural errors-in-variables models.
Proc. Bus. Scon.  Statist. Sect., pp. 57-66,
Washington, D.C.: American Statistical
Association.
12.  Gleser, L. J. (1985).  A note on G. R.
Dolby's unreplicated ultrastructural model.
Biometrika, 72, 117-124.
13.  Hennemuth, R. C.. Palmer, J. B., and Brown,
B. B. (1980).  A statistical description of
recruitment in eighteen selected fish stocks.
J.  Northwest Atlantic Fishery Science, 1,
101-111.
14.  Hennemuth, R. C. and Patil, G. P. (1983).
Implementing statistical ecology initiatives  to
cope with global resource impacts.  In Proc.  of
International Conference: Renewable Resource
Inventories for Monitoring Changes and Trends.
J.  F.  Bell and T. Atterbury, eds.  Corvallis,
Oregon, pp. 374-378.
15.  Hennemuth, R. C., Patil, G. P., and Ross, N.
P. (1986).  Encountered data analysis and
interpretation in ecological and environmental
work: Opening remarks.  In Oceans 86 Proceedings:
Vol. 3:  Monitoring Strategies Symposium.
Washington, D.C.
16.  Hennemuth, R. C., Patil, G. P., and
Simberloff, D. (1986).  Advanced Research
Conference on Frontiers of Statistical Ecology.
Intecol Newsletter, 16(1), 4.
17.  Hennemuth, R. C., Patil, G. P., and Taillie,
C. (1985).  Can we design our encounters?
CM1985/D:9, International Council for the
Exploration of the Sea, London.
18.  Hylland, A., and Zeckhauser, R. (1980).  The
impossibility of Bayesian group decision making
with separate aggregation of beliefs and values.
Harvard University.   (Mimeo).
19.  James, W., and Stein, C. (1961).  Estimation
with Quadratic loss.  Proceedings of  the Fourth
Berkeley Symposium  on Mathematical Statistics and
Probability, Vol. 1,  Berkeley: University of
California Press, pp. 361-379.
20. Kendall, M. G. and Stuart, A. (1979).   The
Advanced Theory of Statistics, Vol. 2, 4tA
edition.  Macmillan,  New York.
21.  Krantz. D. H., and Miyamoto, J. (1983).
Priors and likelihood ratios as evidence.   J.
Amer. Statist. Assoc.. 78, 418-423.
22.  Linder, B., Patil. G. P., Suter, G. W.,  II,
and Taillie, C.  (1986).  Effects of toxic
pollutants on aquatic resources using statistical
models and techniques to extrapolate acute  and
chronic effects benchmarks.  In Oceans 86
Proceedings:  Vol. 3:  Monitoring Strategies
Symposium, pp. 960-963.  Washington, D.C.
press).
23.  Lindley, D. V. (1947).  Regression lines and
the linear functional relationship.  J. Royal
Statist.  Society, Series B, Vol.  9, 218-244.
24.  Lindley, D. V. (1983).  Reconciliation of
probability distributions.  Operations Research,
31, 866-880.
25.  Lindley, D. V. (.1985).  Reconciliation of
discrete probability distributions.  In Bayesian
Statistics, Vol. 2, J. M. Bernardo, et al., eds.
North Holland, Amsterdam,  pp. 375-390.
26.  Madansky, Albert (1984). Externally Bayesian
groups.  Rand Corp.  Memorandum MR 4141-PR,
December.
27.  Mandel, J. (1984).  Fitting straight lines
when both variables are subject to error.   J.
Oual. Techno 1, 16: 1-14.
28.  Morris, P. A.  (1977).  Combining expert
judgements.  Management Science, 23, 679-693.
29.  Morris, P. A.  (1983).  An axiomatic approach
to expert resolution.  Management Science,  29(1),
24-32.
30.  O'Connor, J. S.  and Dewling, R. T.  (1986).
Indices of marine degradation: Their utility.
Environmental Management. (In Press).
31.  Patil, G. P. (1984a).  On constructing a
crystal cube for environmental degradation.
Opening Technical Remarks at the Workshop on
Indices of Marine Degradation: An Overview  for
Managers.  November 15-16, 1984, Washington,  D.C.
32.  Patil. G. P.  (1984b).  Some perspectives of
statistical ecology and environmental  statistics.
In Statistics  in Environmental  Sciences, ASTM STP
845, S. M. Gertz and  M. D. London, eds.  Amer.
Soc. Testing and Materials,  pp. 3-22.
33.  Patil, G. P.  (1985).  Fishery and forestry
management:  Preface.  Amer. Statist., 39(4),
361-362.
34.  Patil, G. P.,  Rao, C. R.,  and Zelen, M.
(1986).  A computerized bibliography of weighted
distributions and related weighted methods  for
statistical analysis  and  interpretation of
                                                    87

-------
encountered data, observational studies,
representativeness  issues, and resulting
inferences.  Center for Statistical Ecology and
Environmental Statistics, The Pennsylvania State
University.  (Under preparation).
35.  Patil. G. P.,  and Taillie, C. (1981).
Statistical analysis of recruitment data for
eighteen marine fish stocks.  Invited Paper
Presented at the Annual Meetings of the American
Statistical Association, Detroit, MI.
36.  Patil, G. P. and Taillie, C. (1985). A
Conceptual Development of Quantitative Indices of
Marine Degradation  for Use in Coastal and
Estuarine Monitoring and Management.  A research
paper presented at  the spring meetings of the
American Statistical Association and the Biometric
Society, ENAH, North Carolina State University,
Raleigh, N.C.
37.  Pugh.W. L., Patil, G. P.. and Boswell, M. T.
(1986).  the crystal cube for coastal and
estuarine degradation.  Sea Technology, September
1986, p. 33.
38.  Raiffa. H. (1968).  Decision Analysis.
Addiaon-Wesiey.
39.  Schneeweiss, H. (1976).  Consistent
estimation of a regression with errors in the
variables.  Metrika, 23, 101-115.
40.  Shafer, G. (1976).  A Mathematical Theory of
Evidence.  Princeton University Press, Princeton,
New Jersey.
41.  Spitzer, P. R. (1985).  A private
	iiication.
42.  Spitzer, P. R.: Poole, A. F.; and Scheibel,
M.  (1983).  Initial population recovery of
breeding ospreys in the region between New York
City and Boston.  In: Biology and Management  of
Bald Eagles and Ospreys.   Editor  D. M.  Bird.
Harpell Press,  Ste.  Anne  de Bellevue, Quebec.
43.  Sprent, P.  (1966).   A generalized
least-squares approach  to linear  functional
relationships.   (With discussion).  J.  Royal
Statist. Soc.,  Series B.  Vol.  28, 278-297.
44.  Summers, J. K., Polgar, T. T.. Rose, K. A.,
Cummins, R.  A.,  Koss, R.  N.  and Heimbuch, D. G.
(1984).  Assessment  of  the Relationships  among
Hydrographic Conditions,  Macropollution
Histories,  and  Fish  and Shellfish Stock in Major
Northeastern Estuaries.   Technical Report, Martin
Marietta Environmental  Systems.
45.  Suter,  G.  N. ,11. Rosen A. B., Linder, E.
(1986a).  Analysis of extrapolation error.   In
User's Manaual  for Ecological  Risk Assessment.  L.
W.  Barnthouse,  and  G.  Suter,  eds.  ORNL-6251, Oak
Ridge National  Laboratory,  Oak Ridge, TN.
46.  Suter,  G. W..11, and Rosen,  A. E.  (1986b).
Comparative toxicology  of marine  fishes and
crustaceans.  ORNL-TM,  Oak Ridge  National
Laboratory,  Oak  Ridge,  TN.  (In press).
47.  Verner, J.; Pastorok,  R.; O'Connor. J.;
Severinghaus. W.; Glass,  N.; and  Swindei. B.
(1985).  Ecology community structure analysis  in
the formulation, implementation,  and enforcement
of law and  policy.    Amer.   Statist., 39(4), Part
2, 393-402.
48.  Wertz.  W.,  and  Schneider, B. (1979).
Statistical  density  estimation: A bibliography.
International Statistical Review, 47, 155-175.
49.  Wilson, R.  (1968).   The theory of  syndicates.
Sconometrica, 36, 119-132.
50.  Winkler, R. L.   (1981).  Combining  probability
distributions from dependent information sources.
Management  Science,   27, 479-488.
                                                   88

-------
                                      DISCUSSION
                Lloyd L. Lininger, U.S. Environmental Protection Agency
   The problems presented in this paper
give an idea of the diversity and
difficulty of the problesm that are
routinely encountered at the U.S.
Environmental Protection Agency.  I wish
to draw attention to the diversity
because I think that it is currently not
practical to think of developing a
methodology for combining studies that is
simultaneously applicable to all problems.
I think the problems presented also point
out the basic reason we must work on the
problem of combining results of studies.
Namely/ we are unable to do "the"
experiment that we believe is required to
make a decision.  We have to use the
information we have from other studies
that were designed for a different pur-
pose and possibly augment them with
further studies to make the decision.
This always requires the assumption of
some model, possibly with some error
structure included.
   I am confident that any attempt to
apply the methodologies advocated by the
previous speakers to these problems would
soon expose the difficulties and
uncertainties of those procedures.  I do
believe that those attempts should be
made. • They would result in systematic
approaches to modeling the underlying
problems and focus attention in the most
appropriate places.
   The recruitment data presented is use-
ful for emphasizing several points.  When
combining studies we must always keep in
mind the "question" we wish to address.
The authors state that one of the
objectives is to estimate a "universal"
recruitment distribution.  First, note
that the solution will be a "distribution"
which is a somewhat unusual outcome for
an experiment.  Details of the analyses
are lacking, but I would be interested in
the sequential aspect of each of the
individual samples, the possible
relationships between species and the way
the 18 species were selected from all
species.  Finally, I would ask why a
"universal" distribution is desired.  Too
frequently, one looks for a question that
combining data sets might answer, instead
of looking for ways to answer a question
by combining data sets and developing an
appropriate model.
   The "crystal cube" problem is also a
long standing problem in statistics.  How
do we reduce a multivariate model to a
one dimension model that will serve as an
index of the phenomenon in which we are
interested?  Unfortunately, the termino-
logy gives no help in deriving or under-
standing how such an index would be
derived.  The relationships between faces
and some "geometrical" concepts would be
helpful before this terminology is
accepted.
   The problems presented in this paper
tend to combine studies each of which
collected information on the same
problem.  Problems which are "linked
linearly" as in the exposition by Eddy
and Wolpert require different modeling
approaches.
   Finally, any methodology to combine
data across studies assumes some under-
lying model.  If extensive effort has
gone into developing a model to combine
studies, then it is relatively straight
forward to do simulation studies to
evaluate the characteristics of the model,
None of the presentations exploited this
useful technique.
                                           89

-------
APPENDIX A: ASA/EPA Conference on Statistical Issues in Combining Environmental Studies Program
  h]

  h)
  s
  o
  u
  N
  oe,

  O
     z
:1'J«1
                                              90

-------
vO
                          PROGRAM

                     WEDNESDAY, OCTOBER 1

                             9:00 a-m.
                          INTRODUCTION

               Kinley Larntz, Washington State University
          Dorothy Wellington, Environmental Protection Agency

                             9:10 a-m.
                     CONFIDENCE PROFILES:
     A BAYESIAN METHOD FOR ASSESSING HEALTH TECHNOLOGIES

             David Eddy ft Robert Wolpert. Duke University

                             10:40 a.m.
                            BREAK

                             10:55 turn.
                           DISCUSSANT

          David Lane, University of Minnesota/McCill University

                             11:30 a-m.
COMPUTER DEMONSTRATION OF CONFIDENCE PROFILES METHODOLOGY

                             12:15 p.m.
                            LUNCH

                             1:30 p.m.
           META-ANALYSIS AND ENVIRONMENTAL STUDIES

                 Larry V. Hedges, University of Chicago

                             3:00 p.m.
                            BREAK

                             3:15 p.m.
                           DISCUSSANTS

              Chao Chen, Environmental Protection Agency
              James M. Landwehr, AT&T Bell Laboratories

                             400 p.m.
                        FLOOR DISCUSSION

                             4:45 p.m.
                         RECEPTION
                                                                                                      THURSDAY, OCTOBER 2
                         9:00 «.m.
          INTEGRATION OF EMPIRICAL RESEARCH:
          THE ROLE OF PROBABILISTIC ASSESSMENT

     Thomas Feagans, Decisions in a Complex Environment, Inc.

                         1030a.m.
                        BREAK

                         10:45 C.M.
                       DISCUSSANTS

        Harvey Richmond, Environmental Protection Agency
        Anthony D. Thrall, Electric Power Research Institute
          Lee Merkhofer. Applied Decision Analysis, Inc.

                         1130a.m.
                    FLOOR DISCUSSION

                         12.-00p.rn.
                        LUNCH

                         1:15 p.m.
STATISTICA L ANA LYSIS OF POOLED DATA IN ECO LOGIC A L AITD
        ENVIRONMENTAL WORK WITH SOME EXAMPLES

         G. J. Babu, M. Bocwell. K. Chatterjee. E. Linder,
                   G. P. Patil. ft C. Tailie
                 Pennsylvania State University

                         2:30 p.m.
                       DISCUSSANT

        Lloyd Linlnger, State University of New York. Albany

                         J.-00 p.m.
                        BREA K

                         3:15 p.m.
              CONCLUDING PANEL DISCUSSION

           Kinley Larntz. Washington State University

-------
                     APPENDIX B: Conference Participants

          ASA/EPA CONFERENCE ON STATISTICAL ISSUES
             IN COMBINING ENVIRONMENTAL STUDIES
                         October 1-2, 1986

                     OMNI SHOREHAM HOTEL
                         WASHINGTON, DC
G.J. Bdbu
Pennsylvania State University
Department of Statistics
University Park, PA  16802

R. Clifton Bailey
U.S. EPA
6507 Divine Street
McLean, VA 22101

James C. Baker
U.S. EPA Region 8
999 18th Street, Suite 1300
Denver, CO 80202-2413

Ted O. Berner
Battelle Columbus Division
2030 M Street, N.W., Suite 700
Washington, DC 20036

M. Boswell
Pennsylvania State University
Department of Statistics
University Park, PA  16802

Robert N. Brown
Food and Drug Administration
200 C Street, S.W., MC-HFF-118
Washington, DC 20204

K. Chatterjee
Pennsylvania State University
Department of Statistics
University Park, PA  16802

Chanfu Chen
Lederle Laboratories
Building 60, Room 203
Pearl River, NY 01965

Chao Chen
U.S. EPA
401 M Street, S.W., RD-689
Washington, DC 20460

Jean Chesson
Battelle
2030 M Street, N.W.
Washington, DC 20036
Kee-whan Choi
Exxon Corporation
Four Bloomingdale Drive, #517
Somerville, NJ 08876

Vincent James Cogliano
U.S. EPA
ORD, ORE A,  CAG
401 M Street, S.W., MC-RD-689
Washington, DC 20460

Margaret Conomof
U.S. EPA
401 M Street, S.W.
Washington, DC 20460

Giles Crane
Department of Health
John Fitch Plaza, CN-360
Trenton, NJ  08625

John P. Creason
U.S. EPA
HERL/'Biometry Division/MD-55
Research Triangle Park, NC 27711

J. Michael Davis
U.S. EPA
MD-52, ECAO
Research Triangle Park, NC 27711

Hari H. Dayal
Fox Chase Cancer Center
430b Rhawn Street
Philadelphia, PA  19111

Elizabeth A. Dutrow
U.S. EPA
401 M Street, S.W. (TS-798N)
Washington, DC  20460

David Eddy
Duke University
Center for Health Policy Analysis
Durham, NC 27706
                                      92

-------
Thomas B. Feagans
Decisions in a Complex Environment, Inc.
636 Wayland Place
State College, PA 16803

Bernice T. Fisher
U.S. EPA
1600 S. Eads Street, #5255
Arlington, VA 22202

Ruth E. Foster
U.S. EPA
401 M Street, N.W.
Washington, DC  20460

Michael E. Ginevan
9039 Sligo Creek Parkway, #1108
Silver Spring, MD 20901

John Goldsmith
U.S. EPA
Biometry Division, MD-55
Research Triangle Park, NC 27711

Noel P. Greis
Bell Communications Research
331 Newman Springs Road
Red Bank, NJ 07701

Gary F. Grindstaff
U.S. EPA
TS-798, 401 M Street, S.W.
Washington, DC  20460

Vic Hassetblad
Center for Health Policy Research
 and Education
Duke University
Durham, NC  27706

Larry V.  Hedges
University of Chicago
College of Education
Chicago, IL 60637

Robert W. Jernigan
U.S. EPA-SPB
American University
Washington, DC 20016

Woodruff B. Johnson
U.S. EPA
401 M Street, S.W., Room 223
Washington, DC 20460

Borko D. Jovanovic
University of Massachusetts
Department of Public Health
Amherst,MA 01003
Marvin A. Kastenbaum
The Tobacco Institute Inc.
1875 Eye Street, N.W.
Washington, DC 20006

Richard F. Kent
U.S. EPA
1 Scott Cirlce, N.W. #716
Washington, DC 20036

Kay T. Kimball
Oak Ridge National Laboratory
P.O. Box X, 4500S, MSF-260
Oak Ridge, TN 37831

Kathleen D. Knox
U.S. EPA
401 M Street, S.W., PM-223
Washington, DC 20460

Herbert Lacayo, Jr.
V.S. EPA
4520 King Street, #502
Alexandria, VA 22302

Emanuel Landau
American Public Health Assn.
1015 15th Street, N.W.
Washington, DC 20005

James M. Landwehr
AT&T Bell Laboratories
Statistical Models and Methods
  Research Department
Murray Hill, NJ 07974

David Lane
University of Minnesota
270 Vincent Hall
Minneapolis, MN 55455

Kinley Larntz
Washington State University
Program in Statistics
Pullman, WA  99164-6212

Walter S. Liggett, Jr.
Center for Applied Mathematics
National Bureau of Standards
Caithersburg, MD  20899

E. Under
Pennsylvania State University
Department of Statistics
 University Park, PA  16802
                                       93

-------
 Lloyd Lininger
 State University of New York-Albany
 (U.S. EPA)
 Albany, NY

 Bertram D. Litt
 U.S. EPA
 OPP/Statistics (TS-769)
 14502 Woodcrest Drive
 Rockville,MD  20853

 Rebecca A. Madison
 U.S. EPA
 401 M Street, S.W.
 Washington, DC 20460

 Sam Marcus
 National Center for Health
  Statistics
 13417 Keating Street
 Rockville,MD  20853    .

 Elizabeth H. Margosches
 U.S. EPA (TS-798)
 401 M Street, S.W.
 Washington, DC 20460

 Lee Merkhofer
 Applied Decision Analysis
 300 Sand Hill Road
 Menlo Park, CA 94025

 Barry I. Milcarek
 Mobil Oil Corporation
 150 East 42nd Street, Room 1324
New York, NY  10017

Patricia Murphy
 U.S. EPA-Cincinnati
26 West Saint Clair
 Cincinnati, OH 45268

 Tom M. Murray
 U.S. EPA
401 M Street, S.W.
 Washington, DC 20460

 CJ. Nelson
 U.S. EPA (TS-798)
401 M Street, S.W.
Washington, DC 20460

Barry D. Nussbaum
 U.S. EPA
EN-397F, 401 M Street, S.W.
 Washington, DC 20460
G.P. Patil
Pennsylvania State University
Department of Statistics
University Park, PA 16802

Susan A. Per I'm
U.S. EPA
401 M Street, S.W.
Washington, DC 20460

Lorenz R. Rhomberg
U.S. EPA (TS-798)
401 M Street, S.W.
Washington, DC 20460

Harvey Richmond
U.S. EPA
OAOPS, MC-MD12
Research Triangle Park, NC  27711

Wilson B. Riggan
U.S. EPA
HERL/Biometry Division/MD-55
Research Triangle Park, NC  27711

Frederick 11. Rueter
CONSAD Research Corporation
121 North Highland Avenue
Pittsburgh, PA  15217

Joel Schwartz
U.S. EPA
1207 Fourth Street, S.W.
Washington, DC 20024

Judy A. Stober
U.S. EPA
26 West St. Clair
Cincinatti, Ohio 45268

Miron L. Straf
Committee on National Statistics
National Academy of Sciences/NRC
2101 Constitution  Avenue, N.W.
Washington, DC 20418

David J. Svendsgaard
U.S. EPA
MD-55
Research Triangle Park, NC  27711

C. Tailie
Pennsylvania State University
Department of Statistics
University Park, PA  16802
                                        94

-------
Anthony D. Thrall
Electric Power Research Institute
P.O. Box 10412
Palo Alto, CA  94303

Harit Trivedi
Pennsylvania Department
 of Environmental Resources
Bureau of Information Systems
Harrisburg, PA  17110

Alta Turner
Ebasco Services Inc. '
160 Chubb Avenue
Lyndhurst,NJ 07071

Paul G. Wakim
American Petroleum Institute
1220 L Street, N.W.
Washington, DC  20005

John Warren
U.S. EPA
03023 (PM-223)
401 M Street, S.W.
Washington, DC  20460
Dorothy Wellington
U.Z. 2?A
401M Street, S.W.
Washington, DC 20460

Herbert L. Wiser
U.S. EPA
Office of Air and Radiation
ANR-443, USEPA
Washington, DC 20460

Robert Wolpert
Duke University
Center for Health Policy Analysis
Durham, NC 27706

You-yen Yang
V.S. EPA
401 M Street, S.W.
Washington, DC  20460
                                      95

-------
as the latter, but in discussions it is sometimes
applied to the former as well.6  In this paper the
usual definition of meta-analysis is both accepted
and adhered to, with reanalysis of pooled data
considered  to be  a third form of secondary
analysis and not a form of meta-analysis.  The
distinction between meta-analyses and other
analyses is then a clear-out distinction between
analysis of empirical data in the case of the latter
and analyses of results of empirical studies in the
case of the former.
    Returning to the rationale for doing meta-
analysis, it has been cast along the lines  that by
using the deductive power of mathematics, in the
form  of statistical techniques, to integrate the
results of a  large set of studies it can be done
more satisfactorily than by narration, just as the
analysis of data in one of the original studies is
done more satisfactorily by such techniques than
by  narration.7 In  the process of putting this
rationale into some perspective, we can work
toward specification of the function served by
meta-analysis.
    There  are problems for both meta-analysis
and secondary analysis of pooled data. For
secondary analysis of pooled data, the raw data is
not available in many cases.  For meta-analysis,
conventional  statistical  procedures   are
problematic for both statistical and conceptual
reasons.8  For both, study designs tend to differ
in significant respects.
    In the  face of such problems, two-extremes
are to be avoided.  On the one  hand, application
of statistical methods is useful even when the
conditions under which they are applied  are not
perfect  in some sense. Also, methods more
suitable for meta-analysis are being developed.
The idea  of developing and applying meta-
analytic methods is unimpeachable, and not using
them due to inertia or purism unwarranted.
    On the other hand, it is important to avoid
false dichotomies.  Various means of informing
policy decisions can be complementary rather
than viewed as competitors Statistical methods
are one powerful means of bringing the deductive
power of mathematics to bear,  means that serve
an important function.  But narration and other
uses of mathematics which serve other functions
need to be brought to bear as well.
   The function in the decision-making process
provided by meta-analyses is the application of
statistical algorithms to the results of primary or
secondary empirical studies. The purpose,  is to
help deduce, infer, and consolidate implications
of sets of studies.  In so doing meta-analyses can
reduce the amount of narration needed in state of
information assessments.
   The choice of algorithms to be applied in
meta-analyses is  subjective, but not arbitrary.
Both substantive and statistical theoretical
principals are applied where possible in making
these choices.  The choices are generally  both
judgmental and affected by substantive empirical
content  Although two different persons might
choose two different algorithms and/or sets of
studies in a given case, any two persons would
get the same result from correct application of the
same algorithm to the same  set of results.9
Thus, although all three types of integration have
both subjective and objective  (intersubjective)
aspects, meta-analysis his more objective aspects
than the other two. For non-empirical Bayesian
analyses some judgment would  also enter as
input to the algorithm in the form of subjective
priors.

3.0 State of Information Assessments
   Before making important policy decisions, it
is useful to assess what is known and what is
uncertain about the relationship between options
and their consequences. Such assessments serve
the ultimate purpose of the society maintaining as
much control as possible over its future. Societal
decison-making agents should make  important
decisions with what  the  society knows  as a
collective available to them. They should not
make such decisions under the assumption that
we know more about the connections between
dccison alternatives and their consequences than
we do.10
   Such assessments have been done in various
forms and in all these forms, narration has played
an important role.  While meta-analyses can
reduce the need for  narration,  they cannot
eliminate this need.  Non-formal  exposition is
essential for the task of interpreting the formal
results  of  primary,   secondary,  and  meta-
analyses,  particularly in interpreting  their
implications for the policy'decisions at hand.
Even for primary research, "often the statistical
analysts is just a preliminary to the discussion of
the underlying meaning of the data." * 1
   State  of  information  assessments  (or
scientific assessments) serve the function within
the decison-making process of assessing .the state
of knowledge on which one or more important

-------
decisions are to be based. It is not the purpose of
such  assessments  to  add  to the  state of
knowledge, either through empirical inquiry or
through statistical analyses of data.  Rather, the
purpose is to assess the knowledge accumulated
up to a point in time  that is  relevant to the
decisions to be made at that time.
   An important issue concemswhether there is a
quantitative measure of the degree to which a
given hypothesis, theory, or other proposition
has been confirmed at a given time. Were such a
measure to exist, it could play an important role
in state of information assessments.  The degree
to which various theories about the shape and
location of dose-response relationships were
confirmed could be addressed,  for example.
Another example would  be discussion of the
degree  to  which the  existence of  a  causal
relationship between a given  pollutant and a
specified effect was discontinued by one or more
negative studies.
   This issue has  received a great deal of
attention  from philosophers  of  science,
measurement theorists, and statisticians interested
in the foundations of their subject. Up until the
mid-nineteenth century, attempts were made to
construct theories of induction which guaranteed
the truth of  the conclusions  to which  their
applications led.  As soon as it became clear that
some uncertainty or doubt about the truth of
conclusions of inductive inferences is inevitable,
methpdologists  began to consider scientific
theories to be more or less probable, more or less
worthy of rational belief. ^Various attempts to
reduce inductive logic to probability theory have
followed. Despite efforts by such outstanding
intellects as DeMorgan,1^ Jevons^Pearce,1^
Keynes,16 and Catnap,17 all such attempts have
failed.
   These attempts have failed  for a reason.
Degree of confirmation or inductive support and
probability are distinct concepts. The difference
is subtle, but real and important The failure to
discern  and explicate  this  distinction  has
bedeviled the history of these two (historically)
conflicted topics.
   Most generally, the concept of probability has
to do with the balance  of  favorable  and
unfavorable evidence; the concept of degree of
confirmation has to do with the amount  (and
kind) of supporting evidence.  If there is little
evidence,  favorable or unfavorable,  for a
proposition, probability assignments concerning
the truth of the proposition can reasonably be
near Oj>. In contrast, by any reasonable account
of confirmation, the degree of confirmation for
the proposition is near 0.0.
   Perhaps the most thorough attempt to develop
a confirmation theory based on inductive logic
was that of the philosopher of science, Rudolf
Carnap.  It is Carnap's terminology, "degree of
confirmation," that is being used to describe the
quantitative concept that is is important for the
state  of  information  integrative function.
Carnap's work gave rise to thorough critiques of
objectivist confirmation theories, and in his later
years he began shifting toward a Bayesian point
of view. 18
   The overoptimism concerning the possibility
of a fully general and objective framework that
pervades the  historical attempts  to develop
inductive logic has  been  another  obstacle to
progress.    Measurement of degrees  of
confirmation and probability are, most generally,
more  subjective than  meta-analyses.    The
objectivity, in the sense of intersubjectivity, that
is inherent re  the  meta-analysis  function is
unachievable  for  the  other two  integrative
functions.  We have used the appelation "meta-
analysis" to name an algorithmic function since
that term is gaining wide usage and the usage
seems to roughly correspond to that function. (It
should  be kept in  mind that from a  decision-
theoretic point of view.it is the functions, not the
semantics, that are important.) One of the major
keys to progress that has been made recently in
confirmation theory  is  relaxation of  the
requirements of objectivity. 19,20 Although
Shafer  and  Krantz have made  significant
progress in what we are calling confirmation
theory, they  dp not recognize  the distinction
between probability and degree of confirmation.
Hence,  like so many before them, they refer to
what we are calling degree of confirmation as
"probability."  Likewise,  Cohen  makes  what
appears to be a similar distinction in terms of
"Pascalian  probability"  and  "BacJonian
probability" 2l

4.0 Probabilistic Assessment
    The role of probabilistic assessment in the
decision-making process  is to use  whatever
information and statistical  analysis  exists at the
time the decision in question is to be made and
relate the consequences the decision is designed
to affect back to decision alternatives in a  way
                                              52

-------
that will achieve as much control as is feasible
under  the  circumstances.  Thus, the third
integrative  function, probabilistic assessment,
uses the outputs of the first two integrative
functions,  meta-analysis  and the state  of
information assessment  It is in turn an input to
valuation  analysis, decision  analysis, and
ultimately decision-making. Valuation analyses
and decision analyses accomplish two other
functions  needed in  support of decision
   Control is a meta-objective concerning the
relationship between decision alternatives and
primary objectives under uncertainty. Control in
the     sense meant here is analogous to the
tightness/slack aspect of a steering mechanism
and is not tied to any particular regulatory policy
direction.  The  ultimate justification of the
approach suggested to probabilistic assessment is
that (in general) its greater generally gives  more
control.
   Risk  assessments  are   probabilistic
assessments in which the consequences are
adverse. 24 Risfc assessments  also include
description of the seriousness of the adverse
effects.   The primary  objectives   in  risk
assessments are the avoidance of adverse health
effects.

4.1 Probability
   There are many ways the complex topic of
probability can be addressed. In this discussion,
we address the relationship between the levels of
generality possible in  making  probability
assignments and the perspective  of the user of
these assignments. This discussion will provide
the basis for the selection of an approach to
probabilistic (risk) assessment.
   There are various possible interpretations of
probability statements from the point of view of
how they came  to be made.  These various
interpretations have been much discussed  for a
long time. Three levels of generality fall out of
all these discussions. At the lowest level of
generality, probability assignments are made as
the ratio of two nonnegative integers; if one of
the integers is larger it is the denominator.  This
ratio may result from the application of a logical
or relative frequency mathematical  model.
   The user of probability assignments  only
cares about how the probabilities  were assigned
in so far as it sheds light on how well a set of
such assignments  can be expected to predict in
the probabilistic sense.  Measurement of how
well sets of probabilities pi edict involves two
criteria, the criteria of calibration and resolu -
tion.25 A canonical process of probability
assignments can be defined in terms of these two
criteria.26  A canonical process of probability
assignments is one in which the assignments are
distributed randomly over  the  [0,  1]  range
(canonical resolution) and  approach perfect
calibration as a limit (canonical calibration).
Many  discussions of  probability  seem  to
implicitly assume a canonical process.
    Canon icity does not necessarily obtain even
at the lowest level of generality in the making of
probability assignments.  The phenomenon of
ambiguity, illustrated  by the EUsburg Paradox
experiments, demonstrates  this  fact.27 The
resolution of that paradox revolves around the
theme of carefully distinguishing and analyzing
the diverse perspectives of the maker and user of
probability assignments. 28
    A second level of  generality in making
probability assignments is the degree of belief
interpretation developed by the English philoso -
pher, Frank Ramsey, 29 and the Italian statist! -
cian Bruno deFinetti.30  This  intcrpretatio.n has
been much used by decision  analysts and
Bayesian  statisticians.31   At  this level  of
generality, probability  assignments can be made
by  using a particular mathematical model, but
only if the situation is deemed to justify the use
of such a model. Many situations obviously do
not.  In such cases,  final integration of the
available  information is done mentally and
probability assignments are made judgmentally.
Algorithmic   devices,    such   as  the
mathematical/statistical models  used  in
mathematical  statistics  in general and meta-
analysis in particular, can be very useful aids in
arriving at these judgmental assignments. Also,
probabilistic  models can   be  built  which
decompose the relationship in question so that the
assignment can be derived from less difficult
assignments.   Thus,  the amount of  mental
integration required is reduced  to  more
manageable size.
    At this level of generality, the fact that from
the user's perspective canonicity may not obtain
becomes  critically important.   It has  been
considered a positive  characteristic  of the
Ramsey/deFinetti theory that despite its greater.
generality, and hence flexibility in application, its
                                             53

-------
  uninterpreted formal properties are equivalent to
  those of the narrower interpretations mentioned
  above-in other words, the usual probability
  mathematics   applies.    However,  when
  judgmental assignments are made using states of
  information  that will  not give  canonical
  probabilistic  prediction,  from  a  user's
  perspective* two probability assignments which
  are numerically  equal should be interpreted
  differently.
      In  short,  at the more general level, users
c^ probabilities should interpret them in  terms of
  both their numerical value  and the  state of
  information on which they  are  based.  The
  divergency of  probability from canonicity in
  general  gives  rise  to  the  phenomenon of
  secondary risk.32 The effect of secondary risk is
  to give less control to the user than he or she
  would have with canonical prediction. However,
  the user will have less control than he or she
  could  have under the circumstances  of the
  existing state  of information if  he or  she
  misinterprets the probabilities to be canonical (in
  terms of probabilistic prediction) when they are
  not.
      It happens that since both the numerical value
  of a probability and the state of information on
  which  it is  based  are  important,   the
  Ramsey/deFinetti theory of  probability is
  conceptually flawed, even from the perspective
  of applying it in making probability assignments.
  According to the theory, personal probabilities
  are revealed by sets of choices between pairs of
  bets.  One bet in each pair is  canonical by
  definition, and the other bet is not. But since
  states  of information on which probabilities are
  based  matter and since the states of information
  are different for the  two bets within each pair,
  choices between  each of these pairs of bets do
  not necessarily reveal personal probabilities. In
  general, the choices are  affected by  both the
  person's belief and the person's attitudes toward
  secondary risks.
      Fortunately, there is  an even  more general
  theory of probability which does not havethis
  problem.  It is no accident that it is a more
  general  theory that lacks this problem.  Even
  though the Ramsey/deFinetti theory is more
  general than the first theory discussed above, it is
  not general enough.  Both of the theories which
  are not general enough to serve as the basis of
a normative framework for probabilistic (risk)
  assessment are available, as special cases oEmore
general theory, for the domains to which they are
general  enough to apply.  These domains are
large sets of problems to which the methods of
classical and Bayesian statistics, respectively,
appropriately apply.
   The more general  theory  of probability
needed for the integrative function we are calling
probabilistic assessment is the theory axiomitized
by  Bernard  Koopman. 33,34  Qn this view,
probability is an intuitive comparative  relation
that in general is only partially ordered.
   The distinctive characteristic of an intuitive
concept is that a large range of statements that
employ a term signifying the concept can be
understood and the meaning of the term cannot
be  explained  in  terms  of more primitive
concepts.35  An intuitive concept can be applied
correctly without using an explicit set of rules of
application.36  An intuitive concept  is itself
primitive.
   Just as for the Ramsey/deFinetti theory, the
key to eliciting probability judgments under the
Koopman  theory  is to set up  a comparison
between the situation of interest and a canonical
situation.  The fact that under this theory
probability is only a partially ordered comparative
relation means that for some  comparisons
between the situation of interest and chosen
canonical probabilities, the person making the
probability judgments does not judge either to be
more likely than the other.  The fact that there is
a range of canonical probabilities for which this
is true in general means that in general, lower
probabilities and upper probabilities are elicited
rather than sharp probabilities.37
    Axiomatically, the fact  that in  general
probability assignments are not sharp means that
an axiom which holds in the less general versions
of probability (the additivity axiom)  does not
hold. The fact that in the Koopman theory the
axioms associated with canonical situations do
not apply is actually an advantage since there is
less chance for confusion; that is, there is less
chance that a user will interpret upper and lower
probabilities to be canonical probabilities. The
impression that it is  an advantage of  the
deFinetti/Ramsey axioms that they are equivalent
to the  canonical  (Kolmogorov) axioms is  a
misimpression  generated by  focusing on
convenience of calculation for the producer rather
than on the needs of the user.  Assuredly, users
have to learn 4iow- to  interpret correctly the
different kinds of output which result
                                                54

-------
4.2 Normative Framework
    The  approach  to probabilistic  (risk)
assessment described below  was developed
within the U.S. EPA Office  of Air Quality
Planning and ytanriapte risk analysis program by
William F. Biller and the  author.38  Support
analyses  for  national ambient air quality
standards (NAAQS) must deal with enormous
complexity.  Thus, as general as possible an
approach was needed and developed. Because of
its  generality, the approach serves well as a
normative framework for probabilistic assess -
ments.  A normative framework should be as
general as possible because it is relatively easy to
reduce the generality of such a framework for the
purposes of a specific application  where the
generality is either not needed or not advisable,
but almost impossible to increase the generality
of a framework within the context of a specific
appliation.
    The process of conducting a probabilistic
(risk) assessment  can be  thought  of as
conducting a set of subprocesses that together
make up the whole assessment:  First, there is the
subprocess of constructing a probabilistic risk
model.  Second, there is  the subprocess of
selecting  those who  are to make probability
assignments, often substantive experts.  Third,
there is the subprocess of eliciting probability
assignments. Finally, there  is the subprocess of
computing and presenting outputs.

    4.2.1 Model Construction
    The process of constructing a probabilistic
risk  model proceeds  most  rationally  in
accordance with a "back logic,"  The starting
point is a set of adverse effects that regulatory
policy  could  reduce.   Starting  with  the
consequences to be reduced, possible regulatory
alternatives are identified which could reduce
these consequences if adopted. Whether a causal
relationship exists is uncertain in some cases. In
all  cases, the exact quantitative relationship
between policy alternatives and consequences is
uncertain.  Both of these uncertainties can be
handled formally within the framework.40
    Assuming  the relationship between  policy
alternatives and consequences of concern is
decomposed for  the  purpose   of reducing
judgments to more manageable size, back logic is
used to  assure  that  the component models
interface appropriately. For example, suppose the
relationship between possible NAAQS's for a
pollutant and adverse health effects to which the
pollutant contributes is decomposed into three
models:
    1. standard-exposure model,
    2. exposure-dose (physiological) model,
    3. dose-response model.
It  is the input-output  structure of the dose-
response model that is chosen  first; then the
exposure-dose model is chosen so that the dose
given as an output is in the form of the input
needed for the exposure-dose model. Similarly,
the standard-exposure model is chosen so that the
exposure  output is the input needed  by the
exposure-dose model.   The component  models
are chosen to give the most accurate output, with
the form of the output a given constraint  This is
the way the form of standards are best chosen,
rather than relating standards directly to effects.
    These component models are probabilistic
models, so the overall model that relates decision
options to the consequences  to be affected is a
probabilistic model.  There is  ordinarily  no
"correct" degree of fineness for the structure of
the representation.  Obviously  a factor is the
importance  of the problem  and the resources
available  for the assessment.   A finer and
therefore larger model costs  more to build and
implement  This is one of several questions of
scale that must be decided before doing the
assessment   The  ideal  would be to  have
alternative  models of varying fineness  of
structure and then crosscheck and interrelate
them.  Such an approach would  make the most
use of indirect (background) information and
coherence in the sense of consistency.  Making
maximum use of  indirect  information and
coherence is most important for situations in
which direct data is sparse.
    Available data, statistical analyses of data,
and indirect information are all considered in the
process of constructing  and  choosing  the
probabilistic (risk) model.   If two possible
alternative models appear to be of equal merit as
representations of the  situation, but one has
better information available  to support probability
judgments and other inputs, then it is preferred.
If  there   is  better   data  for the  lesser
representation, there is a tradeoff. If the better
representation (model) is  a  refinement of the
lesser  representation,  then  the  better
representation is preferred since the data available
for the grosser, (lesser) representation constrains
the inputs to the better representation.
                                              55

-------