Confounding and Selection Bias in Case Control Studies


             United States
             Environmental Protection
             Agency
             Office of
             Radiation Programs
             Washington DC 20460
EPA 520/8-81-004
January 1981
             Radiation
v»EPA
Confounding and
Selection Bias
in Case Control Studies

-------
                                    EPA 520/8-81-004
Confounding and  Selection Bias
                  in
       Case  Control Studies
          Roderick J. A. Little
           Paul R. Rosenbaum
              January 1981
  Division of Statistics and Applied Mathematics
        Office of Radiation Programs
     U.S. Environmental Protection Agency
          Washington, D.C. 20460

-------
Abstract
In case-control studies, the role of adjustments for bias, and
in particular the role of matching, has been extensively debated.
However, the absence of a formal statement of the problem has led to
disagreements, confusion, and occasionally to erroneous conclu-
sions. This paper formulates precisely and answers the following
questions.
1 )
When is it necessary to adjust for a
variable Z?
2) Given that the data analysis will adjust
for the variable Z, is matching on Z the
most efficient method of selecting controls?
In answering these questions, we draw a sharp distinction
between bias caused by confounding in the population and bias caused
by the method used to select the sample.
i i i

-------
Acknowledgment
The authors wish to acknowledge valuable discussions
with Donald B. Rubin on the subject matter of this paper.

-------
CONTENTS
Abstract. . . . . . .
. . . . .
. . . . .
1.
Introduction.
. . . . . .
. . . . . .
. . . . .
. . . . .
2.
Conditions Under Which Adjustment Is Necessary. .
3.
Does Matching Increase Power? . . . .
References. . . .
. . . . . .
. . . . . . . . .
. . . . . . . . .
v
. . . . .
. . . . .
. . . .
. . . .
. . . .
Page
i i i
2
16
19
{
'\

-------
1.
Introduction
In case-control studies, the role of adjustments for bias, and in
particular the role of matching, has been extensively debated (1-7).
However the absence of a formal statement of the problem has led to
disagreements, confusion, and occasionally to erroneous conclusions.
In this paper we formulate precisely and answer the following
questions:
a) When is it necessary to adjust for a variable Z ?
b) Given that the data analysis will adjust for the variable Z, is
matching on Z the most efficient method of selecting controls?
-1-

-------
2.
Conditions Under Which Adjustment Is Necessary
2.1.
Introduction
For simplicity we first consider measurement of the relationship
between a disease (0) and an agent under study (E) in the presence of
a single confounding variable Z.
Extensions to the more realistic
case where a set of variables are candidates for adjustment are
outlined in Section 2.4.
We first seek a valid measure of association in the population of
cases and controls, irrespective of the method of sampling.
We then
ask whether the sample estimate of this measure of association is a
satisfactory estimate of the population quantity, that is, whether the
sample estimate is not subject to selection bias.
-2-

-------
2.2 Measures of Association in the Population
In a case-control study, the association between an agent (E) and
disease (D) in the absence of confounding factors is measured by the
population odds ratio
r = p(dle)p(dle) - p(eld)p(eld)

- ,
p(dle)p(dle) p(eld)p(eld)
(1)
where D = d denotes disease, D = d denotes no disease, E = e denotes
exposure to the agent, E = e denotes no exposure to the agent, and p(alb)
denotes the conditional probability that A = a given B = b in the
population.
The relative risk
* -
r = p(dle)/p(dle)
(2 )
is in some ways a more satisfactory measure of the effect of the agent.
However the odds ratio approximates the relative risk if probability of
disease is low, and unlike the relative risk it can be estimated from a
case control study (8).
We now introduce a confounding factor Z, and suppose that a more
appropriate measure of association is the adjusted odds ratio at Z = z,
r(z) = p(dle,z)p(dle,z) = p(eld,z)p(eld,z)
p(dle,z)p(dle,z) p(eld,z)p(eld,z)
(3 )
which approximates the relative risk at Z if the risk of disease is low
in that subgroup of the population with Z = z.
Note that in general r(z)
varies according to the value of z, and thus represents a set of measures
of association.
-3-

-------
If the population parameters rand r(z) are equal for all z, i.e.,
(*)
r = r(z)
for all z,
then the population relationship between disease (0) and exposure (E) is
not confounded by z; otherwise the population relationship is
confounded.
The theorem below gives an expression for r in terms of
r(z), and the subsequent discussion gives conditions under which
confounding is absent, i.e. under which (*) holds.
If confounding is
present, then rand r(z) may yield strikingly different impressions
concerning the effect of exposure on disease, and in this case, the
choice of parameter to be estimated must depend on either assumptions or
outside evidence concerning the biological mechanism that causes the
disease.
Theorem
The adjusted and unadjusted odds ratios are related by the expression
r = (1 + b(z))r(z)
where
b(z) = p(zl~,e)p(zld,~) -1
p(zld,e)p(zl~,~)
(4)
(Note:
If z is continuous rather than discrete, p(zld,e) is the
probability density function of Z given 0 = d, E = e.)
Proof of Theorem
By Bayes. Theorem,
p(alb,z) = p(alb)p(zla,b)
p(zlb)
and applying this expression in the formula (3) for r(z) leads to
equation (3).
1/
-4-

-------
In view of equation (4) we define b(z) to be the relative confounding
bias of r at Z = z.
For example, if b(z) = 0.1 then the unadjusted odds
ratio r deviates from the adjusted odds ratio at z by ten percent.
Two
situations where the confounding bias is zero are of particular interest.
By inspection of equation (4), b(z) = 0 if either
(C1)
(C2)
Z and 0 are conditionally independent given E, or
Z and E are conditionally independent given O.
In the case where Z is categorical, these conditions are a special case
of the well known collapsing theorem for contingency rables.
(See, for
example, Bishop, Fienberg and Holland, (9), Section 2.4)
Conditions (C1) and (C2) are not the same as the condition proposed
by Miettinen (3) under which adjustment is unnecessary, namely
(C21) Z and E are independent.
The following example illustrates the difference between C2 and C2'.
Example 1. The U.S. Environmental Protection Agency received a proposal
to study the relationship between lung cancer (0) and radon222 in well
water (E).
Radon gas is released into the air when radon bearing well
water is used in the home, for example, in showering.
There is some
concern that as homes are made energy efficient and the rate of air
exchange decreases, the concentration of radon daughters may increase in
homes supplied with radon bearing well water.
The proposal contained a
plan for a pilot study to determine whether well water radon levels (E)
are independent
of smoking history (Z), an important confounding
variable; i.e. to determine whether C21 holds.
If radon and smoking
appear independent as a result of the survey, then the proposal would
ignore smoking history.
-5-

-------
However, our theorem shows that the relevant condition in deciding
whether to adjust for smoking is not independence of smoking and radon
but independence of smoking and radon within the diseased and non-
diseased groups.
Table 1 shows a (strictly hypothetical) population
where radon level and smoking are marginally independent, so condition
(C21) holds, but confounding is present because the adjusted odds ratios
for radon and cancer are radically different in the smoking and non-
smoking groups.
The unadjusted odds ratio lies between these values, but
is a poor summary of the relationship between radon and cancer for this
population.
Table 2 gives another hypothetical population where radon and
smoking are unrelated within diseased and non-diseased groups (condition
C2), so confounding is absent, but condition C21 does not hold.
-6-

-------
Table 1.
Distribution of Radon (E), Lung Cancer (D) and Smoking (Z) in a
Hypothetical Population with a) E and Z independent and b) Unequal Odds Ratios.
Lung Cancer (D)
D
D
d
d
total
d
total
d
Radon (E) e 20 5,980 6,000 e 180 3,820 4,000
e 110 23,890 24,000 - 190 15,810 16,000
e
 odds ratio = .73  odds ratio = 3.92
Z = z: nonsmokers Z = z:smokers 
D
d
d
total
e 200 9,800 10,000
e 300 39,700 40,000
odds ratio = 2.70
Z = z or z
smokers and nonsmokers
Table 2.
Distribution of Radon (E), Lung Cancer (D) and Smoking (Z)
in a Hypothetical Population with a) E and Z independent given D,
and hence b) Equal Odds Ratios
Lu ng Cancer (D)
D
D
d
d
d
total
tota 1
d
e 40 8,000
e 160 2,000 2,160
e 240 8,000 8,240
8,040
Radon (E) -
e
60 32,000 32,060
odds ratio = 2.67
odds ratio = 2.67
Z = z
Z = z
-7-
D
d
d
total
e 200 10,000 10,200
e 300 40,000 40,300
odds ratio = 2.67
Z = z or z

-------
Example 2.
We have seen that either condition (C1) or (C2) implies that the
confounding bias is zero.
If Z is binary, it is easily shown that the
converse holds, that is, a confounding bias of zero implies either
(C1) or (C2).
However if Z has more than two categories, then
populations can be constructed where neither (C1) nor (C2) are
satisfied and yet the confounding bias is still zero.
An example for
trichotomous Z is given in Table 3.
It is readily verified that the
adjusted odds ratios all equal the unadjusted odds ratio (to within
some rounding error), even though each pair of variables is neither
conditionally nor marginally independent.
Such examples are
curiosities, and the two independence conditions (C1) and (C2) are
more useful than equation (4) in practice.
-8-

-------
Table 3.
Hypothetical Population where a) Adjusted and Unadjusted Odds Ratios
of 0 and E Are Equal, b) 0 and Z Are Not Independent Given E,
and c) E and Z Are Not Independent Given 0
a) 0 and E given Z
  o
  d d
 E e 30 20
 - 20 120
 e
  Z = 1
o
E e 150
59
 o 
 d d
E e 120 21 E e
- 52 81 -
e e
o
d d
300 100
100 300
d
d
-
e
28
99
Z = 2
Z = 3
Z = 1, 2 or 3
odds ratio = 9.0
odds ratio = 9.0
odds ratio = 8.9
odds ratio = 9.0
b) 0 and Z given E 
    Z 
   1 2 3
 o d 30 150 120
  d 20 59 21
    E = e 
c) E and Z given 0 
    Z 
   1 2 3
 E e 30 150 120
  - 20 28 52
  e
    o = d 
  Z 
 1 2 3
o d 20 28 52
d 120 99 81
  E = e 
  Z 
 1 2 3
E e 20 59 21
- 120 99 81
e
  o = d 
-9-

-------
2.3.
The Effects of Sample Selection
We have established conditions under which the unadjusted and
adjusted odds ratios in'the population are equal and therefore
confounding is absent in the population.
However these conditions are
not sufficient for adjustment of the sample odds ratio to be
unnecessary.
The method of selection of cases and controls may be
such that the unadjusted odds ratio for the sample is a biased
estimate of its population analog.
Adjustment may be necessary to
eliminate (or at least to reduce) this bias.
To clarify conditions under which selection bias arises, it is
convenient to introduce a sample indicator variable S, defined for
each individual of the population, which takes value one if an
individual is selected into the study and zero otherwise.
The method
of sampling can be characterized in terms of assumptions about the
probability distribution of S given Z, 0 and E (Cf Rubin, 10).
The
following conditions are of particular interest since they
characterize common methods of data collection:
(C3)
(C4)
(C5)
(C6)
( C7)
(C8)
S is independent of Z, 0 and E.
S is independent of 0 and E, given Z.
S is independent of 0 and Z, give n E.
S is independent of 0, given Z and E.
S is independent of E and Z, give nO.
S is independent of E, given Z and D.
Conditions (C3) and (C4) correspond to randomized experiments where
individuals are selected at random from the population and values of 0
and E are measured.
Conditions (C5) and (C6) underlie cohort studies
-10-

-------
if individuals are selected at random within exposed and non-exposed
groups, and values of 0 are measured.
Conditions (C7) and (C8)
underlie case control studies where individuals are selected at random
within diseased and non-diseased groups, and values of E are
recorded.
The odd numbered conditions (C3, C5, C7) correspond to
situations where Z is not used as a stratifying variable for data
collection; in particular, matching on Z has not taken place.
The
variable Z is recorded for the analysis. The even numbered conditions
(C4, C6, C8) correspond to situations where Z is used as a stratifying
variable, for example, by matching cases and controls on Z. 
A key aspect of these conditions is that they imply random sampling
within the indicated groups.
In observational studies this assumption
is subject to doubt since the sampling of cases and/or controls is not
entirely controlled by the researcher.
We shall return to this point
later.
Since the sample adjusted odds ratio ~ (Z) is calculated from
s
the selected individuals, all of whom have S = 1, it estimates the
population adjusted odds ratio conditional on S = 1, that is,
rs(z)
=
p(dle,z,s=l)p(dle,z,s=l) = p(eld,z,s=l)p(eld,z,s=l)
p(dle,z,s=l)p(dle,z,s=l) p(eld,z,s=l)p(eld,z,s=l)
Hence the sample adjusted odds ratio estimates the population adjusted
odds ratio if and only if r (z) = r(z) for all z.
s
argument in the proof of the theorem, we can write
Applying the
r (z) = r (z) (1 +b (z)),
s s
-11-

-------
where
b (z) = p(s=l Id,e,z)p(s=l Id,e,z)
s p(s=l Id,e,z)p(s=l Id,e,z)
-1.
Accordingly we define bs(Z) to be the relative selection bias* of the
1\
sample adjusted odds ratio, r(Z).
The relative selection bias is zero if
any of the conditions (C3) to (C8) for the selection process is
satisfied.
Hence the sample adjusted odds ratio is not biased for
clinical trials, prospective or case/control studies,
provided the appropriate random sampling condition (C3), ..., or (C8) can
be justified.
Stronger conditions are required for the unadjusted sample odds ratio
1\
~ to be free of selection bias.
Let us suppose that the confounding is
absent in the population so that r is an appropriate measure of
association between disease and exposure.
The sample odds ratio ~5
estimates the unadjusted odds ratio conditional on S = 1, that is,
r = p(dle,s=l)p(dle,s=l) = p(eld,s=l)p(eld,s=l) .
s p(dle,s=l)p(dle,s=l) p(eld,s=l)p(eld,s=l)
This parameter is related to r by the expression
r = r(l+b )
s s ,
where
b = p(s=l Id,e)p(s=l Id,e) -1.
s p(s=l Id,e)p(s=l Id,e)
A,
Hence we define bs to be the relative selection bias of ~.
It is zero
if anyone of the conditions (C3), (C5) or (C7) are satisfied, but is not
* Note that the selection bias has a slightly different form than the
confounding bias, in that the values of D and E in the numerator and
denominator have been switched.
-12-

-------
in general zero if Z is controlled at the design stage of the study. that
is, when conditions (C4, (C6) or (C8) apply.
Hence, for example,
matching at the design stage generally leads to a requirement to adjust
at the analysis stage, even when the confounding bias is zero.
Of
greater importance is the fact that even when Z is not controlled in the
selection process, there may still be a need for adjustment in the
analysis, because the sample adjusted odds ratio estimates the population
adjusted odds ratio under weaker conditions (e.g. C8) on the selection
process than are required for the sample unadjusted odds ratio to
estimate the population unadjusted odds ratio.
-13-

-------
2.4 More than One Covariate.
In practice, a number of confounding factors are usually present
in the design and analysis of a study, and thus a more realistic
problem is whether to adjust for a covariate Z in addition to a set of
other confounding variables U = (Ul,...,Uk).
The previous
arguments are easily extended to this case by conditioning throughout
on variables U.
The odds ratio r(Z) adjusted for Z is replaced by the
odds ratio r(Z,U) adjusted for Z and U.
The sample version of the
adjusted odds ratio estimates
r (z,u) = r(z,u) (1 + b (z,u))
s s
with relative selection bias
b ( ) - p(slz,u,d,e)p(slz,u,d,e) 1
z,u - - .
s p ( S I Z, u, d, e) p ( s I z, u, d, e)
In particular, this bias is zero when S is independent of 0 given Z,U,E
or S is independent of E given Z,U,D.
The population odds ratio r(u)
adjusted for U is r(u) = r(u,z)(l + b(zlu))
with relative confounding bias
b(Zlu) =
p(zld,e,u)p(zld,e,u) 1

,
p(zld,e,u)p(zld,e,u)
the bias being zero when Z is independent of 0 given E,U, or when Z is
independent of E given D,U.
The sample odds ratio r (u) adjusted for u
s
is
r (u) = r(u)(l + b (u))
s s
with relative selection bias
b (u) = p(slu,d,e)p(slu,d,e) -1,
s p ( s I u , d, e ) p ( s I u , d , e)
-14-

-------
and in particular the bias is zero when S is independant of 0 given U,E,
or when S is independent of E given U, D.
The counter example to
Miettinen's conditions described by Fisher and Pati1 (6) fails to satisfy
the condition that the relative confounding bias is zero, which explains
why adjustment is necessary in their case.
-15-

-------
3.
Does Matchi ng Increase Power?-
Now we ask:
Given that the analysis will adjust for a variable Z,
does matching on Z in the design increase power?
That is, does matching
on Z increase the probability of detecting a real association between
disease D and exposure E, adjusting for Z?
We suppose the variable is categorized with I levels, and thus
divides the population into I strata.
There are Ni (i=l,...,I) cases
I
available in the ith stratum. We plan to use all N =2: Ni available
i=l
cases in the case-control study, and to select a total of M controls for
comparison. The question is how to best choose the number Mi of
I
controls in the ith stratum, subject to the condition ~ Mi = M.
i =1
Thus Ni, Nand M are fixed; the MilS are to be chosen.
By definition, frequency matching of cases and controls takes
M. = kN.
1 1
with
M
k = N
Let
P1i = population proportion of cases exposed in stratum i.
PZi = population proportion of controls exposed in stratum i.
°i
= P1i - PZi
= (Pli + PZi)/Z
Pi
.1\
and let P1i'
qua nt it i e s .
1\ 1\ 1\
PZ., o. and P. denote the corresponding sample
1 1 1
The null hypothesis H :P1. = PZ. for i = 1,..,1 is equivalent
o 1 1
to the null hypothesis that the adjusted odds ratio of D and E given Z is
zero for all values of Z.
-16-

-------
The statistic
I
L
i =1
N.M.
, ,
N. + M.
, ,
/\
15 .
,
C =
N.M. A /\
, ,
N. + M. Pi ( l-P i )
, ,

may be used to test this hypothesis.
In moderate to large samples, the
test based on C is nearly equivalent to those of Cochran (11),
Mantel-Haenszel (12) and Birch (13), but is easier to manipulate in the
current problem.
The asymptotic expectation of C is
I
Z
i=l
N.M.
, ,
N. + M.
, ,
15 .
,
EA(C) =
~
(1)
I N N.M.
~ N.'+'M. Pi(l-Pi)
, ,

We find Mi to maximize (1) subject to the constraint M =IMi.
Si nce
the nonull variance of C is nearly 1, and since C is asymptotically
normal, maximizing EA(C) is nearly equivalent to maximizing the
asymptotic power.
Differentiating the log of (1) subject to the constraint ~Mi = M
yields
-17-

-------
d log C
e
dM.
1
=
d N.M. N.M.
dM i [loge LN~M~ 6i -1/2 loge L:N~M~ Pi(l-Pi) - A(LMi-M)]
=
[ N. J 2
Ni 1 Mi
6.
1
N.M.
Z-226
N.M. i
1 1
P . ( l-P . )
1 1
)'N .M.
2 ~N1Ml P.(l-P.)
. . 1 1
1 1
- A ]
Cochran (11) observed that if the odds ratio is constant over strata
then 0i/Pi(l-Pi) is nearly constant.
Assuming o./P.(l-P.) is
1 1 1
constant, we find the optimal allocation M. satisfies
1
N.
1
N. + M.
1 1
21.
N.M.
'\-220
L N.M. i
1 1
O.
1
0<;
1
P . ( 1-P .)
1 1
If Pli = Pl'

and P. ( l-P .) are
1 1
Otherwise, still
P2i = P2' for all;) then both 6i = PCP2
constant, and frequency matching is optimal.
assuming the odds ratio is constant, the optimal
allocation takes more controls (M. larger) from strata with a larger
1
difference 6. in exposure proportions, or equivalently, with a larger
1
variance P.(l-P.).
1 1
-18-

-------
References
1.
Miettinen, 0.5.
The matched pairs design in the case of
all-or-none responses.
Biometrics, 1968, 24:339-352.
2.
Bross, LD.J.
How case-for-case matching can improve
design efficiency.
Amer. J. Epid. , 1969, 89:359-363.
3.
Miettinen, 0.5.
Matching and design efficiency in
retrospective studies.
Amer. J. Epid., 1970, 91:111-118.
4.
Hardy, R.J., White, C.
Matching in retrospective
studies.
Amer. J. Epid., 1971, 93:75-6.
5.
Seigel, D.G., Greenhouse, S.W.
Validity in estimating
relative risk in case-control studies.
J. Chron. Dis.,
1973, 26:219-225.
6.
Fisher, L. and Patil, K.
Matching and unrelatedness.
Amer. J. Epid., 1974, 100:347-349.
7.
Miettinen, 0.5.
Confounding and effect modification.
Amer. J. Epid., 1974, 100:350-353.
-19-

-------
8.
Cornfield, J.
A method of estimating comparative rates
from clinical data.
J. Natl. Cancer Inst., 1951,
11:1269-1275.
9.
Bishop, Y.M.M., Fienberg, S.E., Holland, P.W.
Discrete
Cambridge, Massachusetts:
Multivariate Analysis.
Pre ss, 1975.
10.
Rubin, D.B.
1976, 63:581-592.
11.
Cochran, W.G.
chi square tests.
12.
MIT
Inference and missing data.
Biometrika,
Some methods for strengthening the common
Biometrics, 1954, 10:417-451.
Mantel, N., Haenszel, W.
Statistical aspects of the
analysis of data from retrospective studies of disease.
J. Natl. Cancer Inst., 1959, 22:719-748.
13.
Birch, M.W.
2x2 case.
26:313-324.
The detection of partial association, I:
the
J. Royal Statistical Society, 1964, series B,
-20-

-------
          TECHNICAL REPORT DATA      
        (Please read Instructions on the reverse before completing)   
1. REPORT NO.       12.         3. RECIPIENT'S ACCESSION NO.
EPA 520/8-81-004                 
4. TITLE AND SUBTITLE              5. REPORT DATE  
Confounding and Selection Bias in Case Control Studies ,1rlnIJrlY'1I 1 QAl
                 6. PERFORMING ORGANIZATION CODE
7. AUTHOR(S)                8. PERFORMING ORGANIZATION REPORT NO.
Roderi ck J. A. Little                
Paul R. Rosenbaum                  
9. PERFORMING ORGANIZATION NAME AND ADDRESS      10. PROGRAM ELEMENT NO.
Office of Radiation Programs              
U.S. Environmental Protection Agency      11. CONTRACT/GRANT NO.
Washington, D.C. 20460                
12. SPONSORING AGENCY NAME AND ADDRESS        13. TYPE OF REPORT AND PERIOD COVERED
                 14. SPONSORING AGENCY CODE
15. SUPPLEMENTARY NOTES                  
16. ABSTRACT                     
  In case-control studies,  the role of adjustments for bias, and 
ln particular the role of matching, has been extensively debated.  
However, the absence of a formal statement of the problem has led to 
disagreements, confusion, and occasionally to erroneous conclu-   
sions. This paper formulates precisely and answers the following  
questions.  1 ) When is it necessary to adjust for a variable Z?   
2) Given that the data analysis wi 11 adjust for the variable Z, is 
matching on Z the most efficient method of selecting controls? In  
answering these questions, we  draw a sharp distinction between bias 
caused by confounding  in the population and bias caused by the   
method used to select  the sample.           
17.         KEY WORDS AND DOCUMENT ANALYSIS      
a.    DESCRIPTORS       b.IDENTIFIERS/OPEN ENDED TERMS C. CO SA T I Field/Group
bi ometry                     
epidemiologic methods                
research design                   
18. DISTRIBUTION STATEMENT        19. SECURITY CLASS (This Report)  21. NO. OF PAGES
              Uncl ass ifi ed     27
   Unl imited        20. SECURITY CLASS (This page)  22. PRICE
              Unclassified     
EPA Form 2220-1 (Rev. 4-77)
PREVIOUS EDITION IS OBSOLETE
* U.S. GOVERNMENT PRINTING OFFICE:1981--341-082/* 237

-------