United States Environmental Protection Agency Office of Radiation Programs Washington DC 20460 EPA 520/8-81-004 January 1981 Radiation v»EPA Confounding and Selection Bias in Case Control Studies ------- EPA 520/8-81-004 Confounding and Selection Bias in Case Control Studies Roderick J. A. Little Paul R. Rosenbaum January 1981 Division of Statistics and Applied Mathematics Office of Radiation Programs U.S. Environmental Protection Agency Washington, D.C. 20460 ------- Abstract In case-control studies, the role of adjustments for bias, and in particular the role of matching, has been extensively debated. However, the absence of a formal statement of the problem has led to disagreements, confusion, and occasionally to erroneous conclu- sions. This paper formulates precisely and answers the following questions. 1 ) When is it necessary to adjust for a variable Z? 2) Given that the data analysis will adjust for the variable Z, is matching on Z the most efficient method of selecting controls? In answering these questions, we draw a sharp distinction between bias caused by confounding in the population and bias caused by the method used to select the sample. i i i ------- Acknowledgment The authors wish to acknowledge valuable discussions with Donald B. Rubin on the subject matter of this paper. ------- CONTENTS Abstract. . . . . . . . . . . . . . . . . 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . 2. Conditions Under Which Adjustment Is Necessary. . 3. Does Matching Increase Power? . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . Page i i i 2 16 19 { '\ ------- 1. Introduction In case-control studies, the role of adjustments for bias, and in particular the role of matching, has been extensively debated (1-7). However the absence of a formal statement of the problem has led to disagreements, confusion, and occasionally to erroneous conclusions. In this paper we formulate precisely and answer the following questions: a) When is it necessary to adjust for a variable Z ? b) Given that the data analysis will adjust for the variable Z, is matching on Z the most efficient method of selecting controls? -1- ------- 2. Conditions Under Which Adjustment Is Necessary 2.1. Introduction For simplicity we first consider measurement of the relationship between a disease (0) and an agent under study (E) in the presence of a single confounding variable Z. Extensions to the more realistic case where a set of variables are candidates for adjustment are outlined in Section 2.4. We first seek a valid measure of association in the population of cases and controls, irrespective of the method of sampling. We then ask whether the sample estimate of this measure of association is a satisfactory estimate of the population quantity, that is, whether the sample estimate is not subject to selection bias. -2- ------- 2.2 Measures of Association in the Population In a case-control study, the association between an agent (E) and disease (D) in the absence of confounding factors is measured by the population odds ratio r = p(dle)p(dle) - p(eld)p(eld) - , p(dle)p(dle) p(eld)p(eld) (1) where D = d denotes disease, D = d denotes no disease, E = e denotes exposure to the agent, E = e denotes no exposure to the agent, and p(alb) denotes the conditional probability that A = a given B = b in the population. The relative risk * - r = p(dle)/p(dle) (2 ) is in some ways a more satisfactory measure of the effect of the agent. However the odds ratio approximates the relative risk if probability of disease is low, and unlike the relative risk it can be estimated from a case control study (8). We now introduce a confounding factor Z, and suppose that a more appropriate measure of association is the adjusted odds ratio at Z = z, r(z) = p(dle,z)p(dle,z) = p(eld,z)p(eld,z) p(dle,z)p(dle,z) p(eld,z)p(eld,z) (3 ) which approximates the relative risk at Z if the risk of disease is low in that subgroup of the population with Z = z. Note that in general r(z) varies according to the value of z, and thus represents a set of measures of association. -3- ------- If the population parameters rand r(z) are equal for all z, i.e., (*) r = r(z) for all z, then the population relationship between disease (0) and exposure (E) is not confounded by z; otherwise the population relationship is confounded. The theorem below gives an expression for r in terms of r(z), and the subsequent discussion gives conditions under which confounding is absent, i.e. under which (*) holds. If confounding is present, then rand r(z) may yield strikingly different impressions concerning the effect of exposure on disease, and in this case, the choice of parameter to be estimated must depend on either assumptions or outside evidence concerning the biological mechanism that causes the disease. Theorem The adjusted and unadjusted odds ratios are related by the expression r = (1 + b(z))r(z) where b(z) = p(zl~,e)p(zld,~) -1 p(zld,e)p(zl~,~) (4) (Note: If z is continuous rather than discrete, p(zld,e) is the probability density function of Z given 0 = d, E = e.) Proof of Theorem By Bayes. Theorem, p(alb,z) = p(alb)p(zla,b) p(zlb) and applying this expression in the formula (3) for r(z) leads to equation (3). 1/ -4- ------- In view of equation (4) we define b(z) to be the relative confounding bias of r at Z = z. For example, if b(z) = 0.1 then the unadjusted odds ratio r deviates from the adjusted odds ratio at z by ten percent. Two situations where the confounding bias is zero are of particular interest. By inspection of equation (4), b(z) = 0 if either (C1) (C2) Z and 0 are conditionally independent given E, or Z and E are conditionally independent given O. In the case where Z is categorical, these conditions are a special case of the well known collapsing theorem for contingency rables. (See, for example, Bishop, Fienberg and Holland, (9), Section 2.4) Conditions (C1) and (C2) are not the same as the condition proposed by Miettinen (3) under which adjustment is unnecessary, namely (C21) Z and E are independent. The following example illustrates the difference between C2 and C2'. Example 1. The U.S. Environmental Protection Agency received a proposal to study the relationship between lung cancer (0) and radon222 in well water (E). Radon gas is released into the air when radon bearing well water is used in the home, for example, in showering. There is some concern that as homes are made energy efficient and the rate of air exchange decreases, the concentration of radon daughters may increase in homes supplied with radon bearing well water. The proposal contained a plan for a pilot study to determine whether well water radon levels (E) are independent of smoking history (Z), an important confounding variable; i.e. to determine whether C21 holds. If radon and smoking appear independent as a result of the survey, then the proposal would ignore smoking history. -5- ------- However, our theorem shows that the relevant condition in deciding whether to adjust for smoking is not independence of smoking and radon but independence of smoking and radon within the diseased and non- diseased groups. Table 1 shows a (strictly hypothetical) population where radon level and smoking are marginally independent, so condition (C21) holds, but confounding is present because the adjusted odds ratios for radon and cancer are radically different in the smoking and non- smoking groups. The unadjusted odds ratio lies between these values, but is a poor summary of the relationship between radon and cancer for this population. Table 2 gives another hypothetical population where radon and smoking are unrelated within diseased and non-diseased groups (condition C2), so confounding is absent, but condition C21 does not hold. -6- ------- Table 1. Distribution of Radon (E), Lung Cancer (D) and Smoking (Z) in a Hypothetical Population with a) E and Z independent and b) Unequal Odds Ratios. Lung Cancer (D) D D d d total d total d Radon (E) e 20 5,980 6,000 e 180 3,820 4,000 e 110 23,890 24,000 - 190 15,810 16,000 e odds ratio = .73 odds ratio = 3.92 Z = z: nonsmokers Z = z:smokers D d d total e 200 9,800 10,000 e 300 39,700 40,000 odds ratio = 2.70 Z = z or z smokers and nonsmokers Table 2. Distribution of Radon (E), Lung Cancer (D) and Smoking (Z) in a Hypothetical Population with a) E and Z independent given D, and hence b) Equal Odds Ratios Lu ng Cancer (D) D D d d d total tota 1 d e 40 8,000 e 160 2,000 2,160 e 240 8,000 8,240 8,040 Radon (E) - e 60 32,000 32,060 odds ratio = 2.67 odds ratio = 2.67 Z = z Z = z -7- D d d total e 200 10,000 10,200 e 300 40,000 40,300 odds ratio = 2.67 Z = z or z ------- Example 2. We have seen that either condition (C1) or (C2) implies that the confounding bias is zero. If Z is binary, it is easily shown that the converse holds, that is, a confounding bias of zero implies either (C1) or (C2). However if Z has more than two categories, then populations can be constructed where neither (C1) nor (C2) are satisfied and yet the confounding bias is still zero. An example for trichotomous Z is given in Table 3. It is readily verified that the adjusted odds ratios all equal the unadjusted odds ratio (to within some rounding error), even though each pair of variables is neither conditionally nor marginally independent. Such examples are curiosities, and the two independence conditions (C1) and (C2) are more useful than equation (4) in practice. -8- ------- Table 3. Hypothetical Population where a) Adjusted and Unadjusted Odds Ratios of 0 and E Are Equal, b) 0 and Z Are Not Independent Given E, and c) E and Z Are Not Independent Given 0 a) 0 and E given Z o d d E e 30 20 - 20 120 e Z = 1 o E e 150 59 o d d E e 120 21 E e - 52 81 - e e o d d 300 100 100 300 d d - e 28 99 Z = 2 Z = 3 Z = 1, 2 or 3 odds ratio = 9.0 odds ratio = 9.0 odds ratio = 8.9 odds ratio = 9.0 b) 0 and Z given E Z 1 2 3 o d 30 150 120 d 20 59 21 E = e c) E and Z given 0 Z 1 2 3 E e 30 150 120 - 20 28 52 e o = d Z 1 2 3 o d 20 28 52 d 120 99 81 E = e Z 1 2 3 E e 20 59 21 - 120 99 81 e o = d -9- ------- 2.3. The Effects of Sample Selection We have established conditions under which the unadjusted and adjusted odds ratios in'the population are equal and therefore confounding is absent in the population. However these conditions are not sufficient for adjustment of the sample odds ratio to be unnecessary. The method of selection of cases and controls may be such that the unadjusted odds ratio for the sample is a biased estimate of its population analog. Adjustment may be necessary to eliminate (or at least to reduce) this bias. To clarify conditions under which selection bias arises, it is convenient to introduce a sample indicator variable S, defined for each individual of the population, which takes value one if an individual is selected into the study and zero otherwise. The method of sampling can be characterized in terms of assumptions about the probability distribution of S given Z, 0 and E (Cf Rubin, 10). The following conditions are of particular interest since they characterize common methods of data collection: (C3) (C4) (C5) (C6) ( C7) (C8) S is independent of Z, 0 and E. S is independent of 0 and E, given Z. S is independent of 0 and Z, give n E. S is independent of 0, given Z and E. S is independent of E and Z, give nO. S is independent of E, given Z and D. Conditions (C3) and (C4) correspond to randomized experiments where individuals are selected at random from the population and values of 0 and E are measured. Conditions (C5) and (C6) underlie cohort studies -10- ------- if individuals are selected at random within exposed and non-exposed groups, and values of 0 are measured. Conditions (C7) and (C8) underlie case control studies where individuals are selected at random within diseased and non-diseased groups, and values of E are recorded. The odd numbered conditions (C3, C5, C7) correspond to situations where Z is not used as a stratifying variable for data collection; in particular, matching on Z has not taken place. The variable Z is recorded for the analysis. The even numbered conditions (C4, C6, C8) correspond to situations where Z is used as a stratifying variable, for example, by matching cases and controls on Z. A key aspect of these conditions is that they imply random sampling within the indicated groups. In observational studies this assumption is subject to doubt since the sampling of cases and/or controls is not entirely controlled by the researcher. We shall return to this point later. Since the sample adjusted odds ratio ~ (Z) is calculated from s the selected individuals, all of whom have S = 1, it estimates the population adjusted odds ratio conditional on S = 1, that is, rs(z) = p(dle,z,s=l)p(dle,z,s=l) = p(eld,z,s=l)p(eld,z,s=l) p(dle,z,s=l)p(dle,z,s=l) p(eld,z,s=l)p(eld,z,s=l) Hence the sample adjusted odds ratio estimates the population adjusted odds ratio if and only if r (z) = r(z) for all z. s argument in the proof of the theorem, we can write Applying the r (z) = r (z) (1 +b (z)), s s -11- ------- where b (z) = p(s=l Id,e,z)p(s=l Id,e,z) s p(s=l Id,e,z)p(s=l Id,e,z) -1. Accordingly we define bs(Z) to be the relative selection bias* of the 1\ sample adjusted odds ratio, r(Z). The relative selection bias is zero if any of the conditions (C3) to (C8) for the selection process is satisfied. Hence the sample adjusted odds ratio is not biased for clinical trials, prospective or case/control studies, provided the appropriate random sampling condition (C3), ..., or (C8) can be justified. Stronger conditions are required for the unadjusted sample odds ratio 1\ ~ to be free of selection bias. Let us suppose that the confounding is absent in the population so that r is an appropriate measure of association between disease and exposure. The sample odds ratio ~5 estimates the unadjusted odds ratio conditional on S = 1, that is, r = p(dle,s=l)p(dle,s=l) = p(eld,s=l)p(eld,s=l) . s p(dle,s=l)p(dle,s=l) p(eld,s=l)p(eld,s=l) This parameter is related to r by the expression r = r(l+b ) s s , where b = p(s=l Id,e)p(s=l Id,e) -1. s p(s=l Id,e)p(s=l Id,e) A, Hence we define bs to be the relative selection bias of ~. It is zero if anyone of the conditions (C3), (C5) or (C7) are satisfied, but is not * Note that the selection bias has a slightly different form than the confounding bias, in that the values of D and E in the numerator and denominator have been switched. -12- ------- in general zero if Z is controlled at the design stage of the study. that is, when conditions (C4, (C6) or (C8) apply. Hence, for example, matching at the design stage generally leads to a requirement to adjust at the analysis stage, even when the confounding bias is zero. Of greater importance is the fact that even when Z is not controlled in the selection process, there may still be a need for adjustment in the analysis, because the sample adjusted odds ratio estimates the population adjusted odds ratio under weaker conditions (e.g. C8) on the selection process than are required for the sample unadjusted odds ratio to estimate the population unadjusted odds ratio. -13- ------- 2.4 More than One Covariate. In practice, a number of confounding factors are usually present in the design and analysis of a study, and thus a more realistic problem is whether to adjust for a covariate Z in addition to a set of other confounding variables U = (Ul,...,Uk). The previous arguments are easily extended to this case by conditioning throughout on variables U. The odds ratio r(Z) adjusted for Z is replaced by the odds ratio r(Z,U) adjusted for Z and U. The sample version of the adjusted odds ratio estimates r (z,u) = r(z,u) (1 + b (z,u)) s s with relative selection bias b ( ) - p(slz,u,d,e)p(slz,u,d,e) 1 z,u - - . s p ( S I Z, u, d, e) p ( s I z, u, d, e) In particular, this bias is zero when S is independent of 0 given Z,U,E or S is independent of E given Z,U,D. The population odds ratio r(u) adjusted for U is r(u) = r(u,z)(l + b(zlu)) with relative confounding bias b(Zlu) = p(zld,e,u)p(zld,e,u) 1 , p(zld,e,u)p(zld,e,u) the bias being zero when Z is independent of 0 given E,U, or when Z is independent of E given D,U. The sample odds ratio r (u) adjusted for u s is r (u) = r(u)(l + b (u)) s s with relative selection bias b (u) = p(slu,d,e)p(slu,d,e) -1, s p ( s I u , d, e ) p ( s I u , d , e) -14- ------- and in particular the bias is zero when S is independant of 0 given U,E, or when S is independent of E given U, D. The counter example to Miettinen's conditions described by Fisher and Pati1 (6) fails to satisfy the condition that the relative confounding bias is zero, which explains why adjustment is necessary in their case. -15- ------- 3. Does Matchi ng Increase Power?- Now we ask: Given that the analysis will adjust for a variable Z, does matching on Z in the design increase power? That is, does matching on Z increase the probability of detecting a real association between disease D and exposure E, adjusting for Z? We suppose the variable is categorized with I levels, and thus divides the population into I strata. There are Ni (i=l,...,I) cases I available in the ith stratum. We plan to use all N =2: Ni available i=l cases in the case-control study, and to select a total of M controls for comparison. The question is how to best choose the number Mi of I controls in the ith stratum, subject to the condition ~ Mi = M. i =1 Thus Ni, Nand M are fixed; the MilS are to be chosen. By definition, frequency matching of cases and controls takes M. = kN. 1 1 with M k = N Let P1i = population proportion of cases exposed in stratum i. PZi = population proportion of controls exposed in stratum i. °i = P1i - PZi = (Pli + PZi)/Z Pi .1\ and let P1i' qua nt it i e s . 1\ 1\ 1\ PZ., o. and P. denote the corresponding sample 1 1 1 The null hypothesis H :P1. = PZ. for i = 1,..,1 is equivalent o 1 1 to the null hypothesis that the adjusted odds ratio of D and E given Z is zero for all values of Z. -16- ------- The statistic I L i =1 N.M. , , N. + M. , , /\ 15 . , C = N.M. A /\ , , N. + M. Pi ( l-P i ) , , may be used to test this hypothesis. In moderate to large samples, the test based on C is nearly equivalent to those of Cochran (11), Mantel-Haenszel (12) and Birch (13), but is easier to manipulate in the current problem. The asymptotic expectation of C is I Z i=l N.M. , , N. + M. , , 15 . , EA(C) = ~ (1) I N N.M. ~ N.'+'M. Pi(l-Pi) , , We find Mi to maximize (1) subject to the constraint M =IMi. Si nce the nonull variance of C is nearly 1, and since C is asymptotically normal, maximizing EA(C) is nearly equivalent to maximizing the asymptotic power. Differentiating the log of (1) subject to the constraint ~Mi = M yields -17- ------- d log C e dM. 1 = d N.M. N.M. dM i [loge LN~M~ 6i -1/2 loge L:N~M~ Pi(l-Pi) - A(LMi-M)] = [ N. J 2 Ni 1 Mi 6. 1 N.M. Z-226 N.M. i 1 1 P . ( l-P . ) 1 1 )'N .M. 2 ~N1Ml P.(l-P.) . . 1 1 1 1 - A ] Cochran (11) observed that if the odds ratio is constant over strata then 0i/Pi(l-Pi) is nearly constant. Assuming o./P.(l-P.) is 1 1 1 constant, we find the optimal allocation M. satisfies 1 N. 1 N. + M. 1 1 21. N.M. '\-220 L N.M. i 1 1 O. 1 0<; 1 P . ( 1-P .) 1 1 If Pli = Pl' and P. ( l-P .) are 1 1 Otherwise, still P2i = P2' for all;) then both 6i = PCP2 constant, and frequency matching is optimal. assuming the odds ratio is constant, the optimal allocation takes more controls (M. larger) from strata with a larger 1 difference 6. in exposure proportions, or equivalently, with a larger 1 variance P.(l-P.). 1 1 -18- ------- References 1. Miettinen, 0.5. The matched pairs design in the case of all-or-none responses. Biometrics, 1968, 24:339-352. 2. Bross, LD.J. How case-for-case matching can improve design efficiency. Amer. J. Epid. , 1969, 89:359-363. 3. Miettinen, 0.5. Matching and design efficiency in retrospective studies. Amer. J. Epid., 1970, 91:111-118. 4. Hardy, R.J., White, C. Matching in retrospective studies. Amer. J. Epid., 1971, 93:75-6. 5. Seigel, D.G., Greenhouse, S.W. Validity in estimating relative risk in case-control studies. J. Chron. Dis., 1973, 26:219-225. 6. Fisher, L. and Patil, K. Matching and unrelatedness. Amer. J. Epid., 1974, 100:347-349. 7. Miettinen, 0.5. Confounding and effect modification. Amer. J. Epid., 1974, 100:350-353. -19- ------- 8. Cornfield, J. A method of estimating comparative rates from clinical data. J. Natl. Cancer Inst., 1951, 11:1269-1275. 9. Bishop, Y.M.M., Fienberg, S.E., Holland, P.W. Discrete Cambridge, Massachusetts: Multivariate Analysis. Pre ss, 1975. 10. Rubin, D.B. 1976, 63:581-592. 11. Cochran, W.G. chi square tests. 12. MIT Inference and missing data. Biometrika, Some methods for strengthening the common Biometrics, 1954, 10:417-451. Mantel, N., Haenszel, W. Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst., 1959, 22:719-748. 13. Birch, M.W. 2x2 case. 26:313-324. The detection of partial association, I: the J. Royal Statistical Society, 1964, series B, -20- ------- TECHNICAL REPORT DATA (Please read Instructions on the reverse before completing) 1. REPORT NO. 12. 3. RECIPIENT'S ACCESSION NO. EPA 520/8-81-004 4. TITLE AND SUBTITLE 5. REPORT DATE Confounding and Selection Bias in Case Control Studies ,1rlnIJrlY'1I 1 QAl 6. PERFORMING ORGANIZATION CODE 7. AUTHOR(S) 8. PERFORMING ORGANIZATION REPORT NO. Roderi ck J. A. Little Paul R. Rosenbaum 9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT NO. Office of Radiation Programs U.S. Environmental Protection Agency 11. CONTRACT/GRANT NO. Washington, D.C. 20460 12. SPONSORING AGENCY NAME AND ADDRESS 13. TYPE OF REPORT AND PERIOD COVERED 14. SPONSORING AGENCY CODE 15. SUPPLEMENTARY NOTES 16. ABSTRACT In case-control studies, the role of adjustments for bias, and ln particular the role of matching, has been extensively debated. However, the absence of a formal statement of the problem has led to disagreements, confusion, and occasionally to erroneous conclu- sions. This paper formulates precisely and answers the following questions. 1 ) When is it necessary to adjust for a variable Z? 2) Given that the data analysis wi 11 adjust for the variable Z, is matching on Z the most efficient method of selecting controls? In answering these questions, we draw a sharp distinction between bias caused by confounding in the population and bias caused by the method used to select the sample. 17. KEY WORDS AND DOCUMENT ANALYSIS a. DESCRIPTORS b.IDENTIFIERS/OPEN ENDED TERMS C. CO SA T I Field/Group bi ometry epidemiologic methods research design 18. DISTRIBUTION STATEMENT 19. SECURITY CLASS (This Report) 21. NO. OF PAGES Uncl ass ifi ed 27 Unl imited 20. SECURITY CLASS (This page) 22. PRICE Unclassified EPA Form 2220-1 (Rev. 4-77) PREVIOUS EDITION IS OBSOLETE * U.S. GOVERNMENT PRINTING OFFICE:1981--341-082/* 237 ------- |