m PROTOCOL A COMPUTERIZED SOLID WASTE QUANTITY AND COMPOSITION ESTIMATION SYSTEM BY ALBERT J. KLEE RISK REDUCTION ENGINEERING LABORATORY U.S. ENVIRONMENTAL PROTECTION AGENCY CINCINNATI, OHIO 45268 ------- PROTOCOL A COMPUTERIZED SOLID WASTE QUANTITY AND COMPOSITION ESTIMATION SYSTEM by Albert J. Klee Risk Reduction Engineering Laboratory U.S. Environmental Protection Agency Cincinnati, Ohio 45268 RISK REDUCTION ENGINEERING LABORATORY OFFICE OF RESEARCH AND DEVELOPMENT U.S. ENVIRONMENTAL PROTECTION AGENCY CINCINNATI, OHIO 45268 ------- DISCLAIMER This report has been reviewed by the U.S. Environmental Protection Agency and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the U.S. Environmental Protection Agency, nor does mention of trade names or commercial products constitute endorse- ment or recommendation of use. ii ------- FOREWORD Today's rapidly developing and changing technologies and industrial products and practices frequently carry with them the increased generation of materials that, if improperly dealt with, can threaten both public health and the environment. The U.S. Environmental Protection Agency is charged by Congress with protecting the Nation's land, air, and water resources. Under a mandate of national environmental laws, the agency strives to formulate and implement actions leading to a compatible balance between human activities and the ability of natural systems to support and nurture life. These laws direct the EPA to perform research to define our environmental problems, measure the im- pacts, and search for solutions. The Risk Reduction Engineering Laboratory is responsible for planning, implementing, and managing research, development, and demonstration programs to provide an authoritative, defensible engineering basis in support of the policies, programs,' and regulations of the EPA with respect to drinking water, waste- water, pesticides, toxic substances, solid and hazardous wastes, and Superfund-related activities. This publication is one of the products of that research and provides a vital communication link between the researcher and the user community. This report describes a system of sampling protocols for estimating the quantity and composition of solid waste in a given location, such as a landfill site, or at a specific point in an industrial or commercial process, over a stated period of time. An adequate estimation of these elements is essential to the design of resource recovery systems and waste minimization pro- grams, and to the estimation of the life of landfills and the pollution burden on the land posed by the generation of solid wastes. The theory developed in this report takes a significant- ly different approach over the more traditional sampling plans, resulting in a lower cost and more accurate and precise estimates of these critical entities. Although the calculations dictated by these protocols are tedious, a computer program, called PROTOCOL, has also been developed to do these calculations, thus relieving a great burden from the analyst. The program is designed to be run on personal computers with modest capabilities. For further information, please contact the Waste Minimiza- tion, Destruction and Disposal Research Division of the Risk Reduction Engineering Laboratory. E. Timothy Oppelt, Director Risk Reduction Engineering Laboratory iii ------- ABSTRACT The assumptions of traditional sampling theory often do not fit the circumstances when estimating the quantity and composi- tion of solid waste arriving at a given location, such as a landfill site, or at a specific point in an industrial or commer- cial process. The investigator often has little leeway in the sampling observation process. Traditional unbiased random sam- pling will produce some intervals of little or no activity and others of frenzied activity, clearly an inefficient and error- prone procedure. In addition, there are no discrete entities of solid waste composition, such as a basic unit of paper or of textiles, comprising the population about which inferences are to be drawn. Finally, with respect to solid waste composition, the traditional assumptions of normality are not valid, thus preclud- ing the rote application of the standard statistical formulas for the estimation of sample sizes or the construction of confidence intervals. This study describes the development of sampling protocols for estimating the quantity and composition of solid waste that deal with these problems. Since the methods developed, although not mathematically complex, are arithmetically'tedious, a computer program (designed to be run on personal computers with modest capabilities) was written to carry out the calculations involved. iv ------- PREFACE Traditional sampling theory generally follows the following paradigm: SAMPLE SELECTION - SAMPLE OBSERVATION - SAMPLE ESTIMATION. Typically, the sample selection process is one in which the samples are chosen by an unbiased procedure, such as simple random sampling or systematic sampling where it is assumed that the population is already in random order. Traditional sampling theory assumes that there are sampling elements, i.e., discrete entities comprising the population about which inferences are to be drawn. In the sample observation (i.e., data recording) stage, it is further assumed that observation of the elements of the sample is an independent process, i.e., that there is no queue of sample elements building up, waiting to be observed while an observation on one sample element is being made.Final- ly, when it is desired to place confidence intervals about the estimates made in the sample estimation process, the distribution of either the population or of the population parameter estimated is assumed to follow a specific classical probability distribu- tion; typically, the normal distribution is assumed. In the sample selection process, similar assumptions are made when determining the number of samples to be taken. Unfortunately, these assumptions do not fit the circum- stances when the problem is to estimate the quantity and composi- tion of solid waste arriving at a given site, such as a landfill, transfer station, incinerator, or a specific point in an indus- trial or commercial process. For one thing, the sample comes to the investigator, which is the reverse of the situation commonly described in standard sampling textbooks. Since the investigator has no control over the arrival of the sample elements, the sample observation process often is far from independent. Consid- er a situation where it is desired to weigh a random sample of vehicles arriving at a landfill. Suppose that it is feasible to weight up to 10 vehicles per hour and that the average interar- rival time of vehicles is 3.2 minutes. On the average then, with either random or systematic sampling, one vehicle would be weighed every 32 minutes. If it takes an average of 10 minutes to weight a vehicle, then this is well within the capability of the sampling system. Unfortunately, vehicles do not have uniformly distributed arrival times. There may be peak arrival periods when the number of trucks arriving and to be sampled overwhelms the weighing capability and one is forced to "default" on weighing some of the vehicles selected by the sampling plan. One result is that fewer vehicles are weighed than the sampling plan calls ------- for, thus reducing the precision of the estimate of solid waste quantity. More important, however, is that if the load weights of the defaulted samples differ appreciably from the nondefaulted samples, bias will introduced. For example, at many landfills vehicles arriving toward the end of the day tend to have smaller load weights than those arriving at other times. Since fewer vehicles arrive towards the end of the day, the tendency would be to oversample these lightly loaded vehicles and to undersample the normally loaded vehicles arriving at peak hours, thus intro- ducing a bias. This cannot be accomplished with unbiased sam- ples; it can, however, be accomplished with biased samples, using the estimation process)to unbias the estimate. (This technique, by the way, is not unknown in traditional sampling theory; it is used in making estimates in stratified sampling.) Traditional sampling theory assumes that there are sampling elements, i.e., discrete entities comprising the population about which inferences are to be drawn. When it comes to sampling solid waste for composition, however, there are no such discrete enti- ties. There is, for example, no such thing as a basic unit of paper or of textiles. Thus, sampling procedures based upon dis- crete distributions (such as the multinomial or binomial) are not valid. Nonetheless, some basic unit weight of sample must be defined. In traditional cluster sampling theory, a balance is achieved between the within-cluster and between-cluster compo- nents of the total variability of an estimate. If the cluster (i.e., in this context, a sample of given weight) is too small, then the between-cluster variability will be greater than the within-cluster variability and will result in a large sample variability. If the cluster weight is too large, however, the greater will be the time and expense of sampling. Further compli- cating this situation, the optimal sample weight is related to the size of the particles in the sample. Finally, although assumptions that solid waste quantities follow normal distributions are justifiable in the estimation of solid waste quantity, such is not the case in estimating solid waste composition. For one thing, composition fractions are bounded, i.e., there are no components in solid waste that are present in fractions less than zero or greater than one. These boundaries are generally located close to the means of their distributions. Thus, solid waste component distributions are, at the very least, positively-skewed (i.e., skewed to the right) and, at worst, are J-shaped. Nor does reliance on the Central Limit Theorem of statistics help much, since even averages of component fractions do not approach normality quickly, at least not within an economically feasible number of samples. Distribu- tions of component averages still tend to be positively-skewed. This characteristic precludes the rote application of the tradi- ! vi ------- tional statistical formulas for either the estimation of sample size or the construction of confidence intervals. Although trans- formations can be used to construct confidence asymmetric inter- vals, these are of little help when estimating sample size. A knowledge of the effect of positive skewness on the actual level of significance of a confidence interval, however, can be of help in determining the number of samples to take. The purpose of this study was to develop sampling protocols for estimating solid waste quantity and composition to solve the problems enumerated above. This included both sampling and esti- mating procedures. Since these methods, although not mathemati- cally complex, are arithmetically tedious, a computer program (designed to be run on personal computers with modest capabili- ties) was developed to carry out all of the calculations required by the protocols. The program, called PROTOCOL, also contains routines that check input data for errors in coding, and an editor for preparing and modifying input data files. VII ------- Page 1 CHAPTER 1 QUANTITY ESTIMATION 1.1 INTRODUCTION This study is concerned with innovative quantitative ap- proaches to sampling solid waste streams for quantity and compo- sition, and with methods for estimating these values for a given waste shed over a period of time. In addition, computer-based programs have been created for implementing the statistical models selected for the determination of sample sizes for quanti- ty and composition, and for producing the estimates (along with measures of their uncertainties) of quantity and composition once the data are collected. There are two basic paths to the estimation of solid waste quantity and composition: (1) direct measurement, and (2) predic- tive models. Within both approaches there are many variations on a theme. In direct measurement, for example, one can measure at the point of generation (at each house, commercial establishment, etc.) or at the point of destination (at a landfill, incinerator, community recycling center, etc.). The former, however, is much more costly and time consuming, and the sampling protocol or plan is more complex to design and implement than in point of destina- tion methods. Therefore, point of generation methods are neither economically feasible nor of sufficient accuracy for waste shed predictions. Predictive models rely on surrogate or ancillary measure- ments to estimate waste quantity or composition. On the one hand there are the Leontif-type input-output models in which the materials entering a waste shed are placed on one dimension of a matrix, with their waste products placed on the other dimension. Using suitable conversion rates for each cell in the matrix and applying certain mathematical operations (such as matrix inver- sion) , it is possible to estimate the quantities of the wastes produced. Although of some value on a global or strategic level, such approaches are infeasible for local levels because (1) the data bases simply are not available, and (2) the degree of accu- racy achieved is inadequate for local objectives. Other predic- tive models are those in the form of equations (usually of the regression type) that relate quantity or composition to selected independent or predictor variables. (The simplest example equates quantity to the product of waste generated per capita and total population.) However, many variables affect waste generation: geography, climate, income level, local ordinances, etc. To obtain models with any statistical validity, much data would have to be gathered all over the country to estimate the parameters ------- Page 2 (constants) of the model. Also, the significance of the individu- al predictor variables and the values of the parameters would be expected to change with time. Therefore, such models would have to be maintained, a costly procedure that has little chance for support, even by government agencies. It appears, therefore, that the most pragmatic approach to the estimation of solid waste quantity and composition is direct measurement at the point of destination. Upon examination, ordinary sampling techniques are inappro- priate for the estimation of the quantity or composition of waste arriving at a destination site. For one thing, the sample comes to the investigator, which is usually the reverse of the situa- tion commonly described in standard statistical sampling texts. Once a scale is rented and the labor hired to make the measure- ments, it makes good economic sense to use the equipment and labor to the fullest extent possible. This is totally unlike survey sampling where the variable cost of an interview is usual- ly greater than the fixed cost (before sampling). In statistical sampling, the elements of a population are defined jointly with the population itself, i.e., the elements are the basic units that comprise and define the population. In the solid waste composition case, however, a truckload is simply too large to serve as a single sampling unit. Furthermore, in sampling for composition the situation resembles (but is not identical to) multinomial sampling for which there is not much discussion in the statistical texts. Solid waste quantity and composition exhibit strong seasonal effects that must be taken into consider- ation in the sampling protocols, another topic not commonly encountered in the textbooks. Furthermore, traditional survey sampling methodology usually attempts to select unbiased samples. It can be shown, however, that it is more efficient to select biased samples and then correct for the bias in the estimation formulas employed. For all of these reasons, it is appropriate to take a fresh look at the problems of estimating solid waste quantity and composition. 1.2 SOME BASIC STATISTICAL CONSIDERATIONS In statistical sampling, the elements of a population are defined jointly with the population itself, i.e., the elements are the basic units that comprise and define the population. When sampling for solid waste quantity, the population elements usual- ly can be taken as the individual vehicle-loads. ("Vehicle-load" is used here to stress the fact that, for example, one might observe 500 loads delivered in one day by only 100 different vehicles. For simplicity, however, vehicle and vehicle-load will be used interchangeably in this study.) A "parameter" is a numer- ical quantity that, perhaps with other parameters, defines the population completely. Suppose that the weight of solid waste in a vehicle is distributed normally with mean /t and standard devia- ------- Page 3 tion a. /i and a are the parameters of the distribution and, taken together, completely define the distribution. Precision and accuracy are two terms that are relevant to the parameters of a population; more specifically, they are applied to estimates of the parameters. An singl.e estimate is accurate if it is close to the true value, and a collection of estimates is accurate if their average value is close to the true value. The difference between the estimate (or average estimate) and the true value is known as the "bias". Precision, on the other hand, refers to the closeness of multiple estimates of a parameter. If the estimates do not vary much among themselves, the method of estimation is said to be "precise". The concept of precision is inversely related to that of variance or standard deviation in that the greater the precision, the smaller the variance or standard deviation of the estimate. These concepts are illustrated by the distributions shown in Figure 1. Each the individual drawing represents the distribution of repeated esti- mates of some true value, T, and it is assumed that the estimate of T is taken as the mean of the distributions shown. It will be noted that the accurate distributions (A and C) are centered on the true value, T, i.e., the bias is zero. The precise distribu- tions (A and B) are those with less dispersion, i.e., smaller standard deviation. Obviously, estimates that are both unbiased and of high precision are preferred. Accuracy is affected mainly by the selection process, i.e., the way by which the sampling units are selected; precision, on the other hand, is affected mainly by the measurement process and the sample size. Since the most difficult aspect of any protocol for estimating solid waste quantity involves the selection proc- ess, the major problem perforce is one of accuracy. 1.3 THE .SAMPLE SELECTION PROCESS It should be clear that if a destination site has or can be fitted with scale facilities that permit all vehicles to be weighed, there is no sample selection (or estimation) problem. It is assumed here that this is not the case. For the moment, the problems of trend and cyclical or seasonal variation of the quantity of solid waste delivered to the site will be deferred, and it will be assumed that sampling is to take place over a period of one week. (If the site only operates x days a week, then a one-week sample involves sampling on each of the x days). When sampling over this period, variations in quantity due to hour-to-hour and day-to-day differences are accounted for in the estimate. Week to week differences, however, are far less impor- tant than month to month differences. Therefore, it makes little statistical sense to sample for periods of two or more consecu- tive weeks. In order to achieve maximum sampling efficiency in a ------- Page 4 a = STANDARD DEVIATION = BIAS -a* III (A) ACCURATE AND PRECISE I I (B) INACCURATE BUT PRECISE I. liiiiii nil r h.. (C) ACCURATE BUT IMPRECISE (D) INACCURATE AND IMPRECISE FIGURE 1: CONCEPTS OF ACCURACY AND PRECISION ------- Page 5 statistical sense, the one week sampling periods must be spread throughout the year. This will be discussed in the next section. The problem of selection bias can be addressed in either of two ways: 1. Make no assumptions about the nature of the arrivals of the vehicle, and take a random sample, or 2. Assume that the vehicles arrive in random order and take a systematic sample, i.e., one in which every kth vehicle arriving at the site is weighed. The second approach is attractive for two reasons: (1) the proce- dure by which the trucks are selected for the sample is relative- ly simple, and (2) if we are interested in separate estimates for different types of vehicle (different sizes of trucks, com- mercial versus residential, etc.), the systematic sample can easily yield a proportionate sample, which has statistical advan- tages that will be explained later. The basic problem with systematic sampling has to do with any departure from the assumed randomness in the arrival of the' vehicles. These departures are of two kinds: 1. A monotonic trend may exist in the weights of the loads, e.g., the loads may increase with time over the week. Since a systematic sample consists of a random start followed by sampling.each kth truck afterwards, the estimate will depend on the random start within the first interval. In Figure 2A, the low random start (solid dots) will produce a lower estimate than the high random start (open circles). The estimates in these two cases will be biased either low (solid dots) or high (open circles). 2. A cyclical or periodic trend may exist in the weights of the loads. In Figure 2B, if the random start happens to fall at the top of a cycle (solid dots), the estimate will be high; if it falls at the bottom (open circles), it will be low. Again, in either case the estimates will be biased. It does not appear, however, that either of these events poses a real problem. The interval between vehicles is so short that neither monotonic or periodic effects would influence the esti- mate significantly. Furthermore, simply changing the random start each day would average out the effect of any single start. As was mentioned, systematic sampling consists of sampling every kth vehicle after a random start. The random start, r, is the rth vehicle in the arriving vehicle sequence where r is a number, chosen at random, between 1 and k. The succeeding vehi- ------- Page 6 cles to be sampled are k+r, 2k+r, 3k+r, etc. The random start from 1 to k imparts to each vehicle the selection probability l/k=f where f is known as the sampling frequency. If we know the total number of vehicles, N, arriving during the sampling period, the total sample size, n, is given as n=fN=N/k. (Note that n will be an integer only if N is an integral multiple of k.) WEIGHT OF VEHICLE-LOAD INTERVAL k T o T O INCREASING TIME FIGURE 2A: MONOTONIC TREND WEIGHT OF VEHICLE-LOAD INTERVAL k T O T O o -+ INCREASING TIME O O FIGURE 2A: CYCLICAL TREND ------- Page 7 Although the concept of systematic sampling is relatively simple, a problem arises when sampling is weighing scale-limited. For example, suppose it is known that approximately 750 vehicles arrive at a site over a five-day period, and a sample size of 75 is desired. Since k=N/n, the sampling interval, k, is every 10 vehicles (after a random start between 1 and 10). Suppose further that it is practicable to weight up to 10 vehicles per hour. The average interarrival time of vehicles is (5 days)(8 hr/day)(60 min/hr)/750=3.2 minutes. On the average, then, one vehicle would be weighed every 32 minutes, apparently well within the capabili- ty of the weighing system. Unfortunately, vehicles do not have uniformly distributed arrival times. There may be peak arrival periods when the number of trucks arriving and to be sampled overwhelms the weighing capability. One is forced to "default" on weighing some of the vehicles selected by the sampling plan. This has two effects. One result is that fewer vehicles are weighed than the sampling plan called for, thus reducing the precision of the estimate of solid waste quantity. Second, if the load weights of the defaulted samples differ appreciably from the nondefaulted samples, bias will introduced. For example, at many landfills vehicles arriving toward the end of the day tend to have smaller load weights than those arriving at other times (often this is simply a policy not to have vehicles stand overnight with refuse in them). Since fewer vehicles arrive towards the end of the day, the tendency would be to oversample the lightly loaded vehicles and to undersample the normally loaded vehicles. Thus, a bias is introduced. The selection of a random sample is more complicated than that of a systematic sample. Assuming a selection probability equal to that of a systematic sample, a random sample is sampled with probability f=l/k. If we desire a 10% sample (i.e., f=0.10), then we must consider each vehicle entering the site and, using a probability generator of some sort, decide whether to sample it. (For example, use a table of random numbers from 1-1000; if the random number fell below 101 sample the vehicle; otherwise let it pass.) Not only is this more complicated than systematic sam- pling, it may result in a greater number of defaults since random samples "bunch up" more than systematic samples. Thus, the sys- tematic sample has a number of advantages over random sampling. is the method of choice in this study. 1.4 ESTIMATION OF MEANS, TOTALS, AND VARIANCES For a simple random sample, or for a systematic sample where it can be assumed that the population contains neither significant trend or significant cyclical components, the mean, x, is estimated by Equation 1.1, ------- Page 8 n n x = Z w^X-i = 1/n S Xi [1.1] i i where X^ is the ith observation and the weight, w^, is equal to 1/n where n is the total number of observations. The total, X, is estimated by Equation 1.2, X = NX [1-2] where N is the total number of elements in the population for the time sampled. The variance of the individual observations is computed from Equation 1.3, n n ZXf - (Z ^ i i ZX? - (ZXi)2n [1.3] (n-1) and the variance of the mean and of the total are, respectively, (1-f) var(x) = varfX.^ [1.4] n var(X) = N2var(x) [1.5] where (1-f) is the finite sampling correction factor and f=n/N. These equations are well-known and comprise the basic relation- ships for simple random sampling within a finite population (see Kish, 1965). *> We can restate Equation 1.1 for the estimation of a mean by grouping, and then summing, the observations on an hourly basis, i.e., h n.: h n.! x = Z S w. Xjj = 1/n S Z X.!4 [1.6] i j D i j D where Xjj is the jth observation in the ith hour, n^ is the number oV observations in the ith hour, h is the number of hours, n is the sum of the h n^'s, and w^ = 1/n. We note that, drawing an analogy with Equation 1.3, the variance of Xjj in interval i for all intervals where n is not equal to 1 is: ------- nj Page 9 var(Xij) = (n-1) [1.7] The variance over all Xj is the weighted (by the number of vehicles sampled in each jj hour ) average of the i variances, i.e., var(Xi;j) = (l/n) S n [1.8] Thus, the estimate of the variance of an individual load now becomes: (l/n) Sn± i ni (f j [1.9] (for all intervals where n^ is not equal to 1), and Equation 1.4 is slightly altered to: var(x) (1-f) n var(Xi:j) [1.10] Equations 1.2 and 1.5 remain unchanged. Note that when using Equation 1.9, if n^ = 1 the datum for that hour cannot be used in the calculation. Also note that the "hour" interval used in these equations really can be any time unit, e.g., half-hours, etc. Furthermore, even if an hour is selected as the interval, it need not begin on the hour (e.g., the first hour could be 7:30 AM to 8:30 AM, with the second hour 8:30 AM to 9:30 AM, etc.). 1.5 CORRECTION FOR DEFAULTING (UNDERSAMPLING) We first consider the case where undersampling takes place because of burdens placed upon the weighing system, i.e., al- though the sampling plan calls for specific sample sizes to be obtained each hour, the weighing system cannot keep up with the requirements and we default on taking part of the sample. If ------- Page 10 there is lack of randomness in the vehicle-load data (as there would be, for example, if the loads tended to weight less toward the end of the day or if certain-sized vehicles tended to arrive at a destination site at particular times of the day), the equal weighting fw.^ = 1/n) of the observations in Equation 1.6 would bias the estimate of the mean. Where there are no constraints on the weighing system, random or systematic sampling can be considered equivalent to proportionate sampling where the number of samples in a sampling stratum (the ith hour, in this case) is proportional to the size of the stratum. Thus, n-fmi [1.11] where mjj is the total number of vehicles arriving in the ith hour. Note, however, that in this case, = 1/n = 1/fN = 1/f.^N [1.12] Introducing these new weights of Equation 1.10 into Equation 1.6 we obtain: x = 2 S i J w'x 11 [1.13] where x.j is the mean in the ith hour, and w|= m^/N. The important observation to be made here is that it can be shown (Appendix A.I) that, even with defaulting, Equation 1.13 produces an un- biased estimate of the true mean. Since n = f N = n^N/m^ = n^ Equation 1.9 now becomes: sj J ni------- Page 11 var[f(x1,x2,...,xn)] = n n n S (Sf/Sxi )2var(x<) + S Z Sf/S^i 6f/6x+ cov(xs,x^) [1.15] 1 -1 -1 Applying the propagation of error formula to Equation 1.13 (and assuming that the covariance term are zero) we obtain: h var(x) = (1/N2 Z i [1.16] 2i. - n^ "i o i Z X*j - (Z 1 for all intervals where n^ is not equal to 1. The rather compli- cated derivation of Equation 1.16 is given in Appendix A.2. Equa- tions 1.2 and 1.5 still remain unchanged. Note that this method of sampling requires another data set in addition to that re- quired by ordinary systematic sampling, i.e., the total number of vehicles arriving each hour. Note that another way of describing the variability of x is in terms of its coefficient of variation, i.e., the standard deviation as a fraction of the mean. Thus the within-week coefficient of variation (expressed as a percentage), cw, of x is cw = [var(x)]V* [1.17] Typical values for this coefficient of variation range, depending upon the number of trucks sampled during a sampling week, from 2 to 5%. A confidence interval around the mean, X, is x ± t [var(x)]* [1.18] where t is the Student t-value at some significance level, a, and degrees of freedom, df. Defining "error" as one-half of this confidence interval, i.e., t[var(x)]% the error of the estimate of x expressed as a percentage of x is given by: E = 100 t[var(x)]V* [1-19]------- Page 12 According to Bennett and Franklin (1954) , if MS is a mean square and MS = a ^ MS-i ~t~ 3o MSp "I" a-j MS -j + . . . where MS^ is based upon dj degrees of freedom, then the effective degrees of freedom, EOF, for MS is: MS2 EOF = - [1.20] 2 a« Defining, for all i where n not equal to 1, S^cf-i - (S^-:)2^ j 3 j 3 [1-21] n±-l and applying the Bennett and Franklin relationship, Equation 1.20, 1.16, the effective degrees of freedom, EOF, for construct- ing confidence intervals about means or totals for all intervals is: (S a± MSi)2 EOF = - [1.22] where a^ = m? (l-n^/m^)/n^N2 and the sums are over all i where n^ is not equal to 1, and N = Zm^ where the sum is over all i where n^/m^ is not equal to 1. This value of EDF should be used for the degrees of freedom when determining the t-value for Equations 1.18 or 1.19. 1.6 FULL SAMPLING Typically, depending upon the method of readout and the number of axles involved, it may require up to 30-45 munutes to weigh one truck using wheel scales. However, platform scales are frequently 20 to 30 times as fast. When using platform scales, then, it will be possible to oversample some intervals, if not------- Page 13 all of them. As was mentioned previously, once a scale is rented and the labor hired to make the measurements, it makes good economic sense to use the equipment and labor to the fullest extent possible. Since some intervals may be oversampled and others undersampled, sampling to the fullest capacity of the weighing system is defined in this study as "full sampling". Full sampling means that, when finished weighing the current vehicle, we sample the next available vehicle. Within a reasonably short sampling interval (e.g., one-half or one hour) it can be assumed that arrivals are random; indeed, this is the assumption of traditional systematic sampling. Under such circumstances, sam- pling the next available vehicle is identical to systematic sam- pling with frequent, but random, starts. Two questions then arise: (1) How are the estimates in full sampling adjusted for bias?, and (2) Is there any advantage to full sampling? The answer to the first question is simply that we use the same equations as for defaulting or undersampling, i.e., Equations 1.12 and 1.15 (the equations for totals, 1.2 and 1.5, also apply). As .for the second question, when we apply full sampling, the variance of the estimated mean or total decreases. (The proof of this is given in Appendix A.3.) Thus, assuming that the scales are rented for a given time period and that no extra people need be hired, it always pays to sample fully. The use of these formulas is illustrated in Table 1 where data from a random or systematic sample are presented for an 8- hour sampling period from a population with true mean = 18605.55. (The population consisted of eight triangular distributions in which the mean started low at hour one, then rose to a maximum at hours four and five, and then finished low at hour eight.) For a desired sampling frequency, f, of 0.1, the number of trucks sampled each hour would have to be 1/1/2/8/10/3/3/2, respective- ly. However, it was assumed that it was not physically possible to sample more than six trucks in any one-hour period; therefore, the actual number of trucks sampled each hour was 1/1/2/6/6/3/3/2, respectively. Since the sample is biased, the estimate of 17560.46 for the mean obtained by using Equation 1.1 is also biased. Using Equation 1.13, however, the unbiased esti- mate of the mean is 18774.53, which is much closer to the true value of 18605.55. Note that 24 trucks were sampled (1+1+2+6+6+3+3+2=24) . If this were simple random sampling, the degrees of freedom would be 24-1 or 23. Since the effective degrees of freedom was 14.64, the efficiency of the degrees of freedom is 14.64/23 = 0.6365 or 63.65%. Table 2 shows the averages of 1000 daily samples (i.e. 1000 days of Monte Carlo simulation) from the same population used for Table 1 for three different cases: (1) Random or systematic sampling, no scale constraints, (2) Random or systematic sampling with scale constraints (only six vehicles can be weighed in any hour), and (3) Full sampling (again assuming that six vehicles------- TABLE 1: CALCULATIONS FOR EXAMPLE OF SECTION 2.4 {1} i 1 2 3 4 5 6 7 8 Totals: {1} i 1 2 3 4 5 6 7 8 Totals: {2} _ 20 80 100 30 30 20 280 {2} mi _ 20 80 100 30 30 20 280 {3} ni _ 2 6 6 3 3 2 22 {3} ni _ 2 6 6 3 3 2 22 {4} 5734.19 10155.29 36321.28 144738.20 140308.40 45900.44 31825.14 6468.04 421450.97 {10} {5} _ 659856700 3587637000 3305738000 728406800 338253500 21095340 {11} {6} {5}-{4}{4} _ 239110. 96114180. 24664160. 26123340. 640261. 177597. {12} /{3} 10 00 00 00 20 50 {7} _ 239110.10 19222840.00 4932832.00 13061670.00 320130.60 177597.50 {13 {8} m^/22 } nijL/300n^ {5}ZXjj m?l-n^/280)/ro2^ {7} {12} .0333 .0333 .0333 .0444 .0556 .0333 .0333 .0333 191.13 338.50 1210.70 6432.80 7794.91 1530.01 1060.83 215.60 18774.53 _ .002295918 .012585030 .019982990 .003443877 .003443877 .002295918 548. 241920. 98572. 44982. 1102. 407. 387534. 98 00 75 79 49 75 80 _ .071 .276 .357 .107 .107 .071 {13} {9} {7}{8 _ 17079 5492239 1761726 1399465 34299 12685 8717495 {14} {13}/(ni-l) _ ' - 301379 11705057280 1943317409 1011725698 607742 166260 14661175768 , .29 .00 .00 .00 .71 .53 .00 a = (8717495.00)= 2952.54 Using Equation 1.1, X = 421450.97/24 = 17560.46 Using Equation 1.16, Std(x) = (387534.80)3* = 622.52 Using Equation 1.13, X = 18774.53 Using Equation 1.4, std(x) = (1-22/280)^ 2952.54/722 = 604.25 Using Equation 1.17, cw = 100(622.52)/17560.46 = 3.5% Using Equation 1.22, EOF = (387534.80)214661175768 &"lO.24------- Page 15 TABLE 2: AVERAGES FOR 1000 DAILY SAMPLES SYSTEMATIC OR RANDOM SAMPLING NO SCALE CONSTRAINTS SYSTEMATIC OR RANDOM SAMPLING WITH SCALE CONSTRAINTS FULL SAMPLING SAMPLE {1/1/2/8/10/3/3/2} {1/1/2/6/6/3/3/2} {6/6/6/6/6/6/6/6} SAMPLE SIZE 30 24 48 POPULATION a [1.14] 3426.37 3409.62 3317.31 TRUE POPULATION a * 3351.65 > MEAN [1.13]* 18613.57 18630.06 18638.08 BIASED MEAN [1.1]* 18613.57 17545.41 14080.19 TRUE MEAN ««* 18605.55 + a OF MEAN [1.16]* 614.29 715.59 595.47 TRUE a OF MEAN 578.68 701.72 672.99 * Numbers in brackets refer to equations used. can be weighed in any hour). Note that the estimate of the mean can be quite biased if the ordinary sampling formulas, e.g., Equation 1.1, are used when defaulting or undersampling occurs. The advantages of full sampling are also clear since the standard deviation of the estimated mean has been reduced by approximately 17% from that obtained with systematic or random sampling. Since the within-week sampling coefficient of variation, c.., is impor- tant for sample sizing determinations, the model of Tables 1 and 2 was used to simulate (using 100 iterations) a one-week sample for different sample sizes. The results are shown in Table 3. In general, the c.. is rather small, i.e., between 1-2%. j 1.7 SEASONALITY The sampling methodology described in the previous sections is based upon a continuous sampling period such as successive days or successive weeks. Thus, hour-to-hour and day-to-day effects are accounted for in the estimation process. It is well- known, however, that the quantity of solid waste "generated fre- quently varies significantly from month to month. (Municipal solid waste generation, for example, is low during the months of January, February, November, and December, and peaks during June, July, and August, although there are some variations depending upon geographical location. See Table 4.) Thus it is not suffi- cient to sample one week out of the year to estimate generation------- Page 16 TABLE 3: AVERAGE WITHIN WEEK SAMPLING COEFFICIENTS OF VARIATION, SAMPLE SIZE SAMPLE SIZE COEFFICIENT OF (ONE WEEK) (HOURLY) VARIATION, % 96 2 2.11 144 3 1.78 192 4 1.48 240 5 1.31 288 6 1.20 336 7 1.11 384 8 1.02 432 9 0.98 480 10 0.91 528 11 0.87 576 12 0.83 624 13 0.81 672 14 0.78 720 15 0.74 768 16 0.72 816 17 0.69 Note: Number of iterations = 100 weeks The model assumes sampling 8 hours/day, 6 days/week. for the complete year. -Theoretically, if one knew the weeks of the year in which the curve of weekly generation crossed the horizontal line representing the average weekly generation for the year, we could schedule a one-week sampling period for one of these intersection points and be confident that our estimate for that week, multiplied by the number of weeks in that year, would be identical to the quantity produced throughout the year. (This is termed the "critical point" approach.) Information about these critical points is, unfortunately, not available before the fact. Even if it were available for the previous year, there is no guarantee that the critical points will be the same for the current year. Therefore, we are forced to sample additional weeks throughout the year. Suppose we sample r weeks out of the year and determine, for each of these r weeks, the total quantity of solid waste, X^, arriving at the site in the kth week. Let y be the average total weekly quantity over the r weeks. Assuming that there are 365/7=52.1429 weeks per year, then an estimate of the total quantity for the year, Y, is obtained by:------- Page 17 TABLE 4: TYPICAL SEASONAL VARIATIONS IN SOLID WASTE GENERATION WASTE GENERATION AS % OF THE MEAN LOCATION LOW MONTH HIGH MONTH CONNECTICUT 85 NOV 111 MAY ENGLAND 67 JUL 132 JAN HAWAII 84 NOV 118 JUN KENTUCKY 85 MAR 125 AUG MISSOURI 79 FEE 113 JUL OHIO 87 JAN 113 JUL ONTARIO 90 MAR 106 JUN VIRGINIA 80 JAN 125 MAY WASHINGTON 86 FEE 108 MAY WISCONSIN 81 FEE 131 JUN r Y = (52.1429)y = (52.1429/r) SXk [1-23] Applying the propagation of error formula (and assuming that the covariance terms are zero) , var(Y) = (52.1429/r)2S var(Xk) [1.24] Applying the Bennett and Franklin relationship, equation 1.20, to Equation 1.24, the effective degrees of freedom, EDF^., for use in calculating confidence intervals for the total amount of waste is given as: » [ (52 . 1429/r) 2Svar (Xk) ] 2 EOF -- - - - - [1.25] (52.1429/r)4 S[var(Xk) ]VEDFk Since the coefficient of variation is equal to the standard deviation divided by the average, the coefficient of variation of the between-week differences, cb is given as: cb = sb/y [1.26] To obtain some idea of typical values of c^, a Monte Carlo simu- lation was performed (based upon real data obtained from a Boston solid waste site), the results of which are shown in Table 5. A conservative value of 2% was used for cw, and the simulation involved 1000 iterations. Two sampling protocols were simulated:------- Page 18 (1) random sampling, and (2) systematic sampling. Sampling frequencies of from two to ten times per year were investigated. The cb of random sampling should closely approximate the true cb (columns 2 and 3 in Table 5) which it does for all sampling times. However, except for a sampling frequency of two (where random sampling is identical to systematic sampling), systematic sampling is superior to random sampling for all sampling frequen- cies. At a sampling frequency of four times per year, for exam- ple, the Cw of systematic sampling is only 60% that of random sampling. Since both sampling methods provide estimates of less than 1% deviation from the true total (columns 5 and 6), system- atic sampling is clearly the preferred method for sampling throughout the year. Table 5 suggests that typical values of cfa vary between 3 and 4% for sampling frequencies over the range, four to eight times per year. TABLE 5: MONTE CARLO SIMULATION - SEASONALITY (1) (2) (3) (4) (5) (6) # WEEKS SAMPLED 2 3 4 5 6 7 8 9 10 Cb TRUE 10.7 8.7 7.6 7.0 6.2 5.7 5.3 5.0 4.8 cb RANDOM 10.0 7.8 7.3 6.4 5.9 5.6 5.2 4.7 4.6 cb SYSTEM 9.9 5.3 4.1 3.6 3.2 2.5 3.0 2.2 2.4 TOTAL RANDOM* +0.01 +0.10 +0.10 -0.01 +0.33 -0.28 +0.08 -0.10 -0.23 TOTAL SYSTEM* -0.27 +0.37 +0.00 +0.04 -0.03 +0.06 -0.06 +0.02 +0.02 *Estimated total quantity as percent deviation from true total Note: For all simulations in this table, cw = 0.02. Using the data of Tables 3 and 5, Table 6 was prepared. This Table shows the percentage error (assuming systematic full sam- pling and a 90% confidence level ) of the estimate of total yearly quantity of solid waste for various sample sizes and sampling frequencies. 1.8 Stratified Sampling Stratified sampling involves dividing the population into distinct subpopulations called strata. The strata could be based upon vehicle size, load type (residential, commercial, industri-------- Page 19 al, etc.) or any other attribute. Within each stratum a separate sample is selected and a separate estimate made. The stratum means and variances are then appropriately weighted to form a combined estimate for the entire population. Generally, strati- fied sampling is used to: (a) increase the precision of the estimate, (b) afford different sampling methods within the strata, or (c) provide separate estimates for different popula- tion elements. With regard to increasing the precision of the estimate, theory (see Kish, 1965) tells us that grouping like elements within a stratum (for example, one stratum might consist of small, private vehicles, and another might consist of munici- pal or commercial, rear-loading, packer-type vehicles) increases precision. With regard to affording different sampling methods within the strata, one might use entirely different scales for small, private vehicles than for municipal or commercial, rear- loading, packer-type vehicles. With regard to providing separate TABLE 6: ERROR OF THE ESTIMATE OF TOTAL YEARLY QUANTITY OF SOLID WASTE (90% CONFIDENCE LEVEL & SYSTEMATIC FULL SAMPLING) AS A FUNCTION OF SAMPLE SIZE AND SAMPLING FREQUENCY SAMPLING FREQUENCY, WEEKS/YEAR 3 4 5 6 7 8 9 10 NUMBER 800 492 OF 341 NUMBER 16.7 5.1% 3.4 2.7 2.2 1.6 1.8 1.3 1.3 10.3 5.1% 3.5 2.7 2.2 1.7 1.8 1.3 1.3 7. 5. 3. 2. 2. 1. 1. 1. 1. OF 1 2% 5 8 3 7 9 4 4 TRUCKS 244 TRUCKS 5.1 5.2% 3.6 2.8 2.3 1.8 1.9 1.4 1.4 SAMPLED 189 SAMPLED 3.9 5.2% 3.6 2.9 2.4 1.8 2.0 1.5 1.5 PER WEEK 157 PER 3. 5. 3. 2. 2. 1. 2. 1. 1. HOUR 3 3% 7 9 4 9 0 5 5 127 2 5 3 3 2 2 2 1 1 .7 .4% .7 .0 .5 .0 .1 .6 .6 97 2.0 5.4% 3.8 3.1 2.6 2.0 2.1 1.7 1.7 Notes: (1) Table percentages are the errors as a percentage of the true total yearly quantity of solid waste; (2) Table percentages can be converted to other confidence levels by the formula, new = (old*z)/1.645 where "old" is the old % error, "new" is the % error at the new confidence level, and z is the standard normal deviate at the new confidence level. (3) Trucks sampled per week was converted to trucks sampled per hour by assuming 8 hours/day and 6 days/week sampling.------- Page 20 quantity estimates for different population elements, a "quanti- ty" unit, such as a point, might not be identical for all ele- ments. For example, waste composition varies widely between municipal and industrial waste. If the standard deviation of a vehicle-load in a stratum is proportional to the average weight of the load in that stratum (a not unreasonable assumption), then sampling plans that make the sampling effort proportional to the total quantity contribution of each stratum is known as Neyman allocation (see Kish, 1965). An example of a comparison between sampling with and without stratification is shown in Table 7. The model used assumed that 10% of the vehicles had an average net weight of 306 Ibs (repre- senting small vehicles, such as pickup trucks, station wagons, etc.), and 90% had an average net weight of 15000 Ibs (represent- ing typical commercial vehicles). The load distributions of the two vehicle types were assumed to be normal, with the standard deviation equal to 10% of their means. The results in Table 7 represent a simulation involving a total for 5000 vehicles, and the stratification used was Neyman allocation. The estimated means are close to the theoretical mean for sampling both with and without stratification. The standard deviations (and coeffi- cients of variation) of these estimates, however, are quite different. The superiority of the stratified estimate is quite evident. TABLE 7: STRATIFIED VERSUS NON-STRATIFIED SAMPLING MEAN a cv THEORETICAL 13531 1423 10.5% NON-STRATIFIED SAMPLING 13552 4645 34.3% STRATIFIED SAMPLING 13547 1444 10.7% cv = Coefficient of Variation. Thus, if the population of vehicles consists of two or more subpopulations with significantly different means, it is highly advisable to (1) sample the subpopulations separately, making the samples proportional to the total quantity contribution of each stratum (i.e., Neyman allocation), (2) make separate total quan- tity estimates for these subpopulations, and (3) add them to arrive at an overall population quantity estimate. A population------- Page 21 quantity estimate of standard deviation, std(xc) , can be made by combining the subpopulation standard deviations by the following formula: std(xc) = [Zdis/Zdi)]5 [1.27] where s^ is the standard deviation of the estimate of total quantity for subpopulation i, and dj is the degrees of freedom for that estimate. Note that std(xc) "has Zd^ degrees of freedom. REFERENCES 1. Bennett, C.A., and Franklin, N.L., Statistical Analysis in Chemistry and the Chemical Industry. John Wiley & Sons, New York, N.Y., 1954. 2. Deming, W.E., "Some Variances In Random Sampling", in Some Theory of Sampling. John Wiley & Sons, New York, N.Y., 1950, pp. 127-134. 3. Kish, L., Survey Sampling. John Wiley & Sons, New York, N.Y., 1965.------- Page 22 NOTATION cv = coefficient of variation, stratified sampling Cjj = between-week coefficient of variation cw = within-week coefficient of variation d^ = degrees of freedom for ith subpopulation E = error, i.e., one-half of a confidence interval f = weekly sampling frequency, n/N f^ = sampling frequency in ith hour, n^/m^ h - total number of hours sampled during the week i = index for hours j = index for vehicles k = index for week m^ = number of vehicles arriving in ith sampling hour n = total number of vehicles sampled N = total number of vehicles in week n^ = number of vehicles sampled in the ith hour r = number of weeks sampled during the year s^ = subpopulation quantity standard deviation estimate std(xc) = population quantity standard deviation estimate var(Xj-j) = variance of individual load measurement w^ = weighting factor for an individual observation, 1/n X = total weight of all vehicle-loads for the week x = average vehicle-load weight for the week x^ = average vehicle-load weight in the ith hour X^j = jth vehicle-load weight in ith hour Xk = total quantity of solid waste in the kth week y = average total weekly quantity during r sampling weeks Y = total quantity for the year------- Page 23 CHAPTER 2 COMPOSITION ESTIMATION 2.1 INTRODUCTION The estimation of waste composition is a more difficult task than the estimation of waste quantity for at least four reasons: 1. Complexity; The estimation of waste composition involves the measurement of more than one attribute. 2. Cost; Weighing a collection vehicle is a relatively low-cost procedure. Selecting a sample of waste and separating it into a number of components is both a more expensive and a more unpleasant procedure. 3. Statistical Problems; Unlike the estimation of waste quanti- ty, reliance on the Central Limit theorem of statistics in order to assume normality (and hence permitting simpler calculations) is not always justified. Also, there are problems of what constitutes a sample unit and how to obtain random samples of such units. 4. Small sample size; Because of the time, and expense required to sample for waste composition, there are fewer data avail- able regarding this aspect of waste characterization than for waste quantity. Hence our estimates have less precision than that of waste quantity. In quantity estimation, the sample unit is clearly the vehicle. One weighs the entire vehicle because it makes no sense to select and just weight a portion of a vehicle-load. In compo- sition sampling, however, we usually cannot separate an entire vehicle-load because of time and economic considerations. The usual commercial vehicle-load is between 10,000 and 20,000 Ibs, and is obviously too large a sample unit to separate. It must be remembered that separation has to be done manually. As a rough approximation, typically one man can separate 65-300 Ib of raw municipal solid/waste in one hour, depending upon the number and type of components desired. Clearly, very small sample weights make no sense physically. A large piece of wood or metal, for example, could not physically ever be included in a sample of, say, 5 Ibs. Furthermore, small sample weights tend to be more homogeneous than the population being sampled, i.e., the smaller the sample weight, the greater the likelihood that it consists entirely of wood or paper or glass, etc. Following a well-known principle of cluster sampling (see Kish, 1965), this homogeneity------- Page 24 tends to increase the variance of the sample. Indeed, Klee and Carruth(1970) found that the smaller the sample weight, the greater the variance of raw municipal solid waste samples. Howev- er, the relationship is was linear. Under 200 Ibs the sample variance increased rapidly; over 300 Ibs it increased much more slowly. Accordingly, they recommended that a sample weight of 200-300 Ibs be used for general municipal solid waste sampling, and this recommendation has been widely adopted by other investi- gators for raw refuse streams. The sample weight recommendation of 200-300 Ibs is appropri- ate for raw municipal refuse only. Clearly, optimal sample weight is related to -the particle size of the material sampled, but the relationship is not linear. Other investigators (see Trezak, 1977) have found that processed waste stream particle size dis- tributions are adequately described by functions based upon the exponential distribution, the Rosin-Rammler equation, for exam- ple. Thus, the author recommends the following model based upon the exponential function to determine optimal sample weights for other than raw municipal refuse: Y = Xe [2.1] where Y is the optimal sample weight in pounds, and X is the characteristic particle size of the material to be sampled, inches. The characteristic particle size is the screen size at which 63.2% of the material passes through. The boundary condi- tion for Equation 2.1 is that at Y = 250 Ibs (the median of the 200-300 Ibs recommended for raw municipal solid waste) , the characteristic particle size is 18 inches. The value of ft that satisfies the condition is 0.146. Thus, Y = Xe°'146X [2.2] To illustrate the use of this equation, the output of the Appelton West shredder was found to have a characteristic size of 2.2 in. (see Table 10 in Savage and Shiflett, 1979). To sample the output from this shredder we would need to take samples of weight, Y = 2.2e°-146(2-2> = 3.0 Ib. Like quantity, the variation in composition of solid waste can be expected to be influenced by within-week and between-week differences. Therefore, it is important to sample within given weeks throughout the week. However, since the actual separation process is quite lengthy, a long time occurs between samples. Furthermore, a sample can be taken from a selected vehicle and the vehicle allowed to move on. For these reasons, unlike with quantity sampling, the taking of composition samples does not result in vehicle queues regardless of the sampling method used. Thus it makes more sense to consider an unbiased sampling scheme------- Page 25 for composition determination, e.g., either the random or system- atic sampling schemes described in Chapter 1. 2.2 NORMALITY ASSUMPTIONS The determination of the number of samples for the within- week estimation of waste composition is, unfortunately, a much more complex matter than for quantity estimation. Discrete dis- tribution theory (such as multinominal or binomial) cannot be used because we are not dealing with identical items in the sampling unit. One piece of wood in a sample, for example, is different in size and shape from the next piece that might be found in the sample. One is tempted to reach for the Central Limit Theorem once again and assume that either the component in question is distributed normally or that averages taken from their distributions are distributed normally., Previous composi- tion studies, however, have shown that no component is distribut- ed normally (see Klee, 1980) . The question jis then, "How many samples must be taken so that averages of the samples are dis- tributed normally?" For components with positively-skewed distri- butions (i.e., skewed to the right - see Figure 1) - and this includes most components, including newsprint, total paper, plastics/rubber/leather (when combined into one component), ferrous and other metals - averages of as low as n = 4 samples closely approximate normality and, by n = 10, normality is all but assured. However, for components with J-shaped distributions (see Figure 1) - and this includes components such as textiles, wood, and garden waste - reasonable normality is not approach until averages of n = 40 or greater are taken. One indication of normality is the coefficient of skewness, g^ , which is the third moment about the mean , i.e., define k2 = S(xi-x)2/(n-l) [2.3] and k3 = n2(xi-x)3/(n-l) (n-2) , [2*.4] then gx = k3/(k^)^ [2.5] Given the coefficient of skewness, s_, , of a parent population, the coefficient of skewness of the (distribution of averages of size n taken from this distribution, sgl, is: gl The coefficient of skewness is used in Table 1 which the results of simulations for averages of different sizes (number of itera- tions for each case = 5000) for two components, ferrous metals (a positively-skewed distribution) and textiles (a J-distribution) . The distributions are taken from the data collected by Britton (1972) . Since the rationale behind using the Central Limit Theo-------- Page 26 o 2 FERROUS METALS 10X 0 2 8X FIGURE 1: DISTRIBUTION OF TEXTILES AND FERROUS METALS IN MUNICIPAL SOLID WASTE rem is to permit the use of t-statistics to construct confidence intervals about the estimation of the percentage of any component in the waste stream, an appropriate measure of the ability to meet the normality requirements is the fraction of confidence intervals that actually contain the true mean at a given level of significance. This is shown in Table 1 by the Actual versus Nominal a lines. For example, for textiles, given an average of size 10, if a confidence interval at a significance level of o = .05 were constructed about the mean, the actual significance level would be o = .104. In other words, instead of a 95% confi- dence interval, we would actually be constructing an 89.6% confi- dence interval. Note that as the size of the average gets larger, the discrepancy gets smaller. For example, at a significance level of a = .05 and an average of size 50, the true significance------- Page 27 TABLE 1: RESULTS OF SIMULATION STUDIES FOR TWO MUNICIPAL SOLID WASTE COMPONENTS 10 15 SIZE OF AVERAGE 20 25 30 J-FUNCTION'(TEXTILES) 35 40 45 50 MEAN STD (Population) STD (Mean) COEFFICIENT OF SKEUNESS standard deviation t-value significance level Actual at, Nominal a Actual , Nominal a Actual , Nominal a = Actual , Nominal a - Actual , Nominal a = Actual , Nominal a = Actual a. Nominal a 1.6145 1.9037 1.9037 1.45 .03 41.81 .001 .01 .02 .03 .04 .05 .10 .20 1.5924 1.9074 .8530 .66 .03 19.14 .001 .075 .102 .119 .132 .148 .190 .268 1.6001 1.9277 .6096 .46 .03 13.32 .001 .051 .065 .079 .089 J04 .138 .227 1.6058 1.9206 .4959 .33 .03 9.66 .001 .041 .055 .065 .076 .090 .134 .215 1.5956 1.9226 .4299 .32 .03 9.12 .001 .033 .046 .056 .065 .076 .128 .214 1.5957 1.9285 .3857 .30 .03 8.76 .001 .031 .040 .051 .060 .067 .115 .211 1.5983 1.8979 .3465 .27 .03 7.69 .001 .024 .035 .045 .055 .065 .114 .207 1.5920 1.9275 .3258 .29 .03 8.47 .001 .026 .037 .046 .056 .064 .112 .210 1.5972 1.8790 .2971 .28 .03 8.13 .001 .017 .028 .036 .044 .065 .104 .199 1.5994 1.9011 .2834 .21 .03 5.95 .001 .018 .028 .038 .046 .060 .104 .200 1.5951 1.9290 .2728 .20 .03 5.70 .001 .019 .031 .042 .049 .058 .099 .205 MODERATE SKEW (FERROUS METALS) MEAN STD (Population) STD (Mean) COEFFICIENT OF SKEWNESS standard deviation t-value significance level Actual a. Nominal a = Actual a, Nominal a = Actual a. Nominal a = Actual a. Nominal a » Actual a, Nominal a = Actual a, Nominal a = Actual a. Nominal a = 3.6712 1.7796 1.7796 .68 .03 19.74 .001 .01 .02 .03 .04 .05 .10 .20 3.6511 1.7750 .7938 .32 .03 9.26 .001 .017 .026 .043 .059 .065 .116 .207 3.6603 1.8034 .5703 .20 .03 5.88 .001 .017 .033 .039 .053 .063 .115 .204 3.6627 1.7920 .4627 .15 .03 4.32 .001 .018 .027 .039 .047 .057 .110 .215 3.6533 1.7982 .4021 .13 .03 3.67 .001 .018 .026 .040 .043 .055 .098 .204 3.6538 1.7955 .3591 .17 .03 4.91 .001 .015 .023 .035 .042 .053 .100 .204 3.6583 1.8620 .3217 .16 .03 4.67 .001 .013 .019 .033 .048 .047 .104 .214 3.6523 1.7937 .3032 .19 .03 5.54 .001 .014 .025 .035 .043 .052 .091 .211 3.6587 1.7563 .2777 .16 .03 4.53 .001 .009 .025 .035 .041 .059 .097 .197 3.6578 1.770 .2639 .12 .03 3.36 .001 .010 .025 .035 .047 .051 .100 .208 3.6564 1.7960 .2540 .05 .03 1.51 .131 .012 .024 .036 .042 .051 .110 .203 NOTE: NUMBER OF ITERATIONS = 5000------- Page 28 level is a = .058. As the selected significance level is in- creased, the discrepancy also gets smaller. For example, at a significance level of a = .10 and an average of size 10, the true significance level is o = .138. Note that for the moderately positively-skewed ferrous metals distribution, these discrepan- cies are very much smaller, even for very small sizes of the average. 2.3 WITHIN-WEEK SAMPLE SIZE DETERMINATION Assuming that the averages of the components are normally distributed, then the sample size is given by (see Mace, 1964): n = (ts/d)2 [2.7] where d is the precision required (i.e., h the confidence inter- val desired), t is the t-value at significance level a, and s is the population standard deviation. Because the t-value is not known until after n is determined, Equation 2.7 is actually a trial-and-error equation. However, for starting purposes, the t- value can be replaced by its corresponding z-value, i.e., the value of the standard normal deviate at 1 - a. Estimates of s are provided in Table 2. (These estimates are based on various composition sampling studies throughout the country, and include within-day and between-day sampling varia- tion. The Britton (1972) estimates are not appropriate here because they represent only the within-vehicle variation of one truckload.) There is a problem, however, in applying Equation 2.7. The averages of the components are not normally distributed; they are positively-skewed. One might consider an appropriate transforma- tion , such as the lognormal, but since d in Equation 2.7 is not constant over a lognormal scale (i.e., ln[a-b] is not equal to ln[a]-ln[b]) we must also have some knowledge of the mean of the distribution, they very quantity we are trying to estimate. A simpler approach is to take advantage of the fact that there is a strong correlation involving the coefficient of skewness, and the actual and nominal values of a. For example, using the data of Table 1, the following regression equation was found to have a coefficient of determination (R2) of 0.98: On = .0206 + 1.00899tta - .141g1 [2.8] Note: 1. If an > oa, an = a_ 2. If an < 0, an = .001 where a nominal _ is the actual level of significance, and an is the level of significance.------- Page 29 TABLE 2: SUGGESTED POPULATION STANDARD DEVIATIONS COMPONENT STANDARD DEVIATION, S PAPER COMPONENTS CORRUGATED PAPER .0744 NEWSPRINT .0687 TOTAL PAPER .1021 METAL COMPONENTS ALUMINUM .0069 FERROUS METALS .0388 TOTAL METALS .0358 ORGANIC COMPONENTS FOOD WASTE .0506 GARDEN WASTE .1269 WOOD .1376 TOTAL ORGANICS .1121 MISCELLANEOUS COMPONENTS ASH/ROCKS/FINES .0572 GLASS/CERAMICS .0502 PLASTIC/RUBBER/LEATHER .0252 TEXTILES .0687 Equation 2.8 can be used to determine confidence intervals or significance levels even when the distribution is decidedly non-normal. The only input required is a knowledge of the coeffi- cient of skewness (which is calculated from the data, using Equation 2.5) or the sample size, and the desired level of significance, aa. For example, if aa = .10 and g1 = .33, then, using Equation 2.8, an = .0206 + 1.00899(.10) - .141(.33) = .075 Thus, a confidence interval constructed at an a of .075 will produce the required confidence interval at significance level a = .10. Since the value of g^ will generally not be known until after one has obtained a sample, when one is determining sample size Equation 2.8 is not particularly useful for sample size determination. Equations can be obtained, however, that relate an and ota using n rather than g, . For the textile data in Table 1, we obtain the following equation (with a coefficient of determi- nation or R2 of .954):------- Page 30 Of- = -.0633 + 1.0121CU + .00136n [2.9] (37.3) (11.5) Note: 1. If on > aa, an = a_ 2. If an < 0, an = .001 The numbers in parentheses are the t-values of the esti-mated coefficients. For the ferrous metals data in Table 1, we obtain the following equation (with a coefficient of determination or R2 of .995): a_ = -.0102 + .99087a_ + .00019n [2.10] (116.9) (5.2) Note: 1. If on > aa, an = cra 2. If crn < 0, on = .001 Thus, if one knows whether the component is distributed as a J- function (assume this for textiles, wood, and garden wastes), then Equation 2.9 is used; for all others, Equation 2.10 is used. To illustrate the use of Equations 2.7, 2.9, and 2.10, suppose we wished to estimate the concentration of ferrous metals in the waste stream to within ±2 percentage points of the mean, at a significance level of a = .05. Using Equation 2.7, n = (1.9623*.0388/.02)2 = 14.49 or 15, rounded up. Using Equation 2.10, ttn = -.0102 + .99087(.05) + .00019(15) = 0.0421. At n = 15 and a = .0421, t = 2.2367. At iteration #2, therefore, n = (2.2367*.0388/.02)2 = 18.82 or 19 rounded up. Using Equation 2.7, an = -.0102 + .99087(.05) + .00019(19) = 0.0429. At n = 19 and a = .043, t = 2.1783. At iteration #3, therefore, n = (2.1783*.0388/.02)2 = 17.86 or 18, rounded up. Using Equation 2.10, an = -.0102 + .99087(.05) + .00019(18) = 0.0427. At n = 18 and a = .0427, t = 2.1902. At iteration #4, therefore, n = (2.1902*.0388/.02)2 = 18.05 or 19, rounded up.------- Page 31 Since we are cycling between 18 and 19, n = 19 and we are fin- ished. Finally, note that because the shape of the distribution and the standard deviation varies with the component, if there are m components the field sample size, nf, will be the largest sample size over all of the components, i.e., nf = max(nlfn2, . . . ,nm) . 2.4 ESTIMATING THE STANDARD DEVIATION WHEN NO SAMPLE DATA ARE AVAILABLE In almost any situation, one can get at least a very rough estimate of the standard deviation. The minimum information involves the form of the distribution and the spread of values. For example, if the values of the component fractions can be assumed to follow a normal distribution, the either of the following rules can be used to get an estimate of a: (a) Estimate two values, a low one, a^, and a high one, b^, between which you expect 99.7% (almost all) of the values to be. Then estimate a as: (bx - ai)/6 [2.11] (b) Estimate two values, a low one, a,, and a high one, b2, between which you expect 95% of the values to be. Then estimate a as: (b2 - a2)/4 [2.12] If the values of the component fractions can be assumed to follow a positively-skewed distribution, than an alternative is to assume a triangular distribution and estimate a as: [a5(a5 - b5) + c5(c5 - a5) + b5(b5 - c5) ]/18 [2.13] where a5, b5, and c5 are the assumed smallest, most likely, and largest values, respectively, the distribution can take on. As an example, suppose we are going to sample for aluminum, and we estimate that the smallest value is 0.2%, the most likely is 1.5%, and the largest value is 4.1%. Assuming a positively- skewed distribution, the estimated standard deviation is: [.2(.2-1.5) + 4.1(4.1 - .2) + 1.5(1.5 - 4.1)]/18 = .66% or .0066 2.5 ESTIMATING SAMPLE SIZE IN A MULTI-STAGE PROCESS Suppose we do not have a good estimate of the standard deviation of the distribution of a component. Since the cost of------- Page 32 composition sampling is generally high, rather than take a large sample than is really necessary we can take the sample in more than one stage. The method (sometimes called "Stein's Method" - see Natrella, 1966)) is as follows: (1) Make a first estimate of a (using either Table 2 or the technique described in Section 2.4). From this, determine n, the size of the first sample (using the technique described in Section 2.3). Choose some frac- tion of n, nlf as the size of the first sample'. (In Stein's Method, this fraction typically is \. (2) This first sample of size n, provides a estimate of a. Use this value to determine How large the second sample should be. As an example, suppose we wished to estimate the concentra- tion of ferrous metals in the waste stream to within ±2 percent- age points of the mean, at a significance level of a = .05. Assume that our best estimate of a is .04. Following the proce- dure outlines in Section 2.3, our estimate of sample size is 20. Our first sample size (using a fraction of %) , n^f is 10. After taking this sample we find that the sample standard deviation is .03. Recalculating the sample size we find n = 13. Since we have already taken 10 of these sample, only 3 more are required. Note that we can refine the method simply by making the first fraction small (say 1/3 or 1/4), and then recalculating the standard deviation (and hence, new sample size) after every additional sample obtained. 2.6 ESTIMATION OF COMPOSITION As with waste quantity, it is well-known, however, that the composition of solid waste generated varies significantly from month to month. Thus it is not sufficient to sample one week out of the year to estimate composition for the complete year, and we are forced to sample additional weeks throughout the year. Suppose we sample r weeks out of the year and determine, for each of these r weeks, the average (as a fraction) of a particular component of solid waste, pk (where £k = 2p-jv, where the j sum is over nk, the number of samples talcen in the kth week), and, Xk, the total quantity arriving at the site in the kth week. Then an estimate of the fraction of the component over the year, P, is obtained by weighting the weekly fractions by the weekly totals: r r P = (ZpkXk)/ZXk [2.14]------- Page 33 Applying the propagation of error formula (and assuming that the covariance terms are zero), var(P) = S(Xk/SXk)2var(pk) + S[(£k/SXk)(l - Xk/SXk)]2var(Xk) [2.15] Applying the Bennett and Franklin relationship .(Equation 1.20, Chapter 1) to Equation 2.15, the effective degrees of freedom, EDF-, for this variance is given as: EDFp - [var(P)]2/{S[(Xk/ZXk)2]2[var(pk)]2/dpk + S{[(pk/SXk)(l - Xk/ZXk)]2}2[var(Xk)]2/dxk) [2.16] We find the total quantity for the year of the component, T, by: T = P*W [2.17] where W is the total quantity of waste for the year. Applying the propagation of error formula (and assuming that the covariance terms are zero), var(T) = W2var(P) + P2var(W) [2.18] Again applying the Bennett and Franklin relationship (Equation 1.20, Chapter 1) to Equation 2.18, the effective degrees of freedom, EDFt, for this variance is given as: [var(T)]2 EDFt [2.19] [W2var(P) ] where EDFW is the effective degrees of freedom associated with the total quantity of waste for the year, W. Unfortunately, one cannot construct confidence intervals about the total of the component, T, in the usual fashion, i.e., T ± t^varOT)]3* since the distribution of P is positively-skewed and thus the distribution of T is also positively-skewed. The assumption of normality is not appropriate under these circumstances. However, we can use the logarithmic transformation, which is particularly effective in normalizing distributions which have positive skew- ness. If we assume that T1 = ln[T] and is normally distributed with mean, \i, and standard deviation, CT, then (see Aitchison and Brown, 1957): T = e^ + *a2 [2.20]------- Page 3 4 and var(T) = e2^ + a2 [ea2 -1) Solving for M and a2, [2.21] In var(T) + 1 [2.22] In P - h In var(T) + 1 [2.23] One can then construct the desired confidence interval by first computing: L = U = + t a a The confidence interval around T then is given as: lower = eL upper = eu [2.24] [2.25] [2.26] [2.27] An example of these calculations is shown in Table 4, using the data given in Table 3. Note that, assuming that the W2var(P) and P2var(W) terms are the contribution to var(T) by component and quantity respectively, then the component term contributed 93% of the variability in this example while the quantity term only contributed 7%. Thus (and it should come as no great sur- prise) , the precision of an estimate of yearly quantity of a given component depends much more on the precision of the esti- mate of the yearly component standard deviation than on the yearly quantity standard deviation.------- Page 35 TABLE 3: DATA FOR SEASONALITY CALCULATIONS EXAMPLE 102,100,000 Ibs 1,743,900 45 Total, Week #1: Standard Deviation of Total: Effective Degrees of Freedom of Total, d Number of Composition Samples, n: Average Ferrous Metals (as a fraction) Standard Deviation of Average: 45 17 0821 .0105 Total, Week #2: Standard Deviation of Total: Effective Degrees of Freedom of Total, d Number of Composition Samples, n: Average Ferrous Metals (as a fraction) Standard Deviation of Average: 93,013,000 Ibs 1,363,500 Ibs 29 17 .0773 .0077 Total, Week #3: Standard Deviation of Total: Effective Degrees of Freedom of Total, d: .Number of Composition Samples, n: Average Ferrous Metals (as a fraction): Standard Deviation of Average: 97,385,000 Ibs 1,261,900 Ibs 32 17 .0719 .0072 Total, Week #4: Standard Deviation of Total: Effective Degrees of Freedom of Total, d Number of Composition Samples, n: Average Ferrous Metals (as a fraction) Standard Deviation of Average: 89,211,000 Ibs 1,355,800 Ibs 42 17 .0589 .0095 Total, Year: Standard Deviation of Total for Year: Effective Degrees Of Freedom for Year: 4,975,800,000 Ibs 81,829,000 Ibs 143------- Page 36 TABLE 4: 8EASONALITY CALCULATIONS EXAMPLE {1} {2} {3} {4} {5} {6} {7} Pk Xk PkXk "k-1 dk Xk/SXk var |