United States Environmental Protection Agency Risk Reduction Engineering Laboratory Cincinnati OH 45268 Research and Development EPA/600/S8-89/102 June 1990 &EPA Project Summary MOUSE (Modular Oriented Uncertainty SystEm): A Computerized Uncertainty Analysis System Albert J. Klee Environmental engineering calculations involving uncertainties, either in the model itself or in the data, are far beyond the capabilities of conventional analysis for any but the simplest of models. There exist a number of general-purpose computer simulation languages, using Monte Carlo methods, that are capable of such analysis, but these languages are difficult to learn and to implement quickly. MOUSE (an acronym for Modular Oriented Uncertainty SystEm) deals with the problem of uncertainties in models that consist of one or more algebraic equations. It was especially designed for use by those with little or no knowledge of computer languages, programming, or simulation. It is designed to be run on almost any personal computer, easy and fast to learn, and has all of the features needed for substantive uncertainty analysis (built-in probability distributions, plotting and graphing capabilities, sensitivity analysis, interest functions for cost analyses, etc.). Moreover, a series of unique companion utility programs write much of the necessary computer code for the user, help in analyzing sample data to determine the probability distributions that best fit those data, check each program for errors in syntax, and assist in finding logical errors in the model that is subject to uncertainty. Some typical examples of the use of MOUSE within the U.S. Environ- mental Protection Agency include: studying the migration of pollution plumes in streams, establishing regulations for hazardous wastes in landfills, and estimating pollution control costs. This Project Summary was developed by EPA's Risk Reduction Engineering Laboratory, Cincinnati, OH, to announce key findings of the research project that is fully documented in a separate report of the same title (see Project Report ordering information at back). Introduction If we define a model as a physical or symbolic representation of reality, we find among the set of all models one called the mathematical model, one particular type of which consists of a series of one or more algebraic equations. Mathematical models of this type are extremely important for they are found almost everywhere, including the environmental sciences, economics, engineering, and science in general. The use of an equation is understood by almost everyone; in a somewhat "inelegant" sense, numbers "go into" the equation and an answer is obtained. For example, consider the following very simple equation, Y = AB [1] where Y might stand for an equipment purchase cost, given that A is the unit price and B is the number of units required. Alternatively, equation 1 might represent an engineering calculation where Y is the cross-sectional area of a ------- heating duct, given that A is its height and B is its width. In any event, to "solve" equation 1, one has to know the values of A and B. If A is equal to 2, for example, and B is equal to 15, then Y = 2x15 = 30. Often, however, we are not sure of the values of A or of B. A might be 3 and B might be 20; in such event Y would be equal to 60, not 30. The greater our uncertainty about the input variables A and B, clearly the greater our uncertainty about the output variable, Y. The most often encountered approaches to uncertainty in mathematical models are: (1) the best value approach, (2) the conservative approach, and (3) sensitivity analysis. The first two are single-value approaches. The "best" in "best value" is not precisely defined; generally it refers to some measure of central tendency such as an average or a mode. In the equipment purchase problem, we might suppose that the values of A of 2 and of B of 15 are average values. Presumably, the answer of 30 is also some sort of average value. In the conservative approach, the input values selected are not the average or most likely ones but rather those that produce conservative results with regard to the consequences of over- or underestimating. For example, in the conservative approach for the duct example, the values of A and B selected to go into equation 1 would be greater than their average values of 2 and 15 since overestimation is probably better than underestimation in this case. If "best" values were used, there would be too great a chance that the duct area would be underestimated. Traditional sensitivity analysis usually starts with a best value estimate, followed by a perturbation or change in one of the input variables (holding all other input variables at their previous values). The perturbation can be either an increase or a decrease in the value of the variable and hence can be either of a "conservative" nature or a "liberal" one. The perturbation is generally within the known or believed uncertainty range of the variable. The process is repeated for as many variables and for as great a change as is desired. For example, to examine the effects of a modest change in A in equation 1, we might increase the value of A by 10% over its "best" estimate value of 2. A 10% increase in A (to 2.2) results in an estimated Y-value of 33. If a value of A of 2.2 is "reasonably likely" to occur, then the traditional sensitivity analysis suggests that a value of Y of 33 is also "reasonably likely" to occur. Problems with Traditional Approaches Clearly, the best value approach does not address the problem of uncertainty at all. Furthermore, "best" input values do not necessarily have high probabilities of occurring. Another difficulty arises when the algebraic model contains non-linear elements such as multiplications or divisions and the variables are correlated. If variables A and B of equation 1 were correlated, for example, the average of Y would not be equal to the average of A times the average of B. In point of fact, if A and B were positively correlated, then the average of Y would be greater than the product of the averages; conversely, if A and B were negatively correlated, the average of Y would be less than the product of the averages. The conservative approach also has its deficiencies. First, in a complex calculation involving many equations and many input variables (some of which may be correlated), it may not be obvious what values of the input variables constitute "conservative" ones with respect to the output. Second, because conservative input values generally are those with a low probability of occurring, the estimates obtained by using such values perforce will not have a high probability of occurring. The worst thing, however, about both of these approaches is that the point estimates involved do not utilize all of the information that is usually available. One usually has at least some idea of the uncertainties in the data at hand. Sensitivity analysis, being largely an amalgamation of elements of both the best value and conservative approaches, suffers the defects of both methods. An arbitrary change in the value of an input variable, even though the change falls within the expected range of the variable, tells us little about the likelihood of occurrence of the new estimate obtained. In other words, if we know little about the likelihood of such a change occurring, it follows that we know little about the likelihood of the calculated output occurring. Furthermore, in traditional sensitivity analysis, all other variables are held at their previous values, the so- called "all other things being equal" view of the world. The problem is that "all other things" are seldom equal. In actuality, the change we introduce in a variable under traditional sensitivity analysis may well be either mitigated or intensified by what is happening to other variables. In short, traditio sensitivity analysis does not show combined net effect of changes in variables or the likelihood of vari< changes occurring together. Viewed this manner, the traditional sensith analysis can be misleading. Alternative Solutions to the Uncertainty Problem The first alternative, illustrated in Fig 1, is Direct or Complete Enumerati The model of equation 1 is employ and we assume the uncertainties o and B as given in the two probab distributions for these variables showi the upper left-hand corner of the fig In other words, we suppose that then a 25% chance that A is equal to 1, E that it is equal to 2, and 25% that equal to 3. For B, there is a 50-50 cha that it is equal to either 10 or 20. (For simple example, we assume correlation between the two in variables.) In complete enumeration list all of the possible combinations of input variables and then calculate probabilities of these combinati occurring. In this example there ar choices for A and 2 for B, resulting possible outcomes for Y. probabilities of these combinations shown in the middle top of the fig Since some of the combinations duplications, the table of combination A and B may be simplified to th entries shown at the upper right of figure. The average value of Y is sh to be 30. At the bottom of Figure 1 graph of the frequency or probat distribution of Y. Note that the most li value is not the average (or"best") v but rather values to either side c Furthermore, one of the extreme v£ (Y = 60) has a higher probabilit occurrence than has the average v As can be seen, the comp enumeration method tells us everyi about the distribution of Y, includin mean, standard deviation, mininr maximum, and the probability occurrence of any value ol Furthermore, it is an exact met Unfortunately, if the number of outcc is large (or if the probability distribu are continuous), the metho( computationally impractical. The second alternative is Probability Calculus method. method, as the name implies, req some knowledge of the calculu probabilities (sometimes know engineering as the "propagatio ------- Model: Y = A x B Basic Data A p(A) B p(B) 1 .25 10 .50 2 .50 20 .50 3 .25 Outcomes A 1 2 3 1 2 3 B 10 10 10 20 20 20 AB 10 20 30 20 40 60 P(AB) .125 .250 125 .125 .250 .125 Ordered Outcomes AB 10 20 30 40 60 Total p(AB) .125 .375 .125 .250 .125 = 1 .000 Total = 1.000 .40 - .30 — f(Y) .20 - .10 - .00 0 10 20 30 40 50 60 Y Probability Distribution of Y Figure 1. Direct (complete enumeration). error"). Using the model of equation 1 as before, the method is illustrated in Figure 2. The error formula is given at the top of the figure and involves three terms and knowledge of the variances of A and B. The latter are calculated, as is shown in the figure, from the probability distributions of A and B given previously in Figure 1. The error formula shows the variance of Y to be 225. The Probability Calculus method produces no more than the mean and the variance of the output (i.e., Y) distribution. That the variance alone is not sufficient to determine the nature of uncertainty, however, is clear from Figure 1. The third approach to the problem of uncertainty is a form of Monte Carlo simulation known as Model Sampling. Ke idea of Model Sampling is relatively nple: 1. A value for each of the input variables is drawn at random from their respective probability distributions, and the model is computed using this particular set of values. 2. The above process is repeated many times. Since the results vary with each iteration, the outputs themselves (i.e., the Y's) are gathered in the form of a probability distribution. Thus the uncertainties of the model's inputs are transferred to the output that can then be studied and subsequently used in decision processes. The procedure is shown schematically in Figure 3. The output (and the accuracy) of the Monte Carlo simulation method becomes almost equal to that of complete enumeration as the number of iterations becomes large. Unlike Direct Enumeration, however, large and/or complex problems are tractable and continuous uncertainty distributions are easily handled. The Monte Carlo simulation method forms the basis for MOUSE, the computerized uncertainty analysis system that is the subject of this summary. MOUSE, A Computerized Uncertainty Analysis System To conduct Monte Carlo simulation on an environmental, economic, engineering, or scientific model, you must communicate the model and any input information to the computer and instruct it as to what is required with regard to the nature and the output of the analysis. ------- Model: Y = AxB Error Formula: var(AB) = var(Y) + x var(B ) + B2 x var(A) Let A = 2 and § = 15 P(A) (A - A? p(A)(A • A? .25 .50 .25 7 2 3 7 0 7 .25 .00 .25 P(B) .50 .50 B (B - 3)2 10 25 20 25 var(A) = .50 p(B)(B - 3)2 72.5 72.5 var(B) = 25.0 Therefore, var(Y) = 22(25.0) + 152(0.50) + (0.50)(25.0) = 225 Figure 2. Probability calculus. Since models are specific to the problem at hand, the model must be coded in some sort of high-level (i.e., English-like) computer language. For environmental engineering uncertainty analysis, the most commonly-available high-level languages for personal computers are Compiled Basic, FORTRAN, and Pascal. With regard to coding a particular environmental model itself, there is little difference among them (e.g., the equation y = ab is written as Y = A*B in all three), although there are significant differences with regard to the sometimes arcane but necessary programming matters that are part of (a) the specification burdens associated with these languages (e.g., declaring and defining variables, arrays, and common storage), and (b) input/output. Skipping the arguments over the competing virtues of these languages, we note simply that FORTRAN was selected as the basis for MOUSE primarily because it is the most standardized of the three options. With any compiled language, the source text is generally created with an editor, saved to disk, and then processed by a compiler and linker before it is run. Thus, the basic requirements for MOUSE are simply a text editor and a FORTRAN compiler/linker. Anyone knowing a few arithmetic symbols can write even the most complex of models in any of these languages. The overhead burden (as exemplified by FORTRAN statements such as DIMENSION and COMMON) and the input/output procedures (as exemplified by the FORTRAN statements such as OPEN and FORMAT) are quite another thing, however. MOUSE avoids the former problem by providing a uti program (called TRAP, for TRA\ formation Preprocessor) that actuj writes the required specifications; solves the latter problem by doing al the output and most of the input its The output power of MOUSE is showr Figure 4, where a typical MOU histogram is presented. Other than the model itself, overhead burden, and the input/out procedures, a computer program m also contain instructions from the user to what is required, e.g., what out variables are to be gathered and plai into histograms, what probabi distributions (and their parameters) ar< be used for the uncertain variables, v» sensitivity analyses are to be perforrr etc. In MOUSE, these are communicj via single lines inserted into the prog that "call" MOUSE functions subroutines. These calls must be wri precisely, since computers notoriously unhelpful when it come; second guessing exactly what you w MOUSE, however, assists you in ways. Firstly, while you are writing } program (using a word processor of } choice), there is an on-line (merm resident) program available at all tii that immediately shows you the forrr and the arguments needed for, each Secondly, with a little prompting for arguments, TRAP will write these calls you. Thirdly, there is another ut program, called CHECKER, that will : your program for errors, line by Model: Y = AxB Start: i = 1 Random Sample Y = AxB Random ^Sample I Record Y( Finish Figure 3. Monte Carlo simulation. From collection of Y's obtain: 1. Mean 2. Standard deviation 3. Coefficient of variation 4. Minimum 5. Maximum 6. Graph of frequency distribution 7. Graph of cumulative frequency distribution ------- ****************************************** * DISTRIBUTION FOR QUANTITY TEST * ****************************************** NUMBER OF ITERATIONS 5000 MEAN MINIMUM MAXIMUM 20361.21000 4219.64600 82200.32000 STANDARD DEVIATION = COEFFICIENT OF VARIATION, X = 12154.16000 59.69275 LOWER LIMIT 4200.0000 6100.0000 8000.0000 9900.0000 11800.0000 13700.0000 15600.0000 17500.0000 19400.0000 21300.0000 23200.0000 25100.0000 27000.0000 28900.0000 30800.0000 32700.0000 34600.0000 36500.0000 38400.0000 40300.0000 42200.0000 44100.0000 46000.0000 47900.0000 49800.0000 51700.0000 53600.0000 55500.0000 57400.0000 59300.0000 61200.0000 63100.0000 65000.0000 OVERFLOW NUMBER OF ENTRIES 60. 245. 443. 585. 486. 467. 381. 374. 248. 239. 194. 158. 127. 121. 107. 94. 95. 69. 75. 70. 56. 56. 55. 35. 26. 24. 28. 17. 11. 15. 5. 6. 9. 19. PERCENT ENTRIES 1.20 4.90 8.86 11.70 9.72 9.34 7.62 7.48 4.96 4.78 3.88 3.16 2.54 2.42 2.14 1.88 1.90 1.38 1.50 1.40 1.12 1.12 1.10 .70 .52 .48 .56 .34 .22 .30 .10 .12 .18 .38 CUMULATIVE X ENTRIES 1.20 6.10 14.96 26.66 36.38 45.72 53.34 60.82 65.78 70.56 74.44 77.60 80.14 82.56 84.70 86.58 88.48 89.86 91.36 92.76 93.88 95.00 96.10 96.80 97.32 97.80 98.36 98.70 98.92 99.22 99.32 99.44 99.62 100.00 CUMULATIVE COMPLEMENT 98.80 93.90 85.04 73.34 63.62 54.28 46.66* 39.18 34.22 29.44 25.56 22.40 19.86 17.44 15.30 13.42 11.52 10.14 8.64 7.24 6.12 5.00 3.90 3.20 2.68 2.20 1.64 1.30 1.08 .78 .68 .56 .38 .00 DISTRIBUTIONS FREQUENCY DISTRIBUTION CUMULATIVE DISTRIBUTION *Q***** ****Q********************** *********Q************************************* ****************Q******************************************** **********************Q***************************** »*********************»*****Q********************* *********************************Q******* *************************************Q** *************************** o ************************** Q ********************* Q ***************** Q ************** o ************** o ************ Q *********** Q *********** Q ******** 0 ********* Q ******** Q ******* Q ******* Q ******* Q ***** 0 **** 0 **** 0 **** 0 *** 0 ** 0 *** ** ** ** *** CUMULATIVE CUMULATIVE X ENTRIES COMPLEMENT VALUE OF TEST 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0 95.0 90.0 75.0 50.0 25.0 10.0 5.0 1.0 5673.4690 6936.3430 9630.4280 14767.1900 23536.7100 36677.3300 44100.0000 57906.6700 Figure 4. Typical example of statistics, histogram and graphs produced by MOUSE. When it finds an error, it will tell where it is and what is wrong. It is very possible to write a computer program that contains no syntactical errors whatsoever but is rife with logicaJ errors. One way to detect logical errors is to examine the results of intermediate calculations for reasonableness. Utilizing (a device known as a "Trace Line," MOUSE will print out the value of any variable at 1, 20, 50, and 100 iterations of the Monte Carlo method. A utility program, called TRACER, will automatically insert these trace lines into your program and, when you are finished, remove them as well. It is not always clear what probability distributions should be used for the uncertain inputs of an environmental engineering model, and the fitting of probability distributions to sample data is a statistical skill not possessed by all. A MOUSE utility program known as IMP (/nteractive Modeler for Probabilities) not only will fit a classical probability distribution to sample data, it will fit an empirical distribution hand-drawn on graph paper as well and also analyze a data set for auto- and bivariate- correlations. For environmental engineering models involving algebraic equations, MOUSE is superior to either general purpose ------- programming or simulation languages. It kind faster and easier than can other solution. Further, MOUSE programs are is concise, powerful, and convenient and languages. With MOUSE, your attention easier to understand, explain to others easy to use. MOUSE can solve is on problem-solving, rather than on the and modify than are general purpose uncertainty problems of the algebraic details of coding a program to compute a programming or simulation languages. ------- The EPA author, Albert J. Klee (also the EPA Project Officer, seet&feifois the Risk Reduction Engineering Laboratory, Cincinnati, OH 45268. The complete report consists of paper copy and diskette, entitled "MOUSE (Modular Oriented Uncertainty SystEm): A Computerized Uncertainty Analysis System:" Paper Copy (Order No. PB 90-172 560/AS; Cost: $31.00, sub/ect to change) Diskette (Order No. PB 90-501370/AS, Cost $80.00, subject to change) (Cost of diskette includes paper copy.) The above items will be available only from: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-487-4650 The EPA Project Officer can be contacted at: Risk Reduction Engineering Laboratory U.S. Environmental Protection Agency Cincinnati, OH 45268 United States Center for Environmental Research Environmental Protection Information Agency Cincinnati OH 45268 Official Business Penalty for Private Use $300 EPA/600/S8-89/102 C00085836 H«EHL USEPA REGION V LIBRAE* 230 5 DEARBORN ST as 1670 CHICAGO IL 6C6QH ------- |