Computing Science and Statistics Volume 27 Statistics and Manufacturing with Subthemes in Environmental Statistics, Graphics and Imaging Proceedings of the 27th Symposium on the Interface Pittsburgh, PA, June 21-24, 1995 Editors Michael M. Meyer James L. Rosenberger INTERFACE FOUNDATION OF NORTH AMERICA ------- EPA/600/A-96/103 410 Perspectives oo the Analysis of Massive Data Sets By Daniel B. Carr Center for Computational Statistics George Mason University Fairfax, VA 22030 Abstract: This talk presents perspectives on the analysis of massive data sets. These perspectives derive from past efforts to develop tools for the analysis of large data sets and speculations about the future. The perspectives touch on many topics: previous research, recurrent problems, standard tricks for addressing problems, new problems, and new opportunities. In looking forward, it seems that evolving notions about the man-machine interface, models, and model criticism will play an important role in massive data set analysis. The extent to which the statistical community responds this evolution will have far reaching consequence in terms of the influence of our profession with the scientific community. 1. Introduction The analysis of massive data sets (AMDS) is an important topic. Massive data sets present new challenges and new opportunities for learning about the world around us. In fact massive data sets provide the opportunity to learn how the statistical community will respond to the challenges. The progression from large to massive data sets has occurred steadily over the last two decades. It might be thought that the statistics community has followed this progression and is well- prepared to address massive data sets. However, the analysis of large data sets has drawn little attention. Much of the computational statistics community has focused its attention on computationally intensive methods for small data sets. We can attack small problems with portable PC's and crunch medium problems with workstations. Addressing massive data sets requires more finesse, new computational environments and foreign (and sometimes simpler) ways of thinking. In this paper I provide a sketch of the AMDS landscape. The approach uses a series of brief descriptions to provide a broad view of AMDS. A narrow view of AMDS might focus on a new algorithmic approach to complexity reduction or characterize the feasibility of applying a particular class of algorithms in terms of n, the number of cases, and p, the number of variables. A broader view also addresses complexity reduction issues for the analysis team that is pursuing answers to subject matter questions. As part of this broader view, I raise concerns about whether or not statisticians will be involved on such analysis teams. The sketch of die AMDS landscape provides a basis for developing an involvement strategy. Behind my view of the research landscape lies experience from working on the Analysis of Large Data sets (ALDS) project (for some early papers see Carr 1979 and Nicholson et al. 1980). In this paper I relate perceived patterns to my historical experience. The perceived patterns are my interpretation and undoubtedly reflect my biases. Hopefully the patterns reflect the biases of other data analysts. Why would anyone be interested in AMDS? After years of consideration I am not so sure that the exploratory analysis of large (massive) data sets is exactly the challenge that catches my fancy. To me, the computing revolution is not so much that data sets are massive as it is that more and more things are recorded electronically and can be treated as data. The word massive hides at least four types of opportunities: The word massive does not immediately convey the opportunity to integrate data from multiple sources nor the opportunity to pay attention to details. The word massive does not convey the notion that some types of data are new and deserving of new analysis methodology. The word massive does not convey the increased capability to transform information into forms more suitable for human consumption. Historically we have had to adapt just to be able to use computers. The word massive does not convey the opportunities provided by working in teams in the pursuit of scientific inquiry. Massive data sets should be considered in the full context of the computing revolution. It is likely that few are interested in data sets just because they are massive. 2. How Large is Your Large? The Analysis of Large Data Sets (ALDS) project was a DOE ------- Perspectives on the Analysis of Massive Data Sets 411 project that flourished about fifteen years ago. The obvious motivation was the wide-spread existence of large data sets within the DOE community that were presumably worthy of analysis. Previous research such as the Bell Telephone Laboratory Analysis of proton data from the Telstar I Satellite (Gabbe, Wilk, and Brown 1967) and methodology developments such as projection pursuit (Tukey and Friedman 1974) provided inspiration. This led to a white paper entitled Exploratory Analysis of Large Data Sets (Nicholson and Hooper 1977) and subsequent DOE funding in 1978. By 1979 the ALDS project was under way with a Ramtek 9400 monitor (1280 x 1024 with bit-plane control) and a VAX to drive it Thinking about the analysis of large data sets was part of the spirit of the times. In 1977 the IMS devoted a special meeting to the topic. The following are some quotations from the 1979 DOE Statistics Symposium: Lou Gordon said, "No sufficient statistics, not reducible." Charlie Smith said "File update and analysis histories must be maintained on the computer," and "greater than 10" bytes with more than half of that devoted to data base history." Leo Brieman noted, "Not large, but large and complex is the problem." Dick Beckman wondered, "Are large data sets more properly a question for computer science?* Many at the symposium asked ALDS team members what they meant by large. The ALDS response for exploratory analysis consisted of two statements: (1) The quantity of data taxes our technical ability to organize, display and analyze. (2) Knowledge of the subject field is insufficient to define an analysis process that will summarize the information content of the data. This response got away from the game of "my data set is larger than your data set" However, a definition stated in relative terms proved problematic. When we learned to handle a data set with some ease, it was no longer large. Consequently we could never solve the problem of large. Recently Huber (1994) provided a static definition in terms of bytes (tinyslO2, smaH=l0\ medium=10 6, large=101 and huge=l0"5. The labels seem reasonable by today's standards. If the labels don't change, perhaps we can master large data sets. 3. Early Issues The ALDS project had a review panel to make sure that die project was aware of important issues and to steer us toward tasks we could address. Over the years panel members and contributing observers included Doug Bates, Peter Bloomfield, Mike Feldman, James Foley, Jerry Friedman, Jack Heller, Richard Quanrud, Roland Johnson, Benjamin Topping, Marcello Pagano, Ari Shoshani, and David Wallace. John Tukey also participated in project review and Leo Brieman served as a consultant Topics from the 1979 panel meeting included data quality problems, lack of homogeneity and data set dismemberment, sampling and cross-validation, integrating multiple related data bases and use of empirical Bayes methods, data base complexity and complex reduction methods, rapid data access and self-describing binary files, graphical displays and methods of exploiting the hardware (display list, pan and zoom and the video lookup table), extension of statistical packages such Minitab® and PSTAT to handle at least a million observations. We discussed issues like common tapes format just like researchers today discuss the need for common file formats. With the input of distinguished review panel members it was pretty hard not to be aware of the important issues of the time. (Hall, 1980 and 1983, provides panel review comments.) Hardware and software environments have changed dramatically since the late 70's. Some of the ALDS discussions are no longer relevant However, many of the issues remain the same. The analysis of large data sets brought out one of the most recurrent and frustrating of issues. Different disciplines have a hard time communicating. Still, a path to communication is clear: joindy funded research in AMDS. 4. Selected Research Areas and the ALDS Approach Since the ALDS team did not have the database environment to address massive data sets, research turned to what is now called information visualization. In the early 19S0's we worked with color anaglyph stereo, color encoding, rocking, 4-D rotation, and generalized glyph drawing. In the 1982 ASA graphics exposition, our detective work examples included anaglyph stereo and class-colored plots, from stem and leafs through generalized glyphs. Our large n adaption to the scatterplot was binned-data plots. Publications soon followed that used binned data in smoothes, scatterplot matrices, density different plots, animation, and in graphical subset selection and display. Figure 1 shows a scatterplot matrix of binned plots. (See also Carr 1991) Each little binned plot in the Figure 1 represents a roughly quarter of million points. This particular variation represents the log of counts in hexagon bins using die size of hexagon symbols. Can et al. (1987) presented a color-based visualization approach that provides more detail for the counts. Note that the binning approach scales to much larger n and that the outliers, for example in band 1 versus band 2, are readily visible. Elementary sampling approaches may work for some massive data set problems but not for the task of learning about outliers. In later research we used binned-data in image processing algorithms (gray-level erosion and thinning), in multivariate versions of box plots and in cognostics algorithms. ------- 412 D. B. Carr •e a cd CO *o C CO tn Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Figure I. A Hexagon Binned Scatterplot Matrix. The data set is a 512 x 512 T image. Each plot represents more than 250,000 pixels. The spectral bands correspond to different wavelengths of light. Band 1 covers .45 to .52 um and Band 7 covers 2.08 to 235 ton. The bands are in wavelength order except Band 6 that covers 10.4 to 12^ ™ Note the bivariate outliers that would not be found in typical univariate histograms. The ALDS cognostics research warrants some description because it is so little known and so close to what is now called intelligent agents. John Tukey coined die term cognostics. One definition is computer-guiding diagnostics. (Ultimately the guidance is for data analysts.) Given an overwhelming number of plots to review, we wanted algorithms that would decide which virtual plots to store for later review. Our application involved looking at computational fluid dynamics solutions sets as data. For example we would ignore the spatial index on vonicity vectors, bin the 3-D vectors into truncated octahedron cells and then use algorithms to decide if the density pattern was unusual enough to warrant attention. Features of interest included clusters, arm like appendages to tbe high density region and holes. Usual density patterns could be associated with low model resolution in turbulent areas or distinctive flow patterns (such as the jet stream). Fuzzy ellipsoids constituted a class of uninteresting views. The algorithms ranked 2-D and 3-D bin-based density summaries. The variable pairs and triples included intermediate quantities such as derivatives of velocity as well as paired scalar quantities such as temperature and pressure. Our implementation ran on a parallel processor one time step behind the CFD model. ------- Perspectives on the Analysis of Massive Data Sets 413 Some of the plots we found were strange and fascinating. Figure 2, from Carr (1991), was an image flagged for its high degrees of"armness". The variables are partial derivatives of velocity. Presumably a higher resolution CFD grid would produce values that filled in the gaps. The solid black center region in the plot contains the high density part of the plot and 95 percent of the total counts. The gray level thinning algorithm operates on the remaining cells. Hie hexagons symbols that nearly touch indicate strings of two or longer that are left behind by thinning. Small filled hexagons are isolated cells. Small open hexagon are eroded cells, these may appear filled in this small plot but are identifiable because they are not isolated.) The measure of armness was the number of cells left behind relative to those subject to thinning and exceed 60 percent We had looked at thousands of "statistical data" scatterplots and had never seen pattens like this. ¦o X N % o- -2- r As r $ <\ r\ > v-. -•V ; $ i (a i\ * v A* • \ rt \ r _r- -2 0 dV*/dy Figure 2. Gray-level Thinning of Hexagon CeQs. The variables are partial derivatives of velocity components in a CFD solution set The gaps between the high density region in the center and the loops suggest usual phenomena or lack of grid resolution in the model. So far ALDS has had little impact on the statistics community. At first glance one might think that software vendors would snap up the public domain ALDS modifications of Minitab. (The user manual of Littlefield and Carr 1986 indicates the many graphics capabilities.) However, the community with color workstations was small back then. Even today much of the statistics community does not yet have 1024 x 1280 resolution monitors. Tools for really utilizing bit planes will not appear in farce until typical computing systems have more than eight bit planes. Software vendors design for the mass market The technology transfer rate depends on how quickly the hardware and software becomes affordable to the majority of the community. With affordable gigabyte drives, the statistics community may soon give binned scatterplots at try. The ALDS approach was to analyze data sets. Since looking at data sets was our business, at least four analysis steps were obvious. 1) An early step toward understanding the data is to get the background, sometimes both literally and figuratively. Tasks include reading about the metadata, talking to die people who collected the data or desiped the sensor systems, and gathering agonal related data. 2) Much early work involves getting the data into the standard shape (typically n cases x p variables) to apply die standard algorithms. 3) Managing the analysis process is a major task, especially in a quality assurance environment For example, the volume of derived data can easily exceed the original. This will be true of the Earth Observing Station (EOS) data even before the data is distributed around the world. 4) Content rich and scientifically important data sets are often of recurrent interest However, human memory fades, and both hardware and software evolve. Historically the storage media has been subject to deterioration. Unfortunately analysis archiving (AA) is usually a last minute thought that occurs when project funds and energies are exhausted. Hitman nature does not change rapidly, and many facets of die analysis process have changed tittle over the two decades. The ALDS project worked on managing the data analysis process. We N"* tree-based representation of the analysis process, mousable icons for' "aha's" and even a computer- controlled cassette tape for voice annotation. We envisioned a day when the logs would be analyzed to gain a deeper understanding of the analysis process. Our first papers woe Carr et al. (1984) and Nicholson et al. (1984). Oldford and Peters (1985) and Becker and Chambers (1986) quickly noticed our beginning efforts and produced some really promising work. If the quality assurance movenmt for data analysis had caught on as we had expected, rtytr wrw-ft wmM have aHrAjed much more attention. Nowtbe field rfnrmanr This field needs some serious attention. Computer programmers use software management tools for multiple person software engineering projects. AMDS teams mwl ywf^thlng similar Speed was a reosrent issue for ALDS. The notion of binning traded off resolution for speed. Reducing temporal resolution also increases speed. For example, weekly data can be aggregated to seasonal 4?** A 52 to 4 reduction is not a lot but factors of 10 here and there can help. Of course, different patterns appear at different scales of resolution. That is why cr-j^KiUm are always pressing far higher resohflion. Something ------- 414 D. B. Can is lost by summarization methods that reduce resolution. However, many data sets are never examined. The ALDS position was that it was better to reduce the resolution used in the analysis than to drop the whole data set AMDS teams will benefit from software and hardware engineers who can directly address speed problems. Software engineers typically design for generality. Once specific tasks are identified, significant speedups can be gained by removing layers of code. At a recent massive data workshop Albert Anderson of the Population Studies Center in Ann Arbor Michigan reported PC speed-ups on the order of 1000 for the simple task of computing a mean. One key was maximizing size and use of fast processor cache. This is an engineering task, not a statistical one. AMDS will involve compromises. By agreement we avoid thinking about certain classes of finite precision limitations. For all the computed images, has anyone ever seen a correct view of the Mandelbrot set? Consider an example closer to home. Suppose a researcher wants to split a 200 case data set into two equal subsets of size, a training set and a validating set Ideally the researcher would like the training set to be an equally likely selection from the approximately 2m possible splits. Now the traditional accept/reject selection method will typically generate a different subset if die random number seed changes. On a 32 bit machine there are no more than 2n such seeds. Clearly the majority of subsets are excluded from selection by the standard subset generating process. Perhaps the process is sufficiently random for our conceptual purposes. In not we are in deep trouble when in comes to selecting luge equally-likely subsets from massive data sets. Perhaps we can simplify high-dimensional problems when variables seem sufficiently independent or sufficiently conditionally independent S. Funding and Analysis Teams DOE funding was known for its ebbs and flows. It became apparent that if ALDS was to survive the project needed to attach itself to a well-funded application area. We began applying our methodology to computational fluid dynamics (CFD) models. We soon discovered that we had little chance of being heard in the CFD community without a professional CFD modeler on our team. By the time we got a CFD modeler on our team it was too late to salvage the project Nonetheless, during the last year of the project the modeler provided a wealth of experience and new perspectives. The modeler became our translator and we began getting opportunities to present our work in CFD forums. For those making the transition from a pure statistics training into world of applications world, it is important to know how things work. The existing infrastructure often makes it difficult to engage in cross-disciplinary work. To publish in journals one needs to know the language, and often the key people to reference. Good ideas and solid research are not enough. Certain credentials can be crucial for funding. Common knowledge is that the letters "MD" are particularly important for NIH funding. There is little point in questioning this wisdom or lamenting that fact that major funding for data analysis research is provided without the requirement of having a professional statistician on the team. The point is that there are political as well as scientific advantages to working on an interdisciplinary team. Little research occurs without funding. To be involve in AMDS requires serious computational resources and significant funding. Statisticians should be aware that our profession is a very small service profession, small in size and small in the amount of direct research funding. While the current interest in AMDS may bring some direct funding, most of the research opportunities will appears a spin-offs from well-funded applications area. Funding for medically related research is significant This provides numerous AMDS opportunity areas, from molecular modeling to the study of health care costs. Bill Eddie's session on functional magnetic resonance imaging (fMRI) session at this conference provided an outstanding example of statisticians collaborating on an interdisciplinary team in a medically related area. The immediacy of profits suggests AMDS opportunities in the area of market analysis. The study of manufacturing provides the opportunity to increase reliability and profits. Profitable uses of administrative data will be identified. The major funding for and lack of statistician involvement with EOSDIS (Earth Observing Stations Data and Information System) suggests that it is yet another opportunity area. Reasons behind lack of statistician involvement so far includes competition and lack of background. 6. Competition and Credibility Many disciplines are eager to provide analysis methodology. Many physicists and engineers do not perceive that they need help. If anyone needs help, the computer scientists will gladly provide it They discovered DATA sometime during the last decade (anthropomorphized it on Star Trek) and now can mine knowledge from data at will. As a service discipline we have major competition, not only from computer science but from the directly funded disciplines themselves. The computing revolution means that scientists do things themselves. Software tools have replaced die graphic designers, secretaries, and statisticians. ------- Perspectives on the Analysis of Massive Data Sets 415 How many statisticians have the training to make them credible candidates for an AMDS team? Years ago, ALDS project staff visited Karl-Heinz Winker at Los Alamos Hationai Laboratory. He had a direct high-speed link to a CRAY, a large portion of a mobile home devoted to tapes, and special high-speed disks to drive his high resolution monitor. One of his papers suggested that if a visualization system could not display information at 3 gigabaud, the system was wasteful of his eye-brain resources. I heard that in one month during a particularly bad winter he managed to use two Cray-months of time by grabbing unused cycles. In contrast, our project had requested supercomputer time more on the order of one hour. Our visits to places such as NASA AMES, Lawrence Berkeley Laboratory, and NSA provided further cause for computational humility. Yes, we ALDS statisticians had developed nonlinear models to fit tens of thousands of observations at a crack, but was that large? Resources, such as tape robots, put our project resources (a VAX 11/780 and Ramtek display monitor) to shame. It would take many of our data sets to match the amount of data in one satellite image, and then there was NASA's black hole of tapes to consider. The situation today is a bit better than in the past More statisticians are working in supercomputer environments, but the number is small. 7. Positioning arid Education Bill Eddy reports that fMRI project walked in the door as a question about t-tests. His opportunity to work on the project might be regarded as pure luck. However, Dr. Eddy was at Carnegie Mellon University where computational activities flourish and Dr. Eddy had done very interesting computational work (for example see Eddy and Schervish 1988). He was well-positioned is have die fMRI opportunity and to recognize it as something more than t-tests. Writers like Hahn (1989) suggest changing the curriculum to prepare statisticians for the data-intensive environments in the work place. Clearly some facets of statistical education need to change if graduating students are to be well-positioned to work on AMDS teams. AMDS statisticians will need to have experience with supercomputers, parallel processing, database systems, data structures, algorithms and high end graphics software. Traditional statistics programs offer little in most of these areas. When change occurs too slowly, alternatives emerge. Statistics students eager to respond to AMDS challenges may find their needs better served by emerging programs is computational science. Computational science programs provided experience in computationally advanced environments. Some programs even promote team research. This is different than short-tens consulting on someone else's research project The excitement of team research can be hard to forget Contacts made in school lead to projects in later years. As the teaching of statistics evolves, we should make efforts include data analysis methodology developed in other disciplines. By history we lay claim to data analysis. The Webster definition of statistics is: die science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that by use of mathematical theories of probaliliy imposes order and regularity on aggreagates of more or less disparate elements. We should teach methods that work (despite their origins), debunk wild claims, and contribute our statistical wisdom to when others are working to develop new data-analytic methodology. Professionals benefit from the education opportunities provided by professional communities. The longer scientists work at data analysis, they more they find in common with statisticians. The statistical community should work even harder at providing opportunities for cross-disciplinary sharing. Some facets of tradition statistical training save quite well. Each generation of scientists needs to be weaned away from the one at a time experimental approach and taught experimental design. Each generation needs to be taught that experimenter bias is not just a problem of other disciplines. Each generation needs to be caught to reexamine assumptions. The sophistication of people working with massive data sets usually guarantees thorough discipline-based grounding. However, single discipline grounding frequently leaves blind spots. Often as not, the most important contribution of the statistician is obvious from the statistician's perspective and scarcely worth a footnote in the research literature1. 1 Every statistician collects discovery stories. Ralph Kahn (JPL) and I co-chaired a session at die IGARRS "94 meeting. Afterwards we went to see die demonstrations at the IDL booth. At the booth David Stern selected a view of the Earth and showed us an animation of newly acquired sensor Ralph immediately understood and pointed out patterns in what we were seeing, such as swirling regions that built up behind continents and then broke off. lathe session I had talked about (he importance of showing differences (a statistical graphics thing) so I asked David to animate the difference between images. David quickly modified the demo program to show differences (impressive). After a brief period of watching differences, satellite trajectories appeared on the globe. Not all the data had been properly aligned. Hie staff responsible for the data just happened to be close at hand, so David and Ralph ------- 416 D. B. Carr Statisticians ask obnoxious questions like, "Can you verify that the sensor calibration is still applicable?" Focusing attention on the currently perceived critical path is a common scientific trait. Scientists often presume to know where the problems are. This allows strategic allocation of resources to study the problem. The narrowing of attention can be cost effective but sometimes misses the mark. For a long time it was known that the atmospheric action happened at the equator. While polar phenomena may be localized, little details like the hole in the ozone layer are worth discovering. In general, narrow focus limits the ability to make discoveries and inferences about larger domains. For example, in U.S. environmental monitoring, a common strategy is to monitor at locations where pollution is known to be high. This complicates making discoveries about new regions of high pollution and complicates characterizing the state of the nation. Sampling methodology for AMDS needs to be carefully considered. Environmental monitoring often uses stratified sampling methods. This provides some improvement of estimates when the strata are correct Incorrect assumptions about strata can lead to poor estimates. For example, the Washington/Oregon salmon population has been over- estimated for years due to incorrect stratification assumptions. Now it appears that it will be too expensive to save the salmon. Narrow focus and incorrect weighting of information are more common in science than scientists like to admit Sometimes the consequences are massive. Skeptical statisticians can be very helpful when experts are ready to build their certain knowledge into the software that preprocesses massive data sets. Summarizing incoming information and passing on the summary is a common technique for controlling the amount of stored data. For example, calculations using the Doppler shift in laser beam reflections can produce estimates of atmospheric wind velocities at various altitudes. The Doppler shift information may exist only during the brief period when it is used to calculate wind velocities. Parameter calculations, such as wind velocity calculations, are not necessarily linear and may incorporate the results of calibration experiments. Statistical adjustments such as background and interference correction called them over to watch. I presume the staff fixed the problem at their first opportunity. It would have been fun to pretend that I, a mere statistician, had seen the problem in the original animation and had thought up the second animation to confirm my suspicions. Unfortunately the idea did not occur to me until later. Simple statistical graphics principles apply to some fairly large data sets. are often necessary. Statisticians trained in the traditional areas of calibration, background and interference correction will be able to help. Kahn et al. (1991) raise the issue of validating parameter estimates. It would seem that many statisticians have the tools to contribute, although the size and complexity of the data may take some getting used to. Model criticism is an area of particular importance. The world of models continues to grow. A short list includes differential equations models, regression models, Monte Carlo models, genetic algorithms, neural nets, fuzzy set models, tree models and hierarchical Bayesian models. Statisticians can assess models for internal consistency, compare models against data (real or simulated), and compare models against other models. Statisticians may not make the best salespersons, but we can market our strengths once we have established computational credibility. 8. Types of Data and Challenge Areas In the ALDS project we thought about practical cardinality for data sets. Our first distinction was storable versus flow- through data For some data sets there is no intention of ever storing all the data. Flow-through data requires the development of models that update as the data flows by. In terms of graphics an obvious problem is scaling. We want good resolution within each time window and comparable plots over time. Having a large buffer helps to achieve a compromise between the two conflicting goals. What can be done with large buffers? Our second distinction was repratable versus unique data. Computational simulations may generate too much data to store in any detail, but are potentially re pea table. There can be two passes: one to determine global summarization scales and one do the summarization. Only a little research is being done in these area. Statisticians will encounter increasingly complicated data objects. The notions dependent on Euclidean distance between objects may have to be forgone. For example, the application of molecular similarity analysis in drug discovery research involves binary vectors representing the presence or absence of up to 300 molecular fragments, labeled graphs representing the bonding structures of molecules and scalar fields in R' representing the electrostatic flow of molecules (Johnson 1989, Johnson and Maggiora, 1990). Such complex data types provide the opportunity to develop new approaches. While remote sensing data is often nicely structured, many challenges remain. Consider a currently extreme case consisting of a 2'* by 2'4 pixel image with intensities for 123 channels (bands). Our rule of thumb for multivariate analysis is that n should be greater the 3*p. Can we reliably perform ------- Perspectives on the Analysis of Massive Data Sets 417 change detection with two images? Atmospheric conditions and changes in sensor condition induce variability. Sensor position calculations and atmospheric corrections are less than perfect. With massive data sets even computers are not so reliable. I once had a client who went to considerable effort to develop confidence bounds for temperature estimates computed from pixel values in the thermal band. He based his bounds on data from one image and tedious-to-obtain "ground truth" values. The client was not happy with me when I said that he had only one case and that the bounds should incorporate image to image variability. He carefully explained to me that each image was expensive and that his image contained millions of observations. While I still maintain my position, his question made me wonder. How do we borrow strength from the internal structure of such large data sets? Sensor data provides challenges beyond modeling its internal structure. Sensor data is useful not only in relationship to similar data sets, it is also useful for refining computational science models. For example, an ocean CFD solution set may include surface temperatures. Sophisticated regression models can use the mismatch between the modeled and observed surface temperatures to adjusted CFD model input values or to raise questions about the adequacy of the boundary conditions and grid resolution. While research refining ocean models may be well underway, the task of using sensor data to improve on computational science models provides many statistical opportunities. I remember colleagues assessing a complicated Monte Carlo code. The clients needed simulation results for different sets of input values. For each set of input values they had to run the code. It turned out that a simple regression model characterized the relationship between inputs and outputs. Within the domain of the explored input parameter space there was no need to tun the complicated Monte Carlo model. Trivial calculations using regression coefficients sufficed. The direct study of Monte Carlo simulation intermediate and final results can provide insights. 9. More Data, More Complexity and More Barriers Key scientists from many disciplines responded to grand challenge queries concerning their information needs. Almost to the person, the answer was that crucial developments depended on having higher resolution, more detailed data. Science is competitive and scientists understand the advantages of being the first to have access to the "best" data. It is hard to imagine scientists saying that they had all the data they could handle for the next few years. The pressure for obtaining higher resolution data will continue. The mass of available data continues to grow. For years sensors have produced time series with several millions of observations per second. Computer clock speed indexes the ever-increasing collection rate. Spatial resolution is heading toward overwhelming detail with higher spatial-resolution sensors always on the drawing boards. The resolution of measurements at space-time coordinates is increasing. It is scarsely a step from 7 spectral bands to 1728 energy channels. Soon we will have multiple versions of global AVHRR data sets, but we can accumulated such data sets faster by using more sensors. Atmospheric scientists will not be happy until there is complete global coverage at fixed time points. The mass of soon-to-be-available data defies imagination. The complexity of data objects will increase. Consider the complete medical record for a human being, from the birth record through medical exams that include CAT scans and blood workups, to the death certificate. However does one compare groups of such data objects? How does one use the evolving data object for patient care? If we consider the problem in the abstract it becomes overwhelming. Being pragmatic allows progress to be made (see Powsner and Tufte 1994). The challenge is to refine the pragmatic assumptions to more closely reflect the relationships in the data and to possibilities provided by new computing tools and new understandings of human cognition. Unfortunately the barriers to analysis are numerous. These include the interdisciplinary barriers and the lack of training described earlier. A big change since the ALDS project has been the emphasis on software and data as intellectual property. The ALDS project could get source code. Today it may be easier to link in special routines to commercial packages, but the source code is often unavailable. The source code is necessary for understanding exactly how things work, throwing out unneeded software layers, and optimizing. Access to data and algorithms is becoming increasingly difficult Companies treated data as private property. Researchers exercise their rights to new data until it is well worn. Algorithms such as marching cubes receive patents. Researchers wanting algorithms to remain in the public domain need to copyright their work to keep it from being privatized. Data and data processing methodology translate into money and power. Politicians have noticed. They influence what will and will not be collected. Some information will not be collected. In the climate of secrecy and security it become difficult to develop an AMDS community. At a recent massive data set workshop, it was not excepted that NSA attendees would be among the featured speakers. A talk by a communication industry representative described the general nature of one massive data sets but said little about data specific and the ------- 418 analysis that had been done. A representative from the computer chip manufacturing industry indicated that the data sets were massive and that their engineers have specific ways of using the data. Again data and analysis details were missing. One can envision an AMDS conference when everyone comes, not to share, but in die hope that someone else will reveal useful technology. This image may be an exaggeration, but it expresses a concern about the trend toward secrecy and security. The rate of progress in AMDS will be a function of the amount of shared learning. To promote scientific progress, agencies such as NSF will need to work to counter-balance the tendency of companies and universities to isolate AMDS researchers. Further, AMDS researchers need to quickly establish a forum AMDS papers. At least two statistics journals would welcome AMDS papers. Unfortunately the statistics journals reach only a small segment of the AMDS community. AMDS researchers have not yet coalesced into a community and need encouragement In many scientific applications, the lack of analysis funding is major barrier to AMDS research. It is particularly easy to be jealous of all the money spent for developing data collection systems and for gathering data. Sometimes the process of collecting data seems to have a life of its own. The purpose can be long forgotten. The data collection might be useftil to someone sometime. Sometimes data collection serves as political proof that a problem is being studied. Studying the problem is a great delaying tactic. For example the U.S. still has a massive amount of nuclear material that still needs to be stored. If cursory analysis discovers that the data is lacking in resolution, there is justification for more data collection and the cycle is complete. The data collection industry often escapes notice. However, NASA's black hole of tapes has drawn attention. The EOSDIS is increasing die percentage of funds allocated for data analysis. It would seem that if analysis is really wanted, the percentage of funds for analysis would increase in other areas as well. * 10. Data Analysis Environments and Graphics The ultimate filter for massive sets is the human mind. If the analyst is the weak link in the analysis process and it makes sense to devote resources toward improving analyst performance. For discussion purposes it might be helpful to imagine that a massive data set analyst was as important as a nuclear plant control room operator, an airplane pilot, or a commander in a command control center. Training programs would be considered extremely important The environment would be structured toward making important decisions. Comparing die imaging environment to typical data analysis environment raises many questions. Do our graphics workstations optimize human performance? Tufte (1990) calls D.B.Cm attention to the low resolution of the CRT screen as compared to printed graphics. In recent times, readily available 1a«f plotters have improved from 300 to 600 to 1800 line resolution but the workstation screen remains basically same. Further, I am not aware of any plans to extend the gamut of colors currently provided by RGB guns toward the perceptual limits of the human eye. While mao-machine interface research is in progress for virtual reality environments, the financial driving force pushes die research toward entertainment applications. We know that virtual reality environments can provide loud sounds, bright flashes and disorienting time-lagged motion. The design of environments for depth of thought is not in the main stream. Current deficiencies in data analysis environments are literally glaring (see Tufte 1989). Reexpression (transformation) of information is a key to human learning. As a simple psychological description, non- reflexive transformations of internal representations characterize learning. That is, learning has occurred what a new way of perceiving precludes going back to the old perception. Computers enhance our ability to make transformations. Computer transformations mediate our transformations between internal representations. Important computer transformations include mathematical symbols to visual representations and visual representations to other visual representations. Computers can help us take the steps from visual representations back to words, mathematical symbols and our other forms of internal coding. Human assessments and insights are the targets. Toward this end various sensory representations can be useful, including auditory, kinesthetic and olfactory senses. The whisper of a voice seemingly located near one's head can change awareness. We learn through our senses. We learn by doing. We will learn about massive data sets by reexpressing them. 11. Closing Remarks Statisticians live on the edge of scientific research where purely deterministic models are inadequate and where stochastic approximations are needed to reduce complexity and speed processing. As the volume of scientific research increases, so should the corresponding statistical surface area. It is cause for concern that the profession is not growing more in the face of the current opportunity. I hope that the challenge provided by AMDS provides a rallying call to our profession. Acknowledgments Research related to this article by EPA under cooperative agreement No. CR828G820-01-Q. The article has not been subjected to the review of the EPA and thus does not necessarily reflect the view of the agency and no official endorsement should be inferred. ------- Perspectives on the Analysis of Massive Data Sets 419 Mini tab is a registered trademark of Mini tab Inc. References: Becker, R« A and J. M. Chambers. 1986 "Auditing of Data Analyses." American Statistical Association 1986 Proceedings flf Statistical Computing Section, pp. 11-18. Carr, D. B. 1980. "The Many Facets of Large." Proceedings ftf tfift 1979 DOE Statistical Symposium, pp. 201-204. Carr, D. B. 1991. "Looking at Large Data Sets Using Binned Data Plots." Computing and Graphics in Statistics, eds. A. Buja and P. Tukey, pp. 7-39. Springer-Verlag, New York, New York. Carr, D. B., R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. 1987. "Scatterplot Matrix Techniques For Large N." journal of the American Statistical Association 82(398) pp. 424-436. Carr, D. B., P. J. Cowley, M. A. Whiting, and W. L. Nicholson. 1984. "Organization Tools for Data Analysis Environments." American Statistical Association 1984 Proceedings of the Statistical Computing Section, pp. 214-218. American Statistical Association, Washington, DC. Eddy, W. F. And M. J. Schervish. 1988. "Asynchronous iteration." Computing Science and,Statistics Proceedings of Ac 20th Symposium on the Interface- Interface Foundation of North America, Fairfax Station, VA. pp. 165-173. Friedman, J. H., and I. W. Tukey. 1974. "A Projection Pursuit Algorithm for Exploratory Data Analysis." IEEE Trans. Computing- pp. 891-890. Gabbe, J. D., M. B. Wilk, and W. L. Brown. 1967. "Statistical Analysis an Modeling of the High-Energy Proton Data from the Telestar I Satellite." Bell Systems Technical Journal XT.VI 7. pp. 1303-1450. Hahn, G. J. 1989. "Statistics-Aided Manufacturing: A Look Into the Future." The American Statistician. Vol 43, No. 2 pp. 74-79. Hall, D. L. (Ed.) 1980. "ALDS 1979 Panel Review," FNL-SA-8781. Pacific Northwest Laboratory, Richland WA. Hall, D. L. (Ed.) 1983. "ALDS 1982 Panel Review," PNL-SA-11178. Pacific Northwest Laboratory, Richland Wa. Proceedings. (R.Dutter and W. Gross mann, eds.), Physica Verlag, Heidelberg. Johnson, M. A. 1989. "A review and examination of the mathematical spaces underlying molecular similarity analysis," Journal of Mathematical Chemistry. 3. pp. 117-145. Johnson, M. A. and Maggiora, G. M. 1990. Concepts and Applications of Molecular Similarity Wiley Inter-Science, New York. Kahn, R., R. D. Raskins, J. Knighton, A Pursch, and S. Granger-Gallegos. 1991. "Validating a Large Geophysical Data Set Experiences with Satellite-Derived Cloud Parameters. Computing Science and Statistics Proceedings of the 23rd Symposium on the Interface. Interface Foundation of North America, Fairfax Station, VA. pp 133-140. Littlefield, J. S. and D. B. Carr. 1986. "MINIGRAPH, a Device Independent Interactive Graphics Package." PNL-SA- 14366. Pacific Northwest Laboratory, Richland WA. Powsner, S. M. and Tufte, E. R. "Graphical Summary of Patient Status." THE LANCET, Vol. 344 No. 8919 pp 386- 389. Nicholson, W. L., D. B. Carr, P. J. Cowley, and M. A. Whiting. 1984. "The Role of Environments in Managing Data Analysis." American Statistical Association 1984 Proceedings of the Statistical Computing Section, pp. 80-84. American Statistical Association, Washington, DC. Nicholson, W. L.. D. B. Carr and D. L. Hall. 1980. "Hie Analysis of Large Data Sets." American Statistical Association 1980 Pros, of Statistical Computing Section, pp- 59-65. Nicholson, W. L. and R. L, Hooper, R. L. 1977. "A Research Program in Exploratory Analysis of Large Data Sets." (Copies available from D. Carr) Tufte, E, R. 1989. Visual Design of the User Interface. IBM Corporation, Armonk, NY. Tufte, ER. 1990. Envisioning Information. Graphic Press, Cheshire, CN. Ruber, P. J. 1994. "Huge Data Sets." Compstat 1994: ------- NHEERL-CQR-2042A TECHNICAL REPORT DATA (Please read instructions on the reverse before ct 1. REPORT NO, EPA/600/A-96/1 03 2. 4. TITLE AND SUBTITLE Perspectives on the Analysis of Massive Data Sets 6. REPORT DATE 6. PERFORMING ORGANIZATION CODE 7. AUTHORISE Carr, D.B. 8. PERFORMING ORGANIZATION REPORT NO, 9, PERFORMING ORGANIZATION NAME AND ADDRESS Center for Computational Statistics George Mason University Fairfax, VA 22030 10. PROGRAM ELEMENT NO. 11. CONTRACT/GRANT NO. 12. SPONSORING AGENCY NAME AND ADDRESS US EPA ENVIRONMENTAL RESEARCH LABORATORY 200 SW 35th Street Corvallis, OR 97333 13. TYPE OF REPORT AND PERIOD COVERED Symposium paper 14. SPONSORING AGENCY CODE EPA/600/02 15. SUPPLEMENTARY NOTES 1995. Proceedings of the 27th Symposium on the Interface. Computer Science and Statistics, Statistics & Manufacturing with Subthemes in Environmental Statistics, Graphics, and Imaging, Pittsburgh, Pa, June 21-24, 1995. 24, 1995. ^1:6; ABSTRACT This talk presents perspectives on the analysis of massive data sets. These derive from past efforts to develop tools for analysis of large data sets and speculations about the future. The perspectives touch on many topics: previous research, recurrent problems, standard tricks for addressing problems, new problems, and new opportunities. In looking forward, it seems that evolving notions about man-machine interface, models, and model criticism will play an important role in massive data set analysis. The extent to which the statistical community responds this evolution will have far reaching consequences in terms of the influence of our profession with the scientific community. 17. KEY WORDS AND DOCUMENT ANALYSIS a. DESCRIPTORS b. IDENTIFIERS/OPEN ENDED TERMS c. COSAT1 Field/Group Analysis of massive data sets, statistical graphics, statistical computation 18. DISTRIBUTION STATEMENT 19. SECURITY CLASS (77>/s Report) 21- NO. OF PAGES 10 20. SECURITY CLASS IThis page1 22. PRICE EPA Form 2220-1 (Rev. 4-771 PREVIOUS EDITION IS OBSOLETE ------- |