skip to main content

Title: The MCDB System for Management and Analysis of Petabyte-Scale Uncertain Data

Analysts working with very large data sets often use statistical models to “guess” at unknown, inaccurate, or missing information associated with the data. For example, a distant object viewed through an optical lens will have its position slightly shifted by imperfections in the lens. Thus, rather than considering the object’s observed position to be absolutely correct, it makes sense to take into account the lens’s imperfections to obtain a probabilistic guess as to the object’s true position. For another example, it might be important to associate some sort of error distribution with each of the individual sensors in an array of magnetometers. This error distribution may be complex and include spatially-driven covariances, because errors in nearby sensors are likely to be correlated (caused, for example, by the presence of some nearby, fixed metal object). This project is concerned with the design and implementation of a prototype data management system called the Monte Carlo Database System, or MCDB for short. MCDB allows an expert-level analyst or statistician to attach arbitrary stochastic models to very large data sets in order to “guess” the values for unknown or inaccurate data, such as the actual position of the observed object in the lens examplemore » above. When the resulting data set is analyzed, the underlying stochastic models are used to generate hundreds or thousands of possible data set instances, and each of those possible instances is analyzed separately by MCDB. Thus, MCDB does not just give a single answer to the analysis, but it actually gives an empirical distribution of query results that embody the underlying uncertainty, and can in turn be analyzed using standard statistical techniques. The stochastic models in MCDB are implemented as user-defined, external C++ libraries called Variable Generation functions (VG functions for short). Because the VG function interface is exceedingly general, it allows MCDB to be used in a very wide variety of application domains, in conjunction with virtually any stochastic model. Ultimately, we ended up calling the system that we developed over the course of the project SimSQL, to denote the fact that it allowed for stochastic simulation of data. The software developed during the lifetime of the project is available for download from the SimSQL project website:« less
  1. Rice Univ., Houston, TX (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
Final Report: Rice-DE-SC0001779
DOE Contract Number:
Resource Type:
Technical Report
Resource Relation:
Related Information: R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, P. Haas: “The monte carlo database system: Stochastic analysis close to the data.” ACM Transactions on Database Systems (TODS) 36.3 (2011): 18.Cai, Z., Vagena, Z., Perez, L., Arumugam, S., Haas, P. J., and Jermaine, C. (2013, June). “Simulation of database-valued Markov chains using SimSQL.” In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 637-648). ACM.Arumugam, S., Xu, F., Jampani, R., Jermaine, C., Perez, L. L., and Haas, P. J. (2010). “MCDB-R: Risk analysis in the database.” In Proceedings of the VLDB Endowment, 3(1-2), 782-793.Cai, Z., Gao, Z. J., Luo, S., Perez, L. L., Vagena, Z., and Jermaine, C. (2014, June). “A comparison of platforms for implementing and running very large scale machine learning algorithms.” In Proceedings of the 2014 ACM SIGMOD International Conference on Management of data (pp. 1371-1382). ACM.Perez, Luis L., and Christopher M. Jermaine. “History-aware query optimization with materialized intermediate views.” In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pp. 520-531. IEEE, 2014.Perez, L. L., Arumugam, S., and Jermaine, C. M. (2010, June). “Evaluation of probabilistic threshold queries in MCDB.” In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp. 687-698). ACM.
Research Org:
Rice Univ., Houston, TX (United States)
Sponsoring Org:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Country of Publication:
United States