Automated metadata, provenance cataloging and navigable interfaces: ensuring the usefulness of extreme-scale data
- General Atomics, San Diego, CA (United States)
- Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States
The MPO (Metadata, Provenance, Ontology) Project successfully addressed the goal of improving the usefulness and traceability of scientific data by building a system that could capture and display all steps in the process of creating, analyzing and disseminating that data. Throughout history, scientists have generated handwritten logbooks to keep track of data, their hypotheses, assumptions, experimental setup, and computational processes as well as reflections on observations and issues encountered. Over the last several decades, with the growth of personal computers, handheld devices, and the World Wide Web, the handwritten logbook has begun to be replaced by electronic logbooks. This transition has brought increased capability such as supporting multi-media, hypertext, and fast searching. However, content creation and metadata (a set of data that describes and gives information about other data) capturing has for the most part remained a manual activity just as it was with handwritten logbooks. This has led to a fragmentation of data, processing, and annotation that has only accelerated as scientific workflows continue to increase in complexity. From a scientific perspective, it is very important to be able to understand the lineage of any piece of data: who, what, when, how, and why. This is typically referred to as data provenance. The fragmentation discussed previously often means that data provenance is lost. As scientific workflows move to powerful computers and become more complex, the ability to track all of the steps involved in creating a piece of data become even more difficult. It was the goal of the MPO (Metadata, Provenance, Ontology) Project to create a system (the MPO System) that allows for automatic provenance and metadata capturing in such a way to allow easy searching and browsing. This goal needed to be accomplished in a general way so that it may be used across a broad range of scientific domains, yet allow the addition of vocabulary (Ontology) that is domain specific as is required for intelligent searching and browsing in the scientific context. Through the creation and deployment of the MPO system, the goals of the project were achieved. An enhanced metadata, provenance, and ontology storage system was created. This was combined with innovative methodologies for navigating and exploring these data using a web browser for both experimental and simulation-based scientific research. In addition, a system to allow scientists to instrument their existing workflows for automatic metadata and provenance is part of the MPO system. In that way, a scientist can continue to use their existing methodology yet easily document their work. Workflows and data provenance can be displayed either graphically or in an electronic notebook format and support advanced search features including via ontology. The MPO system was successfully used in both Climate and Magnetic Fusion Energy Research. The software for the MPO system is located at https://github.com/MPO-Group/MPO and is open source distributed under the Revised BSD License. A demonstration site of the MPO system is open to the public and is available at https://mpo.psfc.mit.edu/. A Docker container release of the command line client is available for public download using the command docker pull jcwright/mpo-cli at https://hub.docker.com/r/jcwright/mpo-cli.
- Research Organization:
- Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Fusion Energy Sciences (FES); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- SC0008736; AC02-05CH11231; SC0008697
- OSTI ID:
- 1335866
- Report Number(s):
- DOE-MIT-08736
- Country of Publication:
- United States
- Language:
- English
Similar Records
Dynamic Non-Hierarchical File Systems for Exascale Storage
National Computational Infrastructure for LatticeGauge Theory SciDAC-2 Closeout Report