skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scientific Data Management Center for Enabling Technologies

Abstract

Managing scientific data has been identified by the scientific community as one of the most important emerging needs because of the sheer volume and increasing complexity of data being collected. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Fortunately, the data management problems encountered by most scientific domains are common enough to be addressed through shared technology solutions. Based on community input, we have identified three significant requirements. First, more efficient access to storage systems is needed. In particular, parallel file system and I/O system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualization engine. These processes are complicated by the fact that scientific data are structured differently for specific application domains, and are stored in specialized file formats. Second, scientists require technologies to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis and searches over extremely large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. Furthermore, interactivemore » analysis requires techniques for efficiently selecting subsets of the data. Finally, generating the data, collecting and storing the results, keeping track of data provenance, data post-processing, and analysis of results is a tedious, fragmented process. Tools for automation of this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration. The SDM center was established under the SciDAC program to address these issues. The SciDAC-1 Scientific Data Management (SDM) Center succeeded in bringing an initial set of advanced data management technologies to DOE application scientists in astrophysics, climate, fusion, and biology. Equally important, it established collaborations with these scientists to better understand their science as well as their forthcoming data management and data analytics challenges. Building on our early successes, we have greatly enhanced, robustified, and deployed our technology to these communities. In some cases, we identified new needs that have been addressed in order to simplify the use of our technology by scientists. This report summarizes our work so far in SciDAC-2. Our approach is to employ an evolutionary development and deployment process: from research through prototypes to deployment and infrastructure. Accordingly, we have organized our activities in three layers that abstract the end-to-end data flow described above. We labeled the layers (from bottom to top): a) Storage Efficient Access (SEA), b) Data Mining and Analysis (DMA), c) Scientific Process Automation (SPA). The SEA layer is immediately on top of hardware, operating systems, file systems, and mass storage systems, and provides parallel data access technology, and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature identification, and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. NCSU work performed under this contract was primarily at the SPA layer.« less

Authors:
Publication Date:
Research Org.:
North Carolina State University
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1095275
Report Number(s):
Final Technical Report
DOE Contract Number:  
FC02-07ER25809
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; scientific workflows, cloud computing, fault-tolerance, Kepler, SciDAC, provenance, scientific exploration, climate workflows, fusion workflows, biology workflows, astrophysics workflows

Citation Formats

Vouk, Mladen A. Scientific Data Management Center for Enabling Technologies. United States: N. p., 2013. Web. doi:10.2172/1095275.
Vouk, Mladen A. Scientific Data Management Center for Enabling Technologies. United States. doi:10.2172/1095275.
Vouk, Mladen A. Tue . "Scientific Data Management Center for Enabling Technologies". United States. doi:10.2172/1095275. https://www.osti.gov/servlets/purl/1095275.
@article{osti_1095275,
title = {Scientific Data Management Center for Enabling Technologies},
author = {Vouk, Mladen A.},
abstractNote = {Managing scientific data has been identified by the scientific community as one of the most important emerging needs because of the sheer volume and increasing complexity of data being collected. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Fortunately, the data management problems encountered by most scientific domains are common enough to be addressed through shared technology solutions. Based on community input, we have identified three significant requirements. First, more efficient access to storage systems is needed. In particular, parallel file system and I/O system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualization engine. These processes are complicated by the fact that scientific data are structured differently for specific application domains, and are stored in specialized file formats. Second, scientists require technologies to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis and searches over extremely large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. Furthermore, interactive analysis requires techniques for efficiently selecting subsets of the data. Finally, generating the data, collecting and storing the results, keeping track of data provenance, data post-processing, and analysis of results is a tedious, fragmented process. Tools for automation of this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration. The SDM center was established under the SciDAC program to address these issues. The SciDAC-1 Scientific Data Management (SDM) Center succeeded in bringing an initial set of advanced data management technologies to DOE application scientists in astrophysics, climate, fusion, and biology. Equally important, it established collaborations with these scientists to better understand their science as well as their forthcoming data management and data analytics challenges. Building on our early successes, we have greatly enhanced, robustified, and deployed our technology to these communities. In some cases, we identified new needs that have been addressed in order to simplify the use of our technology by scientists. This report summarizes our work so far in SciDAC-2. Our approach is to employ an evolutionary development and deployment process: from research through prototypes to deployment and infrastructure. Accordingly, we have organized our activities in three layers that abstract the end-to-end data flow described above. We labeled the layers (from bottom to top): a) Storage Efficient Access (SEA), b) Data Mining and Analysis (DMA), c) Scientific Process Automation (SPA). The SEA layer is immediately on top of hardware, operating systems, file systems, and mass storage systems, and provides parallel data access technology, and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature identification, and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. NCSU work performed under this contract was primarily at the SPA layer.},
doi = {10.2172/1095275},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2013},
month = {1}
}