skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Metadata Management for a Large Multi-Source Seismic Data Repository

Abstract

In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity. We began the effort with an assessment of open source data flow tools from the Hadoop ecosystem. We then began the construction of a layered architecture that is specifically designed to address many of the scalability and data quality issues we experience with our current pipeline. This included implementing basic functionality in each of the layers, such as establishing a data lake, designing a unified metadata schema, tracking provenance, and calculating data quality metrics. Our original intent was to test and validate the new ingestion framework with data from a large-scale field deployment in a temporary network. This delivered somewhat unsatisfying results, since the new system immediately identified fatal flaws in the data relatively early in the pipeline. Although this is a correct result it did not allow us to sufficiently exercise the whole framework. We then widened our scope to process all available metadata from over a dozen online seismic data sources to further test the implementation and validate the design. This experiment alsomore » uncovered a higher than expected frequency of certain types of metadata issues that challenged us to further tune our data management strategy to handle them. Our result from this project is a greatly improved understanding of real world data issues, a validated design, and prototype implementations of major components of an eventual production framework. This successfully forms the basis of future development for the Geophysical Monitoring Program data pipeline, which is a critical asset supporting multiple programs. It also positions us very well to deliver valuable metadata management expertise to our sponsors, and has already resulted in an NNSA Office of Defense Nuclear Nonproliferation commitment to a multi-year project for follow-on work.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1357348
Report Number(s):
LLNL-TR-729885
DOE Contract Number:
AC52-07NA27344
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
58 GEOSCIENCES; 97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats

Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., and Knapp, D. R. Scalable Metadata Management for a Large Multi-Source Seismic Data Repository. United States: N. p., 2017. Web. doi:10.2172/1357348.
Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., & Knapp, D. R. Scalable Metadata Management for a Large Multi-Source Seismic Data Repository. United States. doi:10.2172/1357348.
Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., and Knapp, D. R. Tue . "Scalable Metadata Management for a Large Multi-Source Seismic Data Repository". United States. doi:10.2172/1357348. https://www.osti.gov/servlets/purl/1357348.
@article{osti_1357348,
title = {Scalable Metadata Management for a Large Multi-Source Seismic Data Repository},
author = {Gaylord, J. M. and Dodge, D. A. and Magana-Zook, S. A. and Barno, J. G. and Knapp, D. R.},
abstractNote = {In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity. We began the effort with an assessment of open source data flow tools from the Hadoop ecosystem. We then began the construction of a layered architecture that is specifically designed to address many of the scalability and data quality issues we experience with our current pipeline. This included implementing basic functionality in each of the layers, such as establishing a data lake, designing a unified metadata schema, tracking provenance, and calculating data quality metrics. Our original intent was to test and validate the new ingestion framework with data from a large-scale field deployment in a temporary network. This delivered somewhat unsatisfying results, since the new system immediately identified fatal flaws in the data relatively early in the pipeline. Although this is a correct result it did not allow us to sufficiently exercise the whole framework. We then widened our scope to process all available metadata from over a dozen online seismic data sources to further test the implementation and validate the design. This experiment also uncovered a higher than expected frequency of certain types of metadata issues that challenged us to further tune our data management strategy to handle them. Our result from this project is a greatly improved understanding of real world data issues, a validated design, and prototype implementations of major components of an eventual production framework. This successfully forms the basis of future development for the Geophysical Monitoring Program data pipeline, which is a critical asset supporting multiple programs. It also positions us very well to deliver valuable metadata management expertise to our sponsors, and has already resulted in an NNSA Office of Defense Nuclear Nonproliferation commitment to a multi-year project for follow-on work.},
doi = {10.2172/1357348},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Apr 11 00:00:00 EDT 2017},
month = {Tue Apr 11 00:00:00 EDT 2017}
}

Technical Report:

Save / Share:
  • In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity.
  • The purpose of the SPE is to develop a more physics-based model for nuclear explosion identification to understand the development of S-waves from explosion sources in order to enhance nuclear test ban treaty monitoring.
  • The purpose of the SDAV institute is to provide tools and expertise in scientific data management, analysis, and visualization to DOE’s application scientists. Our goal is to actively work with application teams to assist them in achieving breakthrough science, and to provide technical solutions in the data management, analysis, and visualization regimes that are broadly used by the computational science community. Over the last 5 years members of our institute worked directly with application scientists and DOE leadership-class facilities to assist them by applying the best tools and technologies at our disposal. We also enhanced our tools based on inputmore » from scientists on their needs. Many of the applications we have been working with are based on connections with scientists established in previous years. However, we contacted additional scientists though our outreach activities, as well as engaging application teams running on leading DOE computing systems. Our approach is to employ an evolutionary development and deployment process: first considering the application of existing tools, followed by the customization necessary for each particular application, and then the deployment in real frameworks and infrastructures. The institute is organized into three areas, each with area leaders, who keep track of progress, engagement of application scientists, and results. The areas are: (1) Data Management, (2) Data Analysis, and (3) Visualization. Kitware has been involved in the Visualization area. This report covers Kitware’s contributions over the last 5 years (February 2012 – February 2017). For details on the work performed by the SDAV institute as a whole, please see the SDAV final report.« less
  • This report is the entire final report for the SciDAC project authored by the whole team. OSU is part of the contributors to the report. This report is organized into sections and subsections, each covering an area of development and deployment of technologies applied to scientific applications of interest to the Department of Energy. Each sub-section includes: 1) a summary description of the research, development, and deployment carried out, the results and the extent to which the stated project objectives were met; 2) significant results, including major findings, developments, or conclusions; 3) products, such as publications and presentations, software developed,more » project website(s), technologies or techniques, inventions, awards, etc., and 4) conclusions of the projects and future directions for research, development, and deployment in this technology area.« less
  • Our ability to generate data far outstrips our ability to explore and understand it. The true value of this data lies not in its final size or complexity, but rather in our ability to exploit the data to achieve scientific goals. The data generated by programs such as ASCI have such a large scale that it is impractical to manually analyze, explore, and understand it. As a result, useful information is overlooked, and the potential benefits of increased computational and data gathering capabilities are only partially realized. The difficulties that will be faced by ASCI applications in the near futuremore » are foreshadowed by the challenges currently facing astrophysicists in making full use of the data they have collected over the years. For example, among other difficulties, astrophysicists have expressed concern that the sheer size of their data restricts them to looking at very small, narrow portions at any one time. This narrow focus has resulted in the loss of ``serendipitous`` discoveries which have been so vital to progress in the area in the past. To solve this problem, a new generation of computational tools and techniques is needed to help automate the exploration and management of large scientific data. This whitepaper proposes applying and extending ideas from the area of data mining, in particular pattern recognition, to improve the way in which scientists interact with large, multi-dimensional, time-varying data.« less