skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Metadata Management for a Large Multi-Source Seismic Data Repository

Abstract

In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity. We began the effort with an assessment of open source data flow tools from the Hadoop ecosystem. We then began the construction of a layered architecture that is specifically designed to address many of the scalability and data quality issues we experience with our current pipeline. This included implementing basic functionality in each of the layers, such as establishing a data lake, designing a unified metadata schema, tracking provenance, and calculating data quality metrics. Our original intent was to test and validate the new ingestion framework with data from a large-scale field deployment in a temporary network. This delivered somewhat unsatisfying results, since the new system immediately identified fatal flaws in the data relatively early in the pipeline. Although this is a correct result it did not allow us to sufficiently exercise the whole framework. We then widened our scope to process all available metadata from over a dozen online seismic data sources to further test the implementation and validate the design. This experiment alsomore » uncovered a higher than expected frequency of certain types of metadata issues that challenged us to further tune our data management strategy to handle them. Our result from this project is a greatly improved understanding of real world data issues, a validated design, and prototype implementations of major components of an eventual production framework. This successfully forms the basis of future development for the Geophysical Monitoring Program data pipeline, which is a critical asset supporting multiple programs. It also positions us very well to deliver valuable metadata management expertise to our sponsors, and has already resulted in an NNSA Office of Defense Nuclear Nonproliferation commitment to a multi-year project for follow-on work.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1357348
Report Number(s):
LLNL-TR-729885
DOE Contract Number:  
AC52-07NA27344
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
58 GEOSCIENCES; 97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats

Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., and Knapp, D. R. Scalable Metadata Management for a Large Multi-Source Seismic Data Repository. United States: N. p., 2017. Web. doi:10.2172/1357348.
Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., & Knapp, D. R. Scalable Metadata Management for a Large Multi-Source Seismic Data Repository. United States. doi:10.2172/1357348.
Gaylord, J. M., Dodge, D. A., Magana-Zook, S. A., Barno, J. G., and Knapp, D. R. Tue . "Scalable Metadata Management for a Large Multi-Source Seismic Data Repository". United States. doi:10.2172/1357348. https://www.osti.gov/servlets/purl/1357348.
@article{osti_1357348,
title = {Scalable Metadata Management for a Large Multi-Source Seismic Data Repository},
author = {Gaylord, J. M. and Dodge, D. A. and Magana-Zook, S. A. and Barno, J. G. and Knapp, D. R.},
abstractNote = {In this work, we implemented the key metadata management components of a scalable seismic data ingestion framework to address limitations in our existing system, and to position it for anticipated growth in volume and complexity. We began the effort with an assessment of open source data flow tools from the Hadoop ecosystem. We then began the construction of a layered architecture that is specifically designed to address many of the scalability and data quality issues we experience with our current pipeline. This included implementing basic functionality in each of the layers, such as establishing a data lake, designing a unified metadata schema, tracking provenance, and calculating data quality metrics. Our original intent was to test and validate the new ingestion framework with data from a large-scale field deployment in a temporary network. This delivered somewhat unsatisfying results, since the new system immediately identified fatal flaws in the data relatively early in the pipeline. Although this is a correct result it did not allow us to sufficiently exercise the whole framework. We then widened our scope to process all available metadata from over a dozen online seismic data sources to further test the implementation and validate the design. This experiment also uncovered a higher than expected frequency of certain types of metadata issues that challenged us to further tune our data management strategy to handle them. Our result from this project is a greatly improved understanding of real world data issues, a validated design, and prototype implementations of major components of an eventual production framework. This successfully forms the basis of future development for the Geophysical Monitoring Program data pipeline, which is a critical asset supporting multiple programs. It also positions us very well to deliver valuable metadata management expertise to our sponsors, and has already resulted in an NNSA Office of Defense Nuclear Nonproliferation commitment to a multi-year project for follow-on work.},
doi = {10.2172/1357348},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Apr 11 00:00:00 EDT 2017},
month = {Tue Apr 11 00:00:00 EDT 2017}
}

Technical Report:

Save / Share: