DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Large-scale seismic waveform quality metric calculation using Hadoop

Abstract

Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times fastermore » than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1262167
Alternate Identifier(s):
OSTI ID: 1325365
Report Number(s):
LLNL-JRNL-683307
Journal ID: ISSN 0098-3004
Grant/Contract Number:  
AC52-07NA27344
Resource Type:
Accepted Manuscript
Journal Name:
Computers and Geosciences
Additional Journal Information:
Journal Volume: 94; Journal Issue: C; Journal ID: ISSN 0098-3004
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
58 GEOSCIENCES; 97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats

Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., and Ruppert, Stanley D. Large-scale seismic waveform quality metric calculation using Hadoop. United States: N. p., 2016. Web. doi:10.1016/j.cageo.2016.05.012.
Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., & Ruppert, Stanley D. Large-scale seismic waveform quality metric calculation using Hadoop. United States. https://doi.org/10.1016/j.cageo.2016.05.012
Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., and Ruppert, Stanley D. Fri . "Large-scale seismic waveform quality metric calculation using Hadoop". United States. https://doi.org/10.1016/j.cageo.2016.05.012. https://www.osti.gov/servlets/purl/1262167.
@article{osti_1262167,
title = {Large-scale seismic waveform quality metric calculation using Hadoop},
author = {Magana-Zook, Steven and Gaylord, Jessie M. and Knapp, Douglas R. and Dodge, Douglas A. and Ruppert, Stanley D.},
abstractNote = {Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.},
doi = {10.1016/j.cageo.2016.05.012},
journal = {Computers and Geosciences},
number = C,
volume = 94,
place = {United States},
year = {Fri May 27 00:00:00 EDT 2016},
month = {Fri May 27 00:00:00 EDT 2016}
}

Journal Article:

Citation Metrics:
Cited by: 15 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Large-scale seismic signal analysis with Hadoop
journal, May 2014


MapReduce: simplified data processing on large clusters
journal, January 2008

  • Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
  • Communications of the ACM, Vol. 51, Issue 1
  • DOI: 10.1145/1327452.1327492

Initial Global Seismic Cross‐Correlation Results: Implications for Empirical Signal Detectors
journal, January 2015

  • Dodge, D. A.; Walter, W. R.
  • Bulletin of the Seismological Society of America, Vol. 105, Issue 1
  • DOI: 10.1785/0120140166

EEG analysis based on time domain properties
journal, September 1970


Comparing measures of sample skewness and kurtosis
journal, March 1998

  • Joanes, D. N.; Gill, C. A.
  • Journal of the Royal Statistical Society: Series D (The Statistician), Vol. 47, Issue 1
  • DOI: 10.1111/1467-9884.00122

Improvements in multiprocessor system design
journal, June 1985


Peakmatch: A Java Program for Multiplet Analysis of Large Seismic Datasets
journal, June 2015

  • Rodgers, Mel; Rodgers, Simon; Roman, Diana C.
  • Seismological Research Letters, Vol. 86, Issue 4
  • DOI: 10.1785/0220140160

An Automatic, Adaptive Algorithm for Refining Phase Picks in Large Seismic Data Sets
journal, June 2002

  • Rowe, C. A.
  • Bulletin of the Seismological Society of America, Vol. 92, Issue 5
  • DOI: 10.1785/0120010224

The Hadoop Distributed File System
conference, May 2010

  • Shvachko, Konstantin; Kuang, Hairong; Radia, Sanjay
  • 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
  • DOI: 10.1109/MSST.2010.5496972

Earthquake detection through computationally efficient similarity search
journal, December 2015

  • Yoon, Clara E.; O’Reilly, Ossian; Bergen, Karianne J.
  • Science Advances, Vol. 1, Issue 11
  • DOI: 10.1126/sciadv.1501057

Real-time earthquake monitoring using a search engine method
journal, December 2014

  • Zhang, Jie; Zhang, Haijiang; Chen, Enhong
  • Nature Communications, Vol. 5, Issue 1
  • DOI: 10.1038/ncomms6664