Large-scale seismic waveform quality metric calculation using Hadoop

Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; Dodge, Douglas A.; Ruppert, Stanley D.

doi:10.1016/j.cageo.2016.05.012

Title: Large-scale seismic waveform quality metric calculation using Hadoop

Journal Article · Fri May 27 00:00:00 EDT 2016 · Computers and Geosciences

DOI:https://doi.org/10.1016/j.cageo.2016.05.012· OSTI ID:1262167

Magana-Zook, Steven ^[1]; Gaylord, Jessie M. ^[1]; Knapp, Douglas R. ^[1]; Dodge, Douglas A. ^[1]; Ruppert, Stanley D. ^[1]

Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC52-07NA27344

OSTI ID:: 1262167

Alternate ID(s):: OSTI ID: 1325365

Report Number(s):: LLNL-JRNL-683307

Journal Information:: Computers and Geosciences, Vol. 94, Issue C; ISSN 0098-3004

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 15 works

Citation information provided by
Web of Science

References (11)

Large-scale seismic signal analysis with Hadoop Addair, T. G.; Dodge, D. A.; Walter, W. R. Computers & Geosciences, Vol. 66 https://doi.org/10.1016/j.cageo.2014.01.014	journal	May 2014
MapReduce: simplified data processing on large clusters Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh Communications of the ACM, Vol. 51, Issue 1 https://doi.org/10.1145/1327452.1327492	journal	January 2008
Initial Global Seismic Cross‐Correlation Results: Implications for Empirical Signal Detectors Dodge, D. A.; Walter, W. R. Bulletin of the Seismological Society of America, Vol. 105, Issue 1 https://doi.org/10.1785/0120140166	journal	January 2015
EEG analysis based on time domain properties Hjorth, Bo Electroencephalography and Clinical Neurophysiology, Vol. 29, Issue 3 https://doi.org/10.1016/0013-4694(70)90143-4	journal	September 1970
Comparing measures of sample skewness and kurtosis Joanes, D. N.; Gill, C. A. Journal of the Royal Statistical Society: Series D (The Statistician), Vol. 47, Issue 1 https://doi.org/10.1111/1467-9884.00122	journal	March 1998
Improvements in multiprocessor system design Rodgers, David P. ACM SIGARCH Computer Architecture News, Vol. 13, Issue 3 https://doi.org/10.1145/327070.327215	journal	June 1985
Peakmatch: A Java Program for Multiplet Analysis of Large Seismic Datasets Rodgers, Mel; Rodgers, Simon; Roman, Diana C. Seismological Research Letters, Vol. 86, Issue 4 https://doi.org/10.1785/0220140160	journal	June 2015
An Automatic, Adaptive Algorithm for Refining Phase Picks in Large Seismic Data Sets Rowe, C. A. Bulletin of the Seismological Society of America, Vol. 92, Issue 5 https://doi.org/10.1785/0120010224	journal	June 2002
The Hadoop Distributed File System Shvachko, Konstantin; Kuang, Hairong; Radia, Sanjay 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) https://doi.org/10.1109/MSST.2010.5496972	conference	May 2010
Earthquake detection through computationally efficient similarity search Yoon, Clara E.; O’Reilly, Ossian; Bergen, Karianne J. Science Advances, Vol. 1, Issue 11 https://doi.org/10.1126/sciadv.1501057	journal	December 2015
Real-time earthquake monitoring using a search engine method Zhang, Jie; Zhang, Haijiang; Chen, Enhong Nature Communications, Vol. 5, Issue 1 https://doi.org/10.1038/ncomms6664	journal	December 2014

Similar Records

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report · Sun Oct 31 00:00:00 EDT 2010 · OSTI ID:1262167

Govindaraju, Madhusudhan

Large-scale seismic signal analysis with Hadoop

Journal Article · Thu May 01 00:00:00 EDT 2014 · Computers and Geosciences · OSTI ID:1262167

Addair, T. G.; Dodge, D. A.; Walter, W. R.; +1 more

Data Intensive Computing on Amazon Web Services

Technical Report · Tue Apr 21 00:00:00 EDT 2015 · OSTI ID:1262167

Magana-Zook, S. A.

Related Subjects

58 GEOSCIENCES
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Title: Large-scale seismic waveform quality metric calculation using Hadoop

Citation Formats

References (11)

Similar Records

Related Subjects