Large-scale seismic waveform quality metric calculation using Hadoop

Magana-Zook, Steven; Gaylord, Jessie M.; Knapp, Douglas R.; Dodge, Douglas A.; Ruppert, Stanley D.

doi:10.1016/j.cageo.2016.05.012

Title: Large-scale seismic waveform quality metric calculation using Hadoop

Abstract

Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times fastermore »« less

Authors:

Magana-Zook, Steven ^[1]; Gaylord, Jessie M. ^[1]; Knapp, Douglas R. ^[1]; Dodge, Douglas A. ^[1]; Ruppert, Stanley D. ^[1]

Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

Publication Date:: Fri May 27 00:00:00 EDT 2016

Research Org.:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 1262167

Alternate Identifier(s):: OSTI ID: 1325365

Report Number(s):: LLNL-JRNL-683307
Journal ID: ISSN 0098-3004

Grant/Contract Number:: AC52-07NA27344

Resource Type:: Accepted Manuscript

Journal Name:: Computers and Geosciences

Additional Journal Information:: Journal Volume: 94; Journal Issue: C; Journal ID: ISSN 0098-3004

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 58 GEOSCIENCES; 97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats


                    Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., and Ruppert, Stanley D. Large-scale seismic waveform quality metric calculation using Hadoop.  United States: N. p., 2016. 
Web.  doi:10.1016/j.cageo.2016.05.012.

Copy to clipboard


                    Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., & Ruppert, Stanley D. Large-scale seismic waveform quality metric calculation using Hadoop.  United States.  https://doi.org/10.1016/j.cageo.2016.05.012

Copy to clipboard


                    Magana-Zook, Steven, Gaylord, Jessie M., Knapp, Douglas R., Dodge, Douglas A., and Ruppert, Stanley D. Fri .  
"Large-scale seismic waveform quality metric calculation using Hadoop".  United States.  https://doi.org/10.1016/j.cageo.2016.05.012.  https://www.osti.gov/servlets/purl/1262167.

Copy to clipboard


                    
@article{osti_1262167,

  title        = {Large-scale seismic waveform quality metric calculation using Hadoop},

  author       = {Magana-Zook, Steven and Gaylord, Jessie M. and Knapp, Douglas R. and Dodge, Douglas A. and Ruppert, Stanley D.},

  abstractNote = {Here in this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1 TB of data were processed with the traditional architecture, and the full 43 TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43 TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. We conducted these experiments multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing.},

  doi          = {10.1016/j.cageo.2016.05.012},

  journal      = {Computers and Geosciences},

  number       = C,

  volume       = 94,

  place        = {United States},

  year         = {Fri May 27 00:00:00 EDT 2016},

  month        = {Fri May 27 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.cageo.2016.05.012

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 15 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Large-scale seismic signal analysis with Hadoop
journal, May 2014

Addair, T. G.; Dodge, D. A.; Walter, W. R.
Computers & Geosciences, Vol. 66
DOI: 10.1016/j.cageo.2014.01.014

MapReduce: simplified data processing on large clusters
journal, January 2008

Dean, Jeffrey; Ghemawat, Sanjay; Mehta, Brijesh
Communications of the ACM, Vol. 51, Issue 1
DOI: 10.1145/1327452.1327492

Initial Global Seismic Cross‐Correlation Results: Implications for Empirical Signal Detectors
journal, January 2015

Dodge, D. A.; Walter, W. R.
Bulletin of the Seismological Society of America, Vol. 105, Issue 1
DOI: 10.1785/0120140166

EEG analysis based on time domain properties
journal, September 1970

Hjorth, Bo
Electroencephalography and Clinical Neurophysiology, Vol. 29, Issue 3
DOI: 10.1016/0013-4694(70)90143-4

Comparing measures of sample skewness and kurtosis
journal, March 1998

Joanes, D. N.; Gill, C. A.
Journal of the Royal Statistical Society: Series D (The Statistician), Vol. 47, Issue 1
DOI: 10.1111/1467-9884.00122

Improvements in multiprocessor system design
journal, June 1985

Rodgers, David P.
ACM SIGARCH Computer Architecture News, Vol. 13, Issue 3
DOI: 10.1145/327070.327215

Peakmatch: A Java Program for Multiplet Analysis of Large Seismic Datasets
journal, June 2015

Rodgers, Mel; Rodgers, Simon; Roman, Diana C.
Seismological Research Letters, Vol. 86, Issue 4
DOI: 10.1785/0220140160

An Automatic, Adaptive Algorithm for Refining Phase Picks in Large Seismic Data Sets
journal, June 2002

Rowe, C. A.
Bulletin of the Seismological Society of America, Vol. 92, Issue 5
DOI: 10.1785/0120010224

The Hadoop Distributed File System
conference, May 2010

Shvachko, Konstantin; Kuang, Hairong; Radia, Sanjay
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
DOI: 10.1109/MSST.2010.5496972

Earthquake detection through computationally efficient similarity search
journal, December 2015

Yoon, Clara E.; O’Reilly, Ossian; Bergen, Karianne J.
Science Advances, Vol. 1, Issue 11
DOI: 10.1126/sciadv.1501057

Real-time earthquake monitoring using a search engine method
journal, December 2014

Zhang, Jie; Zhang, Haijiang; Chen, Enhong
Nature Communications, Vol. 5, Issue 1
DOI: 10.1038/ncomms6664

Similar Records in DOE PAGES and OSTI.GOV collections:

Center for Technology for Advanced Scientific Componet Software (TASCS)

Technical Report Govindaraju, Madhusudhan

Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more »« less
https://doi.org/10.2172/1092881

Full Text Available
Large-scale seismic signal analysis with Hadoop

Journal Article Addair, T. G. ; Dodge, D. A. ; Walter, W. R. ; ... - Computers and Geosciences

In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore »« less
Cited by 41
https://doi.org/10.1016/j.cageo.2014.01.014
Data Intensive Computing on Amazon Web Services

Technical Report Magana-Zook, S. A.

The Geophysical Monitoring Program (GMP) has spent the past few years building up the capability to perform data intensive computing using what have been referred to as “big data” tools. These big data tools would be used against massive archives of seismic signals (>300 TB) to conduct research not previously possible. Examples of such tools include Hadoop (HDFS, MapReduce), HBase, Hive, Storm, Spark, Solr, and many more by the day. These tools are useful for performing data analytics on datasets that exceed the resources of traditional analytic approaches. To this end, a research big data cluster (“Cluster A”) was setmore »« less
https://doi.org/10.2172/1378527

Full Text Available
Deactivation and decommissioning web log analysis using big data technology - 15710

Conference Joshi, Santosh ; Upadhyay, Himanshu ; Lagos, Leonel

The D and D Knowledge Management Information Tool (D and D KM- IT) is a web-based knowledge management information tool built for the D and D user community. Big data is a massive volume of structured and unstructured datasets which is so large that it's difficult to process using traditional database techniques. Web logs are the repository of files which are generated automatically for any operation on the web site. Web log files generated from the D and D KM-IT will be processed using the Apache Hadoop Framework to extract meaningful data. Hadoop is an open-source software framework for storingmore »« less
A Magnetoencephalographic/Encephalographic (MEG/EEG) Brain-Computer Interface Driver for Interactive iOS Mobile Videogame Applications Utilizing the Hadoop Ecosystem, MongoDB, and Cassandra NoSQL Databases

Journal Article McClay, Wilbert - Diseases

In Phase I, we collected data on five subjects yielding over 90% positive performance in Magnetoencephalographic (MEG) mid-and post-movement activity. In addition, a driver was developed that substituted the actions of the Brain Computer Interface (BCI) as mouse button presses for real-time use in visual simulations. The process was interfaced to a flight visualization demonstration utilizing left or right brainwave thought movement, the user experiences, the aircraft turning in the chosen direction, or on iOS Mobile Warfighter Videogame application. The BCI’s data analytics of a subject’s MEG brain waves and flight visualization performance videogame analytics were stored and analyzed usingmore »« less
https://doi.org/10.3390/diseases6040089

Full Text Available

Similar Records

Title: Large-scale seismic waveform quality metric calculation using Hadoop

Abstract

Citation Formats

Large-scale seismic signal analysis with Hadoop journal, May 2014

MapReduce: simplified data processing on large clusters journal, January 2008

Initial Global Seismic Cross‐Correlation Results: Implications for Empirical Signal Detectors journal, January 2015

EEG analysis based on time domain properties journal, September 1970

Comparing measures of sample skewness and kurtosis journal, March 1998

Improvements in multiprocessor system design journal, June 1985

Peakmatch: A Java Program for Multiplet Analysis of Large Seismic Datasets journal, June 2015

An Automatic, Adaptive Algorithm for Refining Phase Picks in Large Seismic Data Sets journal, June 2002

The Hadoop Distributed File System conference, May 2010

Earthquake detection through computationally efficient similarity search journal, December 2015

Real-time earthquake monitoring using a search engine method journal, December 2014

Large-scale seismic signal analysis with Hadoop
journal, May 2014

MapReduce: simplified data processing on large clusters
journal, January 2008

Initial Global Seismic Cross‐Correlation Results: Implications for Empirical Signal Detectors
journal, January 2015

EEG analysis based on time domain properties
journal, September 1970

Comparing measures of sample skewness and kurtosis
journal, March 1998

Improvements in multiprocessor system design
journal, June 1985

Peakmatch: A Java Program for Multiplet Analysis of Large Seismic Datasets
journal, June 2015

An Automatic, Adaptive Algorithm for Refining Phase Picks in Large Seismic Data Sets
journal, June 2002

The Hadoop Distributed File System
conference, May 2010

Earthquake detection through computationally efficient similarity search
journal, December 2015

Real-time earthquake monitoring using a search engine method
journal, December 2014