Performance analysis tool for HPC and big data applications on scientific clusters

Yoo, W; Koo, M; Cao, Y; Sim, A; Nugent, P; Wu, K

doi:10.1007/978-3-319-33742-5_7

Performance analysis tool for HPC and big data applications on scientific clusters

Book · Fri Jan 01 04:00:00 EST 2016

DOI:https://doi.org/10.1007/978-3-319-33742-5_7· OSTI ID:1828854

Yoo, W; Koo, M; Cao, Y; Sim, A; Nugent, P; Wu, K

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster.

View Book

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1828854

Country of Publication:: United States

Language:: English

References (28)

The Palomar Transient Factory: System Overview, Performance, and First Results Law, Nicholas M.; Kulkarni, Shrinivas R.; Dekany, Richard G. Publications of the Astronomical Society of the Pacific, Vol. 121, Issue 886 https://doi.org/10.1086/648598	journal	December 2009
Enabling comprehensive data-driven system management for large computational facilities Browne, James C.; DeLeon, Robert L.; Lu, Charng-Da SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503230	conference	November 2013
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications Bohme, David; Geimer, Markus; Wolf, Felix 2010 39th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2010.18	conference	September 2010
Support-vector networks Cortes, Corinna; Vapnik, Vladimir Machine Learning, Vol. 20, Issue 3 https://doi.org/10.1007/BF00994018	journal	September 1995
A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids Duan, Rubing; Nadeem, Farrukh; Wang, Jie 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid https://doi.org/10.1109/CCGRID.2009.58	conference	May 2009
Modeling virtualized applications using machine learning techniques Kundu, Sajib; Rangaswami, Raju; Gulati, Ajay ACM SIGPLAN Notices, Vol. 47, Issue 7 https://doi.org/10.1145/2365864.2151028	journal	September 2012
The NetLogger methodology for high performance distributed systems performance analysis Tierney, B.; Johnston, W.; Crowley, B. Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244) https://doi.org/10.1109/HPDC.1998.709980	conference	January 1998
ADP: automated diagnosis of performance pathologies using hardware events Yoo, Wucherl; Larson, Kevin; Baugh, Lee Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems - SIGMETRICS '12 https://doi.org/10.1145/2254756.2254791	conference	January 2012
Algorithm AS 136: A K-Means Clustering Algorithm Hartigan, J. A.; Wong, M. A. Applied Statistics, Vol. 28, Issue 1 https://doi.org/10.2307/2346830	journal	January 1979
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications Burtscher, Martin; Kim, Byoung-Do; Diamond, Jeff 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.41	conference	November 2010
Wrangler: Predictable and Faster Jobs using Fewer Resources Yadwadkar, Neeraja J.; Ananthanarayanan, Ganesh; Katz, Randy Proceedings of the ACM Symposium on Cloud Computing - SOCC '14 https://doi.org/10.1145/2670979.2671005	conference	January 2014
Linking Resource Usage Anomalies with System Failures from Cluster Log Data Chuah, Edward; Jhumka, Arshad; Narasimhamurthy, Sai 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS) https://doi.org/10.1109/SRDS.2013.20	conference	September 2013
Performance modeling of parallel and distributed computing using PACE Cao, Junwei; Kerbyson, D. J.; Papaefstathiou, E. Conference Proceedings of the 2000 IEEE International Performance, Computing, and Communications Conference (Cat. No.00CH37086) https://doi.org/10.1109/PCCC.2000.830354	conference	May 2000
Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds Malawski, Maciej; Juve, Gideon; Deelman, Ewa 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.38	conference	November 2012
Roofline: an insightful visual performance model for multicore architectures Williams, Samuel; Waterman, Andrew; Patterson, David Communications of the ACM, Vol. 52, Issue 4 https://doi.org/10.1145/1498765.1498785	journal	April 2009
Root cause detection in a service-oriented architecture Kim, Myunghwan; Sumbaly, Roshan; Shah, Sam Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems - SIGMETRICS '13 https://doi.org/10.1145/2465529.2465753	conference	January 2013
Scientific workflow management and the Kepler system Ludäscher, Bertram; Altintas, Ilkay; Berkley, Chad Concurrency and Computation: Practice and Experience, Vol. 18, Issue 10 https://doi.org/10.1002/cpe.994	journal	January 2006
Ironmodel: robust performance models in the wild Thereska, Eno; Ganger, Gregory R. Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '08 https://doi.org/10.1145/1375457.1375486	conference	January 2008
Using correlated surprise to infer shared influence Oliner, Adam J.; Kulkarni, Ashutosh V.; Aiken, Alex 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN) https://doi.org/10.1109/DSN.2010.5544921	conference	June 2010
System-level monitoring of floating-point performance to improve effective system utilization Vento, Davide Del; Engel, Thomas; Ghosh, Siddhartha S. State of the Practice Reports on - SC '11 https://doi.org/10.1145/2063348.2063355	conference	January 2011
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications Matsunaga, Andréa; Fortes, José A. B. 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing https://doi.org/10.1109/CCGRID.2010.98	conference	May 2010
The Tau Parallel Performance System Shende, Sameer S.; Malony, Allen D. The International Journal of High Performance Computing Applications, Vol. 20, Issue 2 https://doi.org/10.1177/1094342006064482	journal	May 2006
Fingerprinting the datacenter: automated classification of performance crises Bodik, Peter; Goldszmidt, Moises; Fox, Armando Proceedings of the 5th European conference on Computer systems - EuroSys '10 https://doi.org/10.1145/1755913.1755926	conference	January 2010
CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems Chen, Pengfei; Qi, Yong; Zheng, Pengfei IEEE INFOCOM 2014 - IEEE Conference on Computer Communications https://doi.org/10.1109/INFOCOM.2014.6848128	conference	April 2014
Detecting large-scale system problems by mining console logs Xu, Wei; Huang, Ling; Fox, Armando Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles - SOSP '09 https://doi.org/10.1145/1629575.1629587	conference	January 2009
Automated known problem diagnosis with event traces Yuan, Chun; Lao, Ni; Wen, Ji-Rong ACM SIGOPS Operating Systems Review, Vol. 40, Issue 4 https://doi.org/10.1145/1218063.1217972	journal	October 2006
Implementing the Palomar Transient Factory Real-Time Detection Pipeline in GLADE: Results and Observations Rusu, Florin; Nugent, Peter; Wu, Kesheng Databases in Networked Information Systems https://doi.org/10.1007/978-3-319-05693-7_4	book	January 2014
Ganesha: blackBox diagnosis of MapReduce systems Pan, Xinghao; Tan, Jiaqi; Kavulya, Soila ACM SIGMETRICS Performance Evaluation Review, Vol. 37, Issue 3 https://doi.org/10.1145/1710115.1710118	journal	January 2010

Similar Records

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Book · Sat Sep 17 00:00:00 EDT 2016 · OSTI ID:1393595

PATHA: Performance Analysis Tool for HPC Applications

Journal Article · Wed Feb 17 19:00:00 EST 2016 · IEEE International Performance, Computing, and Communications Conference · OSTI ID:1379097

Enabling HPC Scientific Workflows for Serverless

Conference · Fri Nov 01 00:00:00 EDT 2024 · OSTI ID:2538241

Performance analysis tool for HPC and big data applications on scientific clusters

Citation Formats

References (28)

Similar Records

Related Subjects