skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance analysis tool for HPC and big data applications on scientific clusters

Book ·

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC02-05CH11231
OSTI ID:
1828854
Resource Relation:
Related Information: Book Title: Conquering Big Data with High Performance Computing
Country of Publication:
United States
Language:
English

References (28)

Fingerprinting the datacenter: automated classification of performance crises conference January 2010
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications conference September 2010
Enabling comprehensive data-driven system management for large computational facilities
  • Browne, James C.; DeLeon, Robert L.; Lu, Charng-Da
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503230
conference November 2013
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications
  • Burtscher, Martin; Kim, Byoung-Do; Diamond, Jeff
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.41
conference November 2010
Performance modeling of parallel and distributed computing using PACE
  • Cao, Junwei; Kerbyson, D. J.; Papaefstathiou, E.
  • Conference Proceedings of the 2000 IEEE International Performance, Computing, and Communications Conference (Cat. No.00CH37086) https://doi.org/10.1109/PCCC.2000.830354
conference May 2000
CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems conference April 2014
Linking Resource Usage Anomalies with System Failures from Cluster Log Data conference September 2013
Support-vector networks journal September 1995
A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids conference May 2009
Algorithm AS 136: A K-Means Clustering Algorithm journal January 1979
Root cause detection in a service-oriented architecture
  • Kim, Myunghwan; Sumbaly, Roshan; Shah, Sam
  • Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems - SIGMETRICS '13 https://doi.org/10.1145/2465529.2465753
conference January 2013
Modeling virtualized applications using machine learning techniques journal September 2012
The Palomar Transient Factory: System Overview, Performance, and First Results
  • Law, Nicholas M.; Kulkarni, Shrinivas R.; Dekany, Richard G.
  • Publications of the Astronomical Society of the Pacific, Vol. 121, Issue 886 https://doi.org/10.1086/648598
journal December 2009
Scientific workflow management and the Kepler system
  • Ludäscher, Bertram; Altintas, Ilkay; Berkley, Chad
  • Concurrency and Computation: Practice and Experience, Vol. 18, Issue 10 https://doi.org/10.1002/cpe.994
journal January 2006
Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds
  • Malawski, Maciej; Juve, Gideon; Deelman, Ewa
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.38
conference November 2012
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications conference May 2010
Using correlated surprise to infer shared influence conference June 2010
Ganesha: blackBox diagnosis of MapReduce systems journal January 2010
Implementing the Palomar Transient Factory Real-Time Detection Pipeline in GLADE: Results and Observations book January 2014
The Tau Parallel Performance System journal May 2006
Ironmodel: robust performance models in the wild
  • Thereska, Eno; Ganger, Gregory R.
  • Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '08 https://doi.org/10.1145/1375457.1375486
conference January 2008
The NetLogger methodology for high performance distributed systems performance analysis conference January 1998
System-level monitoring of floating-point performance to improve effective system utilization conference January 2011
Roofline: an insightful visual performance model for multicore architectures journal April 2009
Detecting large-scale system problems by mining console logs conference January 2009
Wrangler: Predictable and Faster Jobs using Fewer Resources conference January 2014
ADP: automated diagnosis of performance pathologies using hardware events
  • Yoo, Wucherl; Larson, Kevin; Baugh, Lee
  • Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems - SIGMETRICS '12 https://doi.org/10.1145/2254756.2254791
conference January 2012
Automated known problem diagnosis with event traces journal October 2006

Similar Records

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters
Book · Sat Sep 17 00:00:00 EDT 2016 · OSTI ID:1828854

PATHA: Performance Analysis Tool for HPC Applications
Journal Article · Thu Feb 18 00:00:00 EST 2016 · IEEE International Performance, Computing, and Communications Conference · OSTI ID:1828854

X-composer: enabling cross-environments in-situ workflows between HPC and cloud
Conference · Sun Aug 01 00:00:00 EDT 2021 · OSTI ID:1828854

Related Subjects