Performance analysis tool for HPC and big data applications on scientific clusters
Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1828854
- Country of Publication:
- United States
- Language:
- English
The Palomar Transient Factory: System Overview, Performance, and First Results
|
journal | December 2009 |
Enabling comprehensive data-driven system management for large computational facilities
|
conference | November 2013 |
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
|
conference | September 2010 |
Support-vector networks
|
journal | September 1995 |
A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids
|
conference | May 2009 |
Modeling virtualized applications using machine learning techniques
|
journal | September 2012 |
The NetLogger methodology for high performance distributed systems performance analysis
|
conference | January 1998 |
ADP: automated diagnosis of performance pathologies using hardware events
|
conference | January 2012 |
Algorithm AS 136: A K-Means Clustering Algorithm
|
journal | January 1979 |
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications
|
conference | November 2010 |
Wrangler: Predictable and Faster Jobs using Fewer Resources
|
conference | January 2014 |
Linking Resource Usage Anomalies with System Failures from Cluster Log Data
|
conference | September 2013 |
Performance modeling of parallel and distributed computing using PACE
|
conference | May 2000 |
Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds
|
conference | November 2012 |
Roofline: an insightful visual performance model for multicore architectures
|
journal | April 2009 |
Root cause detection in a service-oriented architecture
|
conference | January 2013 |
Scientific workflow management and the Kepler system
|
journal | January 2006 |
Ironmodel: robust performance models in the wild
|
conference | January 2008 |
Using correlated surprise to infer shared influence
|
conference | June 2010 |
System-level monitoring of floating-point performance to improve effective system utilization
|
conference | January 2011 |
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications
|
conference | May 2010 |
The Tau Parallel Performance System
|
journal | May 2006 |
Fingerprinting the datacenter: automated classification of performance crises
|
conference | January 2010 |
CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems
|
conference | April 2014 |
Detecting large-scale system problems by mining console logs
|
conference | January 2009 |
Automated known problem diagnosis with event traces
|
journal | October 2006 |
Implementing the Palomar Transient Factory Real-Time Detection Pipeline in GLADE: Results and Observations
|
book | January 2014 |
Ganesha: blackBox diagnosis of MapReduce systems
|
journal | January 2010 |
Similar Records
Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters
PATHA: Performance Analysis Tool for HPC Applications
Enabling HPC Scientific Workflows for Serverless
Book
·
Sat Sep 17 00:00:00 EDT 2016
·
OSTI ID:1393595
PATHA: Performance Analysis Tool for HPC Applications
Journal Article
·
Wed Feb 17 19:00:00 EST 2016
· IEEE International Performance, Computing, and Communications Conference
·
OSTI ID:1379097
Enabling HPC Scientific Workflows for Serverless
Conference
·
Fri Nov 01 00:00:00 EDT 2024
·
OSTI ID:2538241