Integrated End-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States); University of California, Santa Cruz
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Brookhaven National Laboratory (BNL), Upton, NY (United States)
- Univ. of California, Santa Cruz, CA (United States)
- Univ. of California, Santa Cruz, CA (United States); Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- San Diego Supercomputing Center, La Jolla, CA (United States)
This report details recent progress for the ASCR funded project “Integrated End-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows”. We refer to the project as IPPD/2, reflecting the 2017 renewal under expanded scope and partners In IPPD/2, we increased our research scope to include data motion. We are focusing on three major aspects: a) observe how data is generated, distributed, and used; b) analyze how data is (repeatedly) consumed with a focus both on repeated patterns and anomalies; and c) explore how to optimize data motion. This new work on data motion will augment and complement IPPD/2’s research that focused on the computational aspects of tasks. We leverage and extend our existing tools and demonstrate our work on the Belle II workflow suite as well as on workflows from NSLS-II. The highlights of our work are as follows: Provenance for Workflows: Provenance is used to provide information enabling quality control, re-run computational workflows, and reproduce results. IPPD/2 has been building a scalable provenance management system that enables the capture of provenance from the high-level workflow through all relevant system levels in one integrated environment. Leveraging this work, our recent efforts have included using provenance as an enabling technique. Workload characterization: Leveraging provenance and analysis, we characterize data movement within network, storage, and memory over a variety of workloads. This characterization enables an understanding by performance analysts and application developers of the range of behaviors that could be expected. Performance Prediction for Workflows: The goal of modeling distributed workflows is to understand performance bottlenecks and enable more intelligent task scheduling to optimize selected metrics of interest (e.g., task throughput or output data rate). IPPD/2 has utilized both analytical and AI/ML modeling methodologies for performance modeling. Advanced Scheduling and Fault Modeling for Workflows: Scheduling of large-scale scientific workflows on geographically distributed resources is a challenging problem. To improve workflow throughput, we combined novel scheduling algorithms with task predictions from performance modeling and fault modeling. Dynamically Alleviating Bottlenecks in Workflows: Exploiting our provenance, analysis, and modeling efforts, we have explored and developed several techniques for dynamically detecting and alleviating bottlenecks in data movement. In particular, we have spent considerable effort demonstrating our techniques on production-like workflow configurations.
- Research Organization:
- Univ. of California, Santa Cruz, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR). Scientific Discovery through Advanced Computing (SciDAC)
- DOE Contract Number:
- SC0018384
- OSTI ID:
- 1970012
- Report Number(s):
- DE--SC0018384-Final
- Country of Publication:
- United States
- Language:
- English
Similar Records
Integrated End-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD) (Final Report)
Guaranteed deadlines for hard real-time fault-tolerant distributed systems
Integrating prediction, provenance, and optimization into high energy workflows
Technical Report
·
Tue Nov 09 23:00:00 EST 2021
·
OSTI ID:1830050
Guaranteed deadlines for hard real-time fault-tolerant distributed systems
Thesis/Dissertation
·
Sat Dec 31 23:00:00 EST 1988
·
OSTI ID:6037957
Integrating prediction, provenance, and optimization into high energy workflows
Journal Article
·
Sun Oct 01 00:00:00 EDT 2017
· Journal of Physics. Conference Series
·
OSTI ID:1434869