Toward designing effective exascale scientific computing workflows: experiences and best practices
- ORNL
Many fields within scientific computing have embraced advances in big-data analysis and machine learning, which often requires the deployment of large, distributed and complicated workflows that may combine training neural networks, performing simulations, running inference, and performing database queries and data analysis in asynchronous, parallel and pipelined execution frameworks. Such a shift has brought into focus the need for scalable, efficient workflow management solutions with reproducibility, error and provenance handling, traceability, and checkpoint-restart capabilities, among other needs. Here, we discuss challenges and best-practices for deploying exascale-generation computational science workflows on resources at the Oak Ridge Leadership Computing Facility (OLCF). We present our experiences with large-scale deployment of distributed workflows on the Summit supercomputer, including for bioinformatics and computational biophysics, materials science, and deep learning model optimization. We also present problems and solutions created by working within a Python-centric software base on traditional HPC systems, and discuss steps that will be required before the convergence of HPC, AI, and data science can be fully realized. Our results point to a wealth of exciting new possibilities for harnessing this convergence to tackle new scientific challenges.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1928954
- Country of Publication:
- United States
- Language:
- English
Similar Records
ExaWorks: Workflows for Exascale
Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows