The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.
Bhatia, Harsh, et al. "Generalizable coordination of large multiscale workflows: challenges and learnings at scale." , Nov. 2021. https://doi.org/10.1145/3458817.3476210
Bhatia, Harsh, Di Natale, Francesco, Moon, Joseph, et al., "Generalizable coordination of large multiscale workflows: challenges and learnings at scale," (2021), https://doi.org/10.1145/3458817.3476210
@conference{osti_1842626,
author = {Bhatia, Harsh and Di Natale, Francesco and Moon, Joseph and Zhang, Xiaohua and Chavez, Joseph and Aydin, Fikret and Stanley, Christopher and Oppelstrup, Tomas and Neale, Christopher and Kokkila Schumacher, Sara and others},
title = {Generalizable coordination of large multiscale workflows: challenges and learnings at scale},
annote = {The advancement of machine learning techniques and the heterogeneous architectures of most current supercomputers are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components and map them to relevant resources to solve complex problems at multiple scales. Nevertheless, despite the recent progress in workflow technologies, current capabilities are limited to coupling two scales. In the first-ever demonstration of using three scales of resolution, we present a scalable and generalizable framework that couples pairs of models using machine learning and in situ feedback. We expand upon the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a recent, award-winning workflow, and generalize the framework beyond its original design. We discuss the challenges and learnings in executing a massive multiscale simulation campaign that utilized over 600,000 node hours on Summit and achieved more than 98% GPU occupancy for more than 83% of the time. We present innovations to enable several orders of magnitude scaling, including simultaneously coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and, with an upcoming open-source release, discuss how the presented framework may be used for new applications.},
doi = {10.1145/3458817.3476210},
url = {https://www.osti.gov/biblio/1842626},
place = {United States},
organization = {Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)},
year = {2021},
month = {11}}
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1842626
Resource Relation:
Conference: Supercomputing 21: International Conference for High Performance Computing, Networking, Storage and Analysis - St. Louis, Missouri, United States of America - 11/14/2021 10:00:00 AM-11/19/2021 10:00:00 AM
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2021https://doi.org/10.1098/rsta.2013.0407
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2021https://doi.org/10.1098/rsta.2013.0378
Di Natale, Francesco; Bhatia, Harsh; Carpenter, Timothy S.
SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/3295500.3356197
Gamblin, Todd; LeGendre, Matthew; Collette, Michael R.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15https://doi.org/10.1145/2807591.2807623
Hoekstra, Alfons; Chopard, Bastien; Coveney, Peter
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2021https://doi.org/10.1098/rsta.2013.0377