Enabling discovery data science through cross-facility workflows
Abstract
Experimental and observational instruments for scientific research (such as light sources, genome sequencers, accelerators, telescopes and electron microscopes) increasingly require High Performance Computing (HPC) scale capabilities for data analysis and workflow processing. Next-generation instruments are being deployed with higher resolutions and faster data capture rates, creating a big data crunch that cannot be handled by modest institutional computing resources. Often these big data analysis pipelines also require near real-time computing and have higher resilience requirements than the simulation and modeling workloads more traditionally seen at HPC centers. While some facilities have enabled workflows to run at a single HPC facility, there is a growing need to integrate capabilities across HPC facilities to enable cross-facility workflows, either to provide resilience to an experiment, increase analysis throughput capabilities, or to better match a workflow to a particular architecture. In this paper we describe the barriers to executing complex data analysis workflows across HPC facilities and propose an architectural design pattern for enabling scientific discovery using cross-facility workflows that includes orchestration services, application programming interfaces (APIs), data access and co-scheduling.
- Authors:
-
- Lawrence Berkeley National Laboratory (LBNL)
- National Energy Research Scientific Computing Center (NERSC), California
- ORNL
- Argonne National Laboratory
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1847515
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: The 3rd International Workshop on Big Data Tools, Methods, and Use Cases for Innovative Scientific Discovery (BTSD) 2021 - Orlando, Florida, United States of America - 12/15/2021 5:00:00 AM-12/18/2021 5:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Antypas, Katerina B., Bard, Deborah, Blaschke, Johannes P., Canon, Richard Shane, Enders, Bjoern, Shankar, Mallikarjun, Somnath, Suhas, Stansberry, Dale, Uram, Thomas, and Wilkinson, Sean. Enabling discovery data science through cross-facility workflows. United States: N. p., 2021.
Web. doi:10.1109/BigData52589.2021.9671421.
Antypas, Katerina B., Bard, Deborah, Blaschke, Johannes P., Canon, Richard Shane, Enders, Bjoern, Shankar, Mallikarjun, Somnath, Suhas, Stansberry, Dale, Uram, Thomas, & Wilkinson, Sean. Enabling discovery data science through cross-facility workflows. United States. https://doi.org/10.1109/BigData52589.2021.9671421
Antypas, Katerina B., Bard, Deborah, Blaschke, Johannes P., Canon, Richard Shane, Enders, Bjoern, Shankar, Mallikarjun, Somnath, Suhas, Stansberry, Dale, Uram, Thomas, and Wilkinson, Sean. 2021.
"Enabling discovery data science through cross-facility workflows". United States. https://doi.org/10.1109/BigData52589.2021.9671421. https://www.osti.gov/servlets/purl/1847515.
@article{osti_1847515,
title = {Enabling discovery data science through cross-facility workflows},
author = {Antypas, Katerina B. and Bard, Deborah and Blaschke, Johannes P. and Canon, Richard Shane and Enders, Bjoern and Shankar, Mallikarjun and Somnath, Suhas and Stansberry, Dale and Uram, Thomas and Wilkinson, Sean},
abstractNote = {Experimental and observational instruments for scientific research (such as light sources, genome sequencers, accelerators, telescopes and electron microscopes) increasingly require High Performance Computing (HPC) scale capabilities for data analysis and workflow processing. Next-generation instruments are being deployed with higher resolutions and faster data capture rates, creating a big data crunch that cannot be handled by modest institutional computing resources. Often these big data analysis pipelines also require near real-time computing and have higher resilience requirements than the simulation and modeling workloads more traditionally seen at HPC centers. While some facilities have enabled workflows to run at a single HPC facility, there is a growing need to integrate capabilities across HPC facilities to enable cross-facility workflows, either to provide resilience to an experiment, increase analysis throughput capabilities, or to better match a workflow to a particular architecture. In this paper we describe the barriers to executing complex data analysis workflows across HPC facilities and propose an architectural design pattern for enabling scientific discovery using cross-facility workflows that includes orchestration services, application programming interfaces (APIs), data access and co-scheduling.},
doi = {10.1109/BigData52589.2021.9671421},
url = {https://www.osti.gov/biblio/1847515},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2021},
month = {12}
}
Works referenced in this record:
The Science DMZ: a network design pattern for data-intensive science
conference, January 2013
- Dart, Eli; Rotman, Lauren; Tierney, Brian
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
Workshop report on In Situ Data Management
report, February 2019
- Peterka, Tom; Bard, Deborah; Bennett, Janine
The future of scientific workflows
journal, April 2017
- Deelman, Ewa; Peterka, Tom; Altintas, Ilkay
- The International Journal of High Performance Computing Applications, Vol. 32, Issue 1
SciSpace: A scientific collaboration workspace for geo-distributed HPC data centers
journal, December 2019
- Khan, Awais; Kim, Taeuk; Byun, Hyunki
- Future Generation Computer Systems, Vol. 101
Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery
report, May 2019
- Ross, Robert; Ward, Lee; Carns, Philip
NEWT: A RESTful service for building High Performance Computing web applications
conference, November 2010
- Cholia, Shreyas; Skinner, David; Boverhof, Joshua
- 2010 Gateway Computing Environments Workshop (GCE)
Cross-facility science with the Superfacility Project at LBNL
conference, November 2020
- Enders, Bjoern; Bard, Debbie; Snavely, Cory
- 2020 IEEE/ACM 2nd Annual Workshop on Extreme-scale Experiment-in-the-Loop Computing (XLOOP)
The Agave Platform: An Open, Science-as-a-Service Platform for Digital Science
conference, July 2018
- Dooley, Rion; Brandt, Steven R.; Fonner, John
- PEARC '18: Practice and Experience in Advanced Research Computing, Proceedings of the Practice and Experience on Advanced Research Computing
HPC Container Runtimes have Minimal or No Performance Impact
conference, November 2019
- Torrez, Alfred; Randles, Timothy; Priedhorsky, Reid
- 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)
DataFed: Towards Reproducible Research via Federated Data Management
conference, December 2019
- Stansberry, Dale; Somnath, Suhas; Breet, Jessica
- 2019 International Conference on Computational Science and Computational Intelligence (CSCI)
Experiences with Cross-Facility Real-Time Light Source Data Analysis Workflows
conference, November 2021
- Giannakou, Anna; Blaschke, Johannes P.; Bard, Deborah
- 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC)