CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
Abstract
The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime andmore »
- Authors:
-
- Rutgers Univ., Piscataway, NJ (United States)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Intel, Austin, TX (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1769940
- Report Number(s):
- SAND-2021-2256J
Journal ID: ISSN 2329-4949; 694181
- Grant/Contract Number:
- AC04-94AL85000
- Resource Type:
- Accepted Manuscript
- Journal Name:
- ACM Transactions on Parallel Computing
- Additional Journal Information:
- Journal Volume: 7; Journal Issue: 2; Journal ID: ISSN 2329-4949
- Publisher:
- Association for Computing Machinery
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Data resilience; Erasure codes; Replication; In-situ workflows; Data staging
Citation Formats
Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, and Parashar, Manish. CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows. United States: N. p., 2020.
Web. doi:10.1145/3391448.
Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, & Parashar, Manish. CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows. United States. https://doi.org/10.1145/3391448
Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, and Parashar, Manish. Sun .
"CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows". United States. https://doi.org/10.1145/3391448. https://www.osti.gov/servlets/purl/1769940.
@article{osti_1769940,
title = {CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows},
author = {Duan, Shaohua and Subedi, Pradeep and Davis, Philip and Teranishi, Keita and Kolla, Hemanth and Gamell, Marc and Parashar, Manish},
abstractNote = {The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. Here, the experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.},
doi = {10.1145/3391448},
journal = {ACM Transactions on Parallel Computing},
number = 2,
volume = 7,
place = {United States},
year = {Sun May 31 00:00:00 EDT 2020},
month = {Sun May 31 00:00:00 EDT 2020}
}
Works referenced in this record:
Combining Partial Redundancy and Checkpointing for HPC
conference, June 2012
- Elliott, James; Kharbas, Kishor; Fiala, David
- 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS)
Terascale direct numerical simulations of turbulent combustion using S3D
journal, January 2009
- Chen, J. H.; Choudhary, A.; de Supinski, B.
- Computational Science & Discovery, Vol. 2, Issue 1
Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017
- Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
Efficient, Failure Resilient Transactions for Parallel and Distributed Computing
conference, November 2014
- Lofstead, Jay; Dayal, Jai; Jimenez, Ivo
- 2014 International Workshop on Data Intensive Scalable Computing Systems (DISCS)
Post-failure recovery of MPI communication capability: Design and rationale
journal, June 2013
- Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas
- The International Journal of High Performance Computing Applications, Vol. 27, Issue 3
ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing: ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing
journal, October 2014
- Docan, Ciprian; Zhang, Fan; Jin, Tong
- Concurrency and Computation: Practice and Experience, Vol. 27, Issue 14
Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention
conference, May 2019
- Aupy, Guillaume; Beaumont, Olivier; Eyraud-Dubois, Lionel
- 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
SmartBlock: An Approach to Standardizing In Situ Workflow Components
conference, May 2017
- Champsaur, Alexis; Lofstead, Jay; Dayal, Jai
- 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Scalable Data Resilience for In-memory Data Staging
conference, May 2018
- Duan, Shaohua; Subedi, Pradeep; Teranishi, Keita
- 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015
- Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Management, analysis, and visualization of experimental and observational data — The convergence of data and computing
conference, October 2016
- Bethel, E. Wes; Greenwald, Martin; van Dam, Kerstin Kleese
- 2016 IEEE 12th International Conference on e-Science (e-Science)
Feature-Based Statistical Analysis of Combustion Simulation Data
journal, December 2011
- Bennett, Janine C.; Krishnamoorthy, Vaidyanathan
- IEEE Transactions on Visualization and Computer Graphics, Vol. 17, Issue 12
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
conference, June 2015
- Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
- 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
conference, June 2018
- Nie, Bin; Xue, Ji; Gupta, Saurabh
- 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Reducing Waste in Extreme Scale Systems through Introspective Analysis
conference, May 2016
- Bautista-Gomez, Leonardo; Gainaru, Ana; Perarnau, Swann
- 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013
- Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
- The Journal of Supercomputing, Vol. 65, Issue 3
DataSpaces: an interaction and coordination framework for coupled simulation workflows
journal, February 2011
- Docan, Ciprian; Parashar, Manish; Klasky, Scott
- Cluster Computing, Vol. 15, Issue 2
Polynomial Codes Over Certain Finite Fields
journal, June 1960
- Reed, I. S.; Solomon, G.
- Journal of the Society for Industrial and Applied Mathematics, Vol. 8, Issue 2
DataSpaces: an interaction and coordination framework for coupled simulation workflows
conference, January 2010
- Docan, Ciprian; Parashar, Manish; Klasky, Scott
- Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10
A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures
conference, May 2013
- Subedi, Pradeep; He, Xubin
- 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers
conference, September 2018
- Kougkas, Anthony; Devarajan, Hariharan; Sun, Xian-He
- 2018 IEEE International Conference on Cluster Computing (CLUSTER)
Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows
conference, November 2018
- Subedi, Pradeep; Davis, Philip; Duan, Shaohua
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
Leveraging burst buffer coordination to prevent I/O interference
conference, October 2016
- Kougkas, Anthony; Dorier, Matthieu; Latham, Rob
- 2016 IEEE 12th International Conference on e-Science (e-Science)
In Situ Visualization for Large-Scale Combustion Simulations
journal, May 2010
- Hongfeng Yu, ; Grout, Ray W.
- IEEE Computer Graphics and Applications, Vol. 30, Issue 3
Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems
conference, June 2015
- Gao, Shen; He, Bingsheng; Xu, Jianliang
- ICS'15: 2015 International Conference on Supercomputing, Proceedings of the 29th ACM on International Conference on Supercomputing
AnalyzeThis: an analysis workflow-aware storage system
conference, November 2015
- Sim, Hyogi; Kim, Youngjae; Vazhkudai, Sudharshan S.
- SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
conference, September 2017
- Tang, Kun; Huang, Ping; He, Xubin
- 2017 IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)