Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

Journal Article · · ACM Transactions on Parallel Computing
DOI:https://doi.org/10.1145/3391448· OSTI ID:1769940
 [1];  [1];  [1];  [2];  [2];  [3];  [1]
  1. Rutgers Univ., Piscataway, NJ (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  3. Intel, Austin, TX (United States)
The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. Here, the experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.
Research Organization:
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1769940
Report Number(s):
SAND--2021-2256J; 694181
Journal Information:
ACM Transactions on Parallel Computing, Journal Name: ACM Transactions on Parallel Computing Journal Issue: 2 Vol. 7; ISSN 2329-4949
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (27)

ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing: ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing journal October 2014
DataSpaces: an interaction and coordination framework for coupled simulation workflows journal February 2011
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems journal February 2013
Terascale direct numerical simulations of turbulent combustion using S3D journal January 2009
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers conference September 2018
Efficient, Failure Resilient Transactions for Parallel and Distributed Computing conference November 2014
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52
conference June 2015
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System conference June 2018
Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Reducing Waste in Extreme Scale Systems through Introspective Analysis conference May 2016
Scalable Data Resilience for In-memory Data Staging conference May 2018
Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention conference May 2019
A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures conference May 2013
SmartBlock: An Approach to Standardizing In Situ Workflow Components conference May 2017
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
  • Tang, Kun; Huang, Ping; He, Xubin
  • 2017 IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2017.35
conference September 2017
In Situ Visualization for Large-Scale Combustion Simulations journal May 2010
Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows conference November 2018
Feature-Based Statistical Analysis of Combustion Simulation Data journal December 2011
Management, analysis, and visualization of experimental and observational data — The convergence of data and computing conference October 2016
Leveraging burst buffer coordination to prevent I/O interference conference October 2016
Polynomial Codes Over Certain Finite Fields journal June 1960
DataSpaces: an interaction and coordination framework for coupled simulation workflows conference January 2010
Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems
  • Gao, Shen; He, Bingsheng; Xu, Jianliang
  • ICS'15: 2015 International Conference on Supercomputing, Proceedings of the 29th ACM on International Conference on Supercomputing https://doi.org/10.1145/2751205.2751212
conference June 2015
AnalyzeThis: an analysis workflow-aware storage system
  • Sim, Hyogi; Kim, Youngjae; Vazhkudai, Sudharshan S.
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807622
conference November 2015
Local recovery and failure masking for stencil-based applications at extreme scales
  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807672
conference January 2015
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013

Similar Records

DataSpaces: an interaction and coordination framework for coupled simulation workflows
Journal Article · Sat Feb 26 23:00:00 EST 2011 · Cluster Computing · OSTI ID:1564811

DataSpaces: An Interaction and Coordination Framework for Coupled Simulation Workflows
Conference · Thu Dec 31 23:00:00 EST 2009 · OSTI ID:982175

Dual Channel Dual Staging: Hierarchical and Portable Staging for GPU-Based In-Situ Workflow
Conference · Sat Nov 30 23:00:00 EST 2024 · OSTI ID:2538207