CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

Duan, Shaohua; Subedi, Pradeep; Davis, Philip; Teranishi, Keita; Kolla, Hemanth; Gamell, Marc; Parashar, Manish

doi:10.1145/3391448

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

Journal Article · Sun May 31 00:00:00 EDT 2020 · ACM Transactions on Parallel Computing

DOI:https://doi.org/10.1145/3391448· OSTI ID:1769940

Duan, Shaohua ^[1]; Subedi, Pradeep ^[1]; Davis, Philip ^[1]; Teranishi, Keita ^[2]; Kolla, Hemanth ^[2]; Gamell, Marc ^[3]; Parashar, Manish ^[1]

Rutgers Univ., Piscataway, NJ (United States)
Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Intel, Austin, TX (United States)

The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. Here, the experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.

View Accepted Manuscript (DOE)

Research Organization:: Sandia National Laboratories (SNL-CA), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC04-94AL85000

OSTI ID:: 1769940

Report Number(s):: SAND--2021-2256J; 694181

Journal Information:: ACM Transactions on Parallel Computing, Journal Name: ACM Transactions on Parallel Computing Journal Issue: 2 Vol. 7; ISSN 2329-4949

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

References (27)

ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing: ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing Docan, Ciprian; Zhang, Fan; Jin, Tong Concurrency and Computation: Practice and Experience, Vol. 27, Issue 14 https://doi.org/10.1002/cpe.3407	journal	October 2014
DataSpaces: an interaction and coordination framework for coupled simulation workflows Docan, Ciprian; Parashar, Manish; Klasky, Scott Cluster Computing, Vol. 15, Issue 2 https://doi.org/10.1007/s10586-011-0162-y	journal	February 2011
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran The Journal of Supercomputing, Vol. 65, Issue 3 https://doi.org/10.1007/s11227-013-0884-0	journal	February 2013
Terascale direct numerical simulations of turbulent combustion using S3D Chen, J. H.; Choudhary, A.; de Supinski, B. Computational Science & Discovery, Vol. 2, Issue 1 https://doi.org/10.1088/1749-4699/2/1/015001	journal	January 2009
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers Kougkas, Anthony; Devarajan, Hariharan; Sun, Xian-He 2018 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2018.00046	conference	September 2018
Efficient, Failure Resilient Transactions for Parallel and Distributed Computing Lofstead, Jay; Dayal, Jai; Jimenez, Ivo 2014 International Workshop on Data Intensive Scalable Computing Systems (DISCS) https://doi.org/10.1109/DISCS.2014.13	conference	November 2014
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52	conference	June 2015
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System Nie, Bin; Xue, Ji; Gupta, Saurabh 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00022	conference	June 2018
Combining Partial Redundancy and Checkpointing for HPC Elliott, James; Kharbas, Kishor; Fiala, David 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS) https://doi.org/10.1109/ICDCS.2012.56	conference	June 2012
Reducing Waste in Extreme Scale Systems through Introspective Analysis Bautista-Gomez, Leonardo; Gainaru, Ana; Perarnau, Swann 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.100	conference	May 2016
Scalable Data Resilience for In-memory Data Staging Duan, Shaohua; Subedi, Pradeep; Teranishi, Keita 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2018.00021	conference	May 2018
Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention Aupy, Guillaume; Beaumont, Olivier; Eyraud-Dubois, Lionel 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00072	conference	May 2019
A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures Subedi, Pradeep; He, Xubin 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) https://doi.org/10.1109/IPDPSW.2013.155	conference	May 2013
SmartBlock: An Approach to Standardizing In Situ Workflow Components Champsaur, Alexis; Lofstead, Jay; Dayal, Jai 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.149	conference	May 2017
Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior Tang, Kun; Huang, Ping; He, Xubin 2017 IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2017.35	conference	September 2017
In Situ Visualization for Large-Scale Combustion Simulations No authors listed IEEE Computer Graphics and Applications, Vol. 30, Issue 3 https://doi.org/10.1109/MCG.2010.55	journal	May 2010
Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows Subedi, Pradeep; Davis, Philip; Duan, Shaohua SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00076	conference	November 2018
Feature-Based Statistical Analysis of Combustion Simulation Data Bennett, Janine C.; Krishnamoorthy, Vaidyanathan IEEE Transactions on Visualization and Computer Graphics, Vol. 17, Issue 12 https://doi.org/10.1109/TVCG.2011.199	journal	December 2011
Management, analysis, and visualization of experimental and observational data — The convergence of data and computing Bethel, E. Wes; Greenwald, Martin; van Dam, Kerstin Kleese 2016 IEEE 12th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2016.7870902	conference	October 2016
Leveraging burst buffer coordination to prevent I/O interference Kougkas, Anthony; Dorier, Matthieu; Latham, Rob 2016 IEEE 12th International Conference on e-Science (e-Science) https://doi.org/10.1109/eScience.2016.7870922	conference	October 2016
Polynomial Codes Over Certain Finite Fields Reed, I. S.; Solomon, G. Journal of the Society for Industrial and Applied Mathematics, Vol. 8, Issue 2 https://doi.org/10.1137/0108018	journal	June 1960
DataSpaces: an interaction and coordination framework for coupled simulation workflows Docan, Ciprian; Parashar, Manish; Klasky, Scott Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10 https://doi.org/10.1145/1851476.1851481	conference	January 2010
Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems Gao, Shen; He, Bingsheng; Xu, Jianliang ICS'15: 2015 International Conference on Supercomputing, Proceedings of the 29th ACM on International Conference on Supercomputing https://doi.org/10.1145/2751205.2751212	conference	June 2015
AnalyzeThis: an analysis workflow-aware storage system Sim, Hyogi; Kim, Youngjae; Vazhkudai, Sudharshan S. SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807622	conference	November 2015
Local recovery and failure masking for stencil-based applications at extreme scales Gamell, Marc; Teranishi, Keita; Heroux, Michael A. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807672	conference	January 2015
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013

Similar Records

DataSpaces: an interaction and coordination framework for coupled simulation workflows

Journal Article · Sat Feb 26 23:00:00 EST 2011 · Cluster Computing · OSTI ID:1564811

DataSpaces: An Interaction and Coordination Framework for Coupled Simulation Workflows

Conference · Thu Dec 31 23:00:00 EST 2009 · OSTI ID:982175

Dual Channel Dual Staging: Hierarchical and Portable Staging for GPU-Based In-Situ Workflow

Conference · Sat Nov 30 23:00:00 EST 2024 · OSTI ID:2538207

Related Subjects

97 MATHEMATICS AND COMPUTING
Data resilience
Data staging
Erasure codes
In-situ workflows
Replication

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

Citation Formats

References (27)

Similar Records

Related Subjects