DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

Abstract

The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime andmore » have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. Here, the experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.« less

Authors:
 [1];  [1];  [1];  [2];  [2];  [3];  [1]
  1. Rutgers Univ., Piscataway, NJ (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  3. Intel, Austin, TX (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1769940
Report Number(s):
SAND-2021-2256J
Journal ID: ISSN 2329-4949; 694181
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
ACM Transactions on Parallel Computing
Additional Journal Information:
Journal Volume: 7; Journal Issue: 2; Journal ID: ISSN 2329-4949
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Data resilience; Erasure codes; Replication; In-situ workflows; Data staging

Citation Formats

Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, and Parashar, Manish. CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows. United States: N. p., 2020. Web. doi:10.1145/3391448.
Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, & Parashar, Manish. CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows. United States. https://doi.org/10.1145/3391448
Duan, Shaohua, Subedi, Pradeep, Davis, Philip, Teranishi, Keita, Kolla, Hemanth, Gamell, Marc, and Parashar, Manish. Sun . "CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows". United States. https://doi.org/10.1145/3391448. https://www.osti.gov/servlets/purl/1769940.
@article{osti_1769940,
title = {CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows},
author = {Duan, Shaohua and Subedi, Pradeep and Davis, Philip and Teranishi, Keita and Kolla, Hemanth and Gamell, Marc and Parashar, Manish},
abstractNote = {The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. Here, the experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.},
doi = {10.1145/3391448},
journal = {ACM Transactions on Parallel Computing},
number = 2,
volume = 7,
place = {United States},
year = {Sun May 31 00:00:00 EDT 2020},
month = {Sun May 31 00:00:00 EDT 2020}
}

Works referenced in this record:

Combining Partial Redundancy and Checkpointing for HPC
conference, June 2012

  • Elliott, James; Kharbas, Kishor; Fiala, David
  • 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS)
  • DOI: 10.1109/ICDCS.2012.56

Terascale direct numerical simulations of turbulent combustion using S3D
journal, January 2009


Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017

  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126937

Efficient, Failure Resilient Transactions for Parallel and Distributed Computing
conference, November 2014

  • Lofstead, Jay; Dayal, Jai; Jimenez, Ivo
  • 2014 International Workshop on Data Intensive Scalable Computing Systems (DISCS)
  • DOI: 10.1109/DISCS.2014.13

Post-failure recovery of MPI communication capability: Design and rationale
journal, June 2013

  • Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas
  • The International Journal of High Performance Computing Applications, Vol. 27, Issue 3
  • DOI: 10.1177/1094342013488238

ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing: ActiveSpaces: Exploring dynamic code deployment for extreme scale data processing
journal, October 2014

  • Docan, Ciprian; Zhang, Fan; Jin, Tong
  • Concurrency and Computation: Practice and Experience, Vol. 27, Issue 14
  • DOI: 10.1002/cpe.3407

Sizing and Partitioning Strategies for Burst-Buffers to Reduce IO Contention
conference, May 2019

  • Aupy, Guillaume; Beaumont, Olivier; Eyraud-Dubois, Lionel
  • 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2019.00072

SmartBlock: An Approach to Standardizing In Situ Workflow Components
conference, May 2017

  • Champsaur, Alexis; Lofstead, Jay; Dayal, Jai
  • 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.149

Scalable Data Resilience for In-memory Data Staging
conference, May 2018

  • Duan, Shaohua; Subedi, Pradeep; Teranishi, Keita
  • 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2018.00021

Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015

  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807672

Management, analysis, and visualization of experimental and observational data — The convergence of data and computing
conference, October 2016

  • Bethel, E. Wes; Greenwald, Martin; van Dam, Kerstin Kleese
  • 2016 IEEE 12th International Conference on e-Science (e-Science)
  • DOI: 10.1109/eScience.2016.7870902

Feature-Based Statistical Analysis of Combustion Simulation Data
journal, December 2011

  • Bennett, Janine C.; Krishnamoorthy, Vaidyanathan
  • IEEE Transactions on Visualization and Computer Graphics, Vol. 17, Issue 12
  • DOI: 10.1109/TVCG.2011.199

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
conference, June 2015

  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2015.52

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
conference, June 2018

  • Nie, Bin; Xue, Ji; Gupta, Saurabh
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2018.00022

Reducing Waste in Extreme Scale Systems through Introspective Analysis
conference, May 2016

  • Bautista-Gomez, Leonardo; Gainaru, Ana; Perarnau, Swann
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.100

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013

  • Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
  • The Journal of Supercomputing, Vol. 65, Issue 3
  • DOI: 10.1007/s11227-013-0884-0

DataSpaces: an interaction and coordination framework for coupled simulation workflows
journal, February 2011


Polynomial Codes Over Certain Finite Fields
journal, June 1960

  • Reed, I. S.; Solomon, G.
  • Journal of the Society for Industrial and Applied Mathematics, Vol. 8, Issue 2
  • DOI: 10.1137/0108018

DataSpaces: an interaction and coordination framework for coupled simulation workflows
conference, January 2010

  • Docan, Ciprian; Parashar, Manish; Klasky, Scott
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10
  • DOI: 10.1145/1851476.1851481

A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures
conference, May 2013

  • Subedi, Pradeep; He, Xubin
  • 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
  • DOI: 10.1109/IPDPSW.2013.155

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers
conference, September 2018

  • Kougkas, Anthony; Devarajan, Hariharan; Sun, Xian-He
  • 2018 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2018.00046

Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows
conference, November 2018

  • Subedi, Pradeep; Davis, Philip; Duan, Shaohua
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00076

Leveraging burst buffer coordination to prevent I/O interference
conference, October 2016

  • Kougkas, Anthony; Dorier, Matthieu; Latham, Rob
  • 2016 IEEE 12th International Conference on e-Science (e-Science)
  • DOI: 10.1109/eScience.2016.7870922

In Situ Visualization for Large-Scale Combustion Simulations
journal, May 2010

  • Hongfeng Yu, ; Grout, Ray W.
  • IEEE Computer Graphics and Applications, Vol. 30, Issue 3
  • DOI: 10.1109/MCG.2010.55

Real-Time In-Memory Checkpointing for Future Hybrid Memory Systems
conference, June 2015

  • Gao, Shen; He, Bingsheng; Xu, Jianliang
  • ICS'15: 2015 International Conference on Supercomputing, Proceedings of the 29th ACM on International Conference on Supercomputing
  • DOI: 10.1145/2751205.2751212

AnalyzeThis: an analysis workflow-aware storage system
conference, November 2015

  • Sim, Hyogi; Kim, Youngjae; Vazhkudai, Sudharshan S.
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2807591.2807622

Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior
conference, September 2017

  • Tang, Kun; Huang, Ping; He, Xubin
  • 2017 IEEE 25th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
  • DOI: 10.1109/MASCOTS.2017.35