Recovering Transient Data: Automated On-demand Data Reconstruction and Offloading for Supercomputers
Journal Article
·
· ACM SIGOPS Operating Systems Review
- ORNL
It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, we envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers. Fundamental to both approaches is the definition and acquisition of recovery-related parallel file system metadata, which is then coupled with transparent remote data accesses. Our approach attempts to maximize the utilization of precious supercomputer resources by improving the accessibility of transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery schemes, which are designed for persistent data. Several of our previous studies help in demonstrating the feasibility of the proposed approaches.
- Research Organization:
- Oak Ridge National Laboratory (ORNL); Center for Computational Sciences
- Sponsoring Organization:
- ORNL LDRD Director's R&D
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 930882
- Journal Information:
- ACM SIGOPS Operating Systems Review, Journal Name: ACM SIGOPS Operating Systems Review Journal Issue: 1 Vol. 41
- Country of Publication:
- United States
- Language:
- English
Similar Records
Improving the Availability of Supercomputer Job Input Data Using Temporal Replication
Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
Method and apparatus for offloading compute resources to a flash co-processing appliance
Conference
·
Mon Jun 01 00:00:00 EDT 2009
·
OSTI ID:1004448
Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
Conference
·
Sun Dec 31 23:00:00 EST 2006
·
OSTI ID:1000413
Method and apparatus for offloading compute resources to a flash co-processing appliance
Patent
·
Tue Oct 13 00:00:00 EDT 2015
·
OSTI ID:1223101