skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault Tolerance for OpenSHMEM. In: PGAS '14 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article No. 23

Abstract

On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.

Authors:
 [1];  [2];  [2];  [1];  [1];  [2];  [1]
  1. Univ. of Houston, TX (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567638
Resource Type:
Conference
Resource Relation:
Conference: 8th International Conference on Partitioned Global Address Space Programming Models Eugene, OR, USA — October 06 - 10, 2014
Country of Publication:
United States
Language:
English

Citation Formats

Hao, Pengfei, Shamis, Pavel, Venkata, Manjunath Gorentla, Pophale, Swaroop, Welch, Aaron, Poole, Stephen, and Chapman, Barbara. Fault Tolerance for OpenSHMEM. In: PGAS '14 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article No. 23. United States: N. p., 2014. Web. doi:10.1145/2676870.2676894.
Hao, Pengfei, Shamis, Pavel, Venkata, Manjunath Gorentla, Pophale, Swaroop, Welch, Aaron, Poole, Stephen, & Chapman, Barbara. Fault Tolerance for OpenSHMEM. In: PGAS '14 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article No. 23. United States. doi:10.1145/2676870.2676894.
Hao, Pengfei, Shamis, Pavel, Venkata, Manjunath Gorentla, Pophale, Swaroop, Welch, Aaron, Poole, Stephen, and Chapman, Barbara. Wed . "Fault Tolerance for OpenSHMEM. In: PGAS '14 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article No. 23". United States. doi:10.1145/2676870.2676894.
@article{osti_1567638,
title = {Fault Tolerance for OpenSHMEM. In: PGAS '14 Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article No. 23},
author = {Hao, Pengfei and Shamis, Pavel and Venkata, Manjunath Gorentla and Pophale, Swaroop and Welch, Aaron and Poole, Stephen and Chapman, Barbara},
abstractNote = {On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.},
doi = {10.1145/2676870.2676894},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2014},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Building and Using a Fault-Tolerant MPI Implementation
journal, August 2004

  • Fagg, Graham E.; Dongarra, Jack J.
  • The International Journal of High Performance Computing Applications, Vol. 18, Issue 3
  • DOI: 10.1177/1094342004046052