skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Final Project Report. Scalable fault tolerance runtime technology for petascale computers

Technical Report ·
DOI:https://doi.org/10.2172/1184567· OSTI ID:1184567
 [1];  [2]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Ohio State Univ., Columbus, OH (United States)

With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been a considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.

Research Organization:
The Ohio State University, Columbus, OH (United States)
Sponsoring Organization:
USDOE
Contributing Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
DOE Contract Number:
FG02-08ER25850
OSTI ID:
1184567
Report Number(s):
DOE-OSU-FG02-08ER25850
Country of Publication:
United States
Language:
English

Similar Records

Steps toward fault-tolerant quantum chemistry.
Technical Report · Sat May 01 00:00:00 EDT 2010 · OSTI ID:1184567

Efficient On-demand Connection Management Mechanisms with PGAS Models on InfiniBand
Conference · Mon May 17 00:00:00 EDT 2010 · OSTI ID:1184567

Runtime Techniques to Enable a Highly-Scalable Global Address Space Model for Petascale Computing
Journal Article · Sun Sep 02 00:00:00 EDT 2012 · International Journal of Parallel Programming · OSTI ID:1184567