skip to main content

Title: Final Project Report. Scalable fault tolerance runtime technology for petascale computers

With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been a considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.
 [1] ;  [2]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  2. Ohio State Univ., Columbus, OH (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
The Ohio State University, Columbus, OH (United States)
Sponsoring Org:
Contributing Orgs:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Country of Publication:
United States
97 MATHEMATICS AND COMPUTING fault tolerance; PGAS; runtime systems