skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Fault-oblivious Extreme-scale Execution Environment

Technical Report ·
DOI:https://doi.org/10.2172/1307122· OSTI ID:1307122
 [1]
  1. The Ohio State Univ., Columbus, OH (United States)

Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. We propose a new approach to the data and work distribution model provided by system software based on the unifying formalism of an abstract file system. The proposed hierarchical data model provides simple, familiar visibility and access to data structures through the file system hierarchy, while providing fault tolerance through selective redundancy. The hierarchical task model features work queues whose form and organization are represented as file system objects. Data and work are both first class entities. By exposing the relationships between data and work to the runtime system, information is available to optimize execution time and provide fault tolerance. The data distribution scheme provides replication (where desirable and possible) for fault tolerance and efficiency, and it is hierarchical to make it possible to take advantage of locality. The user, tools, and applications, including legacy applications, can interface with the data, work queues, and one another through the abstract file model. This runtime environment will provide multiple interfaces to support traditional Message Passing Interface applications, languages developed under DARPA's High Productivity Computing Systems program, as well as other, experimental programming models. We will validate our runtime system with pilot codes on existing platforms and will use simulation to validate for exascale-class platforms. In this final report, we summarize research results from the work done at the Ohio State University towards the larger goals of the project listed above.

Research Organization:
The Ohio State Univ., Columbus, OH (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
SC0005034
OSTI ID:
1307122
Report Number(s):
DOE-OSU-SC0005034; 6147714213
Country of Publication:
United States
Language:
English

Similar Records

A Fault Oblivious Extreme-Scale Execution Environment
Technical Report · Thu Nov 20 00:00:00 EST 2014 · OSTI ID:1307122

Fault Oblivious eXascale Whitepaper
Conference · Wed Jun 01 00:00:00 EDT 2011 · OSTI ID:1307122

FOX: A Fault-Oblivious Extreme-Scale Execution Environment Boston University Final Report Project Number: DE-SC0005365
Technical Report · Sun Mar 17 00:00:00 EDT 2013 · OSTI ID:1307122

Related Subjects