skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault Oblivious eXascale Whitepaper

Conference ·

In this paper we present a software system which supports dynamic, irregular, adaptive applications. Data objects are created and structured in a hierarchical manner, with replication as needed to provide a high degree of redundancy. The data objects can contain data, code, tasks (work descriptors with references to data, code, and other tasks) and higher level structures such as work queues. The higher level structures benefit from the properties of the data objects: redundant storage to support resiliency in the face of hardware failure; hierarchical structure to optimize use of the HPC system; and a presence of object names, available in the per-user file system name space, which allows any application, not just specially written HPC applications, to make use of the data even while it is on the HPC system. Our use of hierarchy will make the runtime scalable to very large systems. Our use of redundancy will allow programs to be written in a fault-oblivious manner, eliminating the need for system-level checkpointing. Putting data object names into the file system name space allows for interactive use of the system by users. With this approach, we will be able to finally leave the batch era behind, a half-century after the invention of time sharing. We will be able to stop bounding program through- put by the checkpoint interval. Application data will be accessible at any time, not hidden behind opaque 128-bit pointers or MPI ranks, but given a name that is visible everywhere. Programmers can stop laying out data, and thinking about where the data is, and the code is, and the nodes are, and stick with the problem of what the application is supposed to be doing. This work, if it succeeds, will enable scientific computing to scale to the next generation of machines.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1023198
Report Number(s):
PNNL-SA-79579; KJ0402000; TRN: US201118%%501
Resource Relation:
Conference: Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2011), held in conjunction with the 25th International Conference on Supercomputing, May 31, 2011, Tucson, Arizona, 17-24
Country of Publication:
United States
Language:
English