A Fault Oblivious Extreme-Scale Execution Environment

McKie, Jim

doi:10.2172/1164219

Title: A Fault Oblivious Extreme-Scale Execution Environment

Technical Report · Thu Nov 20 00:00:00 EST 2014

DOI:https://doi.org/10.2172/1164219· OSTI ID:1164219

McKie, Jim

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.

View Technical Report

Cite

Export

Save

Research Organization:: Bell Laboratories, Alcatel-Lucent USA Inc.

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Contributing Organization:: LLNL SANDIA CA PNNL Boston U. Ohio State U. IBM Bell Labs

DOE Contract Number:: SC0005158

OSTI ID:: 1164219

Report Number(s):: DOE-ALU-05158

Country of Publication:: United States

Language:: English

Similar Records

FOX: A Fault-Oblivious Extreme-Scale Execution Environment Boston University Final Report Project Number: DE-SC0005365

Technical Report · Sun Mar 17 00:00:00 EDT 2013 · OSTI ID:1164219

Appavoo, Jonathan

A Fault-oblivious Extreme-scale Execution Environment

Technical Report · Wed Aug 31 00:00:00 EDT 2016 · OSTI ID:1164219

Sadayappan, Ponnuswamy

HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors

Technical Report · Wed Jan 31 00:00:00 EST 2007 · OSTI ID:1164219

Jones, T; Kale, L; Moreira, J; +4 more

Related Subjects

97 MATHEMATICS AND COMPUTING
exascale
operating system
runtime
fault oblivious
task management
work-stealing

Title: A Fault Oblivious Extreme-Scale Execution Environment

Citation Formats

Similar Records

Related Subjects