System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Petrini, F; Davis, Kei; Sancho, J C

doi:10.1109/IPDPS.2004.1303239

Title: System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Conference · Thu Jan 01 00:00:00 EST 2004

DOI:https://doi.org/10.1109/IPDPS.2004.1303239· OSTI ID:977449

Petrini, F ^[1]; Davis, Kei; Sancho, J C ^[2]

Fabrizio
Jose Carlos

As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual Components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore ofparamount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high eficiency (low overhead), and transparency-requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.

View Conference

Cite

Export

Save

Research Organization:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Organization:: USDOE

OSTI ID:: 977449

Report Number(s):: LA-UR-04-0694; LA-UR-04-694; TRN: US201009%%773

Resource Relation:: Conference: Submitted to: IPDPS 2004, April 2004, Santa Fe, NM

Country of Publication:: United States

Language:: English

Similar Records

Buffered coscheduling for parallel programming and enhanced fault tolerance

Patent · Tue Jan 31 00:00:00 EST 2006 · OSTI ID:977449

Petrini, Fabrizio; Feng, Wu-chun

A New coscheduling technique for a cluster of symmetric multiprocessors

Conference · Mon Apr 17 00:00:00 EDT 2000 · OSTI ID:977449

Yoo, A B; Jette, M A

Coscheduling Technique for Symmetric Multiprocessor Clusters

Conference · Mon Sep 18 00:00:00 EDT 2000 · OSTI ID:977449

Yoo, A B; Jette, M A

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
COMMUNICATIONS
COMPUTERS
DESIGN
TOLERANCE

Title: System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Citation Formats

Similar Records

Related Subjects