skip to main content

DOE PAGESDOE PAGES

Title: Epidemic failure detection and consensus for extreme parallelism

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in all algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.
Authors:
 [1] ;  [1] ;  [2] ;  [2]
  1. Univ. of Reading (United Kingdom)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Grant/Contract Number:
AC05-00OR22725
Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Name: International Journal of High Performance Computing Applications; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Research Org:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
USDOE Office of Science (SC)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING
OSTI Identifier:
1361330