skip to main content

DOE PAGESDOE PAGES

Title: Versioned distributed arrays for resilience in scientific applications: Global view resilience

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.
Authors:
 [1] ;  [2] ;  [2] ;  [1] ;  [3] ;  [1] ;  [2] ;  [3] ;  [4] ;  [4] ;  [5] ;  [5] ;  [6] ;  [6] ;  [7] ;  [7] ;  [8] ;  [8] ;  [8] ;  [2]
  1. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Univ. of Chicago. Chicago, IL (United States)
  4. Intel Corp. Santa Clara, CA (United States)
  5. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  6. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  7. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  8. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Intel Corp. Santa Clara, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
OSTI Identifier:
1196863
Grant/Contract Number:
SC0008603; AC02-06CH11357; AC02-05CH11231; AC02-06CH11357
Type:
Accepted Manuscript
Journal Name:
Procedia Computer Science
Additional Journal Information:
Journal Volume: 51; Journal Issue: C; Conference: International Conference On Computational Science (ICCS 2015). Computational Science at the Gates of Nature, Reykjavik (Iceland), 1-3 Jun 2015; Journal ID: ISSN 1877-0509
Publisher:
Elsevier
Research Org:
Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING resilience; fault tolerance; exascale; scalable computing; application-based fault tolerance