Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Journal Article · · Parallel Computing
 [1];  [1];  [1];  [2];  [1];  [3];  [4];  [1]
  1. Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
  2. Duke Univ., Durham, NC (United States)
  3. Centre National de la Recherche Scientifique (CNRS), Orsay (France). Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI)
  4. Duke Univ., Durham, NC (United States); King Abdullah University of Science and Technology (KAUST), Thuwal (Saudi Arabia)
Here, we discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ~51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-05CH11231; AC04-94AL85000
OSTI ID:
1478742
Alternate ID(s):
OSTI ID: 1550083
Journal Information:
Parallel Computing, Journal Name: Parallel Computing Journal Issue: C Vol. 73; ISSN 0167-8191
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (6)

Iteratively reweighted least squares minimization for sparse recovery journal January 2010
Fault Resilient Domain Decomposition Preconditioner for PDEs journal January 2015
Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations journal January 2017
Understanding the propagation of hard errors to software and implications for resilient system design journal March 2008
Toward Exascale Resilience journal September 2009
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013

Similar Records

Exploring the Interplay of Resilience and Energy Consumption for a Task-Based Partial Differential Equations Preconditioner
Technical Report · Mon Feb 29 23:00:00 EST 2016 · OSTI ID:1561016

Partial differential equations preconditioner resilient to soft and hard faults
Journal Article · Sat Jan 28 19:00:00 EST 2017 · International Journal of High Performance Computing Applications · OSTI ID:1544016

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner [Poster]
Technical Report · Sun May 01 00:00:00 EDT 2016 · OSTI ID:1561476