Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner
Journal Article
·
· Parallel Computing
- Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
- Duke Univ., Durham, NC (United States)
- Centre National de la Recherche Scientifique (CNRS), Orsay (France). Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI)
- Duke Univ., Durham, NC (United States); King Abdullah University of Science and Technology (KAUST), Thuwal (Saudi Arabia)
Here, we discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ~51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
- Sponsoring Organization:
- USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
- Grant/Contract Number:
- AC02-05CH11231; AC04-94AL85000
- OSTI ID:
- 1478742
- Alternate ID(s):
- OSTI ID: 1550083
- Journal Information:
- Parallel Computing, Journal Name: Parallel Computing Journal Issue: C Vol. 73; ISSN 0167-8191
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Exploring the Interplay of Resilience and Energy Consumption for a Task-Based Partial Differential Equations Preconditioner
Partial differential equations preconditioner resilient to soft and hard faults
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner [Poster]
Technical Report
·
Mon Feb 29 23:00:00 EST 2016
·
OSTI ID:1561016
Partial differential equations preconditioner resilient to soft and hard faults
Journal Article
·
Sat Jan 28 19:00:00 EST 2017
· International Journal of High Performance Computing Applications
·
OSTI ID:1544016
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner [Poster]
Technical Report
·
Sun May 01 00:00:00 EDT 2016
·
OSTI ID:1561476