skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Abstract

We discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ~51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.

Authors:
 [1];  [1];  [1];  [2];  [1];  [3];  [4];  [1]
  1. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  2. Duke Univ., Durham, NC (United States)
  3. LIMSI, Orsay (France)
  4. Duke Univ., Durham, NC (United States); King Abdullah Univ., of Science and Technology, Thusal (Saudi Arabia)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory-National Energy Research Scientific Computing Center
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1478742
DOE Contract Number:  
AC04-94AL85000; AC02-05CH11231
Resource Type:
Journal Article
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 73; Journal Issue: C; Journal ID: ISSN 0167-8191
Publisher:
Elsevier
Country of Publication:
United States
Language:
English

Citation Formats

Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O. M., and Debusschere, B. J. Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. United States: N. p., 2018. Web. doi:10.1016/j.parco.2017.05.005.
Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O. M., & Debusschere, B. J. Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. United States. doi:10.1016/j.parco.2017.05.005.
Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O. M., and Debusschere, B. J. Sun . "Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner". United States. doi:10.1016/j.parco.2017.05.005.
@article{osti_1478742,
title = {Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner},
author = {Rizzi, F. and Morris, K. and Sargsyan, K. and Mycek, P. and Safta, C. and Le Maître, O. and Knio, O. M. and Debusschere, B. J.},
abstractNote = {We discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ~51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.},
doi = {10.1016/j.parco.2017.05.005},
journal = {Parallel Computing},
issn = {0167-8191},
number = C,
volume = 73,
place = {United States},
year = {2018},
month = {4}
}