skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Partial differential equations preconditioner resilient to soft and hard faults

Abstract

We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12,000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corruptedmore » data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing; soft faults occurring during the communication of the tasks between server and clients; and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.« less

Authors:
 [1];  [1];  [1];  [2];  [1];  [2];  [2];  [1]
  1. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  2. Duke Univ., Durham, NC (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC); Univ. of California, Oakland, CA (United States); Lockheed Martin Corpration, Litteton, CO (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1544016
Grant/Contract Number:  
AC02-05CH11231; AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 32; Journal Issue: 5; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Computer Science

Citation Formats

Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O., and Debusschere, B. Partial differential equations preconditioner resilient to soft and hard faults. United States: N. p., 2017. Web. doi:10.1177/1094342016684975.
Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O., & Debusschere, B. Partial differential equations preconditioner resilient to soft and hard faults. United States. doi:10.1177/1094342016684975.
Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., Le Maître, O., Knio, O., and Debusschere, B. Sun . "Partial differential equations preconditioner resilient to soft and hard faults". United States. doi:10.1177/1094342016684975. https://www.osti.gov/servlets/purl/1544016.
@article{osti_1544016,
title = {Partial differential equations preconditioner resilient to soft and hard faults},
author = {Rizzi, F. and Morris, K. and Sargsyan, K. and Mycek, P. and Safta, C. and Le Maître, O. and Knio, O. and Debusschere, B.},
abstractNote = {We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12,000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing; soft faults occurring during the communication of the tasks between server and clients; and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.},
doi = {10.1177/1094342016684975},
journal = {International Journal of High Performance Computing Applications},
number = 5,
volume = 32,
place = {United States},
year = {2017},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

A case for two-level distributed recovery schemes
conference, January 1995

  • Vaidya, Nitin H.
  • Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '95/PERFORMANCE '95
  • DOI: 10.1145/223587.223596

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, Rémi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

Understanding the propagation of hard errors to software and implications for resilient system design
journal, March 2008

  • Li, Man-Lap; Ramachandran, Pradeep; Sahoo, Swarup Kumar
  • ACM SIGOPS Operating Systems Review, Vol. 42, Issue 2
  • DOI: 10.1145/1353535.1346315

Analyzing the soft error resilience of linear solvers on multicore multiprocessors
conference, April 2010

  • Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • DOI: 10.1109/IPDPS.2010.5470411

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
conference, October 2013

  • Engelmann, Christian; Naughton, Thomas
  • 2013 42nd International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2013.114

Matrix Multiplication on GPUs with On-Line Fault Tolerance
conference, May 2011

  • Ding, Chong; Karlsson, Christer; Liu, Hui
  • 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
  • DOI: 10.1109/ISPA.2011.50

A Large-Scale Study of Failures in High-Performance Computing Systems
journal, October 2010

  • Schroeder, Bianca; Gibson, Garth A.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4
  • DOI: 10.1109/TDSC.2009.4

Fault Resilient Domain Decomposition Preconditioner for PDEs
journal, January 2015

  • Sargsyan, Khachik; Rizzi, Francesco; Mycek, Paul
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 5
  • DOI: 10.1137/15M1014474

Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver
conference, May 2014

  • Ali, Md Mohsin; Southern, James; Strazdins, Peter
  • 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2014.132

Algorithm-based fault tolerance for dense matrix factorizations
conference, January 2012

  • Du, Peng; Bouteiller, Aurelien; Bosilca, George
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
  • DOI: 10.1145/2145816.2145845

Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults
conference, September 2015

  • Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik
  • 2015 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2015.103

Abstract Machine Models and Proxy Architectures for Exascale Computing
conference, November 2014

  • Ang, J. A.; Barrett, R. F.; Benner, R. E.
  • 2014 Hardware-Software Co-Design for High Performance Computing (Co-HPC)
  • DOI: 10.1109/Co-HPC.2014.4

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
conference, June 2007

  • Shye, Alex; Moseley, Tipp; Reddi, Vijay Janapa
  • 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
  • DOI: 10.1109/DSN.2007.98

Algorithm-based recovery for iterative methods without checkpointing
conference, January 2011

  • Chen, Zizhong
  • Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11
  • DOI: 10.1145/1996130.1996142

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Error log analysis: statistical modeling and heuristic trend analysis
journal, January 1990

  • Lin, T. -T. Y.; Siewiorek, D. P.
  • IEEE Transactions on Reliability, Vol. 39, Issue 4
  • DOI: 10.1109/24.58720

Failure data analysis of a large-scale heterogeneous server environment
conference, January 2004

  • Sahoo, R. K.; Squillante, M. S.; Sivasubramaniam, A.
  • International Conference on Dependable Systems and Networks, 2004
  • DOI: 10.1109/DSN.2004.1311948