Partial differential equations preconditioner resilient to soft and hard faults

Rizzi, F.; Morris, K.; Sargsyan, K.; Mycek, P.; Safta, C.; Le Maître, O.; Knio, O.; Debusschere, B.

doi:10.1177/1094342016684975

Partial differential equations preconditioner resilient to soft and hard faults

Journal Article · Sun Jan 29 00:00:00 EST 2017 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342016684975· OSTI ID:1544016

Rizzi, F. ^[1]; Morris, K. ^[1]; Sargsyan, K. ^[1]; Mycek, P. ^[2]; Safta, C. ^[1]; Le Maître, O. ^[2]; Knio, O. ^[2]; Debusschere, B. ^[1]

Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Duke Univ., Durham, NC (United States)

We present a domain-decomposition-based preconditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. This reformulation allows us to recast the problem as a set of independent tasks, and exploit data locality to reduce global communication. We discuss two different parallel implementations: (a) a single program multiple data (SPMD) version based on a one-to-one mapping between subdomain and MPI processes responsible for both state and computation; and (b) an asynchronous server–client implementation where all state information is held by the servers and clients are designed solely as computational units. We present a scalability comparison of both implementations under nominal conditions, showing efficiency within ~80% for up to 12,000 cores. We present a resilience analysis under different fault scenarios based on the server–client implementation. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all of the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing; soft faults occurring during the communication of the tasks between server and clients; and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC); Lockheed Martin Corpration, Litteton, CO (United States); Univ. of California, Oakland, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: AC02-05CH11231; AC04-94AL85000

OSTI ID:: 1544016

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 5 Vol. 32; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

References (19)

Failure data analysis of a large-scale heterogeneous server environment Sahoo, R. K.; Squillante, M. S.; Sivasubramaniam, A. International Conference on Dependable Systems and Networks, 2004 https://doi.org/10.1109/dsn.2004.1311948	conference	January 2004
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
Error log analysis: statistical modeling and heuristic trend analysis Lin, T. -T. Y.; Siewiorek, D. P. IEEE Transactions on Reliability, Vol. 39, Issue 4 https://doi.org/10.1109/24.58720	journal	January 1990
Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik 2015 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2015.103	conference	September 2015
Abstract Machine Models and Proxy Architectures for Exascale Computing Ang, J. A.; Barrett, R. F.; Benner, R. E. 2014 Hardware-Software Co-Design for High Performance Computing (Co-HPC) https://doi.org/10.1109/Co-HPC.2014.4	conference	November 2014
Failure data analysis of a large-scale heterogeneous server environment Sahoo, R. K.; Squillante, M. S.; Sivasubramaniam, A. International Conference on Dependable Systems and Networks, 2004 https://doi.org/10.1109/DSN.2004.1311948	conference	January 2004
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Shye, Alex; Moseley, Tipp; Reddi, Vijay Janapa 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) https://doi.org/10.1109/DSN.2007.98	conference	June 2007
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems Engelmann, Christian; Naughton, Thomas 2013 42nd International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2013.114	conference	October 2013
Analyzing the soft error resilience of linear solvers on multicore multiprocessors Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470411	conference	April 2010
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver Ali, Md Mohsin; Southern, James; Strazdins, Peter 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2014.132	conference	May 2014
Matrix Multiplication on GPUs with On-Line Fault Tolerance Ding, Chong; Karlsson, Christer; Liu, Hui 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications https://doi.org/10.1109/ISPA.2011.50	conference	May 2011
A Large-Scale Study of Failures in High-Performance Computing Systems Schroeder, Bianca; Gibson, Garth A. IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4 https://doi.org/10.1109/TDSC.2009.4	journal	October 2010
Fault Resilient Domain Decomposition Preconditioner for PDEs Sargsyan, Khachik; Rizzi, Francesco; Mycek, Paul SIAM Journal on Scientific Computing, Vol. 37, Issue 5 https://doi.org/10.1137/15M1014474	journal	January 2015
Understanding the propagation of hard errors to software and implications for resilient system design Li, Man-Lap; Ramachandran, Pradeep; Sahoo, Swarup Kumar ACM SIGOPS Operating Systems Review, Vol. 42, Issue 2 https://doi.org/10.1145/1353535.1346315	journal	March 2008
Algorithm-based recovery for iterative methods without checkpointing Chen, Zizhong Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 https://doi.org/10.1145/1996130.1996142	conference	January 2011
Algorithm-based fault tolerance for dense matrix factorizations Du, Peng; Bouteiller, Aurelien; Bosilca, George Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145845	conference	January 2012
A case for two-level distributed recovery schemes Vaidya, Nitin H. Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '95/PERFORMANCE '95 https://doi.org/10.1145/223587.223596	conference	January 1995
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Keeping checkpoint/restart viable for exascale systems. Riesen, Rolf; Bridges, Patrick; Stearley, Jon https://doi.org/10.2172/1029780	report	September 2011

Similar Records

Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Journal Article · Wed May 24 20:00:00 EDT 2017 · Parallel Computing · OSTI ID:1478742

Exploring the Interplay of Resilience and Energy Consumption for a Task-Based Partial Differential Equations Preconditioner

Technical Report · Mon Feb 29 23:00:00 EST 2016 · OSTI ID:1561016

Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults [Poster]

Technical Report · Sun May 01 00:00:00 EDT 2016 · OSTI ID:1561477

Related Subjects

97 MATHEMATICS AND COMPUTING
Computer Science

Partial differential equations preconditioner resilient to soft and hard faults

Citation Formats

References (19)

Similar Records

Related Subjects