Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Conference · · 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC)
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
OSTI ID:
1567343
Conference Information:
Journal Name: 2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC)
Country of Publication:
United States
Language:
English

Similar Records

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
Conference · Tue May 31 00:00:00 EDT 2016 · OSTI ID:1322529

Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Journal Article · Wed Dec 30 19:00:00 EST 2020 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1775093

Multi-fault Tolerance for Cartesian Data Distributions
Journal Article · Sat Jun 01 00:00:00 EDT 2013 · International Journal of Parallel Programming, 41(3):469-493 · OSTI ID:1064566

Related Subjects