3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication
- Carnegie Mellon Univ., Pittsburgh, PA (United States)
- Univ. of California, Berkeley, CA (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Pennsylvania State Univ., University Park, PA (United States)
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1651320
- Country of Publication:
- United States
- Language:
- English
Similar Records
Multi-fault Tolerance for Cartesian Data Distributions
Supporting the Development of Resilient Message Passing Applications using Simulation
Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
Journal Article
·
Sat Jun 01 00:00:00 EDT 2013
· International Journal of Parallel Programming, 41(3):469-493
·
OSTI ID:1064566
Supporting the Development of Resilient Message Passing Applications using Simulation
Conference
·
Tue Dec 31 23:00:00 EST 2013
·
OSTI ID:1131524
Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
Conference
·
Sun May 01 00:00:00 EDT 2016
· Proceedings - 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
·
OSTI ID:1769300