Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

Conference ·
 [1];  [2];  [2];  [3];  [1];  [4];  [2];  [1]
  1. Carnegie Mellon Univ., Pittsburgh, PA (United States)
  2. Univ. of California, Berkeley, CA (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  4. Pennsylvania State Univ., University Park, PA (United States)
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1651320
Country of Publication:
United States
Language:
English

Similar Records

Multi-fault Tolerance for Cartesian Data Distributions
Journal Article · Sat Jun 01 00:00:00 EDT 2013 · International Journal of Parallel Programming, 41(3):469-493 · OSTI ID:1064566

Supporting the Development of Resilient Message Passing Applications using Simulation
Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1131524

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
Conference · Sun May 01 00:00:00 EDT 2016 · Proceedings - 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) · OSTI ID:1769300