skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization

Abstract

In this paper, we present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable (HSS) representations. Such matrices appear in many applications, for example, finite-element methods, boundary element methods, and so on. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization, and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. Finally, this work is part of a more global effort, the STRUctured Matrices PACKage (STRUMPACK) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.

Authors:
 [1];  [1];  [1];  [2]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Univ. libre de Bruxelles (ULB), Brussels (Belgium)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1393046
Grant/Contract Number:
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
ACM Transactions on Mathematical Software
Additional Journal Information:
Journal Volume: 42; Journal Issue: 4; Journal ID: ISSN 0098-3500
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
98 NUCLEAR DISARMAMENT, SAFEGUARDS, AND PHYSICAL PROTECTION; mathematical software; solvers; design; algorithms; performance; HSS matrices; randomized sampling; ULV factorization; parallel algorithms; distributed-memory

Citation Formats

Rouet, François-Henry, Li, Xiaoye S., Ghysels, Pieter, and Napov, Artem. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization. United States: N. p., 2016. Web. doi:10.1145/2930660.
Rouet, François-Henry, Li, Xiaoye S., Ghysels, Pieter, & Napov, Artem. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization. United States. doi:10.1145/2930660.
Rouet, François-Henry, Li, Xiaoye S., Ghysels, Pieter, and Napov, Artem. 2016. "A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization". United States. doi:10.1145/2930660. https://www.osti.gov/servlets/purl/1393046.
@article{osti_1393046,
title = {A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization},
author = {Rouet, François-Henry and Li, Xiaoye S. and Ghysels, Pieter and Napov, Artem},
abstractNote = {In this paper, we present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable (HSS) representations. Such matrices appear in many applications, for example, finite-element methods, boundary element methods, and so on. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization, and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. Finally, this work is part of a more global effort, the STRUctured Matrices PACKage (STRUMPACK) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.},
doi = {10.1145/2930660},
journal = {ACM Transactions on Mathematical Software},
number = 4,
volume = 42,
place = {United States},
year = 2016,
month = 6
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effectivemore » alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.« less
  • The rapid progress of microprocessors provides economic solutions for small and medium-scale data processing tasks, e.g., workstations. It is a challenging task to combine many powerful microprocessors to a fixed or reconfigurable array which is able to process very large processing tasks with supercomputer performance. Fortunately, many very large applications are regularly structured and can easily be partitioned. One example are physical phenomena which are often described by mathematical models, e.g. by sets of partial differential equations (PDE). In most cases, the mathematical models can only be computed approximately The finer the used model is, the higher is the necessarymore » computational effort. With the appearance of more powerful computers more complicated and more refined models can be calculated. Such user problems are compute- intensive and have strong inherent computational parallelism. Therefore, the needed high performance can be achieved by using many computers working in parallel. In particular, parallel architectures of the MIMD (multiple-instruction multiple-data) type, known as multiprocessors, are well suited because of their higher flexibility with respect to SIMD (single-instruction multiple-data). In this paper, the authors present a distributed shared memory (DSM) architecture that is the basis for the design of a scalable high performance multiprocessor system.« less
  • Sparse matrix computations play an important role in iterative methods to solve systems of equations or eigenvalue problems that are applied during the solution of discretized partial differential equations. The large size of many technical or physical applications in this area results in the need for parallel execution of sparse operations, in particular sparse matrix-vector multiplication, on distributed memory computers. In this report, a data distribution and a communication scheme are presented for parallel sparse iterative solvers. Performance tests, using the conjugate gradient method, the QMR and the TFQMR algorithm for solving systems of equations, and the Lanczos method formore » the symmetric eigenvalue problem, were carried out on a PARAGON XP/S 10 with 140 processors. The parallel variants of the algorithms show good scaling behavior for matrices with different sparsity patterns.« less
  • Enzymes are versatile nanoscale biocatalysts, and find increasing applications in many areas, including organic synthesis[1-3] and bioremediation.[4-5] However, the application of enzymes is often hampered by the short catalytic lifetime of enzymes and by the difficulty in recovery and recycling. To solve these problems, there have been a lot of efforts to develop effective enzyme immobilization techniques. Recent advances in nanotechnology provide more diverse materials and approaches for enzyme immobilization. For example, mesoporous materials offer potential advantages as a host of enzymes due to their well-controlled porosity and large surface area for the immobilization of enzymes.[6,7] On the other hand,more » it has been demonstrated that enzymes attached on magnetic iron oxide nanoparticles can be easily recovered using a magnet and recycled for iterative uses.[8] In this paper, we report the development of magnetically-separable and highly-stable enzyme system by the combined use of two different kinds of nanostructured materials: magnetic nanoparticles and mesoporous silica.« less
  • We describe parallel algorithms for computing maximal cardinality matching in a bipartite graph on distributed-memory systems. Unlike traditional algorithms that match one vertex at a time, our algorithms process many unmatched vertices simultaneously using a matrix-algebraic formulation of maximal matching. This generic matrix-algebraic framework is used to develop three efficient maximal matching algorithms with minimal changes. The newly developed algorithms have two benefits over existing graph-based algorithms. First, unlike existing parallel algorithms, cardinality of matching obtained by the new algorithms stays constant with increasing processor counts, which is important for predictable and reproducible performance. Second, relying on bulk-synchronous matrix operations,more » these algorithms expose a higher degree of parallelism on distributed-memory platforms than existing graph-based algorithms. We report high-performance implementations of three maximal matching algorithms using hybrid OpenMP-MPI and evaluate the performance of these algorithm using more than 35 real and randomly generated graphs. On real instances, our algorithms achieve up to 200 × speedup on 2048 cores of a Cray XC30 supercomputer. Even higher speedups are obtained on larger synthetically generated graphs where our algorithms show good scaling on up to 16,384 cores.« less