Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver

Park, Alfred J; Perumalla, Kalyan S

doi:10.1016/j.jpdc.2013.07.012

Title: Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver

Journal Article · Tue Jan 01 00:00:00 EST 2013 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2013.07.012· OSTI ID:1115349

Park, Alfred J ^[1]; Perumalla, Kalyan S ^[1]

ORNL

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1115349

Journal Information:: Journal of Parallel and Distributed Computing, Vol. 73, Issue 12; ISSN 0743-7315

Country of Publication:: United States

Language:: English

Similar Records

GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and hermitian eigenproblems

Journal Article · Thu Dec 31 00:00:00 EST 2020 · Computer Physics Communications · OSTI ID:1115349

Yu, Victor Wen-zhe; Moussa, Jonathan; Kůs, Pavel; +5 more

Complexity in scalable computing.

Journal Article · Mon Dec 01 00:00:00 EST 2008 · Proposed for publication in Scientific Programming. · OSTI ID:1115349

Rouson, Damian W. I.

Approximate Weighted Matching On Emerging Manycore and Multithreaded Architectures

Journal Article · Fri Nov 30 00:00:00 EST 2012 · International Journal of High Performance Computing Applications, 26 (4 ):413-430 · OSTI ID:1115349

Halappanavar, Mahantesh; Feo, John T; Villa, Oreste; +2 more

Title: Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver

Citation Formats

Similar Records

Related Subjects