Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Ma, Wenjing; Ao, Yulong; Yang, Chao; Williams, Samuel

doi:10.1007/s10586-019-02938-w

Title: Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Journal Article · Mon May 13 00:00:00 EDT 2019 · Cluster Computing

DOI:https://doi.org/10.1007/s10586-019-02938-w· OSTI ID:1580814

Ma, Wenjing ^[1]; Ao, Yulong ^[2];

^[2]; Williams, Samuel ^[3]

Chinese Academy of Sciences (CAS), Beijing (China)
Peking Univ., Beijing (China); Peng Cheng Lab., Shenzhen (China)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Benchmarks for supercomputers are important tools, not only for evaluating and ranking modern supercomputers, but also for providing hints for future architecture design. As a new benchmark, HPGMG (high performance geometric multigrid) solves a linear equation set with a full geometric multi-grid algorithm. It involves computation on different scales, data movement with various volumes, global communication and neighbor communication with both large and small messages, etc., and is more correlated to real world applications than traditional benchmarks such as LINPACK. Therefore, it is desirable to examine how well HPGMG can perform on leadership supercomputers such as Sunway Taihulight. Sunway Taihulight, the No. 1 supercomputer in the Top 500 list from June 2016 to June 2018, which uses a specially designed many-core architecture SW26010, is of great interest to the community of high performance computing. With careful analysis and code design, we came up with an efficient implementation of HPGMG on SW26010 processors. We not only employed traditional optimization techniques such as 2.5D partitioning, double buffering, and collective data load, but also introduced a micro-benchmark to help with the choice of optimization direction and parameter tuning. Another contribution is that we proposed a new procedure for the major operations, by granulating and reordering the smooth function and the ghost exchange operation, leading to reduced memory copy and accelerated communication process. Our optimized implementation of HPGMG on Sunway TaihuLight achieved a ground-breaking performance of 1.036 × 10¹² Degrees of Freedom per second at the finest level, which is No. 1 on the HPGMG list of Nov 2017.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Key R&D Plan of China; Beijing Natural Science Foundation

Grant/Contract Number:: AC02-05CH11231; 2016YFB0200603; JQ18001

OSTI ID:: 1580814

Journal Information:: Cluster Computing, Vol. 23, Issue 2; ISSN 1386-7857

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 2 works

Citation information provided by
Web of Science

References (25)

Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric Multigrid Basu, Protonu; Williams, Samuel; Van Straalen, Brian Proceedings of the Second Workshop on Optimizing Stencil Computations - WOSC '14 https://doi.org/10.1145/2686745.2686749	conference	January 2014
Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor Jiang, Lijuan; Yang, Chao; Ao, Yulong 2017 46th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2017.51	conference	August 2017
Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs Ma, Wen-Jing; Gao, Kan; Long, Guo-Ping Journal of Computer Science and Technology, Vol. 31, Issue 6 https://doi.org/10.1007/s11390-016-1696-5	journal	November 2016
High-performance code generation for stencil computations on GPU architectures Holewinski, Justin; Pouchet, Louis-Noël; Sadayappan, P. Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304619	conference	January 2012
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers Basu, Protonu; Williams, Samuel; Van Straalen, Brian Parallel Computing, Vol. 64 https://doi.org/10.1016/j.parco.2017.04.002	journal	May 2017
Fast implementation of DGEMM on Fermi GPU Tan, Guangming; Li, Linchuan; Triechle, Sean Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063431	conference	January 2011
The Sunway TaihuLight supercomputer: system and applications Fu, Haohuan; Liao, Junfeng; Yang, Jinzhe Science China Information Sciences, Vol. 59, Issue 7 https://doi.org/10.1007/s11432-016-5588-7	journal	June 2016
Compiler generation and autotuning of communication-avoiding operators for geometric multigrid Basu, Protonu; Venkat, Anand; Hall, Mary 2013 20th International Conference on High Performance Computing (HiPC), 20th Annual International Conference on High Performance Computing https://doi.org/10.1109/HiPC.2013.6799131	conference	December 2013
Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer Zhang, Jian; Zhou, Chunbao; Wang, Yangang SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.3	conference	November 2016
CPU/GPU computing for a multi-block structured grid based high-order flow solver on a large heterogeneous system Cao, Wei; Xu, Chuan-fu; Wang, Zheng-hua Cluster Computing, Vol. 17, Issue 2 https://doi.org/10.1007/s10586-013-0332-1	journal	November 2013
10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics Yang, Chao; Xue, Wei; Fu, Haohuan SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.5	conference	November 2016
HPCG and HPGMG benchmark tests on multiple program, multiple data (MPMD) mode on Blue Waters-A Cray XE6/XK7 hybrid system Kwack, JaeHyuk; Bauer, Gregory H. Concurrency and Computation: Practice and Experience, Vol. 30, Issue 1 https://doi.org/10.1002/cpe.4298	journal	October 2017
The LINPACK Benchmark: past, present and future Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9 https://doi.org/10.1002/cpe.728	journal	January 2003
A parallel pattern for iterative stencil + reduce Aldinucci, M.; Danelutto, M.; Drocco, M. The Journal of Supercomputing, Vol. 74, Issue 11 https://doi.org/10.1007/s11227-016-1871-z	journal	September 2016
Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight Fu, Haohuan; Liu, Weiguo; Wang, Lanning Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126909	conference	January 2017
High performance stencil code generation with Lift Hagedorn, Bastian; Stoltzfus, Larisa; Steuwer, Michel Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018 https://doi.org/10.1145/3179541.3168824	conference	January 2018
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs Nguyen, Anthony; Satish, Nadathur; Chhugani, Jatin 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.2	conference	November 2010
The potential of the cell processor for scientific computing Williams, Samuel; Shalf, John; Oliker, Leonid Proceedings of the 3rd conference on Computing frontiers - CF '06 https://doi.org/10.1145/1128022.1128027	conference	January 2006
High performance stencil code generation with Lift Hagedorn, Bastian; Stoltzfus, Larisa; Steuwer, Michel Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018 https://doi.org/10.1145/3168824	conference	January 2018
High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems Dongarra, Jack; Heroux, Michael A.; Luszczek, Piotr The International Journal of High Performance Computing Applications, Vol. 30, Issue 1 https://doi.org/10.1177/1094342015593158	journal	August 2015
A framework for enhancing data reuse via associative reordering Stock, Kevin; Kong, Martin; Grosser, Tobias Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14 https://doi.org/10.1145/2594291.2594342	conference	January 2013
Implementing Molecular Dynamics Simulation on Sunway TaihuLight System Dong, Wenqian; Kang, Letian; Quan, Zhe 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0070	conference	December 2016
18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios Fu, Haohuan; Yin, Wanwang; Yang, Guangwen Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126910	conference	January 2017
A framework for enhancing data reuse via associative reordering Stock, Kevin; Kong, Martin; Grosser, Tobias ACM SIGPLAN Notices, Vol. 49, Issue 6 https://doi.org/10.1145/2666356.2594342	journal	June 2014
A parallel pattern for iterative stencil + reduce Aldinucci, M.; Danelutto, M.; Drocco, M. arXiv https://doi.org/10.48550/arxiv.1609.04567	text	January 2016

Similar Records

HPGMG

Software · Mon Mar 10 00:00:00 EDT 2014 · OSTI ID:1580814

Williams, Samuel; Van Straalen, Brian

The accurate particle tracer code

Journal Article · Thu Jul 20 00:00:00 EDT 2017 · Computer Physics Communications · OSTI ID:1580814

Wang, Yulei; Liu, Jian; Qin, Hong; +2 more

Random circuit block-encoded matrix and a proposal of quantum LINPACK benchmark

Journal Article · Mon Jun 14 00:00:00 EDT 2021 · Physical Review A · OSTI ID:1580814

Dong, Yulong; Lin, Lin

Related Subjects

97 MATHEMATICS AND COMPUTING
HPGMG
Sunway TaihuLight
performance benchmark and optimization
many-core computing

Title: Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Citation Formats

References (25)

Similar Records

Related Subjects