Autotuning in High-Performance Computing Applications

Balaprakash, Prasanna; Dongarra, Jack; Gamblin, Todd; Hall, Mary; Hollingsworth, Jeffrey K.; Norris, Boyana; Vuduc, Richard

doi:10.1109/JPROC.2018.2841200

Title: Autotuning in High-Performance Computing Applications

Journal Article · Tue Jul 31 00:00:00 EDT 2018 · Proceedings of the IEEE

DOI:https://doi.org/10.1109/JPROC.2018.2841200· OSTI ID:1488544

Balaprakash, Prasanna ^[1];

^[2]; Gamblin, Todd ^[3];

^[4]; Hollingsworth, Jeffrey K. ^[5]; Norris, Boyana ^[6]; Vuduc, Richard ^[7]

Argonne National Lab. (ANL), Argonne, IL (United States)
Univ. of Tennessee, Knoxville, TN (United States)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Univ. of Utah, Salt Lake City, UT (United States)
Univ. of Maryland, College Park, MD (United States)
Univ. of Oregon, Eugene, OR (United States)
Georgia Inst. of Technology, Atlanta, GA (United States)

Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors' extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-06CH11357; AC52-07NA27344

OSTI ID:: 1488544

Alternate ID(s):: OSTI ID: 1868859

Report Number(s):: LLNL-JRNL-834240; 147743

Journal Information:: Proceedings of the IEEE, Vol. 106, Issue 11; ISSN 0018-9219

Publisher:: Institute of Electrical and Electronics EngineersCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 54 works

Citation information provided by
Web of Science

References (58)

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations Aktulga, Hasan Metin; Buluc, Aydin; Williams, Samuel 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.125	conference	May 2014
Generating Efficient Tensor Contractions for GPUs Nelson, Thomas; Rivera, Axel; Balaprakash, Prasanna 2015 44th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2015.106	conference	September 2015
Speeding up Nek5000 with autotuning and specialization Shin, Jaewook; Hall, Mary W.; Chame, Jacqueline Proceedings of the 24th ACM International Conference on Supercomputing - ICS '10 https://doi.org/10.1145/1810085.1810120	conference	January 2010
Exploiting Performance Portability in Search Algorithms for Autotuning Roy, Amit; Balaprakash, Prasanna; Hovland, Paul D. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.85	conference	May 2016
Architecture-Adaptive Code Variant Tuning Muralidharan, Saurav; Roy, Amit; Hall, Mary ASPLOS '16: Architectural Support for Programming Languages and Operating Systems, Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems https://doi.org/10.1145/2872362.2872411	conference	March 2016
Computation–communication overlap and parameter auto-tuning for scalable parallel 3-D FFT Song, Sukhyun; Hollingsworth, Jeffrey K. Journal of Computational Science, Vol. 14 https://doi.org/10.1016/j.jocs.2015.12.001	journal	May 2016
The Spack package manager: bringing order to HPC software chaos Gamblin, Todd; LeGendre, Matthew; Collette, Michael R. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807623	conference	January 2015
Autotuning algorithmic choice for input sensitivity Ding, Yufei; Ansel, Jason; Veeramachaneni, Kalyan PLDI '15: ACM SIGPLAN Conference on Programming Language Design and Implementation, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation https://doi.org/10.1145/2737924.2737969	conference	June 2015
Stencil-Aware GPU Optimization of Iterative Solvers Lowell, Daniel; Godwin, Jeswin; Holewinski, Justin SIAM Journal on Scientific Computing, Vol. 35, Issue 5 https://doi.org/10.1137/120883153	journal	January 2013
Autotuning Stencil-Based Computations on GPUs Mametjanov, Azamat; Lowell, Daniel; Ma, Ching-Chen 2012 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2012.46	conference	September 2012
Nitro: A Framework for Adaptive Code Variant Tuning Muralidharan, Saurav; Shantharam, Manu; Hall, Mary 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.59	conference	May 2014
A tuning framework for software-managed memory hierarchies Ren, Manman; Park, Ji Young; Houston, Mike Proceedings of the 17th international conference on Parallel architectures and compilation techniques - PACT '08 https://doi.org/10.1145/1454115.1454155	conference	January 2008
PetaBricks: a language and compiler for algorithmic choice Ansel, Jason; Chan, Cy; Wong, Yee Lok Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation - PLDI '09 https://doi.org/10.1145/1542476.1542481	conference	January 2009
Xevolver: An XML-based code translation framework for supporting HPC application migration Takizawa, Hiroyuki; Hirasawa, Shoichi; Hayashi, Yasuharu 2014 21st International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2014.7116902	conference	December 2014
Can search algorithms save large-scale automatic performance tuning? Balaprakash, Prasanna; Wild, Stefan M.; Hovland, Paul D. Procedia Computer Science, Vol. 4 https://doi.org/10.1016/j.procs.2011.04.234	journal	January 2011
Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries Balay, Satish; Gropp, William D.; McInnes, Lois Curfman Modern Software Tools for Scientific Computing https://doi.org/10.1007/978-1-4612-1986-6_8	book	January 1997
Machine learning for predictive auto-tuning with boosted regression trees Bergstra, James; Pinto, Nicolas; Cox, David 2012 Innovative Parallel Computing (InPar) https://doi.org/10.1109/InPar.2012.6339587	conference	May 2012
Annotation-based empirical performance tuning using Orio Hartono, Albert; Norris, Boyana; Sadayappan, P. Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5161004	conference	May 2009
A scalable auto-tuning framework for compiler optimization Tiwari, Ananta; Chen, Chun; Chame, Jacqueline Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5161054	conference	May 2009
An overview of the Trilinos project Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G. ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089021	journal	September 2005
Lighthouse: a taxonomy-based solver selection tool Sood, Kanika; Norris, Boyana; Jessup, Elizabeth SPLASH '15: Conference on Systems, Programming, Languages, and Applications: Software for Humanity, Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems https://doi.org/10.1145/2837476.2837485	conference	October 2015
POET: Parameterized Optimizations for Empirical Tuning Yi, Qing; Seymour, Keith; You, Haihang 2007 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2007.370637	conference	March 2007
Lighthouse: an automated solver selection tool Motter, Pate; Sood, Kanika; Jessup, Elizabeth SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering https://doi.org/10.1145/2830168.2830169	conference	November 2015
Performance-Based Numerical Solver Selection in the Lighthouse Framework Jessup, Elizabeth; Motter, Pate; Norris, Boyana SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1028406	journal	January 2016
Auto-tuning full applications: A case study Tiwari, Ananta; Hollingsworth, Jeffrey K. The International Journal of High Performance Computing Applications, Vol. 25, Issue 3 https://doi.org/10.1177/1094342011414744	journal	June 2011
Caliper: Performance Introspection for HPC Software Stacks Boehme, David; Gamblin, Todd; Beckingsale, David SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.46	conference	November 2016
Dynamic program instrumentation for scalable performance tools Hollingsworth, J. K.; Miller, B. P.; Cargille, J. Proceedings of IEEE Scalable High Performance Computing Conference https://doi.org/10.1109/SHPCC.1994.296728	conference	January 1994
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures Christen, Matthias; Schenk, Olaf; Burkhart, Helmar Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.70	conference	May 2011
Author retrospective for optimizing matrix multiply using PHiPAC: a portable high-performance ANSI C coding methodology Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye 25th Anniversary International Conference on Supercomputing Anniversary Volume - https://doi.org/10.1145/2591635.2591656	conference	January 2014
A fast Fourier transform compiler Frigo, Matteo Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation - PLDI '99 https://doi.org/10.1145/301618.301661	conference	January 1999
Optimization of sparse matrix–vector multiplication on emerging multicore platforms Williams, Samuel; Oliker, Leonid; Vuduc, Richard Parallel Computing, Vol. 35, Issue 3 https://doi.org/10.1016/j.parco.2008.12.006	journal	March 2009
Parallel Parameter Tuning for Applications with Performance Variability Tabatabaee, V.; Tiwari, A.; Hollingsworth, J. K. ACM/IEEE SC 2005 Conference (SC'05) https://doi.org/10.1109/SC.2005.52	conference	January 2005
Automatic tuning of whole applications using direct search and a performance-based transformation system Qasem, Apan; Kennedy, Ken; Mellor-Crummey, John The Journal of Supercomputing, Vol. 36, Issue 2 https://doi.org/10.1007/s11227-006-7957-2	journal	May 2006
Combined selection of tile sizes and unroll factors using iterative compilation Kisuki, T.; Knijnenburg, P. M. W.; O'Boyle, M. F. P. Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622) https://doi.org/10.1109/PACT.2000.888348	conference	January 2000
Application-tailored linear algebra algorithms: A search-based approach Fabregat-Traver, Diego; Bientinesi, Paolo The International Journal of High Performance Computing Applications, Vol. 27, Issue 4 https://doi.org/10.1177/1094342013494428	journal	July 2013
Online Adaptive Code Generation and Tuning Tiwari, Ananta; Hollingsworth, Jeffrey K. Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.86	conference	May 2011
OSKI: A library of automatically tuned sparse matrix kernels Vuduc, Richard; Demmel, James W.; Yelick, Katherine A. Journal of Physics: Conference Series, Vol. 16 https://doi.org/10.1088/1742-6596/16/1/071	journal	January 2005
A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems Sao, Piyush; Liu, Xing; Vuduc, Richard 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.104	conference	May 2015
Model-Driven Sparse CP Decomposition for Higher-Order Tensors Li, Jiajia; Choi, Jee; Perros, Ioakeim 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2017.80	conference	May 2017
FFTW: an adaptive software architecture for the FFT Frigo, M.; Johnson, S. G. 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181) https://doi.org/10.1109/ICASSP.1998.681704	conference	January 1998
The Design and Implementation of FFTW3 Frigo, M.; Johnson, S. G. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840301	journal	February 2005
The pochoir stencil compiler Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C. Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11 https://doi.org/10.1145/1989493.1989508	conference	January 2011
Basic Linear Algebra Subprograms for Fortran Usage Lawson, C. L.; Hanson, R. J.; Kincaid, D. R. ACM Transactions on Mathematical Software, Vol. 5, Issue 3 https://doi.org/10.1145/355841.355847	journal	September 1979
Design and Implementation of a Parallel Performance Data Management Framework Huck, K. A.; Malony, A. D.; Bell, R. 2005 International Conference on Parallel Processing (ICPP'05) https://doi.org/10.1109/ICPP.2005.29	conference	January 2005
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye Proceedings of the 11th international conference on Supercomputing - ICS '97 https://doi.org/10.1145/263580.263662	conference	January 1997
A set of level 3 basic linear algebra subprograms Dongarra, J. J.; Du Croz, Jeremy; Hammarling, Sven ACM Transactions on Mathematical Software, Vol. 16, Issue 1 https://doi.org/10.1145/77626.79170	journal	March 1990
Autotuning GEMM Kernels for the Fermi GPU Kurzak, Jakub; Tomov, Stanimire; Dongarra, Jack IEEE Transactions on Parallel and Distributed Systems, Vol. 23, Issue 11 https://doi.org/10.1109/TPDS.2011.311	journal	November 2012
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors Datta, Kaushik; Kamil, Shoaib; Williams, Samuel SIAM Review, Vol. 51, Issue 1 https://doi.org/10.1137/070693199	journal	February 2009
SPIRAL: Code Generation for DSP Transforms Puschel, M.; Moura, J. M. F.; Johnson, J. R. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840306	journal	February 2005
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13 https://doi.org/10.1145/2491956.2462176	conference	January 2013
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms Williams, Samuel; Carter, Jonathan; Oliker, Leonid Journal of Parallel and Distributed Computing, Vol. 69, Issue 9 https://doi.org/10.1016/j.jpdc.2009.04.002	journal	September 2009
A Heterogeneous Parallel Framework for Domain-Specific Languages Brown, Kevin J.; Sujeeth, Arvind K.; Lee, Hyouk Joong 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT) https://doi.org/10.1109/PACT.2011.15	conference	October 2011
A Case Study Using Automatic Performance Tuning for Large-Scale Scientific Programs Chung, I. -H.; Hollingsworth, J. K. 2006 15th IEEE International Conference on High Performance Distributed Computing https://doi.org/10.1109/HPDC.2006.1652135	conference	January 2006
High-level adaptive program optimization with ADAPT Voss, Michael J.; Eigemann, Rudolf Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01 https://doi.org/10.1145/379539.379583	conference	January 2001
A comparison of search heuristics for empirical code optimization Seymour, Keith; You, Haihang; Dongarra, Jack 2008 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/clustr.2008.4663803	conference	September 2008
LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation Heinecke, Alexander; Henry, Greg; Hutchinson, Maxwell SC16: International Conference for High-Performance Computing, Networking, Storage and Analysis, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2016.83	conference	November 2016
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye 25th Anniversary International Conference on Supercomputing Anniversary Volume - https://doi.org/10.1145/2591635.2667174	conference	January 2014
Application-tailored Linear Algebra Algorithms: A search-based Approach Fabregat-Traver, Diego; Bientinesi, Paolo arXiv https://doi.org/10.48550/arxiv.1211.5904	preprint	January 2012

Similar Records

PRIMA-X - Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing

Technical Report · Thu Jun 27 00:00:00 EDT 2019 · OSTI ID:1488544

Wolf, Felix; Lorenz, Daniel

Performance Engineering Research Institute SciDAC-2 Enabling Technologies Institute Final Report

Technical Report · Sat Apr 20 00:00:00 EDT 2013 · OSTI ID:1488544

Lucas, Robert

Performance Portability of Molecular Docking Miniapp On Leadership Computing Platforms

Conference · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:1488544

Thavappiragasam, Mathialakan; Scheinberg, Aaron; Elwasif, Wael; +2 more

Related Subjects

97 MATHEMATICS AND COMPUTING
high-performance computing
performance tuning programming systems

Title: Autotuning in High-Performance Computing Applications

Citation Formats

References (58)

Similar Records

Related Subjects