Trends in data locality abstractions for HPC systems

Unat, Didem; Dubey, Anshu; Hoefler, Torsten; Shalf, John; Abraham, Mark; Bianco, Mauro; Chamberlain, Bradford L.; Cledat, Romain; Edwards, H. Carter; Finkel, Hal; Fuerlinger, Karl; Hannig, Frank; Jeannot, Emmanuel; Kamil, Amir; Keasler, Jeff; Kelly, Paul H. J.; Leung, Vitus; Ltaief, Hatem; Maruyama, Naoya; Newburn, Chris J.; Pericas, Miquel

doi:10.1109/tpds.2017.2703149

Title: Trends in data locality abstractions for HPC systems

Journal Article · Wed May 10 00:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems

DOI:https://doi.org/10.1109/tpds.2017.2703149· OSTI ID:1356837

^[1]; Dubey, Anshu ^[2]; Hoefler, Torsten ^[3]; Shalf, John ^[4];

^[5]; Bianco, Mauro ^[6]; Chamberlain, Bradford L. ^[7]; Cledat, Romain ^[8]; Edwards, H. Carter ^[9]; Finkel, Hal ^[10]; Fuerlinger, Karl ^[11]; Hannig, Frank ^[12]; Jeannot, Emmanuel ^[13]; Kamil, Amir ^[14]; Keasler, Jeff ^[15]; Kelly, Paul H. J. ^[16]; Leung, Vitus ^[9]; Ltaief, Hatem ^[17]; Maruyama, Naoya ^[18]; Newburn, Chris J. ^[19] more »

Koc Univ., Istanbul (Turkey)
Argonne National Lab. (ANL), Lemont, IL (United States)
ETH Zurich, Zurich (Switzerland)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
KTH Royal Institute of Technology, Solna (Sweden)
Swiss National Supercomputer, Lugano (Switzerland)
Cray Inc., Seattle, WA (United States)
Intel Corp., Santa Clara, CA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Argonne National Lab. (ANL), Argonne, IL (United States)
Ludwig-Maximilians-Univ., Munich (Germany)
Univ. of Erlangen-Nuremberg, Erlangen (Germany)
INRIA Bordeaux Sud-Ouest, Talence (France)
Univ. of Michigan, Ann Arbor, MI (United States)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Imperial College, London (United Kingdom)
King Abdullah Univ. of Science and Technology, Thuwal (Saudia Arabia)
RIKEN, Hyogo (Japan)
Nvidia Corp., Santa Clara, CA (United States)
Chalmers Univ. of Technology, Goteborg (Sweden)

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); German Research Foundation (DFG)

Grant/Contract Number:: AC04-94AL85000; AC02-06CH11357; AC02-05CH11231

OSTI ID:: 1356837

Alternate ID(s):: OSTI ID: 1393262; OSTI ID: 1525244

Report Number(s):: SAND-2017-3844J; 652425

Journal Information:: IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 10; ISSN 1045-9219

Publisher:: IEEECopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 49 works

Citation information provided by
Web of Science

References (46)

Polly-ACC Transparent compilation to heterogeneous hardware Grosser, Tobias; Hoefler, Torsten Proceedings of the 2016 International Conference on Supercomputing https://doi.org/10.1145/2925426.2926286	conference	June 2016
Designing a unified programming model for heterogeneous machines Garland, Michael; Kudlur, Manjunath; Zheng, Yili 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2012.48	conference	November 2012
Communication and topology-aware load balancing in Charm++ with TreeMatch Jeannot, Emmanuel; Meneses, Esteban; Mercier, Guillaume 2013 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/cluster.2013.6702666	conference	September 2013
The Organization of Computations for Uniform Recurrence Equations Karp, Richard M.; Miller, Raymond E.; Winograd, Shmuel Journal of the ACM, Vol. 14, Issue 3 https://doi.org/10.1145/321406.321418	journal	July 1967
A practical automatic polyhedral parallelizer and locality optimizer Bondhugula, Uday; Hartono, Albert; Ramanujam, J. ACM SIGPLAN Notices, Vol. 43, Issue 6 https://doi.org/10.1145/1379022.1375595	journal	May 2008
Netloc: Towards a Comprehensive View of the HPC System Topology Goglin, Brice; Hursey, Joshua; Squyres, Jeffrey M. 2014 43rd International Conference on Parallel Processing Workshops https://doi.org/10.1109/icppw.2014.38	conference	September 2014
Modesto Gysi, Tobias; Grosser, Tobias; Hoefler, Torsten Proceedings of the 29th ACM on International Conference on Supercomputing https://doi.org/10.1145/2751205.2751223	conference	June 2015
The Scalasca performance toolset architecture Geimer, Markus; Wolf, Felix; Wylie, Brian J. N. Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.1556	journal	January 2010
Active Libraries: Rethinking the roles of compilers and libraries Veldhuizen, Todd L.; Gannon, Dennis arXiv https://doi.org/10.48550/arxiv.math/9810022	preprint	January 1998
Slim Fly: A Cost Effective Low-Diameter Network Topology Besta, Maciej; Hoefler, Torsten arXiv https://doi.org/10.48550/arxiv.1912.08968	text	January 2019
A survey of high level frameworks in block-structured adaptive mesh refinement packages Dubey, Anshu; Almgren, Ann; Bell, John Journal of Parallel and Distributed Computing, Vol. 74, Issue 12 https://doi.org/10.1016/j.jpdc.2014.07.001	journal	December 2014
Software Engineering for Computational Science and Engineering Carver, Jeffrey C. Computing in Science & Engineering, Vol. 14, Issue 2 https://doi.org/10.1109/MCSE.2012.31	journal	March 2012
The ASC-Alliance Projects: A Case Study of Large-Scale Parallel Scientific Code Development Hochstein, L.; Basili, V. R. Computer, Vol. 41, Issue 3 https://doi.org/10.1109/MC.2008.101	journal	March 2008
Developing Scientific Software Segal, Judith; Morris, Chris IEEE Software, Vol. 25, Issue 4 https://doi.org/10.1109/MS.2008.85	journal	July 2008
Scalable molecular dynamics with NAMD Phillips, James C.; Braun, Rosemary; Wang, Wei Journal of Computational Chemistry, Vol. 26, Issue 16, p. 1781-1802 https://doi.org/10.1002/jcc.20289	journal	January 2005
A component-based architecture for parallel multi-physics PDE simulation Parker, Steven G. Future Generation Computer Systems, Vol. 22, Issue 1-2 https://doi.org/10.1016/j.future.2005.04.001	journal	January 2006
Legion: Expressing locality and independence with logical regions Bauer, Michael; Treichler, Sean; Slaughter, Elliott 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.71	conference	November 2012
Polyhedral parallel code generation for CUDA Verdoolaege, Sven; Carlos Juega, Juan; Cohen, Albert ACM Transactions on Architecture and Code Optimization, Vol. 9, Issue 4 https://doi.org/10.1145/2400682.2400713	journal	January 2013
Cache-oblivious algorithms Frigo, M.; Leiserson, C. E.; Prokop, H. 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039) https://doi.org/10.1109/SFFCS.1999.814600	conference	January 1999
A batch scheduler with high level components Capit, N.; Da Costa, G.; Georgiou, Y. CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005. https://doi.org/10.1109/CCGRID.2005.1558641	conference	January 2005
Job scheduling under the Portable Batch System Henderson, Robert L. Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/3-540-60153-8_34	book	January 1995
UPC++: A PGAS Extension for C++ Zheng, Yili; Kamil, Amir; Driscoll, Michael B. 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.115	conference	May 2014
A practical automatic polyhedral parallelizer and locality optimizer Bondhugula, Uday; Hartono, Albert; Ramanujam, J. Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation - PLDI '08 https://doi.org/10.1145/1375581.1375595	conference	January 2008
On implementing MPI-IO portably and with high performance Thakur, Rajeev; Gropp, William; Lusk, Ewing Proceedings of the sixth workshop on I/O in parallel and distributed systems - IOPADS '99 https://doi.org/10.1145/301816.301826	conference	January 1999
Parallel netCDF: A High-Performance Scientific I/O Interface Li, Jianwei; Zingale, Michael; Liao, Wei-keng Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03 https://doi.org/10.1145/1048935.1050189	conference	January 2003
Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture Kogge, Peter; Shalf, John Computing in Science & Engineering, Vol. 15, Issue 6 https://doi.org/10.1109/MCSE.2013.95	journal	November 2013
Co-array Fortran for parallel programming Numrich, Robert W.; Reid, John ACM SIGPLAN Fortran Forum, Vol. 17, Issue 2 https://doi.org/10.1145/289918.289920	journal	August 1998
ZPL: a machine independent programming language for parallel computers Chamberlain, B. L.; Lewis, C. IEEE Transactions on Software Engineering, Vol. 26, Issue 3 https://doi.org/10.1109/32.842947	journal	March 2000
Titanium: a high-performance Java dialect Yelick, Kathy; Semenzato, Luigi; Pike, Geoff Concurrency: Practice and Experience, Vol. 10, Issue 11-13 https://doi.org/10.1002/(SICI)1096-9128(199809/11)10:11/13<825::AID-CPE383>3.0.CO;2-H	journal	September 1998
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination Dorier, Matthieu; Antoniu, Gabriel; Ross, Rob 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.27	conference	May 2014
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory Hoefler, Torsten; Dinan, James; Buntinas, Darius Computing, Vol. 95, Issue 12 https://doi.org/10.1007/s00607-013-0324-2	journal	May 2013
SLURM: Simple Linux Utility for Resource Management Yoo, Andy B.; Jette, Morris A.; Grondona, Mark Job Scheduling Strategies for Parallel Processing https://doi.org/10.1007/10968987_3	book	January 2003
Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc) Goglin, Brice 2014 International Conference on High Performance Computing & Simulation (HPCS) https://doi.org/10.1109/HPCSim.2014.6903671	conference	July 2014
DASH: Data Structures and Algorithms with Support for Hierarchical Locality Fürlinger, Karl; Glass, Colin; Gracia, Jose Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-14313-2_46	book	January 2014
BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework Zhang, Weiqun; Almgren, Ann; Day, Marcus SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M102616X	journal	January 2016
A new vision for coarray Fortran Mellor-Crummey, John; Adhianto, Laksono; Scherer, William N. Proceedings of the Third Conference on Partitioned Global Address Space Programing Models - PGAS '09 https://doi.org/10.1145/1809961.1809969	conference	January 2009
Parallel Programmability and the Chapel Language Chamberlain, B. L.; Callahan, D.; Zima, H. P. The International Journal of High Performance Computing Applications, Vol. 21, Issue 3 https://doi.org/10.1177/1094342007078442	journal	August 2007
Programming for parallelism and locality with hierarchically tiled arrays Bikshandi, Ganesh; Guo, Jia; Hoeflinger, Daniel Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06 https://doi.org/10.1145/1122971.1122981	conference	January 2006
Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers Tessier, Francois; Malakar, Preeti; Vishwanath, Venkatram 2016 First International Workshop on Communication Optimizations in HPC (COMHPC) https://doi.org/10.1109/COMHPC.2016.013	conference	November 2016
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel Journal of Parallel and Distributed Computing, Vol. 74, Issue 12 https://doi.org/10.1016/j.jpdc.2014.07.003	journal	December 2014
Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks Prisacari, Bogdan; Rodriguez, German; Heidelberger, Philip Proceedings of the 23rd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2600212.2600225	conference	June 2014
Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems Pilla, Laercio L.; Navaux, Philippe O. A.; Ribeiro, Christiane P. 2012 IEEE 18th International Conference on Parallel and Distributed Systems https://doi.org/10.1109/ICPADS.2012.41	conference	December 2012
Work-stealing with configurable scheduling strategies Wimmer, Martin; Cederman, Daniel; Träff, Jesper Larsson Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming https://doi.org/10.1145/2442516.2442562	conference	February 2013
OpenMP task scheduling strategies for multicore NUMA systems Olivier, Stephen L.; Porterfield, Allan K.; Wheeler, Kyle B. The International Journal of High Performance Computing Applications, Vol. 26, Issue 2 https://doi.org/10.1177/1094342011434065	journal	February 2012
Technology-Driven, Highly-Scalable Dragonfly Topology Kim, John; Dally, Wiliam J.; Scott, Steve ACM SIGARCH Computer Architecture News, Vol. 36, Issue 3 https://doi.org/10.1145/1394608.1382129	journal	June 2008
Design and implementation of a customizable work stealing scheduler Nakashima, Jun; Nakatani, Sho; Taura, Kenjiro Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers https://doi.org/10.1145/2491661.2481433	conference	June 2013

Cited By (8)

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures Wang, Xinliang; Liu, Weifeng; Xue, Wei ACM SIGPLAN Notices, Vol. 53, Issue 1 https://doi.org/10.1145/3200691.3178513	journal	March 2018
EagerMap Cruz, Eduardo H. M.; Diener, Matthias; Pilla, Laércio L. ACM Transactions on Parallel Computing, Vol. 5, Issue 4 https://doi.org/10.1145/3309711	journal	December 2018
Memory‐aware kernel mechanism and policies for improving internode load balancing on NUMA systems Chiang, Mei‐Ling; Su, Wei‐Lun; Tu, Shu‐Wei Software: Practice and Experience, Vol. 49, Issue 10 https://doi.org/10.1002/spe.2731	journal	July 2019
Data Movement Is All You Need: A Case Study on Optimizing Transformers Ivanov, Andrei; Dryden, Nikoli; Ben-Nun, Tal arXiv https://doi.org/10.48550/arxiv.2007.00072	preprint	January 2020
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures Wang, Xinliang; Liu, Weifeng; Xue, Wei PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178513	conference	February 2018
Impact study of data locality on task-based applications through the Heteroprio scheduler Bramas, Bérenger PeerJ Computer Science, Vol. 5 https://doi.org/10.7717/peerj-cs.190	journal	January 2019
A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations Ziogas, Alexandros Nikolaos; Ben-Nun, Tal; Fernández, Guillermo Indalecio arXiv https://doi.org/10.48550/arxiv.1912.10024	text	January 2019
The future of computing beyond Moore’s Law Shalf, John Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0061	journal	January 2020