Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

Sankaran, Ramanan; Angel, Jordan; Brown, W. Michael

doi:10.1002/cpe.3457

Title: Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

Abstract

Summary The growth in size of networked high performance computers along with novel accelerator‐based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub‐optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter‐task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm‐based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. Application benchmarks after task reordering throughmore »« less

Authors:

^[1]; Angel, Jordan ^[1]; Brown, W. Michael ^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Publication Date:: Wed Apr 08 00:00:00 EDT 2015

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1224742

Alternate Identifier(s):: OSTI ID: 1400703

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Volume: 27; Journal Issue: 17; Journal ID: ISSN 1532-0626

Publisher:: Wiley

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Sankaran, Ramanan, Angel, Jordan, and Brown, W. Michael. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications.  United States: N. p., 2015. 
Web.  doi:10.1002/cpe.3457.

Copy to clipboard


                    Sankaran, Ramanan, Angel, Jordan, & Brown, W. Michael. Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications.  United States.  https://doi.org/10.1002/cpe.3457

Copy to clipboard


                    Sankaran, Ramanan, Angel, Jordan, and Brown, W. Michael. Wed .  
"Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications".  United States.  https://doi.org/10.1002/cpe.3457.  https://www.osti.gov/servlets/purl/1224742.

Copy to clipboard


                    
@article{osti_1224742,

  title        = {Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications},

  author       = {Sankaran, Ramanan and Angel, Jordan and Brown, W. Michael},

  abstractNote = {Summary The growth in size of networked high performance computers along with novel accelerator‐based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub‐optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter‐task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm‐based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. Application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, thereby enabling the applications to achieve better time to solution and scalability on Titan during production. Copyright © 2015 John Wiley & Sons, Ltd.},

  doi          = {10.1002/cpe.3457},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 17,

  volume       = 27,

  place        = {United States},

  year         = {Wed Apr 08 00:00:00 EDT 2015},

  month        = {Wed Apr 08 00:00:00 EDT 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1002/cpe.3457

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 1 work

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Greedy Randomized Adaptive Search Procedures
journal, March 1995

Feo, Thomas A.; Resende, Mauricio G. C.
Journal of Global Optimization, Vol. 6, Issue 2
DOI: 10.1007/BF01096763

Rupture mechanism of liquid crystal thin films realized by large-scale molecular simulations
journal, January 2014

Nguyen, Trung Dac; Carrillo, Jan-Michael Y.; Matheson, Michael A.
Nanoscale, Vol. 6, Issue 6
DOI: 10.1039/C3NR05413F

An Evaluation of Molecular Dynamics Performance on the Hybrid Cray XK6 Supercomputer
journal, January 2012

Michael Brown, W.; Nguyen, Trung D.; Fuentes-Cabrera, Miguel
Procedia Computer Science, Vol. 9
DOI: 10.1016/j.procs.2012.04.020

Simulation of laminar and turbulent impeller stirred tanks using immersed boundary method and large eddy simulation technique in multi-block curvilinear geometries
journal, March 2007

Tyagi, Mayank; Roy, Somnath; Harvey III, Albert D.
Chemical Engineering Science, Vol. 62, Issue 5
DOI: 10.1016/j.ces.2006.11.017

Heuristic technique for processor and link assignment in multicomputers
journal, March 1991

Bollinger, S. W.; Midkiff, S. F.
IEEE Transactions on Computers, Vol. 40, Issue 3
DOI: 10.1109/12.76410

Implementing molecular dynamics on hybrid high performance computers – short range forces
journal, April 2011

Brown, W. Michael; Wang, Peng; Plimpton, Steven J.
Computer Physics Communications, Vol. 182, Issue 4
DOI: 10.1016/j.cpc.2010.12.021

A randomized heuristics for the mapping problem: The genetic approach
journal, October 1992

Chockalingam, T.; Arunkumar, S.
Parallel Computing, Vol. 18, Issue 10
DOI: 10.1016/0167-8191(92)90062-C

On the Mapping Problem
journal, March 1981

Bokhari,
IEEE Transactions on Computers, Vol. C-30, Issue 3
DOI: 10.1109/TC.1981.1675756

Genetic algorithm based heuristics for the mapping problem
journal, January 1995

Chockalingam, T.; Arunkumar, S.
Computers & Operations Research, Vol. 22, Issue 1
DOI: 10.1016/0305-0548(94)P2435-7

Large eddy simulation of turbulence-chemistry interactions in reacting flows
journal, September 2006

Oefelein, J. C.; Drozda, T. G.; Sankaran, V.
Journal of Physics: Conference Series, Vol. 46
DOI: 10.1088/1742-6596/46/1/002

Parallel search for combinatorial optimization: Genetic algorithms, simulated annealing, tabu search and GRASP
book, January 1995

Pardalos, P. M.; Pitsoulis, L.; Mavridou, T.
Parallel Algorithms for Irregularly Structured Problems
DOI: 10.1007/3-540-60321-2_26

Noncontiguous processor allocation algorithms for mesh-connected multicomputers
journal, July 1997

Lo, V.; Windisch, K. J.
IEEE Transactions on Parallel and Distributed Systems, Vol. 8, Issue 7
DOI: 10.1109/71.598346

An approach to mapping parallel programs on hypercube multiprocessors
conference, January 1999

Jose, A.
Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99
DOI: 10.1109/EMPDP.1999.746675

Optimization-based mapping framework for parallel applications
journal, October 2011

Pascual, Jose A.; Miguel-Alonso, Jose; Lozano, Jose A.
Journal of Parallel and Distributed Computing, Vol. 71, Issue 10
DOI: 10.1016/j.jpdc.2011.06.005

A survey for the quadratic assignment problem
journal, January 2007

Loiola, Eliane Maria; de Abreu, Nair Maria Maia; Boaventura-Netto, Paulo Oswaldo
European Journal of Operational Research, Vol. 176, Issue 2
DOI: 10.1016/j.ejor.2005.09.032

Strategies to Map Parallel Applications onto Meshes
book, January 2010

Pascual, Jose A.; Miguel-Alonso, Jose; Lozano, Jose A.
Advances in Intelligent and Soft Computing
DOI: 10.1007/978-3-642-14883-5_26

Low-storage, explicit Runge–Kutta schemes for the compressible Navier–Stokes equations
journal, November 2000

Kennedy, Christopher A.; Carpenter, Mark H.; Lewis, R. Michael
Applied Numerical Mathematics, Vol. 35, Issue 3
DOI: 10.1016/S0168-9274(99)00141-5

Optimization by Simulated Annealing
journal, May 1983

Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P.
Science, Vol. 220, Issue 4598
DOI: 10.1126/science.220.4598.671

New insights into the dynamics and morphology of P3HT:PCBM active layers in bulk heterojunctions
journal, January 2013

Carrillo, Jan-Michael Y.; Kumar, Rajeev; Goswami, Monojoy
Physical Chemistry Chemical Physics, Vol. 15, Issue 41
DOI: 10.1039/C3CP53271B

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995

Plimpton, Steve
Journal of Computational Physics, Vol. 117, Issue 1
DOI: 10.1006/jcph.1995.1039

Task mapping stencil computations for non-contiguous allocations
conference, January 2014

Leung, Vitus J.; Bunde, David P.; Ebbers, Jonathan
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14
DOI: 10.1145/2555243.2555277

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
conference, November 2012

Levesque, John M.; Sankaran, Ramanan; Grout, Ray
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.69

Heuristic-Based Techniques for Mapping Irregular Communication Graphs to Mesh Topologies
conference, September 2011

Bhatele, Abhinav; Kale, Laxmikant V.
Communication (HPCC), 2011 IEEE International Conference on High Performance Computing and Communications
DOI: 10.1109/HPCC.2011.109

Communication patterns and allocation strategies
conference, January 2004

Bunde, D. P.; Leung, V. J.; Mache, J.
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.
DOI: 10.1109/IPDPS.2004.1303307

Contention-aware node allocation policy for high-performance capacity systems
conference, January 2012

Jokanovic, Ana; Minkenberg, Cyriel; Sancho, Jose Carlos
Proceedings of the 2012 Interconnection Network Architecture on On-Chip, Multi-Chip Workshop - INA-OCMC '12
DOI: 10.1145/2107763.2107765

Cray Cascade: A scalable HPC system based on a Dragonfly network
conference, November 2012

Faanes, Greg; Bataineh, Abdulla; Roweth, Duncan
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.39

Generic topology mapping strategies for large-scale parallel architectures
conference, January 2011

Hoefler, Torsten; Snir, Marc
Proceedings of the international conference on Supercomputing - ICS '11
DOI: 10.1145/1995896.1995909

Works referencing / citing this record:

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers
conference, March 2016

Sreepathi, Sarat; D'Azevedo, Ed; Philip, Bobby
ICPE'16: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering
DOI: 10.1145/2851553.2851575

Similar Records in DOE PAGES and OSTI.GOV collections:

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers

Conference Sreepathi, Sarat ; D'Azevedo, Ed ; Philip, Bobby ; ...

On large supercomputers, the job scheduling systems may assign a non-contiguous node allocation for user applications depending on available resources. With parallel applications using MPI (Message Passing Interface), the default process ordering does not take into account the actual physical node layout available to the application. This contributes to non-locality in terms of physical network topology and impacts communication performance of the application. In order to mitigate such performance penalties, this work describes techniques to identify suitable task mapping that takes the layout of the allocated nodes as well as the application's communication behavior into account. During the first phasemore »« less
https://doi.org/10.1145/2851553.2851575
Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers

Conference Sreepathi, Sarat ; D'Azevedo, Eduardo ; Philip, Bobby ; ...

On large supercomputers, the job scheduling systems may assign a non-contiguous node allocation for user applications depending on available resources. With parallel applications using MPI (Message Passing Interface), the default process ordering does not take into account the actual physical node layout available to the application. This contributes to non-locality in terms of physical network topology and impacts communication performance of the application. In order to mitigate such performance penalties, this work describes techniques to identify suitable task mapping that takes the layout of the allocated nodes as well as the application's communication behavior into account. During the first phasemore »« less
Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Distributed Halide

Journal Article Denniston, Tyler ; Kamil, Shoaib ; Amarasinghe, Saman - SIGPLAN

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallelmore »« less
Cited by 11
https://doi.org/10.1145/2851141.2851157

Full Text Available
DoE Phase 1 Final Technical Report for MokaBlox

Technical Report Sadoghi, Mohammad ; Zhao, Dongfang ; McGregor, Kirk

MokaBlox is a university spinoff company building the next generation energy-conscious blockchain technology specifically designed for HPC environments to ensure a decentralized and democratic model for a provable and accountable cyber security and privacy. Blockchain infrastructure can be and is starting to be a provable and accountable model of cybersecurity that is valuable especially in terms of chronically under-appreciated resiliency, performance, and energy efficiency. This project aimed to develop an energy-conscious blockchain technology targeted for HPC cybersecurity. We have reimagined HPC cybersecurity and privacy model as built upon a decentralized and democractic computation model of blockchain that is not onlymore »« less

Similar Records

Title: Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

Abstract

Citation Formats

Greedy Randomized Adaptive Search Procedures journal, March 1995

Rupture mechanism of liquid crystal thin films realized by large-scale molecular simulations journal, January 2014

An Evaluation of Molecular Dynamics Performance on the Hybrid Cray XK6 Supercomputer journal, January 2012

Simulation of laminar and turbulent impeller stirred tanks using immersed boundary method and large eddy simulation technique in multi-block curvilinear geometries journal, March 2007

Heuristic technique for processor and link assignment in multicomputers journal, March 1991

Implementing molecular dynamics on hybrid high performance computers – short range forces journal, April 2011

A randomized heuristics for the mapping problem: The genetic approach journal, October 1992

On the Mapping Problem journal, March 1981

Genetic algorithm based heuristics for the mapping problem journal, January 1995

Large eddy simulation of turbulence-chemistry interactions in reacting flows journal, September 2006

Parallel search for combinatorial optimization: Genetic algorithms, simulated annealing, tabu search and GRASP book, January 1995

Noncontiguous processor allocation algorithms for mesh-connected multicomputers journal, July 1997

An approach to mapping parallel programs on hypercube multiprocessors conference, January 1999

Optimization-based mapping framework for parallel applications journal, October 2011

A survey for the quadratic assignment problem journal, January 2007

Strategies to Map Parallel Applications onto Meshes book, January 2010

Low-storage, explicit Runge–Kutta schemes for the compressible Navier–Stokes equations journal, November 2000

Optimization by Simulated Annealing journal, May 1983

New insights into the dynamics and morphology of P3HT:PCBM active layers in bulk heterojunctions journal, January 2013

Fast Parallel Algorithms for Short-Range Molecular Dynamics journal, March 1995

Task mapping stencil computations for non-contiguous allocations conference, January 2014

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond conference, November 2012

Heuristic-Based Techniques for Mapping Irregular Communication Graphs to Mesh Topologies conference, September 2011

Communication patterns and allocation strategies conference, January 2004

Contention-aware node allocation policy for high-performance capacity systems conference, January 2012

Cray Cascade: A scalable HPC system based on a Dragonfly network conference, November 2012

Generic topology mapping strategies for large-scale parallel architectures conference, January 2011

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers conference, March 2016

Greedy Randomized Adaptive Search Procedures
journal, March 1995

Rupture mechanism of liquid crystal thin films realized by large-scale molecular simulations
journal, January 2014

An Evaluation of Molecular Dynamics Performance on the Hybrid Cray XK6 Supercomputer
journal, January 2012

Simulation of laminar and turbulent impeller stirred tanks using immersed boundary method and large eddy simulation technique in multi-block curvilinear geometries
journal, March 2007

Heuristic technique for processor and link assignment in multicomputers
journal, March 1991

Implementing molecular dynamics on hybrid high performance computers – short range forces
journal, April 2011

A randomized heuristics for the mapping problem: The genetic approach
journal, October 1992

On the Mapping Problem
journal, March 1981

Genetic algorithm based heuristics for the mapping problem
journal, January 1995

Large eddy simulation of turbulence-chemistry interactions in reacting flows
journal, September 2006

Parallel search for combinatorial optimization: Genetic algorithms, simulated annealing, tabu search and GRASP
book, January 1995

Noncontiguous processor allocation algorithms for mesh-connected multicomputers
journal, July 1997

An approach to mapping parallel programs on hypercube multiprocessors
conference, January 1999

Optimization-based mapping framework for parallel applications
journal, October 2011

A survey for the quadratic assignment problem
journal, January 2007

Strategies to Map Parallel Applications onto Meshes
book, January 2010

Low-storage, explicit Runge–Kutta schemes for the compressible Navier–Stokes equations
journal, November 2000

Optimization by Simulated Annealing
journal, May 1983

New insights into the dynamics and morphology of P3HT:PCBM active layers in bulk heterojunctions
journal, January 2013

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995

Task mapping stencil computations for non-contiguous allocations
conference, January 2014

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
conference, November 2012

Heuristic-Based Techniques for Mapping Irregular Communication Graphs to Mesh Topologies
conference, September 2011

Communication patterns and allocation strategies
conference, January 2004

Contention-aware node allocation policy for high-performance capacity systems
conference, January 2012

Cray Cascade: A scalable HPC system based on a Dragonfly network
conference, November 2012

Generic topology mapping strategies for large-scale parallel architectures
conference, January 2011

Communication Characterization and Optimization of Applications Using Topology-Aware Task Mapping on Large Supercomputers
conference, March 2016