skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing PGAS Overhead In A Multi-locale Chapel Implementation Of CoMD

Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: Presented at: PGAS Applications Workshop, Salt Lake City, UT, United States, Nov 14 - Nov 14, 2016
Country of Publication:
United States

Citation Formats

Haque, R, and Richards, D. Optimizing PGAS Overhead In A Multi-locale Chapel Implementation Of CoMD. United States: N. p., 2016. Web.
Haque, R, & Richards, D. Optimizing PGAS Overhead In A Multi-locale Chapel Implementation Of CoMD. United States.
Haque, R, and Richards, D. Fri . "Optimizing PGAS Overhead In A Multi-locale Chapel Implementation Of CoMD". United States. doi:.
title = {Optimizing PGAS Overhead In A Multi-locale Chapel Implementation Of CoMD},
author = {Haque, R and Richards, D},
abstractNote = {},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Sep 30 00:00:00 EDT 2016},
month = {Fri Sep 30 00:00:00 EDT 2016}

Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • No abstract prepared.
  • Traditionally, user-level message-passing libraries (e.g., MPI, PVM) offered only a limited set of operations that involved computation in addition to communication. They are collective operations such as reductions (e.g., MPI_Reduce, MPI_Allreduce) that combine the data in the user communication buffer across the set of tasks participating in the operation. These operations are often used in scientific computing [1] to, for example, determine convergence criteria for the iterative methods for solving linear equations or compute vector dot products in the conjugate gradient solver [2]. Consecutively, multiple research efforts have been pursued to optimize performance of these important operations on modern networks.more » A wide range of implementation protocols and techniques such as shared memory, RMA (remote memory access), and the programmable network interface card (NIC) has been explored e.g., [2,3,4]. The most recent extensions to the MPI standard [5] define atomic reductions, one of the one-sided operations available in MPI-2. In MPI-2, atomic reductions are supported through the MPI_Accumulate operation. This noncollective one-sided operation in a single interface combines communication and computations. It allows the programmer to update atomically remote memory by combining the content of the local communication buffer with the remote memory buffer. The primary difference between atomic one-sided and collective reductions is that in the first case only one processor is involved in the operation and the operation is atomic, which allows multiple processors to independently update the same remote memory location without explicit synchronization that otherwise would be required to ensure consistency of the result. The sample application domain that motivated MPI Forum to add atomic reduction to the MPI-2 standard has been electronic structure computational chemistry with multiple algorithms that relied on the accumulate operation as available in the Global Arrays toolkit [6].« less
  • Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth ofmore » hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini- Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.« less
  • This article discusses the optimization of the target motion sampling (TMS) temperature treatment method, previously implemented in the Monte Carlo reactor physics code Serpent 2. The TMS method was introduced in [1] and first practical results were presented at the PHYSOR 2012 conference [2]. The method is a stochastic method for taking the effect of thermal motion into account on-the-fly in a Monte Carlo neutron transport calculation. It is based on sampling the target velocities at collision sites and then utilizing the 0 K cross sections at target-at-rest frame for reaction sampling. The fact that the total cross section becomesmore » a distributed quantity is handled using rejection sampling techniques. The original implementation of the TMS requires 2.0 times more CPU time in a PWR pin-cell case than a conventional Monte Carlo calculation relying on pre-broadened effective cross sections. In a HTGR case examined in this paper the overhead factor is as high as 3.6. By first changing from a multi-group to a continuous-energy implementation and then fine-tuning a parameter affecting the conservativity of the majorant cross section, it is possible to decrease the overhead factors to 1.4 and 2.3, respectively. Preliminary calculations are also made using a new and yet incomplete optimization method in which the temperature of the basis cross section is increased above 0 K. It seems that with the new approach it may be possible to decrease the factors even as low as 1.06 and 1.33, respectively, but its functionality has not yet been proven. Therefore, these performance measures should be considered preliminary. (authors)« less