skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [3];  [1];  [4];  [4]
  1. Argonne National Lab. (ANL), Lemont, IL (United States). Mathematics and Computer Science Division
  2. Barcelona Supercomputing Center (BSC), Barcelona (Spain)
  3. Intel Corp., Santa Clara, ,CA (United States)
  4. RIKEN Advanced Inst. for Computational Science (AICS), Kobe (Japan)

Casper is a process-based asynchronous progress model for MPI one-sided communication on multi-and many-core architectures. The one-sided communication is not truly one-sided in most MPI implementations: the target process still relies on software progress to complete incoming operations. Casper allows the user to specify an arbitrary number of cores dedicated to background ghost processes and transparently redirects the RMA operations to ghost processes by utilizing the PMPI redirection and MPI-3 shared-memory technologies. Although Casper benefits applications that suffer from lack of asynchronous progress, the operation redirection design might not support complex multiphase applications effectively, which often involve dynamically changing communication density and computing workloads. In this paper, we present an adaptive mechanism in Casper to address the limitation of static asynchronous progress in multiphase applications. For this we exploit two adaptive strategies, a user-guided strategy and a fully transparent and automatic strategy based on self-profiling and prediction, to dynamically reconfigure the asynchronous progress in Casper according to real-time performance characteristics during multiphase execution. We evaluate the adaptive approaches in both microbenchmarks and a real quantum chemistry application suite, NWChem, on the Cray XC30 supercomputer and an Intel Omni-Path cluster.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); Ministry of Economic Affairs and Digital Transformation of Spain (MINECO)
Contributing Organization:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Grant/Contract Number:
AC02-06CH11357; IJCI-2015-23266
OSTI ID:
1467874
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 9; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 6 works
Citation information provided by
Web of Science