skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [2];  [3];  [4]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  3. Univ. of New Mexico, Albuquerque, NM (United States)
  4. Emory Univ., Atlanta, GA (United States)

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1411596
Report Number(s):
SAND-2017-2071J; 651197; TRN: US1800239
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 8; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 4 works
Citation information provided by
Web of Science

Similar Records

Software-Driven Network Architecture for Synchronous Data Acquisition
Technical Report · Fri Jul 10 00:00:00 EDT 2020 · OSTI ID:1411596

Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct
Conference · Sat Sep 01 00:00:00 EDT 2012 · 2012 41st International Conference on Parallel Processing; 10-13 Sept. 2012; Pittsburgh, PA, USA · OSTI ID:1411596

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)
Technical Report · Fri Nov 29 00:00:00 EST 2019 · OSTI ID:1411596