skip to main content

DOE PAGESDOE PAGES

This content will become publicly available on November 21, 2018

Title: Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.
Authors:
 [1] ;  [2] ;  [3] ;  [4]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  3. Univ. of New Mexico, Albuquerque, NM (United States)
  4. Emory Univ., Atlanta, GA (United States)
Publication Date:
Report Number(s):
SAND-2017-2071J
Journal ID: ISSN 1045-9219; 651197; TRN: US1800239
Grant/Contract Number:
AC04-94AL85000
Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Parallel and Distributed Systems
Additional Journal Information:
Journal Volume: 29; Journal Issue: 8; Journal ID: ISSN 1045-9219
Publisher:
IEEE
Research Org:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org:
USDOE National Nuclear Security Administration (NNSA)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; measurement; performance; memory contention; networks; asynchronous communication; machine learning
OSTI Identifier:
1411596

Groves, Taylor Liles, Grant, Ryan, Gonzales, Aaron, and Arnold, Dorian. Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning. United States: N. p., Web. doi:10.1109/tpds.2017.2773483.
Groves, Taylor Liles, Grant, Ryan, Gonzales, Aaron, & Arnold, Dorian. Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning. United States. doi:10.1109/tpds.2017.2773483.
Groves, Taylor Liles, Grant, Ryan, Gonzales, Aaron, and Arnold, Dorian. 2017. "Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning". United States. doi:10.1109/tpds.2017.2773483.
@article{osti_1411596,
title = {Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning},
author = {Groves, Taylor Liles and Grant, Ryan and Gonzales, Aaron and Arnold, Dorian},
abstractNote = {Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.},
doi = {10.1109/tpds.2017.2773483},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 8,
volume = 29,
place = {United States},
year = {2017},
month = {11}
}