DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?

Abstract

For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the middleware level. This work tackles two key challenges that have not been fully resolved in the past: whether I/O variability can be reduced on a QoS-less HPC storage system, and how to design a runtime scheduling system that can scale up to a large amount of cores. The proposed scheme uses a two-level messaging system to re-route I/O requests to a less congested storage location so that write performance is improved, while limiting the impact on read by throttling re-routing. An analytical model is derived to guide the setup of optimal throttling factor. We thoroughly analyze the virtual messaging layer overhead and explore whether the in-transit buffering is effective in managing I/O variability. Contrary to the intuition, in-transit buffer cannot completely solve the problem. It can reduce the absolute variability but not the relative variability. Here, the proposed scheme is verified against a synthetic benchmark as well as being used by production applications.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [3]; ORCiD logo [2]
  1. New Jersey Inst. of Technology, Newark, NJ (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  3. Temple Univ., Philadelphia, PA (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1559601
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
IEEE Transactions on Computers
Additional Journal Information:
Journal Volume: 68; Journal Issue: 5; Journal ID: ISSN 0018-9340
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; High-performance computing; storage; quality of service; variability

Citation Formats

Huang, Dan, Liu, Qing, Choi, Jong Youl, Podhorszki, Norbert, Klasky, Scott A., Logan, Jeremy, Ostrouchov, George, He, Xubin, and Wolf, Matthew D. Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?. United States: N. p., 2018. Web. doi:10.1109/TC.2018.2881709.
Huang, Dan, Liu, Qing, Choi, Jong Youl, Podhorszki, Norbert, Klasky, Scott A., Logan, Jeremy, Ostrouchov, George, He, Xubin, & Wolf, Matthew D. Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?. United States. https://doi.org/10.1109/TC.2018.2881709
Huang, Dan, Liu, Qing, Choi, Jong Youl, Podhorszki, Norbert, Klasky, Scott A., Logan, Jeremy, Ostrouchov, George, He, Xubin, and Wolf, Matthew D. Mon . "Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?". United States. https://doi.org/10.1109/TC.2018.2881709. https://www.osti.gov/servlets/purl/1559601.
@article{osti_1559601,
title = {Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?},
author = {Huang, Dan and Liu, Qing and Choi, Jong Youl and Podhorszki, Norbert and Klasky, Scott A. and Logan, Jeremy and Ostrouchov, George and He, Xubin and Wolf, Matthew D.},
abstractNote = {For a production high-performance computing (HPC) system, where storage devices are shared between multiple applications and managed in a best effort manner, I/O contention is often a major problem. In this paper, we propose a balanced messaging-based re-routing in conjunction with throttling at the middleware level. This work tackles two key challenges that have not been fully resolved in the past: whether I/O variability can be reduced on a QoS-less HPC storage system, and how to design a runtime scheduling system that can scale up to a large amount of cores. The proposed scheme uses a two-level messaging system to re-route I/O requests to a less congested storage location so that write performance is improved, while limiting the impact on read by throttling re-routing. An analytical model is derived to guide the setup of optimal throttling factor. We thoroughly analyze the virtual messaging layer overhead and explore whether the in-transit buffering is effective in managing I/O variability. Contrary to the intuition, in-transit buffer cannot completely solve the problem. It can reduce the absolute variability but not the relative variability. Here, the proposed scheme is verified against a synthetic benchmark as well as being used by production applications.},
doi = {10.1109/TC.2018.2881709},
journal = {IEEE Transactions on Computers},
number = 5,
volume = 68,
place = {United States},
year = {Mon Nov 19 00:00:00 EST 2018},
month = {Mon Nov 19 00:00:00 EST 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share: