skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance

Abstract

Proportional to the scale increases in HPC systems, many scientific applications are becoming increasingly data intensive, and parallel I/O has become one of the dominant factors impacting the large-scale HPC application performance. On a typical large-scale HPC system, we have observed that the lack of a global workload coordination coupled with the shared nature of storage systems cause load imbalance and resource contention over the end-to-end I/O paths resulting in severe performance degradation. I/O load imbalance on HPC systems is generally a self-inflicted wound and mostly occurs between the I/O paths and resources consumed by each individual job. In this paper, we introduce TAPP-IO, a dynamic, shared load balancing framework for mitigating resource contention. TAPP-IO extends our previous work and solves two major limitations: First, it transparently intercepts file creation calls during runtime to balance the workload over all available storage targets. The usage of TAPP-IO requires no application source code modifications and is independent from any I/O middleware. The framework can be applied to almost any HPC platform and is suitable for systems that lack a centralized file system resource manager. Second, the framework proposes a new placement strategy to support not only file-per-process I/O, but also single sharedmore » file I/O. This opens the door to a new class of scientific applications that can leverage the placement library for improved performance. We demonstrate the effectiveness of our integration on the Titan system at the Oak Ridge National Laboratory. Our experiments with a synthetic benchmark and real-world HPC workload show that, even in a noisy production environment, TAPP-IO can improve large-scale application performance significantly.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1567457
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Computer Science; Engineering

Citation Formats

Neuwirth, Sarah, Wang, Feiyi, Oral, Sarp, and Bruening, Ulrich. Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance. United States: N. p., 2017. Web. doi:10.1109/ICPADS.2017.00084.
Neuwirth, Sarah, Wang, Feiyi, Oral, Sarp, & Bruening, Ulrich. Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance. United States. doi:10.1109/ICPADS.2017.00084.
Neuwirth, Sarah, Wang, Feiyi, Oral, Sarp, and Bruening, Ulrich. Fri . "Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance". United States. doi:10.1109/ICPADS.2017.00084.
@article{osti_1567457,
title = {Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance},
author = {Neuwirth, Sarah and Wang, Feiyi and Oral, Sarp and Bruening, Ulrich},
abstractNote = {Proportional to the scale increases in HPC systems, many scientific applications are becoming increasingly data intensive, and parallel I/O has become one of the dominant factors impacting the large-scale HPC application performance. On a typical large-scale HPC system, we have observed that the lack of a global workload coordination coupled with the shared nature of storage systems cause load imbalance and resource contention over the end-to-end I/O paths resulting in severe performance degradation. I/O load imbalance on HPC systems is generally a self-inflicted wound and mostly occurs between the I/O paths and resources consumed by each individual job. In this paper, we introduce TAPP-IO, a dynamic, shared load balancing framework for mitigating resource contention. TAPP-IO extends our previous work and solves two major limitations: First, it transparently intercepts file creation calls during runtime to balance the workload over all available storage targets. The usage of TAPP-IO requires no application source code modifications and is independent from any I/O middleware. The framework can be applied to almost any HPC platform and is suitable for systems that lack a centralized file system resource manager. Second, the framework proposes a new placement strategy to support not only file-per-process I/O, but also single shared file I/O. This opens the door to a new class of scientific applications that can leverage the placement library for improved performance. We demonstrate the effectiveness of our integration on the Titan system at the Oak Ridge National Laboratory. Our experiments with a synthetic benchmark and real-world HPC workload show that, even in a noisy production environment, TAPP-IO can improve large-scale application performance significantly.},
doi = {10.1109/ICPADS.2017.00084},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2017},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: