skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: I/O load balancing for big data HPC applications

Abstract

High Performance Computing (HPC) big data problems require efficient distributed storage systems. However, at scale, such storage systems often experience load imbalance and resource contention due to two factors: the bursty nature of scientific application I/O; and the complex I/O path that is without centralized arbitration and control. For example, the extant Lustre parallel file system-that supports many HPC centers-comprises numerous components connected via custom network topologies, and serves varying demands of a large number of users and applications. Consequently, some storage servers can be more loaded than others, which creates bottlenecks and reduces overall application I/O performance. Existing solutions typically focus on per application load balancing, and thus are not as effective given their lack of a global view of the system. In this paper, we propose a data-driven approach to load balance the I/O servers at scale, targeted at Lustre deployments. To this end, we design a global mapper on Lustre Metadata Server, which gathers runtime statistics from key storage components on the I/O path, and applies Markov chain modeling and a minimum-cost maximum-flow algorithm to decide where data should be placed. Evaluation using a realistic system simulator and a real setup shows that our approach yields bettermore » load balancing, which in turn can improve end-to-end performance.« less

Authors:
 [1];  [1]; ORCiD logo [2]; ORCiD logo [2];  [3]; ORCiD logo [2];  [1]
  1. Virginia Polytechnic Institute and State University
  2. ORNL
  3. Virginia Tech, Blacksburg, VA
Publication Date:
Research Org.:
Oak Ridge National Laboratory, Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1415911
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 2017 IEEE International Conference on Big Data - Boston, Massachusetts, United States of America - 12/11/2017 5:00:00 AM-12/14/2017 5:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Paul, Arnab K., Goyal, Arpit, Wang, Feiyi, Oral, H Sarp, Butt, Ali R., Brim, Michael J., and Srinivasa, Sangeetha B. I/O load balancing for big data HPC applications. United States: N. p., 2018. Web. doi:10.1109/BigData.2017.8257931.
Paul, Arnab K., Goyal, Arpit, Wang, Feiyi, Oral, H Sarp, Butt, Ali R., Brim, Michael J., & Srinivasa, Sangeetha B. I/O load balancing for big data HPC applications. United States. doi:10.1109/BigData.2017.8257931.
Paul, Arnab K., Goyal, Arpit, Wang, Feiyi, Oral, H Sarp, Butt, Ali R., Brim, Michael J., and Srinivasa, Sangeetha B. Mon . "I/O load balancing for big data HPC applications". United States. doi:10.1109/BigData.2017.8257931. https://www.osti.gov/servlets/purl/1415911.
@article{osti_1415911,
title = {I/O load balancing for big data HPC applications},
author = {Paul, Arnab K. and Goyal, Arpit and Wang, Feiyi and Oral, H Sarp and Butt, Ali R. and Brim, Michael J. and Srinivasa, Sangeetha B.},
abstractNote = {High Performance Computing (HPC) big data problems require efficient distributed storage systems. However, at scale, such storage systems often experience load imbalance and resource contention due to two factors: the bursty nature of scientific application I/O; and the complex I/O path that is without centralized arbitration and control. For example, the extant Lustre parallel file system-that supports many HPC centers-comprises numerous components connected via custom network topologies, and serves varying demands of a large number of users and applications. Consequently, some storage servers can be more loaded than others, which creates bottlenecks and reduces overall application I/O performance. Existing solutions typically focus on per application load balancing, and thus are not as effective given their lack of a global view of the system. In this paper, we propose a data-driven approach to load balance the I/O servers at scale, targeted at Lustre deployments. To this end, we design a global mapper on Lustre Metadata Server, which gathers runtime statistics from key storage components on the I/O path, and applies Markov chain modeling and a minimum-cost maximum-flow algorithm to decide where data should be placed. Evaluation using a realistic system simulator and a real setup shows that our approach yields better load balancing, which in turn can improve end-to-end performance.},
doi = {10.1109/BigData.2017.8257931},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: