skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Characterizing Output Bottlenecks of a Production Supercomputer: Analysis and Implications

Journal Article · · ACM Transactions on Storage
DOI:https://doi.org/10.1145/3335205· OSTI ID:1607202

This article studies the I/O write behaviors of the Titan supercomputer and its Lustre parallel file stores under production load. The results can inform the design, deployment, and configuration of file systems along with the design of I/O software in the application, operating system, and adaptive I/O libraries.We propose a statistical benchmarking methodology to measure write performance across I/O configurations, hardware settings, and system conditions. Moreover, we introduce two relative measures to quantify the write-performance behaviors of hardware components under production load. In addition to designing experiments and benchmarking on Titan, we verify the experimental results on one real application and one real application I/O kernel, XGC and HACC IO, respectively. These two are representative and widely used to address the typical I/O behaviors of applications.In summary, we find that Titan’s I/O system is variable across the machine at fine time scales. This variability has two major implications. First, stragglers lessen the benefit of coupled I/O parallelism (striping). Peak median output bandwidths are obtained with parallel writes to many independent files, with no striping or write sharing of files across clients (compute nodes). I/O parallelism is most effective when the application—or its I/O libraries—distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets in a balanced way with minimal contention. Second, our results suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify “good locations” in the machine or in the file system: component performance is driven by transient load conditions and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Wind Energy Technologies Office; National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC05-00OR22725; AC04-94AL85000; CNS-1245997; NA0003525
OSTI ID:
1607202
Alternate ID(s):
OSTI ID: 1618106
Report Number(s):
SAND-2019-9925J
Journal Information:
ACM Transactions on Storage, Vol. 15, Issue 4; ISSN 1553-3077
Publisher:
Association for Computing Machinery (ACM)Copyright Statement
Country of Publication:
United States
Language:
English

References (19)

Understanding I/O workload characteristics of a Peta-scale storage system journal November 2014
Design implications for enterprise storage systems via multi-dimensional trace analysis conference January 2011
Parallel I/O performance: From events to ensembles conference April 2010
Terascale direct numerical simulations of turbulent combustion using S3D journal January 2009
VAXcluster: a closely-coupled distributed system journal May 1986
Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark conference November 2008
EDO: Improving Read Performance for Scientific Applications through Elastic Data Organization conference September 2011
Understanding and Improving Computational Science Storage Access through Continuous Characterization journal October 2011
Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters conference September 2016
A non-staggered, conservative, , finite-volume scheme for 3D implicit extended magnetohydrodynamics in curvilinear geometries journal November 2004
Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems
  • Dillow, David A.; Shipman, Galen M.; Oral, Sarp
  • 2011 IEEE 30th International Performance Computing and Communications Conference (IPCCC), 30th IEEE International Performance Computing and Communications Conference https://doi.org/10.1109/PCCC.2011.6108062
conference November 2011
Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks: HELLO ADIOS journal August 2013
Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction
  • Dorier, Matthieu; Ibrahim, Shadi; Antoniu, Gabriel
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.56
conference November 2014
Spontaneous rotation sources in a quiescent tokamak edge plasma journal June 2008
24/7 Characterization of petascale I/O workloads conference August 2009
Managing Variability in the IO Performance of Petascale Storage Systems
  • Lofstead, Jay; Zheng, Fang; Liu, Qing
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.32
conference November 2010
A Multiplatform Study of I/O Behavior on Petascale Supercomputers
  • Luu, Huong; Winslett, Marianne; Gropp, William
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749269
conference January 2015
Adaptable, metadata rich IO methods for portable high performance IO conference May 2009
Predicting Output Performance of a Petascale Supercomputer
  • Xie, Bing; Huang, Yezhou; Chase, Jeffrey S.
  • Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '17 https://doi.org/10.1145/3078597.3078614
conference January 2017

Similar Records

Related Subjects