Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
Conference
·
OSTI ID:1000413
- ORNL
- North Carolina State University
Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.
- Research Organization:
- Oak Ridge National Laboratory (ORNL); Center for Computational Sciences
- Sponsoring Organization:
- ORNL LDRD Director's R&D
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1000413
- Country of Publication:
- United States
- Language:
- English
Similar Records
Improving the Availability of Supercomputer Job Input Data Using Temporal Replication
Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability
Recovering Transient Data: Automated On-demand Data Reconstruction and Offloading for Supercomputers
Conference
·
Mon Jun 01 00:00:00 EDT 2009
·
OSTI ID:1004448
Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability
Journal Article
·
Fri Dec 31 23:00:00 EST 2010
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1020778
Recovering Transient Data: Automated On-demand Data Reconstruction and Offloading for Supercomputers
Journal Article
·
Sun Dec 31 23:00:00 EST 2006
· ACM SIGOPS Operating Systems Review
·
OSTI ID:930882