skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Conference ·
OSTI ID:1000413

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
1000413
Resource Relation:
Conference: Supercomputing 2007, Reno, NV, USA, 20071110, 20071116
Country of Publication:
United States
Language:
English