Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Zhang, Zhe; Wang, Chao; Vazhkudai, Sudharshan S; Ma, Xiaosong; Pike, Gregory; Cobb, John W; Mueller, Frank

Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Conference · Mon Jan 01 04:00:00 EST 2007

OSTI ID:1000413

Zhang, Zhe ^[1]; Wang, Chao ^[1]; Vazhkudai, Sudharshan S ^[1]; Ma, Xiaosong ^[1]; Pike, Gregory ^[1]; Cobb, John W ^[1]; Mueller, Frank ^[2]

ORNL
North Carolina State University

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.

Research Organization:: Oak Ridge National Laboratory (ORNL); Center for Computational Sciences

Sponsoring Organization:: ORNL LDRD Director's R&D

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1000413

Country of Publication:: United States

Language:: English

Similar Records

Improving the Availability of Supercomputer Job Input Data Using Temporal Replication

Conference · Mon Jun 01 00:00:00 EDT 2009 · OSTI ID:1004448

Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Journal Article · Fri Dec 31 23:00:00 EST 2010 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1020778

Recovering Transient Data: Automated On-demand Data Reconstruction and Offloading for Supercomputers

Journal Article · Sun Dec 31 23:00:00 EST 2006 · ACM SIGOPS Operating Systems Review · OSTI ID:930882

Related Subjects

36 MATERIALS SCIENCE
A CENTERS
AVAILABILITY
MANAGEMENT
Optimize center performance
PERFORMANCE
PROCUREMENT
PRODUCTION
STORAGE
SUPERCOMPUTERS
TRANSIENTS
data recovery
scratch storage

Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Citation Formats

Similar Records

Related Subjects