Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Zhang, Zhe; Wang, Chao; Vazhkudai, Sudharshan S; Ma, Xiaosong; Pike, Gregory; Cobb, John W; Mueller, Frank

Title: Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Conference · Mon Jan 01 00:00:00 EST 2007

OSTI ID:1000413

Zhang, Zhe ^[1]; Wang, Chao ^[1]; Vazhkudai, Sudharshan S ^[1]; Ma, Xiaosong ^[1]; Pike, Gregory ^[1]; Cobb, John W ^[1]; Mueller, Frank ^[2]

ORNL
North Carolina State University

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)

Sponsoring Organization:: USDOE Laboratory Directed Research and Development (LDRD) Program

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1000413

Resource Relation:: Conference: Supercomputing 2007, Reno, NV, USA, 20071110, 20071116

Country of Publication:: United States

Language:: English

Similar Records

Scientific Application Requirements for Leadership Computing at the Exascale

Technical Report · Sat Dec 01 00:00:00 EST 2007 · OSTI ID:1000413

Ahern, Sean; Alam, Sadaf R; Fahey, Mark R; +8 more

Improving the Availability of Supercomputer Job Input Data Using Temporal Replication

Conference · Mon Jun 01 00:00:00 EDT 2009 · OSTI ID:1000413

Wang, Chao; Zhang, Zhe; Ma, Xiaosong; +2 more

Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Journal Article · Sat Jan 01 00:00:00 EST 2011 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1000413

Monti, Henri; Butt, Ali R; Vazhkudai, Sudharshan S

Related Subjects

36 MATERIALS SCIENCE
A CENTERS
AVAILABILITY
MANAGEMENT
PERFORMANCE
PROCUREMENT
PRODUCTION
STORAGE
SUPERCOMPUTERS
TRANSIENTS
Optimize center performance
scratch storage
data recovery

Title: Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Citation Formats

Similar Records

Related Subjects