skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Abstract

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources.more » We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.« less

Authors:
 [1];  [1];  [1];  [1];  [1];  [1];  [2]
  1. ORNL
  2. North Carolina State University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1000413
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Supercomputing 2007, Reno, NV, USA, 20071110, 20071116
Country of Publication:
United States
Language:
English
Subject:
36 MATERIALS SCIENCE; A CENTERS; AVAILABILITY; MANAGEMENT; PERFORMANCE; PROCUREMENT; PRODUCTION; STORAGE; SUPERCOMPUTERS; TRANSIENTS; Optimize center performance; scratch storage; data recovery

Citation Formats

Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, and Mueller, Frank. Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery. United States: N. p., 2007. Web.
Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, & Mueller, Frank. Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery. United States.
Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, and Mueller, Frank. Mon . "Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery". United States. doi:.
@article{osti_1000413,
title = {Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery},
author = {Zhang, Zhe and Wang, Chao and Vazhkudai, Sudharshan S and Ma, Xiaosong and Pike, Gregory and Cobb, John W and Mueller, Frank},
abstractNote = {Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: