skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Abstract

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources.more » We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.« less

Authors:
 [1];  [1];  [1];  [1];  [1];  [1];  [2]
  1. ORNL
  2. North Carolina State University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1000413
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Supercomputing 2007, Reno, NV, USA, 20071110, 20071116
Country of Publication:
United States
Language:
English
Subject:
36 MATERIALS SCIENCE; A CENTERS; AVAILABILITY; MANAGEMENT; PERFORMANCE; PROCUREMENT; PRODUCTION; STORAGE; SUPERCOMPUTERS; TRANSIENTS; Optimize center performance; scratch storage; data recovery

Citation Formats

Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, and Mueller, Frank. Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery. United States: N. p., 2007. Web.
Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, & Mueller, Frank. Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery. United States.
Zhang, Zhe, Wang, Chao, Vazhkudai, Sudharshan S, Ma, Xiaosong, Pike, Gregory, Cobb, John W, and Mueller, Frank. Mon . "Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery". United States. doi:.
@article{osti_1000413,
title = {Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery},
author = {Zhang, Zhe and Wang, Chao and Vazhkudai, Sudharshan S and Ma, Xiaosong and Pike, Gregory and Cobb, John W and Mueller, Frank},
abstractNote = {Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance as well as in a lack of coordination between I/O activities and job scheduling. In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer environment using multiple data sources. We conducted simulations based on the measured data recovery performance, the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O nodes.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • While future terabit networks hold the promise of signifi- cantly improving big-data motion among geographically distributed data centers, significant challenges must be overcome even on today s 100 gigabit networks to real- ize end-to-end performance. Multiple bottlenecks exist along the end-to-end path from source to sink. Data stor- age infrastructure at both the source and sink and its in- terplay with the wide-area network are increasingly the bottleneck to achieving high performance. In this paper, we identify the issues that lead to congestion on the path of an end-to-end data transfer in the terabit network en- vironment, and we presentmore » a new bulk data movement framework called LADS for terabit networks. LADS ex- ploits the underlying storage layout at each endpoint to maximize throughput without negatively impacting the performance of shared storage resources for other users. LADS also uses the Common Communication Interface (CCI) in lieu of the sockets interface to use zero-copy, OS-bypass hardware when available. It can further im- prove data transfer performance under congestion on the end systems using buffering at the source using flash storage. With our evaluations, we show that LADS can avoid congested storage elements within the shared stor- age resource, improving I/O bandwidth, and data transfer rates across the high speed networks.« less
  • Bulk data transfer is facing significant challenges in the coming era of big data. There are multiple performance bottlenecks along the end-to-end path from the source to destination storage system. The limitations of current generation data transfer tools themselves can have a significant impact on end-to-end data transfer rates. In this paper, we identify the issues that lead to underperformance of these tools, and present a new data transfer tool with an innovative I/O scheduler called MDTM. The MDTM scheduler exploits underlying multicore layouts to optimize throughput by reducing delay and contention for I/O reading and writing operations. With ourmore » evaluations, we show how MDTM successfully avoids NUMA-based congestion and significantly improves end-to-end data transfer rates across high-speed wide area networks.« less
  • This paper presents a field study funded by the Gas Research Institute (GRI) to determine reservoir management and production strategies to economically maximize recovery from the Vermejo/Moore-Hooper strong water-drive gas reservoir in West Texas. The reservoir is greater than 16,000 feet deep and contains high pressure sour gas, and it is composed of geologically complex fractured dolomite and chert. The water drive is exceptionally strong, the reservoir has experienced no more than 5 percent pressure depletion during its 20-year history, and late in life the field is beginning to re-pressurize. Detailed Engineering and Geologic studies were performed and reservoir simulationmore » of the field tested the most effective methods to economically maximize recovery. The management strategies evaluated were comprised of various methods to maximize field productivity, reduce bottom-hole flowing pressure, improve the ability of the wellbore to unload water and reduce/retard aquifer influx. Specific strategies that were tested included the drilling of an infill well, placing the wells on compression, installation of coiled tubing, installation of gas lift, installation of an optimal tubing string, injection of a permeability barrier into the reservoir and de-pressuring the aquifer by downdip water production. It was concluded that the most economic method to deplete the field was to place the remaining wells on compression at 200 psia wellhead pressure and to install coiled tubing when the wells begin to experience water load-up problems. The gas-lift and optimal tubing cases predicted higher recoveries, but were economically less favorable. The methods tested to reduce aquifer influx to the reservoir actually decreased recovery of gas reserves. The study also identified the need for a multiphase flow correlation which fully handles the complexity involved with high water-gas-ratio systems (up to 10 stb/mscf).« less
  • This book was developed from symposia sponsored by the Divisions of Polymeric Materials: Science and Engineering and of Analytical Chemistry at the 201st National Meeting of the American Chemical Society, held April 14-19, 1991 in Atlanta, Georgia. The development of different types of biosensors, immunosensors, selective coatings, and electrodes is described. Some potential or existing applications of biosensor and chemical sensor technology are presented. Some methods of predicting and/or improving the characteristics of sensors and polymeric films are discussed.