Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
 [1];  [1];  [2]
  1. Virginia Polytechnic Institute and State University (Virginia Tech)
  2. ORNL
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the highthroughput center storage system, scratch space. However, the scratch space is typically managed using simple purge policies, without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center s purge and users delivery deadlines, unable to adapt to changing dynamics in the end-toend data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4% for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user Service Level Agreements.
Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
ORNL LDRD Director's R&D
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1020778
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Journal Name: IEEE Transactions on Parallel and Distributed Systems Journal Issue: 8 Vol. 22; ISSN 1045-9219
Country of Publication:
United States
Language:
English

Similar Records

CATCH: A Cloud-based Adaptive Data Transfer Service for HPC
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1015023

/Scratch as a Cache: Rethinking HPC Center Scratch Storage
Conference · Mon Jun 01 00:00:00 EDT 2009 · OSTI ID:1004447

Reconciling Scratch Space Consumption, Exposure, and Volatility to Achieve Timely Staging of Job Input Data
Conference · Thu Apr 01 00:00:00 EDT 2010 · OSTI ID:985302