skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations

Abstract

Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing amount of scientific data. In our previous work, we built a distributed local cache on unreliable desktop storage contributions. This offers several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. Such a low-cost, best-effort cache, however, is faced with the vagaries of storage node availability: these donated machines may be significantly less reliable than dedicated systems and cannot be controlled centrally. In this paper, we address %the tradeoffs between techniques that favor %availability or performance when it comes to cache management. the performance impact of data availability in the distributed scientific data cache setting. We then present a novel approach to storage cache management, {\em remote partial data recovery (RPDR)}. We compare our approach to two standard techniques, namely replication and erasure coding, both extended to the target caching environment. Our evaluation uses a trace-driven simulation parameterized with benchmarking results from our distributed cache prototype. The results with multiple real-world traces indicate that RPDR significantly outperforms both replication and erasure coding in many cases and overall the combination of RPDR and erasure coding yields the best performance.

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1000706
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Journal Article
Journal Name:
Journal of Grid Computing
Additional Journal Information:
Journal Volume: 7; Journal Issue: 4; Journal ID: ISSN 1570--7873
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; AVAILABILITY; EVALUATION; IMPEDANCE; MANAGEMENT; PERFORMANCE; SIMULATION; STORAGE; TARGETS; Data Availability; storage cache

Citation Formats

Ma, Xiaosong, Zhang, Zhe, and Vazhkudai, Sudharshan S. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations. United States: N. p., 2009. Web. doi:10.1007/s10723-009-9122-7.
Ma, Xiaosong, Zhang, Zhe, & Vazhkudai, Sudharshan S. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations. United States. doi:10.1007/s10723-009-9122-7.
Ma, Xiaosong, Zhang, Zhe, and Vazhkudai, Sudharshan S. Thu . "Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations". United States. doi:10.1007/s10723-009-9122-7.
@article{osti_1000706,
title = {Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations},
author = {Ma, Xiaosong and Zhang, Zhe and Vazhkudai, Sudharshan S},
abstractNote = {Client-side data caching serves as an excellent mechanism to store and analyze the rapidly growing amount of scientific data. In our previous work, we built a distributed local cache on unreliable desktop storage contributions. This offers several desirable properties, such as performance impedance matching, improved space utilization, and high parallel I/O bandwidth. Such a low-cost, best-effort cache, however, is faced with the vagaries of storage node availability: these donated machines may be significantly less reliable than dedicated systems and cannot be controlled centrally. In this paper, we address %the tradeoffs between techniques that favor %availability or performance when it comes to cache management. the performance impact of data availability in the distributed scientific data cache setting. We then present a novel approach to storage cache management, {\em remote partial data recovery (RPDR)}. We compare our approach to two standard techniques, namely replication and erasure coding, both extended to the target caching environment. Our evaluation uses a trace-driven simulation parameterized with benchmarking results from our distributed cache prototype. The results with multiple real-world traces indicate that RPDR significantly outperforms both replication and erasure coding in many cases and overall the combination of RPDR and erasure coding yields the best performance.},
doi = {10.1007/s10723-009-9122-7},
journal = {Journal of Grid Computing},
issn = {1570--7873},
number = 4,
volume = 7,
place = {United States},
year = {2009},
month = {1}
}