File caching in data intensive scientific applications
We present some theoretical and experimental results of animportant caching problem that arises frequently in data intensivescientific applications. In such applications, jobs need to processseveral files simultaneously, i.e., a job can only be serviced if all itsneeded files are present in the disk cache. The set of files requested bya job is called a file-bundle. This requirement introduces the need forcache replacement algorithms based on file-bundles rather then individualfiles. We show that traditional caching algorithms such Least RecentlyUsed (LRU), and GreedyDual-Size (GDS), are not optimal in this case sincethey are not sensitive to file-bundles and may hold in the cachenon-relevant combinations of files. In this paper we propose and analyzea new cache replacement algorithm specifically adapted to deal withfile-bundles. We tested the new algorithm using a disk cache simulationmodel under a wide range of parameters such as file requestdistributions, relative cache size, file size distribution,and queuesize. In all these tests, the results show significant improvement overtraditional caching algorithms such as GDS.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Director. Office of Science. Advanced ScientificComputing Research
- DOE Contract Number:
- DE-AC02-05CH11231
- OSTI ID:
- 882745
- Report Number(s):
- LBNL-55587; R&D Project: 429201; BnR: KJ0101030
- Resource Relation:
- Conference: 1st International Workshop on Data Management inGrids, Trondheim, Norway, September 2 - 3, 2005
- Country of Publication:
- United States
- Language:
- English
Similar Records
Accurate modeling of cache replacement policies in a Data-Grid.
Efficient algorithms for multi-file caching