skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Unified Multiple-Level Cache for High Performance Storage Systems

Abstract

Multi-level cache hierarchies are widely used in high-performance storage systems to improve I/O performance. However, traditional cache management algorithms are not suited well for such cache organizations. Recently proposed multi-level cache replacement algorithms using aggressive exclusive caching work well with single or multiple-client, low-correlated workloads, but suffer serious performance degradation with multiple-client, high-correlated workloads. In this paper, we propose a new cache management algorithm that handles multi-level buffer caches by forming a unified cache (uCache), which uses both exclusive caching in L2 storage caches and cooperative client caching. We also propose a new local replacement algorithm, Frequency Based Eviction-Reference (FBER), based on our study of access patterns in exclusive caches. Our simulation results show that uCache increases the cumulative cache hit ratio dramatically. Compared to other popular cache algorithms, such as LRU, the I/O response time is improved by up to 46% for low-correlated workloads and 53% for high-correlated workloads.

Authors:
 [1];  [1];  [1];  [2];  [2]
  1. Tennessee Technological University
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
931933
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: International Journal of High Performance Computing and Networking; Journal Volume: 5; Journal Issue: 1
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ALGORITHMS; BUFFERS; MANAGEMENT; PERFORMANCE; SIMULATION; STORAGE; COMPUTERS

Citation Formats

He, X., Ou, Li, Kosa, Martha J., Scott, Steven L, and Engelmann, Christian. A Unified Multiple-Level Cache for High Performance Storage Systems. United States: N. p., 2007. Web. doi:10.1504/IJHPCN.2007.015768.
He, X., Ou, Li, Kosa, Martha J., Scott, Steven L, & Engelmann, Christian. A Unified Multiple-Level Cache for High Performance Storage Systems. United States. doi:10.1504/IJHPCN.2007.015768.
He, X., Ou, Li, Kosa, Martha J., Scott, Steven L, and Engelmann, Christian. Mon . "A Unified Multiple-Level Cache for High Performance Storage Systems". United States. doi:10.1504/IJHPCN.2007.015768.
@article{osti_931933,
title = {A Unified Multiple-Level Cache for High Performance Storage Systems},
author = {He, X. and Ou, Li and Kosa, Martha J. and Scott, Steven L and Engelmann, Christian},
abstractNote = {Multi-level cache hierarchies are widely used in high-performance storage systems to improve I/O performance. However, traditional cache management algorithms are not suited well for such cache organizations. Recently proposed multi-level cache replacement algorithms using aggressive exclusive caching work well with single or multiple-client, low-correlated workloads, but suffer serious performance degradation with multiple-client, high-correlated workloads. In this paper, we propose a new cache management algorithm that handles multi-level buffer caches by forming a unified cache (uCache), which uses both exclusive caching in L2 storage caches and cooperative client caching. We also propose a new local replacement algorithm, Frequency Based Eviction-Reference (FBER), based on our study of access patterns in exclusive caches. Our simulation results show that uCache increases the cumulative cache hit ratio dramatically. Compared to other popular cache algorithms, such as LRU, the I/O response time is improved by up to 46% for low-correlated workloads and 53% for high-correlated workloads.},
doi = {10.1504/IJHPCN.2007.015768},
journal = {International Journal of High Performance Computing and Networking},
number = 1,
volume = 5,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}
  • This paper describes an apparatus and method which exploit potential overlap between processor activity and cache miss sequences and potential overlap between cache miss sequences themselves, thereby improving the performance of any pipelined central processing unit which uses a cache.
  • Efficient execution of large-scale scientific applications requires high-performance computing systems designed to meet the I/O requirements. To achieve high-performance, such data-intensive parallel applications use a multi-layer layer I/O software stack, which consists of high-level I/O libraries such as PnetCDF and HDF5, the MPI library, and parallel file systems. To design efficient parallel scientific applications, understanding the complicated flow of I/O operations and the involved interactions among the libraries is quintessential. Such comprehension helps identify I/O bottlenecks and thus exploits the potential performance in different layers of the storage hierarchy. To profile the performance of individual components in the I/O stackmore » and to understand complex interactions among them, we have implemented a GUI-based integrated profiling and analysis framework, IOPro. IOPro automatically generates an instrumented I/O stack, runs applications on it, and visualizes detailed statistics based on the user-specified metrics of interest. We present experimental results from two different real-life applications and show how our framework can be used in practice. By generating an end-to-end trace of the whole I/O stack and pinpointing I/O interference, IOPro aids in understanding I/O behavior and improving the I/O performance significantly.« less
  • This patent describes a data processing system comprising: a pair of independently operated processing units, each processing unit being operative to generate cache requests for data, each request including an address having first and second address portions; and a cache memory subsystem coupled to the pair of processing units for receiving the requests, the cache memory subsystem comprising: a directory store being divided equally into first and second pluralities of levels. The first and second levels each containing groups of storage locations, each location for storing the first address portion of a memory request generated by each of the pairmore » of processing units allocated to the first and second levels and each different group of locations within the directory store levels being defined by a different one of the second address portions; a data store being divided equally into the same first and second levels as the directory store and each different group of locations with the data store levels being accessed by a different one of the second address portions; first and second accounting means being associated with the first and second levels of the cache store respectively and each accounting means containing the first level for storing information establishing the order for replacing locations within the levels on a least recently used basis; and multiple allocation memory (MAM) means including the first level of the groups of locations, each different group of locations being accessed by the second address portion.« less
  • In this paper, the performance of cache-based multiprocessors for general-purpose computing and for multi-tasking is analyzed with simple throughput models. A private cache is associated with each processor, and multiple buses connect the processors to the shared, interleaved memory. Simple models based on dynamic instruction mix statistics are introduced to evaluate upper bounds on the throughput when independent tasks are run on each processor. With these models, one can obtain a first estimate of the MIPS rate of a multiprocessor. The authors then present analytical models to evaluate the throughput and efficiency of different cache-based systems for a particular multitaskedmore » algorithm, namely the Successive Over-Relaxation algorithm (SOR). The SOR algorithm is efficient for solving partial differential equations (PDE's) numerically in a multiprocessor. Parallelism is obtained naturally by partitioning the data and applying the same operator to each data partition. They identify locality levels and read/write sharing sets in the kernel of the SOR algorithm with red/black ordering of the iterates and evaluate the critical cache sizes capturing each locality level. Critical cache sizes define ranges of cache sizes within which different models apply. Besides showing performance of SOR in multiprocessors, this paper illustrates techniques that can be used to predict the suitability of cache-based multiprocessors for specific algorithms.« less