skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CloudBB: Scalable I/O Accelerator for Shared Cloud Storage

; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: Presented at: The 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS 2016), Wuhan, China, Dec 13 - Dec 16, 2016
Country of Publication:
United States

Citation Formats

Xu, T ., Sato, K ., and Matsuoka, S .. CloudBB: Scalable I/O Accelerator for Shared Cloud Storage. United States: N. p., 2016. Web. doi:10.1109/ICPADS.2016.0074.
Xu, T ., Sato, K ., & Matsuoka, S .. CloudBB: Scalable I/O Accelerator for Shared Cloud Storage. United States. doi:10.1109/ICPADS.2016.0074.
Xu, T ., Sato, K ., and Matsuoka, S .. 2016. "CloudBB: Scalable I/O Accelerator for Shared Cloud Storage". United States. doi:10.1109/ICPADS.2016.0074.
title = {CloudBB: Scalable I/O Accelerator for Shared Cloud Storage},
author = {Xu, T . and Sato, K . and Matsuoka, S .},
abstractNote = {},
doi = {10.1109/ICPADS.2016.0074},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2016,
month = 7

Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Large scale shared memory multiprocessors use directory based cache coherence scheme. The basic directory scheme, called full-map, is efficient but has a large memory overhead. Therefore, limited directory schemes have been proposed which limit the number of pointers in the directories. These schemes tradeoff smaller memory overhead for larger memory access latencies. In this paper, the authors propose a new limited directory scheme, which achieves lower memory overhead as well as smaller memory access latencies. The scheme uses ring embedding in a hypercube in conjunction with wormhole routing to reduce the invalidation delays. The proposed scheme performs as good asmore » full-map for smaller degree of sharing and performs better than full-map for larger degree of sharing.« less
  • This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of the Cannon’s algorithm. It is suitable for clusters and shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over ScaLAPACK pdgemm, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case onmore » the SGI Altix, the new algorithm performs 20 times better than ScaLAPACK for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated.« less
  • Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
  • Abstract not provided.
  • Abstract not provided.