skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An Application Specific Memory Characterization Technique for Co-processor Accelerators

Abstract

Commodity accelerator technologies including reconfigurable devices and graphical processing units (GPUs) provide an order of magnitude performance improvement compared to mainstream microprocessor systems. A number of compute-intensive, scientific applications, therefore, can potentially benefit from commodity computing devices available in the form of co-processor accelerators. However, there has been little progress in accelerating production-level scientific applications using these technologies due to several programming and performance challenges. One of the key performance challenges is performance sustainability. While computation is often accelerated substantially by accelerator devices, the achievable performance is significantly lower once the data transfer costs and overheads are incorporated. We present an application-specific memory characterization technique for an FPGA-accelerated system that enabled us to reduce data transfer overhead for a scientific application by a factor of 5. We classify large data structures in the application according to their read and write characteristics and access patterns. This classification in turn enabled us to sustain a speedup of over three for a full-scale scientific application. Our proposed technique extends to applications that exhibit similar memory behavior and to co-processor accelerator systems that support data streaming and pipelining, and allow overlapped execution between the host and the accelerator device.

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
931800
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE 18th International Conference on Application-specific Systems, Architectures and Processors, Montreal, Canada, 20070709, 20070711
Country of Publication:
United States
Language:
English
Subject:
97; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER NETWORKS; PERFORMANCE; PROGRAMMING; MEMORY MANAGEMENT; DATA TRANSMISSION; TIME DEPENDENCE

Citation Formats

Alam, Sadaf R, Smith, Melissa C, and Vetter, Jeffrey S. An Application Specific Memory Characterization Technique for Co-processor Accelerators. United States: N. p., 2007. Web.
Alam, Sadaf R, Smith, Melissa C, & Vetter, Jeffrey S. An Application Specific Memory Characterization Technique for Co-processor Accelerators. United States.
Alam, Sadaf R, Smith, Melissa C, and Vetter, Jeffrey S. Mon . "An Application Specific Memory Characterization Technique for Co-processor Accelerators". United States. doi:.
@article{osti_931800,
title = {An Application Specific Memory Characterization Technique for Co-processor Accelerators},
author = {Alam, Sadaf R and Smith, Melissa C and Vetter, Jeffrey S},
abstractNote = {Commodity accelerator technologies including reconfigurable devices and graphical processing units (GPUs) provide an order of magnitude performance improvement compared to mainstream microprocessor systems. A number of compute-intensive, scientific applications, therefore, can potentially benefit from commodity computing devices available in the form of co-processor accelerators. However, there has been little progress in accelerating production-level scientific applications using these technologies due to several programming and performance challenges. One of the key performance challenges is performance sustainability. While computation is often accelerated substantially by accelerator devices, the achievable performance is significantly lower once the data transfer costs and overheads are incorporated. We present an application-specific memory characterization technique for an FPGA-accelerated system that enabled us to reduce data transfer overhead for a scientific application by a factor of 5. We classify large data structures in the application according to their read and write characteristics and access patterns. This classification in turn enabled us to sustain a speedup of over three for a full-scale scientific application. Our proposed technique extends to applications that exhibit similar memory behavior and to co-processor accelerator systems that support data streaming and pipelining, and allow overlapped execution between the host and the accelerator device.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an application`s communication patterns and memory semantics. This paper presents evidence that this approach can lead to large performance improvements. It shows that application-specific protocols substantially improved the performance of three application programs--appbt, em3d, and barnes--over carefully tuned transparent shared memory implementations. The speed-ups were obtained on Blizzard, a fine-grained DSM system running on a 32-node Thinking Machines CM-5.
  • Abstract not provided.
  • Various advanced accelerator concepts invoke lasers for the generation of very-high-gradient accelerating fields. We introduce the subject and review the application of picosecond CO/sub 2/ lasers to one such scheme: laser-driven open microstructures.
  • Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less
  • For those generator sites preparing to dispose of their Transuranic (TRU) waste at the Waste Isolation Pilot Plant (WIPP), waste assay measurement performance will be demonstrated by the successful analysis of blind samples according to the criteria set forth by the Performance Demonstration Program Plan for Nondestructive Assay. Several key program elements are discussed that have evolved over the last year. These elements include matrix design and fabrication, source specification and manufacture, measurement reporting and scoring, and site coordination. To be highlighted is the framework of quality assurance and quality control which provides the backbone of each program element andmore » ensures an auditable program meeting the Environmental Protection Agency`s (EPA`s) requirements. Technical and programmatic lessons learned are identified to facilitate the transfer of applicable TRU waste characterization technologies to other waste forms of low specific activity (< 100 nCi/g).« less