skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: EXPLOITING TEMPORAL LOCALITY IN STENCIL BASED APPLICATIONS

Abstract

No abstract prepared.

Authors:
;
Publication Date:
Research Org.:
Los Alamos National Lab., NM (US)
Sponsoring Org.:
US Department of Energy (US)
OSTI Identifier:
785420
Report Number(s):
LA-UR-99-3390
TRN: US200307%%437
DOE Contract Number:
W-7405-ENG-36
Resource Type:
Conference
Resource Relation:
Conference: Conference title not supplied, Conference location not supplied, Conference dates not supplied; Other Information: PBD: 1 Jul 1999
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER CODES; TIME DEPENDENCE; LOCALITY; USES

Citation Formats

F. BASSETTI, and K. DAVIS. EXPLOITING TEMPORAL LOCALITY IN STENCIL BASED APPLICATIONS. United States: N. p., 1999. Web.
F. BASSETTI, & K. DAVIS. EXPLOITING TEMPORAL LOCALITY IN STENCIL BASED APPLICATIONS. United States.
F. BASSETTI, and K. DAVIS. 1999. "EXPLOITING TEMPORAL LOCALITY IN STENCIL BASED APPLICATIONS". United States. doi:. https://www.osti.gov/servlets/purl/785420.
@article{osti_785420,
title = {EXPLOITING TEMPORAL LOCALITY IN STENCIL BASED APPLICATIONS},
author = {F. BASSETTI and K. DAVIS},
abstractNote = {No abstract prepared.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = 1999,
month = 7
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • High-performance scientific computing relies increasingly on high-level large-scale object-oriented software frameworks to manage both algorithmic complexity and the complexities of parallelism: distributed data management, process management, inter-process communication, and load balancing. This encapsulation of data management, together with the prescribed semantics of a typical fundamental component of such object-oriented frameworks--a parallel or serial array-class library--provides an opportunity for increasingly sophisticated compile-time optimization techniques. This paper describes a technique for introducing cache blocking suitable for certain classes of numerical algorithms, demonstrates and analyzes the resulting performance gains, and indicates how this optimization transformation is being automated.
  • Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these optimization techniques might not be able to fully exploit locality (both spatial and temporal) on multiple levels of the memory hierarchy without compromising parallelism. It is no longer true that the machine can be seen as a homogeneous collection of nodesmore » with caches, main memory and an interconnect network. New architectural designs exhibit complex grouping of nodes, cores, threads, caches and memory connected by an ever evolving network-on-chip design. These new designs may benefit greatly from carefully crafted schedules and groupings that encourage parallel actors (i.e. threads, cores or nodes) to be aware of the computational history of other actors in close proximity. In this paper, we provide an efficient tiling technique that allows hierarchical concurrent start for memory hierarchy aware tile groups. Each execution schedule and tile shape exploit the available parallelism, load balance and locality present in the given applications. We demonstrate our technique on the Intel Xeon Phi architecture with selected and representative stencil kernels. We show improvement ranging from 5.58% to 31.17% over existing state-of-the-art techniques.« less
  • In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovativemore » algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS.« less
  • Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on ``checkpointing'' as a resilience mechanism to store application state to permanent storage and recover from failures. \\ \indent Unfortunately, checkpointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how checkpointing overheadmore » can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the checkpointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.« less
  • Abstract not provided.