Exploring Shared Memory Protocols in FLASH
- Stanford University
ABSTRACT The goal of this project was to improve the performance of large scientific and engineering applications through collaborative hardware and software mechanisms to manage the memory hierarchy of non-uniform memory access time (NUMA) shared-memory machines, as well as their component individual processors. In spite of the programming advantages of shared-memory platforms, obtaining good performance for large scientific and engineering applications on such machines can be challenging. Because communication between processors is managed implicitly by the hardware, rather than expressed by the programmer, application performance may suffer from unintended communication – communication that the programmer did not consider when developing his/her application. In this project, we developed and evaluated a collection of hardware, compiler, languages and performance monitoring tools to obtain high performance on scientific and engineering applications on NUMA platforms by managing communication through alternative coherence mechanisms. Alternative coherence mechanisms have often been discussed as a means for reducing unintended communication, although architecture implementations of such mechanisms are quite rare. This report describes an actual implementation of a set of coherence protocols that support coherent, non-coherent and write-update accesses for a CC-NUMA shared-memory architecture, the Stanford FLASH machine. Such an approach has the advantages of using alternative coherence only where it is beneficial, and also provides an evolutionary migration path for improving application performance. We present data on two computations, RandomAccess from the HPC Challenge benchmarks and a forward solver derived from LS-DYNA, showing the performance advantages of the alternative coherence mechanisms. For RandomAccess, the non-coherent and write-update versions can outperform the coherent version by factors of 5 and 2.5, respectively. In LS-DYNA, we obtain improvements of 18% on average using the non-coherent version. We also present data on the SpecOMP benchmarks, showing that the protocols have a modest overhead of less than 3% in applications where the alternative mechanisms are not needed. In addition to the selective coherence studies on the FLASH machine, in the last six months of this project ISI performed research on compiler technology for the transactional memory (TM) programming model being developed at Stanford. As part of this research ISI developed a compiler that recognizes transactional memory “pragmas” and automatically generates parallel code for the TM programming model
- Research Organization:
- Stanford University, Stanford, CA; University of Southern California, Los Angeles, CA
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- FG02-03ER25564
- OSTI ID:
- 939091
- Report Number(s):
- DOE DE-FG02-ER25564
- Country of Publication:
- United States
- Language:
- English
Similar Records
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Interprocessor invocation on a NUMA multiprocessor. Technical report
Improving the performance of DSM systems via compiler involvement
Conference
·
Thu Dec 30 23:00:00 EST 1993
·
OSTI ID:46219
Interprocessor invocation on a NUMA multiprocessor. Technical report
Technical Report
·
Mon Oct 01 00:00:00 EDT 1990
·
OSTI ID:5971577
Improving the performance of DSM systems via compiler involvement
Book
·
Fri Dec 30 23:00:00 EST 1994
·
OSTI ID:87677