Collective Memory Transfers for Multi-Core Chips
Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in on-chip parallelism. Performance improvements for a wide variety of parallel applications require domain-decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 data caches and scratchpads. How- ever, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. We propose collective memory scheduling (CMS) that actively takes control of collective memory transfers such that requests arrive in a sequential and predictable fashion to the memory controller. CMS uses the hierarchically tiled arrays formal- ism to compactly express collective operations, which greatly improves programmability over conventional prefetch or list- DMA approaches. CMS reduces application execution time by up to 32% and DRAM read power by 2.2×, compared to a baseline DMA architecture such as STI Cell.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- DE-AC02-05CH11231
- OSTI ID:
- 1164908
- Report Number(s):
- LBNL-6485E
- Country of Publication:
- United States
- Language:
- English
Similar Records
Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation
The SPUR instruction unit: An on-chip instruction cache memory for a high performance VLSI multiprocessor