skip to main content

Title: Data Movement Dominates: Final Report

Over the past three years in this project, what we have observed is that the primary reason for data movement in large-scale systems is that the per-node capacity is not large enough—i.e., one of the solutions to the data-movement problem (certainly not the only solution that is required, but a significant one nonetheless) is to increase per-node capacity so that inter-node traffic is reduced. This unfortunately is not as simple as it sounds. Today’s main memory systems for datacenters, enterprise computing systems, and supercomputers, fail to provide high per-socket capacity [Dirik & Jacob 2009; Cooper-Balis et al. 2012], except at extremely high price points (factors of 10–100x the cost/bit of consumer main-memory systems) [Stokes 2008]. The reason is that our choice of technology for today’s main memory systems—i.e., DRAM, which we have used as a main-memory technology since the 1970s [Jacob et al. 2007]—can no longer keep up with our needs for density and price per bit. Main memory systems have always been built from the cheapest, densest, lowest-power memory technology available, and DRAM is no longer the cheapest, the densest, nor the lowest-power storage technology out there. It is now time for DRAM to go the way that SRAMmore » went: move out of the way for a cheaper, slower, denser storage technology, and become a cache instead. This inflection point has happened before, in the context of SRAM yielding to DRAM. There was once a time that SRAM was the storage technology of choice for all main memories [Tomasulo 1967; Thornton 1970; Kidder 1981]. However, once DRAM hit volume production in the 1970s and 80s, it supplanted SRAM as a main memory technology because it was cheaper, and it was denser. It also happened to be lower power, but that was not the primary consideration of the day. At the time, it was recognized that DRAM was much slower than SRAM, but it was only at the supercomputer level (For instance the Cray X-MP in the 1980s and its follow-on, the Cray Y-MP, in the 1990s) that could one afford to build ever- larger main memories out of SRAM—the reasoning for moving to DRAM was that an appropriately designed memory hierarchy, built of DRAM as main memory and SRAM as a cache, would approach the performance of SRAM, at the price-per-bit of DRAM [Mashey 1999]. Today it is quite clear that, were one to build an entire multi-gigabyte main memory out of SRAM instead of DRAM, one could improve the performance of almost any computer system by up to an order of magnitude—but this option is not even considered, because to build that system would be prohibitively expensive. It is now time to revisit the same design choice in the context of modern technologies and modern systems. For reasons both technical and economic, we can no longer afford to build ever-larger main memory systems out of DRAM. Flash memory, on the other hand, is significantly cheaper and denser than DRAM and therefore should take its place. While it is true that flash is significantly slower than DRAM, one can afford to build much larger main memories out of flash than out of DRAM, and we show that an appropriately designed memory hierarchy, built of flash as main memory and DRAM as a cache, will approach the performance of DRAM, at the price-per-bit of flash. In our studies as part of this project, we have investigated Non-Volatile Main Memory (NVMM), a new main-memory architecture for large-scale computing systems, one that is specifically designed to address the weaknesses described previously. In particular, it provides the following features: non-volatility: The bulk of the storage is comprised of NAND flash, and in this organization DRAM is used only as a cache, not as main memory. Furthermore, the flash is journaled, which means that operations such as checkpoint/restore are already built into the system. 1+ terabytes of storage per socket: SSDs and DRAM DIMMs have roughly the same form factor (several square inches of PCB surface area), and terabyte SSDs are now commonplace. performance approaching that of DRAM: DRAM is used as a cache to the flash system. price-per-bit approaching that of NAND: Flash is currently well under $0.50 per gigabyte; DDR3 SDRAM is currently just over $10 per gigabyte [Newegg 2014]. Even today, one can build an easily affordable main memory system with a terabyte or more of NAND storage per CPU socket (which would be extremely expensive were one to use DRAM), and our cycle- accurate, full-system experiments show that this can be done at a performance point that lies within a factor of two of DRAM.« less
  1. Univ. of Maryland, College Park, MD (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
Final Report
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Univ. of Maryland, College Park, MD (United States)
Sponsoring Org:
Country of Publication:
United States