



# Energy Efficient Neuromorphic Algorithm Acceleration Enabled by Resistive Memory (ReRAM) Crossbars

Matthew J. Marinella\*, S. Agarwal, R. Jacobs-Gedrim, D.R. Hughart, I. Richter, A. Hsia, E. Fuller, A.A. Talin, R. Goeke, S.J. Plimpton, and C.D. James

Sandia National Laboratories

\*[matthew.marinella@sandia.gov](mailto:matthew.marinella@sandia.gov)



Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

# Outline

- **Intro and Motivation**
- **ReRAM-Based Accelerator Key Concepts**
- **ReRAM-Based Accelerator Model**
- **Conclusion**

# Why do we need more efficient computers?

- **Google Deep Learning Study**
  - 16000 core, 1000 machine GPU cluster
  - Trained on 10 million 200x200 pixel images
  - Training required 3 days
  - Training set size set by what can be completed in less than one week
- **What would they like to do?**
  - ~2 billion photos uploaded to internet per day (2014)
  - Can we train a deep net on one day of image data?
  - Assume 1000x1000 nominal image size, linear scaling (both assumptions are unrealistically optimistic)
  - *Requires 5 ZettaIPS to train in 3 days*  
*(ZettaIPS=10<sup>21</sup> IPS; ~5 billion modern GPU cores)*
  - Data is increasing exponentially with time
- **Need  $>10^{16}$ - $10^{18}$  instruction-per-second on 1 IC**
  - Less than 10 fJ per instruction energy budget



Q. Le, IEEE ICASSP 2013

# Where Are we Today?

- **Single Unit: Nvidea Tesla P100 GPU**
  - Most advanced GPU processor specs, released late 2016
  - Target's deep learning and neural applications
  - 20 TFLOPs 16 bit peak performance w/ peak power dissipation of 300W
  - 70 GFLOPs/watt or about 15 pJ/FLOP (16 bit)
- **Supercomputer: Sunway TaihuLight (China)**
  - Top supercomputer in the world
  - ShenWei processor
  - 90 PFLOPs peak, 15 MW power
  - 6 GFLOPs/W or about 170 pJ/FLOP
- **Need >1000x improvement to tackle *internet-scale* problems**



# Evolution of Computing Machinery



# Outline

- **Intro and Motivation**
- **ReRAM-Based Accelerator Key Concepts**
- **ReRAM-Based Accelerator Model**
- **Conclusion**

# Metal Oxide Resistive RAM (ReRAM)

- Sandia TiN/Ta/TaO<sub>x</sub>/TiN example device
- Starts as insulating MIM structure
- Forming: remove O<sup>2-</sup>  $\rightarrow$  soft breakdown
- Bipolar resistance modulation
- Excellent memory attributes: Switching in less than 1ns, less than 1 pJ demonstrated, scaling to 5nm,  $>10^{12}$  write cycles



# Crossbar Theoretical Limits

- Potential for 100 Tbit of ReRAM on chip
- If each can perform 1M computations of interest per second (1 M-op):
  - $10^{12}$  active devices/chip  $\times 10^6$  cycle per second  $\rightarrow 10^{18}$  comps per second per chip
  - Exascale-computations per sec on one chip!
- In order to not melt the chip, entire area must be limited to  $\sim 100W$
- Allowed energy per operation =  $P \times t/op$   
 $= 100W / 10^{18} = 10^{-16} = 100 \text{ aJ/operation}$
- 10nm line capacitance = 10 aF
- Can charge line to 1V with 10 aJ
- Drawback: “only”  $\sim 100B$  transistors/chip



# Why is it essential to cram so many computations on a single chip?

Can you simply connect millions of ultra-efficient chips?

Yes, but every time data leaves the chip, it is elevated in the comm hierarchy

→Energy efficiency per operation is reduced



# How does a crossbar perform a useful computation per device?

- Electronic Vector Matrix Multiply

## Mathematical

$$V^T W = I$$

$$\begin{bmatrix} V_1 & V_2 & V_3 \end{bmatrix} \begin{bmatrix} W_{1,1} & W_{1,2} & W_{1,3} \\ W_{2,1} & W_{2,2} & W_{2,3} \\ W_{3,1} & W_{3,2} & W_{3,3} \end{bmatrix} =$$

$$\begin{bmatrix} I_1 = \sum V_{i,1} W_{i,1} & I_2 = \sum V_{i,2} W_{i,2} & I_3 = \sum V_{i,3} W_{i,3} \end{bmatrix}$$

## Electrical



# Basics of Neural Networks

## Simple Network: Backpropagation

### Basic Building Block

$$y = \frac{1}{1 + e^{-z}}$$

Neuron

Weights

Inputs



Incorrect –  
adjust

Correct – no  
adjustment  
Outputs



# Mapping Backprop to a Crossbar



# Vector Matrix Multiply, Rank 1 Update: Key kernel used in many algorithms

# Analog Core: Forward Propagation



$O(N^2)$   
Operations

$O(N)$   
Operations



# Analog Core: Back Propagation



# Accelerator Architecture



# Outline

- **Intro and Motivation**
- **ReRAM-Based Accelerator Key Concepts**
- **ReRAM-Based Accelerator Model**
- **Conclusion**

# Device to Algorithm Model

What device properties are needed?

Top Down



Neural  
Algorithm Level  
Model

Computer  
Architecture Level  
Model

Circuit Level  
Models

Device Level  
Models

How do specific devices work in  
system?



# Experimental Device Nonidealities

- Ideally weight would increase and decrease linearly proportional to learning rule result
- Experimental devices have several nonidealities: **Write Variability**, **Write Nonlinearity**, **Asymmetry**, **Read Noise**
- Circuits also have A/D, D/A noise, parasitics



# ReRAM Measurements

ties

- DC Current-voltage “loops” sweeps are not time-controlled
  - Excessive heating and early wearout
  - Do not provide info on dynamics
- Physical switching < 10ns
- Need pseudo RF setup to measure
  - Ground/signal, conductor backed
  - Agilent B1530 module
  - 10 ns RT/FT, 10 ns PW
  - 1 V nominal, ~140 mV overshoot



# ReRAM Analog Characterization



- Use as a neuromorphic weight requires precise analog tuning
- Dataset requires 1000 repeated SET and RESET pulses
- Nominal pulse values
  - SET: +1V 10ns RT/PW/FT
  - RESET: -1V 10ns RT/PW/FT
  - READ: 100 mV 1 ms RT/PW/FT



# Pulse Width Analog Measurements

100 on→off cycles,  
(200k pulses)



# Effect of Pulse Width and Edge Time



- Shorter pulses may be employed to lower conductance switching range
- Linearity qualitatively similar across Pulse Width (PW) and Edge Time (ET)
  - Best for SET at 100 ns
  - Best for RESET at 1  $\mu$ s
- Relative conductance change increased with shorter Pulse Width / Edge Time

Nominal Pulse Voltage Values: SET: +1 V RESET: -1 V

# Repeated Pulsed Cycling



# TaOx ReRAM in Backprop Training



| Data set              | # Training Examples | # Test Examples | Network Size |
|-----------------------|---------------------|-----------------|--------------|
| UCI Small Digits[1]   | 3,823               | 1,797           | 64×36×10     |
| File Types[2]         | 4,501               | 900             | 256×512×9    |
| MNIST Large Digits[3] | 60,000              | 10,000          | 784×300×10   |

# Modeling Effect of Pulse Time

Increasing Network Size



| TaOx      | Large Images | Small Images | File Types |
|-----------|--------------|--------------|------------|
| 10 ns     | 84.45%       | 71.40%       | 77.67%     |
| 100 ns    | 78.48%       | 89.48%       | 67.78%     |
| 1 $\mu$ s | 71.48%       | 71.84%       | 56.33%     |

**How can training accuracy be improved?**

# Li-Ion Synaptic Transistor for Analog Computation (LISTA)



**G-V for LISTA Transistor**



E. Fuller et al, *Adv Mater*, accepted 2017

# Analog State Characterization



E. Fuller et al, *Adv Mater*, accepted 2017

28

# LISTA-device Performance for Backprop Algorithm

Increasing Network Size



| Data set              | # Training Examples | # Test Examples | Network Size |
|-----------------------|---------------------|-----------------|--------------|
| UCI Small Digits[1]   | 3,823               | 1,797           | 64×36×10     |
| File Types[2]         | 4,501               | 900             | 256×512×9    |
| MNIST Large Digits[3] | 60,000              | 10,000          | 784×300×10   |

E. Fuller et al, *Adv Mater*, accepted 2017

# Circuit-Level Improvement

- Allows much closer to ideal with high variability TaO<sub>x</sub> device
- LISTA achieves essentially perfect accuracy
- Requires tradeoff of energy/latency for accuracy – exact tradeoff depends on algorithm reqs.



# Energy and Latency Comparison

| Overview                                        |  | Digital SRAM        | Digital ReRAM      | Analog ReRAM Crossbar              |
|-------------------------------------------------|--|---------------------|--------------------|------------------------------------|
| <b>Equivalent Area</b><br>~450 1k x 1k matrices |  | 400 mm <sup>2</sup> | 32 mm <sup>2</sup> | 11 mm <sup>2</sup><br>[64nm pitch] |
| <b>Total Time [per cycle]</b>                   |  | ~ 100μ s            | ~ 60μ s            | ~ 5μ s                             |
| <b>Total Energy [per cycle]</b>                 |  | ~ 1000 nJ           | ~ 700 nJ           | ~ 15 nJ                            |
| <b>Matrix Storage Area</b>                      |  | 95%                 | 50%                | 17%                                |
| <b>Periphery Area</b>                           |  | 5%                  | 50%                | 100% (crossbar is above periphery) |
| <b>Matrices per 400 mm<sup>2</sup> Chip</b>     |  | ~450                | ~5,500             | ~15,000                            |

The above figures do not include a SIMD engine or on-chip routing fabric, and are based on a 14nm FinFET process.

# Energy Analysis

| Per-Component Breakdown                                                                                        |                  | Digital SRAM                                | Digital ReRAM                              | Analog ReRAM Crossbar                  |  |
|----------------------------------------------------------------------------------------------------------------|------------------|---------------------------------------------|--------------------------------------------|----------------------------------------|--|
| <b>Matrix Storage</b><br>1024x 1024<br>Digital: 8 bits/value<br>Analog: 1 cell/value<br>[Values are per-array] | Area             | <b>800,000 <math>\mu\text{m}^2</math></b>   | 35,000 $\mu\text{m}^2$                     | 10,000 $\mu\text{m}^2$                 |  |
|                                                                                                                | Read             | 30 nJ / 15 $\mu\text{s}$                    | 15 nJ / 4 $\mu\text{s}$                    | $\sim$ 3 nJ / $\sim$ 1.5 $\mu\text{s}$ |  |
|                                                                                                                | Read Transpose   | 300 nJ / <b>65 <math>\mu\text{s}</math></b> | 15 nJ / 4 $\mu\text{s}$                    | $\sim$ 3 nJ / $\sim$ 1.5 $\mu\text{s}$ |  |
|                                                                                                                | Write            | 30 nJ / 15 $\mu\text{s}$                    | 50 nJ / <b>45 <math>\mu\text{s}</math></b> | $\sim$ 3 nJ / $\sim$ 1.5 $\mu\text{s}$ |  |
| <b>Multiply Accumulators</b><br>[256 in parallel]                                                              | Area             | 19,000 $\mu\text{m}^2$                      |                                            | Performed by crossbar                  |  |
|                                                                                                                | Run [1M ops]     | <b>200 nJ / 4<math>\mu\text{s}</math></b>   |                                            |                                        |  |
| <b>Output LUT</b><br>[8 bit $\rightarrow$ 16 bit]                                                              | Area             | 1,400 $\mu\text{m}^2$                       |                                            | Uses Digital Methods                   |  |
|                                                                                                                | Read [1K values] | 1 nJ / 1 $\mu\text{s}$                      |                                            |                                        |  |
| <b>Input/Output Buffers</b><br>[8 bits]                                                                        | Area             | 13,000 $\mu\text{m}^2$                      |                                            |                                        |  |
|                                                                                                                | Per Run          | $\sim$ 0.1 nJ                               |                                            |                                        |  |
| <b>128 Entry 1024x8 Vector Cache</b> (8 matrices per cache)<br>[Values are per vector]                         | Area             | 90,000 $\mu\text{m}^2$                      | 4,000 $\mu\text{m}^2$                      | Uses Digital Methods                   |  |
|                                                                                                                | Read             | $\sim$ 0.1 nJ / $\sim$ 0.2 $\mu\text{s}$    | $\sim$ 1 nJ / $\sim$ 4 ns                  |                                        |  |
|                                                                                                                | Write            | $\sim$ 0.1 nJ / $\sim$ 0.2 $\mu\text{s}$    | $\sim$ 1 nJ / $\sim$ 50 ns                 |                                        |  |

Digital ReRAM based on output from X. Dong, et. al., *NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory*, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994-1007, July 2012.

# Outline

- **Intro and Motivation**
- **ReRAM-Based Accelerator Key Concepts**
- **ReRAM-Based Accelerator Model**
- **Conclusion**

# Conclusion

- Dennard (constant power density) scaling has ceased and Moore's law is slowing
- As this slows, a new direction will be needed to achieve the continue the exponential improvements in performance/watt (aka energy efficiency)
- New paradigms like neuromorphic computing will be required for sub-fJ computing
- We now require a device through system design mentality
  - Motivation behind CrossSim
- Oxide-based resistive memory offers intriguing device options for both eras
- Novel lithiated device LISTA and circuit techniques offer significant potential in the development of a low energy neural accelerator

# Thank you!



# Acknowledgements

- This work is funded by Sandia's Laboratory Directed Research and Development as part of the Hardware Acceleration of Adaptive Neural Algorithms Grand Challenge Project
- Many shared ideas among collaborators:
  - DOE BIS: John Shalf, Ramamoorthy Ramesh, Patrick Nealeau
  - Dave Mountain, Mark McLean, US Government
  - Stan Williams, John Paul Strachan, HPL
  - Jianhua Yang, U Mass
  - Hugh Barnaby, Mike Kozicki, Sheming Yu, ASU
  - Sayeef Salahuddin, UC Berkeley
  - Engin Ipek, U Rochester
  - Tarek Taha, U Dayton
  - Paul Franzon, NC State University
  - Dhireesha Kudithipudi, RIT
  - Alberto Saleo, Stanford
  - Dozens of others...
- **We are especially interested in collaborations on cross-sim!**

# Backup Slides



# Energy Analysis

| Analogy Breakdown<br>Values are per indicated operation | Area                       | Energy                               | Latency                       |
|---------------------------------------------------------|----------------------------|--------------------------------------|-------------------------------|
| Array<br>[1024x 1024]                                   | 4,300 $\mu$ m <sup>2</sup> | ~ 0.2 nJ read<br>~ <b>2 nJ</b> write | ~ 1 ns (propagation)          |
| Temporal Drivers<br>[1024 rows]                         | 460 $\mu$ m <sup>2</sup>   | ~ 2 pJ read<br>~ 0.3 nJ write        | <b>1 nsx 2<sup>bits</sup></b> |
| Voltage Drivers<br>[1024 cols; 16 voltages]             | 5,000 $\mu$ m <sup>2</sup> | ~ 2 pJ read<br>~ 0.3 nJ write        | $\leq$ 1 ns                   |
| Integrators/ADCs (reads only)                           | 3,000 $\mu$ m <sup>2</sup> | ~ <b>2 nJ</b>                        | <b>1 nsx 2<sup>bits</sup></b> |

# Multiscale CoDesign Model: Neuromorphic Crossbar Accelerator



**Sandia Cross-Sim:**  
Translates device measurements and crossbar circuits to algorithm-level performance



**Memristor fabrication and measurements in MESAFab**



**DFT of model of oxide physics, bands**

## Target Algorithms

- Deep Learning
- Sparse Coding
- Liquid State Machines

## Algorithms



## Architecture



**Modified McPAT/CACTI:**  
Model performance and energy requirements



**Sandia's Xyce Circuit Sim:** Simulate crossbar circuits based on our devices

## Circuits



## Devices



**Drift-diffusion model of ReRAM band diagram & transport (REOS, Charon)**



## Materials

**In situ TEM of filament switching:** Use DFT model to interpret EELS signature



# Beyond Moore Co-design Framework

## Modeling

10,000x improvement: 20 fJ per instruction equivalent

## Experimental

### Algorithms and Software Environments

- Application Performance Modeling



### Computer System Architecture Modeling

- Next generation of Structural Simulation Toolkit
- Heterogeneous systems HPC models



### Microarchitecture Models

- McPAT, CACTI, NVSIM, gem5



### Circuit/IP Block Design and Modeling

- SPICE/Xyce model



### Component Fabrication

- Processors, ASICs
- Photonics
- Memory

### Compact Device Models

- Single device electrical models
- Variability and corner models



### Device Measurements

- Single device electrical behavior
- Parametric variability

### Device Physics Modeling

- Device physics modeling (TCAD)
- Electron transport, ion transport
- Magnetic properties



### Device Structure Integration and Demonstration

- Novel device structure demonstration

### Process Module Modeling

- Diffusion, etch, implant simulation
- EUV and novel lithography models



### Process Module Demonstrations

- EUV and novel lithography
- Diffusion, etch, implant simulation

### Atomistic and Ab-Initio Modeling

- DFT – VASP, Socorro
- MD – LAMMPS



Example activities  
within a MSCD  
framework



### Fundamental Materials Science

- Understanding Properties/Defects via Electron, Photon, & Scanning Probes
- Novel Materials Synthesis

Algorithms & SW Environments

Hardware & Circuit Architectures

Comm., Memory & Computation Devices

Materials

Architectures

Materials



**Low voltage high performance logic:**

- Tunnel-FET
- Negative Cg FET
- Single Electron Transistor

# TaOx ReRAM in Backprop Training (10ns)

Increasing Network Size



| Data set              | # Training Examples | # Test Examples | Network Size |
|-----------------------|---------------------|-----------------|--------------|
| UCI Small Digits[1]   | 3,823               | 1,797           | 64×36×10     |
| File Types[2]         | 4,501               | 900             | 256×512×9    |
| MNIST Large Digits[3] | 60,000              | 10,000          | 784×300×10   |

**How can training accuracy be improved?**

# Switching Power & Energy Measurement



- Energy determination requires fast pulsed measurements:
- Can measure resistance change during pulsed switching with pulsewidths  $> 100$  ns and edgetimes  $> 10$  ns
- $E = \int_0^t P(t)$ 
  - $\approx 800$  pJ (RESET)
  - $\approx 400$  pJ (SET)
- Wasted power/energy past first  $\sim 1$  ns of pulse
- Lower energy with high resistance devices, sub-ns pulse
  - $> 1$  pJ demonstrated @  $< 1$  ns in similar TaO<sub>x</sub> device (by HP)

# Theoretical Efficiency Analysis

**SRAM crossbar:**



SRAMs must be read one row at a time

→ charges M columns;

$$E = N \text{ Rows} \times O(N) \text{ wire length} \times M \text{ Columns}$$

$$\sim O(N^2 \times M)$$

Implication: Crossbar is  $O(N)$  better than SRAM in energy consumption for vector-matrix multiply computations

**ReRAM crossbar:**



Energy to charge the crossbar is  $CV^2$ ;  
 $E \propto C \propto \text{number of RRAMs} \propto N \times M$   
 $\sim O(N \times M)$

# Technological Considerations: Trends



# HAANA Crossbar Accelerator Design

- Initial work by several groups indicates order of magnitude energy efficiency gains are possible using a ReRAM accelerator
- The assumptions and outcomes of these models vary significantly
- HAANA goal: develop a Multiscale CoDesign Framework which can evaluate our neural crossbar accelerator algorithms, architectures, and devices on a “level playing field”
- Evaluate architectures and devices for **accuracy, energy, perf.**
- Once a clear energy advantage demonstrated, move forward with technology development



# How can we get to fJ computing?

| Description                           | NPU-1              | NPU-2             | NPU-3             | TrueNorth                |
|---------------------------------------|--------------------|-------------------|-------------------|--------------------------|
| System clock frequency                | 100kHz             | 1 MHz             | 10 MHz            | 1 kHz                    |
| Synapses per neuron                   | 500                | 500               | 500               | 256                      |
| Average energy per device update      | 1 fJ               | 1 fJ              | 10 aJ             | 26 pJ                    |
| Energy per update op cycle (per core) | 250pJ              | 250pJ             | 2.5pJ             |                          |
| Operations per second (per core)      | 250 GOPs           | 250 GOPs          | 250 GOPs          |                          |
| Single core max power                 | 25 uW              | 250 uW            | 25 uW             |                          |
| Chip Area                             | 4 cm <sup>2</sup>  | 4 cm <sup>2</sup> | 4 cm <sup>2</sup> | 4.3 cm <sup>2</sup>      |
| Cores per layer                       | 800 k              | 800 k             | 800 k             | 4 k                      |
| Layers per chip                       | 10                 | 100               | 10                | 1                        |
| Neurons per chip                      | 4 B                | 200 B             | 4 B               | 1 M                      |
| Chip Max Power                        | 200 W              | 10 kW             | 200 W             | 70 mW                    |
| <b>Chip Max operations per second</b> | <b>0.2 ExaMACS</b> | <b>10 ExaMACS</b> | <b>20 ExaMACS</b> | <b>28 GigaOps</b>        |
| Operations per second per watt        | $10^{15}$ MACS/W   | $10^{15}$ MACS/W  | $10^{17}$ MACS/W  | $4 \times 10^{11}$ Ops/W |

MACS = Multiply Accumulate per Second



# How do we get to 10 fJ per inst?

- CMOS scaling not providing significant energy efficiency gains
- Many algorithmic, architectural, and device answers:
  - Neuromorphic algorithms
  - Analog accelerators
  - mV switch (e.g. TFET, NgcFET)
  - Superconducting electronics, quantum computing...
- Which horse should we bet on??
- Well...studies for each approach “prove” each respective option to be the best path forward
- Winner not yet clear, most will require major development efforts to realize full potential (\$\$)
- Need systematic, universal method to determine best approaches for further investment...