



# Commercial Field-Programmable Gate Arrays

## for Space Processing Applications

David S. Lee



Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA-0003525. SAND 2017-xxxx

# Outline

- Introduction
  - Field-Programmable Gate Arrays
  - Ionizing radiation effects
- Device Considerations
  - Space-qualified FPGAs
  - Characterization of commercial devices
  - Effects of device scaling
- Designing for Space
- Conclusion

# Background: FPGAs

- Field-Programmable Gate Arrays (FPGAs) are devices comprised of various feature blocks, connected by a programmable interconnect
- These features may include: I/O, configurable logic blocks, memory, high-speed serial transceivers, processors, & more...
- Designers describe how the hardware should behave using a Hardware Descriptor Language (HDL)
- Software tools create the FPGA configuration by translating the hardware description into a design that maps onto the available resources in the FPGA



(Source: D. Ashby, "Circuit Design: Know It All")

# Ionizing Radiation

- The greatest challenge for FPGAs operating in the space environment is ionizing radiation
- Radiation induces multiple types of effects:
  - Single-Event Latchup (SEL)
  - Single-Event Upset (SEU)
  - Single-Event Transient (SET)
  - Total Ionizing Dose (TID)
  - And other less important or less common effects...

# Single-Event Latchup

- CMOS circuits in bulk will develop parasitic *pnp* and *npn* transistors in a feedback loop (normally both off)
- A radiation-induced event will inject charge that activates the feedback loop and connects  $V_{CC}$  to GND
- High current density results in extremely high, localized heat
- Result is usually destructive or results in latent failure



Parasitic BJT structures in CMOS (modified from source: F. Sexton, *Destructive Single-Event Effects in Semiconductor Devices and ICs*, IEEE TNS 2003)



Latent damage from SEL (Source: H. Becker, T. Miyahira, and A. Johnston, *Latent Damage in CMOS Devices from SEL*, IEEE TNS 2002)

# Single-Event Upset & Transients

- Single-Event Upsets (SEU) can flip the state of data by introducing enough charge to overpower the transistors holding the value of a data bit
- Single-Event Transients (SET) can cause transient pulses by momentarily turning on driver transistors that should be off



*Illustration of SEU in SRAM-based FPGA affecting the device configuration. (A) original circuit, (B) SEU affecting LUT equation, (C) SEU affecting routing matrix. (Source: B. Pratt, "Analysis and Mitigation of SEU-Induced Noise in FPGA-based DSP Systems")*

# Space-Qualified FPGAs

- Space-qualified FPGAs are desirable as they are qualified to operate in the harsh space environment
- Some features are radiation hardened in these devices
- There are three FPGAs that comprise the majority of current space-qualified FPGA needs
  - Xilinx Virtex-5QV (SRAM, 65nm)
  - Microsemi RTG4 (Flash, 65 nm)
  - Microsemi RTAX (Anti-fuse, 150 nm)
- Main disadvantage is with respect to cost and performance



# Why Commercial FPGAs?

- Commercial FPGAs are built with technology nodes four generations more advanced than existing space FPGAs
- A study performed by T. Lovelly, et al. evaluated performance of many processing technologies with standard benchmarks
  - Modern commercial FPGAs at the time of the study were 5x more computationally efficient than the latest space-grade FPGA – but the commercial FPGAs used are now two generations old
- Other advantages include:
  - Considerably higher performance
  - Lower comparable power
  - Higher density
  - Better tool support
- **Commercial devices are the only viable option to enable high-performance, high-bandwidth missions**

# Radiation Effects in Commercial FPGAs

- Sandia has been among the first to characterize radiation effects in the Xilinx 28 nm 7-series, 20 nm UltraScale, and 16 nm UltraScale+ devices
- Tests conducted include single-event latch-up and single-event upset testing in heavy ions
- Other organizations have performed proton, prompt dose, and TID testing



# Observations and Results

- Most devices in the Xilinx commercial family perform well with respect to radiation effects
  - Configuration & Flip-Flop SEU rates very acceptable and showed improvements consistent with device scaling
  - UltraScale+ utilizes transistors built with a 16 nm FinFET process; FinFET offers significant SEU benefits
  - SEL testing performed to date shows non-destructive latch-up in 7-series, and no latch-up of any kind observed in the 20 nm UltraScale family
  - Total dose response in 7-series and UltraScale appears to be very promising, with UltraScale+ results soon to come



# Scaling Trends in Xilinx Families

- This chart shows cross-section (susceptibility) with respect to LET (charge deposited by an ionizing particle)
- As transistor feature sizes shrink, SEU rates reduce as well



## Configuration Memory Upsets

|                    | per bit, per day | Node         | Improvement* |
|--------------------|------------------|--------------|--------------|
| <b>Virtex-II</b>   | 3.99E-07         | 130 nm       | 1.00         |
| <b>Virtex-4</b>    | 2.63E-07         | 90 nm        | 1.52         |
| <b>Virtex-7</b>    | 1.41E-08         | 28 nm        | 28.30        |
| <b>UltraScale</b>  | 7.56E-09         | 20 nm        | 52.78        |
| <b>UltraScale+</b> | ~2.5E-10         | 16 nm FinFET | ~1600        |

### Other Data sources:

[Virtex-2] R. Koga, J. George, G. Swift, C. Yui, L. Edmonds, C. Carmichael, T. Langley, P. Murray, K. Lanes, and M. Napier, "Comparison of Xilinx Virtex-II FPGA SEE sensitivities to protons and heavy ions," *Nuclear Science, IEEE Transactions on*, vol. 51, no. 5, pp. 2825-2833, 2004.

[Virtex-4] G. Allen, G. Swift, C. Carmichael, C. Tseng, and G. Miller, "Upset measurements on Mil/Aero Virtex-4 FPGAs incorporating 90 nm features and a thin epitaxial layer."

**Warning: UltraScale+ Data is preliminary and needs better statistics.**

# Changes in Radiation Response

- Part of the investigation into modern FPGA characterization involved the analysis of radiation response for small-feature-size FPGAs in general
- In particular, testing by SNL has observed increasing prevalence of certain effects in these new devices:
  - A non-trivial incidence of multiple cell upset (MCU)
  - A strong rotational (azimuthal) dependence when irradiating at angles away from normal incidence

# Multiple Cell Upset

- Multiple Cell Upset (MCU) occurs when a single ionizing particle upsets multiple, physically adjacent cells
  - Sometimes “Multiple Bit Upset” (MBU) is used – MBU is a MCU affecting multiple bits in one memory word, potentially defeating ECC
- MCU is highly dependent on the angle of incidence of the event, but even ionizing events that occur at normal incidence can generate MCUs



*Visualization of upset cells in Kintex-7*

# MCU Rates at Normal Incidence



*Comparison of configuration SRAM cells across multiple Xilinx families*

(Source: M. Wirthlin, D. Lee, G. Swift, and H. Quinn, A Method and Case Study on Identifying Physically Adjacent Multiple-Cell Upsets Using 28-nm, Interleaved and SECDED-Protected Arrays)

# MCU Rates for Angular Events

- Experimental data was obtained using the 28 nm Xilinx Kintex-7 with respect to angular irradiation
- MCU rates not only vary widely by angle, but also by LET



*MCU prevalence in 28 nm Kintex-7 irradiated with silicon (left) and argon (right)*

# MCU Summary

- The main concern in MCU trends is the non-trivial rates observed at modern technology scale, even at low LET
  - Especially problematic due to the high concentration of low LET ions present in space
  - A sample LET spectra graph for a GEO orbit is shown below and right, solar min conditions
- Angular irradiation data is extremely important when analyzing MCU or MBU
- Xilinx has begun utilizing bit-interleaving in 28 nm FPGAs to mitigate MBU effects



Sample LET Spectra at GEO

# Angular Dependence

- Highly-scaled commercial FPGA devices are showing strong angular dependencies, starting with 28 nm Kintex-7 and observed in 20 nm UltraScale
- The angular susceptibility can vary significantly when compared to normal incidence or other azimuthal angles
- The concept of “effective LET” from tilting may no longer be valid
- This effect should be considered when testing FPGAs for radiation susceptibility



(Source: N. Dodds, et al., “The Contribution of Low-Energy Protons to the Total On-Orbit SEU Rate”)

# Designing for Space

- Mitigating the SEUs that occur in the configuration memory of SRAM-based FPGAs is critical to long-term reliability
  - The configuration memory may be repaired through a process called “scrubbing” which locates errors in the configuration and corrects them
  - Though scrubbing can repair an error in the configuration, this correction is not instantaneous and an error may propagate through the data path and affect the operational state of the design
- For critical applications, additional design-level mitigation can be employed
  - Error Correcting Codes (ECC) for data storage
  - Triple Modular Redundancy for data and logic

# Scrubbing Configuration Memory

- Scrubbing refers to the correction of upset bits in the configuration memory
- Mandatory when utilizing any SRAM-based FPGA in space
- Proper scrubbing technique is critical – some “vague” or outright incorrect recommendations have been circulated throughout the space FPGA community
- Incorrect scrubbing techniques can cause significant issues in the form of a “Scrub SEFI” (see next slide), which looks a lot like latch-up at first

# Dangers of Bad Scrubbing Techniques



# Mitigating Designs with TMR

- Scrubbing will repair eventually repair configuration upsets, but not mitigate against propagation of invalid data
- One popular method to increase FPGA reliability is at the design level using “Triple Modular Redundancy” or TMR
  - Makes three copies of your design and inserts majority voters in the data path
  - Any one copy of the circuit can fail; the other two copies will continue to “vote” the correct operational values



*Illustration of TMRed circuit*

(Source: M. Wirthlin, et al., SEU Mitigation and Validation of the LEON3 Soft Processor Using Triple Modular Redundancy for Space Processing)

# TMRed LEON3 Processor

- Sandia and BYU collaborated on a test to evaluate the effectiveness of TMR on modern FPGAs
- Tested the Gaisler LEON3 soft-core processor in Kintex-7



FPGA Layout, unmitigated (left)  
and mitigated (right)

| Resource Utilization | Testing Overhead | LEON3 Core 1  | LEON3 Core 2  | Total          | Device NonTMR/TMR    |
|----------------------|------------------|---------------|---------------|----------------|----------------------|
| Slices (TMR)         | 1753<br>1960     | 1383<br>6567  | 1410<br>6767  | 4546<br>15294  | 50950<br>8.9%/30.0%  |
| Slice Reg (TMR)      | 2726<br>2726     | 1950<br>6165  | 1950<br>6165  | 6626<br>15056  | 407600<br>1.6%/3.7%  |
| LUTS (TMR)           | 3324<br>3265     | 4077<br>18046 | 4069<br>18051 | 11470<br>39362 | 203800<br>5.6%/19.3% |
| LUTRAM (TMR)         | 1<br>1           | 15<br>45      | 15<br>45      | 31<br>91       | 64000<br>.048%/.142% |
| BRAM (TMR)           | 0<br>0           | 50<br>150     | 50<br>150     | 100<br>300     | 445<br>22.5%/67.4%   |
| DSP48E1 (TMR)        | 0<br>0           | 1<br>3        | 1<br>3        | 2<br>6         | 840<br>.238%/.714%   |
| BUFG (TMR)           | 4<br>4           | 0<br>0        | 0<br>0        | 4<br>4         | 32<br>12.5%/12.5%    |

Resource Penalties (left)  
and Space Rates,  
GEO, solar min,  
0.1" Al shielding  
(below)

| Design      | Failure Rate ( $\lambda$ )<br>(failures/processor/s) | MTTF<br>(days/years) |
|-------------|------------------------------------------------------|----------------------|
| Unmitigated | 2.77E-8                                              | 501/1.4              |
| Mitigated   | 4.15E-10                                             | 27,889/76            |

(Source: M. Wirthlin, et al., *SEU Mitigation and Validation of the LEON3 Soft Processor Using Triple Modular Redundancy for Space Processing*)

# Proper TMR Insertion

- Designers must be careful when implementing TMR
  - Coherency across the three copies of the circuit can be difficult to maintain
  - Ensure tools don't "optimize" your circuit by removing the triplicated copies of your circuit
- Scrubbing is essential to reap the benefits of TMR (see right)
- Also keep MCU effects in mind
  - Politecnico di Torino University in Italy is developing a tool to optimize placement of FPGA resources for reliability



(Source: B. Pratt, "Analysis and Mitigation of SEU-Induced Noise in FPGA-based DSP Systems")

# In Summary...

- Most modern commercial FPGAs tested so far perform quite well in radiation, and may be appropriate for many missions
  - Space-grade and commercial FPGAs trade off between radiation reliability and device performance
  - Significant potential benefits from available commercial technologies
- The paradigm of radiation-induced upsets is changing in modern technology nodes
- Designing for space applications introduces new challenges, but nothing that is not manageable
- There is a potential path to flight for commercial FPGA devices that offers future missions a significant boost in capability compared to existing systems

# Thank You!

# Backup Slides

# MCU vs. LET



Increasing LET not only increases MCU prevalence, but also MCU size (upper left = lowest LET, increasing down and right)

# 7-Series Latch-up

- 7-Series devices did not show signs of classical latch-up, however an unusual low-current, non-destructive latch-up signature was observed
- Latch-up current was limited to  $\sim 130$  mA per site
- Industry partners (Harris) have taken over investigation of latch-up sites, and presented findings at 2016 NSREC conference



*A beam run showing eight SEL sites activated*