A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Future Technologies Group
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Future Technologies Group; Georgia Inst. of Technology, Atlanta, GA (United States)
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1261262
- Journal Information:
- IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 4; ISSN 1045-9219
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Similar Records
Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Leaky Buddies: Cross-Component Covert Channels on Integrated CPU-GPU Systems
Related Subjects
review
classification
reliability
resilience
fault-tolerance
vulnerability
architectural vulnerability factor
soft/transient error
architectural techniques
software architecture
software reliability
storage management
3D-stacked processors
GPU
aggressive technology scaling
computing systems
non-volatile memory