A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
Abstract
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.
- Authors:
-
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Future Technologies Group
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Future Technologies Group; Georgia Inst. of Technology, Atlanta, GA (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1261262
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- IEEE Transactions on Parallel and Distributed Systems
- Additional Journal Information:
- Journal Volume: 27; Journal Issue: 4; Journal ID: ISSN 1045-9219
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; review; classification; reliability; resilience; fault-tolerance; vulnerability; architectural vulnerability factor; soft/transient error; architectural techniques; software architecture; software reliability; storage management; 3D-stacked processors; GPU; aggressive technology scaling; computing systems; non-volatile memory
Citation Formats
Mittal, Sparsh, and Vetter, Jeffrey S. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems. United States: N. p., 2015.
Web. doi:10.1109/TPDS.2015.2426179.
Mittal, Sparsh, & Vetter, Jeffrey S. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems. United States. https://doi.org/10.1109/TPDS.2015.2426179
Mittal, Sparsh, and Vetter, Jeffrey S. Fri .
"A Survey of Techniques for Modeling and Improving Reliability of Computing Systems". United States. https://doi.org/10.1109/TPDS.2015.2426179. https://www.osti.gov/servlets/purl/1261262.
@article{osti_1261262,
title = {A Survey of Techniques for Modeling and Improving Reliability of Computing Systems},
author = {Mittal, Sparsh and Vetter, Jeffrey S.},
abstractNote = {Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.},
doi = {10.1109/TPDS.2015.2426179},
journal = {IEEE Transactions on Parallel and Distributed Systems},
number = 4,
volume = 27,
place = {United States},
year = {Fri Apr 24 00:00:00 EDT 2015},
month = {Fri Apr 24 00:00:00 EDT 2015}
}
Web of Science
Works referencing / citing this record:
Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
journal, October 2017
- Psychou, Georgia; Rodopoulos, Dimitrios; Sabry, Mohamed M.
- ACM Computing Surveys, Vol. 50, Issue 4
A Survey of Soft-Error Mitigation Techniques for Non-Volatile Memories
journal, February 2017
- Mittal, Sparsh
- Computers, Vol. 6, Issue 1
Evaluation by Neutron Radiation of the NMR-MPar Fault-Tolerance Approach Applied to Applications Running on a 28-nm Many-Core Processor
journal, November 2018
- Vargas, Vanessa; Ramos, Pablo; Velazco, Raoul
- Electronics, Vol. 7, Issue 11
A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks
journal, April 2018
- Mittal, Sparsh
- Machine Learning and Knowledge Extraction, Vol. 1, Issue 1
A Comprehensive Technological Survey on the Dependable Self-Management CPS: From Self-Adaptive Architecture to Self-Management Strategies
journal, February 2019
- Zhou, Peng; Zuo, Decheng; Hou, Kun
- Sensors, Vol. 19, Issue 5
A Comprehensive Technological Survey on the Dependable Self-Management CPS: From Self-Adaptive Architecture to Self-Management Strategies
preprint, January 2019
- Zhou, Peng; Zuo, Decheng; Hou, Kunmean