skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Abstract

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Authors:
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »; « less
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
National Science Foundation (NSF)
OSTI Identifier:
1560664
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 16th USENIX Conference on File and Storage Technologies, 02/12/18 - 02/15/18, Oakland, CA, US
Country of Publication:
United States
Language:
English
Subject:
Hardware fault; fail-slow; fail-stutter; jitter; limpware; performance

Citation Formats

Gunawi, Haryadi S., Suminto, Riza O., Sears, Russell, Golliher, Casey, Sundararaman, Swaminathan, Lin, Xing, Emami, Tim, Sheng, Weiguang, Bidokhti, Nematollah, McCaffrey, Caitie, Ross, Robert B., Grider, G, Fields, Parks M., Harms, Kevin, Jacobson, Andree, Ricci, Robert, Webb, Kirk, Alvaro, Peter, Runesha, H. Birali, Hao, Mingzhe, and Li, Huaicheng. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. United States: N. p., 2018. Web. doi:10.1145/3242086.
Gunawi, Haryadi S., Suminto, Riza O., Sears, Russell, Golliher, Casey, Sundararaman, Swaminathan, Lin, Xing, Emami, Tim, Sheng, Weiguang, Bidokhti, Nematollah, McCaffrey, Caitie, Ross, Robert B., Grider, G, Fields, Parks M., Harms, Kevin, Jacobson, Andree, Ricci, Robert, Webb, Kirk, Alvaro, Peter, Runesha, H. Birali, Hao, Mingzhe, & Li, Huaicheng. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. United States. doi:10.1145/3242086.
Gunawi, Haryadi S., Suminto, Riza O., Sears, Russell, Golliher, Casey, Sundararaman, Swaminathan, Lin, Xing, Emami, Tim, Sheng, Weiguang, Bidokhti, Nematollah, McCaffrey, Caitie, Ross, Robert B., Grider, G, Fields, Parks M., Harms, Kevin, Jacobson, Andree, Ricci, Robert, Webb, Kirk, Alvaro, Peter, Runesha, H. Birali, Hao, Mingzhe, and Li, Huaicheng. Thu . "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems". United States. doi:10.1145/3242086.
@article{osti_1560664,
title = {Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems},
author = {Gunawi, Haryadi S. and Suminto, Riza O. and Sears, Russell and Golliher, Casey and Sundararaman, Swaminathan and Lin, Xing and Emami, Tim and Sheng, Weiguang and Bidokhti, Nematollah and McCaffrey, Caitie and Ross, Robert B. and Grider, G and Fields, Parks M. and Harms, Kevin and Jacobson, Andree and Ricci, Robert and Webb, Kirk and Alvaro, Peter and Runesha, H. Birali and Hao, Mingzhe and Li, Huaicheng},
abstractNote = {Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.},
doi = {10.1145/3242086},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: