Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Fault Diagnosis of Hybrid Computing Systems Using Chaotic-Map Method

Book ·

Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-core central processing units (CPUs), many integrated core (MIC) and graphics processing unit (GPU) accelerators. These computing units and their interconnections are subject to different classes of hardware and software faults, which should be detected to support mitigation measures. We present the chaotic-map method that uses the exponential divergence and wide Fourier properties of the trajectories, combined with memory allocations and assignments to diagnose component-level faults in these hybrid computing systems. We propose lightweight codes that utilize highly parallel chaotic-map computations tailored to isolate faults in arithmetic units, memory elements and interconnects. The diagnosis module on a node utilizes pthreads to place chaotic-map threads on CPU and MIC cores, and CUDA C and OpenCL kernels on GPU blocks. We present experimental diagnosis results on five multi-core CPUs; one MIC; and, seven GPUs with typical diagnosis run-times under a minute.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1649633
Country of Publication:
United States
Language:
English

References (23)

Contemporary High Performance Computing January 2013
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities July 2009
The International Exascale Software Project roadmap January 2011
Toward Exascale Resilience September 2009
Understanding the propagation of hard errors to software and implications for resilient system design
  • No authors listed
  • Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII https://doi.org/10.1145/1346281.1346315
January 2008
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer November 2005
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach January 2013
Snap-back repellers imply chaos in Rn March 1978
Chaos: A tutorial for engineers January 1987
Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers January 2013
Chaotic-identity maps for robustness estimation of exascale computations June 2012
Fault detection in multi-core processors using chaotic maps June 2013
Using likely program invariants to detect hardware errors June 2008
Introduction to Applied Nonlinear Dynamical Systems and Chaos January 1990
Chaos January 1997
The Complexity of Fault Detection Problems for Combinational Logic Circuits June 1982
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods January 2013
Correcting soft errors online in LU factorization January 2013
CPU-GPU hybrid bidiagonal reduction with soft error resilience November 2013
Verifying quantitative reliability for programs that execute on unreliable hardware
  • No authors listed
  • Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA '13 https://doi.org/10.1145/2509136.2509546
January 2013
Relax: an architectural framework for software recovery of hardware faults January 2010
Multiscale Analysis of Complex Time Series April 2007
Introduction March 2013