Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Fault Diagnosis of Hybrid Computing Systems Using Chaotic-Map Method

Book ·
Computing systems are becoming increasingly complex with nodes consisting of a combination of multi-core central processing units (CPUs), many integrated core (MIC) and graphics processing unit (GPU) accelerators. These computing units and their interconnections are subject to different classes of hardware and software faults, which should be detected to support mitigation measures. We present the chaotic-map method that uses the exponential divergence and wide Fourier properties of the trajectories, combined with memory allocations and assignments to diagnose component-level faults in these hybrid computing systems. We propose lightweight codes that utilize highly parallel chaotic-map computations tailored to isolate faults in arithmetic units, memory elements and interconnects. The diagnosis module on a node utilizes pthreads to place chaotic-map threads on CPU and MIC cores, and CUDA C and OpenCL kernels on GPU blocks. We present experimental diagnosis results on five multi-core CPUs; one MIC; and, seven GPUs with typical diagnosis run-times under a minute.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1649633
Country of Publication:
United States
Language:
English

References (23)

Introduction to Applied Nonlinear Dynamical Systems and Chaos book January 1990
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods conference January 2013
Correcting soft errors online in LU factorization conference January 2013
Snap-back repellers imply chaos in Rn journal March 1978
Introduction book March 2013
Verifying quantitative reliability for programs that execute on unreliable hardware
  • Carbin, Michael; Misailovic, Sasa; Rinard, Martin C.
  • Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA '13 https://doi.org/10.1145/2509136.2509546
conference January 2013
Toward Exascale Resilience journal September 2009
Contemporary High Performance Computing book January 2013
CPU-GPU hybrid bidiagonal reduction with soft error resilience conference November 2013
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
Relax: an architectural framework for software recovery of hardware faults conference January 2010
Chaos: A tutorial for engineers journal January 1987
Understanding the propagation of hard errors to software and implications for resilient system design
  • Li, Man-Lap; Ramachandran, Pradeep; Sahoo, Swarup Kumar
  • Proceedings of the 13th international conference on Architectural support for programming languages and operating systems - ASPLOS XIII https://doi.org/10.1145/1346281.1346315
conference January 2008
Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers book January 2013
The International Exascale Software Project roadmap journal January 2011
The Complexity of Fault Detection Problems for Combinational Logic Circuits journal June 1982
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer conference November 2005
Multiscale Analysis of Complex Time Series book April 2007
Using likely program invariants to detect hardware errors conference June 2008
Chaotic-identity maps for robustness estimation of exascale computations conference June 2012
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities journal July 2009
Fault detection in multi-core processors using chaotic maps conference June 2013
Chaos book January 1997

Similar Records

Fault Diagnosis of Hybrid Computing Systems Using Chaotic-Map Method
Book · Thu Nov 01 00:00:00 EDT 2018 · OSTI ID:1561635

Failure detection in high-performance clusters and computers using chaotic map computations
Patent · Mon Aug 31 20:00:00 EDT 2015 · OSTI ID:1213445

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
Conference · Thu Dec 31 23:00:00 EST 2009 · OSTI ID:974630

Related Subjects