RELIABILITY, AVAILABILITY, AND SERVICEABILITY FOR PETASCALE HIGH-END COMPUTING AND BEYOND
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABILITY, and SERVICEABILITY (RAS) aspects for solving resilience issues in highend scientific computing in the next generation of supercomputers. results lie in the following tracks: Failure prediction in a large scale HPC; Investigate reliability issues and mitigation techniques including in GPGPU-based HPC system; HPC resilience runtime & tools.
- Research Organization:
- Louisiana Tech University
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- FG02-08ER25836
- OSTI ID:
- 1041206
- Report Number(s):
- DOEER25836-3
- Country of Publication:
- United States
- Language:
- English
Similar Records
MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems
Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Journal Article
·
Sun Jan 01 00:00:00 EST 2006
· ACM SIGOPS Operating Systems Review
·
OSTI ID:1041206
+7 more
Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
Conference
·
Sun Jan 01 00:00:00 EST 2006
·
OSTI ID:1041206
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Conference
·
Tue Dec 01 00:00:00 EST 2020
·
OSTI ID:1041206