Reliability, availability, and serviceability for petascale high-end computing and beyond
- Louisiana Tech Univ., Ruston, LA (United States)
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABILITY, and SERVICEABILITY (RAS) aspects for solving resilience issues in highend scientific computing in the next generation of supercomputers. results lie in the following tracks: Failure prediction in a large scale HPC; Investigate reliability issues and mitigation techniques including in GPGPU-based HPC system; HPC resilience runtime & tools.
- Research Organization:
- Louisiana Tech Univ., Ruston, LA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- FG02-08ER25836
- OSTI ID:
- 1041206
- Report Number(s):
- DOEER--25836-3
- Country of Publication:
- United States
- Language:
- English
Similar Records
Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems
Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations
Conference
·
Sat Dec 31 23:00:00 EST 2005
·
OSTI ID:989650
MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems
Journal Article
·
Sat Dec 31 23:00:00 EST 2005
· ACM SIGOPS Operating Systems Review
·
OSTI ID:978167
Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations
Conference
·
Sat Dec 31 23:00:00 EST 2005
·
OSTI ID:931290