Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Reliability, availability, and serviceability for petascale high-end computing and beyond

Technical Report ·
DOI:https://doi.org/10.2172/1041206· OSTI ID:1041206
 [1]
  1. Louisiana Tech Univ., Ruston, LA (United States)

Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABILITY, and SERVICEABILITY (RAS) aspects for solving resilience issues in highend scientific computing in the next generation of supercomputers. results lie in the following tracks: Failure prediction in a large scale HPC; Investigate reliability issues and mitigation techniques including in GPGPU-based HPC system; HPC resilience runtime & tools.

Research Organization:
Louisiana Tech Univ., Ruston, LA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
FG02-08ER25836
OSTI ID:
1041206
Report Number(s):
DOEER--25836-3
Country of Publication:
United States
Language:
English

Similar Records

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
Conference · Sat Dec 31 23:00:00 EST 2005 · OSTI ID:989650

MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems
Journal Article · Sat Dec 31 23:00:00 EST 2005 · ACM SIGOPS Operating Systems Review · OSTI ID:978167

Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations
Conference · Sat Dec 31 23:00:00 EST 2005 · OSTI ID:931290

Related Subjects