skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report ·
OSTI ID:1847927

As the field of High-Performance Computing (HPC) heads towards exascale with modern processing, networking and storage technologies, it is increasingly important to provide support for fast I/O operations and scalable checkpoint-restart for users of these systems. Fast I/O support is critical for applications handling large-scale data and for visualizing the results. Checkpoint-restart enables users to tolerate failures in the underlying commodity components (processors, memory, interconnect, and storage) of HPC systems and run applications on a continuous basis without productivity loss. The Scalable Checkpoint-Restart (SCR) project, funded by DOE and developed by researchers from the Lawrence Livermore National Laboratory (LLNL), has made considerable progress along these lines. Modern multi-petaflop systems and emerging exascale systems use a diverse range of storage technologies (SSDs with NVMe over Fabrics (NVMeoF) and High Bandwidth Memory (HBM)), resource managers (SLURM, LSF, Flux), job launchers (mpirun and srun), process management protocols (PMI1, PMI2, PMIx), and high-performance networking technologies (InfiniBand and Slingshot). The existing SCR library needs enhancements and hardening to achieve cross-platform portability and applicability across a diverse range of supercomputers and HPC clouds using different resource managers and job launchers. The SCR core also needs enhancements to satisfy the needs of next generation exascale systems and applications. Performance evaluation results from a study on the LLNL Lassen multi-petaflop system have shown that I/O write operations using the parallel file system deliver relatively poor performance as the number of nodes increases. However, I/O write operations through SCR using multilayer storage have potential to improve the performance and scalability by a factor of 50. Thus, the above-mentioned enhancements are critical to foster the discovery and innovation in HPC, Deep Learning (DL), and Machine Learning (ML) domains for exascale systems and enable the U.S. to maintain its leadership role in exascale computing. These developments in exascale computing technologies lead to the following broad challenge: How can the latest SCR code base be hardened and expanded with new features and capabilities so that next-generation applications can take advantage of fast and scalable checkpoint-restart, and I/O acceleration on emerging exascale systems?

Research Organization:
X-ScaleSolutions
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0021587
OSTI ID:
1847927
Type / Phase:
SBIR (Phase I)
Report Number(s):
X-ScaleSolutions-DOE-21587
Resource Relation:
Related Information: Scalable Checkpoint-Restart (SCR)
Country of Publication:
United States
Language:
English