SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Dai, Donglai

Title: SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022

OSTI ID:1847927

Dai, Donglai ^[1]

X-ScaleSolutions

As the field of High-Performance Computing (HPC) heads towards exascale with modern processing, networking and storage technologies, it is increasingly important to provide support for fast I/O operations and scalable checkpoint-restart for users of these systems. Fast I/O support is critical for applications handling large-scale data and for visualizing the results. Checkpoint-restart enables users to tolerate failures in the underlying commodity components (processors, memory, interconnect, and storage) of HPC systems and run applications on a continuous basis without productivity loss. The Scalable Checkpoint-Restart (SCR) project, funded by DOE and developed by researchers from the Lawrence Livermore National Laboratory (LLNL), has made considerable progress along these lines. Modern multi-petaflop systems and emerging exascale systems use a diverse range of storage technologies (SSDs with NVMe over Fabrics (NVMeoF) and High Bandwidth Memory (HBM)), resource managers (SLURM, LSF, Flux), job launchers (mpirun and srun), process management protocols (PMI1, PMI2, PMIx), and high-performance networking technologies (InfiniBand and Slingshot). The existing SCR library needs enhancements and hardening to achieve cross-platform portability and applicability across a diverse range of supercomputers and HPC clouds using different resource managers and job launchers. The SCR core also needs enhancements to satisfy the needs of next generation exascale systems and applications. Performance evaluation results from a study on the LLNL Lassen multi-petaflop system have shown that I/O write operations using the parallel file system deliver relatively poor performance as the number of nodes increases. However, I/O write operations through SCR using multilayer storage have potential to improve the performance and scalability by a factor of 50. Thus, the above-mentioned enhancements are critical to foster the discovery and innovation in HPC, Deep Learning (DL), and Machine Learning (ML) domains for exascale systems and enable the U.S. to maintain its leadership role in exascale computing. These developments in exascale computing technologies lead to the following broad challenge: How can the latest SCR code base be hardened and expanded with new features and capabilities so that next-generation applications can take advantage of fast and scalable checkpoint-restart, and I/O acceleration on emerging exascale systems?

This content will become available on Thu Aug 22 00:00:00 EDT 2041.

Cite

Export

Save

Research Organization:: X-ScaleSolutions

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: SC0021587

OSTI ID:: 1847927

Type / Phase:: SBIR (Phase I)

Report Number(s):: X-ScaleSolutions-DOE-21587

Resource Relation:: Related Information: Scalable Checkpoint-Restart (SCR)

Country of Publication:: United States

Language:: English

Similar Records

Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O

Technical Report · Tue Aug 18 00:00:00 EDT 2009 · OSTI ID:1847927

Bronevetsky, G; Moody, A

Combining Partial Redundancy and Checkpointing for HPC

Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1847927

Elliott, James; Kharbas, Kishor H; Fiala, David J; +3 more

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library

Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1847927

Mohror, K; Moody, A; de Supinski, B R

Related Subjects

97 MATHEMATICS AND COMPUTING
HPC, Deep Learning, Machine Learning, MPI, Checkpoint, I/O, and Middleware

Title: SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Citation Formats

Similar Records

Related Subjects