Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

An Efficient Checkpointing System for Large Machine Learning Model Training

Conference ·
 [1];  [2];  [1];  [1];  [3];  [4]
  1. Nanchang Hangkong University
  2. Kobe University
  3. BATTELLE (PACIFIC NW LAB)
  4. RIKEN R-CCS
As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
2526231
Report Number(s):
PNNL-SA-204755
Country of Publication:
United States
Language:
English

Similar Records

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning
Conference · Tue Dec 31 23:00:00 EST 2019 · OSTI ID:1770373

Checkpoint repair for high-performance out-of-order execution machines
Journal Article · Mon Nov 30 23:00:00 EST 1987 · IEEE Trans. Comput.; (United States) · OSTI ID:5496980

Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference · Mon Jun 01 00:00:00 EDT 2020 · OSTI ID:1648858