An Efficient Checkpointing System for Large Machine Learning Model Training
- Nanchang Hangkong University
- Kobe University
- BATTELLE (PACIFIC NW LAB)
- RIKEN R-CCS
As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 2526231
- Report Number(s):
- PNNL-SA-204755
- Country of Publication:
- United States
- Language:
- English
Similar Records
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning
Checkpoint repair for high-performance out-of-order execution machines
Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference
·
Tue Dec 31 23:00:00 EST 2019
·
OSTI ID:1770373
Checkpoint repair for high-performance out-of-order execution machines
Journal Article
·
Mon Nov 30 23:00:00 EST 1987
· IEEE Trans. Comput.; (United States)
·
OSTI ID:5496980
Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference
·
Mon Jun 01 00:00:00 EDT 2020
·
OSTI ID:1648858