Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Evaluation of pre-training large language models on leadership-class supercomputers

Journal Article · · Journal of Supercomputing
Large language models (LLMs) have arisen rapidly to the center stage of artificial intelligence as the foundation models applicable to many downstream learning tasks. However, how to effectively build, train, and serve such models for many high-stake and first-principle-based scientific use cases are both of great interests and of great challenges. Moreover, pre-training LLMs with billions or even trillions of parameters can be prohibitively expensive not just for academic institutions, but also for well-funded industrial and government labs. Furthermore, the energy cost and the environmental impact of developing LLMs must be kept in mind. Here, in this work, we conduct a first-of-its-kind performance analysis to understand the time and energy cost of pre-training LLMs on the Department of Energy (DOE)’s leadership-class supercomputers. Employing state-of-the-art distributed training techniques, we evaluate the computational performance of various parallelization approaches at scale for a range of model sizes, and establish a projection model for the cost of full training. Our findings provide baseline results, best practices, and heuristics for pre-training such large models that should be valuable to HPC community at large. We also offer insights and optimization strategies for using the first exascale computing system, Frontier, to train models of the size of GPT-3 and beyond.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1994640
Journal Information:
Journal of Supercomputing, Journal Name: Journal of Supercomputing Journal Issue: 18 Vol. 79; ISSN 0920-8542
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English

References (14)

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science journal April 2022
Comparative evaluation of deep learning workloads for leadership-class systems journal October 2021
CORE: Three Access Levels to Underpin Open Access journal November 2012
Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication conference May 2023
ZeRO: Memory optimizations Toward Training Trillion Parameter Models conference November 2020
DeepSpeed conference August 2020
Efficient large-scale language model training on GPU clusters using megatron-LM
  • Narayanan, Deepak; Shoeybi, Mohammad; Casper, Jared
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476209
conference November 2021
PyTorch distributed journal August 2020
Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned conference January 2022
Zero-Shot Text-to-Image Generation preprint January 2021
Training Compute-Optimal Large Language Models preprint January 2022
OPT: Open Pre-trained Transformer Language Models preprint January 2022
The Diminishing Returns of Masked Language Models to Science preprint January 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model preprint January 2022

Similar Records

Comparative Study of Large Language Model Architectures on Frontier
Conference · Wed May 01 00:00:00 EDT 2024 · OSTI ID:2406796

Optimizing Distributed Training on Frontier for Large Language Models
Conference · Wed May 01 00:00:00 EDT 2024 · OSTI ID:2438819

Revealing power, energy and thermal dynamics of a 200PF pre-exascale supercomputer
Conference · Mon Nov 01 00:00:00 EDT 2021 · OSTI ID:1833956