skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An automatic performance model-based scheduling tool for coupled climate system models

Journal Article · · Journal of Parallel and Distributed Computing
 [1];  [2];  [3];  [4];  [5];  [6]
  1. Tsinghua Univ., Beijing (China). Dept. of Computer Science and Technology; Qingdao National Lab. for Marine Science and Technology (United States). Lab. for Regional Oceanography and Numerical Modeling; Tsinghua Univ., Beijing (China). Ministry of Education Key Lab. for Earth System Modeling
  2. Tsinghua Univ., Beijing (China). Dept. of Computer Science and Technology; Qingdao National Lab. for Marine Science and Technology (United States). Lab. for Regional Oceanography and Numerical Modeling; Tsinghua Univ., Beijing (China). Ministry of Education Key Lab. for Earth System Modeling; National Supercomputing Center, Wuxi (China); Joint Center for Global Change Studies, Beijing (China)
  3. Qingdao National Lab. for Marine Science and Technology (United States). Lab. for Regional Oceanography and Numerical Modeling; ; State Oceanic Administration, Qingdao (China). First Inst. of Oceanography
  4. Qingdao National Lab. for Marine Science and Technology (United States). Lab. for Regional Oceanography and Numerical Modeling; Tsinghua Univ., Beijing (China). Ministry of Education Key Lab. for Earth System Modeling; National Supercomputing Center, Wuxi (China); Joint Center for Global Change Studies, Beijing (China)
  5. Tsinghua Univ., Beijing (China). Ministry of Education Key Lab. for Earth System Modeling
  6. Tsinghua Univ., Beijing (China). Dept. of Computer Science and Technology

The prediction of the climate system is highly depended on the efficient integration of observations and simulations of the Earth. This is regarded as a canonical example of the cyber-physical system. The climate system model, the simulation engine in this cyber-physical system, is one of most challenging applications in scientific computing. It utilizes the multi-physics simulation that couples multiple components, conducts decadal to millennium simulations, and has long been an important application on supercomputers. However, current climate system models suffer from the inefficient task scheduling methods resulting in an intolerable simulation time. Take the Community Earth System Model (CESM), the most widely used climate system model, as an example, one major reason that CESM suffers from bad performances is the huge overhead to rationally distribute processes among the coupled heterogeneous components. According to the report of NCAR, every percent improvement in CESM performance frees up to the equivalent of $250,000 in computing resources in their scientific experiments. To address such challenge, our paper firstly constructs a lightweight and accurate performance model for effectively capturing and predicting the heterogeneous time-to-solution performance of end-to-end CESM components with a given simulation configuration. Then, based on the performance model, we further propose an efficient scheduling strategy based on rectangular packing method to determine the best process layout among different components, and the process numbers assigned to each component. Our evaluations show that we can achieve 58% average run time reductions on CESM comparing to the widely used sequential process layout for a scale of 144 to 480 cores on typical CPU clusters. And we can save 4 million CPU hours when we conduct one standard scientific experiment (a 2870-year simulation), which equals to save $40,089 with a charge of $0.01 per CPU hour. Meanwhile, 26% extra performance improvements could also be gained in our methods comparing to the heuristic branch and bound algorithm with the guidance of the known curve-fitting performance model.

Research Organization:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1559255
Journal Information:
Journal of Parallel and Distributed Computing, Vol. 132, Issue C; ISSN 0743-7315
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 3 works
Citation information provided by
Web of Science

Figures / Tables (18)