Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

GPU age-aware scheduling to improve the reliability of leadership jobs on Titan

Conference ·

In 2015, OLCF's Titan supercomputer experienced a significant increase in GPU related job failures. The impact on jobs was serious and OLCF decided to replace ~50% of the GPUs. Unfortunately, jobs using more than 20% of the machine (i.e., leadership jobs) continued to encounter higher levels of application failures. These jobs contained significant amounts of both the low-failure rate and high-failure rate GPUs. The impacts of these failures are more adversely felt by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. We have employed two complementary techniques, updating both the system ordering and the allocation mechanisms. Using simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1489583
Country of Publication:
United States
Language:
English

References (5)

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility
  • Tiwari, Devesh; Gupta, Saurabh; Gallarno, George
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807666
conference January 2015
The Gemini System Interconnect
  • Alverson, Robert; Roweth, Duncan; Kaplan, Larry
  • 2010 IEEE 18th Annual Symposium on High-Performance Interconnects (HOTI), 2010 18th IEEE Symposium on High Performance Interconnects https://doi.org/10.1109/HOTI.2010.23
conference August 2010
A comparison of next-fit, first-fit, and best-fit journal March 1977
A Multi-faceted Approach to Job Placement for Improved Performance on Extreme-Scale Systems
  • Zimmer, Christopher; Gupta, Saurabh; Atchley, Scott
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.86
conference November 2016
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems
  • Gupta, Saurabh; Tiwari, Devesh; Jantzi, Christopher
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.52
conference June 2015

Similar Records

Learning from Five-year Resource-Utilization Data of Titan System
Conference · Sun Sep 01 00:00:00 EDT 2019 · OSTI ID:1606979

Learning from Five-year Resource-Utilization Data of Titan System
Conference · Sun Sep 01 00:00:00 EDT 2019 · OSTI ID:1648993

MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems
Conference · Fri May 01 00:00:00 EDT 2020 · OSTI ID:1649080

Related Subjects