skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer

Abstract

The unprecedented computational power of cur- rent supercomputers now makes possible the exploration of complex problems in many scientific fields, from genomic analysis to computational fluid dynamics. Modern machines are powerful because they are massive: they assemble millions of cores and a huge quantity of disks, cards, routers, and other components. But it is precisely the size of these machines that glooms the future of supercomputing. A system that comprises many components has a high chance to fail, and fail often. In order to make the next generation of supercomputers usable, it is imperative to use some type of fault tolerance platform to run applications on large machines. Most fault tolerance strategies can be optimized for the peculiarities of each system and boost efficacy by keeping the system productive. In this paper, we aim to understand how failure characterization can improve resilience in several layers of the software stack: applications, runtime systems, and job schedulers. We examine the Titan supercomputer, one of the fastest systems in the world. We analyze a full year of Titan in production and distill the failure patterns of the machine. By looking into Titan s log files and using the criteria of experts, we providemore » a detailed description of the types of failures. In addition, we inspect the job submission files and describe how the system is used. Using those two sources, we cross correlate failures in the machine to executing jobs and provide a picture of how failures affect the user experience. We believe such characterization is fundamental in developing appropriate fault tolerance solutions for Cray systems similar to Titan.« less

Authors:
 [1];  [2];  [3];  [3]
  1. University of Pittsburgh
  2. University of Illinois at Urbana-Champaign
  3. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1265485
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: CUG 2015, Chicago, IL, USA, 20150427, 20150430
Country of Publication:
United States
Language:
English
Subject:
Failures; Cray; Fault Tolerance; Resilience.

Citation Formats

Meneses, Esteban, Ni, Xiang, Jones, Terry R, and Maxwell, Don E. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer. United States: N. p., 2015. Web.
Meneses, Esteban, Ni, Xiang, Jones, Terry R, & Maxwell, Don E. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer. United States.
Meneses, Esteban, Ni, Xiang, Jones, Terry R, and Maxwell, Don E. Thu . "Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer". United States. doi:.
@article{osti_1265485,
title = {Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer},
author = {Meneses, Esteban and Ni, Xiang and Jones, Terry R and Maxwell, Don E},
abstractNote = {The unprecedented computational power of cur- rent supercomputers now makes possible the exploration of complex problems in many scientific fields, from genomic analysis to computational fluid dynamics. Modern machines are powerful because they are massive: they assemble millions of cores and a huge quantity of disks, cards, routers, and other components. But it is precisely the size of these machines that glooms the future of supercomputing. A system that comprises many components has a high chance to fail, and fail often. In order to make the next generation of supercomputers usable, it is imperative to use some type of fault tolerance platform to run applications on large machines. Most fault tolerance strategies can be optimized for the peculiarities of each system and boost efficacy by keeping the system productive. In this paper, we aim to understand how failure characterization can improve resilience in several layers of the software stack: applications, runtime systems, and job schedulers. We examine the Titan supercomputer, one of the fastest systems in the world. We analyze a full year of Titan in production and distill the failure patterns of the machine. By looking into Titan s log files and using the criteria of experts, we provide a detailed description of the types of failures. In addition, we inspect the job submission files and describe how the system is used. Using those two sources, we cross correlate failures in the machine to executing jobs and provide a picture of how failures affect the user experience. We believe such characterization is fundamental in developing appropriate fault tolerance solutions for Cray systems similar to Titan.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2015},
month = {Thu Jan 01 00:00:00 EST 2015}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: