Institute for Sustained Performance, Energy, and Resilience (SuPER)

Jagode, Heike; Bosilca, George; Danalis, Anthony; Dongarra, Jack; Moore, Shirley

doi:10.2172/1333889

Title: Institute for Sustained Performance, Energy, and Resilience (SuPER)

Technical Report · Wed Nov 30 00:00:00 EST 2016

DOI:https://doi.org/10.2172/1333889· OSTI ID:1333889

Jagode, Heike ^[1]; Bosilca, George ^[1]; Danalis, Anthony ^[1]; Dongarra, Jack ^[1]; Moore, Shirley ^[2]

Univ. of Tennessee, Knoxville, TN (United States)
Univ. of Texas, El Paso, TX (United States)

The University of Tennessee (UTK) and University of Texas at El Paso (UTEP) partnership supported the three main thrusts of the SUPER project---performance, energy, and resilience. The UTK-UTEP effort thus helped advance the main goal of SUPER, which was to ensure that DOE's computational scientists can successfully exploit the emerging generation of high performance computing (HPC) systems. This goal is being met by providing application scientists with strategies and tools to productively maximize performance, conserve energy, and attain resilience. The primary vehicle through which UTK provided performance measurement support to SUPER and the larger HPC community is the Performance Application Programming Interface (PAPI). PAPI is an ongoing project that provides a consistent interface and methodology for collecting hardware performance information from various hardware and software components, including most major CPUs, GPUs and accelerators, interconnects, I/O systems, and power interfaces, as well as virtual cloud environments. The PAPI software is widely used for performance modeling of scientific and engineering applications---for example, the HOMME (High Order Methods Modeling Environment) climate code, and the GAMESS and NWChem computational chemistry codes---on DOE supercomputers. PAPI is widely deployed as middleware for use by higher-level profiling, tracing, and sampling tools (e.g., CrayPat, HPCToolkit, Scalasca, Score-P, TAU, Vampir, PerfExpert), making it the de facto standard for hardware counter analysis. PAPI has established itself as fundamental software infrastructure in every application domain (spanning academia, government, and industry), where improving performance can be mission critical. Ultimately, as more application scientists migrate their applications to HPC platforms, they will benefit from the extended capabilities this grant brought to PAPI to analyze and optimize performance in these environments, whether they use PAPI directly, or via third-party performance tools. Capabilities added to PAPI through this grant include support for new architectures such as the lastest GPU and Xeon Phi accelerators, and advanced power measurement and management features. Another important topic for the UTK team was providing support for a rich ecosystem of different fault management strategies in the context of parallel computing. Our long term efforts have been oriented toward proposing flexible strategies and providing building boxes that application developers can use to build the most efficient fault management technique for their application. These efforts span across the entire software spectrum, from theoretical models of existing strategies to easily assess their performance, to algorithmic modifications to take advantage of specific mathematical properties for data redundancy and to extensions to widely used programming paradigms to empower the application developers to deal with all types of faults. We have also continued our tight collaborations with users to help them adopt these technologies to ensure their application always deliver meaningful scientific data. Large supercomputer systems are becoming more and more power and energy constrained, and future systems and applications running on them will need to be optimized to run under power caps and/or minimize energy consumption. The UTEP team contributed to the SUPER energy thrust by developing power modeling methodologies and investigating power management strategies. Scalability modeling results showed that some applications can scale better with respect to an increasing power budget than with respect to only the number of processors. Power management, in particular shifting power to processors on the critical path of an application execution, can reduce perturbation due to system noise and other sources of runtime variability, which are growing problems on large-scale power-constrained computer systems.

View Technical Report

Cite

Export

Save

Research Organization:: Univ. of Tennessee, Knoxville, TN (United States)

Sponsoring Organization:: USDOE

Contributing Organization:: Univ. of Texas, El Paso, TX (United States)

DOE Contract Number:: SC0006733

OSTI ID:: 1333889

Report Number(s):: DOE-UTK-UTEP-6733-1

Country of Publication:: United States

Language:: English

Similar Records

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Journal Article · Fri Sep 01 00:00:00 EDT 2017 · Supercomputing frontiers and innovations · OSTI ID:1333889

Engelmann, Christian; Hukerikar, Saurabh

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Technical Report · Thu Dec 01 00:00:00 EST 2016 · OSTI ID:1333889

Hukerikar, Saurabh; Engelmann, Christian

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

Conference · Tue Jan 01 00:00:00 EST 2013 · OSTI ID:1333889

Vallee, Geoffroy R; Boehm, Swen; Engelmann, Christian

Related Subjects

97 MATHEMATICS AND COMPUTING
42 ENGINEERING
Performance Analysis
PAPI
Power monitoring
Power Capping
Resilience
Performance Counters

Title: Institute for Sustained Performance, Energy, and Resilience (SuPER)

Citation Formats

Similar Records

Related Subjects