skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Institute for Sustained Performance, Energy, and Resilience (SuPER)

Technical Report ·
DOI:https://doi.org/10.2172/1333889· OSTI ID:1333889
 [1];  [1];  [1];  [1];  [2]
  1. Univ. of Tennessee, Knoxville, TN (United States)
  2. Univ. of Texas, El Paso, TX (United States)

The University of Tennessee (UTK) and University of Texas at El Paso (UTEP) partnership supported the three main thrusts of the SUPER project---performance, energy, and resilience. The UTK-UTEP effort thus helped advance the main goal of SUPER, which was to ensure that DOE's computational scientists can successfully exploit the emerging generation of high performance computing (HPC) systems. This goal is being met by providing application scientists with strategies and tools to productively maximize performance, conserve energy, and attain resilience. The primary vehicle through which UTK provided performance measurement support to SUPER and the larger HPC community is the Performance Application Programming Interface (PAPI). PAPI is an ongoing project that provides a consistent interface and methodology for collecting hardware performance information from various hardware and software components, including most major CPUs, GPUs and accelerators, interconnects, I/O systems, and power interfaces, as well as virtual cloud environments. The PAPI software is widely used for performance modeling of scientific and engineering applications---for example, the HOMME (High Order Methods Modeling Environment) climate code, and the GAMESS and NWChem computational chemistry codes---on DOE supercomputers. PAPI is widely deployed as middleware for use by higher-level profiling, tracing, and sampling tools (e.g., CrayPat, HPCToolkit, Scalasca, Score-P, TAU, Vampir, PerfExpert), making it the de facto standard for hardware counter analysis. PAPI has established itself as fundamental software infrastructure in every application domain (spanning academia, government, and industry), where improving performance can be mission critical. Ultimately, as more application scientists migrate their applications to HPC platforms, they will benefit from the extended capabilities this grant brought to PAPI to analyze and optimize performance in these environments, whether they use PAPI directly, or via third-party performance tools. Capabilities added to PAPI through this grant include support for new architectures such as the lastest GPU and Xeon Phi accelerators, and advanced power measurement and management features. Another important topic for the UTK team was providing support for a rich ecosystem of different fault management strategies in the context of parallel computing. Our long term efforts have been oriented toward proposing flexible strategies and providing building boxes that application developers can use to build the most efficient fault management technique for their application. These efforts span across the entire software spectrum, from theoretical models of existing strategies to easily assess their performance, to algorithmic modifications to take advantage of specific mathematical properties for data redundancy and to extensions to widely used programming paradigms to empower the application developers to deal with all types of faults. We have also continued our tight collaborations with users to help them adopt these technologies to ensure their application always deliver meaningful scientific data. Large supercomputer systems are becoming more and more power and energy constrained, and future systems and applications running on them will need to be optimized to run under power caps and/or minimize energy consumption. The UTEP team contributed to the SUPER energy thrust by developing power modeling methodologies and investigating power management strategies. Scalability modeling results showed that some applications can scale better with respect to an increasing power budget than with respect to only the number of processors. Power management, in particular shifting power to processors on the critical path of an application execution, can reduce perturbation due to system noise and other sources of runtime variability, which are growing problems on large-scale power-constrained computer systems.

Research Organization:
Univ. of Tennessee, Knoxville, TN (United States)
Sponsoring Organization:
USDOE
Contributing Organization:
Univ. of Texas, El Paso, TX (United States)
DOE Contract Number:
SC0006733
OSTI ID:
1333889
Report Number(s):
DOE-UTK-UTEP-6733-1
Country of Publication:
United States
Language:
English