skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding and Avoiding Performance Variability in High Performance Networks.


Abstract not provided.

; ; ; ;
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: Proposed for presentation at the SIAM Conference on Computational Science and Engineering held February 27-3, 2017 in Atlanta, GA.
Country of Publication:
United States

Citation Formats

Grant, Ryan, Groves, Taylor, Pedretti, Kevin, Gentile, Ann C., and Arnold, Dorian. Understanding and Avoiding Performance Variability in High Performance Networks.. United States: N. p., 2017. Web.
Grant, Ryan, Groves, Taylor, Pedretti, Kevin, Gentile, Ann C., & Arnold, Dorian. Understanding and Avoiding Performance Variability in High Performance Networks.. United States.
Grant, Ryan, Groves, Taylor, Pedretti, Kevin, Gentile, Ann C., and Arnold, Dorian. Wed . "Understanding and Avoiding Performance Variability in High Performance Networks.". United States. doi:.
title = {Understanding and Avoiding Performance Variability in High Performance Networks.},
author = {Grant, Ryan and Groves, Taylor and Pedretti, Kevin and Gentile, Ann C. and Arnold, Dorian},
abstractNote = {Abstract not provided.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Feb 01 00:00:00 EST 2017},
month = {Wed Feb 01 00:00:00 EST 2017}

Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.
  • Combustion instabilities in dilute internal combustion engines are manifest in cyclic variability (CV) in engine performance measures such as integrated heat release or shaft work. Understanding the factors leading to CV is important in model-based control, especially with high dilution where experimental studies have demonstrated that deterministic effects can become more prominent. Observation of enough consecutive engine cycles for significant statistical analysis is standard in experimental studies but is largely wanting in numerical simulations because of the computational time required to compute hundreds or thousands of consecutive cycles. We have proposed and begun implementation of an alternative approach to allowmore » rapid simulation of long series of engine dynamics based on a low-dimensional mapping of ensembles of single-cycle simulations which map input parameters to output engine performance. This paper details the use Titan at the Oak Ridge Leadership Computing Facility to investigate CV in a gasoline direct-injected spark-ignited engine with a moderately high rate of dilution achieved through external exhaust gas recirculation. The CONVERGE CFD software was used to perform single-cycle simulations with imposed variations of operating parameters and boundary conditions selected according to a sparse grid sampling of the parameter space. Using an uncertainty quantification technique, the sampling scheme is chosen similar to a design of experiments grid but uses functions designed to minimize the number of samples required to achieve a desired degree of accuracy. The simulations map input parameters to output metrics of engine performance for a single cycle, and by mapping over a large parameter space, results can be interpolated from within that space. This interpolation scheme forms the basis for a low-dimensional metamodel which can be used to mimic the dynamical behavior of corresponding high-dimensional simulations. Simulations of high-EGR spark-ignition combustion cycles within a parametric sampling grid were performed and analyzed statistically, and sensitivities of the physical factors leading to high CV are presented. With these results, the prospect of producing low-dimensional metamodels to describe engine dynamics at any point in the parameter space will be discussed. Additionally, modifications to the methodology to account for nondeterministic effects in the numerical solution environment are proposed« less
  • Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. We present and analyze various venues for exploiting multiple rails. Different rail access policies are presented and compared, including static and dynamic allocation schemes. An analytical lower bound on the number of networks required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency. Striping messages over multiple rails can substantially reduce network latency, depending on average message size, network loadmore » and allocation scheme. The methods compared include a static rail allocation, a round-robin rail allocation, a dynamic allocation based on local knowledge, and a rail allocation that reserves both end-points of a message before sending it. The latter is shown to perform better than other methods at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates on higher loads (for messages large enough). Most importantly, this proposed allocation scheme scales well with the number of rails and message sizes.« less