skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Abstract

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.

Authors:
 [1];  [1];  [1]
  1. Los Alamos National Laboratory
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
971329
Report Number(s):
LA-UR-09-06298; LA-UR-09-6298
TRN: US201004%%91
DOE Contract Number:
AC52-06NA25396
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Parallel & Distributed Processing Symposium ; April 19, 2010 ; Atlanta, GA
Country of Publication:
United States
Language:
English
Subject:
97; ALGORITHMS; COMMUNICATIONS; IMPLEMENTATION; KERNELS; PERFORMANCE; PROCESSING

Citation Formats

Kerbyson, Darren J, Lang, Michael, and Pakin, Scott. Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies. United States: N. p., 2009. Web.
Kerbyson, Darren J, Lang, Michael, & Pakin, Scott. Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies. United States.
Kerbyson, Darren J, Lang, Michael, and Pakin, Scott. Thu . "Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies". United States. doi:. https://www.osti.gov/servlets/purl/971329.
@article{osti_971329,
title = {Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies},
author = {Kerbyson, Darren J and Lang, Michael and Pakin, Scott},
abstractNote = {Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2009},
month = {Thu Jan 01 00:00:00 EST 2009}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processorcores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wavefront processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundarymore » data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the Reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
  • Abstract not provided.
  • Abstraction, in search, problem solving, and planning, works by replacing one state space by another (the {open_quotes}abstract{close_quotes} space) that is easier to search. The results of the search in the abstract space are used to guide search in the original space. For instance, the length of the abstract solution can be used as a heuristic for A* in searching in the original space. However, there are two obstacles to making this work efficiently. The first is a theorem stating that for a large class of abstractions, {open_quotes}embedding abstractions,{close_quotes} every state expanded by blind search must also be expanded by A*more » when its heuristic is computed in this way. The second obstacle arises because in solving a problem A* needs repeatedly to do a full search of the abstract space while computing its heuristic. This paper introduces a new abstraction-induced search technique, {open_quotes}Hierarchical A*,{close_quotes} that gets around both of these difficulties: first, by drawing from a different class of abstractions, {open_quotes}homomorphism abstractions,{close_quotes} and, secondly, by using novel caching techniques to avoid repeatedly expanding the same states in successive searches in the abstract space. Hierarchical A* outperforms blind search on all the search spaces studied.« less
  • Shack-Hartmann based Adaptive Optics system with a point-source reference normally use a wave-front sensing algorithm that estimates the centroid (center of mass) of the point-source image 'spot' to determine the wave-front slope. The centroiding algorithm suffers for several weaknesses. For a small number of pixels, the algorithm gain is dependent on spot size. The use of many pixels on the detector leads to significant propagation of read noise. Finally, background light or spot halo aberrations can skew results. In this paper an alternative algorithm that suffers from none of these problems is proposed: correlation of the spot with a idealmore » reference spot. The correlation method is derived and a theoretical analysis evaluates its performance in comparison with centroiding. Both simulation and data from real AO systems are used to illustrate the results. The correlation algorithm is more robust than centroiding, but requires more computation.« less