skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Studying the performance of CA-GMRES on multicores with multiple GPUs.


Abstract not provided.

; ; ; ;
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: Proposed for presentation at the IEEE International Parallel and Distributed Processing Symposium held May 19-23, 2014 in Phoenix, AZ.
Country of Publication:
United States

Citation Formats

Hoemmen, Mark Frederick, Yamazaki, Ichitaro, Anzt, Hartwig, Tomov, Stanimire, and Dongarra, Jack. Studying the performance of CA-GMRES on multicores with multiple GPUs.. United States: N. p., 2013. Web.
Hoemmen, Mark Frederick, Yamazaki, Ichitaro, Anzt, Hartwig, Tomov, Stanimire, & Dongarra, Jack. Studying the performance of CA-GMRES on multicores with multiple GPUs.. United States.
Hoemmen, Mark Frederick, Yamazaki, Ichitaro, Anzt, Hartwig, Tomov, Stanimire, and Dongarra, Jack. Tue . "Studying the performance of CA-GMRES on multicores with multiple GPUs.". United States. doi:.
title = {Studying the performance of CA-GMRES on multicores with multiple GPUs.},
author = {Hoemmen, Mark Frederick and Yamazaki, Ichitaro and Anzt, Hartwig and Tomov, Stanimire and Dongarra, Jack},
abstractNote = {Abstract not provided.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Oct 01 00:00:00 EDT 2013},
month = {Tue Oct 01 00:00:00 EDT 2013}

Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Multiple right-hand sides occur in radar scattering calculations in the computation of the simulated radar return from a body at a large number of angles. Each desired angle requires a right-hand side vector to be computed and the solution generated. These right-hand sides are naturally smooth functions of the angle parameters and this property is utilized in a novel way to compute solutions an order of magnitude faster than LINPACK The modeling technique addressed is the Method of Moments (MOM), i.e. a boundary element method for time harmonic Maxwell`s equations. Discretization by this method produces general complex dense systems ofmore » rank 100`s to 100,000`s. The usual way to produce the required multiple solutions is via LU factorization and solution routines such as found in LINPACK. Our method uses the block GMRES iterative method to directly iterate a subset of the desired solutions to convergence.« less
  • Abstract not provided.
  • Programmable graphics processing units (GPUs) have emerged as excellent computational platforms for certain general-purpose applications. The data parallel execution capabilities of GPUs specifically point to the potential for effective use in simulations of agent-based models (ABM). In this paper, the computational efficiency of ABM simulation on GPUs is evaluated on representative ABM benchmarks. The runtime speed of GPU-based models is compared to that of traditional CPU-based implementation, and also to that of equivalent models in traditional ABM toolkits (Repast and NetLogo). As expected, it is observed that, GPU-based ABM execution affords excellent speedup on simple models, with better speedup onmore » models exhibiting good locality and fair amount of computation per memory element. Execution is two to three orders of magnitude faster with a GPU than with leading ABM toolkits, but at the cost of decrease in modularity, ease of programmability and reusability. At a more fundamental level, however, the data parallel paradigm is found to be somewhat at odds with traditional model-specification approaches for ABM. Effective use of data parallel execution, in general, seems to require resolution of modeling and execution challenges. Some of the challenges are identified and related solution approaches are described.« less
  • Graphical Processing Units (GPUs) have evolved into highly parallel, multi-threaded, multicore powerful processors with high memory bandwidth. GPUs are used in a variety of intensive computing applications. The combination of highly parallel architecture and high memory bandwidth makes GPUs a potentially promising technology for effective real-time processing for High Energy Physics (HEP) experiments. However, not much is known of their performance in real-time applications that require low latency, such as the trigger for HEP experiments. We describe an R and D project with the goal to study the performance of GPU technology for possible low latency applications, performing basic operationsmore » as well as some more advanced HEP lower-level trigger algorithms (such as fast tracking or jet finding). We present some preliminary results on timing measurements, comparing the performance of a CPU versus a GPU with NVIDIA's CUDA general-purpose parallel computing architecture, carried out at CDF's Level-2 trigger test stand. These studies will provide performance benchmarks for future studies to investigate the potential and limitations of GPUs for real-time applications in HEP experiments.« less
  • In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less