skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Massively Parallel QCD

Abstract

The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results.

Authors:
; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
940899
Report Number(s):
UCRL-JRNL-229921
Journal ID: ISSN 0018-8646; IBMJAE; TRN: US0807241
DOE Contract Number:
W-7405-ENG-48
Resource Type:
Journal Article
Resource Relation:
Journal Name: IBM Journal of Research and Development, vol. 52, no. 1/2, December 11, 2007, pp. 189; Journal Volume: 52; Journal Issue: 1/2
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 99 GENERAL AND MISCELLANEOUS; ARCHITECTURE; NUCLEAR FORCES; PERFORMANCE; PHYSICS; PROGRAMMING; QUANTUM CHROMODYNAMICS; STRONG INTERACTIONS; SUPERCOMPUTERS

Citation Formats

Soltz, R, Vranas, P, Blumrich, M, Chen, D, Gara, A, Giampap, M, Heidelberger, P, Salapura, V, Sexton, J, and Bhanot, G. Massively Parallel QCD. United States: N. p., 2007. Web.
Soltz, R, Vranas, P, Blumrich, M, Chen, D, Gara, A, Giampap, M, Heidelberger, P, Salapura, V, Sexton, J, & Bhanot, G. Massively Parallel QCD. United States.
Soltz, R, Vranas, P, Blumrich, M, Chen, D, Gara, A, Giampap, M, Heidelberger, P, Salapura, V, Sexton, J, and Bhanot, G. Wed . "Massively Parallel QCD". United States. doi:. https://www.osti.gov/servlets/purl/940899.
@article{osti_940899,
title = {Massively Parallel QCD},
author = {Soltz, R and Vranas, P and Blumrich, M and Chen, D and Gara, A and Giampap, M and Heidelberger, P and Salapura, V and Sexton, J and Bhanot, G},
abstractNote = {The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results.},
doi = {},
journal = {IBM Journal of Research and Development, vol. 52, no. 1/2, December 11, 2007, pp. 189},
number = 1/2,
volume = 52,
place = {United States},
year = {Wed Apr 11 00:00:00 EDT 2007},
month = {Wed Apr 11 00:00:00 EDT 2007}
}
  • New massively parallel computer architectures have revolutionized the design of computer algorithms, and promise to have significant influence on algorithms for engineering computations. The traditional global model method has a limited benefit for massively parallel computers. An alternative method is to use the domain decomposition approach. This paper explores the potential for the domain decomposition strategy through actual computations. The example of a three-dimensional linear static finite element analysis is presented to the BBN Butterfly TC2000 massively parallel computer with up to 104 processors. The numerical reults indicate that the parallel domain decomposition method requires a lower computation time thanmore » parallel global model method. Also, the parallel domain decomposition approach offers a better speed-up than does the parallel global model method.« less
  • An understanding of particle transport is necessary to reduce contamination of semiconductor wafers during low-pressure processing. The trajectories of particles in these reactors are determined by external forces (the most important being neutral fluid drag, thermophoresis, electrostatic, viscous ion drag, and gravitational), by Brownian motion (due to neutral and charged gas molecule collisions), and by particle inertia. Gas velocity and temperature fields are also needed for particle transport calculations, but conventional continuum fluid approximations break down at low pressures when the gas mean free path becomes comparable to chamber dimensions. Thus, in this work we use a massively parallel directmore » simulation Monte Carlo method to calculate low-pressure internal gas flow fields which show temperature jump and velocity slip at the reactor boundaries. Because particle residence times can be short compared to particle response times in these low-pressure systems (for which continuum diffusion theory fails), we solve the Langevin equation using a numerical Lagrangian particle tracking model which includes a fluctuating Brownian force. Because of the need for large numbers of particle trajectories to ensure statistical accuracy, the particle tracking model is also implemented on a massively parallel computer. The particle transport model is validated by comparison to the Ornstein{endash}Furth theoretical result for the mean square displacement of a cloud of particles. For long times, the particles tend toward a Maxwellian spatial distribution, while at short times, particle spread is controlled by their initial (Maxwellian) velocity distribution. Several simulations using these techniques are presented for particle transport and deposition in a low pressure, parallel-plate reactor geometry. The corresponding particle collection efficiencies on a wafer for different particle sizes, gas temperature gradients, and gas pressures are evaluated.« less
  • A massively parallel version of the configuration interaction (CI) section of the COLUMBUS multireference singles and doubles CI (MRCISD) program system is described. In an extension of our previous parallelization work, which was based on message passing, the global array (GA) toolkit has now been used. For each process, these tools permit asynchronous and efficient access to logical blocks of 1- and 2-dimensional (2-D) arrays physically distributed over the memory of all processors. The GAs are available on most of the major parallel computer systems enabling very convenient portability of our parallel program code. To demonstrate the features of themore » parallel COLUMBUS CI code, benchmark calculations on selected MRCI and SRCI test cases are reported for the CRAY T3D, Intel Paragon, and IBM SP2. Excellent scaling with the number of processors up to 256 processors (CRAY T3D) was observed. The CI section of a 19 million configuration MRCISD calculation was carried out within 20 min wall clock time on 256 processors of a CRAY T3D. Computations with 38 million configurations were performed recently; calculations up to about 100 million configurations seem possible within the near future.« less
  • No abstract prepared.
  • The Shift-and-invert parallel spectral transformations (SIPs), a computational approach to solve sparse eigenvalue problems, is developed for massively parallel architectures with exceptional parallel scalability and robustness. The capabilities of SIPs are demonstrated by diagonalization of density-functional based tight-binding (DFTB) Hamiltonian and overlap matrices for single-wall metallic carbon nanotubes, diamond nanowires, and bulk diamond crystals. The largest (smallest) example studied is a 128,000 (2000) atom nanotube for which ~330,000 (~5600) eigenvalues and eigenfunctions are obtained in ~190 (~5) seconds when parallelized over 266,144 (16,384) Blue Gene/Q cores. Weak scaling and strong scaling of SIPs are analyzed and the performance of SIPsmore » is compared with other novel methods. Different matrix ordering methods are investigated to reduce the cost of the factorization step, which dominates the time-to-solution at the strong scaling limit. As a result, a parallel implementation of assembling the density matrix from the distributed eigenvectors is demonstrated.« less