skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scientific Application Performance on Leading Scalar and VectorSupercomputing Platforms

Abstract

The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end computing (HEC) platforms, primarily because of their generality, scalability, and cost effectiveness. However, the growing gap between sustained and peak performance for full-scale scientific applications on conventional supercomputers has become a major concern in high performance computing, requiring significantly larger systems and application scalability than implied by peak performance in order to achieve desired performance. The latest generation of custom-built parallel vector systems have the potential to address this issue for numerical algorithms with sufficient regularity in their computational structure. In this work we explore applications drawn from four areas: magnetic fusion (GTC), plasma physics (LBMHD3D), astrophysics (Cactus), and material science (PARATEC). We compare performance of the vector-based Cray X1, X1E, Earth Simulator, NEC SX-8, with performance of three leading commodity-based superscalar platforms utilizing the IBM Power3, Intel Itanium2, and AMD Opteron processors. Our work makes several significant contributions: a new data-decomposition scheme for GTC that (for the first time) enables a breakthrough of the Teraflop barrier; the introduction of a new three-dimensional Lattice Boltzmann magneto-hydrodynamic implementation used to study the onset evolution of plasma turbulence that achieves over 26Tflop/s on 4800 ES processors; themore » highest per processor performance (by far) achieved by the full-production version of the Cactus ADM-BSSN; and the largest PARATEC cell size atomistic simulation to date. Overall, results show that the vector architectures attain unprecedented aggregate performance across our application suite, demonstrating the tremendous potential of modern parallel vector systems.« less

Authors:
; ; ; ;
Publication Date:
Research Org.:
Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US)
Sponsoring Org.:
USDOE Director. Office of Science. Advanced ScientificComputing Research
OSTI Identifier:
925423
Report Number(s):
LBNL-60800
R&D Project: K11121; BnR: KJ0101030; TRN: US200807%%348
DOE Contract Number:
DE-AC02-05CH11231
Resource Type:
Journal Article
Resource Relation:
Journal Name: International Journal of High Performance ComputingApplications; Journal Volume: 22; Journal Issue: 1; Related Information: Journal Publication Date: 2008
Country of Publication:
United States
Language:
English
Subject:
42; ALGORITHMS; ASTROPHYSICS; IMPLEMENTATION; MICROPROCESSORS; PERFORMANCE; PHYSICS; PLASMA; PROLIFERATION; SCALARS; SIMULATION; SUPERCOMPUTERS; TURBULENCE; VECTORS

Citation Formats

Oliker, Leonid, Canning, Andrew, Carter, Jonathan, Shalf, John, and Ethier, Stephane. Scientific Application Performance on Leading Scalar and VectorSupercomputing Platforms. United States: N. p., 2007. Web.
Oliker, Leonid, Canning, Andrew, Carter, Jonathan, Shalf, John, & Ethier, Stephane. Scientific Application Performance on Leading Scalar and VectorSupercomputing Platforms. United States.
Oliker, Leonid, Canning, Andrew, Carter, Jonathan, Shalf, John, and Ethier, Stephane. Mon . "Scientific Application Performance on Leading Scalar and VectorSupercomputing Platforms". United States. doi:. https://www.osti.gov/servlets/purl/925423.
@article{osti_925423,
title = {Scientific Application Performance on Leading Scalar and VectorSupercomputing Platforms},
author = {Oliker, Leonid and Canning, Andrew and Carter, Jonathan and Shalf, John and Ethier, Stephane},
abstractNote = {The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end computing (HEC) platforms, primarily because of their generality, scalability, and cost effectiveness. However, the growing gap between sustained and peak performance for full-scale scientific applications on conventional supercomputers has become a major concern in high performance computing, requiring significantly larger systems and application scalability than implied by peak performance in order to achieve desired performance. The latest generation of custom-built parallel vector systems have the potential to address this issue for numerical algorithms with sufficient regularity in their computational structure. In this work we explore applications drawn from four areas: magnetic fusion (GTC), plasma physics (LBMHD3D), astrophysics (Cactus), and material science (PARATEC). We compare performance of the vector-based Cray X1, X1E, Earth Simulator, NEC SX-8, with performance of three leading commodity-based superscalar platforms utilizing the IBM Power3, Intel Itanium2, and AMD Opteron processors. Our work makes several significant contributions: a new data-decomposition scheme for GTC that (for the first time) enables a breakthrough of the Teraflop barrier; the introduction of a new three-dimensional Lattice Boltzmann magneto-hydrodynamic implementation used to study the onset evolution of plasma turbulence that achieves over 26Tflop/s on 4800 ES processors; the highest per processor performance (by far) achieved by the full-production version of the Cactus ADM-BSSN; and the largest PARATEC cell size atomistic simulation to date. Overall, results show that the vector architectures attain unprecedented aggregate performance across our application suite, demonstrating the tremendous potential of modern parallel vector systems.},
doi = {},
journal = {International Journal of High Performance ComputingApplications},
number = 1,
volume = 22,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}
  • The last decade has witnessed a rapid proliferation of superscalar cache-based microprocessors to build high-end capability and capacity computers primarily because of their generality, scalability, and cost effectiveness. However, the constant degradation of superscalar sustained performance, has become a well-known problem in the scientific computing community. This trend has been widely attributed to the use of superscalar-based commodity components who's architectural designs offer a balance between memory performance, network capability, and execution rate that is poorly matched to the requirements of large-scale numerical computations. The recent development of massively parallel vector systems offers the potential to increase the performance gapmore » for many important classes of algorithms. In this study we examine four diverse scientific applications with the potential to run at ultrascale, from the areas of plasma physics, material science, astrophysics, and magnetic fusion. We compare performance between the vector-based Earth Simulator (ES) and Cray X1, with leading superscalar-based platforms: the IBM Power3/4 and the SGI Altix. Results demonstrate that the ES vector systems achieve excellent performance on our application suite - the highest of any architecture tested to date.« less
  • A performance study has been made for the MCNP4B Monte Carlo radiation transport code on a wide variety of scientific computing platforms ranging from personal computers to Cray mainframes. We present the timing study results using MCNP4B and its new test set and libraries. This timing study is unlike other timing studies because of its widespread reproducibility, its direct comparability to the predecessor study in 1993, and its focus upon a nuclear engineering code.
  • Several computing platforms were evaluated with the MCNP4B Monte Carlo radiation transport code. The DEC AlphaStation 500/500 was the fastest to run MCNP4B. Compared to the HP 9000-735, the fastest platform 4 yr ago, the AlphaStation is 335% faster, the HP C180 is 133% faster, the SGI Origin 2000 is 82% faster, the Cray T94/4128 is 1% faster, the IBM RS/6000-590 is 93% as fast, the DEC 3000/600 is 81% as fast, the Sun Sparc20 is 57% as fast, the Cray YMP 8/8128 is 57% as fast, the sun Sparc5 is 33% as fast, and the Sun Sparc2 is 13%more » as fast. All results presented are reproducible and allow for comparison to computer platforms not included in this study. Timing studies are seen to be very problem dependent. The performance gains resulting from advances in software were also investigated. Various compilers and operating systems were seen to have a modest impact on performance, whereas hardware improvements have resulted in a factor of 4 improvement. MCNP4B also ran approximately as fast as MCNP4A.« less
  • This article describes the performance evaluation through benchmarking of computers with vector-computing capabilities for general-purpose, large-scale scientific computation. This study differs from others in several respects. The author compared three major classes of machines: scalar mainframe computers, mainframe computers, mainframe computers with integrated vector facilities, and supercomputers. He evaluated throughput (measured by total elapsed time) of these machines using a collection of 20 end-user codes. He also measured the speed of execution of each individual code. He recorded the relative ease with which these codes were converted to run, perhaps at less than optimal performance, in each new environment. Finallymore » he attempted to optimize a few of the codes in order to realize the full potential of the particular machine being benchmarked.« less
  • With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. However, to use such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modernmore » parallel file systems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3, XT4, and Intel Itanium2 clusters; GPFS on IBM Power5 and AMD Opteron platforms; a BlueGene/P installation using GPFS and PVFS2 file systems; and CXFS on the SGI Altix\-3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique-versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and show that only the two of the nine evaluated systems benefited from MPI-2's asynchronous MPI-IO. On those systems, experimental results indicate that the computational intensity required to hide I/O effectively is already close to the practical limit of BLAS3 calculations. Overall, our study quantifies vast differences in performance and functionality of parallel file systems across state-of-the-art platforms -- showing I/O rates that vary up to 75x on the examined architectures -- while providing system designers and computational scientists a lightweight tool for conducting further analysis.« less