skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

Abstract

Over the past 15 years, microprocessor performance hasdoubled approximately every 18 months through increased clock rates andprocessing efficiency. In the past few years, clock frequency growth hasstalled, and microprocessor manufacturers such as AMD have moved towardsdoubling the number of cores every 18 months in order to maintainhistorical growth rates in chip performance. This document investigatesthe ramifications of multicore processor technology on the new Cray XT4?systems based on AMD processor technology. We begin by walking throughthe AMD single-core and dual-core and upcoming quad-core processorarchitectures. This is followed by a discussion of methods for collectingperformance counter data to understand code performance on the Cray XT3?and XT4? systems. We then use the performance counter data to analyze theimpact of multicore processors on the performance of microbenchmarks suchas STREAM, application kernels such as the NAS Parallel Benchmarks, andfull application codes that comprise the NERSC-5 SSP benchmark suite. Weexplore compiler options and software optimization techniques that canmitigate the memory bandwidth contention that can reduce computingefficiency on multicore processors. The last section provides a casestudy of applying the dual-core optimizations to the NAS ParallelBenchmarks to dramatically improve their performance.

Authors:
; ; ; ; ; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
COLLABORATION - CrayInc.
Sponsoring Org.:
USDOE
OSTI Identifier:
918496
Report Number(s):
LBNL-62500
R&D Project: KX1310; BnR: KJ0102000; TRN: US200818%%407
DOE Contract Number:
DE-AC02-05CH11231
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS; COMPUTER ARCHITECTURE; BENCHMARKS; EFFICIENCY; KERNELS; MANUFACTURERS; MICROPROCESSORS; OPTIMIZATION; PERFORMANCE; PROCESSING; SUPERCOMPUTERS; multicore supercomputer processor performance

Citation Formats

Levesque, John, Larkin, Jeff, Foster, Martyn, Glenski, Joe, Geissler, Garry, Whalen, Stephen, Waldecker, Brian, Carter, Jonathan, Skinner, David, He, Helen, Wasserman, Harvey, Shalf, John, Shan,Hongzhang, and Strohmaier, Erich. Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture. United States: N. p., 2007. Web. doi:10.2172/918496.
Levesque, John, Larkin, Jeff, Foster, Martyn, Glenski, Joe, Geissler, Garry, Whalen, Stephen, Waldecker, Brian, Carter, Jonathan, Skinner, David, He, Helen, Wasserman, Harvey, Shalf, John, Shan,Hongzhang, & Strohmaier, Erich. Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture. United States. doi:10.2172/918496.
Levesque, John, Larkin, Jeff, Foster, Martyn, Glenski, Joe, Geissler, Garry, Whalen, Stephen, Waldecker, Brian, Carter, Jonathan, Skinner, David, He, Helen, Wasserman, Harvey, Shalf, John, Shan,Hongzhang, and Strohmaier, Erich. Wed . "Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture". United States. doi:10.2172/918496. https://www.osti.gov/servlets/purl/918496.
@article{osti_918496,
title = {Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture},
author = {Levesque, John and Larkin, Jeff and Foster, Martyn and Glenski, Joe and Geissler, Garry and Whalen, Stephen and Waldecker, Brian and Carter, Jonathan and Skinner, David and He, Helen and Wasserman, Harvey and Shalf, John and Shan,Hongzhang and Strohmaier, Erich},
abstractNote = {Over the past 15 years, microprocessor performance hasdoubled approximately every 18 months through increased clock rates andprocessing efficiency. In the past few years, clock frequency growth hasstalled, and microprocessor manufacturers such as AMD have moved towardsdoubling the number of cores every 18 months in order to maintainhistorical growth rates in chip performance. This document investigatesthe ramifications of multicore processor technology on the new Cray XT4?systems based on AMD processor technology. We begin by walking throughthe AMD single-core and dual-core and upcoming quad-core processorarchitectures. This is followed by a discussion of methods for collectingperformance counter data to understand code performance on the Cray XT3?and XT4? systems. We then use the performance counter data to analyze theimpact of multicore processors on the performance of microbenchmarks suchas STREAM, application kernels such as the NAS Parallel Benchmarks, andfull application codes that comprise the NERSC-5 SSP benchmark suite. Weexplore compiler options and software optimization techniques that canmitigate the memory bandwidth contention that can reduce computingefficiency on multicore processors. The last section provides a casestudy of applying the dual-core optimizations to the NAS ParallelBenchmarks to dramatically improve their performance.},
doi = {10.2172/918496},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Mar 07 00:00:00 EST 2007},
month = {Wed Mar 07 00:00:00 EST 2007}
}

Technical Report:

Save / Share:
  • We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.
  • This report summarizes work done to verify the component, failure mode, and method of detection information provided in the Equipment Performance Information Exchange (EPIX) to support implementation of Mitigating Systems Performance Indices. This task is to select reports from EPIX and determine if their categorization as MSPI or non-MSPI failures is consistent with the development of unreliability baseline failure rates, and whether this significantly affects estimates of plant risk. This review is of all MSPI devices in EPIX that were reported as failures. The components include emergency generators; motor-driven, turbine-driven, and enginedriven pumps; and air and motor-operated valves. The datemore » range for this report includes all MSPI device reported failures from 2003 to the most current EPIX data at the INL (up to the 3rd quarter 2008).« less
  • One of the most difficult technical challenges in cleaning up the US Department of Energy`s (DOE) Hanford Site in southeast Washington State will be to process the radioactive and chemically complex waste found in the Site`s 177 underground storage tanks. Solid, liquid, and sludge-like wastes are contained in 149 single- and 28 double-shelled steel tanks. These wastes contain about one half of the curies of radioactivity and mass of hazardous chemicals found on the Hanford Site. Therefore, Hanford cleanup means tank cleanup. Safely removing the waste from the tanks, separating radioactive elements from inert chemicals, and creating a final wastemore » form for disposal will require the use of our nation`s best available technology coupled with scientific advances, and an extraordinary commitment by all involved. The purpose of this guide is to inform the reader about critical issues facing tank cleanup. It is written as an information resource for the general reader as well as the technically trained person wanting to gain a basic understanding about the waste in Hanford`s tanks -- how the waste was created, what is in the waste, how it is stored, and what are the key technical issues facing tank cleanup. Access to information is key to better understanding the issues and more knowledgeably participating in cleanup decisions. This guide provides such information without promoting a given cleanup approach or technology use.« less
  • The objective of the High Performance Distributed Systems Architecture (HPDSA) project are to integrate very high bandwidth networks with heterogeneous computer architectures (including parallel and specialized processors) and support multiple programming models with this system. The driving forces for this project are user demands for high- performance computing coupled with the availability of gigabit and high bandwidth networking. Command and control applications are increasingly requiring the capabilities of both specialized processors and supercomputer-class machines. These resources are often concentrated at computer centers while the data sources and users of these systems are geographically distributed. By linking users, data sources, andmore » systems together with high-speed networks and distributed operating systems, we can substantially improve the quality of command and control information and provide that information to more people. A high-performance distributed operating system must satisfy a number of design goals to accomplish our architectural objectives. Foremost, it must be capable of accommodating a wide variety of computers, networks and programming languages.« less
  • This study examines the effects of computer architecture on FFT algorithm performance. The computer architectures evaluated are those of the Cray-1, CDC Cyber 750, IBM 370/155, DEC VAX 11/780, DEC PDP 11/60, DEC PDP 11/50, and Cromemco Z-2D. The algorithms executed are the radix-2, mixed-radix FFT (MFFT), Winograd Fourier Transform Algorithm (WFTA), and prime factor algorithm (PFA). The execution time of each algorithm for different sequence lengths is determined for each computer. Then the number of assembly language instructions executed are determined for the following categories: data transfers, floating point additions and subtractions, floating point multiplications and divisions, and integermore » operations. The correlation coefficients between the number of assembly language instructions in each category and the algorithm execution speeds are determined for each computer. The values of the correlation coefficients are then related to the computer architectures. The computer architectures are then compared against each other to determine what features are desireable in an FFT processor.« less