CUDA Computation of the Feynman Distribution

Talamo, A.; Gohar, Y.

Title: CUDA Computation of the Feynman Distribution

Journal Article · Sat Jul 01 00:00:00 EDT 2017 · Transactions of the American Nuclear Society

OSTI ID:23050325

Talamo, A.; Gohar, Y. ^[1]

Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 (United States)

In 2006, NVIDIA Corporation introduced the Computer Unified Device Architecture (CUDA) for C, C++, and Fortran programming languages. The CUDA parallel computing platform uses the Graphics Processing Unit (GPU) of NVIDIA graphics card, and it is designed for parallel computations. In these computations, the same algorithms are executed on the data in parallel. The GPU contains many Streaming Multiprocessors (SM) that can independently execute blocks of threads. A thread is a sequence of serial programmed instructions. The SM executes threads in groups of 32 and a group of 32 threads is called warp. The threads within a block can be synchronized and have access to a shared memory that is visible only to them. Conversely, blocks of threads cannot be synchronized and have access only to the global memory of the graphics card. This memory is called device memory and is independent of the host memory controlled by the Central Processing Unit (CPU). A single thread has access to a local memory (register) that is not visible to any other thread. The part of the program that is executed on the graphics card is called kernel. Each thread of the kernel has a unique identification index that is accessible by the in-built variable threadIdx; the latter is a 3-component vector. A threads block can contain up to 1024 threads. Blocks of threads have a unique identification index that is accessible by the in-built variable blockIdx; the latter is a 3-component vector. The in-built variable blockDim contains the number of threads per block; this variable is a 3-component vector. Threads and blocks represent different levels of parallelism; a kernel is executed as a grid of blocks of threads. Figure 1 illustrates a grid of thread blocks for a GPU with 2 SM. A SM can execute multiple blocks of threads in parallel; a block of threads is executed on a single SM. A GPU can execute multiple kernels concurrently. This work has used the Quadro M5000 graphics card with the following technical specifications: - 2048 cores with 861 MHz frequency, - 16 streaming multiprocessors, - 5.2 trillion transistors, - 8.2 Gb device memory, - 64 kb register memory (per thread), - 48 kb shared memory (per block) - 211.6 GB/s memory bandwidth, and - 3.5 GFLOPS floating point performance.

Cite

Export

Save

OSTI ID:: 23050325

Journal Information:: Transactions of the American Nuclear Society, Vol. 116; Conference: 2017 Annual Meeting of the American Nuclear Society, San Francisco, CA (United States), 11-15 Jun 2017; Other Information: Country of input: France; 5 refs.; available from American Nuclear Society - ANS, 555 North Kensington Avenue, La Grange Park, IL 60526 (US); ISSN 0003-018X

Country of Publication:: United States

Language:: English

Similar Records

Evaluation of vectorized Monte Carlo algorithms on GPUs for a neutron Eigenvalue problem

Conference · Mon Jul 01 00:00:00 EDT 2013 · OSTI ID:23050325

Du, X.; Liu, T.; Ji, W.; +2 more

Juggler: a dependence-aware task-based execution framework for GPUs

Conference · Thu Feb 01 00:00:00 EST 2018 · OSTI ID:23050325

Belviranli, Mehmet E.; Lee, Seyong; Vetter, Jeffrey S.; +1 more

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Conference · Fri Dec 01 00:00:00 EST 2017 · OSTI ID:23050325

Potluri, Sreeram; Goswami, Anshuman; Rossetti, Davide; +3 more

Related Subjects

97 MATHEMATICAL METHODS AND COMPUTING
ALGORITHMS
ARRAY PROCESSORS
CALCULATION METHODS
DESIGN
FORTRAN
GRIDS
KERNELS
MHZ RANGE
PERFORMANCE
SPECIFICATIONS
TRANSISTORS
VECTORS

Title: CUDA Computation of the Feynman Distribution

Citation Formats

Similar Records

Related Subjects