skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: CUDA Computation of the Feynman Distribution

Journal Article · · Transactions of the American Nuclear Society
OSTI ID:23050325
;  [1]
  1. Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 (United States)

In 2006, NVIDIA Corporation introduced the Computer Unified Device Architecture (CUDA) for C, C++, and Fortran programming languages. The CUDA parallel computing platform uses the Graphics Processing Unit (GPU) of NVIDIA graphics card, and it is designed for parallel computations. In these computations, the same algorithms are executed on the data in parallel. The GPU contains many Streaming Multiprocessors (SM) that can independently execute blocks of threads. A thread is a sequence of serial programmed instructions. The SM executes threads in groups of 32 and a group of 32 threads is called warp. The threads within a block can be synchronized and have access to a shared memory that is visible only to them. Conversely, blocks of threads cannot be synchronized and have access only to the global memory of the graphics card. This memory is called device memory and is independent of the host memory controlled by the Central Processing Unit (CPU). A single thread has access to a local memory (register) that is not visible to any other thread. The part of the program that is executed on the graphics card is called kernel. Each thread of the kernel has a unique identification index that is accessible by the in-built variable threadIdx; the latter is a 3-component vector. A threads block can contain up to 1024 threads. Blocks of threads have a unique identification index that is accessible by the in-built variable blockIdx; the latter is a 3-component vector. The in-built variable blockDim contains the number of threads per block; this variable is a 3-component vector. Threads and blocks represent different levels of parallelism; a kernel is executed as a grid of blocks of threads. Figure 1 illustrates a grid of thread blocks for a GPU with 2 SM. A SM can execute multiple blocks of threads in parallel; a block of threads is executed on a single SM. A GPU can execute multiple kernels concurrently. This work has used the Quadro M5000 graphics card with the following technical specifications: - 2048 cores with 861 MHz frequency, - 16 streaming multiprocessors, - 5.2 trillion transistors, - 8.2 Gb device memory, - 64 kb register memory (per thread), - 48 kb shared memory (per block) - 211.6 GB/s memory bandwidth, and - 3.5 GFLOPS floating point performance.

OSTI ID:
23050325
Journal Information:
Transactions of the American Nuclear Society, Vol. 116; Conference: 2017 Annual Meeting of the American Nuclear Society, San Francisco, CA (United States), 11-15 Jun 2017; Other Information: Country of input: France; 5 refs.; available from American Nuclear Society - ANS, 555 North Kensington Avenue, La Grange Park, IL 60526 (US); ISSN 0003-018X
Country of Publication:
United States
Language:
English