DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An MPI + $$X$$ implementation of contact global search using Kokkos

Journal Article · · Engineering with Computers

This paper describes an approach that seeks to parallelize the spatial search associated with computational contact mechanics. In contact mechanics, the purpose of the spatial search is to find “nearest neighbors,” which is the prelude to an imprinting search that resolves the interactions between the external surfaces of contacting bodies. In particular, we are interested in the contact global search portion of the spatial search associated with this operation on domain-decomposition-based meshes. Specifically, we describe an implementation that combines standard domain-decomposition-based MPI-parallel spatial search with thread-level parallelism (MPI-X) available on advanced computer architectures (those with GPU coprocessors). Our goal is to demonstrate the efficacy of the MPI-X paradigm in the overall contact search. Standard MPI-parallel implementations typically use a domain decomposition of the external surfaces of bodies within the domain in an attempt to efficiently distribute computational work. This decomposition may or may not be the same as the volume decomposition associated with the host physics. The parallel contact global search phase is then employed to find and distribute surface entities (nodes and faces) that are needed to compute contact constraints between entities owned by different MPI ranks without further inter-rank communication. Key steps of the contact global search include computing bounding boxes, building surface entity (node and face) search trees and finding and distributing entities required to complete on-rank (local) spatial searches. To enable source-code portability and performance across a variety of different computer architectures, we implemented the algorithm using the Kokkos hardware abstraction library. While we targeted development towards machines with a GPU accelerator per MPI rank, we also report performance results for OpenMP with a conventional multi-core compute node per rank. Results here demonstrate a 47 % decrease in the time spent within the global search algorithm, comparing the reference ACME algorithm with the GPU implementation, on an 18M face problem using four MPI ranks. As a result, while further work remains to maximize performance on the GPU, this result illustrates the potential of the proposed implementation.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1335669
Alternate ID(s):
OSTI ID: 1512914
Report Number(s):
SAND-2016-10645J; SAND-2015-7190J; PII: 418
Journal Information:
Engineering with Computers, Vol. 32, Issue 2; ISSN 0177-0667
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English

References (16)

Fast BVH Construction on GPUs journal April 2009
Efficient parallel merge sort for fixed and variable length keys conference May 2012
A parallel contact detection algorithm for transient solid dynamics simulations using PRONTO3D journal August 1998
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns journal December 2014
Fast Four-Way Parallel Radix Sorting on GPUs journal December 2009
A Jacobian-free Newton Krylov method for mortar-discretized thermomechanical contact problems journal July 2011
Zoltan data management services for parallel dynamic applications journal January 2002
ALEGRA: An Arbitrary Lagrangian-Eulerian Multimaterial, Multiphysics Code conference June 2012
Fast In-Place Sorting with CUDA Based on Bitonic Sort book January 2010
Fast parallel GPU-sorting using a hybrid algorithm journal October 2008
The design of a task parallel library
  • Leijen, Daan; Schulte, Wolfram; Burckhardt, Sebastian
  • Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA 09 https://doi.org/10.1145/1640089.1640106
conference January 2009
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort conference January 2010
Designing efficient sorting algorithms for manycore GPUs conference May 2009
Scans as primitive parallel operations journal January 1989
OpenACC — First Experiences with Real-World Applications book January 2012
Composable Parallel Patterns with Intel Cilk Plus journal March 2013

Figures / Tables (11)