skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Distributed Halide

Abstract

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achievemore » up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.« less

Authors:
 [1];  [2];  [1]
  1. Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
  2. Adobe, Cambridge, MA (United States)
Publication Date:
Research Org.:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1557579
Grant/Contract Number:  
SC0005288
Resource Type:
Accepted Manuscript
Journal Name:
ACM SIGPLAN Notices
Additional Journal Information:
Journal Volume: 51; Journal Issue: 8; Journal ID: ISSN 0362-1340
Publisher:
ACM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Distributed memory; Image processing; Stencils

Citation Formats

Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Distributed Halide. United States: N. p., 2016. Web. doi:10.1145/2851141.2851157.
Denniston, Tyler, Kamil, Shoaib, & Amarasinghe, Saman. Distributed Halide. United States. doi:10.1145/2851141.2851157.
Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Fri . "Distributed Halide". United States. doi:10.1145/2851141.2851157. https://www.osti.gov/servlets/purl/1557579.
@article{osti_1557579,
title = {Distributed Halide},
author = {Denniston, Tyler and Kamil, Shoaib and Amarasinghe, Saman},
abstractNote = {Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.},
doi = {10.1145/2851141.2851157},
journal = {ACM SIGPLAN Notices},
number = 8,
volume = 51,
place = {United States},
year = {2016},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share: