Distributed Halide
Abstract
Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achievemore »
- Authors:
-
- Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
- Adobe, Cambridge, MA (United States)
- Publication Date:
- Research Org.:
- Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1557579
- Grant/Contract Number:
- SC0005288
- Resource Type:
- Accepted Manuscript
- Journal Name:
- SIGPLAN
- Additional Journal Information:
- Journal Volume: 51; Journal Issue: 8; Journal ID: ISSN 0362-1340
- Publisher:
- ACM
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; Distributed memory; Image processing; Stencils
Citation Formats
Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Distributed Halide. United States: N. p., 2016.
Web. doi:10.1145/2851141.2851157.
Denniston, Tyler, Kamil, Shoaib, & Amarasinghe, Saman. Distributed Halide. United States. https://doi.org/10.1145/2851141.2851157
Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Fri .
"Distributed Halide". United States. https://doi.org/10.1145/2851141.2851157. https://www.osti.gov/servlets/purl/1557579.
@article{osti_1557579,
title = {Distributed Halide},
author = {Denniston, Tyler and Kamil, Shoaib and Amarasinghe, Saman},
abstractNote = {Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.},
doi = {10.1145/2851141.2851157},
journal = {SIGPLAN},
number = 8,
volume = 51,
place = {United States},
year = {Fri Jan 01 00:00:00 EST 2016},
month = {Fri Jan 01 00:00:00 EST 2016}
}
Web of Science
Works referenced in this record:
An auto-tuning framework for parallel multicore stencil computations
conference, April 2010
- Kamil, Shoaib; Chan, Cy; Oliker, Leonid
- 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
The pochoir stencil compiler
conference, January 2011
- Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C.
- Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11
A stencil compiler for short-vector SIMD architectures
conference, January 2013
- Henretty, Tom; Veras, Richard; Franchetti, Franz
- Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13
PolyMage: Automatic Optimization for Image Processing Pipelines
conference, January 2015
- Mullapudi, Ravi Teja; Vasista, Vinay; Bondhugula, Uday
- Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15
Optimal scheduling algorithm for distributed-memory machines
journal, January 1998
- Darbha, S.; Agrawal, D. P.
- IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 1
Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme
book, January 2002
- Jansen, Klaus
- Algorithms — ESA 2002
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013
- Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
- Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
conference, January 2011
- Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento
- Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011
- Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
- Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
Distributed Image Processing On A Network Of Workstations
journal, January 2003
- Li, X. L.; Veeravalli, B.; Ko, C. C.
- International Journal of Computers and Applications, Vol. 25, Issue 2
Real-time edge-aware image processing with the bilateral grid
conference, January 2007
- Chen, Jiawen; Paris, Sylvain; Durand, Frédo
- ACM SIGGRAPH 2007 papers on - SIGGRAPH '07
Statistical scalability analysis of communication operations in distributed applications
conference, January 2001
- Vetter, Jeffrey S.; McCracken, Michael O.
- Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01
Automatic data mapping for distributed-memory parallel computers
conference, January 1992
- Wholey, Skef
- Proceedings of the 6th international conference on Supercomputing - ICS '92
General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time
book, January 1999
- Jansen, Klaus; Porkolab, Lorant
- Lecture Notes in Computer Science
Forma: a DSL for image processing applications to target GPUs and multi-core CPUs
conference, January 2015
- Ravishankar, Mahesh; Holewinski, Justin; Grover, Vinod
- Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015
Distributed processing of very large datasets with DataCutter
journal, October 2001
- Beynon, Michael D.; Kurc, Tahsin; Catalyurek, Umit
- Parallel Computing, Vol. 27, Issue 11
LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation
conference, January 1995
- Alexandrov, Albert; Ionescu, Mihai F.; Schauser, Klaus E.
- Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures - SPAA '95
X10: an object-oriented approach to non-uniform cluster computing
conference, January 2005
- Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
- Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05
OpenTuner: an extensible framework for program autotuning
conference, January 2014
- Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
- Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
Scheduling Independent Multiprocessor Tasks
journal, February 2002
- Amoura,
- Algorithmica, Vol. 32, Issue 2
Ghost Cell Pattern
conference, January 2010
- Kjolstad, Fredrik Berg; Snir, Marc
- Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10
Real-time edge-aware image processing with the bilateral grid
journal, July 2007
- Chen, Jiawen; Paris, Sylvain; Durand, Frédo
- ACM Transactions on Graphics, Vol. 26, Issue 3
Local Laplacian filters: edge-aware image processing with a Laplacian pyramid
journal, July 2011
- Paris, Sylvain; Hasinoff, Samuel W.; Kautz, Jan
- ACM Transactions on Graphics, Vol. 30, Issue 4
Works referencing / citing this record:
Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019
- Wang, Minjie; Huang, Chien-chin; Li, Jinyang
- Proceedings of the Fourteenth EuroSys Conference 2019 CD-ROM on ZZZ - EuroSys '19
Supporting Very Large Models using Automatic Dataflow Graph Partitioning
text, January 2018
- Wang, Minjie; Huang, Chien-chin; Li, Jinyang
- arXiv
Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019
- Wang, Minjie; Huang, Chien-chin; Li, Jinyang
- Proceedings of the Fourteenth EuroSys Conference 2019 CD-ROM on ZZZ - EuroSys '19
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
text, January 2017
- Reguly, Istvan Z.; Mudalige, Gihan R.; Giles, Mike B.
- arXiv