Distributed Halide

Denniston, Tyler; Kamil, Shoaib; Amarasinghe, Saman

doi:10.1145/2851141.2851157

Title: Distributed Halide

Journal Article · Fri Jan 01 00:00:00 EST 2016 · SIGPLAN

DOI:https://doi.org/10.1145/2851141.2851157· OSTI ID:1557579

Denniston, Tyler ^[1]; Kamil, Shoaib ^[2]; Amarasinghe, Saman ^[1]

Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Adobe, Cambridge, MA (United States)

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: SC0005288

OSTI ID:: 1557579

Journal Information:: SIGPLAN, Vol. 51, Issue 8; ISSN 0362-1340

Publisher:: ACMCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

References (23)

An auto-tuning framework for parallel multicore stencil computations Kamil, Shoaib; Chan, Cy; Oliker, Leonid 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470421	conference	April 2010
The pochoir stencil compiler Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C. Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11 https://doi.org/10.1145/1989493.1989508	conference	January 2011
A stencil compiler for short-vector SIMD architectures Henretty, Tom; Veras, Richard; Franchetti, Franz Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2467268	conference	January 2013
PolyMage: Automatic Optimization for Image Processing Pipelines Mullapudi, Ravi Teja; Vasista, Vinay; Bondhugula, Uday Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15 https://doi.org/10.1145/2694344.2694364	conference	January 2015
Optimal scheduling algorithm for distributed-memory machines Darbha, S.; Agrawal, D. P. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 1 https://doi.org/10.1109/71.655248	journal	January 1998
Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme Jansen, Klaus Algorithms — ESA 2002 https://doi.org/10.1007/3-540-45749-6_50	book	January 2002
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13 https://doi.org/10.1145/2491956.2462176	conference	January 2013
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063398	conference	January 2011
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures Christen, Matthias; Schenk, Olaf; Burkhart, Helmar Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.70	conference	May 2011
Distributed Image Processing On A Network Of Workstations Li, X. L.; Veeravalli, B.; Ko, C. C. International Journal of Computers and Applications, Vol. 25, Issue 2 https://doi.org/10.1080/1206212X.2003.11441695	journal	January 2003
Real-time edge-aware image processing with the bilateral grid Chen, Jiawen; Paris, Sylvain; Durand, Frédo ACM SIGGRAPH 2007 papers on - SIGGRAPH '07 https://doi.org/10.1145/1275808.1276506	conference	January 2007
Statistical scalability analysis of communication operations in distributed applications Vetter, Jeffrey S.; McCracken, Michael O. Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01 https://doi.org/10.1145/379539.379590	conference	January 2001
Automatic data mapping for distributed-memory parallel computers Wholey, Skef Proceedings of the 6th international conference on Supercomputing - ICS '92 https://doi.org/10.1145/143369.143377	conference	January 1992
General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time Jansen, Klaus; Porkolab, Lorant Lecture Notes in Computer Science https://doi.org/10.1007/3-540-48447-7_13	book	January 1999
Forma: a DSL for image processing applications to target GPUs and multi-core CPUs Ravishankar, Mahesh; Holewinski, Justin; Grover, Vinod Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015 https://doi.org/10.1145/2716282.2716290	conference	January 2015
Distributed processing of very large datasets with DataCutter Beynon, Michael D.; Kurc, Tahsin; Catalyurek, Umit Parallel Computing, Vol. 27, Issue 11 https://doi.org/10.1016/S0167-8191(01)00099-0	journal	October 2001
LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation Alexandrov, Albert; Ionescu, Mihai F.; Schauser, Klaus E. Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures - SPAA '95 https://doi.org/10.1145/215399.215427	conference	January 1995
X10: an object-oriented approach to non-uniform cluster computing Charles, Philippe; Grothoff, Christian; Saraswat, Vijay Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852	conference	January 2005
OpenTuner: an extensible framework for program autotuning Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14 https://doi.org/10.1145/2628071.2628092	conference	January 2014
Scheduling Independent Multiprocessor Tasks Algorithmica, Vol. 32, Issue 2 https://doi.org/10.1007/s00453-001-0076-9	journal	February 2002
Ghost Cell Pattern Kjolstad, Fredrik Berg; Snir, Marc Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10 https://doi.org/10.1145/1953611.1953615	conference	January 2010
Real-time edge-aware image processing with the bilateral grid Chen, Jiawen; Paris, Sylvain; Durand, Frédo ACM Transactions on Graphics, Vol. 26, Issue 3 https://doi.org/10.1145/1276377.1276506	journal	July 2007
Local Laplacian filters: edge-aware image processing with a Laplacian pyramid Paris, Sylvain; Hasinoff, Samuel W.; Kautz, Jan ACM Transactions on Graphics, Vol. 30, Issue 4 https://doi.org/10.1145/2010324.1964963	journal	July 2011

Cited By (3)

Supporting Very Large Models using Automatic Dataflow Graph Partitioning Wang, Minjie; Huang, Chien-chin; Li, Jinyang Proceedings of the Fourteenth EuroSys Conference 2019 CD-ROM on ZZZ - EuroSys '19 https://doi.org/10.1145/3302424.3303953	conference	January 2019
Supporting Very Large Models using Automatic Dataflow Graph Partitioning Wang, Minjie; Huang, Chien-chin; Li, Jinyang arXiv https://doi.org/10.48550/arxiv.1807.08887	text	January 2018
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS Reguly, Istvan Z.; Mudalige, Gihan R.; Giles, Mike B. arXiv https://doi.org/10.48550/arxiv.1704.00693	text	January 2017

Similar Records

Developing And Scaling an OpenFOAM Model to Study Turbulent Flow in a HFIR Coolant Channel

Technical Report · Fri Mar 01 00:00:00 EST 2024 · OSTI ID:1557579

Popov, Emilian; Mecham, Nicholas; Edwardson, Carter

PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines

Journal Article · Sat May 01 00:00:00 EDT 2021 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1557579

Ghosh, Priyanka; Krishnamoorthy, Sriram; Kalyanaraman, Ananth

Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

Conference · Tue May 23 00:00:00 EDT 2017 · OSTI ID:1557579

Koskela, Tuomas S.; Lobet, Mathieu; Deslippe, Jack; +1 more

Related Subjects

97 MATHEMATICS AND COMPUTING
Distributed memory
Image processing
Stencils

Title: Distributed Halide

Citation Formats

References (23)

Cited By (3)

Similar Records

Related Subjects