skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Distributed Halide

Journal Article · · SIGPLAN
 [1];  [2];  [1]
  1. Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
  2. Adobe, Cambridge, MA (United States)

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.

Research Organization:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
SC0005288
OSTI ID:
1557579
Journal Information:
SIGPLAN, Vol. 51, Issue 8; ISSN 0362-1340
Publisher:
ACMCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

References (23)

An auto-tuning framework for parallel multicore stencil computations conference April 2010
The pochoir stencil compiler conference January 2011
A stencil compiler for short-vector SIMD architectures
  • Henretty, Tom; Veras, Richard; Franchetti, Franz
  • Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13 https://doi.org/10.1145/2464996.2467268
conference January 2013
PolyMage: Automatic Optimization for Image Processing Pipelines
  • Mullapudi, Ravi Teja; Vasista, Vinay; Bondhugula, Uday
  • Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15 https://doi.org/10.1145/2694344.2694364
conference January 2015
Optimal scheduling algorithm for distributed-memory machines journal January 1998
Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme book January 2002
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
  • Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
  • Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13 https://doi.org/10.1145/2491956.2462176
conference January 2013
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
  • Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063398
conference January 2011
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
  • Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.70
conference May 2011
Distributed Image Processing On A Network Of Workstations journal January 2003
Real-time edge-aware image processing with the bilateral grid conference January 2007
Statistical scalability analysis of communication operations in distributed applications
  • Vetter, Jeffrey S.; McCracken, Michael O.
  • Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01 https://doi.org/10.1145/379539.379590
conference January 2001
Automatic data mapping for distributed-memory parallel computers conference January 1992
General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time book January 1999
Forma: a DSL for image processing applications to target GPUs and multi-core CPUs conference January 2015
Distributed processing of very large datasets with DataCutter journal October 2001
LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation
  • Alexandrov, Albert; Ionescu, Mihai F.; Schauser, Klaus E.
  • Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures - SPAA '95 https://doi.org/10.1145/215399.215427
conference January 1995
X10: an object-oriented approach to non-uniform cluster computing
  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852
conference January 2005
OpenTuner: an extensible framework for program autotuning
  • Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
  • Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14 https://doi.org/10.1145/2628071.2628092
conference January 2014
Scheduling Independent Multiprocessor Tasks journal February 2002
Ghost Cell Pattern conference January 2010
Real-time edge-aware image processing with the bilateral grid journal July 2007
Local Laplacian filters: edge-aware image processing with a Laplacian pyramid journal July 2011

Cited By (3)

Supporting Very Large Models using Automatic Dataflow Graph Partitioning conference January 2019
Supporting Very Large Models using Automatic Dataflow Graph Partitioning text January 2018
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS text January 2017