Distributed Halide

Denniston, Tyler; Kamil, Shoaib; Amarasinghe, Saman

doi:10.1145/2851141.2851157

Title: Distributed Halide

Abstract

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achievemore »« less

Authors:

Denniston, Tyler ^[1]; Kamil, Shoaib ^[2]; Amarasinghe, Saman ^[1]

Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Adobe, Cambridge, MA (United States)

Publication Date:: Fri Jan 01 00:00:00 EST 2016

Research Org.:: Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1557579

Grant/Contract Number:: SC0005288

Resource Type:: Accepted Manuscript

Journal Name:: SIGPLAN

Additional Journal Information:: Journal Volume: 51; Journal Issue: 8; Journal ID: ISSN 0362-1340

Publisher:: ACM

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Distributed memory; Image processing; Stencils

Citation Formats


                    Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Distributed Halide.  United States: N. p., 2016. 
Web.  doi:10.1145/2851141.2851157.

Copy to clipboard


                    Denniston, Tyler, Kamil, Shoaib, & Amarasinghe, Saman. Distributed Halide.  United States.  https://doi.org/10.1145/2851141.2851157

Copy to clipboard


                    Denniston, Tyler, Kamil, Shoaib, and Amarasinghe, Saman. Fri .  
"Distributed Halide".  United States.  https://doi.org/10.1145/2851141.2851157.  https://www.osti.gov/servlets/purl/1557579.

Copy to clipboard


                    
@article{osti_1557579,

  title        = {Distributed Halide},

  author       = {Denniston, Tyler and Kamil, Shoaib and Amarasinghe, Saman},

  abstractNote = {Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.},

  doi          = {10.1145/2851141.2851157},

  journal      = {SIGPLAN},

  number       = 8,

  volume       = 51,

  place        = {United States},

  year         = {Fri Jan 01 00:00:00 EST 2016},

  month        = {Fri Jan 01 00:00:00 EST 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1145/2851141.2851157

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

Kamil, Shoaib; Chan, Cy; Oliker, Leonid
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
DOI: 10.1109/IPDPS.2010.5470421

The pochoir stencil compiler
conference, January 2011

Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C.
Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11
DOI: 10.1145/1989493.1989508

A stencil compiler for short-vector SIMD architectures
conference, January 2013

Henretty, Tom; Veras, Richard; Franchetti, Franz
Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13
DOI: 10.1145/2464996.2467268

PolyMage: Automatic Optimization for Image Processing Pipelines
conference, January 2015

Mullapudi, Ravi Teja; Vasista, Vinay; Bondhugula, Uday
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15
DOI: 10.1145/2694344.2694364

Optimal scheduling algorithm for distributed-memory machines
journal, January 1998

Darbha, S.; Agrawal, D. P.
IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 1
DOI: 10.1109/71.655248

Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme
book, January 2002

Jansen, Klaus
Algorithms — ESA 2002
DOI: 10.1007/3-540-45749-6_50

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
DOI: 10.1145/2491956.2462176

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
conference, January 2011

Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063398

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011

Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
DOI: 10.1109/IPDPS.2011.70

Distributed Image Processing On A Network Of Workstations
journal, January 2003

Li, X. L.; Veeravalli, B.; Ko, C. C.
International Journal of Computers and Applications, Vol. 25, Issue 2
DOI: 10.1080/1206212X.2003.11441695

Real-time edge-aware image processing with the bilateral grid
conference, January 2007

Chen, Jiawen; Paris, Sylvain; Durand, Frédo
ACM SIGGRAPH 2007 papers on - SIGGRAPH '07
DOI: 10.1145/1275808.1276506

Statistical scalability analysis of communication operations in distributed applications
conference, January 2001

Vetter, Jeffrey S.; McCracken, Michael O.
Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01
DOI: 10.1145/379539.379590

Automatic data mapping for distributed-memory parallel computers
conference, January 1992

Wholey, Skef
Proceedings of the 6th international conference on Supercomputing - ICS '92
DOI: 10.1145/143369.143377

General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time
book, January 1999

Jansen, Klaus; Porkolab, Lorant
Lecture Notes in Computer Science
DOI: 10.1007/3-540-48447-7_13

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs
conference, January 2015

Ravishankar, Mahesh; Holewinski, Justin; Grover, Vinod
Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015
DOI: 10.1145/2716282.2716290

Distributed processing of very large datasets with DataCutter
journal, October 2001

Beynon, Michael D.; Kurc, Tahsin; Catalyurek, Umit
Parallel Computing, Vol. 27, Issue 11
DOI: 10.1016/S0167-8191(01)00099-0

LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation
conference, January 1995

Alexandrov, Albert; Ionescu, Mihai F.; Schauser, Klaus E.
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures - SPAA '95
DOI: 10.1145/215399.215427

X10: an object-oriented approach to non-uniform cluster computing
conference, January 2005

Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05
DOI: 10.1145/1094811.1094852

OpenTuner: an extensible framework for program autotuning
conference, January 2014

Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
DOI: 10.1145/2628071.2628092

Scheduling Independent Multiprocessor Tasks
journal, February 2002

Amoura,
Algorithmica, Vol. 32, Issue 2
DOI: 10.1007/s00453-001-0076-9

Ghost Cell Pattern
conference, January 2010

Kjolstad, Fredrik Berg; Snir, Marc
Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10
DOI: 10.1145/1953611.1953615

Real-time edge-aware image processing with the bilateral grid
journal, July 2007

Chen, Jiawen; Paris, Sylvain; Durand, Frédo
ACM Transactions on Graphics, Vol. 26, Issue 3
DOI: 10.1145/1276377.1276506

Local Laplacian filters: edge-aware image processing with a Laplacian pyramid
journal, July 2011

Paris, Sylvain; Hasinoff, Samuel W.; Kautz, Jan
ACM Transactions on Graphics, Vol. 30, Issue 4
DOI: 10.1145/2010324.1964963

Works referencing / citing this record:

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019

Wang, Minjie; Huang, Chien-chin; Li, Jinyang
Proceedings of the Fourteenth EuroSys Conference 2019 CD-ROM on ZZZ - EuroSys '19
DOI: 10.1145/3302424.3303953

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
text, January 2018

Wang, Minjie; Huang, Chien-chin; Li, Jinyang
arXiv
DOI: 10.48550/arxiv.1807.08887

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019

Wang, Minjie; Huang, Chien-chin; Li, Jinyang
Proceedings of the Fourteenth EuroSys Conference 2019 CD-ROM on ZZZ - EuroSys '19
DOI: 10.1145/3302424.3303953

Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
text, January 2017

Reguly, Istvan Z.; Mudalige, Gihan R.; Giles, Mike B.
arXiv
DOI: 10.48550/arxiv.1704.00693

Similar Records in DOE PAGES and OSTI.GOV collections:

Developing And Scaling an OpenFOAM Model to Study Turbulent Flow in a HFIR Coolant Channel

Technical Report Popov, Emilian ; Mecham, Nicholas ; Edwardson, Carter

Improving the understanding of how computational fluid dynamics (CFD) direct numerical simulations (DNS) of flows in the High Flux Isotope Reactor (HFIR) perform when run in parallel using the high performance computing (HPC) platform Summit at the Oak Ridge Leadership Computing Facility (OLCF) is of particular importance to boost the computational tools used to support HFIR conversion to low enriched fuel (LEU). Evaluation of scaling performance was driven by the increasing importance of graphics processing unit (GPU) usage in HPC, which is becoming the standard for modern supercomputers such as Summit. The desired results are to obtain a strong positivemore »« less
https://doi.org/10.2172/2329590

Full Text Available
PaKman: A Scalable Algorithm for Generating Genomic Contigs on Distributed Memory Machines

Journal Article Ghosh, Priyanka ; Krishnamoorthy, Sriram ; Kalyanaraman, Ananth - IEEE Transactions on Parallel and Distributed Systems

De novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this article,more »« less
https://doi.org/10.1109/TPDS.2020.3043241
Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

Conference Koskela, Tuomas S. ; Lobet, Mathieu ; Deslippe, Jack ; ...

In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is amore »« less
Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Distributed Louvain Algorithm for Graph Community Detection

Conference Ghosh, Sayan ; Halappanavar, Mahantesh ; Tumeo, Antonino ; ...

In most real-world networks, the nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters, such that nodes within a community are more likely to be “related” to one another than they are to the rest of the network. The goodness of partitioning into communities is typically measured using a well known measure called modularity. However, modularity optimization is an NP-complete problem. In 2008, Blondel, et al. introduced a multi-phase, iterative heuristic for modularity optimization, called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be onemore »« less
https://doi.org/10.1109/IPDPS.2018.00098

Similar Records

Title: Distributed Halide

Abstract

Citation Formats

An auto-tuning framework for parallel multicore stencil computations conference, April 2010

The pochoir stencil compiler conference, January 2011

A stencil compiler for short-vector SIMD architectures conference, January 2013

PolyMage: Automatic Optimization for Image Processing Pipelines conference, January 2015

Optimal scheduling algorithm for distributed-memory machines journal, January 1998

Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme book, January 2002

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines conference, January 2013

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers conference, January 2011

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures conference, May 2011

Distributed Image Processing On A Network Of Workstations journal, January 2003

Real-time edge-aware image processing with the bilateral grid conference, January 2007

Statistical scalability analysis of communication operations in distributed applications conference, January 2001

Automatic data mapping for distributed-memory parallel computers conference, January 1992

General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time book, January 1999

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs conference, January 2015

Distributed processing of very large datasets with DataCutter journal, October 2001

LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation conference, January 1995

X10: an object-oriented approach to non-uniform cluster computing conference, January 2005

OpenTuner: an extensible framework for program autotuning conference, January 2014

Scheduling Independent Multiprocessor Tasks journal, February 2002

Ghost Cell Pattern conference, January 2010

Real-time edge-aware image processing with the bilateral grid journal, July 2007

Local Laplacian filters: edge-aware image processing with a Laplacian pyramid journal, July 2011

Supporting Very Large Models using Automatic Dataflow Graph Partitioning conference, January 2019

Supporting Very Large Models using Automatic Dataflow Graph Partitioning text, January 2018

Supporting Very Large Models using Automatic Dataflow Graph Partitioning conference, January 2019

Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS text, January 2017

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

The pochoir stencil compiler
conference, January 2011

A stencil compiler for short-vector SIMD architectures
conference, January 2013

PolyMage: Automatic Optimization for Image Processing Pipelines
conference, January 2015

Optimal scheduling algorithm for distributed-memory machines
journal, January 1998

Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme
book, January 2002

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
conference, January 2011

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011

Distributed Image Processing On A Network Of Workstations
journal, January 2003

Real-time edge-aware image processing with the bilateral grid
conference, January 2007

Statistical scalability analysis of communication operations in distributed applications
conference, January 2001

Automatic data mapping for distributed-memory parallel computers
conference, January 1992

General Multiprocessor Task Scheduling: Approximate Solutions in Linear Time
book, January 1999

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs
conference, January 2015

Distributed processing of very large datasets with DataCutter
journal, October 2001

LogGP: incorporating long messages into the LogP model---one step closer towards a realistic model for parallel computation
conference, January 1995

X10: an object-oriented approach to non-uniform cluster computing
conference, January 2005

OpenTuner: an extensible framework for program autotuning
conference, January 2014

Scheduling Independent Multiprocessor Tasks
journal, February 2002

Ghost Cell Pattern
conference, January 2010

Real-time edge-aware image processing with the bilateral grid
journal, July 2007

Local Laplacian filters: edge-aware image processing with a Laplacian pyramid
journal, July 2011

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
text, January 2018

Supporting Very Large Models using Automatic Dataflow Graph Partitioning
conference, January 2019

Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
text, January 2017