Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Mendis, Charith; Bosboom, Jeffrey; Wu, Kevin; Kamil, Shoaib; Ragan-Kelley, Jonathan; Paris, Sylvain; Zhao, Qin; Amarasinghe, Saman

doi:10.1145/2737924.2737974

Title: Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Abstract

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide's state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. Here, we abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75 % performance improvement, four kernelsmore »« less

Authors:

Mendis, Charith ^[1]; Bosboom, Jeffrey ^[1]; Wu, Kevin ^[1]; Kamil, Shoaib ^[1]; Ragan-Kelley, Jonathan ^[2]; Paris, Sylvain ^[3]; Zhao, Qin ^[4]; Amarasinghe, Saman ^[1]

Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Computer Science and Artificial Intelligence Lab. (CSAIL)
Stanford Univ., Palo Alto, CA (United States)
Adobe, Cambridge, MA (United States)
Google, Cambridge, MA (United States)

Publication Date:: Wed Jun 03 00:00:00 EDT 2015

Research Org.:: Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Computer Science and Artificial Intelligence Lab. (CSAIL)

Sponsoring Org.:: USDOE Office of Science (SC); Defense Advanced Research Projects Agency (DARPA)

OSTI Identifier:: 1457399

Grant/Contract Number:: SC0005288; SC0008923

Resource Type:: Accepted Manuscript

Journal Name:: ACM SIGPLAN Notices

Additional Journal Information:: Journal Volume: 2015; Journal ID: ISSN 0362-1340

Publisher:: ACM

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, and Amarasinghe, Saman. Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code.  United States: N. p., 2015. 
Web.  doi:10.1145/2737924.2737974.

Copy to clipboard


                    Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, & Amarasinghe, Saman. Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code.  United States.  https://doi.org/10.1145/2737924.2737974

Copy to clipboard


                    Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, and Amarasinghe, Saman. Wed .  
"Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code".  United States.  https://doi.org/10.1145/2737924.2737974.  https://www.osti.gov/servlets/purl/1457399.

Copy to clipboard


                    
@article{osti_1457399,

  title        = {Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code},

  author       = {Mendis, Charith and Bosboom, Jeffrey and Wu, Kevin and Kamil, Shoaib and Ragan-Kelley, Jonathan and Paris, Sylvain and Zhao, Qin and Amarasinghe, Saman},

  abstractNote = {Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide's state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. Here, we abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75 % performance improvement, four kernels from Irfan View, leading to 4.97 x performance, and one stencil from the mini GMG multigrid benchmark netting a 4.25 x improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop's filters with our lifted implementations, giving 1.12 x speedup without affecting the user experience.},

  doi          = {10.1145/2737924.2737974},

  journal      = {ACM SIGPLAN Notices},

  number       = ,

  volume       = 2015,

  place        = {United States},

  year         = {Wed Jun 03 00:00:00 EDT 2015},

  month        = {Wed Jun 03 00:00:00 EDT 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1145/2737924.2737974

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 17 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Integrating profile-driven parallelism detection and machine-learning-based mapping
journal, February 2014

Wang, Zheng; Tournavitis, Georgios; Franke, Björn
ACM Transactions on Architecture and Code Optimization, Vol. 11, Issue 1
DOI: 10.1145/2579561

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

Kamil, Shoaib; Chan, Cy; Oliker, Leonid
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
DOI: 10.1109/IPDPS.2010.5470421

SmartDec: Approaching C++ Decompilation
conference, October 2011

Fokin, Alexander; Derevenetc, Egor; Chernov, Alexander
2011 18th Working Conference on Reverse Engineering (WCRE)
DOI: 10.1109/wcre.2011.49

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
DOI: 10.1145/2491956.2462176

S2E: a platform for in-vivo multi-path analysis of software systems
conference, January 2011

Chipounov, Vitaly; Kuznetsov, Volodymyr; Candea, George
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems - ASPLOS '11
DOI: 10.1145/1950365.1950396

Transparent dynamic instrumentation
journal, September 2012

Bruening, Derek; Zhao, Qin; Amarasinghe, Saman
ACM SIGPLAN Notices, Vol. 47, Issue 7
DOI: 10.1145/2365864.2151043

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

Wu, Youfeng
Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02
DOI: 10.1145/512553.512555

Scalable variable and data type detection in a binary rewriter
journal, June 2013

ElWazeer, Khaled; Anand, Kapil; Kotha, Aparna
ACM SIGPLAN Notices, Vol. 48, Issue 6
DOI: 10.1145/2499370.2462165

Analyzing Memory Accesses in x86 Executables
book, January 2004

Balakrishnan, Gogul; Reps, Thomas
Lecture Notes in Computer Science
DOI: 10.1007/978-3-540-24723-4_2

A compiler-level intermediate representation based binary analysis and rewriting system
conference, January 2013

Anand, Kapil; Smithson, Matthew; Elwazeer, Khaled
Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13
DOI: 10.1145/2465351.2465380

Transparent dynamic instrumentation
conference, January 2012

Bruening, Derek; Zhao, Qin; Amarasinghe, Saman
Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments - VEE '12
DOI: 10.1145/2151024.2151043

An Approach to the Problem of Detranslation of Computer Programs
journal, August 1980

Horspool, R. N.; Marovac, N.
The Computer Journal, Vol. 23, Issue 3
DOI: 10.1093/comjnl/23.3.223

The Paralax infrastructure: automatic parallelization with a helping hand
conference, January 2010

Vandierendonck, Hans; Rul, Sean; De Bosschere, Koen
Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10
DOI: 10.1145/1854273.1854322

Dynamo: a transparent dynamic optimization system
conference, January 2000

Bala, Vasanth; Duesterwald, Evelyn; Banerjia, Sanjeev
Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation - PLDI '00
DOI: 10.1145/349299.349303

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
journal, May 2002

Wu, Youfeng
ACM SIGPLAN Notices, Vol. 37, Issue 5
DOI: 10.1145/543552.512555

Reverse engineering of binary device drivers with RevNIC
conference, January 2010

Chipounov, Vitaly; Candea, George
Proceedings of the 5th European conference on Computer systems - EuroSys '10
DOI: 10.1145/1755913.1755932

Scalable variable and data type detection in a binary rewriter
conference, January 2013

ElWazeer, Khaled; Anand, Kapil; Kotha, Aparna
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
DOI: 10.1145/2491956.2462165

Automatic Parallelization in a Binary Rewriter
conference, December 2010

Kotha, Aparna; Anand, Kapil; Smithson, Matthew
2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
DOI: 10.1109/MICRO.2010.27

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

Wu, Youfeng
Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02
DOI: 10.1145/512529.512555

Reviewers
conference, December 2007

,
40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
DOI: 10.1109/MICRO.2007.7

OpenTuner: an extensible framework for program autotuning
conference, January 2014

Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
DOI: 10.1145/2628071.2628092

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.85

HELIX: automatic parallelization of irregular programs for chip multiprocessing
conference, January 2012

Campanoni, Simone; Jones, Timothy; Holloway, Glenn
Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
DOI: 10.1145/2259016.2259028

Practical and Accurate Low-Level Pointer Analysis
conference, March 2005

Guo, Bolei; Bridges, M. J.; Triantafyllis, S.
International Symposium on Code Generation and Optimization
DOI: 10.1109/CGO.2005.27

Valgrind: a framework for heavyweight dynamic binary instrumentation
conference, January 2007

Nethercote, Nicholas; Seward, Julian
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation - PLDI '07
DOI: 10.1145/1250734.1250746

A framework for enhancing data reuse via associative reordering
conference, January 2013

Stock, Kevin; Kong, Martin; Grosser, Tobias
Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14
DOI: 10.1145/2594291.2594342

The pochoir stencil compiler
conference, January 2011

Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C.
Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11
DOI: 10.1145/1989493.1989508

Dynamo: a transparent dynamic optimization system
journal, May 2011

Bala, Vasanth; Duesterwald, Evelyn; Banerjia, Sanjeev
ACM SIGPLAN Notices, Vol. 46, Issue 4
DOI: 10.1145/1988042.1988044

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
journal, June 2013

Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
ACM SIGPLAN Notices, Vol. 48, Issue 6
DOI: 10.1145/2499370.2462176

Works referencing / citing this record:

Verified lifting of stencil computations
journal, August 2016

Kamil, Shoaib; Cheung, Alvin; Itzhaky, Shachar
ACM SIGPLAN Notices, Vol. 51, Issue 6
DOI: 10.1145/2980983.2908117

Verified lifting of stencil computations
conference, January 2016

Kamil, Shoaib; Cheung, Alvin; Itzhaky, Shachar
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2016
DOI: 10.1145/2908080.2908117

Trace-based affine reconstruction of codes
conference, January 2016

Rodríguez, Gabriel; Andión, José M.; Kandemir, Mahmut T.
Proceedings of the 2016 International Symposium on Code Generation and Optimization - CGO 2016
DOI: 10.1145/2854038.2854056

Similar Records in DOE PAGES and OSTI.GOV collections:

Verified lifting of stencil computations

Journal Article Kamil, Shoaib ; Cheung, Alvin ; Itzhaky, Shachar ; ... - ACM SIGPLAN Notices

This paper demonstrates a novel combination of program synthesis and verification to lift stencil computations from low-level Fortran code to a high-level summary expressed using a predicate language. The technique is sound and mostly automated, and leverages counter-example guided inductive synthesis (CEGIS) to find provably correct translations. Lifting existing code to a high-performance description language has a number of benefits, including maintainability and performance portability. For example, our experiments show that the lifted summaries can enable domain specific compilers to do a better job of parallelization as compared to an off-the-shelf compiler working on the original code, and can evenmore »« less
Cited by 25
https://doi.org/10.1145/2908080.2908117

Full Text Available
Compiler-Directed Transformation for Higher-Order Stencils

Conference Basu, Protonu ; Hall, Mary ; Williams, Samuel ; ...

As the cost of data movement increasingly dominates performance, developers of finite-volume and finite-difference solutions for partial differential equations (PDEs) are exploring novel higher-order stencils that increase numerical accuracy and computational intensity. This paper describes a new compiler reordering transformation applied to stencil operators that performs partial sums in buffers, and reuses the partial sums in computing multiple results. This optimization has multiple effect son improving stencil performance that are particularly important to higher-order stencils: exploits data reuse, reduces floating-point operations, and exposes efficient SIMD parallelism to backend compilers. We study the benefit of this optimization in the context ofmore »« less
https://doi.org/10.1109/IPDPS.2015.103

Full Text Available
Snowflake: A Lightweight Portable Stencil DSL

Journal Article Zhang, Nathan ; Driscoll, Michael ; Markley, Charles ; ... - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a 'micro-compiler' approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher-order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded inmore »« less
Cited by 6
https://doi.org/10.1109/IPDPSW.2017.89

Full Text Available
Distributed Halide

Journal Article Denniston, Tyler ; Kamil, Shoaib ; Amarasinghe, Saman - SIGPLAN

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks. We present an extension to Halide to support distributed-memory parallelmore »« less
Cited by 11
https://doi.org/10.1145/2851141.2851157

Full Text Available
Revisiting Temporal Blocking Stencil Optimizations

Conference Zhang, Lingqi ; Wahib, Mohamed ; Chen, Peng ; ...

Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We proposemore »« less
https://doi.org/10.1145/3577193.3593716

Full Text Available

Similar Records

Title: Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Abstract

Citation Formats

Integrating profile-driven parallelism detection and machine-learning-based mapping journal, February 2014

An auto-tuning framework for parallel multicore stencil computations conference, April 2010

SmartDec: Approaching C++ Decompilation conference, October 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines conference, January 2013

S2E: a platform for in-vivo multi-path analysis of software systems conference, January 2011

Transparent dynamic instrumentation journal, September 2012

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching conference, January 2002

Scalable variable and data type detection in a binary rewriter journal, June 2013

Analyzing Memory Accesses in x86 Executables book, January 2004

A compiler-level intermediate representation based binary analysis and rewriting system conference, January 2013

Transparent dynamic instrumentation conference, January 2012

An Approach to the Problem of Detranslation of Computer Programs journal, August 1980

The Paralax infrastructure: automatic parallelization with a helping hand conference, January 2010

Dynamo: a transparent dynamic optimization system conference, January 2000

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching journal, May 2002

Reverse engineering of binary device drivers with RevNIC conference, January 2010

Scalable variable and data type detection in a binary rewriter conference, January 2013

Automatic Parallelization in a Binary Rewriter conference, December 2010

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching conference, January 2002

Reviewers conference, December 2007

OpenTuner: an extensible framework for program autotuning conference, January 2014

Optimization of geometric multigrid for emerging multi- and manycore processors conference, November 2012

HELIX: automatic parallelization of irregular programs for chip multiprocessing conference, January 2012

Practical and Accurate Low-Level Pointer Analysis conference, March 2005

Valgrind: a framework for heavyweight dynamic binary instrumentation conference, January 2007

A framework for enhancing data reuse via associative reordering conference, January 2013

The pochoir stencil compiler conference, January 2011

Dynamo: a transparent dynamic optimization system journal, May 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines journal, June 2013

Verified lifting of stencil computations journal, August 2016

Verified lifting of stencil computations conference, January 2016

Trace-based affine reconstruction of codes conference, January 2016

Integrating profile-driven parallelism detection and machine-learning-based mapping
journal, February 2014

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

SmartDec: Approaching C++ Decompilation
conference, October 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

S2E: a platform for in-vivo multi-path analysis of software systems
conference, January 2011

Transparent dynamic instrumentation
journal, September 2012

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

Scalable variable and data type detection in a binary rewriter
journal, June 2013

Analyzing Memory Accesses in x86 Executables
book, January 2004

A compiler-level intermediate representation based binary analysis and rewriting system
conference, January 2013

Transparent dynamic instrumentation
conference, January 2012

An Approach to the Problem of Detranslation of Computer Programs
journal, August 1980

The Paralax infrastructure: automatic parallelization with a helping hand
conference, January 2010

Dynamo: a transparent dynamic optimization system
conference, January 2000

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
journal, May 2002

Reverse engineering of binary device drivers with RevNIC
conference, January 2010

Scalable variable and data type detection in a binary rewriter
conference, January 2013

Automatic Parallelization in a Binary Rewriter
conference, December 2010

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

Reviewers
conference, December 2007

OpenTuner: an extensible framework for program autotuning
conference, January 2014

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

HELIX: automatic parallelization of irregular programs for chip multiprocessing
conference, January 2012

Practical and Accurate Low-Level Pointer Analysis
conference, March 2005

Valgrind: a framework for heavyweight dynamic binary instrumentation
conference, January 2007

A framework for enhancing data reuse via associative reordering
conference, January 2013

The pochoir stencil compiler
conference, January 2011

Dynamo: a transparent dynamic optimization system
journal, May 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
journal, June 2013

Verified lifting of stencil computations
journal, August 2016

Verified lifting of stencil computations
conference, January 2016

Trace-based affine reconstruction of codes
conference, January 2016