DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Abstract

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide's state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. Here, we abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75 % performance improvement, four kernelsmore » from Irfan View, leading to 4.97 x performance, and one stencil from the mini GMG multigrid benchmark netting a 4.25 x improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop's filters with our lifted implementations, giving 1.12 x speedup without affecting the user experience.« less

Authors:
 [1];  [1];  [1];  [1];  [2];  [3];  [4];  [1]
  1. Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Computer Science and Artificial Intelligence Lab. (CSAIL)
  2. Stanford Univ., Palo Alto, CA (United States)
  3. Adobe, Cambridge, MA (United States)
  4. Google, Cambridge, MA (United States)
Publication Date:
Research Org.:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Computer Science and Artificial Intelligence Lab. (CSAIL)
Sponsoring Org.:
USDOE Office of Science (SC); Defense Advanced Research Projects Agency (DARPA)
OSTI Identifier:
1457399
Grant/Contract Number:  
SC0005288; SC0008923
Resource Type:
Accepted Manuscript
Journal Name:
ACM SIGPLAN Notices
Additional Journal Information:
Journal Volume: 2015; Journal ID: ISSN 0362-1340
Publisher:
ACM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, and Amarasinghe, Saman. Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code. United States: N. p., 2015. Web. doi:10.1145/2737924.2737974.
Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, & Amarasinghe, Saman. Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code. United States. https://doi.org/10.1145/2737924.2737974
Mendis, Charith, Bosboom, Jeffrey, Wu, Kevin, Kamil, Shoaib, Ragan-Kelley, Jonathan, Paris, Sylvain, Zhao, Qin, and Amarasinghe, Saman. Wed . "Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code". United States. https://doi.org/10.1145/2737924.2737974. https://www.osti.gov/servlets/purl/1457399.
@article{osti_1457399,
title = {Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code},
author = {Mendis, Charith and Bosboom, Jeffrey and Wu, Kevin and Kamil, Shoaib and Ragan-Kelley, Jonathan and Paris, Sylvain and Zhao, Qin and Amarasinghe, Saman},
abstractNote = {Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide's state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. Here, we abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75 % performance improvement, four kernels from Irfan View, leading to 4.97 x performance, and one stencil from the mini GMG multigrid benchmark netting a 4.25 x improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop's filters with our lifted implementations, giving 1.12 x speedup without affecting the user experience.},
doi = {10.1145/2737924.2737974},
journal = {ACM SIGPLAN Notices},
number = ,
volume = 2015,
place = {United States},
year = {Wed Jun 03 00:00:00 EDT 2015},
month = {Wed Jun 03 00:00:00 EDT 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 17 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Integrating profile-driven parallelism detection and machine-learning-based mapping
journal, February 2014

  • Wang, Zheng; Tournavitis, Georgios; Franke, Björn
  • ACM Transactions on Architecture and Code Optimization, Vol. 11, Issue 1
  • DOI: 10.1145/2579561

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

  • Kamil, Shoaib; Chan, Cy; Oliker, Leonid
  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • DOI: 10.1109/IPDPS.2010.5470421

SmartDec: Approaching C++ Decompilation
conference, October 2011

  • Fokin, Alexander; Derevenetc, Egor; Chernov, Alexander
  • 2011 18th Working Conference on Reverse Engineering (WCRE)
  • DOI: 10.1109/wcre.2011.49

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

  • Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
  • Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
  • DOI: 10.1145/2491956.2462176

S2E: a platform for in-vivo multi-path analysis of software systems
conference, January 2011

  • Chipounov, Vitaly; Kuznetsov, Volodymyr; Candea, George
  • Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems - ASPLOS '11
  • DOI: 10.1145/1950365.1950396

Transparent dynamic instrumentation
journal, September 2012

  • Bruening, Derek; Zhao, Qin; Amarasinghe, Saman
  • ACM SIGPLAN Notices, Vol. 47, Issue 7
  • DOI: 10.1145/2365864.2151043

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

  • Wu, Youfeng
  • Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02
  • DOI: 10.1145/512553.512555

Scalable variable and data type detection in a binary rewriter
journal, June 2013


Analyzing Memory Accesses in x86 Executables
book, January 2004


A compiler-level intermediate representation based binary analysis and rewriting system
conference, January 2013

  • Anand, Kapil; Smithson, Matthew; Elwazeer, Khaled
  • Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13
  • DOI: 10.1145/2465351.2465380

Transparent dynamic instrumentation
conference, January 2012

  • Bruening, Derek; Zhao, Qin; Amarasinghe, Saman
  • Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments - VEE '12
  • DOI: 10.1145/2151024.2151043

An Approach to the Problem of Detranslation of Computer Programs
journal, August 1980


The Paralax infrastructure: automatic parallelization with a helping hand
conference, January 2010

  • Vandierendonck, Hans; Rul, Sean; De Bosschere, Koen
  • Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10
  • DOI: 10.1145/1854273.1854322

Dynamo: a transparent dynamic optimization system
conference, January 2000

  • Bala, Vasanth; Duesterwald, Evelyn; Banerjia, Sanjeev
  • Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation - PLDI '00
  • DOI: 10.1145/349299.349303

Reverse engineering of binary device drivers with RevNIC
conference, January 2010

  • Chipounov, Vitaly; Candea, George
  • Proceedings of the 5th European conference on Computer systems - EuroSys '10
  • DOI: 10.1145/1755913.1755932

Scalable variable and data type detection in a binary rewriter
conference, January 2013

  • ElWazeer, Khaled; Anand, Kapil; Kotha, Aparna
  • Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
  • DOI: 10.1145/2491956.2462165

Automatic Parallelization in a Binary Rewriter
conference, December 2010

  • Kotha, Aparna; Anand, Kapil; Smithson, Matthew
  • 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • DOI: 10.1109/MICRO.2010.27

Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching
conference, January 2002

  • Wu, Youfeng
  • Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation - PLDI '02
  • DOI: 10.1145/512529.512555

Reviewers
conference, December 2007

  • ,
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)
  • DOI: 10.1109/MICRO.2007.7

OpenTuner: an extensible framework for program autotuning
conference, January 2014

  • Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
  • Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
  • DOI: 10.1145/2628071.2628092

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

  • Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.85

HELIX: automatic parallelization of irregular programs for chip multiprocessing
conference, January 2012

  • Campanoni, Simone; Jones, Timothy; Holloway, Glenn
  • Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
  • DOI: 10.1145/2259016.2259028

Practical and Accurate Low-Level Pointer Analysis
conference, March 2005

  • Guo, Bolei; Bridges, M. J.; Triantafyllis, S.
  • International Symposium on Code Generation and Optimization
  • DOI: 10.1109/CGO.2005.27

Valgrind: a framework for heavyweight dynamic binary instrumentation
conference, January 2007

  • Nethercote, Nicholas; Seward, Julian
  • Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation - PLDI '07
  • DOI: 10.1145/1250734.1250746

A framework for enhancing data reuse via associative reordering
conference, January 2013

  • Stock, Kevin; Kong, Martin; Grosser, Tobias
  • Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI '14
  • DOI: 10.1145/2594291.2594342

The pochoir stencil compiler
conference, January 2011

  • Tang, Yuan; Chowdhury, Rezaul Alam; Kuszmaul, Bradley C.
  • Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures - SPAA '11
  • DOI: 10.1145/1989493.1989508

Dynamo: a transparent dynamic optimization system
journal, May 2011

  • Bala, Vasanth; Duesterwald, Evelyn; Banerjia, Sanjeev
  • ACM SIGPLAN Notices, Vol. 46, Issue 4
  • DOI: 10.1145/1988042.1988044

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
journal, June 2013

  • Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
  • ACM SIGPLAN Notices, Vol. 48, Issue 6
  • DOI: 10.1145/2499370.2462176

Works referencing / citing this record:

Verified lifting of stencil computations
journal, August 2016

  • Kamil, Shoaib; Cheung, Alvin; Itzhaky, Shachar
  • ACM SIGPLAN Notices, Vol. 51, Issue 6
  • DOI: 10.1145/2980983.2908117

Verified lifting of stencil computations
conference, January 2016

  • Kamil, Shoaib; Cheung, Alvin; Itzhaky, Shachar
  • Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2016
  • DOI: 10.1145/2908080.2908117

Trace-based affine reconstruction of codes
conference, January 2016

  • Rodríguez, Gabriel; Andión, José M.; Kandemir, Mahmut T.
  • Proceedings of the 2016 International Symposium on Code Generation and Optimization - CGO 2016
  • DOI: 10.1145/2854038.2854056