APT malware static trace analysis through bigrams and graph edit distance

Bolton, Alexander D.; Anderson-Cook, Christine M.

doi:10.1002/sam.11346

Title: APT malware static trace analysis through bigrams and graph edit distance

Abstract

Research and business organizations are vulnerable to attack by malware, particularly advanced persistent threat malware tailored for a specific target. Malware identification is made more difficult because samples can be subtly altered to avoid detection by methods that check for an identical match to known code. Different versions of an original piece of malware form a malware family. And when new malicious software is identified, reverse engineers seek to identify its origin and purpose. Knowing whether new malware is from a known family or a previously unobserved family aids the efficiency of reverse engineers. Furthermore, this article presents a three-stage method to classify new malware into a family by comparing its similarity to existing static traces, and assigning it to the most similar family. First, a fast filtering method creates a shortlist of samples with some similarity to the new malware, using a simple bigram comparison of the instructions. The second stage takes the call graph view of the shortlisted static traces and uses simulated annealing to estimate the graph edit distance, a measure of dissimilarity between graphs. Finally, a random forest classifier combines the previous two results to predict the family to which a new sample belongs. Our papermore »« less

Authors:

Bolton, Alexander D. ^[1]; Anderson-Cook, Christine M. ^[2]

Imperial College, London (United Kingdom)
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

Publication Date:: Wed May 17 00:00:00 EDT 2017

Research Org.:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Org.:: USDOE

OSTI Identifier:: 1364539

Report Number(s):: LA-UR-16-24029
Journal ID: ISSN 1932-1864

Grant/Contract Number:: AC52-06NA25396

Resource Type:: Accepted Manuscript

Journal Name:: Statistical Analysis and Data Mining

Additional Journal Information:: Journal Volume: 10; Journal Issue: 3; Journal ID: ISSN 1932-1864

Publisher:: Wiley

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; call graph; family detection; malware detection; random forest; simulated annealing

Citation Formats


                    Bolton, Alexander D., and Anderson-Cook, Christine M. APT malware static trace analysis through bigrams and graph edit distance.  United States: N. p., 2017. 
Web.  doi:10.1002/sam.11346.

Copy to clipboard


                    Bolton, Alexander D., & Anderson-Cook, Christine M. APT malware static trace analysis through bigrams and graph edit distance.  United States.  https://doi.org/10.1002/sam.11346

Copy to clipboard


                    Bolton, Alexander D., and Anderson-Cook, Christine M. Wed .  
"APT malware static trace analysis through bigrams and graph edit distance".  United States.  https://doi.org/10.1002/sam.11346.  https://www.osti.gov/servlets/purl/1364539.

Copy to clipboard


                    
@article{osti_1364539,

  title        = {APT malware static trace analysis through bigrams and graph edit distance},

  author       = {Bolton, Alexander D. and Anderson-Cook, Christine M.},

  abstractNote = {Research and business organizations are vulnerable to attack by malware, particularly advanced persistent threat malware tailored for a specific target. Malware identification is made more difficult because samples can be subtly altered to avoid detection by methods that check for an identical match to known code. Different versions of an original piece of malware form a malware family. And when new malicious software is identified, reverse engineers seek to identify its origin and purpose. Knowing whether new malware is from a known family or a previously unobserved family aids the efficiency of reverse engineers. Furthermore, this article presents a three-stage method to classify new malware into a family by comparing its similarity to existing static traces, and assigning it to the most similar family. First, a fast filtering method creates a shortlist of samples with some similarity to the new malware, using a simple bigram comparison of the instructions. The second stage takes the call graph view of the shortlisted static traces and uses simulated annealing to estimate the graph edit distance, a measure of dissimilarity between graphs. Finally, a random forest classifier combines the previous two results to predict the family to which a new sample belongs. Our paper also considers how to detect when malware is from a new family.},

  doi          = {10.1002/sam.11346},

  journal      = {Statistical Analysis and Data Mining},

  number       = 3,

  volume       = 10,

  place        = {United States},

  year         = {Wed May 17 00:00:00 EDT 2017},

  month        = {Wed May 17 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1002/sam.11346

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Improved call graph comparison using simulated annealing
conference, January 2011

Kostakis, Orestis; Kinable, Joris; Mahmoudi, Hamed
Proceedings of the 2011 ACM Symposium on Applied Computing - SAC '11
DOI: 10.1145/1982185.1982509

Using opcode sequences in single-class learning to detect unknown malware
journal, January 2011

Santos, I.; Brezo, F.; Sanz, B.
IET Information Security, Vol. 5, Issue 4
DOI: 10.1049/iet-ifs.2010.0180

Stochastic identification of malware with dynamic traces
journal, March 2014

Storlie, Curtis; Anderson, Blake; Vander Wiel, Scott
The Annals of Applied Statistics, Vol. 8, Issue 1
DOI: 10.1214/13-AOAS703

Static Malware Analysis Using Machine Learning Methods
book, January 2014

Nath, Hiran V.; Mehtre, Babu M.
Recent Trends in Computer Networks and Distributed Systems Security
DOI: 10.1007/978-3-642-54525-2_39

Comparing stars: on approximating graph edit distance
journal, August 2009

Zeng, Zhiping; Tung, Anthony K. H.; Wang, Jianyong
Proceedings of the VLDB Endowment, Vol. 2, Issue 1
DOI: 10.14778/1687627.1687631

Automated Classification and Analysis of Internet Malware
book, January 2007

Bailey, Michael; Oberheide, Jon; Andersen, Jon
Recent Advances in Intrusion Detection
DOI: 10.1007/978-3-540-74320-0_10

Using Entropy Analysis to Find Encrypted and Packed Malware
journal, March 2007

Lyda, Robert; Hamrock, James
IEEE Security and Privacy Magazine, Vol. 5, Issue 2
DOI: 10.1109/MSP.2007.48

A Biosequence-Based Approach to Software Characterization
conference, May 2016

Oehmen, Christopher S.; Peterson, Elena S.; Phillips, Aaron R.
2016 IEEE Security and Privacy Workshops (SPW)
DOI: 10.1109/SPW.2016.43

A Graph Matching Algorithm Using Data-Driven Markov Chain Monte Carlo Sampling
conference, August 2010

Lee, Jungmin; Cho, Minsu; Lee, Kyoung Mu
2010 20th International Conference on Pattern Recognition
DOI: 10.1109/ICPR.2010.690

Data mining methods for detection of new malicious executables
conference, January 2001

Schultz, M. G.; Eskin, E.; Zadok, F.
Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on
DOI: 10.1109/SECPRI.2001.924286

Optimization by Simulated Annealing
journal, May 1983

Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P.
Science, Vol. 220, Issue 4598
DOI: 10.1126/science.220.4598.671

Malware Target Recognition of Unknown Threats
journal, September 2013

Dube, Thomas E.; Raines, Richard A.; Grimaila, Michael R.
IEEE Systems Journal, Vol. 7, Issue 3
DOI: 10.1109/JSYST.2012.2221913

The Hungarian method for the assignment problem
journal, March 1955

Kuhn, H. W.
Naval Research Logistics Quarterly, Vol. 2, Issue 1-2, p. 83-97
DOI: 10.1002/nav.3800020109

Malware classification based on call graph clustering
journal, February 2011

Kinable, Joris; Kostakis, Orestis
Journal in Computer Virology, Vol. 7, Issue 4
DOI: 10.1007/s11416-011-0151-y

Approximate graph edit distance computation by means of bipartite graph matching
journal, June 2009

Riesen, Kaspar; Bunke, Horst
Image and Vision Computing, Vol. 27, Issue 7
DOI: 10.1016/j.imavis.2008.04.004

Random Forests
journal, January 2001

Breiman, Leo
Machine Learning, Vol. 45, Issue 1, p. 5-32
DOI: 10.1023/A:1010933404324

Improving malware classification: bridging the static/dynamic gap
conference, January 2012

Anderson, Blake; Storlie, Curtis; Lane, Terran
Proceedings of the 5th ACM workshop on Security and artificial intelligence - AISec '12
DOI: 10.1145/2381896.2381900

Similar Records in DOE PAGES and OSTI.GOV collections:

Tools for Large-Scale Mobile Malware Analysis

Thesis/Dissertation Bierma, Michael

Analyzing mobile applications for malicious behavior is an important area of re- search, and is made di cult, in part, by the increasingly large number of appli- cations available for the major operating systems. There are currently over 1.2 million apps available in both the Google Play and Apple App stores (the respec- tive o cial marketplaces for the Android and iOS operating systems)[1, 2]. Our research provides two large-scale analysis tools to aid in the detection and analysis of mobile malware. The rst tool we present, Andlantis, is a scalable dynamic analysis system capa- ble of processing over 3000more »« less
Tensor Text-Mining Methods for Malware Identification and Detection, Malware Dynamics Characterization, and Hosts Ranking

Technical Report Alexandrov, Boian ; Eren, Maksim Ekin

Malware is one of the most persistent and costly cyber threats endangering reputation, confidentiality, integrity, and availability for organizations and national security. Consequently, many of the incident detection and prevention systems, and incident responders have begun to utilize machine learning as a helper in the fight against malware and other cyber threats. However, cyber defenders rely on interpretability and generalizability, yet the popular machine learning methods are black-box and often use traditional supervised solutions that do not generalize to novel malware. Therefore, there is a need to improve the existing solutions. At the same time, the majority of the priormore »« less
https://doi.org/10.2172/1826495

Full Text Available
Beyond the Hype: An Evaluation of Commercially Available Machine-Learning-Based Malware Detectors

Journal Article Bridges, Robert A. ; Oesch, Sean ; Iannacone, Michael D. ; ... - Digital Threats: Research and Practice

There is a lack of scientific testing of commercially available malware detectors, especially those that boast accurate classification of never-before-seen (i.e., zero-day) files using machine learning (ML). Consequently, efficacy of malware detectors is opaque, inhibiting end users from making informed decisions and researchers from targeting gaps in current detectors. In this paper, we present a scientific evaluation of four prominent commercial malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously and never-before-seen files? Is purchasing a network-level malware detector worth the cost? To investigate, we tested each tool againstmore »« less
https://doi.org/10.1145/3567432

Full Text Available
Deep PDF parsing to extract features for detecting embedded malware.

Technical Report Munson, Miles Arthur ; Cross, Jesse S

The number of PDF files with embedded malicious code has risen significantly in the past few years. This is due to the portability of the file format, the ways Adobe Reader recovers from corrupt PDF files, the addition of many multimedia and scripting extensions to the file format, and many format properties the malware author may use to disguise the presence of malware. Current research focuses on executable, MS Office, and HTML formats. In this paper, several features and properties of PDF Files are identified. Features are extracted using an instrumented open source PDF viewer. The feature descriptions of benignmore »« less
https://doi.org/10.2172/1030303

Full Text Available
AI ATAC 1: An Evaluation of Prominent Commercial Malware Detectors

Conference Bridges, Robert ; Weber, Brian ; Beaver, Justin M. ; ...

This work presents an evaluation of six prominent commercial endpoint malware detectors, a network malware detector, and a file-conviction algorithm from a cyber technology vendor. The evaluation was administered as the first of the Artificial I ntelligence Applications t o Autonomous Cybersecurity (AI ATAC) prize challenges, funded by / completed in service of the US Navy. The experiment employed 100K files (50/50% benign/malicious) with a stratified distribution of file types, including ~1K zero-day program executables (increasing experiment size two orders of magnitude over previous work). We present an evaluation process of delivering a file to a fresh virtual machine donningmore »« less
https://doi.org/10.1109/BigData59044.2023.10386590

Full Text Available

Similar Records

Title: APT malware static trace analysis through bigrams and graph edit distance

Abstract

Citation Formats

Improved call graph comparison using simulated annealing conference, January 2011

Using opcode sequences in single-class learning to detect unknown malware journal, January 2011

Stochastic identification of malware with dynamic traces journal, March 2014

Static Malware Analysis Using Machine Learning Methods book, January 2014

Comparing stars: on approximating graph edit distance journal, August 2009

Automated Classification and Analysis of Internet Malware book, January 2007

Using Entropy Analysis to Find Encrypted and Packed Malware journal, March 2007

A Biosequence-Based Approach to Software Characterization conference, May 2016

A Graph Matching Algorithm Using Data-Driven Markov Chain Monte Carlo Sampling conference, August 2010

Data mining methods for detection of new malicious executables conference, January 2001

Optimization by Simulated Annealing journal, May 1983

Malware Target Recognition of Unknown Threats journal, September 2013

The Hungarian method for the assignment problem journal, March 1955

Malware classification based on call graph clustering journal, February 2011

Approximate graph edit distance computation by means of bipartite graph matching journal, June 2009

Random Forests journal, January 2001

Improving malware classification: bridging the static/dynamic gap conference, January 2012

Improved call graph comparison using simulated annealing
conference, January 2011

Using opcode sequences in single-class learning to detect unknown malware
journal, January 2011

Stochastic identification of malware with dynamic traces
journal, March 2014

Static Malware Analysis Using Machine Learning Methods
book, January 2014

Comparing stars: on approximating graph edit distance
journal, August 2009

Automated Classification and Analysis of Internet Malware
book, January 2007

Using Entropy Analysis to Find Encrypted and Packed Malware
journal, March 2007

A Biosequence-Based Approach to Software Characterization
conference, May 2016

A Graph Matching Algorithm Using Data-Driven Markov Chain Monte Carlo Sampling
conference, August 2010

Data mining methods for detection of new malicious executables
conference, January 2001

Optimization by Simulated Annealing
journal, May 1983

Malware Target Recognition of Unknown Threats
journal, September 2013

The Hungarian method for the assignment problem
journal, March 1955

Malware classification based on call graph clustering
journal, February 2011

Approximate graph edit distance computation by means of bipartite graph matching
journal, June 2009

Random Forests
journal, January 2001

Improving malware classification: bridging the static/dynamic gap
conference, January 2012