skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Order priors for Bayesian network discovery with an application to malware phylogeny

Abstract

Here, Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the partial ordering of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges) in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared tomore » existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.« less

Authors:
ORCiD logo [1];  [2]; ORCiD logo [1]; ORCiD logo [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  2. Cisco Systems Inc., Durham, NC (United States)
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1398911
Report Number(s):
LA-UR-16-23891
Journal ID: ISSN 1932-1864
Grant/Contract Number:
AC52-06NA25396
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Statistical Analysis and Data Mining
Additional Journal Information:
Journal Volume: 10; Journal Issue: 5; Journal ID: ISSN 1932-1864
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Bayesian networks; cyber security; malware; probabilistic graphical models

Citation Formats

Oyen, Diane, Anderson, Blake, Sentz, Kari, and Anderson-Cook, Christine Michaela. Order priors for Bayesian network discovery with an application to malware phylogeny. United States: N. p., 2017. Web. doi:10.1002/sam.11364.
Oyen, Diane, Anderson, Blake, Sentz, Kari, & Anderson-Cook, Christine Michaela. Order priors for Bayesian network discovery with an application to malware phylogeny. United States. doi:10.1002/sam.11364.
Oyen, Diane, Anderson, Blake, Sentz, Kari, and Anderson-Cook, Christine Michaela. Fri . "Order priors for Bayesian network discovery with an application to malware phylogeny". United States. doi:10.1002/sam.11364.
@article{osti_1398911,
title = {Order priors for Bayesian network discovery with an application to malware phylogeny},
author = {Oyen, Diane and Anderson, Blake and Sentz, Kari and Anderson-Cook, Christine Michaela},
abstractNote = {Here, Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the partial ordering of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges) in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared to existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.},
doi = {10.1002/sam.11364},
journal = {Statistical Analysis and Data Mining},
number = 5,
volume = 10,
place = {United States},
year = {Fri Sep 15 00:00:00 EDT 2017},
month = {Fri Sep 15 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on September 15, 2018
Publisher's Version of Record

Save / Share:
  • While the ML-EM (maximum-likelihood-expectation maximization) algorithm for reconstruction for emission tomography is unstable due to the ill-posed nature of the problem, Bayesian reconstruction methods overcome this instability by introducing prior information, often in the form of a spatial smoothness regularizer. More elaborate forms of smoothness constraints may be used to extend the role of the prior beyond that of a stabilizer in order to capture actual spatial information about the object. Previously proposed forms of such prior distributions were based on the assumption of a piecewise constant source distribution. Here, the authors propose an extension to a piecewise linear model--themore » weak plate--which is more expressive than the piecewise constant model. The weak plate prior not only preserves edges but also allows for piecewise ramplike regions in the reconstruction. Indeed, for the application in SPECT, such ramplike regions are observed in ground-truth source distributions in the form of primate autoradiographs of rCBF radionuclides. To incorporate the weak plate prior in a MAP approach, the authors model the prior as a Gibbs distribution and use a GEM formulation for the optimization. They compare quantitative performance of the ML-EM algorithm, a GEM algorithm with a prior favoring piecewise constant regions, and a GEM algorithm with the weak plate prior. Pointwise and regional bias and variance of ensemble image reconstructions are used as indications of image quality. The results show that the weak plate and membrane priors exhibit improved bias and variance relative to ML-EM techniques.« less
  • A statistical methodology was applied to the simultaneous calibration and validation of thermodynamic models for the uptake of CO{sub 2} in mesoporous silica-supported amines. The methodology is Bayesian, and follows the procedure introduced by Kennedy and O'Hagan. One key aspect of the application presented is the use of quantum chemical calculations to define prior probability distributions for physical model parameters. Inclusion of this prior information proved to be crucial to the identifiability of model parameters against experimental thermogravimetric data. Through the statistical analysis, a quantitative assessment of the accuracy of various quantum chemical methods is produced. Another important aspect ofmore » the current approach is the conditioning of the model form discrepancy – a critical component of the Kennedy and O'Hagan methodology – to the experimental data in such a mannner that it becomes an implicit function of the model parameters and thereby connected with the posterior distribution. It is shown that the inclusion of prior information in the analysis leads to a shifting of uncertainty from the posterior distribution for model parameters to this conditioned model form discrepancy. Prospects for more accurate model predictions and propagation of uncertainty in upscaling and extrapolation through a “model-plus-discrepancy” approach are discussed. The synthesis methods and thermogravimetric characterization of hybrid grafted/impregnated mesoporous silica-supported amine sorbents are presented, along with the details of the quantum chemical study, which shows that a carbamic acid–base acceptor complex is the most stable form of adsorbed CO{sub 2} in both alkanol- and ethyleneamines.« less
  • A statistical methodology was applied to the simultaneous calibration and validation of thermodynamic models for the uptake of CO{sub 2} in mesoporous silica-supported amines. The methodology is Bayesian, and follows the procedure introduced by Kennedy and O'Hagan. One key aspect of the application presented is the use of quantum chemical calculations to define prior probability distributions for physical model parameters. Inclusion of this prior information proved to be crucial to the identifiability of model parameters against experimental thermogravimetric data. Through the statistical analysis, a quantitative assessment of the accuracy of various quantum chemical methods is produced. Another important aspect ofmore » the current approach is the conditioning of the model form discrepancy – a critical component of the Kennedy and O'Hagan methodology – to the experimental data in such a manner that it becomes an implicit function of the model parameters and thereby connected with the posterior distribution. It is shown that the inclusion of prior information in the analysis leads to a shifting of uncertainty from the posterior distribution for model parameters to this conditioned model form discrepancy. Prospects for more accurate model predictions and propagation of uncertainty in upscaling and extrapolation through a “model-plus-discrepancy” approach are discussed. The synthesis methods and thermogravimetric characterization of hybrid grafted/impregnated mesoporous silica-supported amine sorbents are presented, along with the details of the quantum chemical study, which shows that a carbamic acid–base acceptor complex is the most stable form of adsorbed CO{sub 2} in both alkanol- and ethyleneamines.« less
  • The authors propose a Bayesian method whereby maximum a posteriori (MAP) estimates of functional (PET and SPECT) images may be reconstructed with the aid of prior information derived from registered anatomical MR images of the same slice. The prior information consists of significant anatomical boundaries that are likely to correspond to discontinuities in an otherwise spatially smooth radionuclide distribution. The algorithm, like others proposed recently, seeks smooth solutions with occasional discontinuities; the contribution here is the inclusion of a coupling term that influences the creation of discontinuities in the vicinity of the significant anatomical boundaries. Simulations on anatomically derived mathematicalmore » phantoms are presented. Although computationally intense in its current implication, the reconstructions are improved (ROI-RMS error) relative to filtered backprojection and EM-ML reconstructions. The simulations show that the inclusion of position-dependent anatomical prior information leads to further improvement relative to Bayesian reconstructions without the anatomical prior. The algorithm exhibits a certain degree of robustness with respect to errors in the location of anatomical boundaries.« less
  • We present a new technique for overcoming confusion noise in deep far-infrared Herschel space telescope images making use of prior information from shorter λ < 2 μm wavelengths. For the deepest images obtained by Herschel, the flux limit due to source confusion is about a factor of three brighter than the flux limit due to instrumental noise and (smooth) sky background. We have investigated the possibility of de-confusing simulated Herschel PACS 160 μm images by using strong Bayesian priors on the positions and weak priors on the flux of sources. We find the blended sources and group them together and simultaneously fitmore » their fluxes. We derive the posterior probability distribution function of fluxes subject to these priors through Monte Carlo Markov Chain (MCMC) sampling by fitting the image. Assuming we can predict the FIR flux of sources based on the ultraviolet-optical part of their SEDs to within an order of magnitude, the simulations show that we can obtain reliable fluxes and uncertainties at least a factor of three fainter than the confusion noise limit of 3σ {sub c} = 2.7 mJy in our simulated PACS-160 image. This technique could in principle be used to mitigate the effects of source confusion in any situation where one has prior information of positions and plausible fluxes of blended sources. For Herschel, application of this technique will improve our ability to constrain the dust content in normal galaxies at high redshift.« less