skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Blazing Signature Filter: a library for fast pairwise similarity comparisons

Abstract

Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductivemore » pairwise comparison. Furthermore, two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.« less

Authors:
 [1];  [1];  [1];  [1]; ORCiD logo [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1455280
Report Number(s):
PNNL-SA-126956
Journal ID: ISSN 1471-2105; 453060036
Grant/Contract Number:  
AC05-76RL01830
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 19; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; Pairwise similarity comparison; Filtering; Large-scale data mining

Citation Formats

Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States: N. p., 2018. Web. doi:10.1186/s12859-018-2210-6.
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, & Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States. doi:10.1186/s12859-018-2210-6.
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Mon . "Blazing Signature Filter: a library for fast pairwise similarity comparisons". United States. doi:10.1186/s12859-018-2210-6. https://www.osti.gov/servlets/purl/1455280.
@article{osti_1455280,
title = {Blazing Signature Filter: a library for fast pairwise similarity comparisons},
author = {Lee, Joon -Yong and Fujimoto, Grant M. and Wilson, Ryan and Wiley, H. Steven and Payne, Samuel H.},
abstractNote = {Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Furthermore, two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.},
doi = {10.1186/s12859-018-2210-6},
journal = {BMC Bioinformatics},
number = 1,
volume = 19,
place = {United States},
year = {Mon Jun 11 00:00:00 EDT 2018},
month = {Mon Jun 11 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Identification of common molecular subsequences
journal, March 1981


KEGG: Kyoto Encyclopedia of Genes and Genomes
journal, January 2000

  • Kanehisa, Minoru; Goto, Susumu
  • Nucleic Acids Research, Vol. 28, Issue 1, p. 27-30
  • DOI: 10.1093/nar/28.1.27

Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

Amino acid substitution matrices from protein blocks.
journal, November 1992

  • Henikoff, S.; Henikoff, J. G.
  • Proceedings of the National Academy of Sciences, Vol. 89, Issue 22, p. 10915-10919
  • DOI: 10.1073/pnas.89.22.10915