DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Blazing Signature Filter: a library for fast pairwise similarity comparisons

Abstract

Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductivemore » pairwise comparison. Furthermore, two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.« less

Authors:
 [1];  [1];  [1];  [1]; ORCiD logo [1]
  1. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1455280
Report Number(s):
PNNL-SA-126956
Journal ID: ISSN 1471-2105; 453060036
Grant/Contract Number:  
AC05-76RL01830
Resource Type:
Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 19; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; Pairwise similarity comparison; Filtering; Large-scale data mining

Citation Formats

Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States: N. p., 2018. Web. doi:10.1186/s12859-018-2210-6.
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, & Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States. https://doi.org/10.1186/s12859-018-2210-6
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Mon . "Blazing Signature Filter: a library for fast pairwise similarity comparisons". United States. https://doi.org/10.1186/s12859-018-2210-6. https://www.osti.gov/servlets/purl/1455280.
@article{osti_1455280,
title = {Blazing Signature Filter: a library for fast pairwise similarity comparisons},
author = {Lee, Joon -Yong and Fujimoto, Grant M. and Wilson, Ryan and Wiley, H. Steven and Payne, Samuel H.},
abstractNote = {Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Furthermore, two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.},
doi = {10.1186/s12859-018-2210-6},
journal = {BMC Bioinformatics},
number = 1,
volume = 19,
place = {United States},
year = {Mon Jun 11 00:00:00 EDT 2018},
month = {Mon Jun 11 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
journal, November 2011

  • Pruitt, K. D.; Tatusova, T.; Brown, G. R.
  • Nucleic Acids Research, Vol. 40, Issue D1
  • DOI: 10.1093/nar/gkr1079

Identification of common molecular subsequences
journal, March 1981


A fast bit-vector algorithm for approximate string matching based on dynamic programming
journal, May 1999


A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry
journal, September 2003

  • Nesvizhskii, Alexey I.; Keller, Andrew; Kolker, Eugene
  • Analytical Chemistry, Vol. 75, Issue 17
  • DOI: 10.1021/ac0341261

L1000CDS2: LINCS L1000 characteristic direction signatures search engine
journal, August 2016

  • Duan, Qiaonan; Reid, St Patrick; Clark, Neil R.
  • npj Systems Biology and Applications, Vol. 2, Issue 1
  • DOI: 10.1038/npjsba.2016.15

GutenTag:  High-Throughput Sequence Tagging via an Empirically Derived Fragmentation Model
journal, December 2003

  • Tabb, David L.; Saraf, Anita; Yates, John R.
  • Analytical Chemistry, Vol. 75, Issue 23
  • DOI: 10.1021/ac0347462

Bioinformatics methods in drug repurposing for Alzheimer’s disease
journal, July 2015

  • Siavelis, John C.; Bourdakou, Marilena M.; Athanasiadis, Emmanouil I.
  • Briefings in Bioinformatics, Vol. 17, Issue 2
  • DOI: 10.1093/bib/bbv048

KEGG: Kyoto Encyclopedia of Genes and Genomes
journal, January 2000

  • Kanehisa, Minoru; Goto, Susumu
  • Nucleic Acids Research, Vol. 28, Issue 1, p. 27-30
  • DOI: 10.1093/nar/28.1.27

Compound signature detection on LINCS L1000 big data
journal, January 2015

  • Liu, Chenglin; Su, Jing; Yang, Fei
  • Molecular BioSystems, Vol. 11, Issue 3
  • DOI: 10.1039/c4mb00677a

Origin of an Alternative Genetic Code in the Extremely Small and GC–Rich Genome of a Bacterial Symbiont
journal, July 2009


FastBit: interactively searching massive data
journal, July 2009


The characteristic direction: a geometrical approach to identify differentially expressed genes
journal, January 2014

  • Clark, Neil R.; Hu, Kevin S.; Feldmann, Axel S.
  • BMC Bioinformatics, Vol. 15, Issue 1
  • DOI: 10.1186/1471-2105-15-79

Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

HC-toxin
journal, July 2006


Amino acid substitution matrices from protein blocks.
journal, November 1992

  • Henikoff, S.; Henikoff, J. G.
  • Proceedings of the National Academy of Sciences, Vol. 89, Issue 22, p. 10915-10919
  • DOI: 10.1073/pnas.89.22.10915

Identification of small-molecule inhibitors of Zika virus infection and induced neural cell death via a drug repurposing screen
journal, August 2016

  • Xu, Miao; Lee, Emily M.; Wen, Zhexing
  • Nature Medicine, Vol. 22, Issue 10
  • DOI: 10.1038/nm.4184

Anatomy of High-Performance 2D Similarity Calculations
journal, August 2011

  • Haque, Imran S.; Pande, Vijay S.; Walters, W. Patrick
  • Journal of Chemical Information and Modeling, Vol. 51, Issue 9
  • DOI: 10.1021/ci200235e

Repurposing Salicylanilide Anthelmintic Drugs to Combat Drug Resistant Staphylococcus aureus
journal, April 2015


Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry
journal, August 2005

  • Frank, Ari; Tanner, Stephen; Bafna, Vineet
  • Journal of Proteome Research, Vol. 4, Issue 4
  • DOI: 10.1021/pr050011x

The COG database: a tool for genome-scale analysis of protein functions and evolution
journal, January 2000


UniProt: a hub for protein information
journal, October 2014

  • Consortium, UniPot
  • Nucleic Acids Research, Vol. 43, Issue D1, p. D204-D212
  • DOI: 10.1093/nar/gku989

Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants
journal, December 2001


A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules
journal, October 2003


The RAST Server: Rapid Annotations using Subsystems Technology
journal, January 2008

  • Aziz, Ramy K.; Bartels, Daniela; Best, Aaron A.
  • BMC Genomics, Vol. 9, Issue 1, Article No. 75
  • DOI: 10.1186/1471-2164-9-75

The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease
journal, January 2007


Identification of common molecular subsequences
journal, March 1981


Red versus green leaves: transcriptomic comparison of foliar senescence between two Prunus cerasifera genotypes
journal, February 2020


FastBit: interactively searching massive data
journal, July 2009


The COG database: a tool for genome-scale analysis of protein functions and evolution
journal, January 2000


The neighbor-joining method: a new method for reconstructing phylogenetic trees.
journal, July 1987


Works referencing / citing this record:

Reproducibility and Transparency by Design
journal, July 2019

  • Petyuk, Vladislav A.; Gatto, Laurent; Payne, Samuel H.
  • Molecular & Cellular Proteomics, Vol. 18, Issue 8 suppl 1
  • DOI: 10.1074/mcp.ip119.001567