Blazing Signature Filter: a library for fast pairwise similarity comparisons
Abstract
Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductivemore »
- Authors:
-
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Publication Date:
- Research Org.:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1455280
- Report Number(s):
- PNNL-SA-126956
Journal ID: ISSN 1471-2105; 453060036
- Grant/Contract Number:
- AC05-76RL01830
- Resource Type:
- Accepted Manuscript
- Journal Name:
- BMC Bioinformatics
- Additional Journal Information:
- Journal Volume: 19; Journal Issue: 1; Journal ID: ISSN 1471-2105
- Publisher:
- BioMed Central
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; Pairwise similarity comparison; Filtering; Large-scale data mining
Citation Formats
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States: N. p., 2018.
Web. doi:10.1186/s12859-018-2210-6.
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, & Payne, Samuel H. Blazing Signature Filter: a library for fast pairwise similarity comparisons. United States. https://doi.org/10.1186/s12859-018-2210-6
Lee, Joon -Yong, Fujimoto, Grant M., Wilson, Ryan, Wiley, H. Steven, and Payne, Samuel H. Mon .
"Blazing Signature Filter: a library for fast pairwise similarity comparisons". United States. https://doi.org/10.1186/s12859-018-2210-6. https://www.osti.gov/servlets/purl/1455280.
@article{osti_1455280,
title = {Blazing Signature Filter: a library for fast pairwise similarity comparisons},
author = {Lee, Joon -Yong and Fujimoto, Grant M. and Wilson, Ryan and Wiley, H. Steven and Payne, Samuel H.},
abstractNote = {Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Furthermore, two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.},
doi = {10.1186/s12859-018-2210-6},
journal = {BMC Bioinformatics},
number = 1,
volume = 19,
place = {United States},
year = {Mon Jun 11 00:00:00 EDT 2018},
month = {Mon Jun 11 00:00:00 EDT 2018}
}
Web of Science
Works referenced in this record:
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
journal, November 2011
- Pruitt, K. D.; Tatusova, T.; Brown, G. R.
- Nucleic Acids Research, Vol. 40, Issue D1
Identification of common molecular subsequences
journal, March 1981
- Smith, T. F.; Waterman, M. S.
- Journal of Molecular Biology, Vol. 147, Issue 1, p. 195-197
A fast bit-vector algorithm for approximate string matching based on dynamic programming
journal, May 1999
- Myers, Gene
- Journal of the ACM, Vol. 46, Issue 3
A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry
journal, September 2003
- Nesvizhskii, Alexey I.; Keller, Andrew; Kolker, Eugene
- Analytical Chemistry, Vol. 75, Issue 17
L1000CDS2: LINCS L1000 characteristic direction signatures search engine
journal, August 2016
- Duan, Qiaonan; Reid, St Patrick; Clark, Neil R.
- npj Systems Biology and Applications, Vol. 2, Issue 1
GutenTag: High-Throughput Sequence Tagging via an Empirically Derived Fragmentation Model
journal, December 2003
- Tabb, David L.; Saraf, Anita; Yates, John R.
- Analytical Chemistry, Vol. 75, Issue 23
Bioinformatics methods in drug repurposing for Alzheimer’s disease
journal, July 2015
- Siavelis, John C.; Bourdakou, Marilena M.; Athanasiadis, Emmanouil I.
- Briefings in Bioinformatics, Vol. 17, Issue 2
KEGG: Kyoto Encyclopedia of Genes and Genomes
journal, January 2000
- Kanehisa, Minoru; Goto, Susumu
- Nucleic Acids Research, Vol. 28, Issue 1, p. 27-30
Compound signature detection on LINCS L1000 big data
journal, January 2015
- Liu, Chenglin; Su, Jing; Yang, Fei
- Molecular BioSystems, Vol. 11, Issue 3
Origin of an Alternative Genetic Code in the Extremely Small and GC–Rich Genome of a Bacterial Symbiont
journal, July 2009
- McCutcheon, John P.; McDonald, Bradon R.; Moran, Nancy A.
- PLoS Genetics, Vol. 5, Issue 7
FastBit: interactively searching massive data
journal, July 2009
- Wu, K.; Ahern, S.; Bethel, E. W.
- Journal of Physics: Conference Series, Vol. 180
The characteristic direction: a geometrical approach to identify differentially expressed genes
journal, January 2014
- Clark, Neil R.; Hu, Kevin S.; Feldmann, Axel S.
- BMC Bioinformatics, Vol. 15, Issue 1
Basic local alignment search tool
journal, October 1990
- Altschul, Stephen F.; Gish, Warren; Miller, Webb
- Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
Amino acid substitution matrices from protein blocks.
journal, November 1992
- Henikoff, S.; Henikoff, J. G.
- Proceedings of the National Academy of Sciences, Vol. 89, Issue 22, p. 10915-10919
Identification of small-molecule inhibitors of Zika virus infection and induced neural cell death via a drug repurposing screen
journal, August 2016
- Xu, Miao; Lee, Emily M.; Wen, Zhexing
- Nature Medicine, Vol. 22, Issue 10
Anatomy of High-Performance 2D Similarity Calculations
journal, August 2011
- Haque, Imran S.; Pande, Vijay S.; Walters, W. Patrick
- Journal of Chemical Information and Modeling, Vol. 51, Issue 9
Repurposing Salicylanilide Anthelmintic Drugs to Combat Drug Resistant Staphylococcus aureus
journal, April 2015
- Rajamuthiah, Rajmohan; Fuchs, Beth Burgwyn; Conery, Annie L.
- PLOS ONE, Vol. 10, Issue 4
Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry
journal, August 2005
- Frank, Ari; Tanner, Stephen; Bafna, Vineet
- Journal of Proteome Research, Vol. 4, Issue 4
The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease
journal, September 2006
- Lamb, J.
- Science, Vol. 313, Issue 5795
The COG database: a tool for genome-scale analysis of protein functions and evolution
journal, January 2000
- Tatusov, R. L.
- Nucleic Acids Research, Vol. 28, Issue 1
UniProt: a hub for protein information
journal, October 2014
- Consortium, UniPot
- Nucleic Acids Research, Vol. 43, Issue D1, p. D204-D212
Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants
journal, December 2001
- Tong, A. H. Y.
- Science, Vol. 294, Issue 5550
A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules
journal, October 2003
- Stuart, J. M.
- Science, Vol. 302, Issue 5643
Histone Deacetylase Inhibitor Panobinostat Induces Clinical Responses with Associated Alterations in Gene Expression Profiles in Cutaneous T-Cell Lymphoma
journal, July 2008
- Ellis, L.; Pan, Y.; Smyth, G. K.
- Clinical Cancer Research, Vol. 14, Issue 14
The RAST Server: Rapid Annotations using Subsystems Technology
journal, January 2008
- Aziz, Ramy K.; Bartels, Daniela; Best, Aaron A.
- BMC Genomics, Vol. 9, Issue 1, Article No. 75
The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease
journal, January 2007
- Thiers, B. H.
- Yearbook of Dermatology and Dermatologic Surgery, Vol. 2007
Identification of common molecular subsequences
journal, March 1981
- Smith, T. F.; Waterman, M. S.
- Journal of Molecular Biology, Vol. 147, Issue 1, p. 195-197
Red versus green leaves: transcriptomic comparison of foliar senescence between two Prunus cerasifera genotypes
journal, February 2020
- Vangelisti, Alberto; Guidi, Lucia; Cavallini, Andrea
- Scientific Reports, Vol. 10, Issue 1
FastBit: interactively searching massive data
journal, July 2009
- Wu, K.; Ahern, S.; Bethel, E. W.
- Journal of Physics: Conference Series, Vol. 180
The COG database: a tool for genome-scale analysis of protein functions and evolution
journal, January 2000
- Tatusov, R. L.
- Nucleic Acids Research, Vol. 28, Issue 1
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
journal, July 1987
- Saitou, N.; Nei, M.
- Molecular Biology and Evolution, Vol. 4, Issue 4, p. 406-425
Works referencing / citing this record:
Reproducibility and Transparency by Design
journal, July 2019
- Petyuk, Vladislav A.; Gatto, Laurent; Payne, Samuel H.
- Molecular & Cellular Proteomics, Vol. 18, Issue 8 suppl 1