skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Benchmarking of alignment-free sequence comparison methods

Abstract

Background: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Results: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. Conclusion: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions

Authors:
 [1];  [2];  [3];  [4];  [5];  [4];  [4];  [4];  [6];  [7];  [8];  [6];  [9];  [10];  [11];  [2];  [7];  [4]; ORCiD logo [1]
  1. Adam Mickiewicz Univ., Poznan (Poland). Faculty of Biology. Dept. of Computational Biology
  2. Univ. of Tulsa, Tulsa, OK (United States). Tandy School of Computer Science
  3. Sorbonne Univ., Paris (France)
  4. Gottingen Univ. (Germany). Inst. of Microbiology and Genetics. Dept. of Bioinformatics
  5. Univ. of Southern California, Los Angeles, CA (United States). Quantitative and Computational Biology Program. Dept. of Biological Sciences
  6. Univ. of California, Berkeley, CA (United States). Dept. of Chemistry; Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Molecular Biophysics & Integrated Bioimaging Division
  7. Univ. of Southern California, Los Angeles, CA (United States). Quantitative and Computational Biology Program. Dept. of Biological Sciences; Fudan Univ., Shanghai (China). School of Mathematical Sciences. Centre for Computational Systems Biology
  8. Univ. of Padua (Italy). Dept. of Information Engineering
  9. Univ. of Lisbon (Portugal). Inst. Superior Tecnico. INESC-ID. IDMEC
  10. National Inst. of Health (NIH), Bethesda, MD (United States). National Cancer Inst. Division of Cancer Epidemiology and Genetics (DCEG)
  11. Univ. of Queensland, Brisbane, QLD (Australia). School of Chemistry and Molecular Biosciences. Inst. for Molecular Bioscience
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1626947
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Genome Biology (Online)
Additional Journal Information:
Journal Name: Genome Biology (Online); Journal Volume: 20; Journal Issue: 1; Journal ID: ISSN 1474-760X
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; Biotechnology & Applied Microbiology; Genetics & Heredity; Alignment-free; Sequence comparison; Benchmark; Whole-genome phylogeny; Horizontal gene transfer; Web service

Citation Formats

Zielezinski, Andrzej, Girgis, Hani Z., Bernard, Guillaume, Leimeister, Chris-Andre, Tang, Kujin, Dencker, Thomas, Lau, Anna Katharina, Röhling, Sophie, Choi, Jae Jin, Waterman, Michael S., Comin, Matteo, Kim, Sung-Hou, Vinga, Susana, Almeida, Jonas S., Chan, Cheong Xin, James, Benjamin T., Sun, Fengzhu, Morgenstern, Burkhard, and Karlowski, Wojciech M. Benchmarking of alignment-free sequence comparison methods. United States: N. p., 2019. Web. doi:10.1186/s13059-019-1755-7.
Zielezinski, Andrzej, Girgis, Hani Z., Bernard, Guillaume, Leimeister, Chris-Andre, Tang, Kujin, Dencker, Thomas, Lau, Anna Katharina, Röhling, Sophie, Choi, Jae Jin, Waterman, Michael S., Comin, Matteo, Kim, Sung-Hou, Vinga, Susana, Almeida, Jonas S., Chan, Cheong Xin, James, Benjamin T., Sun, Fengzhu, Morgenstern, Burkhard, & Karlowski, Wojciech M. Benchmarking of alignment-free sequence comparison methods. United States. doi:10.1186/s13059-019-1755-7.
Zielezinski, Andrzej, Girgis, Hani Z., Bernard, Guillaume, Leimeister, Chris-Andre, Tang, Kujin, Dencker, Thomas, Lau, Anna Katharina, Röhling, Sophie, Choi, Jae Jin, Waterman, Michael S., Comin, Matteo, Kim, Sung-Hou, Vinga, Susana, Almeida, Jonas S., Chan, Cheong Xin, James, Benjamin T., Sun, Fengzhu, Morgenstern, Burkhard, and Karlowski, Wojciech M. Thu . "Benchmarking of alignment-free sequence comparison methods". United States. doi:10.1186/s13059-019-1755-7. https://www.osti.gov/servlets/purl/1626947.
@article{osti_1626947,
title = {Benchmarking of alignment-free sequence comparison methods},
author = {Zielezinski, Andrzej and Girgis, Hani Z. and Bernard, Guillaume and Leimeister, Chris-Andre and Tang, Kujin and Dencker, Thomas and Lau, Anna Katharina and Röhling, Sophie and Choi, Jae Jin and Waterman, Michael S. and Comin, Matteo and Kim, Sung-Hou and Vinga, Susana and Almeida, Jonas S. and Chan, Cheong Xin and James, Benjamin T. and Sun, Fengzhu and Morgenstern, Burkhard and Karlowski, Wojciech M.},
abstractNote = {Background: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Results: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. Conclusion: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions},
doi = {10.1186/s13059-019-1755-7},
journal = {Genome Biology (Online)},
number = 1,
volume = 20,
place = {United States},
year = {2019},
month = {7}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella
journal, October 2011


kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison
journal, May 2014


Fast alignment-free sequence comparison using spaced-word frequencies
journal, April 2014


Comparison of phylogenetic trees
journal, February 1981


A greedy alignment-free distance estimator for phylogenetic inference
journal, June 2017

  • Thankachan, Sharma V.; Chockalingam, Sriram P.; Liu, Yongchao
  • BMC Bioinformatics, Vol. 18, Issue S8
  • DOI: 10.1186/s12859-017-1658-0

The ASTRAL Compendium in 2004
journal, January 2004


CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
journal, January 1994

  • Thompson, Julie D.; Higgins, Desmond G.; Gibson, Toby J.
  • Nucleic Acids Research, Vol. 22, Issue 22, p. 4673-4680
  • DOI: 10.1093/nar/22.22.4673

A statistical method for alignment-free comparison of regulatory sequences
journal, July 2007


Alignathon: a competitive assessment of whole-genome alignment methods
journal, October 2014


An improved model for whole genome phylogenetic analysis by Fourier transform
journal, October 2015


kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity
journal, September 2017


A Protein Map and Its Application
journal, May 2008

  • Yau, Stephen S. -T.; Yu, Chenglong; He, Rong
  • DNA and Cell Biology, Vol. 27, Issue 5
  • DOI: 10.1089/dna.2007.0676

A measure of the similarity of sets of sequences not requiring sequence alignment.
journal, July 1986


Skmer: assembly-free and alignment-free sample identification using genome skims
journal, February 2019

  • Sarmashghi, Shahab; Bohmann, Kristine; P. Gilbert, M. Thomas
  • Genome Biology, Vol. 20, Issue 1
  • DOI: 10.1186/s13059-019-1632-4

Analysis of genomic sequences by Chaos Game Representation
journal, May 2001


Rapid similarity search of proteins using alignments of domain arrangements
journal, July 2013


Co-phylog: an assembly-free phylogenomic approach for closely related organisms
journal, January 2013


Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences
journal, December 2018


SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures
journal, December 2013

  • Fox, Naomi K.; Brenner, Steven E.; Chandonia, John-Marc
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1240

A genome Tree of Life for the Fungi kingdom
journal, August 2017

  • Choi, JaeJin; Kim, Sung-Hou
  • Proceedings of the National Academy of Sciences, Vol. 114, Issue 35
  • DOI: 10.1073/pnas.1711939114

Patternhunter ii: Highly Sensitive and fast Homology Search
journal, September 2004

  • Li, Ming; Ma, Bin; Kisman, Derek
  • Journal of Bioinformatics and Computational Biology, Vol. 02, Issue 03
  • DOI: 10.1142/S0219720004000661

Highways of gene sharing in prokaryotes
journal, September 2005

  • Beiko, R. G.; Harlow, T. J.; Ragan, M. A.
  • Proceedings of the National Academy of Sciences, Vol. 102, Issue 40
  • DOI: 10.1073/pnas.0504068102

A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method
journal, January 2012


Mash: fast genome and metagenome distance estimation using MinHash
journal, June 2016


kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome: Table 1
journal, April 2015


Genome-scale approaches to resolving incongruence in molecular phylogenies
journal, October 2003

  • Rokas, Antonis; Williams, Barry L.; King, Nicole
  • Nature, Vol. 425, Issue 6960
  • DOI: 10.1038/nature02053

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
journal, January 2014


New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing
journal, September 2013

  • Song, K.; Ren, J.; Reinert, G.
  • Briefings in Bioinformatics, Vol. 15, Issue 3
  • DOI: 10.1093/bib/bbt067

Information theory applications for biological sequence analysis
journal, September 2013


Dynamics of Genome Rearrangement in Bacterial Populations
journal, July 2008


Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

Estimating evolutionary distances between genomic sequences from spaced-word matches
journal, February 2015

  • Morgenstern, Burkhard; Zhu, Bingyao; Horwege, Sebastian
  • Algorithms for Molecular Biology, Vol. 10, Issue 1
  • DOI: 10.1186/s13015-015-0032-x

Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer
journal, July 2016

  • Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A.
  • Scientific Reports, Vol. 6, Issue 1
  • DOI: 10.1038/srep28970

Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011

  • Sims, G. E.; Kim, S. -H.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
  • DOI: 10.1073/pnas.1105168108

Chaos game representation of gene structure
journal, January 1990


Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
journal, July 2013

  • Bradnam, Keith R.; Fass, Joseph N.; Alexandrov, Anton
  • GigaScience, Vol. 2, Issue 1
  • DOI: 10.1186/2047-217X-2-10

Entropic Profiler – detection of conservation in genomes using information theory
journal, January 2009

  • Fernandes, Francisco; Freitas, Ana T.; Almeida, Jonas S.
  • BMC Research Notes, Vol. 2, Issue 1
  • DOI: 10.1186/1756-0500-2-72

Alignment-Free Sequence Analysis and Applications
journal, July 2018


Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes
journal, May 2014

  • Comin, Matteo; Antonello, Morris
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 11, Issue 3
  • DOI: 10.1109/TCBB.2013.2297924

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
journal, July 2013

  • Bonham-Carter, O.; Steele, J.; Bastola, D.
  • Briefings in Bioinformatics, Vol. 15, Issue 6
  • DOI: 10.1093/bib/bbt052

Alignment free comparison: Similarity distribution between the DNA primary sequences based on the shortest absent word
journal, February 2012


Comparison of Undirected Phylogenetic Trees Based on Subtrees of Four Evolutionary Units
journal, June 1985

  • Estabrook, George F.; McMorris, F. R.; Meacham, Christopher A.
  • Systematic Zoology, Vol. 34, Issue 2
  • DOI: 10.2307/2413326

An information-based sequence distance and its application to whole mitochondrial genome phylogeny
journal, February 2001


Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs
journal, January 2008


Quartet MaxCut: A fast algorithm for amalgamating quartet trees
journal, January 2012


Alignment-Free Sequence Comparison (I): Statistics and Power
journal, December 2009

  • Reinert, Gesine; Chew, David; Sun, Fengzhu
  • Journal of Computational Biology, Vol. 16, Issue 12
  • DOI: 10.1089/cmb.2009.0198

The Average Common Substring Approach to Phylogenomic Reconstruction
journal, March 2006

  • Ulitsky, Igor; Burstein, David; Tuller, Tamir
  • Journal of Computational Biology, Vol. 13, Issue 2
  • DOI: 10.1089/cmb.2006.13.336

Getting a better picture of microbial evolution en route to a network of genomes
journal, August 2009

  • Dagan, Tal; Martin, William
  • Philosophical Transactions of the Royal Society B: Biological Sciences, Vol. 364, Issue 1527
  • DOI: 10.1098/rstb.2009.0040

Classification of methanogenic bacteria by 16S ribosomal RNA characterization
journal, October 1977

  • Fox, G. E.; Magrum, L. J.; Balch, W. E.
  • Proceedings of the National Academy of Sciences, Vol. 74, Issue 10
  • DOI: 10.1073/pnas.74.10.4537

andi: Fast and accurate estimation of evolutionary distances between closely related genomes
journal, December 2014


Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution
journal, December 2009

  • Jun, S. -R.; Sims, G. E.; Wu, G. A.
  • Proceedings of the National Academy of Sciences, Vol. 107, Issue 1
  • DOI: 10.1073/pnas.0913033107

On the comparison of regulatory sequences with multiple resolution Entropic Profiles
journal, March 2016


Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software
journal, October 2017

  • Sczyrba, Alexander; Hofmann, Peter; Belmann, Peter
  • Nature Methods, Vol. 14, Issue 11
  • DOI: 10.1038/nmeth.4458

Comparative evaluation of word composition distances for the recognition of SCOP relationships
journal, January 2004


Alignment-free phylogeny of whole genomes using underlying subwords
journal, December 2012

  • Comin, Matteo; Verzotto, Davide
  • Algorithms for Molecular Biology, Vol. 7, Issue 1
  • DOI: 10.1186/1748-7188-7-34

A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF
journal, July 2016

  • Cong, Yingnan; Chan, Yao-ban; Ragan, Mark A.
  • Scientific Reports, Vol. 6, Issue 1
  • DOI: 10.1038/srep30308

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
journal, December 2017

  • Luczak, Brian B.; James, Benjamin T.; Girgis, Hani Z.
  • Briefings in Bioinformatics, Vol. 20, Issue 4
  • DOI: 10.1093/bib/bbx161

PatternHunter: faster and more sensitive homology search
journal, March 2002


Alignment-free phylogenetics and population genetics
journal, November 2013


The ASTRAL compendium for protein structure and sequence analysis
journal, January 2000


CAFE: aCcelerated Alignment-FrEe sequence analysis
journal, May 2017

  • Lu, Yang Young; Tang, Kujin; Ren, Jie
  • Nucleic Acids Research, Vol. 45, Issue W1
  • DOI: 10.1093/nar/gkx351

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification
journal, May 2018


A Measure of DNA Sequence Dissimilarity Based on Mahalanobis Distance between Frequencies of Words
journal, December 1997

  • Wu, Tiee-Jian; Burke, John P.; Davison, Daniel B.
  • Biometrics, Vol. 53, Issue 4
  • DOI: 10.2307/2533509

Pattern pluralism and the Tree of Life hypothesis
journal, January 2007

  • Doolittle, W. F.; Bapteste, E.
  • Proceedings of the National Academy of Sciences, Vol. 104, Issue 7
  • DOI: 10.1073/pnas.0610699104

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics
journal, December 2017


Simulation-based comprehensive benchmarking of RNA-seq aligners
journal, December 2016

  • Baruzzo, Giacomo; Hayer, Katharina E.; Kim, Eun Ji
  • Nature Methods, Vol. 14, Issue 2
  • DOI: 10.1038/nmeth.4106

A simulation test bed for hypotheses of genome evolution
journal, January 2007


ART: a next-generation sequencing read simulator
journal, December 2011


Alignment-free sequence comparison: benefits, applications, and tools
journal, October 2017


Sequence analysis by iterated maps, a review
journal, October 2013


ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data
journal, February 2016

  • Huerta-Cepas, Jaime; Serra, François; Bork, Peer
  • Molecular Biology and Evolution, Vol. 33, Issue 6
  • DOI: 10.1093/molbev/msw046

Next-generation phylogenomics
journal, January 2013


Assemblathon 1: A competitive assessment of de novo short read assembly methods
journal, September 2011


Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
journal, May 2014

  • Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku398

Alignment-free sequence comparison--a review
journal, March 2003


Biological Evaluation of d 2 , an Algorithm for High-Performance Sequence Comparison
journal, January 1994

  • Hide, Winston; Burke, John; Da Vison, Daniel B.
  • Journal of Computational Biology, Vol. 1, Issue 3
  • DOI: 10.1089/cmb.1994.1.199

Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison
journal, August 2008


Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
journal, February 2009

  • Sims, Gregory E.; Jun, Se-Ran; Wu, Guohong A.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 8
  • DOI: 10.1073/pnas.0813249106

Practical Performance of Tree Comparison Metrics
journal, December 2014


Alignment-free distance measure based on return time distribution for sequence analysis: Applications to clustering, molecular phylogeny and subtyping
journal, November 2012

  • Kolekar, Pandurang; Kale, Mohan; Kulkarni-Kale, Urmila
  • Molecular Phylogenetics and Evolution, Vol. 65, Issue 2
  • DOI: 10.1016/j.ympev.2012.07.003

k -mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank
journal, November 2018


Alignment-free inference of hierarchical and reticulate phylogenomic relationships
journal, June 2017

  • Bernard, Guillaume; Chan, Cheong Xin; Chan, Yao-ban
  • Briefings in Bioinformatics, Vol. 20, Issue 2
  • DOI: 10.1093/bib/bbx067

Estimating Mutation Distances from Unaligned Genomes
journal, October 2009

  • Haubold, Bernhard; Pfaffelhuber, Peter; Domazet-Los˘o, Mirjana
  • Journal of Computational Biology, Vol. 16, Issue 10
  • DOI: 10.1089/cmb.2009.0106

Inferring phylogenies of evolving sequences without multiple sequence alignment
journal, September 2014

  • Chan, Cheong Xin; Bernard, Guillaume; Poirion, Olivier
  • Scientific Reports, Vol. 4, Issue 1
  • DOI: 10.1038/srep06504

Standardized benchmarking in the quest for orthologs
journal, April 2016

  • Altenhoff, Adrian M.; Boeckmann, Brigitte; Capella-Gutierrez, Salvador
  • Nature Methods, Vol. 13, Issue 5
  • DOI: 10.1038/nmeth.3830

Divergence measures based on the Shannon entropy
journal, January 1991

  • Lin, J.
  • IEEE Transactions on Information Theory, Vol. 37, Issue 1
  • DOI: 10.1109/18.61115

Recapitulating phylogenies using k-mers: from trees to networks
journal, January 2016


EMBOSS: The European Molecular Biology Open Software Suite
journal, June 2000


Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
journal, November 2010

  • Wan, Lin; Reinert, Gesine; Sun, Fengzhu
  • Journal of Computational Biology, Vol. 17, Issue 11
  • DOI: 10.1089/cmb.2010.0056

Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets
journal, February 2019


An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data
journal, July 2015


An estimator for local analysis of genome based on the minimal absent word
journal, April 2016