DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Efficient DNA sequence compression with neural networks

Abstract

Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $$2.4\%$$, $$7.1\%$$, $$6.1\%$$, $$5.8\%$$, and $$6.0\%$$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $$12.4\%$$, $$11.7\%$$, $$10.8\%$$, and $$10.1\%$$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

Authors:
ORCiD logo [1]; ORCiD logo [2];  [1]
  1. Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
  2. Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1712526
Resource Type:
Published Article
Journal Name:
GigaScience
Additional Journal Information:
Journal Name: GigaScience Journal Volume: 9 Journal Issue: 11; Journal ID: ISSN 2047-217X
Publisher:
Oxford University Press
Country of Publication:
United Kingdom
Language:
English

Citation Formats

Silva, Milton, Pratas, Diogo, and Pinho, Armando J. Efficient DNA sequence compression with neural networks. United Kingdom: N. p., 2020. Web. doi:10.1093/gigascience/giaa119.
Silva, Milton, Pratas, Diogo, & Pinho, Armando J. Efficient DNA sequence compression with neural networks. United Kingdom. https://doi.org/10.1093/gigascience/giaa119
Silva, Milton, Pratas, Diogo, and Pinho, Armando J. Wed . "Efficient DNA sequence compression with neural networks". United Kingdom. https://doi.org/10.1093/gigascience/giaa119.
@article{osti_1712526,
title = {Efficient DNA sequence compression with neural networks},
author = {Silva, Milton and Pratas, Diogo and Pinho, Armando J.},
abstractNote = {Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.},
doi = {10.1093/gigascience/giaa119},
journal = {GigaScience},
number = 11,
volume = 9,
place = {United Kingdom},
year = {2020},
month = {11}
}

Works referenced in this record:

Human mitochondrial genome compression using machine learning techniques
journal, October 2019


Origin of human chromosome 2: an ancestral telomere-telomere fusion.
journal, October 1991

  • IJdo, J. W.; Baldini, A.; Ward, D. C.
  • Proceedings of the National Academy of Sciences, Vol. 88, Issue 20
  • DOI: 10.1073/pnas.88.20.9051

Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight
conference, September 2018


iDoComp: a compression scheme for assembled genomes
journal, October 2014


A Fast Reference-Free Genome Compression Using Deep Neural Networks
conference, November 2019


Data structures and compression algorithms for genomic sequence data
journal, May 2009


An efficient normalized maximum likelihood algorithm for DNA sequence compression
journal, January 2005

  • Korodi, Gergely; Tabus, Ioan
  • ACM Transactions on Information Systems, Vol. 23, Issue 1
  • DOI: 10.1145/1055709.1055711

FRESCO: Referential Compression of Highly Similar Sequences
journal, September 2013

  • Wandelt, Sebastian; Leser, Ulf
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 10, Issue 5
  • DOI: 10.1109/TCBB.2013.122

Evolutionary determinants of genome-wide nucleotide composition
journal, January 2018


BIND – An algorithm for loss-less compression of nucleotide sequence data
journal, August 2012

  • Bose, Tungadri; Mohammed, Monzoorul Haque; Dutta, Anirban
  • Journal of Biosciences, Vol. 37, Issue 4
  • DOI: 10.1007/s12038-012-9230-6

Human genomes as email attachments
journal, September 2008


Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences
journal, February 2019


17 th Century Variola Virus Reveals the Recent History of Smallpox
journal, December 2016

  • Duggan, Ana T.; Perdomo, Maria F.; Piombino-Mascali, Dario
  • Current Biology, Vol. 26, Issue 24
  • DOI: 10.1016/j.cub.2016.10.061

The landscape of persistent human DNA viruses in femoral bone
journal, September 2020


A Survey on Data Compression Methods for Biological Sequences
journal, October 2016

  • Hosseini, Morteza; Pratas, Diogo; Pinho, Armando
  • Information, Vol. 7, Issue 4
  • DOI: 10.3390/info7040056

Efficient Compression of Genomic Sequences
conference, March 2016

  • Pratas, Diogo; Pinho, Armando J.; Ferreira, Paulo J. S. G.
  • 2016 Data Compression Conference (DCC)
  • DOI: 10.1109/DCC.2016.60

Human genome variability, natural selection and infectious diseases
journal, October 2014


A simple and fast DNA compressor
journal, January 2004

  • Manzini, Giovanni; Rastero, Marcella
  • Software: Practice and Experience, Vol. 34, Issue 14
  • DOI: 10.1002/spe.619

High-speed and high-ratio referential genome compression
journal, June 2017


HERQ-9 Is a New Multiplex PCR for Differentiation and Quantification of All Nine Human Herpesviruses
journal, June 2020


DNA Sequence Compression Using Adaptive Particle Swarm Optimization-Based Memetic Algorithm
journal, October 2011

  • Zhu, Zexuan; Zhou, Jiarui; Ji, Zhen
  • IEEE Transactions on Evolutionary Computation, Vol. 15, Issue 5
  • DOI: 10.1109/TEVC.2011.2160399

A new challenge for compression algorithms: Genetic sequences
journal, November 1994


GReEn: a tool for efficient compression of genome resequencing data
journal, December 2011

  • Pinho, Armando J.; Pratas, Diogo; Garcia, Sara P.
  • Nucleic Acids Research, Vol. 40, Issue 4
  • DOI: 10.1093/nar/gkr1124

Ensemble based systems in decision making
journal, January 2006


A guaranteed compression scheme for repetitive DNA sequences
conference, January 1996

  • Rivals, E.; Delahaye, J. -P.; Dauchet, M.
  • Proceedings of Data Compression Conference - DCC '96
  • DOI: 10.1109/DCC.1996.488385

Robust relative compression of genomes with random access
journal, September 2011


Chromosomal rearrangements and speciation
journal, July 2001


Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage
journal, December 2019

  • Lau, Anna-Katharina; Dörrer, Svenja; Leimeister, Chris-André
  • BMC Bioinformatics, Vol. 20, Issue S20
  • DOI: 10.1186/s12859-019-3205-7

A Catalogue of Marine Biodiversity Indicators
journal, November 2016


Genome sequence compression based on optimized context weighting
journal, January 2017

  • Chen, M.; Shao, J. J.; Jia, X. M.
  • Genetics and Molecular Research, Vol. 16, Issue 2
  • DOI: 10.4238/gmr16026784

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
journal, November 2019

  • Yao, Haichang; Ji, Yimu; Li, Kui
  • BioMed Research International, Vol. 2019
  • DOI: 10.1155/2019/3108950

A Stochastic Approximation Method
journal, September 1951

  • Robbins, Herbert; Monro, Sutton
  • The Annals of Mathematical Statistics, Vol. 22, Issue 3
  • DOI: 10.1214/aoms/1177729586

Earth BioGenome Project: Sequencing life for the future of life
journal, April 2018

  • Lewin, Harris A.; Robinson, Gene E.; Kress, W. John
  • Proceedings of the National Academy of Sciences, Vol. 115, Issue 17
  • DOI: 10.1073/pnas.1720115115

CoGI: Towards Compressing Genomes as an Image
journal, November 2015

  • Xie, Xiaojing; Zhou, Shuigeng; Guan, Jihong
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 12, Issue 6
  • DOI: 10.1109/TCBB.2015.2430331

SeqCompress: An algorithm for biological sequence compression
journal, October 2014


Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
journal, July 2020


GDC 2: Compression of large collections of genomes
journal, June 2015

  • Deorowicz, Sebastian; Danek, Agnieszka; Niemiec, Marcin
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep11565

Improve the compression of bacterial DNA sequence
conference, December 2017


A High-Coverage Genome Sequence from an Archaic Denisovan Individual
journal, August 2012


On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models
journal, June 2011


A survey of feature selection and feature extraction techniques in machine learning
conference, August 2014

  • Khalid, Samina; Khalil, Tehmina; Nasreen, Shamila
  • 2014 Science and Information Conference (SAI)
  • DOI: 10.1109/SAI.2014.6918213

Metagenomics of extreme environments
journal, June 2015


DeepDNA: a hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes
conference, December 2018

  • Wang, Rongjie; Bai, Yang; Chu, Yan-Shuo
  • 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
  • DOI: 10.1109/BIBM.2018.8621140

Complementary Contextual Models with FM-Index for DNA Compression
conference, April 2017

  • Fan, Wenjing; Dai, Wenrui; Li, Yong
  • 2017 Data Compression Conference (DCC)
  • DOI: 10.1109/DCC.2017.50

DNA rearrangements associated with a transposable element in yeast
journal, August 1980


Evidence for recent, population-specific evolution of the human mutation rate
journal, March 2015


Big Data: Astronomical or Genomical?
journal, July 2015


A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
journal, November 2019

  • Pratas, Diogo; Hosseini, Morteza; Silva, Jorge M.
  • Entropy, Vol. 21, Issue 11
  • DOI: 10.3390/e21111074

MFCompress: a compression tool for FASTA and multi-FASTA data
journal, October 2013


DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences
journal, July 2012


Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
journal, May 2020


A novel compression tool for efficient storage of genome resequencing data
journal, January 2011

  • Wang, Congmao; Zhang, Dabing
  • Nucleic Acids Research, Vol. 39, Issue 7
  • DOI: 10.1093/nar/gkr009

NSE Stock Market Prediction Using Deep-Learning Models
journal, January 2018


Genomic Data Compression
journal, July 2019


A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level
journal, August 2020


Textual data compression in computational biology: a synopsis
journal, February 2009


Editors’ Pick: Contamination has always been the issue!
journal, December 2014


Clustering by Compression
journal, April 2005

  • Cilibrasi, R.; Vitanyi, P. M. B.
  • IEEE Transactions on Information Theory, Vol. 51, Issue 4
  • DOI: 10.1109/TIT.2005.844059

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences
journal, May 2015

  • Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep10203

Adaptations to local environments in modern human populations
journal, December 2014


Artificial neural networks for non-stationary time series
journal, October 2004


Efficient storage of high throughput DNA sequencing data using reference-based compression
journal, January 2011

  • Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.
  • Genome Research, Vol. 21, Issue 5
  • DOI: 10.1101/gr.114819.110