Efficient DNA sequence compression with neural networks
Abstract
Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $$2.4\%$$, $$7.1\%$$, $$6.1\%$$, $$5.8\%$$, and $$6.0\%$$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $$12.4\%$$, $$11.7\%$$, $$10.8\%$$, and $$10.1\%$$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.
- Authors:
-
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal, Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
- Publication Date:
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1712526
- Resource Type:
- Published Article
- Journal Name:
- GigaScience
- Additional Journal Information:
- Journal Name: GigaScience Journal Volume: 9 Journal Issue: 11; Journal ID: ISSN 2047-217X
- Publisher:
- Oxford University Press
- Country of Publication:
- United Kingdom
- Language:
- English
Citation Formats
Silva, Milton, Pratas, Diogo, and Pinho, Armando J. Efficient DNA sequence compression with neural networks. United Kingdom: N. p., 2020.
Web. doi:10.1093/gigascience/giaa119.
Silva, Milton, Pratas, Diogo, & Pinho, Armando J. Efficient DNA sequence compression with neural networks. United Kingdom. https://doi.org/10.1093/gigascience/giaa119
Silva, Milton, Pratas, Diogo, and Pinho, Armando J. Wed .
"Efficient DNA sequence compression with neural networks". United Kingdom. https://doi.org/10.1093/gigascience/giaa119.
@article{osti_1712526,
title = {Efficient DNA sequence compression with neural networks},
author = {Silva, Milton and Pratas, Diogo and Pinho, Armando J.},
abstractNote = {Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.},
doi = {10.1093/gigascience/giaa119},
journal = {GigaScience},
number = 11,
volume = 9,
place = {United Kingdom},
year = {2020},
month = {11}
}
https://doi.org/10.1093/gigascience/giaa119
Works referenced in this record:
Human mitochondrial genome compression using machine learning techniques
journal, October 2019
- Wang, Rongjie; Zang, Tianyi; Wang, Yadong
- Human Genomics, Vol. 13, Issue S1
Origin of human chromosome 2: an ancestral telomere-telomere fusion.
journal, October 1991
- IJdo, J. W.; Baldini, A.; Ward, D. C.
- Proceedings of the National Academy of Sciences, Vol. 88, Issue 20
Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight
conference, September 2018
- Pratas, Diogo; Pinho, Armando J.
- 2018 26th European Signal Processing Conference (EUSIPCO)
iDoComp: a compression scheme for assembled genomes
journal, October 2014
- Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy
- Bioinformatics, Vol. 31, Issue 5
A Fast Reference-Free Genome Compression Using Deep Neural Networks
conference, November 2019
- Absardi, Zeinab Nazemi; Javidan, Reza
- 2019 Big Data, Knowledge and Control Systems Engineering (BdKCSE)
Data structures and compression algorithms for genomic sequence data
journal, May 2009
- Brandon, M. C.; Wallace, D. C.; Baldi, P.
- Bioinformatics, Vol. 25, Issue 14
An efficient normalized maximum likelihood algorithm for DNA sequence compression
journal, January 2005
- Korodi, Gergely; Tabus, Ioan
- ACM Transactions on Information Systems, Vol. 23, Issue 1
FRESCO: Referential Compression of Highly Similar Sequences
journal, September 2013
- Wandelt, Sebastian; Leser, Ulf
- IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 10, Issue 5
Evolutionary determinants of genome-wide nucleotide composition
journal, January 2018
- Long, Hongan; Sung, Way; Kucukyildirim, Sibel
- Nature Ecology & Evolution, Vol. 2, Issue 2
BIND – An algorithm for loss-less compression of nucleotide sequence data
journal, August 2012
- Bose, Tungadri; Mohammed, Monzoorul Haque; Dutta, Anirban
- Journal of Biosciences, Vol. 37, Issue 4
Human genomes as email attachments
journal, September 2008
- Christley, Scott; Lu, Yiming; Li, Chen
- Bioinformatics, Vol. 25, Issue 2
Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences
journal, February 2019
- Kryukov, Kirill; Ueda, Mahoko Takahashi; Nakagawa, So
- Bioinformatics, Vol. 35, Issue 19
17 th Century Variola Virus Reveals the Recent History of Smallpox
journal, December 2016
- Duggan, Ana T.; Perdomo, Maria F.; Piombino-Mascali, Dario
- Current Biology, Vol. 26, Issue 24
The landscape of persistent human DNA viruses in femoral bone
journal, September 2020
- Toppinen, Mari; Pratas, Diogo; Väisänen, Elina
- Forensic Science International: Genetics, Vol. 48
A Survey on Data Compression Methods for Biological Sequences
journal, October 2016
- Hosseini, Morteza; Pratas, Diogo; Pinho, Armando
- Information, Vol. 7, Issue 4
Efficient Compression of Genomic Sequences
conference, March 2016
- Pratas, Diogo; Pinho, Armando J.; Ferreira, Paulo J. S. G.
- 2016 Data Compression Conference (DCC)
Human genome variability, natural selection and infectious diseases
journal, October 2014
- Fumagalli, Matteo; Sironi, Manuela
- Current Opinion in Immunology, Vol. 30
A simple and fast DNA compressor
journal, January 2004
- Manzini, Giovanni; Rastero, Marcella
- Software: Practice and Experience, Vol. 34, Issue 14
High-speed and high-ratio referential genome compression
journal, June 2017
- Liu, Yuansheng; Peng, Hui; Wong, Limsoon
- Bioinformatics, Vol. 33, Issue 21
HERQ-9 Is a New Multiplex PCR for Differentiation and Quantification of All Nine Human Herpesviruses
journal, June 2020
- Pyöriä, Lari; Jokinen, Maija; Toppinen, Mari
- mSphere, Vol. 5, Issue 3
DNA Sequence Compression Using Adaptive Particle Swarm Optimization-Based Memetic Algorithm
journal, October 2011
- Zhu, Zexuan; Zhou, Jiarui; Ji, Zhen
- IEEE Transactions on Evolutionary Computation, Vol. 15, Issue 5
A new challenge for compression algorithms: Genetic sequences
journal, November 1994
- Grumbach, Stéphane; Tahi, Fariza
- Information Processing & Management, Vol. 30, Issue 6
GReEn: a tool for efficient compression of genome resequencing data
journal, December 2011
- Pinho, Armando J.; Pratas, Diogo; Garcia, Sara P.
- Nucleic Acids Research, Vol. 40, Issue 4
Ensemble based systems in decision making
journal, January 2006
- Polikar, R.
- IEEE Circuits and Systems Magazine, Vol. 6, Issue 3
A guaranteed compression scheme for repetitive DNA sequences
conference, January 1996
- Rivals, E.; Delahaye, J. -P.; Dauchet, M.
- Proceedings of Data Compression Conference - DCC '96
Robust relative compression of genomes with random access
journal, September 2011
- Deorowicz, Sebastian; Grabowski, Szymon
- Bioinformatics, Vol. 27, Issue 21
Chromosomal rearrangements and speciation
journal, July 2001
- Rieseberg, Loren H.
- Trends in Ecology & Evolution, Vol. 16, Issue 7
Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage
journal, December 2019
- Lau, Anna-Katharina; Dörrer, Svenja; Leimeister, Chris-André
- BMC Bioinformatics, Vol. 20, Issue S20
A Catalogue of Marine Biodiversity Indicators
journal, November 2016
- Teixeira, Heliana; Berg, Torsten; Uusitalo, Laura
- Frontiers in Marine Science, Vol. 3
Genome sequence compression based on optimized context weighting
journal, January 2017
- Chen, M.; Shao, J. J.; Jia, X. M.
- Genetics and Molecular Research, Vol. 16, Issue 2
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
journal, November 2019
- Yao, Haichang; Ji, Yimu; Li, Kui
- BioMed Research International, Vol. 2019
Differential direct coding: a compression algorithm for nucleotide sequence data
journal, January 2009
- Vey, Gregory
- Database, Vol. 2009
A Stochastic Approximation Method
journal, September 1951
- Robbins, Herbert; Monro, Sutton
- The Annals of Mathematical Statistics, Vol. 22, Issue 3
Earth BioGenome Project: Sequencing life for the future of life
journal, April 2018
- Lewin, Harris A.; Robinson, Gene E.; Kress, W. John
- Proceedings of the National Academy of Sciences, Vol. 115, Issue 17
CoGI: Towards Compressing Genomes as an Image
journal, November 2015
- Xie, Xiaojing; Zhou, Shuigeng; Guan, Jihong
- IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 12, Issue 6
SeqCompress: An algorithm for biological sequence compression
journal, October 2014
- Sardaraz, Muhammad; Tahir, Muhammad; Ikram, Ataul Aziz
- Genomics, Vol. 104, Issue 4
Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
journal, July 2020
- Kryukov, Kirill; Ueda, Mahoko Takahashi; Nakagawa, So
- GigaScience, Vol. 9, Issue 7
GDC 2: Compression of large collections of genomes
journal, June 2015
- Deorowicz, Sebastian; Danek, Agnieszka; Niemiec, Marcin
- Scientific Reports, Vol. 5, Issue 1
Improve the compression of bacterial DNA sequence
conference, December 2017
- Bakr, Nour S.; Sharawi, Amr A.
- 2017 13th International Computer Engineering Conference (ICENCO)
A High-Coverage Genome Sequence from an Archaic Denisovan Individual
journal, August 2012
- Meyer, M.; Kircher, M.; Gansauge, M. -T.
- Science, Vol. 338, Issue 6104
On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models
journal, June 2011
- Pinho, Armando J.; Ferreira, Paulo J. S. G.; Neves, António J. R.
- PLoS ONE, Vol. 6, Issue 6
A survey of feature selection and feature extraction techniques in machine learning
conference, August 2014
- Khalid, Samina; Khalil, Tehmina; Nasreen, Shamila
- 2014 Science and Information Conference (SAI)
Metagenomics of extreme environments
journal, June 2015
- Cowan, Da; Ramond, J-B; Makhalanyane, Tp
- Current Opinion in Microbiology, Vol. 25
DeepDNA: a hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes
conference, December 2018
- Wang, Rongjie; Bai, Yang; Chu, Yan-Shuo
- 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Complementary Contextual Models with FM-Index for DNA Compression
conference, April 2017
- Fan, Wenjing; Dai, Wenrui; Li, Yong
- 2017 Data Compression Conference (DCC)
DNA rearrangements associated with a transposable element in yeast
journal, August 1980
- Roeder, G. Shirleen; Fink, Gerald R.
- Cell, Vol. 21, Issue 1
Evidence for recent, population-specific evolution of the human mutation rate
journal, March 2015
- Harris, Kelley
- Proceedings of the National Academy of Sciences, Vol. 112, Issue 11
Big Data: Astronomical or Genomical?
journal, July 2015
- Stephens, Zachary D.; Lee, Skylar Y.; Faghri, Faraz
- PLOS Biology, Vol. 13, Issue 7
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
journal, November 2019
- Pratas, Diogo; Hosseini, Morteza; Silva, Jorge M.
- Entropy, Vol. 21, Issue 11
MFCompress: a compression tool for FASTA and multi-FASTA data
journal, October 2013
- Pinho, Armando J.; Pratas, Diogo
- Bioinformatics, Vol. 30, Issue 1
DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences
journal, July 2012
- Mohammed, Monzoorul Haque; Dutta, Anirban; Bose, Tungadri
- Bioinformatics, Vol. 28, Issue 19
Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
journal, May 2020
- Hosseini, Morteza; Pratas, Diogo; Morgenstern, Burkhard
- GigaScience, Vol. 9, Issue 5
A novel compression tool for efficient storage of genome resequencing data
journal, January 2011
- Wang, Congmao; Zhang, Dabing
- Nucleic Acids Research, Vol. 39, Issue 7
NSE Stock Market Prediction Using Deep-Learning Models
journal, January 2018
- M., Hiransha; E. A., Gopalakrishnan; Menon, Vijay Krishna
- Procedia Computer Science, Vol. 132
Genomic Data Compression
journal, July 2019
- Hernaez, Mikel; Pavlichin, Dmitri; Weissman, Tsachy
- Annual Review of Biomedical Data Science, Vol. 2, Issue 1
A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level
journal, August 2020
- Pratas, Diogo; Toppinen, Mari; Pyöriä, Lari
- GigaScience, Vol. 9, Issue 8
Textual data compression in computational biology: a synopsis
journal, February 2009
- Giancarlo, R.; Scaturro, D.; Utro, F.
- Bioinformatics, Vol. 25, Issue 13
Editors’ Pick: Contamination has always been the issue!
journal, December 2014
- Sajantila, Antti
- Investigative Genetics, Vol. 5, Issue 1
Clustering by Compression
journal, April 2005
- Cilibrasi, R.; Vitanyi, P. M. B.
- IEEE Transactions on Information Theory, Vol. 51, Issue 4
An alignment-free method to find and visualise rearrangements between pairs of DNA sequences
journal, May 2015
- Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.
- Scientific Reports, Vol. 5, Issue 1
Adaptations to local environments in modern human populations
journal, December 2014
- Jeong, Choongwon; Di Rienzo, Anna
- Current Opinion in Genetics & Development, Vol. 29
Artificial neural networks for non-stationary time series
journal, October 2004
- Kim, Tae Yoon; Oh, Kyong Joo; Kim, Chiho
- Neurocomputing, Vol. 61
Efficient storage of high throughput DNA sequencing data using reference-based compression
journal, January 2011
- Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.
- Genome Research, Vol. 21, Issue 5