skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

This content will become publicly available on May 12, 2020

Title: Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization

Abstract

This paper focuses on the last two stages of genome assembly, namely scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many as possible of the distance constraints encoding the insert-size information. We formulate it as a mixed-integer linear programming problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.

Authors:
ORCiD logo [1]; ORCiD logo [2];  [1];  [1]
  1. Univ. of Rennes (France)
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1526953
Report Number(s):
LA-UR-18-25924
Journal ID: ISSN 0219-7200
Grant/Contract Number:  
89233218CNA000001
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Bioinformatics and Computational Biology
Additional Journal Information:
Journal Name: Journal of Bioinformatics and Computational Biology; Journal ID: ISSN 0219-7200
Publisher:
World Scientific
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Biological Science; Mathematics; genome assembly, scaffolding, longest path problem, integer programming

Citation Formats

Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, and Lavenier, Dominique. Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization. United States: N. p., 2019. Web. doi:10.1142/S0219720019500148.
Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, & Lavenier, Dominique. Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization. United States. doi:10.1142/S0219720019500148.
Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, and Lavenier, Dominique. Sun . "Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization". United States. doi:10.1142/S0219720019500148.
@article{osti_1526953,
title = {Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization},
author = {Andonov, Rumen and Djidjev, Hristo Nikolov and Francois, Sebastien and Lavenier, Dominique},
abstractNote = {This paper focuses on the last two stages of genome assembly, namely scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many as possible of the distance constraints encoding the insert-size information. We formulate it as a mixed-integer linear programming problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.},
doi = {10.1142/S0219720019500148},
journal = {Journal of Bioinformatics and Computational Biology},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {5}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on May 12, 2020
Publisher's Version of Record

Save / Share:

Works referenced in this record:

An Eulerian path approach to DNA fragment assembly
journal, August 2001

  • Pevzner, P. A.; Tang, H.; Waterman, M. S.
  • Proceedings of the National Academy of Sciences, Vol. 98, Issue 17
  • DOI: 10.1073/pnas.171285098

QUAST: quality assessment tool for genome assemblies
journal, February 2013


GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies
journal, April 2012


Global Optimization for Scaffolding and Completing Genome Assemblies
journal, February 2018

  • François, Sebastien; Andonov, Rumen; Lavenier, Dominique
  • Electronic Notes in Discrete Mathematics, Vol. 64
  • DOI: 10.1016/j.endm.2018.01.020

Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers
journal, November 2011

  • Medvedev, Paul; Pham, Son; Chaisson, Mark
  • Journal of Computational Biology, Vol. 18, Issue 11
  • DOI: 10.1089/cmb.2011.0151

Informed and automated k-mer size selection for genome assembly
journal, June 2013


Fast scaffolding with small independent mixed integer programs
journal, October 2011


Scaffolding Problems Revisited: Complexity, Approximation and Fixed Parameter Tractable Algorithms, and Some Special Cases
journal, January 2018


The greedy path-merging algorithm for contig scaffolding
journal, September 2002

  • Huson, Daniel H.; Reinert, Knut; Myers, Eugene W.
  • Journal of the ACM, Vol. 49, Issue 5
  • DOI: 10.1145/585265.585267

Exact approaches for scaffolding
journal, October 2015


ART: a next-generation sequencing read simulator
journal, December 2011


De Novo Repeat Classification and Fragment Assembly
journal, September 2004


Human Whole-Genome Shotgun Sequencing
journal, May 1997

  • Weber, James L.; Myers, Eugene W.
  • Genome Research, Vol. 7, Issue 5
  • DOI: 10.1101/gr.7.5.401

OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
journal, May 2016


Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
journal, March 2016


Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III
journal, March 2007

  • Shaw, Joey; Lickey, Edgar B.; Schilling, Edward E.
  • American Journal of Botany, Vol. 94, Issue 3
  • DOI: 10.3732/ajb.94.3.275

BESST - Efficient scaffolding of large fragmented assemblies
journal, August 2014

  • Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn
  • BMC Bioinformatics, Vol. 15, Issue 1
  • DOI: 10.1186/1471-2105-15-281

Space-efficient and exact de Bruijn graph representation based on a Bloom filter
journal, January 2013

  • Chikhi, Rayan; Rizk, Guillaume
  • Algorithms for Molecular Biology, Vol. 8, Issue 1
  • DOI: 10.1186/1748-7188-8-22