## This content will become publicly available on May 12, 2020

# Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization

## Abstract

This paper focuses on the last two stages of genome assembly, namely scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many as possible of the distance constraints encoding the insert-size information. We formulate it as a mixed-integer linear programming problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.

- Authors:

- Univ. of Rennes (France)
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

- Publication Date:

- Research Org.:
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

- Sponsoring Org.:
- USDOE

- OSTI Identifier:
- 1526953

- Report Number(s):
- LA-UR-18-25924

Journal ID: ISSN 0219-7200

- Grant/Contract Number:
- 89233218CNA000001

- Resource Type:
- Accepted Manuscript

- Journal Name:
- Journal of Bioinformatics and Computational Biology

- Additional Journal Information:
- Journal Name: Journal of Bioinformatics and Computational Biology; Journal ID: ISSN 0219-7200

- Publisher:
- World Scientific

- Country of Publication:
- United States

- Language:
- English

- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; Biological Science; Mathematics; genome assembly, scaffolding, longest path problem, integer programming

### Citation Formats

```
Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, and Lavenier, Dominique. Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization. United States: N. p., 2019.
Web. doi:10.1142/S0219720019500148.
```

```
Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, & Lavenier, Dominique. Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization. United States. doi:10.1142/S0219720019500148.
```

```
Andonov, Rumen, Djidjev, Hristo Nikolov, Francois, Sebastien, and Lavenier, Dominique. Sun .
"Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization". United States. doi:10.1142/S0219720019500148.
```

```
@article{osti_1526953,
```

title = {Complete Assembly of Circular and Chloroplast Genomes Based on Global Optimization},

author = {Andonov, Rumen and Djidjev, Hristo Nikolov and Francois, Sebastien and Lavenier, Dominique},

abstractNote = {This paper focuses on the last two stages of genome assembly, namely scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many as possible of the distance constraints encoding the insert-size information. We formulate it as a mixed-integer linear programming problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.},

doi = {10.1142/S0219720019500148},

journal = {Journal of Bioinformatics and Computational Biology},

number = ,

volume = ,

place = {United States},

year = {2019},

month = {5}

}

Works referenced in this record:

##
An Eulerian path approach to DNA fragment assembly

journal, August 2001

- Pevzner, P. A.; Tang, H.; Waterman, M. S.
- Proceedings of the National Academy of Sciences, Vol. 98, Issue 17

##
QUAST: quality assessment tool for genome assemblies

journal, February 2013

- Gurevich, Alexey; Saveliev, Vladislav; Vyahhi, Nikolay
- Bioinformatics, Vol. 29, Issue 8

##
GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies

journal, April 2012

- Gritsenko, A. A.; Nijkamp, J. F.; Reinders, M. J. T.
- Bioinformatics, Vol. 28, Issue 11

##
Global Optimization for Scaffolding and Completing Genome Assemblies

journal, February 2018

- François, Sebastien; Andonov, Rumen; Lavenier, Dominique
- Electronic Notes in Discrete Mathematics, Vol. 64

##
Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers

journal, November 2011

- Medvedev, Paul; Pham, Son; Chaisson, Mark
- Journal of Computational Biology, Vol. 18, Issue 11

##
Informed and automated k-mer size selection for genome assembly

journal, June 2013

- Chikhi, R.; Medvedev, P.
- Bioinformatics, Vol. 30, Issue 1

##
Fast scaffolding with small independent mixed integer programs

journal, October 2011

- Salmela, L.; Makinen, V.; Valimaki, N.
- Bioinformatics, Vol. 27, Issue 23

##
Scaffolding Problems Revisited: Complexity, Approximation and Fixed Parameter Tractable Algorithms, and Some Special Cases

journal, January 2018

- Weller, Mathias; Chateau, Annie; Dallard, Clément
- Algorithmica, Vol. 80, Issue 6

##
The greedy path-merging algorithm for contig scaffolding

journal, September 2002

- Huson, Daniel H.; Reinert, Knut; Myers, Eugene W.
- Journal of the ACM, Vol. 49, Issue 5

##
Exact approaches for scaffolding

journal, October 2015

- Weller, Mathias; Chateau, Annie; Giroudeau, Rodolphe
- BMC Bioinformatics, Vol. 16, Issue S14

##
ART: a next-generation sequencing read simulator

journal, December 2011

- Huang, Weichun; Li, Leping; Myers, Jason R.
- Bioinformatics, Vol. 28, Issue 4

##
De Novo Repeat Classification and Fragment Assembly

journal, September 2004

- Pevzner, P. A.
- Genome Research, Vol. 14, Issue 9

##
Human Whole-Genome Shotgun Sequencing

journal, May 1997

- Weber, James L.; Myers, Eugene W.
- Genome Research, Vol. 7, Issue 5

##
OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees

journal, May 2016

- Gao, Song; Bertrand, Denis; Chia, Burton K. H.
- Genome Biology, Vol. 17, Issue 1

##
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

journal, March 2016

- Li, Heng
- Bioinformatics, Vol. 32, Issue 14

##
Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III

journal, March 2007

- Shaw, Joey; Lickey, Edgar B.; Schilling, Edward E.
- American Journal of Botany, Vol. 94, Issue 3

##
BESST - Efficient scaffolding of large fragmented assemblies

journal, August 2014

- Sahlin, Kristoffer; Vezzi, Francesco; Nystedt, Björn
- BMC Bioinformatics, Vol. 15, Issue 1

##
Space-efficient and exact de Bruijn graph representation based on a Bloom filter

journal, January 2013

- Chikhi, Rayan; Rizk, Guillaume
- Algorithms for Molecular Biology, Vol. 8, Issue 1