NovoGraph: Human genome graph construction from multiple long-read de novo assemblies
- Weill Cornell Medicine, New York, NY (United States); New York Genome Center, NY (United States)
- Univ. of Arizona, Tuscon, AZ (United States)
- National Inst. of Health (NIH), Bethesda, MD (United States)
- Baylor College of Medicine, Houston, TX (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Cold Spring Harbor Lab., NY (United States)
- National Inst. of Health (NIH), Bethesda, MD (United States); Heinrich Heine Univ., Düsseldorf (Germany)
Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enablesde novoassembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-basedde novoassembly, including large structural variants and divergent haplotypes. We present NovoGraph, a method for the construction of a human genome graph directly from a set ofde novoassemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles fromde novoassemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); National Inst. of Health (NIH)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1594902
- Journal Information:
- F1000Research, Journal Name: F1000Research Vol. 7; ISSN 2046-1402
- Publisher:
- F1000ResearchCopyright Statement
- Country of Publication:
- United States
- Language:
- English
GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs
|
journal | November 2019 |
Similar Records
Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly
De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis