Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

Journal Article · · F1000Research
 [1];  [2];  [3];  [4];  [5];  [6];  [3];  [7]
  1. Weill Cornell Medicine, New York, NY (United States); New York Genome Center, NY (United States)
  2. Univ. of Arizona, Tuscon, AZ (United States)
  3. National Inst. of Health (NIH), Bethesda, MD (United States)
  4. Baylor College of Medicine, Houston, TX (United States)
  5. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  6. Cold Spring Harbor Lab., NY (United States)
  7. National Inst. of Health (NIH), Bethesda, MD (United States); Heinrich Heine Univ., Düsseldorf (Germany)

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enablesde novoassembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-basedde novoassembly, including large structural variants and divergent haplotypes. We present NovoGraph, a method for the construction of a human genome graph directly from a set ofde novoassemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles fromde novoassemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC); National Inst. of Health (NIH)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1594902
Journal Information:
F1000Research, Journal Name: F1000Research Vol. 7; ISSN 2046-1402
Publisher:
F1000ResearchCopyright Statement
Country of Publication:
United States
Language:
English

References (50)

Resolving the complexity of the human genome using single-molecule sequencing journal November 2014
An integrated map of structural variation in 2,504 human genomes journal September 2015
De novo assembly and phasing of a Korean human genome journal October 2016
Variation graph toolkit improves read mapping by representing genetic variation in the reference journal October 2018
Long-read sequencing and de novo assembly of a Chinese genome journal June 2016
Graphtyper enables population-scale genotyping using pangenome graphs journal September 2017
Genome-wide CRISPR screen identifies PRC2 and KMT2D-COMPASS as regulators of distinct EMT trajectories that contribute differentially to metastasis journal April 2022
A robust benchmark for detection of germline large deletions and insertions journal June 2020
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer journal September 2022
Fast and accurate genomic analyses using genome graphs journal January 2019
Full-length transcriptome analysis of multiple organs and identification of adaptive genes and pathways in Mikania micrantha journal February 2022
Cactus Graphs for Genome Comparisons journal March 2011
Efficient multiple genome alignment journal July 2002
Mugsy: fast multiple alignment of closely related whole genomes journal December 2010
The variant call format and VCFtools journal June 2011
Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations journal September 2018
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice journal January 1994
MUSCLE: multiple sequence alignment with high accuracy and high throughput journal March 2004
Efficient storage of high throughput DNA sequencing data using reference-based compression journal January 2011
Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner journal April 2004
Major Histocompatibility Complex Genomics and Human Disease journal August 2013
Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping journal January 2016
Fast Statistical Alignment journal May 2009
Computational pan-genomics: status, promises and challenges collection January 2018
Killer Cell Immunoglobulin-Like Receptor Gene Associations with Autoimmune and Allergic Diseases, Recurrent Spontaneous Abortion, and Neoplasms journal January 2013
T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton journal September 2000
Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences book January 2013
Integrative genomics viewer journal January 2011
Nanopore sequencing and assembly of a human genome with ultra-long reads journal January 2018
Improved genome inference in the MHC using a population reference graph journal April 2015
Systematic discovery of pseudomonad genetic factors involved in sensitivity to tailocins journal March 2021
Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals journal April 2020
Accurate genotyping across variant classes and lengths using variant graphs journal June 2018
Accurate detection of complex structural variations using single-molecule sequencing journal April 2018
Extensive sequencing of seven human genomes to characterize benchmark reference materials journal June 2016
Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration journal April 2012
Efficient multiple genome alignment journal July 2002
The Sequence Alignment/Map format and SAMtools journal June 2009
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability journal January 2013
Single haplotype assembly of the human genome from a hydatidiform mole journal November 2014
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly journal April 2017
Genome graphs and the evolution of genome inference journal March 2017
Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements journal June 2004
A novel method for multiple alignment of sequences with repeated and shuffled elements journal November 2004
Kalign – an accurate and fast multiple sequence alignment algorithm journal December 2005
Simultaneous alignment of short reads against multiple genomes journal January 2009
Genome-wide detection of short tandem repeat expansions by long-read sequencing journal December 2020
Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping journal January 2016
High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs journal October 2016
progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement journal June 2010

Cited By (1)

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs journal November 2019

Similar Records

De novo Nanopore read quality improvement using deep learning
Journal Article · Tue Nov 05 23:00:00 EST 2019 · BMC Bioinformatics · OSTI ID:1581387

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly
Journal Article · Sat May 01 00:00:00 EDT 2021 · Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS) · OSTI ID:1818231

De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis
Journal Article · Thu Apr 08 00:00:00 EDT 2010 · PLoS Genetics · OSTI ID:1627282