skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

Abstract

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enablesde novoassembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-basedde novoassembly, including large structural variants and divergent haplotypes. We present NovoGraph, a method for the construction of a human genome graph directly from a set ofde novoassemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles fromde novoassemblies of sevenmore » ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.« less

Authors:
 [1];  [2];  [3];  [4]; ORCiD logo [5];  [6];  [3]; ORCiD logo [7]
  1. Weill Cornell Medicine, New York, NY (United States); New York Genome Center, NY (United States)
  2. Univ. of Arizona, Tuscon, AZ (United States)
  3. National Inst. of Health (NIH), Bethesda, MD (United States)
  4. Baylor College of Medicine, Houston, TX (United States)
  5. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  6. Cold Spring Harbor Lab., NY (United States)
  7. National Inst. of Health (NIH), Bethesda, MD (United States); Heinrich Heine Univ., Düsseldorf (Germany)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Institutes of Health (NIH)
OSTI Identifier:
1594902
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
F1000Research
Additional Journal Information:
Journal Volume: 7; Journal ID: ISSN 2046-1402
Publisher:
F1000Research
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Genome graph; de novo assembly; alignment; multiple sequence alignment; population reference graph; NovoGraph

Citation Formats

Biederstedt, Evan, Oliver, Jeffrey C., Hansen, Nancy F., Jajoo, Aarti, Dunn, Nathan, Olson, Andrew, Busby, Ben, and Dilthey, Alexander T. NovoGraph: Human genome graph construction from multiple long-read de novo assemblies. United States: N. p., 2018. Web. doi:10.12688/f1000research.15895.2.
Biederstedt, Evan, Oliver, Jeffrey C., Hansen, Nancy F., Jajoo, Aarti, Dunn, Nathan, Olson, Andrew, Busby, Ben, & Dilthey, Alexander T. NovoGraph: Human genome graph construction from multiple long-read de novo assemblies. United States. doi:10.12688/f1000research.15895.2.
Biederstedt, Evan, Oliver, Jeffrey C., Hansen, Nancy F., Jajoo, Aarti, Dunn, Nathan, Olson, Andrew, Busby, Ben, and Dilthey, Alexander T. Mon . "NovoGraph: Human genome graph construction from multiple long-read de novo assemblies". United States. doi:10.12688/f1000research.15895.2. https://www.osti.gov/servlets/purl/1594902.
@article{osti_1594902,
title = {NovoGraph: Human genome graph construction from multiple long-read de novo assemblies},
author = {Biederstedt, Evan and Oliver, Jeffrey C. and Hansen, Nancy F. and Jajoo, Aarti and Dunn, Nathan and Olson, Andrew and Busby, Ben and Dilthey, Alexander T.},
abstractNote = {Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enablesde novoassembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-basedde novoassembly, including large structural variants and divergent haplotypes. We present NovoGraph, a method for the construction of a human genome graph directly from a set ofde novoassemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles fromde novoassemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.},
doi = {10.12688/f1000research.15895.2},
journal = {F1000Research},
issn = {2046-1402},
number = ,
volume = 7,
place = {United States},
year = {2018},
month = {12}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Resolving the complexity of the human genome using single-molecule sequencing
journal, November 2014

  • Chaisson, Mark J. P.; Huddleston, John; Dennis, Megan Y.
  • Nature, Vol. 517, Issue 7536
  • DOI: 10.1038/nature13907

Nanopore sequencing and assembly of a human genome with ultra-long reads
journal, January 2018

  • Jain, Miten; Koren, Sergey; Miga, Karen H.
  • Nature Biotechnology, Vol. 36, Issue 4
  • DOI: 10.1038/nbt.4060

progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
journal, June 2010


Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
journal, April 2012

  • Thorvaldsdottir, H.; Robinson, J. T.; Mesirov, J. P.
  • Briefings in Bioinformatics, Vol. 14, Issue 2, p. 178-192
  • DOI: 10.1093/bib/bbs017

Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping
journal, January 2016


Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations
journal, September 2018


Genome graphs and the evolution of genome inference
journal, March 2017

  • Paten, Benedict; Novak, Adam M.; Eizenga, Jordan M.
  • Genome Research, Vol. 27, Issue 5
  • DOI: 10.1101/gr.214155.116

Major Histocompatibility Complex Genomics and Human Disease
journal, August 2013


Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner
journal, April 2004


Fast Statistical Alignment
journal, May 2009


Graphtyper enables population-scale genotyping using pangenome graphs
journal, September 2017

  • Eggertsson, Hannes P.; Jonsson, Hakon; Kristmundsdottir, Snaedis
  • Nature Genetics, Vol. 49, Issue 11
  • DOI: 10.1038/ng.3964

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
journal, January 1994

  • Thompson, Julie D.; Higgins, Desmond G.; Gibson, Toby J.
  • Nucleic Acids Research, Vol. 22, Issue 22, p. 4673-4680
  • DOI: 10.1093/nar/22.22.4673

An integrated map of structural variation in 2,504 human genomes
journal, September 2015

  • Sudmant, Peter H.; Rausch, Tobias; Gardner, Eugene J.
  • Nature, Vol. 526, Issue 7571
  • DOI: 10.1038/nature15394

Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements
journal, June 2004


De novo assembly and phasing of a Korean human genome
journal, October 2016

  • Seo, Jeong-Sun; Rhie, Arang; Kim, Junsoo
  • Nature, Vol. 538, Issue 7624
  • DOI: 10.1038/nature20098

Accurate genotyping across variant classes and lengths using variant graphs
journal, June 2018


The variant call format and VCFtools
journal, June 2011


Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly
journal, April 2017

  • Schneider, Valerie A.; Graves-Lindsay, Tina; Howe, Kerstin
  • Genome Research, Vol. 27, Issue 5
  • DOI: 10.1101/gr.213611.116

High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
journal, October 2016

  • Dilthey, Alexander T.; Gourraud, Pierre-Antoine; Mentzer, Alexander J.
  • PLOS Computational Biology, Vol. 12, Issue 10
  • DOI: 10.1371/journal.pcbi.1005151

Long-read sequencing and de novo assembly of a Chinese genome
journal, June 2016

  • Shi, Lingling; Guo, Yunfei; Dong, Chengliang
  • Nature Communications, Vol. 7, Issue 1
  • DOI: 10.1038/ncomms12065

Mugsy: fast multiple alignment of closely related whole genomes
journal, December 2010


Accurate detection of complex structural variations using single-molecule sequencing
journal, April 2018


Single haplotype assembly of the human genome from a hydatidiform mole
journal, November 2014

  • Steinberg, Karyn Meltz; Schneider, Valerie A.; Graves-Lindsay, Tina A.
  • Genome Research, Vol. 24, Issue 12
  • DOI: 10.1101/gr.180893.114

Improved genome inference in the MHC using a population reference graph
journal, April 2015

  • Dilthey, Alexander; Cox, Charles; Iqbal, Zamin
  • Nature Genetics, Vol. 47, Issue 6
  • DOI: 10.1038/ng.3257

MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004

  • Edgar, R. C.
  • Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
  • DOI: 10.1093/nar/gkh340

Simultaneous alignment of short reads against multiple genomes
journal, January 2009

  • Schneeberger, Korbinian; Hagmann, Jörg; Ossowski, Stephan
  • Genome Biology, Vol. 10, Issue 9
  • DOI: 10.1186/gb-2009-10-9-r98

Variation graph toolkit improves read mapping by representing genetic variation in the reference
journal, October 2018

  • Garrison, Erik; Sirén, Jouni; Novak, Adam M.
  • Nature Biotechnology, Vol. 36, Issue 9
  • DOI: 10.1038/nbt.4227

Integrative genomics viewer
journal, January 2011

  • Robinson, James T.; Thorvaldsdóttir, Helga; Winckler, Wendy
  • Nature Biotechnology, Vol. 29, Issue 1
  • DOI: 10.1038/nbt.1754

Extensive sequencing of seven human genomes to characterize benchmark reference materials
journal, June 2016

  • Zook, Justin M.; Catoe, David; McDaniel, Jennifer
  • Scientific Data, Vol. 3, Issue 1
  • DOI: 10.1038/sdata.2016.25

T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
journal, September 2000

  • Notredame, Cédric; Higgins, Desmond G.; Heringa, Jaap
  • Journal of Molecular Biology, Vol. 302, Issue 1
  • DOI: 10.1006/jmbi.2000.4042

Cactus Graphs for Genome Comparisons
journal, March 2011

  • Paten, Benedict; Diekhans, Mark; Earl, Dent
  • Journal of Computational Biology, Vol. 18, Issue 3
  • DOI: 10.1089/cmb.2010.0252

The Sequence Alignment/Map format and SAMtools
journal, June 2009


MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
journal, January 2013

  • Katoh, K.; Standley, D. M.
  • Molecular Biology and Evolution, Vol. 30, Issue 4
  • DOI: 10.1093/molbev/mst010

Efficient storage of high throughput DNA sequencing data using reference-based compression
journal, January 2011

  • Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.
  • Genome Research, Vol. 21, Issue 5
  • DOI: 10.1101/gr.114819.110

A novel method for multiple alignment of sequences with repeated and shuffled elements
journal, November 2004


Efficient multiple genome alignment
journal, July 2002


    Works referencing / citing this record:

    T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
    journal, September 2000

    • Notredame, Cédric; Higgins, Desmond G.; Heringa, Jaap
    • Journal of Molecular Biology, Vol. 302, Issue 1
    • DOI: 10.1006/jmbi.2000.4042

    Resolving the complexity of the human genome using single-molecule sequencing
    journal, November 2014

    • Chaisson, Mark J. P.; Huddleston, John; Dennis, Megan Y.
    • Nature, Vol. 517, Issue 7536
    • DOI: 10.1038/nature13907

    An integrated map of structural variation in 2,504 human genomes
    journal, September 2015

    • Sudmant, Peter H.; Rausch, Tobias; Gardner, Eugene J.
    • Nature, Vol. 526, Issue 7571
    • DOI: 10.1038/nature15394

    De novo assembly and phasing of a Korean human genome
    journal, October 2016

    • Seo, Jeong-Sun; Rhie, Arang; Kim, Junsoo
    • Nature, Vol. 538, Issue 7624
    • DOI: 10.1038/nature20098

    Integrative genomics viewer
    journal, January 2011

    • Robinson, James T.; Thorvaldsdóttir, Helga; Winckler, Wendy
    • Nature Biotechnology, Vol. 29, Issue 1
    • DOI: 10.1038/nbt.1754

    Nanopore sequencing and assembly of a human genome with ultra-long reads
    journal, January 2018

    • Jain, Miten; Koren, Sergey; Miga, Karen H.
    • Nature Biotechnology, Vol. 36, Issue 4
    • DOI: 10.1038/nbt.4060

    Variation graph toolkit improves read mapping by representing genetic variation in the reference
    journal, October 2018

    • Garrison, Erik; Sirén, Jouni; Novak, Adam M.
    • Nature Biotechnology, Vol. 36, Issue 9
    • DOI: 10.1038/nbt.4227

    Long-read sequencing and de novo assembly of a Chinese genome
    journal, June 2016

    • Shi, Lingling; Guo, Yunfei; Dong, Chengliang
    • Nature Communications, Vol. 7, Issue 1
    • DOI: 10.1038/ncomms12065

    Improved genome inference in the MHC using a population reference graph
    journal, April 2015

    • Dilthey, Alexander; Cox, Charles; Iqbal, Zamin
    • Nature Genetics, Vol. 47, Issue 6
    • DOI: 10.1038/ng.3257

    Graphtyper enables population-scale genotyping using pangenome graphs
    journal, September 2017

    • Eggertsson, Hannes P.; Jonsson, Hakon; Kristmundsdottir, Snaedis
    • Nature Genetics, Vol. 49, Issue 11
    • DOI: 10.1038/ng.3964

    Accurate genotyping across variant classes and lengths using variant graphs
    journal, June 2018


    Accurate detection of complex structural variations using single-molecule sequencing
    journal, April 2018


    Extensive sequencing of seven human genomes to characterize benchmark reference materials
    journal, June 2016

    • Zook, Justin M.; Catoe, David; McDaniel, Jennifer
    • Scientific Data, Vol. 3, Issue 1
    • DOI: 10.1038/sdata.2016.25

    Cactus Graphs for Genome Comparisons
    journal, March 2011

    • Paten, Benedict; Diekhans, Mark; Earl, Dent
    • Journal of Computational Biology, Vol. 18, Issue 3
    • DOI: 10.1089/cmb.2010.0252

    Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
    journal, April 2012

    • Thorvaldsdottir, H.; Robinson, J. T.; Mesirov, J. P.
    • Briefings in Bioinformatics, Vol. 14, Issue 2, p. 178-192
    • DOI: 10.1093/bib/bbs017

    Efficient multiple genome alignment
    journal, July 2002


    The Sequence Alignment/Map format and SAMtools
    journal, June 2009


    Mugsy: fast multiple alignment of closely related whole genomes
    journal, December 2010


    The variant call format and VCFtools
    journal, June 2011


    Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations
    journal, September 2018


    MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability
    journal, January 2013

    • Katoh, K.; Standley, D. M.
    • Molecular Biology and Evolution, Vol. 30, Issue 4
    • DOI: 10.1093/molbev/mst010

    CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
    journal, January 1994

    • Thompson, Julie D.; Higgins, Desmond G.; Gibson, Toby J.
    • Nucleic Acids Research, Vol. 22, Issue 22, p. 4673-4680
    • DOI: 10.1093/nar/22.22.4673

    MUSCLE: multiple sequence alignment with high accuracy and high throughput
    journal, March 2004

    • Edgar, R. C.
    • Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
    • DOI: 10.1093/nar/gkh340

    Efficient storage of high throughput DNA sequencing data using reference-based compression
    journal, January 2011

    • Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.
    • Genome Research, Vol. 21, Issue 5
    • DOI: 10.1101/gr.114819.110

    Single haplotype assembly of the human genome from a hydatidiform mole
    journal, November 2014

    • Steinberg, Karyn Meltz; Schneider, Valerie A.; Graves-Lindsay, Tina A.
    • Genome Research, Vol. 24, Issue 12
    • DOI: 10.1101/gr.180893.114

    Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner
    journal, April 2004


    Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly
    journal, April 2017

    • Schneider, Valerie A.; Graves-Lindsay, Tina; Howe, Kerstin
    • Genome Research, Vol. 27, Issue 5
    • DOI: 10.1101/gr.213611.116

    Genome graphs and the evolution of genome inference
    journal, March 2017

    • Paten, Benedict; Novak, Adam M.; Eizenga, Jordan M.
    • Genome Research, Vol. 27, Issue 5
    • DOI: 10.1101/gr.214155.116

    Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements
    journal, June 2004


    A novel method for multiple alignment of sequences with repeated and shuffled elements
    journal, November 2004


    Major Histocompatibility Complex Genomics and Human Disease
    journal, August 2013


    Simultaneous alignment of short reads against multiple genomes
    journal, January 2009

    • Schneeberger, Korbinian; Hagmann, Jörg; Ossowski, Stephan
    • Genome Biology, Vol. 10, Issue 9
    • DOI: 10.1186/gb-2009-10-9-r98

    Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping
    journal, January 2016


    Fast Statistical Alignment
    journal, May 2009


    High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
    journal, October 2016

    • Dilthey, Alexander T.; Gourraud, Pierre-Antoine; Mentzer, Alexander J.
    • PLOS Computational Biology, Vol. 12, Issue 10
    • DOI: 10.1371/journal.pcbi.1005151

    progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
    journal, June 2010