DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets

Abstract

The decreasing cost of DNA sequencing over the past decade has led to an explosion of sequencing datasets, leaving us with petabytes of data to analyze. However, current sequencing visualization tools are designed to run on single machines, which limits their scalability and interactivity on modern genomic datasets. Here, we leverage the scalability of Apache Spark to provide Mango, consisting of a Jupyter notebook and genome browser, which removes scalability and interactivity constraints by leveraging multi-node compute clusters to allow interactive analysis over terabytes of sequencing data. We demonstrate scalability of the Mango tools by performing quality control analyses on 10 terabytes of 100 high-coverage sequencing samples from the Simons Genome Diversity Project, enabling capability for interactive genomic exploration of multi-sample datasets that surpass the computational limitations of single-node visualization tools. Mango is freely available for download with full documentation at https://bdg-mango.readthedocs.io/en/latest/.

Authors:
; ; ; ; ; ;
Publication Date:
Research Org.:
Univ. of California, Oakland, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC); DHS; National Science Foundation (NSF); National Institutes of Health (NIH)
OSTI Identifier:
1577012
Alternate Identifier(s):
OSTI ID: 1802357
Grant/Contract Number:  
SC0012463; HSHQDC-16-3-00083; CCF-1139158; FA8750-12-2-0331; 1-U54HG007990-01; HHSN261201400006C
Resource Type:
Published Article
Journal Name:
Cell Systems
Additional Journal Information:
Journal Name: Cell Systems Journal Volume: 9 Journal Issue: 6; Journal ID: ISSN 2405-4712
Publisher:
Elsevier
Country of Publication:
Niger
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; biochemistry & molecular biology; cell biology; genome visualization; genome sequencing; Apache Spark; genome browser; interactive notebook

Citation Formats

Morrow, Alyssa Kramer, He, George Zhixuan, Nothaft, Frank Austin, Tu, Eric Tongching, Paschall, Justin, Yosef, Nir, and Joseph, Anthony Douglas. Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets. Niger: N. p., 2019. Web. doi:10.1016/j.cels.2019.11.002.
Morrow, Alyssa Kramer, He, George Zhixuan, Nothaft, Frank Austin, Tu, Eric Tongching, Paschall, Justin, Yosef, Nir, & Joseph, Anthony Douglas. Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets. Niger. https://doi.org/10.1016/j.cels.2019.11.002
Morrow, Alyssa Kramer, He, George Zhixuan, Nothaft, Frank Austin, Tu, Eric Tongching, Paschall, Justin, Yosef, Nir, and Joseph, Anthony Douglas. Sun . "Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets". Niger. https://doi.org/10.1016/j.cels.2019.11.002.
@article{osti_1577012,
title = {Mango: Exploratory Data Analysis for Large-Scale Sequencing Datasets},
author = {Morrow, Alyssa Kramer and He, George Zhixuan and Nothaft, Frank Austin and Tu, Eric Tongching and Paschall, Justin and Yosef, Nir and Joseph, Anthony Douglas},
abstractNote = {The decreasing cost of DNA sequencing over the past decade has led to an explosion of sequencing datasets, leaving us with petabytes of data to analyze. However, current sequencing visualization tools are designed to run on single machines, which limits their scalability and interactivity on modern genomic datasets. Here, we leverage the scalability of Apache Spark to provide Mango, consisting of a Jupyter notebook and genome browser, which removes scalability and interactivity constraints by leveraging multi-node compute clusters to allow interactive analysis over terabytes of sequencing data. We demonstrate scalability of the Mango tools by performing quality control analyses on 10 terabytes of 100 high-coverage sequencing samples from the Simons Genome Diversity Project, enabling capability for interactive genomic exploration of multi-sample datasets that surpass the computational limitations of single-node visualization tools. Mango is freely available for download with full documentation at https://bdg-mango.readthedocs.io/en/latest/.},
doi = {10.1016/j.cels.2019.11.002},
journal = {Cell Systems},
number = 6,
volume = 9,
place = {Niger},
year = {2019},
month = {12}
}

Works referenced in this record:

The Parable of Google Flu: Traps in Big Data Analysis
journal, March 2014


pileup.js: a JavaScript library for interactive and in-browser visualization of genomic data
journal, March 2016


Rethinking Data-Intensive Science Using Scalable Analytics Systems
conference, January 2015

  • Nothaft, Frank Austin; Linderman, Michael; Franklin, Michael J.
  • Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15
  • DOI: 10.1145/2723372.2742787

Spark SQL: Relational Data Processing in Spark
conference, January 2015

  • Armbrust, Michael; Ghodsi, Ali; Zaharia, Matei
  • Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15
  • DOI: 10.1145/2723372.2742797

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud
journal, February 2012


‘Big data’, Hadoop and cloud computing in genomics
journal, October 2013

  • O’Driscoll, Aisling; Daugelaite, Jurate; Sleator, Roy D.
  • Journal of Biomedical Informatics, Vol. 46, Issue 5
  • DOI: 10.1016/j.jbi.2013.07.001

The NEI/NCBI dbGAP database: Genotypes and haplotypes that may specifically predispose to risk of neovascular age-related macular degeneration
journal, June 2008

  • Zhang, Hong; Morrison, Margaux A.; DeWan, Andy
  • BMC Medical Genetics, Vol. 9, Issue 1
  • DOI: 10.1186/1471-2350-9-51

Integrated genome browser: visual analytics platform for genomics
journal, March 2016


An integrated encyclopedia of DNA elements in the human genome
journal, September 2012


Computational solutions to large-scale data management and analysis
journal, September 2010

  • Schadt, Eric E.; Linderman, Michael D.; Sorenson, Jon
  • Nature Reviews Genetics, Vol. 11, Issue 9
  • DOI: 10.1038/nrg2857

Orchestrating high-throughput genomic analysis with Bioconductor
journal, January 2015

  • Huber, Wolfgang; Carey, Vincent J.; Gentleman, Robert
  • Nature Methods, Vol. 12, Issue 2
  • DOI: 10.1038/nmeth.3252

bam.iobio: a web-based, real-time, sequence alignment file inspector
journal, November 2014

  • Miller, Chase A.; Qiao, Yi; DiSera, Tonya
  • Nature Methods, Vol. 11, Issue 12
  • DOI: 10.1038/nmeth.3174

Next-generation DNA sequencing
journal, October 2008

  • Shendure, Jay; Ji, Hanlee
  • Nature Biotechnology, Vol. 26, Issue 10
  • DOI: 10.1038/nbt1486

Matplotlib: A 2D Graphics Environment
journal, January 2007


JBrowse: A next-generation genome browser
journal, July 2009

  • Skinner, M. E.; Uzilov, A. V.; Stein, L. D.
  • Genome Research, Vol. 19, Issue 9
  • DOI: 10.1101/gr.094607.109

The Cancer Genome Atlas Pan-Cancer analysis project
journal, September 2013

  • Weinstein, John N.; Collisson, Eric A.; Mills, Gordon B.
  • Nature Genetics, Vol. 45, Issue 10
  • DOI: 10.1038/ng.2764

Disaggregating asthma: Big investigation versus big data
journal, February 2017

  • Belgrave, Danielle; Henderson, John; Simpson, Angela
  • Journal of Allergy and Clinical Immunology, Vol. 139, Issue 2
  • DOI: 10.1016/j.jaci.2016.11.003

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations
journal, September 2016

  • Mallick, Swapan; Li, Heng; Lipson, Mark
  • Nature, Vol. 538, Issue 7624
  • DOI: 10.1038/nature18964

Integrative genomics viewer
journal, January 2011

  • Robinson, James T.; Thorvaldsdóttir, Helga; Winckler, Wendy
  • Nature Biotechnology, Vol. 29, Issue 1
  • DOI: 10.1038/nbt.1754

CloudBurst: highly sensitive read mapping with MapReduce
journal, April 2009


Savant: genome browser for high-throughput sequencing data
journal, June 2010


Improving genetic diagnosis in Mendelian disease with transcriptome sequencing
journal, April 2017

  • Cummings, Beryl B.; Marshall, Jamie L.; Tukiainen, Taru
  • Science Translational Medicine, Vol. 9, Issue 386
  • DOI: 10.1126/scitranslmed.aal5209

Improving therapeutic effectiveness and safety through big healthcare data
journal, January 2016

  • Schneeweiss, S.
  • Clinical Pharmacology & Therapeutics, Vol. 99, Issue 3
  • DOI: 10.1002/cpt.316