DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum

Abstract

Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptionalmore » regulation in C. thermocellum and other bacteria.« less

Authors:
 [1];  [1];  [2];  [3];  [4];  [5];  [6]
  1. Univ. of Georgia, Athens, GA (United States); BioEnergy Science Center, Oak Ridge, TN (United States)
  2. BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); National Renewable Energy Lab. (NREL), Golden, CO (United States)
  3. Univ. of Georgia, Athens, GA (United States)
  4. BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  5. BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States);
  6. Univ. of Georgia, Athens, GA (United States); BioEnergy Science Center, Oak Ridge, TN (United States); Jilin Univ., Changchun (China)
Publication Date:
Research Org.:
National Renewable Energy Laboratory (NREL), Golden, CO (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). BioEnergy Science Center (BESC)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES)
OSTI Identifier:
1242033
Alternate Identifier(s):
OSTI ID: 1265796
Report Number(s):
NREL/JA-5100-64668
Journal ID: ISSN 0305-1048
Grant/Contract Number:  
AC36-08GO28308; AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Nucleic Acids Research
Additional Journal Information:
Journal Volume: 43; Journal Issue: 10; Related Information: Nucleic Acids Research; Journal ID: ISSN 0305-1048
Publisher:
Oxford University Press
Country of Publication:
United States
Language:
English
Subject:
09 BIOMASS FUELS; 59 BASIC BIOLOGICAL SCIENCES; transcription units (TU); bacterial genome

Citation Formats

Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., and Xu, Ying. Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum. United States: N. p., 2015. Web. doi:10.1093/nar/gkv177.
Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., & Xu, Ying. Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum. United States. https://doi.org/10.1093/nar/gkv177
Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., and Xu, Ying. Thu . "Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum". United States. https://doi.org/10.1093/nar/gkv177. https://www.osti.gov/servlets/purl/1242033.
@article{osti_1242033,
title = {Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum},
author = {Chou, Wen-Chi and Ma, Qin and Yang, Shihui and Cao, Sha and Klingeman, Dawn M. and Brown, Steven D. and Xu, Ying},
abstractNote = {Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.},
doi = {10.1093/nar/gkv177},
journal = {Nucleic Acids Research},
number = 10,
volume = 43,
place = {United States},
year = {Thu Mar 12 00:00:00 EDT 2015},
month = {Thu Mar 12 00:00:00 EDT 2015}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 17 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The transcription unit architecture of the Escherichia coli genome
journal, November 2009

  • Cho, Byung-Kwan; Zengler, Karsten; Qiu, Yu
  • Nature Biotechnology, Vol. 27, Issue 11
  • DOI: 10.1038/nbt.1582

Genome-wide operon prediction in Staphylococcus aureus
journal, July 2004


ODB: a database of operons accumulating known operons across multiple genomes
journal, January 2006


DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information
journal, October 2007

  • Sierro, Nicolas; Makita, Yuko; de Hoon, Michiel
  • Nucleic Acids Research, Vol. 36, Issue suppl_1
  • DOI: 10.1093/nar/gkm910

OperonDB: a comprehensive database of predicted operons in microbial genomes
journal, January 2009

  • Pertea, M.; Ayanbule, K.; Smedinghoff, M.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn784

DOOR: a database for prokaryotic operons
journal, November 2008

  • Mao, Fenglou; Dam, Phuongan; Chou, Jacky
  • Nucleic Acids Research, Vol. 37, Issue suppl_1
  • DOI: 10.1093/nar/gkn757

DOOR 2.0: presenting operons and their functions through dynamic and integrated views
journal, November 2013

  • Mao, Xizeng; Ma, Qin; Zhou, Chuan
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1048

Mycoplasma hyopneumoniae Transcription Unit Organization: Genome Survey and Prediction
journal, November 2011


The relative value of operon predictions
journal, April 2008

  • Brouwer, R. W. W.; Kuipers, O. P.; van Hijum, S. A. F. T.
  • Briefings in Bioinformatics, Vol. 9, Issue 5
  • DOI: 10.1093/bib/bbn019

Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing
journal, February 2009

  • Yoder-Himes, D. R.; Chain, P. S. G.; Zhu, Y.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 10
  • DOI: 10.1073/pnas.0813403106

Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome
journal, November 2009

  • Albrecht, Marco; Sharma, Cynthia M.; Reinhardt, Richard
  • Nucleic Acids Research, Vol. 38, Issue 3
  • DOI: 10.1093/nar/gkp1032

Review Application of RNA-seq to reveal the transcript profile in bacteria
journal, January 2011

  • Pinto, A. C.; Melo-Barbosa, H. P.; Miyoshi, A.
  • Genetics and Molecular Research, Vol. 10, Issue 3
  • DOI: 10.4238/vol10-3gmr1554

Computational analysis of bacterial RNA-Seq data
journal, May 2013

  • McClure, Ryan; Balasubramanian, Divya; Sun, Yan
  • Nucleic Acids Research, Vol. 41, Issue 14
  • DOI: 10.1093/nar/gkt444

Transcriptome dynamics-based operon prediction in prokaryotes
journal, May 2014

  • Fortino, Vittorio; Smolander, Olli-Pekka; Auvinen, Petri
  • BMC Bioinformatics, Vol. 15, Issue 1
  • DOI: 10.1186/1471-2105-15-145

Clostridium thermocellum ATCC27405 transcriptomic, metabolomic and proteomic profiles after ethanol stress
journal, January 2012


Solexa Ltd
journal, June 2004


Genome sequencing in microfabricated high-density picolitre reactors
journal, July 2005

  • Margulies, Marcel; Egholm, Michael; Altman, William E.
  • Nature, Vol. 437, Issue 7057, p. 376-380
  • DOI: 10.1038/nature03959

A new framework for identifying cis-regulatory motifs in prokaryotes
journal, December 2010

  • Li, Guojun; Liu, Bingqiang; Ma, Qin
  • Nucleic Acids Research, Vol. 39, Issue 7
  • DOI: 10.1093/nar/gkq948

Minimal metabolic pathway structure is consistent with associated biomolecular interactions
journal, July 2014

  • Bordbar, Aarash; Nagarajan, Harish; Lewis, Nathan E.
  • Molecular Systems Biology, Vol. 10, Issue 7
  • DOI: 10.15252/msb.20145243

RNA degradome--its biogenesis and functions
journal, June 2011

  • Jackowiak, P.; Nowacka, M.; Strozycki, P. M.
  • Nucleic Acids Research, Vol. 39, Issue 17
  • DOI: 10.1093/nar/gkr450

Transcriptome Complexity in a Genome-Reduced Bacterium
journal, November 2009


The Listeria transcriptional landscape from saprophytism to virulence
journal, May 2009

  • Toledo-Arana, Alejandro; Dussurget, Olivier; Nikitas, Georgios
  • Nature, Vol. 459, Issue 7249
  • DOI: 10.1038/nature08080

LIBSVM: A library for support vector machines
journal, April 2011

  • Chang, Chih-Chung; Lin, Chih-Jen
  • ACM Transactions on Intelligent Systems and Technology, Vol. 2, Issue 3
  • DOI: 10.1145/1961189.1961199

Interruptions in gene expression drive highly expressed operons to the leading strand of DNA replication
journal, June 2005


The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces
journal, June 2012

  • Mao, Xizeng; Zhang, Han; Yin, Yanbin
  • Nucleic Acids Research, Vol. 40, Issue 17
  • DOI: 10.1093/nar/gks605

An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale
journal, July 2013


RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more
journal, November 2012

  • Salgado, Heladia; Peralta-Gil, Martin; Gama-Castro, Socorro
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1201

RegTransBase – a database of regulatory sequences and interactions based on literature: a resource for investigating transcriptional regulation in prokaryotes
journal, January 2013

  • Cipriano, Michael J.; Novichkov, Pavel N.; Kazakov, Alexey E.
  • BMC Genomics, Vol. 14, Issue 1
  • DOI: 10.1186/1471-2164-14-213

DMINDA: an integrated web server for DNA motif identification and analyses
journal, April 2014

  • Ma, Qin; Zhang, Hanyuan; Mao, Xizeng
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku315

Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake
journal, January 2007

  • Kingsford, Carleton L.; Ayanbule, Kunmi; Salzberg, Steven L.
  • Genome Biology, Vol. 8, Issue 2
  • DOI: 10.1186/gb-2007-8-2-r22

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
journal, November 2008

  • Huang, Da Wei; Sherman, Brad T.; Lempicki, Richard A.
  • Nucleic Acids Research, Vol. 37, Issue 1
  • DOI: 10.1093/nar/gkn923

RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”
journal, January 2012


The primary transcriptome of the major human pathogen Helicobacter pylori
journal, February 2010

  • Sharma, Cynthia M.; Hoffmann, Steve; Darfeuille, Fabien
  • Nature, Vol. 464, Issue 7286
  • DOI: 10.1038/nature08756

Genome sequencing in microfabricated high-density picolitre reactors
journal, July 2005

  • Margulies, Marcel; Egholm, Michael; Altman, William E.
  • Nature, Vol. 437, Issue 7057, p. 376-380
  • DOI: 10.1038/nature03959

The Listeria transcriptional landscape from saprophytism to virulence
journal, May 2009

  • Toledo-Arana, Alejandro; Dussurget, Olivier; Nikitas, Georgios
  • Nature, Vol. 459, Issue 7249
  • DOI: 10.1038/nature08080

The primary transcriptome of the major human pathogen Helicobacter pylori
journal, February 2010

  • Sharma, Cynthia M.; Hoffmann, Steve; Darfeuille, Fabien
  • Nature, Vol. 464, Issue 7286
  • DOI: 10.1038/nature08756

The transcription unit architecture of the Escherichia coli genome
journal, November 2009

  • Cho, Byung-Kwan; Zengler, Karsten; Qiu, Yu
  • Nature Biotechnology, Vol. 27, Issue 11
  • DOI: 10.1038/nbt.1582

Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing
journal, February 2009

  • Yoder-Himes, D. R.; Chain, P. S. G.; Zhu, Y.
  • Proceedings of the National Academy of Sciences, Vol. 106, Issue 10
  • DOI: 10.1073/pnas.0813403106

Mycoplasma hyopneumoniae Transcription Unit Organization: Genome Survey and Prediction
journal, November 2011


Genome-wide operon prediction in Staphylococcus aureus
journal, July 2004


Interruptions in gene expression drive highly expressed operons to the leading strand of DNA replication
journal, June 2005


ODB: a database of operons accumulating known operons across multiple genomes
journal, January 2006


DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information
journal, October 2007

  • Sierro, Nicolas; Makita, Yuko; de Hoon, Michiel
  • Nucleic Acids Research, Vol. 36, Issue suppl_1
  • DOI: 10.1093/nar/gkm910

DOOR: a database for prokaryotic operons
journal, November 2008

  • Mao, Fenglou; Dam, Phuongan; Chou, Jacky
  • Nucleic Acids Research, Vol. 37, Issue suppl_1
  • DOI: 10.1093/nar/gkn757

OperonDB: a comprehensive database of predicted operons in microbial genomes
journal, January 2009

  • Pertea, M.; Ayanbule, K.; Smedinghoff, M.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn784

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
journal, November 2008

  • Huang, Da Wei; Sherman, Brad T.; Lempicki, Richard A.
  • Nucleic Acids Research, Vol. 37, Issue 1
  • DOI: 10.1093/nar/gkn923

Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome
journal, November 2009

  • Albrecht, Marco; Sharma, Cynthia M.; Reinhardt, Richard
  • Nucleic Acids Research, Vol. 38, Issue 3
  • DOI: 10.1093/nar/gkp1032

A new framework for identifying cis-regulatory motifs in prokaryotes
journal, December 2010

  • Li, Guojun; Liu, Bingqiang; Ma, Qin
  • Nucleic Acids Research, Vol. 39, Issue 7
  • DOI: 10.1093/nar/gkq948

RNA degradome--its biogenesis and functions
journal, June 2011

  • Jackowiak, P.; Nowacka, M.; Strozycki, P. M.
  • Nucleic Acids Research, Vol. 39, Issue 17
  • DOI: 10.1093/nar/gkr450

RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more
journal, November 2012

  • Salgado, Heladia; Peralta-Gil, Martin; Gama-Castro, Socorro
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1201

DOOR 2.0: presenting operons and their functions through dynamic and integrated views
journal, November 2013

  • Mao, Xizeng; Ma, Qin; Zhou, Chuan
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1048

Computational analysis of bacterial RNA-Seq data
journal, May 2013

  • McClure, Ryan; Balasubramanian, Divya; Sun, Yan
  • Nucleic Acids Research, Vol. 41, Issue 14
  • DOI: 10.1093/nar/gkt444

DMINDA: an integrated web server for DNA motif identification and analyses
journal, April 2014

  • Ma, Qin; Zhang, Hanyuan; Mao, Xizeng
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku315

Transcriptome Complexity in a Genome-Reduced Bacterium
journal, November 2009


Transcriptome dynamics-based operon prediction in prokaryotes
journal, May 2014

  • Fortino, Vittorio; Smolander, Olli-Pekka; Auvinen, Petri
  • BMC Bioinformatics, Vol. 15, Issue 1
  • DOI: 10.1186/1471-2105-15-145

Clostridium thermocellum ATCC27405 transcriptomic, metabolomic and proteomic profiles after ethanol stress
journal, January 2012


RegTransBase – a database of regulatory sequences and interactions based on literature: a resource for investigating transcriptional regulation in prokaryotes
journal, January 2013

  • Cipriano, Michael J.; Novichkov, Pavel N.; Kazakov, Alexey E.
  • BMC Genomics, Vol. 14, Issue 1
  • DOI: 10.1186/1471-2164-14-213

Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake
journal, January 2007

  • Kingsford, Carleton L.; Ayanbule, Kunmi; Salzberg, Steven L.
  • Genome Biology, Vol. 8, Issue 2
  • DOI: 10.1186/gb-2007-8-2-r22

Solexa Ltd
journal, June 2004


Review Application of RNA-seq to reveal the transcript profile in bacteria
journal, January 2011

  • Pinto, A. C.; Melo-Barbosa, H. P.; Miyoshi, A.
  • Genetics and Molecular Research, Vol. 10, Issue 3
  • DOI: 10.4238/vol10-3gmr1554

Works referencing / citing this record:

Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses
journal, March 2016

  • Liu, Bingqiang; Zhou, Chuan; Li, Guojun
  • Scientific Reports, Vol. 6, Issue 1
  • DOI: 10.1038/srep23030

DOOR: a prokaryotic operon database for genome analyses and functional inference
journal, July 2017

  • Cao, Huansheng; Ma, Qin; Chen, Xin
  • Briefings in Bioinformatics, Vol. 20, Issue 4
  • DOI: 10.1093/bib/bbx088

Revisiting operons: an analysis of the landscape of transcriptional units in E. coli
journal, November 2015


A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
journal, February 2017


Single-Cell RNA Sequencing of Plant-Associated Bacterial Communities
journal, October 2019

  • Ma, Qin; Bücking, Heike; Gonzalez Hernandez, Jose L.
  • Frontiers in Microbiology, Vol. 10
  • DOI: 10.3389/fmicb.2019.02452

RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics Analysis
journal, May 2018


A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
posted_content, December 2016

  • Way, Gregory P.; Allaway, Robert J.; Bouley, Stephanie J.
  • bioRxiv
  • DOI: 10.1101/075382

SeqTU: A Web Server for Identification of Bacterial Transcription Units
journal, March 2017

  • Chen, Xin; Chou, Wen-Chi; Ma, Qin
  • Scientific Reports, Vol. 7, Issue 1
  • DOI: 10.1038/srep43925

A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
posted_content, December 2016

  • Way, Gregory P.; Allaway, Robert J.; Bouley, Stephanie J.
  • bioRxiv
  • DOI: 10.1101/075382

RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics Analysis
journal, March 2018


Revisiting operons: an analysis of the landscape of transcriptional units in E. coli
journal, November 2015


rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units
journal, May 2019


Single-Cell RNA Sequencing of Plant-Associated Bacterial Communities
journal, October 2019

  • Ma, Qin; Bücking, Heike; Gonzalez Hernandez, Jose L.
  • Frontiers in Microbiology, Vol. 10
  • DOI: 10.3389/fmicb.2019.02452