Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum
Abstract
Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptionalmore »
- Authors:
-
- Univ. of Georgia, Athens, GA (United States); BioEnergy Science Center, Oak Ridge, TN (United States)
- BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); National Renewable Energy Lab. (NREL), Golden, CO (United States)
- Univ. of Georgia, Athens, GA (United States)
- BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- BioEnergy Science Center, Oak Ridge, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States);
- Univ. of Georgia, Athens, GA (United States); BioEnergy Science Center, Oak Ridge, TN (United States); Jilin Univ., Changchun (China)
- Publication Date:
- Research Org.:
- National Renewable Energy Laboratory (NREL), Golden, CO (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). BioEnergy Science Center (BESC)
- Sponsoring Org.:
- USDOE Office of Science (SC), Basic Energy Sciences (BES)
- OSTI Identifier:
- 1242033
- Alternate Identifier(s):
- OSTI ID: 1265796
- Report Number(s):
- NREL/JA-5100-64668
Journal ID: ISSN 0305-1048
- Grant/Contract Number:
- AC36-08GO28308; AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Nucleic Acids Research
- Additional Journal Information:
- Journal Volume: 43; Journal Issue: 10; Related Information: Nucleic Acids Research; Journal ID: ISSN 0305-1048
- Publisher:
- Oxford University Press
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 09 BIOMASS FUELS; 59 BASIC BIOLOGICAL SCIENCES; transcription units (TU); bacterial genome
Citation Formats
Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., and Xu, Ying. Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum. United States: N. p., 2015.
Web. doi:10.1093/nar/gkv177.
Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., & Xu, Ying. Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum. United States. https://doi.org/10.1093/nar/gkv177
Chou, Wen-Chi, Ma, Qin, Yang, Shihui, Cao, Sha, Klingeman, Dawn M., Brown, Steven D., and Xu, Ying. Thu .
"Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum". United States. https://doi.org/10.1093/nar/gkv177. https://www.osti.gov/servlets/purl/1242033.
@article{osti_1242033,
title = {Analysis of Strand-Specific RNA-Seq Data Using Machine Learning Reveals the Structures of Transcription Units in Clostridium thermocellum},
author = {Chou, Wen-Chi and Ma, Qin and Yang, Shihui and Cao, Sha and Klingeman, Dawn M. and Brown, Steven D. and Xu, Ying},
abstractNote = {Identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets. Among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available at https://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.},
doi = {10.1093/nar/gkv177},
journal = {Nucleic Acids Research},
number = 10,
volume = 43,
place = {United States},
year = {Thu Mar 12 00:00:00 EDT 2015},
month = {Thu Mar 12 00:00:00 EDT 2015}
}
Web of Science
Works referenced in this record:
The transcription unit architecture of the Escherichia coli genome
journal, November 2009
- Cho, Byung-Kwan; Zengler, Karsten; Qiu, Yu
- Nature Biotechnology, Vol. 27, Issue 11
Genome-wide operon prediction in Staphylococcus aureus
journal, July 2004
- Wang, L.
- Nucleic Acids Research, Vol. 32, Issue 12
ODB: a database of operons accumulating known operons across multiple genomes
journal, January 2006
- Okuda, S.
- Nucleic Acids Research, Vol. 34, Issue 90001
DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information
journal, October 2007
- Sierro, Nicolas; Makita, Yuko; de Hoon, Michiel
- Nucleic Acids Research, Vol. 36, Issue suppl_1
OperonDB: a comprehensive database of predicted operons in microbial genomes
journal, January 2009
- Pertea, M.; Ayanbule, K.; Smedinghoff, M.
- Nucleic Acids Research, Vol. 37, Issue Database
DOOR: a database for prokaryotic operons
journal, November 2008
- Mao, Fenglou; Dam, Phuongan; Chou, Jacky
- Nucleic Acids Research, Vol. 37, Issue suppl_1
DOOR 2.0: presenting operons and their functions through dynamic and integrated views
journal, November 2013
- Mao, Xizeng; Ma, Qin; Zhou, Chuan
- Nucleic Acids Research, Vol. 42, Issue D1
Mycoplasma hyopneumoniae Transcription Unit Organization: Genome Survey and Prediction
journal, November 2011
- Siqueira, F. M.; Schrank, A.; Schrank, I. S.
- DNA Research, Vol. 18, Issue 6
The relative value of operon predictions
journal, April 2008
- Brouwer, R. W. W.; Kuipers, O. P.; van Hijum, S. A. F. T.
- Briefings in Bioinformatics, Vol. 9, Issue 5
Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing
journal, February 2009
- Yoder-Himes, D. R.; Chain, P. S. G.; Zhu, Y.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 10
Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs
journal, January 2009
- Oliver, Haley F.; Orsi, Renato H.; Ponnala, Lalit
- BMC Genomics, Vol. 10, Issue 1
Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome
journal, November 2009
- Albrecht, Marco; Sharma, Cynthia M.; Reinhardt, Richard
- Nucleic Acids Research, Vol. 38, Issue 3
Review Application of RNA-seq to reveal the transcript profile in bacteria
journal, January 2011
- Pinto, A. C.; Melo-Barbosa, H. P.; Miyoshi, A.
- Genetics and Molecular Research, Vol. 10, Issue 3
Computational analysis of bacterial RNA-Seq data
journal, May 2013
- McClure, Ryan; Balasubramanian, Divya; Sun, Yan
- Nucleic Acids Research, Vol. 41, Issue 14
Transcriptome dynamics-based operon prediction in prokaryotes
journal, May 2014
- Fortino, Vittorio; Smolander, Olli-Pekka; Auvinen, Petri
- BMC Bioinformatics, Vol. 15, Issue 1
Clostridium thermocellum ATCC27405 transcriptomic, metabolomic and proteomic profiles after ethanol stress
journal, January 2012
- Yang, Shihui; Giannone, Richard J.; Dice, Lezlee
- BMC Genomics, Vol. 13, Issue 1
Genome sequencing in microfabricated high-density picolitre reactors
journal, July 2005
- Margulies, Marcel; Egholm, Michael; Altman, William E.
- Nature, Vol. 437, Issue 7057, p. 376-380
A new framework for identifying cis-regulatory motifs in prokaryotes
journal, December 2010
- Li, Guojun; Liu, Bingqiang; Ma, Qin
- Nucleic Acids Research, Vol. 39, Issue 7
Minimal metabolic pathway structure is consistent with associated biomolecular interactions
journal, July 2014
- Bordbar, Aarash; Nagarajan, Harish; Lewis, Nathan E.
- Molecular Systems Biology, Vol. 10, Issue 7
RNA degradome--its biogenesis and functions
journal, June 2011
- Jackowiak, P.; Nowacka, M.; Strozycki, P. M.
- Nucleic Acids Research, Vol. 39, Issue 17
Transcriptome Complexity in a Genome-Reduced Bacterium
journal, November 2009
- Güell, Marc; van Noort, Vera; Yus, Eva
- Science, Vol. 326, Issue 5957
The Listeria transcriptional landscape from saprophytism to virulence
journal, May 2009
- Toledo-Arana, Alejandro; Dussurget, Olivier; Nikitas, Georgios
- Nature, Vol. 459, Issue 7249
LIBSVM: A library for support vector machines
journal, April 2011
- Chang, Chih-Chung; Lin, Chih-Jen
- ACM Transactions on Intelligent Systems and Technology, Vol. 2, Issue 3
Interruptions in gene expression drive highly expressed operons to the leading strand of DNA replication
journal, June 2005
- Price, M. N.
- Nucleic Acids Research, Vol. 33, Issue 10
The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces
journal, June 2012
- Mao, Xizeng; Zhang, Han; Yin, Yanbin
- Nucleic Acids Research, Vol. 40, Issue 17
An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale
journal, July 2013
- Ma, Qin; Liu, Bingqiang; Zhou, Chuan
- Bioinformatics, Vol. 29, Issue 18
RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more
journal, November 2012
- Salgado, Heladia; Peralta-Gil, Martin; Gama-Castro, Socorro
- Nucleic Acids Research, Vol. 41, Issue D1
RegTransBase – a database of regulatory sequences and interactions based on literature: a resource for investigating transcriptional regulation in prokaryotes
journal, January 2013
- Cipriano, Michael J.; Novichkov, Pavel N.; Kazakov, Alexey E.
- BMC Genomics, Vol. 14, Issue 1
DMINDA: an integrated web server for DNA motif identification and analyses
journal, April 2014
- Ma, Qin; Zhang, Hanyuan; Mao, Xizeng
- Nucleic Acids Research, Vol. 42, Issue W1
Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake
journal, January 2007
- Kingsford, Carleton L.; Ayanbule, Kunmi; Salzberg, Steven L.
- Genome Biology, Vol. 8, Issue 2
Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
journal, November 2008
- Huang, Da Wei; Sherman, Brad T.; Lempicki, Richard A.
- Nucleic Acids Research, Vol. 37, Issue 1
RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”
journal, January 2012
- Kumar, Ranjit; Lawrence, Mark L.; Watt, James
- PLoS ONE, Vol. 7, Issue 1
The primary transcriptome of the major human pathogen Helicobacter pylori
journal, February 2010
- Sharma, Cynthia M.; Hoffmann, Steve; Darfeuille, Fabien
- Nature, Vol. 464, Issue 7286
Genome sequencing in microfabricated high-density picolitre reactors
journal, July 2005
- Margulies, Marcel; Egholm, Michael; Altman, William E.
- Nature, Vol. 437, Issue 7057, p. 376-380
The Listeria transcriptional landscape from saprophytism to virulence
journal, May 2009
- Toledo-Arana, Alejandro; Dussurget, Olivier; Nikitas, Georgios
- Nature, Vol. 459, Issue 7249
The primary transcriptome of the major human pathogen Helicobacter pylori
journal, February 2010
- Sharma, Cynthia M.; Hoffmann, Steve; Darfeuille, Fabien
- Nature, Vol. 464, Issue 7286
The transcription unit architecture of the Escherichia coli genome
journal, November 2009
- Cho, Byung-Kwan; Zengler, Karsten; Qiu, Yu
- Nature Biotechnology, Vol. 27, Issue 11
Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing
journal, February 2009
- Yoder-Himes, D. R.; Chain, P. S. G.; Zhu, Y.
- Proceedings of the National Academy of Sciences, Vol. 106, Issue 10
Mycoplasma hyopneumoniae Transcription Unit Organization: Genome Survey and Prediction
journal, November 2011
- Siqueira, F. M.; Schrank, A.; Schrank, I. S.
- DNA Research, Vol. 18, Issue 6
Genome-wide operon prediction in Staphylococcus aureus
journal, July 2004
- Wang, L.
- Nucleic Acids Research, Vol. 32, Issue 12
Interruptions in gene expression drive highly expressed operons to the leading strand of DNA replication
journal, June 2005
- Price, M. N.
- Nucleic Acids Research, Vol. 33, Issue 10
ODB: a database of operons accumulating known operons across multiple genomes
journal, January 2006
- Okuda, S.
- Nucleic Acids Research, Vol. 34, Issue 90001
DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information
journal, October 2007
- Sierro, Nicolas; Makita, Yuko; de Hoon, Michiel
- Nucleic Acids Research, Vol. 36, Issue suppl_1
DOOR: a database for prokaryotic operons
journal, November 2008
- Mao, Fenglou; Dam, Phuongan; Chou, Jacky
- Nucleic Acids Research, Vol. 37, Issue suppl_1
OperonDB: a comprehensive database of predicted operons in microbial genomes
journal, January 2009
- Pertea, M.; Ayanbule, K.; Smedinghoff, M.
- Nucleic Acids Research, Vol. 37, Issue Database
Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
journal, November 2008
- Huang, Da Wei; Sherman, Brad T.; Lempicki, Richard A.
- Nucleic Acids Research, Vol. 37, Issue 1
Deep sequencing-based discovery of the Chlamydia trachomatis transcriptome
journal, November 2009
- Albrecht, Marco; Sharma, Cynthia M.; Reinhardt, Richard
- Nucleic Acids Research, Vol. 38, Issue 3
A new framework for identifying cis-regulatory motifs in prokaryotes
journal, December 2010
- Li, Guojun; Liu, Bingqiang; Ma, Qin
- Nucleic Acids Research, Vol. 39, Issue 7
RNA degradome--its biogenesis and functions
journal, June 2011
- Jackowiak, P.; Nowacka, M.; Strozycki, P. M.
- Nucleic Acids Research, Vol. 39, Issue 17
RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more
journal, November 2012
- Salgado, Heladia; Peralta-Gil, Martin; Gama-Castro, Socorro
- Nucleic Acids Research, Vol. 41, Issue D1
DOOR 2.0: presenting operons and their functions through dynamic and integrated views
journal, November 2013
- Mao, Xizeng; Ma, Qin; Zhou, Chuan
- Nucleic Acids Research, Vol. 42, Issue D1
Computational analysis of bacterial RNA-Seq data
journal, May 2013
- McClure, Ryan; Balasubramanian, Divya; Sun, Yan
- Nucleic Acids Research, Vol. 41, Issue 14
DMINDA: an integrated web server for DNA motif identification and analyses
journal, April 2014
- Ma, Qin; Zhang, Hanyuan; Mao, Xizeng
- Nucleic Acids Research, Vol. 42, Issue W1
Transcriptome Complexity in a Genome-Reduced Bacterium
journal, November 2009
- Güell, Marc; van Noort, Vera; Yus, Eva
- Science, Vol. 326, Issue 5957
Transcriptome dynamics-based operon prediction in prokaryotes
journal, May 2014
- Fortino, Vittorio; Smolander, Olli-Pekka; Auvinen, Petri
- BMC Bioinformatics, Vol. 15, Issue 1
Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs
journal, January 2009
- Oliver, Haley F.; Orsi, Renato H.; Ponnala, Lalit
- BMC Genomics, Vol. 10, Issue 1
Clostridium thermocellum ATCC27405 transcriptomic, metabolomic and proteomic profiles after ethanol stress
journal, January 2012
- Yang, Shihui; Giannone, Richard J.; Dice, Lezlee
- BMC Genomics, Vol. 13, Issue 1
RegTransBase – a database of regulatory sequences and interactions based on literature: a resource for investigating transcriptional regulation in prokaryotes
journal, January 2013
- Cipriano, Michael J.; Novichkov, Pavel N.; Kazakov, Alexey E.
- BMC Genomics, Vol. 14, Issue 1
Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake
journal, January 2007
- Kingsford, Carleton L.; Ayanbule, Kunmi; Salzberg, Steven L.
- Genome Biology, Vol. 8, Issue 2
Review Application of RNA-seq to reveal the transcript profile in bacteria
journal, January 2011
- Pinto, A. C.; Melo-Barbosa, H. P.; Miyoshi, A.
- Genetics and Molecular Research, Vol. 10, Issue 3
Works referencing / citing this record:
Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses
journal, March 2016
- Liu, Bingqiang; Zhou, Chuan; Li, Guojun
- Scientific Reports, Vol. 6, Issue 1
DOOR: a prokaryotic operon database for genome analyses and functional inference
journal, July 2017
- Cao, Huansheng; Ma, Qin; Chen, Xin
- Briefings in Bioinformatics, Vol. 20, Issue 4
Revisiting operons: an analysis of the landscape of transcriptional units in E. coli
journal, November 2015
- Mao, Xizeng; Ma, Qin; Liu, Bingqiang
- BMC Bioinformatics, Vol. 16, Issue 1
A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
journal, February 2017
- Way, Gregory P.; Allaway, Robert J.; Bouley, Stephanie J.
- BMC Genomics, Vol. 18, Issue 1
A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
journal, August 2018
- McDermaid, Adam; Chen, Xin; Zhang, Yiran
- Frontiers in Genetics, Vol. 9
Single-Cell RNA Sequencing of Plant-Associated Bacterial Communities
journal, October 2019
- Ma, Qin; Bücking, Heike; Gonzalez Hernandez, Jose L.
- Frontiers in Microbiology, Vol. 10
RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics Analysis
journal, May 2018
- Chen, Xin; Ma, Anjun; McDermaid, Adam
- Genes, Vol. 9, Issue 6
A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
posted_content, December 2016
- Way, Gregory P.; Allaway, Robert J.; Bouley, Stephanie J.
- bioRxiv
SeqTU: A Web Server for Identification of Bacterial Transcription Units
journal, March 2017
- Chen, Xin; Chou, Wen-Chi; Ma, Qin
- Scientific Reports, Vol. 7, Issue 1
A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
posted_content, December 2016
- Way, Gregory P.; Allaway, Robert J.; Bouley, Stephanie J.
- bioRxiv
RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics Analysis
journal, March 2018
- Chen, Xin; Ma, Anjun; McDermaid, Adam
- bioRxiv
Revisiting operons: an analysis of the landscape of transcriptional units in E. coli
journal, November 2015
- Mao, Xizeng; Ma, Qin; Liu, Bingqiang
- BMC Bioinformatics, Vol. 16, Issue 1
A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation
journal, August 2018
- McDermaid, Adam; Chen, Xin; Zhang, Yiran
- Frontiers in Genetics, Vol. 9
rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units
journal, May 2019
- Niu, Sheng-Yong; Liu, Binqiang; Ma, Qin
- Frontiers in Genetics, Vol. 10
Single-Cell RNA Sequencing of Plant-Associated Bacterial Communities
journal, October 2019
- Ma, Qin; Bücking, Heike; Gonzalez Hernandez, Jose L.
- Frontiers in Microbiology, Vol. 10