skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable deep text comprehension for Cancer surveillance on high-performance computing

Abstract

Background: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions.Results: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowedmore » the network to converge in only 4.5 training epochs.Conclusions: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm.« less

Authors:
 [1]; ORCiD logo [1]; ORCiD logo [1];  [2]; ORCiD logo [1]; ORCiD logo [1];  [3];  [4]; ORCiD logo [1]
  1. ORNL
  2. University of Memphis
  3. LSUHSC-Louisiana Tumor Registry
  4. National Cancer Institute, Bethesda, MD
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1491345
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Journal Volume: 19; Journal Issue: S18; Conference: SuperComputing 17 - Denver, Colorado, United States of America - 11/12/2017 5:00:00 AM-11/17/2017 5:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Qiu, John X., Yoon, Hong-Jun, Srivastava, Kshitij, Watson, Thomas, Christian, Blair, Ramanathan, Arvind, Wu, Xiao-Cheng, Fearn, Paul A., and Tourassi, Georgia. Scalable deep text comprehension for Cancer surveillance on high-performance computing. United States: N. p., 2018. Web. doi:10.1186/s12859-018-2511-9.
Qiu, John X., Yoon, Hong-Jun, Srivastava, Kshitij, Watson, Thomas, Christian, Blair, Ramanathan, Arvind, Wu, Xiao-Cheng, Fearn, Paul A., & Tourassi, Georgia. Scalable deep text comprehension for Cancer surveillance on high-performance computing. United States. doi:10.1186/s12859-018-2511-9.
Qiu, John X., Yoon, Hong-Jun, Srivastava, Kshitij, Watson, Thomas, Christian, Blair, Ramanathan, Arvind, Wu, Xiao-Cheng, Fearn, Paul A., and Tourassi, Georgia. Sat . "Scalable deep text comprehension for Cancer surveillance on high-performance computing". United States. doi:10.1186/s12859-018-2511-9. https://www.osti.gov/servlets/purl/1491345.
@article{osti_1491345,
title = {Scalable deep text comprehension for Cancer surveillance on high-performance computing},
author = {Qiu, John X. and Yoon, Hong-Jun and Srivastava, Kshitij and Watson, Thomas and Christian, Blair and Ramanathan, Arvind and Wu, Xiao-Cheng and Fearn, Paul A. and Tourassi, Georgia},
abstractNote = {Background: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions.Results: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowed the network to converge in only 4.5 training epochs.Conclusions: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm.},
doi = {10.1186/s12859-018-2511-9},
journal = {},
issn = {1471-2105},
number = S18,
volume = 19,
place = {United States},
year = {2018},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Cancer statistics, 2016: Cancer Statistics, 2016
journal, January 2016

  • Siegel, Rebecca L.; Miller, Kimberly D.; Jemal, Ahmedin
  • CA: A Cancer Journal for Clinicians, Vol. 66, Issue 1
  • DOI: 10.3322/caac.21332

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports
journal, January 2018

  • Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A.
  • IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1
  • DOI: 10.1109/JBHI.2017.2700722

Deep learning
journal, May 2015

  • LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
  • Nature, Vol. 521, Issue 7553
  • DOI: 10.1038/nature14539

Accelerator: using data parallelism to program GPUs for general-purpose uses
journal, October 2006


Using Natural Language Processing to Improve Efficiency of Manual Chart Abstraction in Research: The Case of Breast Cancer Recurrence
journal, January 2014

  • Carrell, David S.; Halgrim, Scott; Tran, Diem-Thy
  • American Journal of Epidemiology, Vol. 179, Issue 6
  • DOI: 10.1093/aje/kwt441

American Society of Clinical Oncology 1998 Update of Recommended Breast Cancer Surveillance Guidelines
journal, March 1999

  • Smith, Thomas J.; Davidson, Nancy E.; Schapira, David V.
  • Journal of Clinical Oncology, Vol. 17, Issue 3
  • DOI: 10.1200/JCO.1999.17.3.1080

Distributed asynchronous deterministic and stochastic gradient optimization algorithms
journal, September 1986

  • Tsitsiklis, J.; Bertsekas, D.; Athans, M.
  • IEEE Transactions on Automatic Control, Vol. 31, Issue 9
  • DOI: 10.1109/TAC.1986.1104412