Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Yoon, Hong-Jun; Stanley, Christopher B.; Christian, J. Blair; Klasky, Hilda B.; Blanchard, Andrew E.; Durbin, Eric B.; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M.; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Tourassi, Georgia

doi:10.3233/cbm-210306

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Journal Article · Sun Feb 13 23:00:00 EST 2022 · Cancer Biomarkers

DOI:https://doi.org/10.3233/cbm-210306· OSTI ID:1855683

^[1]; ^[1]; ^[1]; ^[1]; Blanchard, Andrew E. ^[1]; Durbin, Eric B. ^[2]; Wu, Xiao-Cheng ^[3]; Stroup, Antoinette ^[4]; Doherty, Jennifer ^[5]; Schwartz, Stephen M. ^[6]; Wiggins, Charles ^[7]; Damesyn, Mark ^[8]; Coyle, Linda ^[9]; ^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Kentucky, Lexington, KY (United States)
Louisiana State Univ., New Orleans, LA (United States). Health Science Center
Rutgers Cancer Institute of New Jersey, New Brunswick, NJ (United States)
Univ. of Utah, Salt Lake City, UT (United States)
Fred Hutchinson Cancer Research Center, Seattle, WA (United States)
Univ. of New Mexico, Albuquerque, NM (United States)
California Department of Public Health, Sacramento, CA (United States)
Information Management Services Inc., Calverton, MD (United States)

With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients’ information to mitigate confidentiality breaches. The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

View Accepted Manuscript (DOE)

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC); National Institutes of Health (NIH); USDOE Laboratory Directed Research and Development (LDRD) Program; Centers for Disease Control and Prevention's (CDC); National Cancer Institute (NCI); National Program of Cancer Registries (NPCR)

Grant/Contract Number:: AC05-00OR22725; AC02-06CH11357; AC52-07NA27344; AC52-06NA25396

OSTI ID:: 1855683

Journal Information:: Cancer Biomarkers, Journal Name: Cancer Biomarkers Journal Issue: 2 Vol. 33; ISSN 1574-0153

Publisher:: IOS PressCopyright Statement

Country of Publication:: United States

Language:: English

References (16)

Differential Privacy: A Survey of Results Dwork, Cynthia Lecture Notes in Computer Science, p. 1-19 https://doi.org/10.1007/978-3-540-79228-4_1	conference	January 2008
Big data phenotyping in rare diseases: some ethical issues Hallowell, Nina; Parker, Michael; Nellåker, Christoffer Genetics in Medicine, Vol. 21, Issue 2 https://doi.org/10.1038/s41436-018-0067-8	journal	February 2019
Privacy in the age of medical big data Price, W. Nicholson; Cohen, I. Glenn Nature Medicine, Vol. 25, Issue 1 https://doi.org/10.1038/s41591-018-0272-7	journal	January 2019
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks Alawad, Mohammed; Gao, Shang; Qiu, John X. Journal of the American Medical Informatics Association, Vol. 27, Issue 1 https://doi.org/10.1093/jamia/ocz153	journal	November 2019
Model-based Hyperparameter Optimization of Convolutional Neural Networks for Information Extraction from Cancer Pathology Reports on HPC Yoon, Hong-Jun; Gounley, John; Gao, Shang 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) https://doi.org/10.1109/BHI.2019.8834674	conference	May 2019
Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports Qiu, John X.; Yoon, Hong-Jun; Fearn, Paul A. IEEE Journal of Biomedical and Health Informatics, Vol. 22, Issue 1 https://doi.org/10.1109/JBHI.2017.2700722	journal	January 2018
Membership Inference Attacks Against Machine Learning Models Shokri, Reza; Stronati, Marco; Song, Congzheng 2017 IEEE Symposium on Security and Privacy (SP) https://doi.org/10.1109/SP.2017.41	conference	May 2017
Secure and Differentially Private Logistic Regression for Horizontally Distributed Data Kim, Miran; Lee, Junghye; Ohno-Machado, Lucila IEEE Transactions on Information Forensics and Security, Vol. 15 https://doi.org/10.1109/TIFS.2019.2925496	journal	January 2020
Efficient and Privacy-Enhanced Federated Learning for Industrial Artificial Intelligence Hao, Meng; Li, Hongwei; Luo, Xizhao IEEE Transactions on Industrial Informatics, Vol. 16, Issue 10 https://doi.org/10.1109/TII.2019.2945367	journal	October 2020
Preserving Statistical Validity in Adaptive Data Analysis Dwork, Cynthia; Feldman, Vitaly; Hardt, Moritz STOC '15: Symposium on Theory of Computing, Proceedings of the forty-seventh annual ACM symposium on Theory of Computing https://doi.org/10.1145/2746539.2746580	conference	June 2015
Membership Privacy in MicroRNA-based Studies Backes, Michael; Berrang, Pascal; Humbert, Mathias CCS'16: 2016 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security https://doi.org/10.1145/2976749.2978355	conference	October 2016
MemGuard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples Jia, Jinyuan; Salem, Ahmed; Backes, Michael CCS '19: 2019 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security https://doi.org/10.1145/3319535.3363201	conference	November 2019
Big healthcare data: preserving security and privacy Abouelmehdi, Karim; Beni-Hessane, Abderrahim; Khaloufi, Hayat Journal of Big Data, Vol. 5, Issue 1 https://doi.org/10.1186/s40537-017-0110-7	journal	January 2018
Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays Homer, Nils; Szelinger, Szabolcs; Redman, Margot PLoS Genetics, Vol. 4, Issue 8 https://doi.org/10.1371/journal.pgen.1000167	journal	August 2008
Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers Ateniese, Giuseppe; Mancini, Luigi V.; Spognardi, Angelo International Journal of Security and Networks, Vol. 10, Issue 3 https://doi.org/10.1504/IJSN.2015.071829	journal	January 2015
A sparse deep learning model for privacy attack on remote sensing images Wang, Eric Ke; Zhe, Nie; Li, Yueping Mathematical Biosciences and Engineering, Vol. 16, Issue 3 https://doi.org/10.3934/mbe.2019063	journal	January 2019

Similar Records

Adversarial Training for Privacy-Preserving Deep Learning Model Distribution

Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1606810

Privacy-Preserving Knowledge Transfer with Bootstrap Aggregation of Teacher Ensembles

Conference · Sun Feb 28 23:00:00 EST 2021 · OSTI ID:1771902

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
artificial intelligence
cancer epidemiology
deep learning
natural language processing
privacy
privacy-preserving training

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Citation Formats

References (16)

Similar Records

Related Subjects