Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Univ. of Kentucky, Lexington, KY (United States)
- Louisiana State Univ., New Orleans, LA (United States). Health Science Center
- Rutgers Cancer Institute of New Jersey, New Brunswick, NJ (United States)
- Univ. of Utah, Salt Lake City, UT (United States)
- Fred Hutchinson Cancer Research Center, Seattle, WA (United States)
- Univ. of New Mexico, Albuquerque, NM (United States)
- California Department of Public Health, Sacramento, CA (United States)
- Information Management Services Inc., Calverton, MD (United States)
With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients’ information to mitigate confidentiality breaches. The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC); National Institutes of Health (NIH); USDOE Laboratory Directed Research and Development (LDRD) Program; Centers for Disease Control and Prevention's (CDC); National Cancer Institute (NCI); National Program of Cancer Registries (NPCR)
- Grant/Contract Number:
- AC05-00OR22725; AC02-06CH11357; AC52-07NA27344; AC52-06NA25396; HHSN261201800032I; HHSN261201800015I; HHSN261201800009I; HHSN261201800013I; U58DP00003907; HHSN26120180000; NU58DP006332-02-00; HHSN261201300021I; NU58DP006279-02-00; HHSN261201800014I; HHSN26100001; HHSN261291800004I; HHSN261201800016I
- OSTI ID:
- 1855683
- Journal Information:
- Cancer Biomarkers, Vol. 33, Issue 2; ISSN 1574-0153
- Publisher:
- IOS PressCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types
A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification