Generation and evaluation of synthetic patient data

Goncalves, Andre; Ray, Priyadip; Soper, Braden; Stevens, Jennifer; Coyle, Linda; Sales, Ana Paula

doi:10.1186/s12874-020-00977-1

Title: Generation and evaluation of synthetic patient data

Journal Article · Thu May 07 00:00:00 EDT 2020 · BMC Medical Research Methodology (Online)

DOI:https://doi.org/10.1186/s12874-020-00977-1· OSTI ID:1643772

Goncalves, Andre ^[1]; Ray, Priyadip ^[1]; Soper, Braden ^[1]; Stevens, Jennifer ^[2]; Coyle, Linda ^[2]; Sales, Ana Paula ^[1]

Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Information Management Systems, Rockville, MD (United States)

Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); National Cancer Institute (NCI); National Institutes of Health (NIH)

Grant/Contract Number:: AC52-07NA27344; AC02-06CH11357; AC52-06NA25396; AC05-00OR22725

OSTI ID:: 1643772

Report Number(s):: LLNL-JRNL-769883; 961118

Journal Information:: BMC Medical Research Methodology (Online), Vol. 20, Issue 1; ISSN 1471-2288

Publisher:: BioMed CentralCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 47 works

Citation information provided by
Web of Science

References (20)

A Systematic Review of Re-Identification Attacks on Health Data El Emam, Khaled; Jonker, Elizabeth; Arbuckle, Luk PLoS ONE, Vol. 6, Issue 12 https://doi.org/10.1371/journal.pone.0028071	journal	December 2011
Curriculum learning Bengio, Yoshua; Louradour, Jérôme; Collobert, Ronan Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09 https://doi.org/10.1145/1553374.1553380	conference	January 2009
The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures Chen, Junqiao; Chun, David; Patel, Milesh BMC Medical Informatics and Decision Making, Vol. 19, Issue 1 https://doi.org/10.1186/s12911-019-0793-0	journal	March 2019
Deep Learning with Differential Privacy Abadi, Martin; Chu, Andy; Goodfellow, Ian CCS'16: 2016 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security https://doi.org/10.1145/2976749.2978318	conference	October 2016
Differential Privacy via Wavelet Transforms Xiao, Xiaokui; Wang, Guozhang; Gehrke, Johannes IEEE Transactions on Knowledge and Data Engineering, Vol. 23, Issue 8 https://doi.org/10.1109/TKDE.2010.247	journal	August 2011
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record Walonoski, Jason; Kramer, Mark; Nichols, Joseph Journal of the American Medical Informatics Association, Vol. 25, Issue 3 https://doi.org/10.1093/jamia/ocx079	journal	August 2017
Multiply-Imputed Synthetic Data: Advice to the Imputer Loong, Bronwyn; Rubin, Donald B. Journal of Official Statistics, Vol. 33, Issue 4 https://doi.org/10.1515/jos-2017-0047	journal	December 2017
Nonparametric Bayes Modeling of Multivariate Categorical Data Dunson, David B.; Xing, Chuanhua Journal of the American Statistical Association, Vol. 104, Issue 487 https://doi.org/10.1198/jasa.2009.tm08439	journal	September 2009
Data-driven approach for creating synthetic electronic medical records Buczak, Anna L.; Babin, Steven; Moniz, Linda BMC Medical Informatics and Decision Making, Vol. 10, Issue 1 https://doi.org/10.1186/1472-6947-10-59	journal	October 2010
Protecting Privacy in Large Datasets—First We Assess the Risk; Then We Fuzzy the Data Ursin, Giske; Sen, Sagar; Mottu, Jean-Marie Cancer Epidemiology Biomarkers & Prevention, Vol. 26, Issue 8 https://doi.org/10.1158/1055-9965.EPI-17-0172	journal	August 2017
Implementation of a Novel Algorithm For Generating Synthetic CT Images From Magnetic Resonance Imaging Data Sets for Prostate Cancer Radiation Therapy Kim, Joshua; Glide-Hurst, Carri; Doemer, Anthony International Journal of Radiation OncologyBiologyPhysics, Vol. 91, Issue 1 https://doi.org/10.1016/j.ijrobp.2014.09.015	journal	January 2015
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization Tremblay, Jonathan; Prakash, Aayush; Acuna, David 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) https://doi.org/10.1109/cvprw.2018.00143	conference	June 2018
Ensuring electronic medical record simulation through better training, modeling, and evaluation Zhang, Ziqi; Yan, Chao; Mesa, Diego A. Journal of the American Medical Informatics Association, Vol. 27, Issue 1 https://doi.org/10.1093/jamia/ocz161	journal	October 2019
Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy Matthews, Gregory J.; Harel, Ofer Statistics Surveys, Vol. 5, Issue 0 https://doi.org/10.1214/11-SS074	journal	January 2011
Multiple imputation by chained equations: what is it and how does it work?: Multiple imputation by chained equations Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine International Journal of Methods in Psychiatric Research, Vol. 20, Issue 1 https://doi.org/10.1002/mpr.329	journal	February 2011
Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation Sankaranarayanan, Swami; Balaji, Yogesh; Jain, Arpit 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/cvpr.2018.00395	conference	June 2018
Boosting and Differential Privacy Dwork, Cynthia; Rothblum, Guy N.; Vadhan, Salil 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS) https://doi.org/10.1109/FOCS.2010.12	conference	October 2010
PrivBayes: Private Data Release via Bayesian Networks Zhang, Jun; Cormode, Graham; Procopiuc, Cecilia M. ACM Transactions on Database Systems, Vol. 42, Issue 4 https://doi.org/10.1145/3134428	journal	November 2017
How Can We Analyze Differentially-Private Synthetic Datasets? Charest, Anne-Sophie Journal of Privacy and Confidentiality, Vol. 2, Issue 2 https://doi.org/10.29012/jpc.v2i2.589	journal	April 2011
A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records Purdam, Kingsley; Elliot, Mark Environment and Planning A: Economy and Space, Vol. 39, Issue 5 https://doi.org/10.1068/a38335	journal	May 2007

Similar Records

WE-A-16A-01: International Medical Physics Symposium: Increasing Access to Medical Physics Education/Training and Research Excellence

Journal Article · Sun Jun 15 00:00:00 EDT 2014 · Medical Physics · OSTI ID:1643772

Bortfeld, T; Ngoma, T; Odedina, F; +4 more

Knowledge Graph-Enabled Cancer Data Analytics

Journal Article · Wed Jul 01 00:00:00 EDT 2020 · IEEE Journal of Biomedical and Health Informatics · OSTI ID:1643772

Hasan, S. M. Shamimul; Rivera, Donna; Wu, Xiao-Cheng; +3 more

The medical science DMZ: a network design pattern for data-intensive medical science

Journal Article · Fri Oct 06 00:00:00 EDT 2017 · Journal of the American Medical Informatics Association · OSTI ID:1643772

Peisert, Sean; Dart, Eli; Barnett, William; +6 more

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
97 MATHEMATICS AND COMPUTING
99 GENERAL AND MISCELLANEOUS
Synthetic data generation
Cancer patient data
Information disclosure
Generative models

Title: Generation and evaluation of synthetic patient data

Citation Formats

References (20)

Similar Records

Related Subjects