skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Generation and evaluation of synthetic patient data

Journal Article · · BMC Medical Research Methodology (Online)
 [1];  [1];  [1];  [2];  [2];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Information Management Systems, Rockville, MD (United States)

Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

Research Organization:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Cancer Institute (NCI); National Institutes of Health (NIH)
Grant/Contract Number:
AC52-07NA27344; AC02-06CH11357; AC52-06NA25396; AC05-00OR22725
OSTI ID:
1643772
Report Number(s):
LLNL-JRNL-769883; 961118
Journal Information:
BMC Medical Research Methodology (Online), Vol. 20, Issue 1; ISSN 1471-2288
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 47 works
Citation information provided by
Web of Science

References (20)

A Systematic Review of Re-Identification Attacks on Health Data journal December 2011
Curriculum learning conference January 2009
The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures journal March 2019
Deep Learning with Differential Privacy
  • Abadi, Martin; Chu, Andy; Goodfellow, Ian
  • CCS'16: 2016 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security https://doi.org/10.1145/2976749.2978318
conference October 2016
Differential Privacy via Wavelet Transforms journal August 2011
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record journal August 2017
Multiply-Imputed Synthetic Data: Advice to the Imputer journal December 2017
Nonparametric Bayes Modeling of Multivariate Categorical Data journal September 2009
Data-driven approach for creating synthetic electronic medical records journal October 2010
Protecting Privacy in Large Datasets—First We Assess the Risk; Then We Fuzzy the Data journal August 2017
Implementation of a Novel Algorithm For Generating Synthetic CT Images From Magnetic Resonance Imaging Data Sets for Prostate Cancer Radiation Therapy journal January 2015
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization conference June 2018
Ensuring electronic medical record simulation through better training, modeling, and evaluation journal October 2019
Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy journal January 2011
Multiple imputation by chained equations: what is it and how does it work?: Multiple imputation by chained equations
  • Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine
  • International Journal of Methods in Psychiatric Research, Vol. 20, Issue 1 https://doi.org/10.1002/mpr.329
journal February 2011
Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation conference June 2018
Boosting and Differential Privacy conference October 2010
PrivBayes: Private Data Release via Bayesian Networks journal November 2017
How Can We Analyze Differentially-Private Synthetic Datasets? journal April 2011
A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records journal May 2007

Similar Records

WE-A-16A-01: International Medical Physics Symposium: Increasing Access to Medical Physics Education/Training and Research Excellence
Journal Article · Sun Jun 15 00:00:00 EDT 2014 · Medical Physics · OSTI ID:1643772

Knowledge Graph-Enabled Cancer Data Analytics
Journal Article · Wed Jul 01 00:00:00 EDT 2020 · IEEE Journal of Biomedical and Health Informatics · OSTI ID:1643772

The medical science DMZ: a network design pattern for data-intensive medical science
Journal Article · Fri Oct 06 00:00:00 EDT 2017 · Journal of the American Medical Informatics Association · OSTI ID:1643772