skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Generation and evaluation of synthetic patient data

Abstract

Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providingmore » guidance on considerations for the generation and usage of medical synthetic data.« less

Authors:
 [1];  [1];  [1];  [2];  [2];  [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  2. Information Management Systems, Rockville, MD (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); National Cancer Institute (NCI); National Institutes of Health (NIH)
OSTI Identifier:
1643772
Report Number(s):
LLNL-JRNL-769883
Journal ID: ISSN 1471-2288; 961118
Grant/Contract Number:  
AC52-07NA27344; AC02-06CH11357; AC52-06NA25396; AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
BMC Medical Research Methodology (Online)
Additional Journal Information:
Journal Volume: 20; Journal Issue: 1; Journal ID: ISSN 1471-2288
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 97 MATHEMATICS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS; Synthetic data generation; Cancer patient data; Information disclosure; Generative models

Citation Formats

Goncalves, Andre, Ray, Priyadip, Soper, Braden, Stevens, Jennifer, Coyle, Linda, and Sales, Ana Paula. Generation and evaluation of synthetic patient data. United States: N. p., 2020. Web. doi:10.1186/s12874-020-00977-1.
Goncalves, Andre, Ray, Priyadip, Soper, Braden, Stevens, Jennifer, Coyle, Linda, & Sales, Ana Paula. Generation and evaluation of synthetic patient data. United States. https://doi.org/10.1186/s12874-020-00977-1
Goncalves, Andre, Ray, Priyadip, Soper, Braden, Stevens, Jennifer, Coyle, Linda, and Sales, Ana Paula. 2020. "Generation and evaluation of synthetic patient data". United States. https://doi.org/10.1186/s12874-020-00977-1. https://www.osti.gov/servlets/purl/1643772.
@article{osti_1643772,
title = {Generation and evaluation of synthetic patient data},
author = {Goncalves, Andre and Ray, Priyadip and Soper, Braden and Stevens, Jennifer and Coyle, Linda and Sales, Ana Paula},
abstractNote = {Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.},
doi = {10.1186/s12874-020-00977-1},
url = {https://www.osti.gov/biblio/1643772}, journal = {BMC Medical Research Methodology (Online)},
issn = {1471-2288},
number = 1,
volume = 20,
place = {United States},
year = {2020},
month = {5}
}

Works referenced in this record:

A Systematic Review of Re-Identification Attacks on Health Data
journal, December 2011


Curriculum learning
conference, January 2009


The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures
journal, March 2019


Deep Learning with Differential Privacy
conference, October 2016

  • Abadi, Martin; Chu, Andy; Goodfellow, Ian
  • CCS'16: 2016 ACM SIGSAC Conference on Computer and Communications Security, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security
  • https://doi.org/10.1145/2976749.2978318

Differential Privacy via Wavelet Transforms
journal, August 2011


Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record
journal, August 2017


Multiply-Imputed Synthetic Data: Advice to the Imputer
journal, December 2017


Nonparametric Bayes Modeling of Multivariate Categorical Data
journal, September 2009


Data-driven approach for creating synthetic electronic medical records
journal, October 2010


Protecting Privacy in Large Datasets—First We Assess the Risk; Then We Fuzzy the Data
journal, August 2017


Implementation of a Novel Algorithm For Generating Synthetic CT Images From Magnetic Resonance Imaging Data Sets for Prostate Cancer Radiation Therapy
journal, January 2015


Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization
conference, June 2018


Ensuring electronic medical record simulation through better training, modeling, and evaluation
journal, October 2019


Multiple imputation by chained equations: what is it and how does it work?: Multiple imputation by chained equations
journal, February 2011

  • Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine
  • International Journal of Methods in Psychiatric Research, Vol. 20, Issue 1
  • https://doi.org/10.1002/mpr.329

Approximating discrete probability distributions with dependence trees
journal, May 1968


Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation
conference, June 2018


Boosting and Differential Privacy
conference, October 2010


PrivBayes: Private Data Release via Bayesian Networks
journal, November 2017


How Can We Analyze Differentially-Private Synthetic Datasets?
journal, April 2011


A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records
journal, May 2007