SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes
Abstract
One of the key, emerging challenges that connects the "Big Data" and the AI domain is the availability of sufficient volumes of training data for AI/Machine Learning tasks. SynthNotes is a framework for generating standards-compliant, realistic mental health progress report notes at the very large, population-level scale, and in a strict privacy-preserving manner. Our framework, inspired by the needs to explore, evaluate, and train computational methods for the emerging mental health crisis in the US, is useful for benchmarking, optimization, and training of biomedical natural language processing, information extraction, and machine learning systems intended to operate at "Big Data" scale (billions of notes). The free text notes generated by SynthNotes are based on the literature and public statistical models allowing for realistic, natural language representation of a patient, and his or her mental health characteristics. Additionally, SynthNotes can partially simulate stylistic, grammatical, and expressive characteristics of a licensed mental health professional. SynthNotes is modular and flexible, allowing for representation of variety of conditions, incorporation of alternative foundational models, and parametrization of the variability of the structure, content, and size of the synthetically generated corpus. In this paper, we report on the initial use and performance characteristics of our SynthNotes frameworkmore »
- Authors:
-
- ORNL
- Stanford University
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1507868
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: 2018 IEEE International Conference on Big Data - Seattle, Washington, United States of America - 8/10/2018 8:00:00 AM-8/13/2018 8:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Begoli, Edmon, Brown, Kris, Srinivasan, Sudarshan, and Tamang, Suzanne. SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes. United States: N. p., 2018.
Web. doi:10.1109/BigData.2018.8621981.
Begoli, Edmon, Brown, Kris, Srinivasan, Sudarshan, & Tamang, Suzanne. SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes. United States. https://doi.org/10.1109/BigData.2018.8621981
Begoli, Edmon, Brown, Kris, Srinivasan, Sudarshan, and Tamang, Suzanne. 2018.
"SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes". United States. https://doi.org/10.1109/BigData.2018.8621981. https://www.osti.gov/servlets/purl/1507868.
@article{osti_1507868,
title = {SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes},
author = {Begoli, Edmon and Brown, Kris and Srinivasan, Sudarshan and Tamang, Suzanne},
abstractNote = {One of the key, emerging challenges that connects the "Big Data" and the AI domain is the availability of sufficient volumes of training data for AI/Machine Learning tasks. SynthNotes is a framework for generating standards-compliant, realistic mental health progress report notes at the very large, population-level scale, and in a strict privacy-preserving manner. Our framework, inspired by the needs to explore, evaluate, and train computational methods for the emerging mental health crisis in the US, is useful for benchmarking, optimization, and training of biomedical natural language processing, information extraction, and machine learning systems intended to operate at "Big Data" scale (billions of notes). The free text notes generated by SynthNotes are based on the literature and public statistical models allowing for realistic, natural language representation of a patient, and his or her mental health characteristics. Additionally, SynthNotes can partially simulate stylistic, grammatical, and expressive characteristics of a licensed mental health professional. SynthNotes is modular and flexible, allowing for representation of variety of conditions, incorporation of alternative foundational models, and parametrization of the variability of the structure, content, and size of the synthetically generated corpus. In this paper, we report on the initial use and performance characteristics of our SynthNotes framework and on the ongoing work for inclusion of content planning and deep learning-based generative methods trained on real data.},
doi = {10.1109/BigData.2018.8621981},
url = {https://www.osti.gov/biblio/1507868},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sat Dec 01 00:00:00 EST 2018},
month = {Sat Dec 01 00:00:00 EST 2018}
}
Works referenced in this record:
New Data on Suicide Risk Among Military Veterans
journal, October 2017
- Lyon, Jeff
- JAMA, Vol. 318, Issue 16
Development and applications of the Veterans Health Administration’s Stratification Tool for Opioid Risk Mitigation (STORM) to improve opioid safety and prevent overdose and suicide.
journal, February 2017
- Oliva, Elizabeth M.; Bowe, Thomas; Tavakoli, Sara
- Psychological Services, Vol. 14, Issue 1
Using a composite index of socioeconomic status to investigate health disparities while protecting the confidentiality of cancer registry data
journal, November 2013
- Yu, Mandi; Tatalovich, Zaria; Gibson, James T.
- Cancer Causes & Control, Vol. 25, Issue 1
VistA—U.S. Department of Veterans Affairs national-scale HIS
journal, March 2003
- Brown, S.
- International Journal of Medical Informatics, Vol. 69, Issue 2-3
A synthetic Longitudinal Study dataset for England and Wales
journal, December 2016
- Dennett, Adam; Norman, Paul; Shelton, Nicola
- Data in Brief, Vol. 9
Synthetic Text Generation for Sentiment Analysis
conference, January 2015
- Maqsud, Umar
- Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Learning to Write Case Notes Using the SOAP Format
journal, July 2002
- Cameron, Susan; Turtle-Song, Imani
- Journal of Counseling & Development, Vol. 80, Issue 3
A Hybrid Convolutional Variational Autoencoder for Text Generation
conference, January 2017
- Semeniuta, Stanislau; Severyn, Aliaksei; Barth, Erhardt
- Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Texygen
conference, June 2018
- Zhu, Yaoming; Lu, Sidi; Zheng, Lei
- The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
The DSM-5: Classification and criteria changes
journal, June 2013
- Regier, Darrel A.; Kuhl, Emily A.; Kupfer, David J.
- World Psychiatry, Vol. 12, Issue 2
MIMIC-III, a freely accessible critical care database
journal, May 2016
- Johnson, Alistair E. W.; Pollard, Tom J.; Shen, Lu
- Scientific Data, Vol. 3, Issue 1
The Synthetic Data Vault
conference, October 2016
- Patki, Neha; Wedge, Roy; Veeramachaneni, Kalyan
- 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
TextGen: a realistic text data content generation method for modern storage system benchmarks
journal, October 2016
- Wang, Long-xiang; Dong, Xiao-she; Zhang, Xing-jun
- Frontiers of Information Technology & Electronic Engineering, Vol. 17, Issue 10
synthpop: Bespoke Creation of Synthetic Data in R
journal, January 2016
- Nowok, Beata; Raab, Gillian M.; Dibben, Chris
- Journal of Statistical Software, Vol. 74, Issue 11
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record
journal, August 2017
- Walonoski, Jason; Kramer, Mark; Nichols, Joseph
- Journal of the American Medical Informatics Association, Vol. 25, Issue 3
Community-Wide Health Risk Assessment Using Geographically Resolved Demographic Data: A Synthetic Population Approach
journal, January 2014
- Levy, Jonathan I.; Fabian, Maria Patricia; Peters, Junenette L.
- PLoS ONE, Vol. 9, Issue 1
Data-driven approach for creating synthetic electronic medical records
journal, October 2010
- Buczak, Anna L.; Babin, Steven; Moniz, Linda
- BMC Medical Informatics and Decision Making, Vol. 10, Issue 1
The Unified Medical Language System (UMLS): integrating biomedical terminology
journal, January 2004
- Bodenreider, O.
- Nucleic Acids Research, Vol. 32, Issue 90001
Addressing the Opioid Epidemic in the United States
journal, May 2017
- Gellad, Walid F.; Good, Chester B.; Shulkin, David J.
- JAMA Internal Medicine, Vol. 177, Issue 5
Predictive Modeling and Concentration of the Risk of Suicide: Implications for Preventive Interventions in the US Department of Veterans Affairs
journal, September 2015
- McCarthy, John F.; Bossarte, Robert M.; Katz, Ira R.
- American Journal of Public Health, Vol. 105, Issue 9
Protecting Confidentiality in Cancer Registry Data With Geographic Identifiers
journal, June 2017
- Yu, Mandi; Reiter, Jerome Phillip; Zhu, Li
- American Journal of Epidemiology, Vol. 186, Issue 1
Physicians' Characteristics Associated with Exploring Suicide Risk among Patients with Depression: A French Panel Survey of General Practitioners
journal, December 2013
- Bocquier, Aurélie; Pambrun, Elodie; Dumesnil, Hélène
- PLoS ONE, Vol. 8, Issue 12
Automatically generating Wikipedia articles
conference, January 2009
- Sauper, Christina; Barzilay, Regina
- Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP '09
Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
journal, September 2010
- Savova, Guergana K.; Masanz, James J.; Ogren, Philip V.
- Journal of the American Medical Informatics Association, Vol. 17, Issue 5
Automatic generation of textual summaries from neonatal intensive care data
journal, May 2009
- Portet, François; Reiter, Ehud; Gatt, Albert
- Artificial Intelligence, Vol. 173, Issue 7-8