skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Journal Article · · PLoS ONE
 [1];  [2];  [3];  [2];  [4];  [5];  [4];  [6];  [7];  [5];  [4];  [2];  [2];  [6];  [6];  [6];  [2];  [8];  [7];  [6] more »; « less
  1. Univ. of Michigan, Ann Arbor, MI (United States). Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science; Univ. of Southern California, Los Angeles, CA (United States). Stevens Neuroimaging and Informatics Institute; ; Univ. of Michigan, Ann Arbor, MI (United States). Udall Center of Excellence for Parkinson’s Disease Research
  2. Institute for Systems Biology, Seattle, Washington, (United States)
  3. Univ. of Michigan, Ann Arbor, MI (United States). Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science
  4. Argonne National Lab. (ANL), Argonne, IL (United States); ; Univ. of Chicago, IL (United States). Computation Institute
  5. Univ. of Southern California, Los Angeles, CA (United States). Information Sciences Institute
  6. Univ. of Southern California, Los Angeles, CA (United States). Stevens Neuroimaging and Informatics Institute
  7. Univ. of Michigan, Ann Arbor, MI (United States). Udall Center of Excellence for Parkinson’s Disease Research
  8. Univ. of Michigan, Ann Arbor, MI (United States). Department of Psychiatry and Michigan Alzheimer’s Disease Center; Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, Michigan, (United States)

Background A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

Sponsoring Organization:
USDOE; National Science Foundation (NSF); National Institutes of Health (NIH)
Grant/Contract Number:
AC02-06CH11357; 1023115; 1022560; 1022636; 0089377; 9652870; 0442992; 0442630; 0333672; 0716055; P20 NR015331; P50 NS091856; P30 DK089503; U54 EB02040
OSTI ID:
1627792
Journal Information:
PLoS ONE, Vol. 11, Issue 8; ISSN 1932-6203
Publisher:
Public Library of ScienceCopyright Statement
Country of Publication:
United States
Language:
English

References (76)

EnsCat: clustering of categorical data via ensembling journal September 2016
The properties of high-dimensional data spaces: implications for exploring gene and protein expression data journal January 2008
Kryder's Law journal August 2005
Establishing Moore's Law journal July 2006
Challenges and Trends of Big Data Analytics conference November 2014
The current and projected economic burden of Parkinson's disease in the United States: Economic Burden of PD in The US journal February 2013
2013 Alzheimer's disease facts and figures journal March 2013
Soluble protein oligomers in neurodegeneration: lessons from the Alzheimer's amyloid β-peptide journal February 2007
Atomic structures of amyloid cross-β spines reveal varied steric zippers journal April 2007
Brain banks as key part of biochemical and molecular studies on cerebral cortex involvement in Parkinson’s disease: Brain banks and the biochemistry of PD journal February 2012
Systematic Review of the Risk of Parkinson's Disease After Mild Traumatic Brain Injury: Results of the International Collaboration on Mild Traumatic Brain Injury Prognosis journal March 2014
Environmental Toxins and Parkinson's Disease journal January 2014
To operate or not?: A literature review of surgical outcomes in 95 patients with Parkinson's disease undergoing spine surgery journal July 2015
The prevalence of Parkinson's disease: A systematic review and meta-analysis: PD PREVALENCE journal June 2014
Clinical markers for identifying cholinergic deficits in Parkinson's disease: Clinical Marers of Cholinergic Deficits in PD journal November 2014
Prion-like mechanisms in neurodegenerative diseases journal December 2009
Neuroprotective effects and mechanisms of exercise in a chronic mouse model of Parkinson’s disease with moderate neurodegeneration: Exercise neuroprotection in chronic parkinsonism journal March 2011
The biology and pathology of the familial Parkinson's disease protein LRRK2: Familial PD Protein LRRK2 journal January 2010
Early-onset parkinsonism caused by alpha-synuclein gene triplication: Clinical and genetic findings in a novel family journal August 2015
Phenotypic characterization of recessive gene knockout rat models of Parkinson's disease journal October 2014
Clinical, imaging, and molecular findings in a sample of Mexican families with pantothenate kinase-associated neurodegeneration: PKAN disease findings in Mexican families journal June 2014
Multitarget drug discovery projects in CNS diseases: quantitative systems pharmacology as a possible path forward journal October 2014
Neuroimaging Study Designs, Computational Analyses and Data Provenance Using the LONI Pipeline journal September 2010
NeuroX, a fast and efficient genotyping platform for investigation of neurodegenerative diseases journal March 2015
Confidence Interval Based Parameter Estimation—A New SOCR Applet and Activity journal May 2011
Population stratification and spurious allelic association journal February 2003
SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory journal December 2011
SMOTE: Synthetic Minority Over-sampling Technique journal January 2002
Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective journal October 1998
mice : Multivariate Imputation by Chained Equations in R journal January 2011
Generalized Linear Models journal December 2000
Handling drop-out in longitudinal clinical trials: a comparison of the LOCF and MMRM approaches journal January 2008
Top 10 algorithms in data mining journal December 2007
Support vector machines journal July 1998
Nearest neighbor pattern classification journal January 1967
Building Predictive Models in R Using the caret Package journal January 2008
Dopaminergic modulation of striato-frontal connectivity during motor timing in Parkinson's disease journal March 2010
Cognitive Rehabilitation in Parkinson’s Disease: Evidence from Neuroimaging journal January 2011
Cortical volume and folding abnormalities in Parkinson's disease patients with pathological gambling journal November 2014
Regional Brain Differences in Cortical Thickness, Surface Area and Subcortical Volume in Individuals with Williams Syndrome journal February 2012
Rule Evaluation Measures: A Unifying View book January 1999
Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter journal May 1979
Rational inference: deductive, inductive and probabilistic thinking journal August 2010
Learning Vector Quantization book January 1995
Feature Selection and Classification of Hyperspectral Images With Support Vector Machines journal October 2007
Application of Akaike's information criterion (AIC) in the evaluation of linear pharmacokinetic equations journal April 1978
A Proposal for a Comprehensive Grading of Parkinson's Disease Severity Combining Motor and Non-Motor Assessments: Meeting an Unmet Need journal February 2013
An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach journal January 2013
Diagnosis of Parkinson's disease on the basis of clinical and genetic classification: a population-based modelling study journal October 2015
Computer-Aided Diagnosis of Parkinson’s Disease Using Enhanced Probabilistic Neural Network journal September 2015
Early diagnosis of Parkinson's disease via machine learning on speech data
  • Hazan, Hananel; Hilu, Dan; Manevitz, Larry
  • 2012 IEEE 27th Convention of Electrical & Electronics Engineers in Israel (IEEEI 2012), 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel https://doi.org/10.1109/EEEI.2012.6377065
conference November 2012
Cerebrospinal fluid proteomic patterns discriminate Parkinson's disease and multiple system atrophy: Proteomic Pattern Analysis in PD and MSA journal June 2012
Algorithm for image-based biomarker detection for differential diagnosis of Parkinson's disease journal January 2015
Multiclass classification of FDG PET scans for the distinction between Parkinson's disease and atypical parkinsonian syndromes journal January 2013
The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods journal January 2008
Introduction to Information Retrieval journal January 2009
Mixed-Effects Models book January 2012
Learning Vector Quantization book January 2017
Environmental toxins and Parkinson's disease journal January 1989
Rational Inference: Deductive, Inductive and Probabilistic Thinking journal January 2010
Mixed-effects models text January 2019
Generalized linear models journal November 1984
Generalized Linear Models journal August 1994
SMOTE: Synthetic Minority Over-sampling Technique text January 2011
Empirical Studies on Usability of mHealth Apps: A Systematic Literature Review journal January 2015
The perfect neuroimaging-genetics-computation storm: collision of petabytes of data, millions of hardware devices and thousands of software tools journal August 2013
A diagnostic algorithm for Parkinson's disease: what next? journal October 2015
PTEN regulates RPA1 and protects DNA replication forks journal September 2015
Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson's disease journal July 2014
Classification of Parkinson's Disease Gait Using Spatial-Temporal Gait Features journal November 2015
Forecasting Hotspots—A Predictive Analytics Approach journal April 2011
Genes associated with Parkinson's disease: regulation of autophagy and beyond journal September 2015
SOCR data dashboard: an integrated big data archive mashing medicare, labor, census and econometric information journal July 2015
REM sleep behavior disorder and REM sleep without atonia in Parkinson's disease journal August 2002
GLUMIP2.0:SAS/IMLSoftware for Planning Internal Pilots journal January 2008
High-throughput neuroimaging-genetics computational infrastructure journal April 2014

Cited By (22)

Model-Based and Model-Free Techniques for Amyotrophic Lateral Sclerosis Diagnostic Prediction and Patient Clustering journal November 2018
Big data in IBD: a look into the future journal January 2019
Model-based and Model-free Machine Learning Techniques for Diagnostic Prediction and Classification of Clinical Outcomes in Parkinson’s Disease journal May 2018
Predictive Big Data Analytics using the UK Biobank Data journal April 2019
Harmonization of Respiratory Data From 9 US Population-Based Cohorts journal June 2018
Big data, observational research and P-value: a recipe for false-positive findings? A study of simulated and real prospective cohorts journal October 2019
Patient Subtyping via Time-Aware LSTM Networks
  • Baytas, Inci M.; Xiao, Cao; Zhang, Xi
  • KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/3097983.3097997
conference August 2017
Leveraging healthcare utilization to explore outcomes from musculoskeletal disorders: methodology for defining relevant variables from a health services data repository journal January 2018
Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models journal June 2019
Collaboration between a human group and artificial intelligence can improve prediction of multiple sclerosis course: a proof-of-principle study journal January 2017
Big Data Analytics in Medicine and Healthcare journal May 2018
A Systematic Review on Healthcare Analytics: Application and Theoretical Perspective of Data Mining journal May 2018
Reproducible Big Data Science: A Case Study In Continuous Fairness text January 2018
Collaboration between a human group and artificial intelligence can improve prediction of multiple sclerosis course: a proof-of-principle study journal January 2017
Reproducible Big Data Science: A Case Study In Continuous Fairness text January 2018
Machine learning techniques for personalized breast cancer risk prediction : comparison with the BCRAT and BOADICEA models text January 2019
Quadruple Decision Making for Parkinson’s Disease Patients: Combining Expert Opinion, Patient Preferences, Scientific Evidence, and Big Data Approaches to Reach Precision Medicine journal January 2020
Big Health Data and Cardiovascular Diseases: A Challenge for Research, an Opportunity for Clinical Care journal February 2019
Correlations between Motor Symptoms across Different Motor Tasks, Quantified via Random Forest Feature Classification in Parkinson’s Disease journal November 2017
Understanding Physiology in the Continuum: Integration of Information from Multiple -Omics Levels journal February 2017
Modernizing the Methods and Analytics Curricula for Health Science Doctoral Programs journal February 2020
A Policy Guide on Integrated Care (PGIC): Lessons Learned from EU Project INTEGRATE and Beyond journal September 2017

Figures / Tables (18)