skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Abstract

Background A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing,more » cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.« less

Authors:
 [1];  [2];  [3];  [2];  [4];  [5];  [4];  [6];  [7];  [5];  [4];  [2];  [2];  [6];  [6];  [6];  [2];  [8];  [7];  [6] more »; « less
  1. Univ. of Michigan, Ann Arbor, MI (United States). Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science; Univ. of Southern California, Los Angeles, CA (United States). Stevens Neuroimaging and Informatics Institute; ; Univ. of Michigan, Ann Arbor, MI (United States). Udall Center of Excellence for Parkinson’s Disease Research
  2. Institute for Systems Biology, Seattle, Washington, (United States)
  3. Univ. of Michigan, Ann Arbor, MI (United States). Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science
  4. Argonne National Lab. (ANL), Argonne, IL (United States); ; Univ. of Chicago, IL (United States). Computation Institute
  5. Univ. of Southern California, Los Angeles, CA (United States). Information Sciences Institute
  6. Univ. of Southern California, Los Angeles, CA (United States). Stevens Neuroimaging and Informatics Institute
  7. Univ. of Michigan, Ann Arbor, MI (United States). Udall Center of Excellence for Parkinson’s Disease Research
  8. Univ. of Michigan, Ann Arbor, MI (United States). Department of Psychiatry and Michigan Alzheimer’s Disease Center; Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, Michigan, (United States)
Publication Date:
Sponsoring Org.:
USDOE; National Science Foundation (NSF); National Institutes of Health (NIH)
OSTI Identifier:
1627792
Grant/Contract Number:  
AC02-06CH11357; 1023115; 1022560; 1022636; 0089377; 9652870; 0442992; 0442630; 0333672; 0716055; P20 NR015331; P50 NS091856; P30 DK089503; U54 EB02040
Resource Type:
Accepted Manuscript
Journal Name:
PLoS ONE
Additional Journal Information:
Journal Volume: 11; Journal Issue: 8; Journal ID: ISSN 1932-6203
Publisher:
Public Library of Science
Country of Publication:
United States
Language:
English
Subject:
Science & Technology - Other Topics

Citation Formats

Dinov, Ivo D., Heavner, Ben, Tang, Ming, Glusman, Gustavo, Chard, Kyle, Darcy, Mike, Madduri, Ravi, Pa, Judy, Spino, Cathie, Kesselman, Carl, Foster, Ian, Deutsch, Eric W., Price, Nathan D., Van Horn, John D., Ames, Joseph, Clark, Kristi, Hood, Leroy, Hampstead, Benjamin M., Dauer, William, Toga, Arthur W., and Draganski, Bogdan. Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations. United States: N. p., 2016. Web. doi:10.1371/journal.pone.0157077.
Dinov, Ivo D., Heavner, Ben, Tang, Ming, Glusman, Gustavo, Chard, Kyle, Darcy, Mike, Madduri, Ravi, Pa, Judy, Spino, Cathie, Kesselman, Carl, Foster, Ian, Deutsch, Eric W., Price, Nathan D., Van Horn, John D., Ames, Joseph, Clark, Kristi, Hood, Leroy, Hampstead, Benjamin M., Dauer, William, Toga, Arthur W., & Draganski, Bogdan. Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations. United States. doi:10.1371/journal.pone.0157077.
Dinov, Ivo D., Heavner, Ben, Tang, Ming, Glusman, Gustavo, Chard, Kyle, Darcy, Mike, Madduri, Ravi, Pa, Judy, Spino, Cathie, Kesselman, Carl, Foster, Ian, Deutsch, Eric W., Price, Nathan D., Van Horn, John D., Ames, Joseph, Clark, Kristi, Hood, Leroy, Hampstead, Benjamin M., Dauer, William, Toga, Arthur W., and Draganski, Bogdan. Fri . "Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations". United States. doi:10.1371/journal.pone.0157077. https://www.osti.gov/servlets/purl/1627792.
@article{osti_1627792,
title = {Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations},
author = {Dinov, Ivo D. and Heavner, Ben and Tang, Ming and Glusman, Gustavo and Chard, Kyle and Darcy, Mike and Madduri, Ravi and Pa, Judy and Spino, Cathie and Kesselman, Carl and Foster, Ian and Deutsch, Eric W. and Price, Nathan D. and Van Horn, John D. and Ames, Joseph and Clark, Kristi and Hood, Leroy and Hampstead, Benjamin M. and Dauer, William and Toga, Arthur W. and Draganski, Bogdan},
abstractNote = {Background A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.},
doi = {10.1371/journal.pone.0157077},
journal = {PLoS ONE},
number = 8,
volume = 11,
place = {United States},
year = {2016},
month = {8}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

EnsCat: clustering of categorical data via ensembling
journal, September 2016


The properties of high-dimensional data spaces: implications for exploring gene and protein expression data
journal, January 2008

  • Clarke, Robert; Ressom, Habtom W.; Wang, Antai
  • Nature Reviews Cancer, Vol. 8, Issue 1
  • DOI: 10.1038/nrc2294

Kryder's Law
journal, August 2005


Establishing Moore's Law
journal, July 2006


Challenges and Trends of Big Data Analytics
conference, November 2014

  • Li, Hui; Lu, Xin
  • 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC)
  • DOI: 10.1109/3PGCIC.2014.136

The current and projected economic burden of Parkinson's disease in the United States: Economic Burden of PD in The US
journal, February 2013

  • Kowal, Stacey L.; Dall, Timothy M.; Chakrabarti, Ritashree
  • Movement Disorders, Vol. 28, Issue 3
  • DOI: 10.1002/mds.25292

2013 Alzheimer's disease facts and figures
journal, March 2013


Soluble protein oligomers in neurodegeneration: lessons from the Alzheimer's amyloid β-peptide
journal, February 2007

  • Haass, Christian; Selkoe, Dennis J.
  • Nature Reviews Molecular Cell Biology, Vol. 8, Issue 2
  • DOI: 10.1038/nrm2101

Atomic structures of amyloid cross-β spines reveal varied steric zippers
journal, April 2007

  • Sawaya, Michael R.; Sambashivan, Shilpa; Nelson, Rebecca
  • Nature, Vol. 447, Issue 7143
  • DOI: 10.1038/nature05695

Systematic Review of the Risk of Parkinson's Disease After Mild Traumatic Brain Injury: Results of the International Collaboration on Mild Traumatic Brain Injury Prognosis
journal, March 2014

  • Marras, Connie; Hincapié, Cesar A.; Kristman, Vicki L.
  • Archives of Physical Medicine and Rehabilitation, Vol. 95, Issue 3
  • DOI: 10.1016/j.apmr.2013.08.298

Environmental Toxins and Parkinson's Disease
journal, January 2014


To operate or not?: A literature review of surgical outcomes in 95 patients with Parkinson's disease undergoing spine surgery
journal, July 2015


The prevalence of Parkinson's disease: A systematic review and meta-analysis: PD PREVALENCE
journal, June 2014

  • Pringsheim, Tamara; Jette, Nathalie; Frolkis, Alexandra
  • Movement Disorders, Vol. 29, Issue 13
  • DOI: 10.1002/mds.25945

Clinical markers for identifying cholinergic deficits in Parkinson's disease: Clinical Marers of Cholinergic Deficits in PD
journal, November 2014

  • Müller, Martijn L. T. M.; Bohnen, Nicolaas I.; Kotagal, Vikas
  • Movement Disorders, Vol. 30, Issue 2
  • DOI: 10.1002/mds.26061

Prion-like mechanisms in neurodegenerative diseases
journal, December 2009

  • Frost, Bess; Diamond, Marc I.
  • Nature Reviews Neuroscience, Vol. 11, Issue 3
  • DOI: 10.1038/nrn2786

The biology and pathology of the familial Parkinson's disease protein LRRK2: Familial PD Protein LRRK2
journal, January 2010

  • Dauer, William; Ho, Cherry Cheng-Ying
  • Movement Disorders, Vol. 25, Issue S1
  • DOI: 10.1002/mds.22717

Early-onset parkinsonism caused by alpha-synuclein gene triplication: Clinical and genetic findings in a novel family
journal, August 2015


Phenotypic characterization of recessive gene knockout rat models of Parkinson's disease
journal, October 2014


Clinical, imaging, and molecular findings in a sample of Mexican families with pantothenate kinase-associated neurodegeneration: PKAN disease findings in Mexican families
journal, June 2014

  • Morales-Briceño, H.; Chacón-Camacho, O. F.; Pérez-González, E. A.
  • Clinical Genetics, Vol. 87, Issue 3
  • DOI: 10.1111/cge.12400

Multitarget drug discovery projects in CNS diseases: quantitative systems pharmacology as a possible path forward
journal, October 2014

  • Geerts, Hugo; Kennis, Ludo
  • Future Medicinal Chemistry, Vol. 6, Issue 16
  • DOI: 10.4155/fmc.14.97

Neuroimaging Study Designs, Computational Analyses and Data Provenance Using the LONI Pipeline
journal, September 2010


NeuroX, a fast and efficient genotyping platform for investigation of neurodegenerative diseases
journal, March 2015


Confidence Interval Based Parameter Estimation—A New SOCR Applet and Activity
journal, May 2011


Population stratification and spurious allelic association
journal, February 2003


SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory
journal, December 2011

  • Ramentol, Enislay; Caballero, Yailé; Bello, Rafael
  • Knowledge and Information Systems, Vol. 33, Issue 2
  • DOI: 10.1007/s10115-011-0465-6

SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002

  • Chawla, N. V.; Bowyer, K. W.; Hall, L. O.
  • Journal of Artificial Intelligence Research, Vol. 16
  • DOI: 10.1613/jair.953

Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective
journal, October 1998


mice : Multivariate Imputation by Chained Equations in R
journal, January 2011

  • Buuren, Stef van; Groothuis-Oudshoorn, Karin
  • Journal of Statistical Software, Vol. 45, Issue 3
  • DOI: 10.18637/jss.v045.i03

Generalized Linear Models
journal, December 2000


Handling drop-out in longitudinal clinical trials: a comparison of the LOCF and MMRM approaches
journal, January 2008

  • Lane, Peter
  • Pharmaceutical Statistics, Vol. 7, Issue 2
  • DOI: 10.1002/pst.267

Forecasting Hotspots—A Predictive Analytics Approach
journal, April 2011

  • Maciejewski, R.; Hafen, R.; Rudolph, S.
  • IEEE Transactions on Visualization and Computer Graphics, Vol. 17, Issue 4
  • DOI: 10.1109/TVCG.2010.82

Top 10 algorithms in data mining
journal, December 2007

  • Wu, Xindong; Kumar, Vipin; Ross Quinlan, J.
  • Knowledge and Information Systems, Vol. 14, Issue 1
  • DOI: 10.1007/s10115-007-0114-2

Support vector machines
journal, July 1998

  • Hearst, M. A.; Dumais, S. T.; Osuna, E.
  • IEEE Intelligent Systems and their Applications, Vol. 13, Issue 4
  • DOI: 10.1109/5254.708428

Nearest neighbor pattern classification
journal, January 1967


Building Predictive Models in R Using the caret Package
journal, January 2008


Dopaminergic modulation of striato-frontal connectivity during motor timing in Parkinson's disease
journal, March 2010


Cognitive Rehabilitation in Parkinson’s Disease: Evidence from Neuroimaging
journal, January 2011


Cortical volume and folding abnormalities in Parkinson's disease patients with pathological gambling
journal, November 2014


Regional Brain Differences in Cortical Thickness, Surface Area and Subcortical Volume in Individuals with Williams Syndrome
journal, February 2012


Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter
journal, May 1979


Rational inference: deductive, inductive and probabilistic thinking
journal, August 2010

  • Ormerod, R. J.
  • Journal of the Operational Research Society, Vol. 61, Issue 8
  • DOI: 10.1057/jors.2009.96

Feature Selection and Classification of Hyperspectral Images With Support Vector Machines
journal, October 2007

  • Archibald, Rick; Fann, George
  • IEEE Geoscience and Remote Sensing Letters, Vol. 4, Issue 4
  • DOI: 10.1109/LGRS.2007.905116

Application of Akaike's information criterion (AIC) in the evaluation of linear pharmacokinetic equations
journal, April 1978

  • Yamaoka, Kiyoshi; Nakagawa, Terumichi; Uno, Toyozo
  • Journal of Pharmacokinetics and Biopharmaceutics, Vol. 6, Issue 2
  • DOI: 10.1007/BF01117450

A Proposal for a Comprehensive Grading of Parkinson's Disease Severity Combining Motor and Non-Motor Assessments: Meeting an Unmet Need
journal, February 2013


An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach
journal, January 2013

  • Chen, Hui-Ling; Huang, Chang-Cheng; Yu, Xin-Gang
  • Expert Systems with Applications, Vol. 40, Issue 1
  • DOI: 10.1016/j.eswa.2012.07.014

Diagnosis of Parkinson's disease on the basis of clinical and genetic classification: a population-based modelling study
journal, October 2015


Computer-Aided Diagnosis of Parkinson’s Disease Using Enhanced Probabilistic Neural Network
journal, September 2015

  • Hirschauer, Thomas J.; Adeli, Hojjat; Buford, John A.
  • Journal of Medical Systems, Vol. 39, Issue 11
  • DOI: 10.1007/s10916-015-0353-9

Early diagnosis of Parkinson's disease via machine learning on speech data
conference, November 2012

  • Hazan, Hananel; Hilu, Dan; Manevitz, Larry
  • 2012 IEEE 27th Convention of Electrical & Electronics Engineers in Israel (IEEEI 2012), 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel
  • DOI: 10.1109/EEEI.2012.6377065

Cerebrospinal fluid proteomic patterns discriminate Parkinson's disease and multiple system atrophy: Proteomic Pattern Analysis in PD and MSA
journal, June 2012

  • Ishigami, Noriko; Tokuda, Takahiko; Ikegawa, Masaya
  • Movement Disorders, Vol. 27, Issue 7
  • DOI: 10.1002/mds.24994

Algorithm for image-based biomarker detection for differential diagnosis of Parkinson's disease
journal, January 2015


Multiclass classification of FDG PET scans for the distinction between Parkinson's disease and atypical parkinsonian syndromes
journal, January 2013


The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods
journal, January 2008

  • Jack, Clifford R.; Bernstein, Matt A.; Fox, Nick C.
  • Journal of Magnetic Resonance Imaging, Vol. 27, Issue 4
  • DOI: 10.1002/jmri.21049

REM sleep behavior disorder and REM sleep without atonia in Parkinson's disease
journal, August 2002

  • Gagnon, J. - F.; Bedard, M. - A.; Fantini, M. L.
  • Neurology, Vol. 59, Issue 4
  • DOI: 10.1212/WNL.59.4.585