Imputing data that are missing at high rates using a boosting algorithm
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Apple Inc., Cupertino, CA (United States)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, and the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1431477
- Report Number(s):
- SAND-2016-9430J; 647630
- Resource Relation:
- Conference: JSM 2016, Chicago, IL (United States), 30 Jul - 4 Aug 2016
- Country of Publication:
- United States
- Language:
- English
Similar Records
Spatio-Temporal Denoising Graph Autoencoders with Data Augmentation for Missing Photovoltaic Data Imputation
Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: missing value imputation using temporal datasets