Imputation of missing data using machine learning techniques
Abstract
A serious problem in mining industrial data bases is that they are often incomplete, and a significant amount of data is missing, or erroneously entered. This paper explores the use of machine-learning based alternatives to standard statistical data completion (data imputation) methods, for dealing with missing data. We have approached the data completion problem using two well-known machine learning techniques. The first is an unsupervised clustering strategy which uses a Bayesian approach to cluster the data into classes. The classes so obtained are then used to predict multiple choices for the attribute of interest. The second technique involves modeling missing variables by supervised induction of a decision tree-based classifier. This predicts the most likely value for the attribute of interest. Empirical tests using extracts from industrial databases maintained by Honeywell customers have been done in order to compare the two techniques. These tests show both approaches are useful and have advantages and disadvantages. We argue that the choice between unsupervised and supervised classification techniques should be influenced by the motivation for solving the missing data problem, and discuss potential applications for the procedures we are developing.
- Authors:
-
- Honeywell Technology Center, Minneapolis, MN (United States)
- Publication Date:
- OSTI Identifier:
- 421269
- Report Number(s):
- CONF-960830-
TRN: 96:005928-0024
- Resource Type:
- Conference
- Resource Relation:
- Conference: 2. international conference on knowledge discovery and data mining, Portland, OR (United States), 2-4 Aug 1996; Other Information: PBD: 1996; Related Information: Is Part Of Proceedings of the second international conference on knowledge discovery & data mining; Simoudis, E.; Han, J.; Fayyad, U. [eds.]; PB: 405 p.
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 99 MATHEMATICS, COMPUTERS, INFORMATION SCIENCE, MANAGEMENT, LAW, MISCELLANEOUS; DATA BASE MANAGEMENT; DECISION TREE ANALYSIS; MAINTENANCE; ALGORITHMS; LEARNING; KNOWLEDGE BASE
Citation Formats
Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, and Samad, T. Imputation of missing data using machine learning techniques. United States: N. p., 1996.
Web.
Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, & Samad, T. Imputation of missing data using machine learning techniques. United States.
Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, and Samad, T. 1996.
"Imputation of missing data using machine learning techniques". United States.
@article{osti_421269,
title = {Imputation of missing data using machine learning techniques},
author = {Lakshminarayan, Kamakshi and Harp, S A and Goldman, R and Samad, T},
abstractNote = {A serious problem in mining industrial data bases is that they are often incomplete, and a significant amount of data is missing, or erroneously entered. This paper explores the use of machine-learning based alternatives to standard statistical data completion (data imputation) methods, for dealing with missing data. We have approached the data completion problem using two well-known machine learning techniques. The first is an unsupervised clustering strategy which uses a Bayesian approach to cluster the data into classes. The classes so obtained are then used to predict multiple choices for the attribute of interest. The second technique involves modeling missing variables by supervised induction of a decision tree-based classifier. This predicts the most likely value for the attribute of interest. Empirical tests using extracts from industrial databases maintained by Honeywell customers have been done in order to compare the two techniques. These tests show both approaches are useful and have advantages and disadvantages. We argue that the choice between unsupervised and supervised classification techniques should be influenced by the motivation for solving the missing data problem, and discuss potential applications for the procedures we are developing.},
doi = {},
url = {https://www.osti.gov/biblio/421269},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Dec 31 00:00:00 EST 1996},
month = {Tue Dec 31 00:00:00 EST 1996}
}