skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Imputation of missing data using machine learning techniques

Abstract

A serious problem in mining industrial data bases is that they are often incomplete, and a significant amount of data is missing, or erroneously entered. This paper explores the use of machine-learning based alternatives to standard statistical data completion (data imputation) methods, for dealing with missing data. We have approached the data completion problem using two well-known machine learning techniques. The first is an unsupervised clustering strategy which uses a Bayesian approach to cluster the data into classes. The classes so obtained are then used to predict multiple choices for the attribute of interest. The second technique involves modeling missing variables by supervised induction of a decision tree-based classifier. This predicts the most likely value for the attribute of interest. Empirical tests using extracts from industrial databases maintained by Honeywell customers have been done in order to compare the two techniques. These tests show both approaches are useful and have advantages and disadvantages. We argue that the choice between unsupervised and supervised classification techniques should be influenced by the motivation for solving the missing data problem, and discuss potential applications for the procedures we are developing.

Authors:
; ; ;  [1]
  1. Honeywell Technology Center, Minneapolis, MN (United States)
Publication Date:
OSTI Identifier:
421269
Report Number(s):
CONF-960830-
TRN: 96:005928-0024
Resource Type:
Conference
Resource Relation:
Conference: 2. international conference on knowledge discovery and data mining, Portland, OR (United States), 2-4 Aug 1996; Other Information: PBD: 1996; Related Information: Is Part Of Proceedings of the second international conference on knowledge discovery & data mining; Simoudis, E.; Han, J.; Fayyad, U. [eds.]; PB: 405 p.
Country of Publication:
United States
Language:
English
Subject:
99 MATHEMATICS, COMPUTERS, INFORMATION SCIENCE, MANAGEMENT, LAW, MISCELLANEOUS; DATA BASE MANAGEMENT; DECISION TREE ANALYSIS; MAINTENANCE; ALGORITHMS; LEARNING; KNOWLEDGE BASE

Citation Formats

Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, and Samad, T. Imputation of missing data using machine learning techniques. United States: N. p., 1996. Web.
Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, & Samad, T. Imputation of missing data using machine learning techniques. United States.
Lakshminarayan, Kamakshi, Harp, S A, Goldman, R, and Samad, T. 1996. "Imputation of missing data using machine learning techniques". United States.
@article{osti_421269,
title = {Imputation of missing data using machine learning techniques},
author = {Lakshminarayan, Kamakshi and Harp, S A and Goldman, R and Samad, T},
abstractNote = {A serious problem in mining industrial data bases is that they are often incomplete, and a significant amount of data is missing, or erroneously entered. This paper explores the use of machine-learning based alternatives to standard statistical data completion (data imputation) methods, for dealing with missing data. We have approached the data completion problem using two well-known machine learning techniques. The first is an unsupervised clustering strategy which uses a Bayesian approach to cluster the data into classes. The classes so obtained are then used to predict multiple choices for the attribute of interest. The second technique involves modeling missing variables by supervised induction of a decision tree-based classifier. This predicts the most likely value for the attribute of interest. Empirical tests using extracts from industrial databases maintained by Honeywell customers have been done in order to compare the two techniques. These tests show both approaches are useful and have advantages and disadvantages. We argue that the choice between unsupervised and supervised classification techniques should be influenced by the motivation for solving the missing data problem, and discuss potential applications for the procedures we are developing.},
doi = {},
url = {https://www.osti.gov/biblio/421269}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Dec 31 00:00:00 EST 1996},
month = {Tue Dec 31 00:00:00 EST 1996}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: