skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A High Accuracy Method for Semi-supervised Information Extraction

Abstract

Customization to specific domains of dis-course and/or user requirements is one of the greatest challenges for today’s Information Extraction (IE) systems. While demonstrably effective, both rule-based and supervised machine learning approaches to IE customization pose too high a burden on the user. Semi-supervised learning approaches may in principle offer a more resource effective solution but are still insufficiently accurate to grant realistic application. We demonstrate that this limitation can be overcome by integrating fully-supervised learning techniques within a semi-supervised IE approach, without increasing resource requirements.

Authors:
;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
908955
Report Number(s):
PNNL-SA-53858
400904120; TRN: US200722%%831
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), 169-172
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ACCURACY; LEARNING; INFORMATION RETRIEVAL; INFORMATION SYSTEMS; information extraction; machine learning; unsupervised learning; computational linguistics

Citation Formats

Tratz, Stephen C., and Sanfilippo, Antonio P.. A High Accuracy Method for Semi-supervised Information Extraction. United States: N. p., 2007. Web.
Tratz, Stephen C., & Sanfilippo, Antonio P.. A High Accuracy Method for Semi-supervised Information Extraction. United States.
Tratz, Stephen C., and Sanfilippo, Antonio P.. Sun . "A High Accuracy Method for Semi-supervised Information Extraction". United States. doi:. https://www.osti.gov/servlets/purl/908955.
@article{osti_908955,
title = {A High Accuracy Method for Semi-supervised Information Extraction},
author = {Tratz, Stephen C. and Sanfilippo, Antonio P.},
abstractNote = {Customization to specific domains of dis-course and/or user requirements is one of the greatest challenges for today’s Information Extraction (IE) systems. While demonstrably effective, both rule-based and supervised machine learning approaches to IE customization pose too high a burden on the user. Semi-supervised learning approaches may in principle offer a more resource effective solution but are still insufficiently accurate to grant realistic application. We demonstrate that this limitation can be overcome by integrating fully-supervised learning techniques within a semi-supervised IE approach, without increasing resource requirements.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Apr 22 00:00:00 EDT 2007},
month = {Sun Apr 22 00:00:00 EDT 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • This paper illustrates the relative merits of three methods - k-fold Cross Validation, Error Bounds, and Incremental Halting Test - to estimate the accuracy of a supervised learning algorithm. For each of the three methods we point out the problem they address, some of the important assumptions that are based on, and illustrate them through an example. Finally, we discuss the relative advantages and disadvantages of each method.
  • Practicaldataminingrarelyfalls exactlyinto the supervisedlearning scenario. Rather, the growing amount of unlabeled data poses a big challenge to large-scale semi-supervised learning (SSL). We note that the computationalintensivenessofgraph-based SSLarises largely from the manifold or graph regularization, which in turn lead to large models that are dificult to handle. To alleviate this, we proposed the prototype vector machine (PVM), a highlyscalable,graph-based algorithm for large-scale SSL. Our key innovation is the use of"prototypes vectors" for effcient approximation on both the graph-based regularizer and model representation. The choice of prototypes are grounded upon two important criteria: they not only perform effective low-rank approximation of themore » kernel matrix, but also span a model suffering the minimum information loss compared with the complete model. We demonstrate encouraging performance and appealing scaling properties of the PVM on a number of machine learning benchmark data sets.« less
  • Detecting complex targets, such as facilities, in commercially available satellite imagery is a difficult problem that human analysts try to solve by applying world knowledge. Often there are known observables that can be extracted by pixel-level feature detectors that can assist in the facility detection process. Individually, each of these observables is not sufficient for an accurate and reliable detection, but in combination, these auxiliary observables may provide sufficient context for detection by a machine learning algorithm. We describe an approach for automatic detection of facilities that uses an automated feature extraction algorithm to extract auxiliary observables, and a semi-supervisedmore » assisted target recognition algorithm to then identify facilities of interest. We illustrate the approach using an example of finding schools in Quickbird image data of Albuquerque, New Mexico. We use Los Alamos National Laboratory's Genie Pro automated feature extraction algorithm to find a set of auxiliary features that should be useful in the search for schools, such as parking lots, large buildings, sports fields and residential areas and then combine these features using Genie Pro's assisted target recognition algorithm to learn a classifier that finds schools in the image data.« less
  • In many practical situations thematic classes can not be discriminated by spectral measurements alone. Often one needs additional features such as population density, road density, wetlands, elevation, soil types, etc. which are discrete attributes. On the other hand remote sensing image features are continuous attributes. Finding a suitable statistical model and estimation of parameters is a challenging task in multisource (e.g., discrete and continuous attributes) data classification. In this paper we present a semi-supervised learning method by assuming that the samples were generated by a mixture model, where each component could be either a continuous or discrete distribution. Overall classificationmore » accuracy of the proposed method is improved by 12% in our initial experiments.« less
  • Machine learning is used in many applications, from machine vision to speech recognition to decision support systems, and is used to test applications. However, though much has been done to evaluate the performance of machine learning algorithms, little has been done to verify the algorithms or examine their failure modes. Moreover, complex learning frameworks often require stepping beyond black box evaluation to distinguish between errors based on natural limits on learning and errors that arise from mistakes in implementation. We present a conceptual architecture, failure model and taxonomy, and failure modes and effects analysis (FMEA) of a semi-supervised, multi-modal learningmore » system, and provide specific examples from its use in a radiological analysis assistant system. The goal of the research described in this paper is to provide a foundation from which dependability analysis of systems using semi-supervised, multi-modal learning can be conducted. The methods presented provide a first step towards that overall goal.« less