Automatic Generation of Data Types for Classification of Deep Web Sources
A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.
- Research Organization:
- Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- W-7405-ENG-48
- OSTI ID:
- 15016845
- Report Number(s):
- UCRL-CONF-209719; TRN: US200516%%1162
- Resource Relation:
- Journal Volume: 3615; Conference: Presented at: 2nd International Workshop on Data Integration in the Life Sciences, San Diego, CA, United States, Jul 20 - Jul 22, 2005
- Country of Publication:
- United States
- Language:
- English
Instance-based Schema Matching for Web Databases by Domain-specific Query Probing
|
book | January 2004 |
Similar Records
Review and comparison of web- and disk-based tools for residentialenergy analysis
Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces