Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Springer-Verlag London Ltd. 2005 Knowledge and Information Systems (2005)

Summary: Springer-Verlag London Ltd. 2005
Knowledge and Information Systems (2005)
DOI 10.1007/s10115-004-0188-z
Web data extraction based on structural
Zhao Li, Wee Keong Ng, Aixin Sun
Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Technological
University, Singapore
Abstract. Web data-extraction systems in use today mainly focus on the generation of extrac-
tion rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when
a holistic view is taken. Each phase in the data-extraction process is disconnected and does not
share a common foundation to make the building of a complete system straightforward. In this
paper, we demonstrate a holistic approach to Web data extraction. The principal component of
our proposal is the notion of a document schema. Document schemata are patterns of struc-
tures embedded in documents. Once the document schemata are obtained, the various phases
(e.g. training set preparation, wrapper induction and document classification) can be easily in-
tegrated. The implication of this is improved efficiency and better control over the extraction
procedure. Our experimental results confirmed this. More importantly, because a document can
be represented as a vector of schema, it can be easily incorporated into existing systems as the
fabric for integration.


Source: Aixin, Sun - School of Computer Engineering, Nanyang Technological University


Collections: Computer Technologies and Information Sciences