| | |
Summary: Mining Reference Tables for Automatic Text Segmentation
Eugene Agichtein
Columbia University
eugene@cs.columbia.edu
Venkatesh Ganti
Microsoft Research
vganti@microsoft.com
ABSTRACT
Automatically segmenting unstructured text strings into structured
records is necessary for importing the information contained in legacy
sources and text collections into a data warehouse for subsequent
querying, analysis, mining and integration. In this paper, we mine
tables present in data warehouses and relational databases to develop
an automatic segmentation system. Thus, we overcome limitations
of existing supervised text segmentation approaches, which require
comprehensive manually labeled training data. Our segmentation
system is robust, accurate, and efficient, and requires no additional
manual effort. Thorough evaluation on real datasets demonstrates the
robustness and accuracy of our system, with segmentation accuracy
|