| | |
Summary: ACM DL 2000
Snowball: Extracting Relations from Large Plain-Text Collections
Eugene Agichtein Luis Gravano
Department of Computer Science
Columbia University
1214 Amsterdam Avenue
New York, NY 10027-7003, USA
{eugene,gravano}@cs.columbia.edu
ABSTRACT
Text documents often contain valuable structured data that
is hidden in regular English sentences. This data is best ex-
ploited if available as a relational table that we could use for
answering precise queries or for running data mining tasks.
We explore a technique for extracting such tables from doc-
ument collections that requires only a handful of training ex-
amples from users. These examples are used to generate
extraction patterns, that in turn result in new tuples being
extracted from the document collection. We build on this
idea and present our Snowball system. Snowball introduces
novel strategies for generating patterns and extracting tuples
|