| | |
Summary: Unsupervised Named-Entity Extraction
from the Web: An Experimental Study
Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu
Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195-2350
etzioni@cs.washington.edu
February 28, 2005
Abstract
The KNOWITALL system aims to automate the tedious process of extracting large col-
lections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised,
domain-independent, and scalable manner. The paper presents an overview of KNOW-
ITALL's novel architecture and design principles, emphasizing its distinctive ability to ex-
tract information without any hand-labeled training examples. In its first major run, KNOW-
ITALL extracted over 50,000 class instances, but suggested a challenge: How can we im-
prove KNOWITALL's recall and extraction rate without sacrificing precision?
This paper presents three distinct ways to address this challenge and evaluates their perfor-
mance. Pattern Learning learns domain-specific extraction rules, which enable additional
extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall
|