| | |
Summary: The Use of Web-based Statistics
to Validate Information Extraction
Stephen Soderland, Oren Etzioni, Tal Shaked, and Daniel S. Weld
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195-2350
U.S.A.
{soderlan,etzioni,shaked,weld}@cs.washington.edu
Abstract
The World Wide Web is a powerful and readily avail-
able text corpus that can be used effectively to vali-
date the output of an information extraction system. We
present experiments that explore how pointwise mutual
information (PMI) from search engine hit counts can
be used in an Assessor module that assigns a proba-
bility that an extracted fact or relationship is correct,
thus boosting precision. We find that thresholding on
PMI scores is more effective in creating features for the
Assessor than using probability density models. Boot-
strapping can be effective in finding both positive and
|