Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Visual Information Extraction Yonatan Aumann

Summary: Visual Information Extraction
Yonatan Aumann
Ronen Feldman
Yair Liberzon
Benjamin Rosenfeld
Jonathan Schler
Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel
ClearForest Ltd., 6 Yoni Netanyahu St., Or Yehuda 60376, Israel
Typographic and visual information is an integral part of textual documents. Most informa-
tion extraction systems ignore most of this visual information, processing the text as a linear
sequence of words. Thus, much valuable information is lost. In this paper, we show how to
make use of this visual information for information extraction. We present an algorithm that
allows to automatically extract specific fields of the document (such as the title, author, etc.),
based exclusively on the visual formatting of the document, without any reference to the se-
mantic content. The algorithm employs a machine learning approach, whereby the system is
first provided with a set of training documents in which the target fields are manually tagged,
and automatically learns how to extract these fields in future documents. We implemented the
algorithm in a system for automatic analysis of documents in PDF format. We present experi-


Source: Aumann, Yonatan - Computer Science Department, Bar Ilan University


Collections: Computer Technologies and Information Sciences