Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Visual Information Extraction Yonatan Aumann
 

Summary: Visual Information Extraction
Yonatan Aumann
Ronen Feldman
Yair Liberzon
Benjamin Rosenfeld
Jonathan Schler
Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel
{aumann,feldman}@cs.biu.ac.il
ClearForest Ltd., 6 Yoni Netanyahu St., Or Yehuda 60376, Israel
Abstract
Typographic and visual information is an integral part of textual documents. Most informa-
tion extraction systems ignore most of this visual information, processing the text as a linear
sequence of words. Thus, much valuable information is lost. In this paper, we show how to
make use of this visual information for information extraction. We present an algorithm that
allows to automatically extract specific fields of the document (such as the title, author, etc.),
based exclusively on the visual formatting of the document, without any reference to the se-
mantic content. The algorithm employs a machine learning approach, whereby the system is
first provided with a set of training documents in which the target fields are manually tagged,
and automatically learns how to extract these fields in future documents. We implemented the
algorithm in a system for automatic analysis of documents in PDF format. We present experi-

  

Source: Aumann, Yonatan - Computer Science Department, Bar Ilan University

 

Collections: Computer Technologies and Information Sciences