Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Content-Based Document Image Retrieval in Complex Document Collections

Summary: Content-Based Document Image Retrieval in Complex
Document Collections
G. Agama, S. Argamona, O. Friedera, D. Grossmana, D. Lewisb
aDepartment of Computer Science, Illinois Institute of Technology, Chicago, IL 60616
bDavid D. Lewis Consulting, 858 W. Armitage Ave., #296 Chicago, IL 60614
We address the problem of content-based image retrieval in the context of complex document images. Complex
document are documents that typically start out on paper and are then electronically scanned. These docu-
ments have rich internal structure and might only be available in image form. Additionally, they may have been
produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables
and other non-textual elements. Large collections of such complex documents are commonly found in legal and
security investigations. The indexing and analysis of large document collections is currently limited to textual
features based OCR data and ignore the structural context of the document as well as important non-textual ele-
ments such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally
ignored due to the inherent complexity of offline handwriting recognition. We address important research issues
concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggre-
gation of diverse information contained in scanned paper documents we are developing. Such complex document
information processing combines several forms of image processing together with textual/linguistic processing to
enable effective analysis of complex document collections, a necessity for a wide range of applications. Our proto-
type automatically generates rich metadata about a complex document and then applies query tools to integrate


Source: Argamon, Shlomo - Department of Computer Science, Illinois Institute of Technology


Collections: Computer Technologies and Information Sciences