| | |
Summary: A Survey of Retrieval Strategies for
OCR Text Collections
Steven M. Beitzel, Eric C. Jensen, David A. Grossman
Information Retrieval Laboratory
Department of Computer Science
Illinois Institute of Technology
{steve,ej,grossman}@ir.iit.edu
Abstract
The importance of effectively retrieving OCR text has grown significantly in recent
years. We provide a brief overview of work done to improve the effectiveness of
retrieval of OCR text.
Introduction
As electronic media becomes more and more prevalent, the need for transferring older
documents to the electronic domain grows. Optical Character Recognition (OCR) works
by scanning source documents and performing character analysis on the resulting images,
giving a translation to ASCII text, which can then be stored and manipulated
electronically like any standard electronic document. Unfortunately, the character
recognition process is not perfect, and errors often occur. These errors have an adverse
effect on the effectiveness of information retrieval algorithms that are based on exact
matches of query terms and document terms. Searching OCR data is essentially a search
|