Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

An evaluation of information retrieval accuracy with simulated OCR output

Technical Report ·
OSTI ID:68569
;  [1]; ;  [2]
  1. Univ. of Massachusetts, Amherst, MA (United States)
  2. Information Science Research Institute, Univ. of Nevada, Las Vegas, NV (United States)

Optical Character Recognition (OCR) is a critical part of many text-based applications. Although some commercial systems use the output from OCR devices to index documents without editing, there is very little quantitative data on the impact of OCR errors on the accuracy of a text retrieval system. Because of the difficulty of constructing test collections to obtain this data, we have carried out evaluation using simulated OCR output on a variety of databases. The results show that high quality OCR devices have little effect on the accuracy of retrieval, but low quality devices used with databases of short documents can result in significant degradation.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68569
Report Number(s):
CONF-9404212--
Country of Publication:
United States
Language:
English

Similar Records

Performance evaluation of two OCR systems
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68585

Prediction of OCR accuracy using simple image features
Technical Report · Fri Mar 31 23:00:00 EST 1995 · OSTI ID:46719

An evaluation of an automatic markup system
Conference · Fri Mar 31 23:00:00 EST 1995 · OSTI ID:46721