Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Validation of document image defect models for optical character recognition

Technical Report ·
OSTI ID:68571
; ;  [1]
  1. Panasonic Technologies, Inc., Princeton, NJ (United States)

In this paper we consider the problem of evaluating models for physical defects affecting the optical character recognition (OCR) process. While a number of such models have been proposed, the contention that they produce the desired result is typically argued in an ad hoc and informal way. We introduce a rigorous and more pragmatic definition of when a model is accurate: we say a defect model is validated if the OCR errors induced by the model are effectively indistinguishable from the errors encountered when using real scanned documents. We present two measures to quantify this similarity: the Vector Space method and the Coin Bias method. The former adapts an approach used in information retrieval, the latter simulates an observer attempting to do better than a {open_quotes}random{close_quotes} guesser. We compare and contrast the two techniques based on experimental data; both seem to work well, suggesting this is an appropriate formalism for the development and evaluation of document image defect models.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68571
Report Number(s):
CONF-9404212--
Country of Publication:
United States
Language:
English

Similar Records

An evaluation of information retrieval accuracy with simulated OCR output
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68569

An evaluation of an automatic markup system
Conference · Fri Mar 31 23:00:00 EST 1995 · OSTI ID:46721

Low-level structural recognition of documents
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68590