Validation of document image defect models for optical character recognition
- Panasonic Technologies, Inc., Princeton, NJ (United States)
In this paper we consider the problem of evaluating models for physical defects affecting the optical character recognition (OCR) process. While a number of such models have been proposed, the contention that they produce the desired result is typically argued in an ad hoc and informal way. We introduce a rigorous and more pragmatic definition of when a model is accurate: we say a defect model is validated if the OCR errors induced by the model are effectively indistinguishable from the errors encountered when using real scanned documents. We present two measures to quantify this similarity: the Vector Space method and the Coin Bias method. The former adapts an approach used in information retrieval, the latter simulates an observer attempting to do better than a {open_quotes}random{close_quotes} guesser. We compare and contrast the two techniques based on experimental data; both seem to work well, suggesting this is an appropriate formalism for the development and evaluation of document image defect models.
- Research Organization:
- Nevada Univ., Las Vegas, NV (United States)
- OSTI ID:
- 68571
- Report Number(s):
- CONF-9404212--
- Country of Publication:
- United States
- Language:
- English
Similar Records
An evaluation of an automatic markup system
Low-level structural recognition of documents