Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Prediction of OCR accuracy using simple image features

Technical Report ·
OSTI ID:46719

A classifier for predicting the character accuracy of a given page achieved by any Optical Character Recognition (OCR) system is presented. This classifier is based on measuring the amount of white speckle, the amount of character fragments, and overall size information in the page. No output from the OCR system is used. The given page is classified as either good quality (i.e., high OCR accuracy expected) or poor (i.e., low OCR accuracy expected). Six OCR systems processed two different sets of test data: a set of 439 pages obtained from technical and scientific documents and a set of 200 pages obtained from magazines. For every system, approximately 85% of the pages in each data set were correctly predicted. The performance of this classifier is also compared with the ideal-case performance of a prediction method based upon the number of reject markers in OCR generated text. In several cases, this method matched or exceeded the performance of the reject based approach.

Research Organization:
Nevada Univ., Las Vegas, NV (United States). Information Science Research Inst.
Sponsoring Organization:
USDOE, Washington, DC (United States)
DOE Contract Number:
FC08-90NV10872
OSTI ID:
46719
Report Number(s):
CONF-950226--34; ON: DE95009887
Country of Publication:
United States
Language:
English

Similar Records

An evaluation of information retrieval accuracy with simulated OCR output
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68569

Performance evaluation of two OCR systems
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68585

Validation of simulated OCR data sets
Technical Report · Fri Dec 30 23:00:00 EST 1994 · OSTI ID:68570