Prediction of OCR accuracy using simple image features
A classifier for predicting the character accuracy of a given page achieved by any Optical Character Recognition (OCR) system is presented. This classifier is based on measuring the amount of white speckle, the amount of character fragments, and overall size information in the page. No output from the OCR system is used. The given page is classified as either good quality (i.e., high OCR accuracy expected) or poor (i.e., low OCR accuracy expected). Six OCR systems processed two different sets of test data: a set of 439 pages obtained from technical and scientific documents and a set of 200 pages obtained from magazines. For every system, approximately 85% of the pages in each data set were correctly predicted. The performance of this classifier is also compared with the ideal-case performance of a prediction method based upon the number of reject markers in OCR generated text. In several cases, this method matched or exceeded the performance of the reject based approach.
- Research Organization:
- Nevada Univ., Las Vegas, NV (United States). Information Science Research Inst.
- Sponsoring Organization:
- USDOE, Washington, DC (United States)
- DOE Contract Number:
- FC08-90NV10872
- OSTI ID:
- 46719
- Report Number(s):
- CONF-950226--34; ON: DE95009887
- Country of Publication:
- United States
- Language:
- English
Similar Records
Performance evaluation of two OCR systems
Validation of simulated OCR data sets