An evaluation of an automatic markup system
One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, the authors present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device the authors use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Their automatic markup program, which they call Autotag, uses this information, combined with dictionary, lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of their markup system to determine its correctness.
- Research Organization:
- Nevada Univ., Las Vegas, NV (United States). Information Science Research Inst.
- Sponsoring Organization:
- USDOE, Washington, DC (United States)
- DOE Contract Number:
- FC08-90NV10872
- OSTI ID:
- 46721
- Report Number(s):
- CONF-950226-33; ON: DE95009885; TRN: AHC29513%%136
- Resource Relation:
- Conference: SPIE `95: SPIE conference on optics, electro-optics, and laser application in science, engineering and medicine, San Jose, CA (United States), 5-14 Feb 1995; Other Information: PBD: [1995]
- Country of Publication:
- United States
- Language:
- English
Similar Records
Information Science Research Institute: 1994 annual report
Information Science Research Institute quarterly progress report