skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An evaluation of an automatic markup system

Conference ·
OSTI ID:46721

One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, the authors present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device the authors use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Their automatic markup program, which they call Autotag, uses this information, combined with dictionary, lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of their markup system to determine its correctness.

Research Organization:
Nevada Univ., Las Vegas, NV (United States). Information Science Research Inst.
Sponsoring Organization:
USDOE, Washington, DC (United States)
DOE Contract Number:
FC08-90NV10872
OSTI ID:
46721
Report Number(s):
CONF-950226-33; ON: DE95009885; TRN: AHC29513%%136
Resource Relation:
Conference: SPIE `95: SPIE conference on optics, electro-optics, and laser application in science, engineering and medicine, San Jose, CA (United States), 5-14 Feb 1995; Other Information: PBD: [1995]
Country of Publication:
United States
Language:
English

Similar Records

UNLV Information Science Research Institute 1995 annual report
Technical Report · Tue Aug 01 00:00:00 EDT 1995 · OSTI ID:46721

Information Science Research Institute: 1994 annual report
Technical Report · Thu Sep 01 00:00:00 EDT 1994 · OSTI ID:46721

Information Science Research Institute quarterly progress report
Technical Report · Sat Sep 30 00:00:00 EDT 1995 · OSTI ID:46721