skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Lexicon-based word recognition without word segmentation

Technical Report ·
OSTI ID:68574
;  [1]
  1. Advanced Automation Technology Center, Menlo Park, CA (United States)

We present a word recognition approach that does not rely on explicit word segmentation. It treats the character recognition output as a continuous string of characters instead of first dividing it into words before word-level contextual knowledge is applied. This technique is useful in degraded document images, in which isolation of individual words by purely image- or character-based means is difficult or unreliable. We use a hypothesis generation and verification approach, in which word identities and their positions are hypothesized based on {open_quotes}seed features{close_quotes} (character substrings) extracted from the output of the character recognizer. Verification of the hypotheses consists of comparing the characters in the hypothesized word with candidate characters near the position of the seed feature in the text, and selecting the set of consecutive word hypotheses that are the most mutually consistent. Hence, word segmentation and word recognition are effectively performed in parallel.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68574
Report Number(s):
CONF-9404212-; TRN: 95:004349-0014
Resource Relation:
Conference: 3. annual symposium on document analysis and information retrieval, Las Vegas, NV (United States), 11-13 Apr 1994; Other Information: PBD: 1994; Related Information: Is Part Of Third Annual Symposium on Document Analysis and Information Retrieval; PB: 484 p.
Country of Publication:
United States
Language:
English