Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Script and language determination from document images

Technical Report ·
OSTI ID:68579
 [1]
  1. Fuji Xerox Palo Alto Laboratory, Palo Alto, CA (United States)

We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world`s languages, using techniques that should be applicable across much more comprehensive samples. The method first classifies the script into two broad classes: European and Asian. This classification is based on the spatial relationships of fiducial points related to the upward concavities in character structures. Language identification within the Asian script class (Japanese, Chinese, Korean) is performed by analysis of the optical density distribution of the text images. Within the European script class, language identification is described in separate papers.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68579
Report Number(s):
CONF-9404212--
Country of Publication:
United States
Language:
English