Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Automatic script identification from images using cluster-based templates

Conference ·
OSTI ID:62630

We have developed a technique for automatically identifying the script used to generate a document that is stored electronically in bit image form. Our approach differs from previous work in that the distinctions among scripts are discovered by an automatic learning procedure, without any handson analysis. We first develop a set of representative symbols (templates) for each script in our database (Cyrillic, Roman, etc.). We do this by identifying all textual symbols in a set of training documents, scaling each symbol to a fixed size, clustering similar symbols, pruning minor clusters, and finding each cluster`s centroid. To identify a new document`s script, we identify and scale a subset of symbols from the document and compare them to the templates for each script. We choose the script whose templates provide the best match. Our current system distinguishes among the Armenian, Burmese, Chinese, Cyrillic, Ethiopic, Greek, Hebrew, Japanese, Korean, Roman, and Thai scripts with over 90% accuracy.

Research Organization:
Los Alamos National Lab., NM (United States)
Sponsoring Organization:
Department of Defense, Washington, DC (United States)
DOE Contract Number:
W-7405-ENG-36
OSTI ID:
62630
Report Number(s):
LA-UR--95-21; CONF-950869--1; ON: DE95006339
Country of Publication:
United States
Language:
English