skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic script identification from images using cluster-based templates

Abstract

We have developed a technique for automatically identifying the script used to generate a document that is stored electronically in bit image form. Our approach differs from previous work in that the distinctions among scripts are discovered by an automatic learning procedure, without any handson analysis. We first develop a set of representative symbols (templates) for each script in our database (Cyrillic, Roman, etc.). We do this by identifying all textual symbols in a set of training documents, scaling each symbol to a fixed size, clustering similar symbols, pruning minor clusters, and finding each cluster`s centroid. To identify a new document`s script, we identify and scale a subset of symbols from the document and compare them to the templates for each script. We choose the script whose templates provide the best match. Our current system distinguishes among the Armenian, Burmese, Chinese, Cyrillic, Ethiopic, Greek, Hebrew, Japanese, Korean, Roman, and Thai scripts with over 90% accuracy.

Authors:
; ; ;
Publication Date:
Research Org.:
Los Alamos National Lab., NM (United States)
Sponsoring Org.:
Department of Defense, Washington, DC (United States)
OSTI Identifier:
62630
Report Number(s):
LA-UR-95-21; CONF-950869-1
ON: DE95006339
DOE Contract Number:  
W-7405-ENG-36
Resource Type:
Conference
Resource Relation:
Conference: 3. international conference on document, analysis, and recognition, Montreal (Canada), 14-16 Aug 1995; Other Information: PBD: [1995]
Country of Publication:
United States
Language:
English
Subject:
99 MATHEMATICS, COMPUTERS, INFORMATION SCIENCE, MANAGEMENT, LAW, MISCELLANEOUS; 44 INSTRUMENTATION, INCLUDING NUCLEAR AND PARTICLE DETECTORS; IMAGE SCANNERS; DESIGN; IMAGE PROCESSING; IMAGE CONVERTERS; PATTERN RECOGNITION; PHOTOCOPYING; AUTOMATION

Citation Formats

Hochberg, J., Kerns, L., Kelly, P., and Thomas, T. Automatic script identification from images using cluster-based templates. United States: N. p., 1995. Web.
Hochberg, J., Kerns, L., Kelly, P., & Thomas, T. Automatic script identification from images using cluster-based templates. United States.
Hochberg, J., Kerns, L., Kelly, P., and Thomas, T. Wed . "Automatic script identification from images using cluster-based templates". United States. https://www.osti.gov/servlets/purl/62630.
@article{osti_62630,
title = {Automatic script identification from images using cluster-based templates},
author = {Hochberg, J. and Kerns, L. and Kelly, P. and Thomas, T.},
abstractNote = {We have developed a technique for automatically identifying the script used to generate a document that is stored electronically in bit image form. Our approach differs from previous work in that the distinctions among scripts are discovered by an automatic learning procedure, without any handson analysis. We first develop a set of representative symbols (templates) for each script in our database (Cyrillic, Roman, etc.). We do this by identifying all textual symbols in a set of training documents, scaling each symbol to a fixed size, clustering similar symbols, pruning minor clusters, and finding each cluster`s centroid. To identify a new document`s script, we identify and scale a subset of symbols from the document and compare them to the templates for each script. We choose the script whose templates provide the best match. Our current system distinguishes among the Armenian, Burmese, Chinese, Cyrillic, Ethiopic, Greek, Hebrew, Japanese, Korean, Roman, and Thai scripts with over 90% accuracy.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {1995},
month = {2}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: