Automatic script identification from images using cluster-based templates
We have developed a technique for automatically identifying the script used to generate a document that is stored electronically in bit image form. Our approach differs from previous work in that the distinctions among scripts are discovered by an automatic learning procedure, without any handson analysis. We first develop a set of representative symbols (templates) for each script in our database (Cyrillic, Roman, etc.). We do this by identifying all textual symbols in a set of training documents, scaling each symbol to a fixed size, clustering similar symbols, pruning minor clusters, and finding each cluster`s centroid. To identify a new document`s script, we identify and scale a subset of symbols from the document and compare them to the templates for each script. We choose the script whose templates provide the best match. Our current system distinguishes among the Armenian, Burmese, Chinese, Cyrillic, Ethiopic, Greek, Hebrew, Japanese, Korean, Roman, and Thai scripts with over 90% accuracy.
- Research Organization:
- Los Alamos National Lab., NM (United States)
- Sponsoring Organization:
- Department of Defense, Washington, DC (United States)
- DOE Contract Number:
- W-7405-ENG-36
- OSTI ID:
- 62630
- Report Number(s):
- LA-UR--95-21; CONF-950869--1; ON: DE95006339
- Country of Publication:
- United States
- Language:
- English
Similar Records
Script identification from images using cluster-based templates
Page segmentation using script identification vectors: A first look