Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Page segmentation using script identification vectors: A first look

Technical Report ·
DOI:https://doi.org/10.2172/495845· OSTI ID:495845

Document images in which different scripts, such as Chinese and Roman, appear on a single page pose a problem for optical character recognition (OCR) systems. This paper explores the use of script identification vectors in the analysis of multilingual document images. A script identification vector is calculated for each connected component in a document. The vector expresses the closest distance between the component and templates developed for each of thirteen scripts, including Arabic, Chinese, Cyrillic, and Roman. The authors calculate the first three principal components within the resulting thirteen-dimensional space for each image. By mapping these components to red, green, and blue, they can visualize the information contained in the script identification vectors. The visualization of several multilingual images suggests that the script identification vectors can be used to segment images into script-specific regions as large as several paragraphs or as small as a few characters. The visualized vectors also reveal distinctions within scripts, such as font in Roman documents, and kanji vs. kana in Japanese. Results are best for documents containing highly dissimilar scripts such as Roman and Japanese. Documents containing similar scripts, such as Roman and Cyrillic will require further investigation.

Research Organization:
Los Alamos National Lab., NM (United States)
Sponsoring Organization:
USDOE Assistant Secretary for Human Resources and Administration, Washington, DC (United States)
DOE Contract Number:
W-7405-ENG-36
OSTI ID:
495845
Report Number(s):
LA-UR--97-1281; ON: DE97007224
Country of Publication:
United States
Language:
English