ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts
Cited 3 time in scopus Download 124 time Share share facebook twitter linkedin kakaostory
Authors
Aram Lee, HongYeon Yu, Gihyeon Min
Issue Date
2024-05
Citation
Journal of Cultural Heritage, v.67, pp.80-91
ISSN
1296-2074
Publisher
Elsevier Masson
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1016/j.culher.2024.02.001
Abstract
In recent times, the advent of AI-based optical character recognition (OCR) has garnered significant attention in the realm of digital text conversion. However, it is imperative to note that OCR solely identifies individual characters or words, and lacks the ability to reunite them into cohesive units such as words or sentences. Consequently, the manual sorting of them to establish the appropriate reading order has emerged as a bottleneck. In this paper, we present an algorithm termed adjacent character detection (ACD), designed to serve as a post-processing of OCR, facilitating automatic digital text conversion. The algorithm involves line segmentation through a quad-ACD scan (up-down-down-up), allowing it to consecutively discern characters within a column based on their adjacency relations. Conventional projection profile analyses have struggled to effectively partition the distinct internal structure of Chinese historical text, where two annotation columns often subdivide from a single body column. In contrast, our ACD algorithm employs an approach, reuniting adjacent characters rather than fragmenting the entire text into isolated entities. Additionally, ACD algorithm enabled body/annotation classification for OCR-detected characters based on the pattern analysis of its quad scan. This cumulative information empowers the conversion of digital text in a desired reading order. To assess the efficacy of the proposed algorithm, a set of ground-truth OCR result was subjected to rigorous testing, culminating in a reading order accuracy of 98.6%. Noteworthy robustness was also demonstrated in the face of misaligned columns, experimentally induced by applying tilt, warp, and wavy noises to the original digital images. Lastly, the algorithm was integrated with two pre-developed OCR models, resulting in a reading order accuracy of 97.7%.
KSP Keywords
Character detection, Ground Truth, Line Segmentation, Optical character Recognition, Post-Processing, Projection profile, Reading order, digital image, individual characters, internal structure, pattern analysis
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)
CC BY NC ND