ETRI Knowledge Sharing Platform : An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts

Cited 5 time in scopus

Download 411 time Share share

Authors: Aram Lee, HongYeon Yu, Gihyeon Min

Issue Date: 2024-05

Citation: Journal of Cultural Heritage, v.67, pp.80-91

ISSN: 1296-2074

Publisher: Elsevier Masson

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1016/j.culher.2024.02.001

Abstract: In recent times, the advent of AI-based optical character recognition (OCR) has garnered significant attention in the realm of digital text conversion. However, it is imperative to note that OCR solely identifies individual characters or words, and lacks the ability to reunite them into cohesive units such as words or sentences. Consequently, the manual sorting of them to establish the appropriate reading order has emerged as a bottleneck. In this paper, we present an algorithm termed adjacent character detection (ACD), designed to serve as a post-processing of OCR, facilitating automatic digital text conversion. The algorithm involves line segmentation through a quad-ACD scan (up-down-down-up), allowing it to consecutively discern characters within a column based on their adjacency relations. Conventional projection profile analyses have struggled to effectively partition the distinct internal structure of Chinese historical text, where two annotation columns often subdivide from a single body column. In contrast, our ACD algorithm employs an approach, reuniting adjacent characters rather than fragmenting the entire text into isolated entities. Additionally, ACD algorithm enabled body/annotation classification for OCR-detected characters based on the pattern analysis of its quad scan. This cumulative information empowers the conversion of digital text in a desired reading order. To assess the efficacy of the proposed algorithm, a set of ground-truth OCR result was subjected to rigorous testing, culminating in a reading order accuracy of 98.6%. Noteworthy robustness was also demonstrated in the face of misaligned columns, experimentally induced by applying tilt, warp, and wavy noises to the original digital images. Lastly, the algorithm was integrated with two pre-developed OCR models, resulting in a reading order accuracy of 97.7%.

KSP Keywords: Character detection, Ground Truth, Line Segmentation, Optical character Recognition, Post-Processing, Projection profile, Reading order, digital image, individual characters, internal structure, pattern analysis

This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI