ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition
Cited 0 time in scopus Download 49 time Share share facebook twitter linkedin kakaostory
Authors
Sanghun Jeon, Jieun Lee, Yong-Ju Lee
Issue Date
2025-09
Citation
AI (Switzerland), v.6, no.9, pp.1-25
ISSN
2673-2688
Publisher
Multidisciplinary Digital Publishing Institute (MDPI)
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.3390/ai6090222
Abstract
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings.
KSP Keywords
Computational Efficiency, Convolution neural network(CNN), Dual-branch, High resolution, Recognition error, Spatial Representation, Spatiotemporal dependencies, Temporal modeling, Visual Speech Recognition, assistive technologies, back-end
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY