ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Changhan Oh, Kiyoung Park, Jeomja Kang, Woo Yong Choi, Hwa Jeon Song
Issue Date
2025-08
Citation
International Speech Communication Association (INTERSPEECH) 2025, pp.3329-3333
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.21437/Interspeech.2025-2582
Abstract
End-to-end (E2E) models have significantly advanced automatic speech recognition (ASR), with hybrid architectures that combine Connectionist Temporal Classification (CTC) and attention-based encoder-decoder (AED) models demonstrating superior performance. However, AED architectures, particularly Conformer, face notable challenges with long-form speech, with performance degradation becoming evident for audio exceeding 25 second. In this study, we propose improving the Conformer's robustness for long-form ASR by applying Gaussian masking to the cross-attention mechanism of the Transformer decoder during inference, using the aligned positions obtained from the CTC prefix score. The proposed method achieves an error reduction rate (ERR) of 88.41% (from 26.41% to 3.06%) for audio longer than 20 seconds on a LibriSpeech evaluation set constructed by concatenating three utterances. Moreover, the method remains effective with either a Transformer or an E-Branchformer encoder.
KSP Keywords
Attention mechanism, Encoder and Decoder, End to End(E2E), Error reduction, automatic speech recognition(ASR), connectionist temporal classification(CTC), hybrid architecture, performance degradation, reduction rate, superior performance