ETRI Knowledge Sharing Platform : Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition

Cited 0 time in scopus

Authors: Changhan Oh, Kiyoung Park, Jeomja Kang, Woo Yong Choi, Hwa Jeon Song

Citation: International Speech Communication Association (INTERSPEECH) 2025, pp.3329-3333

Abstract: End-to-end (E2E) models have significantly advanced automatic speech recognition (ASR), with hybrid architectures that combine Connectionist Temporal Classification (CTC) and attention-based encoder-decoder (AED) models demonstrating superior performance. However, AED architectures, particularly Conformer, face notable challenges with long-form speech, with performance degradation becoming evident for audio exceeding 25 second. In this study, we propose improving the Conformer's robustness for long-form ASR by applying Gaussian masking to the cross-attention mechanism of the Transformer decoder during inference, using the aligned positions obtained from the CTC prefix score. The proposed method achieves an error reduction rate (ERR) of 88.41% (from 26.41% to 3.06%) for audio longer than 20 seconds on a LibriSpeech evaluation set constructed by concatenating three utterances. Moreover, the method remains effective with either a Transformer or an E-Branchformer encoder.

KSP Keywords: Attention mechanism, Encoder and Decoder, End to End(E2E), Error reduction, automatic speech recognition(ASR), connectionist temporal classification(CTC), hybrid architecture, performance degradation, reduction rate, superior performance

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.