ETRI Knowledge Sharing Platform : Audio-Visual Overlapped Speech Detection for Spontaneous Distant Speech

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article Audio-Visual Overlapped Speech Detection for Spontaneous Distant Speech

Cited 2 time in scopus

Download 359 time Share share

Abstract: Although advances in deep learning have brought remarkable improvements to Overlapped Speech Detection (OSD), the performance in far-field environments is still limited owing to the lack of real-world overlapped speech and a low signal-to-noise ratio. In this paper, we present an end-to-end audiovisual OSD system based on decision fusion between audio and video modalities. Firstly, we propose a simple yet powerful audio data augmentation method for spontaneous distant speech data. Secondly, to maximize the effectiveness of the video modality, we design a video OSD system based on a cross-speaker attention module that explores the visual correlation between multiple speakers. Lastly, we present cross-modality attention module to make the final decision more accurate. Our experimental results demonstrate that our approach outperforms current state-of-the-art methods on a real-world distant speech dataset. Moreover, our approach can robustly detect overlapped speech when compared with its counterpart, which uses audio modality alone.

KSP Keywords: Audio and video, Audio data, Audio-visual, Augmentation method, Current state, Data Augmentation, Decision Fusion, End to End(E2E), Far-field, Field environment, Real-world

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.