ETRI Knowledge Sharing Platform : Modeling Long-Term Multimodal Representations for Active Speaker Detection with Spatio-Positional Encoder

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article Modeling Long-Term Multimodal Representations for Active Speaker Detection with Spatio-Positional Encoder

Cited 1 time in scopus

Download 300 time Share share

Abstract: In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.

KSP Keywords: Active speaker detection, End to End(E2E), False detection, Multimodal approach, Multimodal representation, Robust performance, Visual noise, Visual signals, multiple speakers, noise interference, pre-trained networks

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.