ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Modeling Long-Term Multimodal Representations for Active Speaker Detection with Spatio-Positional Encoder
Cited 1 time in scopus Download 83 time Share share facebook twitter linkedin kakaostory
Authors
Minyoung Kyoung, Hwa Jeon Song
Issue Date
2023-10
Citation
IEEE Access, v.11, pp.116561-116569
ISSN
2169-3536
Publisher
Institute of Electrical and Electronics Engineers Inc.
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1109/ACCESS.2023.3325474
Abstract
In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.
KSP Keywords
Active speaker detection, End to End(E2E), False detection, Multimodal approach, Multimodal representation, Robust performance, Visual noise, Visual signals, multiple speakers, noise interference, pre-trained networks
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY