ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
Cited 1 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi
Issue Date
2024-10
Citation
International Conference on Multimedia (MM) 2024, pp.5507-5516
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1145/3664647.3681537
Abstract
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.
KSP Keywords
Action Classification, Action localization, Embedding space, Fine grained(FG), Human action, Human motion, Joint space, Language Model, Language representation, Pre-Training, Probabilistic embedding