ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper T2V2T: Text-to-Video-to-Text Fusion for Text-to-Video Retrieval
Cited 0 time in scopus Download 176 time Share share facebook twitter linkedin kakaostory
Authors
Jonghee Kim, Youngwan Lee, Jinyoung Moon
Issue Date
2023-06
Citation
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2023, pp.5612-5617
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/CVPRW59228.2023.00594
Abstract
Video-language transformers for text-to-video retrieval typically consist of a video encoder, a text encoder, and a joint encoder. The joint encoder can be categorized into 1) self-attention-based fusion and 2) unidirectional fusion based on cross-attention. The former approach performs self-attention on the concatenation of video and text embeddings. Although it allows complete interaction between text and video, the length of the input sequences makes it computationally intensive. Instead, unidirectional fusion employs rather efficient cross-attention to fuse video embeddings into text embeddings while ignoring text-to-video interaction. The text-to-video fusion is not well explored because of the information imbalance between text and video, which makes it difficult to determine which video patches can be used as queries in cross-attention. i.e., a text embedding corresponds to one or more patch embeddings, while a video patch embedding may not correspond to any text embeddings. In order to address this challenge, we devise a Bypass cross-attention (Bypass CA) which prevents matching between irrelevant video and text embedding pairs in the cross-attention. Using Bypass CA, we propose a novel bidirectional interaction approach, Text-to-Video-to-Text (T2V2T) fusion. The proposed T2V2T uses two unidirectional fusions with opposite directions, i.e., text-to-video fusion followed by video-to-text fusion. As a result, the proposed T2V2T fusion yields state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet Captions.
KSP Keywords
Bidirectional interaction, Text embedding, Video and text, Video interaction, Video retrieval, Video-to-text, interaction approach, state-of-The-Art, video encoder, video fusion