ETRI Knowledge Sharing Platform : T2V2T: Text-to-Video-to-Text Fusion for Text-to-Video Retrieval

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Conference Paper T2V2T: Text-to-Video-to-Text Fusion for Text-to-Video Retrieval

Cited 1 time in scopus

Download 252 time Share share

Authors: Jonghee Kim, Youngwan Lee, Jinyoung Moon

Issue Date: 2023-06

Citation: Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2023, pp.5612-5617

Language: English

Type: Conference Paper

DOI: https://dx.doi.org/10.1109/CVPRW59228.2023.00594

Abstract: Video-language transformers for text-to-video retrieval typically consist of a video encoder, a text encoder, and a joint encoder. The joint encoder can be categorized into 1) self-attention-based fusion and 2) unidirectional fusion based on cross-attention. The former approach performs self-attention on the concatenation of video and text embeddings. Although it allows complete interaction between text and video, the length of the input sequences makes it computationally intensive. Instead, unidirectional fusion employs rather efficient cross-attention to fuse video embeddings into text embeddings while ignoring text-to-video interaction. The text-to-video fusion is not well explored because of the information imbalance between text and video, which makes it difficult to determine which video patches can be used as queries in cross-attention. i.e., a text embedding corresponds to one or more patch embeddings, while a video patch embedding may not correspond to any text embeddings. In order to address this challenge, we devise a Bypass cross-attention (Bypass CA) which prevents matching between irrelevant video and text embedding pairs in the cross-attention. Using Bypass CA, we propose a novel bidirectional interaction approach, Text-to-Video-to-Text (T2V2T) fusion. The proposed T2V2T uses two unidirectional fusions with opposite directions, i.e., text-to-video fusion followed by video-to-text fusion. As a result, the proposed T2V2T fusion yields state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet Captions.

KSP Keywords: Bidirectional interaction, Text embedding, Video and text, Video interaction, Video retrieval, Video-to-text, interaction approach, state-of-The-Art, video encoder, video fusion

ETRI

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.

제1유형

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI