ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Learning to combine the modalities of language and video for temporal moment localization
Cited 7 time in scopus Download 162 time Share share facebook twitter linkedin kakaostory
Authors
Jungkyoo Shin, Jinyoung Moon
Issue Date
2022-03
Citation
Computer Vision and Image Understanding, v.217, pp.1-13
ISSN
1077-3142
Publisher
Elsevier
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1016/j.cviu.2022.103375
Abstract
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperforms state-of-the-art TML methods with R@1 of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we show that the revised state-of-the-arts methods by replacing original LSTM with our CM-LSTM achieves performance gains.
KSP Keywords
2d map, Attention mechanism, Benchmark datasets, Cognitive processes, Contextual features, Contextual information, Cross-modal interaction, End to End(E2E), Interaction networks, Semantic embeddings, Temporal relationship
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)
CC BY NC ND