ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Learning to combine the modalities of language and video for temporal moment localization
Cited 6 time in scopus Download 88 time Share share facebook twitter linkedin kakaostory
Authors
Jungkyoo Shin, Jinyoung Moon
Issue Date
2022-03
Citation
Computer Vision and Image Understanding, v.217, pp.1-13
ISSN
1077-3142
Publisher
Elsevier
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1016/j.cviu.2022.103375
Abstract
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperforms state-of-the-art TML methods with R@1 of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we show that the revised state-of-the-arts methods by replacing original LSTM with our CM-LSTM achieves performance gains.
KSP Keywords
2d map, Attention mechanism, Benchmark datasets, Cognitive processes, Contextual features, Contextual information, Cross-modal interaction, End to End(E2E), Interaction networks, Long-short term memory(LSTM), Semantic embeddings
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)
CC BY NC ND