ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 Learning to combine the modalities of language and video for temporal moment localization
Cited 4 time in scopus Download 64 time Share share facebook twitter linkedin kakaostory
저자
신정규, 문진영
발행일
202203
출처
Computer Vision and Image Understanding, v.217, pp.1-13
ISSN
1077-3142
출판사
Elsevier
DOI
https://dx.doi.org/10.1016/j.cviu.2022.103375
협약과제
21HS4800, 장기 시각 메모리 네트워크 기반의 예지형 시각지능 핵심기술 개발, 문진영
초록
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal interaction network (TACI) that generates two 2D proposal maps obtained globally from the integrated contextual features, which are generated by using CM-LSTM, and locally from boundary score sequences and then combines them into a final 2D map in an end-to-end manner. On the TML benchmark dataset, ActivityNet-Captions, the TACI outperforms state-of-the-art TML methods with R@1 of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we show that the revised state-of-the-arts methods by replacing original LSTM with our CM-LSTM achieves performance gains.
KSP 제안 키워드
2d map, Attention mechanism, Benchmark datasets, Cognitive processes, Contextual features, Contextual information, Cross-modal interaction, End to End(E2E), Interaction networks, Long-short term memory(LSTM), Semantic embeddings
본 저작물은 크리에이티브 커먼즈 저작자 표시 - 비영리 - 변경금지 (CC BY NC ND) 조건에 따라 이용할 수 있습니다.
저작자 표시 - 비영리 - 변경금지 (CC BY NC ND)