ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Cited 11 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim
Issue Date
2024-06
Citation
Conference on Computer Vision and Pattern Recognition (CVPR) 2024, pp.13894-13904
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/CVPR52733.2024.01318
Abstract
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.
KSP Keywords
Memory retrieval, Semantic content, Video Captioning, Video dataset, Video-to-text, Visual Input, cognitive information processing, cross-modal, event localization, external memory, prior knowledge