ETRI-Knowledge Sharing Plaform



논문 검색
구분 SCI
연도 ~ 키워드


학술지 Multi-Perspective Attention Network for Fast Temporal Moment Localization
Cited 0 time in scopus Download 31 time Share share facebook twitter linkedin kakaostory
신정규, 문진영
IEEE Access, v.9, pp.116962-116972
21HS4800, 장기 시각 메모리 네트워크 기반의 예지형 시각지능 핵심기술 개발, 문진영
Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically relevant to a sentence query. This is challenging because it requires understanding a video, a sentence, and the relationship between them. Existing TML methods have shown impressive performances by modeling interactions between videos and sentences using fine-grained techniques. However, these fine-grained techniques require a high computational overhead, making them impractical. This work proposes an effective and efficient multi-perspective attention network for temporal moment localization. Inspired by the way humans understand an image from multiple perspectives and different contexts, we devise a novel multi-perspective attention mechanism consisting of perspective attention and multi-perspective modal interactions. Specifically, a perspective attention layer based on multi-head attention takes two memory sequences, one as the base and the other as the reference memory, as inputs. Perspective attention assesses the two different memories, models the relationship, and encourages the base memory to focus on features related to the reference memory, providing an understanding of the base memory from the perspective of the reference memory. Furthermore, multi-perspective modal interactions model the complex relationship between a video and sentence query, and obtain the modal-interacted memory, consisting of a visual feature that selectively learned query-related information. Similar to the heavyweight fine-grained TML methods, the proposed network obtains the accurate complex relationship while being lightweight like coarse-grained TML methods. We also adopted a fast AR network to efficiently extract visual features, which reduced the computational overhead. Through experiments on three benchmark datasets, we demonstrate the effectiveness and efficiency of the proposed network.
Cross-modal interaction, fast temporal moment localization, Feature extraction, Location awareness, Semantics, Spatiotemporal phenomena, Task analysis, temporal moment localization, temporal sentence grounding, Three-dimensional displays, Visualization
KSP 제안 키워드
Attention mechanism, Benchmark datasets, Cross-modal interaction, Effectiveness and efficiency, Feature extractioN, Location awareness, Multi-head, Multi-perspective, Multiple perspectives, Three dimensional(3D), Three-dimensional displays
본 저작물은 크리에이티브 커먼즈 저작자 표시 (CC BY) 조건에 따라 이용할 수 있습니다.
저작자 표시 (CC BY)