ETRI Knowledge Sharing Platform : Multi-Perspective Attention Network for Fast Temporal Moment Localization

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Multi-Perspective Attention Network for Fast Temporal Moment Localization

Cited 0 time in scopus

Download 218 time Share share

Authors: Jungkyoo Shin, Jinyoung Moon

Issue Date: 2021-08

Citation: IEEE Access, v.9, pp.116962-116972

ISSN: 2169-3536

Publisher: IEEE

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1109/ACCESS.2021.3106698

Abstract: Temporal moment localization (TML) aims to retrieve the temporal interval for a moment semantically relevant to a sentence query. This is challenging because it requires understanding a video, a sentence, and the relationship between them. Existing TML methods have shown impressive performances by modeling interactions between videos and sentences using fine-grained techniques. However, these fine-grained techniques require a high computational overhead, making them impractical. This work proposes an effective and efficient multi-perspective attention network for temporal moment localization. Inspired by the way humans understand an image from multiple perspectives and different contexts, we devise a novel multi-perspective attention mechanism consisting of perspective attention and multi-perspective modal interactions. Specifically, a perspective attention layer based on multi-head attention takes two memory sequences, one as the base and the other as the reference memory, as inputs. Perspective attention assesses the two different memories, models the relationship, and encourages the base memory to focus on features related to the reference memory, providing an understanding of the base memory from the perspective of the reference memory. Furthermore, multi-perspective modal interactions model the complex relationship between a video and sentence query, and obtain the modal-interacted memory, consisting of a visual feature that selectively learned query-related information. Similar to the heavyweight fine-grained TML methods, the proposed network obtains the accurate complex relationship while being lightweight like coarse-grained TML methods. We also adopted a fast AR network to efficiently extract visual features, which reduced the computational overhead. Through experiments on three benchmark datasets, we demonstrate the effectiveness and efficiency of the proposed network.

KSP Keywords: Attention mechanism, Benchmark datasets, Effectiveness and efficiency, Fine grained(FG), Multi-head, Multi-perspective, Multiple perspectives, Visual Features, coarse-grained, modal interaction

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI