ETRI Knowledge Sharing Platform : LLM-powered scene graph representation learning for image retrieval via visual triplet-based graph transformation

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article LLM-powered scene graph representation learning for image retrieval via visual triplet-based graph transformation

Cited 1 time in scopus

Abstract: A scene graph represents the relational information between objects within an image, conveying its inherent semantic content. Current image retrieval methods, which use images as queries to find similar ones, typically rely on visual content or basic structural similarities in scene graphs. However, these methods use only basic and surface-level information, overlooking the high-level semantic information embedded in the scene graph. In this study, we leverage visual triplet units, consisting of subject-relation-object pairs in the scene graph, to capture high-level semantics more effectively. To enhance the triplets, we incorporate extensive knowledge from large language models (LLMs). We propose Visual Triplet-based Graph Transformation (VTGT), a framework that transforms the scene graph into a visual triplet-based graph, which is the triplets serve as the nodes. This transformed graph is then processed by a graph neural network (GNN) to learn an optimal scene graph representation. Experimental results in image retrieval demonstrate the superior performance of our approach, driven by the LLM-powered visual triplet-based graph representation.

KSP Keywords: Current image, Graph Transformation, Graph representation, Image retrieval, Language Model, Representation learning, Scene graph, Semantic content, Structure Similarity Index measure(SSIM), high-level semantics, neural network(NN)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.