ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models
Cited 0 time in scopus Download 34 time Share share facebook twitter linkedin kakaostory
Authors
Hyesu Hwang, Daeun Kim, Jaehui Park, Yongjin Kwon
Issue Date
2024-10
Citation
International Conference on Multimedia (MM) 2024, pp.26-34
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1145/3689091.3690089
Abstract
Retrieving relevant images based on text is challenging due to the non-trivial nature of aligning vision and language representations. Large-scale vision-language models such as CLIP are widely used in recent studies to leverage the pre-trained knowledge of the alignment. However, our observations reveal a performance decrease of 60.8% for verb, adjective, and adverb queries in contrast to noun queries. With preliminary studies, we found that there is an insufficient alignment between image and text regarding specific parts of speech in the popular vision-language models. We also observed that nouns have a high influence on the text-to-image retrieval results of vision-language models. Based on this, this paper proposes a method to generate noun-based queries as part of rewriting queries. First, a large language model extracts nouns that are relevant to the initial query and generates a hypothetical query that best matches the parts of speech alignment in the vision-language model. Then, we verify whether the hypothetical query preserves the original intent of the query and iteratively rewrite it. Our experiments show that our method can significantly enhance text-to-image retrieval performance and highlight the understanding of lexical knowledge in the vision-language models.
KSP Keywords
Hypothetical query, Image retrieval, Language Model, Part of Speech(POS), Retrieval performance, large-scale
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY