ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data
Cited - time in scopus Share share facebook twitter linkedin kakaostory
Authors
Kwanyong Park, Kuniaki Saito, Donghyun Kim
Issue Date
2024-06
Citation
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2024, pp.1-6
Language
English
Type
Conference Paper
Abstract
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. While conventional methods try to enhance VL models through the use of hard negative synthetic text, their effectiveness remains restricted. In this paper, we introduce a structured synthetic data generation approach to improve the compositional understanding of VL models for language-based object detection. Specifically, our framework generates densely paired positive and negative triplets (image, text descriptions, bounding boxes) in both image and text domains. In addition, in order to train VL models effectively, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D 3 benchmark by +6.9AP upon existing baselines.
KSP Keywords
Bounding Box, Conventional methods, Data vision, Object detection, Positive and negative, positive and, synthetic data generation