SQUARE: 훈련 없이 제로샷 구성 이미지 검색을 위한 의미론적 쿼리 증강 융합 및 효율적 배치 재순위화

초록

구성 이미지 검색(Composed Image Retrieval, CIR)은 참조 이미지의 시각적 내용을 보존하면서 사용자가 지정한 텍스트 수정을 반영한 대상 이미지를 검색하는 것을 목표로 합니다. 작업별 훈련이나 레이블된 데이터가 필요 없는 훈련 없는 제로샷 CIR(ZS-CIR) 접근법은 매우 바람직하지만, 사용자의 의도를 정확히 파악하는 것은 여전히 어려운 과제입니다. 본 논문에서는 다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)을 활용하여 ZS-CIR을 향상시키는 새로운 두 단계의 훈련 없는 프레임워크인 SQUARE를 제안합니다. 의미론적 쿼리 증강 융합(Semantic Query-Augmented Fusion, SQAF) 단계에서는 CLIP과 같은 시각-언어 모델(Vision-Language Model, VLM)에서 도출된 쿼리 임베딩을 MLLM이 생성한 대상 이미지의 캡션으로 풍부하게 합니다. 이러한 캡션은 높은 수준의 의미론적 지침을 제공하여 쿼리가 사용자의 의도를 더 잘 파악하고 전역 검색 품질을 개선할 수 있도록 합니다. 효율적 배치 재순위(Efficient Batch Reranking, EBR) 단계에서는 상위 순위 후보들이 시각적 표시가 있는 이미지 그리드로 MLLM에 제공되며, MLLM은 모든 후보에 걸쳐 시각-의미론적 추론을 수행합니다. 우리의 재순위 전략은 단일 패스로 작동하며 더 정확한 순위를 산출합니다. 실험 결과, SQUARE는 단순성과 효과성으로 인해 네 가지 표준 CIR 벤치마크에서 강력한 성능을 보여줍니다. 특히, 경량 사전 훈련 모델에서도 높은 성능을 유지하며, 그 잠재적 적용 가능성을 입증합니다.

English

Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.

SQUARE: 훈련 없이 제로샷 구성 이미지 검색을 위한 의미론적 쿼리 증강 융합 및 효율적 배치 재순위화

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

초록

Support