SQUARE：語義查詢增強融合與高效批次重排序用於無訓練零樣本組合圖像檢索

摘要

組合圖像檢索（Composed Image Retrieval, CIR）旨在檢索出既保留參考圖像視覺內容，又融入用戶指定文本修改的目標圖像。無需特定任務訓練或標註數據的零樣本CIR（ZS-CIR）方法極具吸引力，然而精確捕捉用戶意圖仍具挑戰性。本文提出SQUARE，一種新穎的兩階段無訓練框架，利用多模態大語言模型（MLLMs）來增強ZS-CIR。在語義查詢增強融合（Semantic Query-Augmented Fusion, SQAF）階段，我們通過MLLM生成的目標圖像描述，豐富了源自視覺語言模型（如CLIP）的查詢嵌入。這些描述提供了高層次的語義指導，使查詢能更好地捕捉用戶意圖，提升全局檢索質量。在高效批量重排序（Efficient Batch Reranking, EBR）階段，將排名靠前的候選圖像以帶有視覺標記的網格形式呈現給MLLM，MLLM對所有候選圖像進行聯合視覺-語義推理。我們的重排序策略單次執行即可生成更精確的排序結果。實驗表明，SQUARE以其簡潔高效，在四個標準CIR基準測試中展現出強勁性能。值得注意的是，即使在輕量級預訓練條件下，它仍保持高性能，顯示出其廣泛應用的潛力。

English

Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.

SQUARE：語義查詢增強融合與高效批次重排序用於無訓練零樣本組合圖像檢索

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

摘要

Support