CoLLM:面向组合圖像檢索的大型語言模型
CoLLM: A Large Language Model for Composed Image Retrieval
March 25, 2025
作者: Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava
cs.AI
摘要
組合圖像檢索(Composed Image Retrieval, CIR)是一項旨在基於多模態查詢檢索圖像的複雜任務。典型的訓練數據由包含參考圖像、期望修改的文字描述以及目標圖像的三元組構成,這些數據的獲取既昂貴又耗時。CIR數據集的稀缺性催生了利用合成三元組或依賴於無處不在的網絡爬取圖像-標題對的零樣本方法。然而,這些方法存在顯著限制:合成三元組面臨規模有限、多樣性不足以及修改文本不自然的問題,而圖像-標題對由於缺乏三元組數據,阻礙了多模態查詢的聯合嵌入學習。此外,現有方法在處理需要視覺與語言模態深度融合和理解的複雜細膩修改文本時表現欠佳。我們提出了CoLLM,一個一站式框架,有效解決了這些限制。我們的方法從圖像-標題對中即時生成三元組,實現了無需人工標註的監督訓練。我們利用大型語言模型(LLMs)生成參考圖像與修改文本的聯合嵌入,促進了更深層次的多模態融合。此外,我們引入了多文本CIR(MTCIR),一個包含340萬樣本的大規模數據集,並改進了現有的CIR基準(CIRR和Fashion-IQ)以提升評估的可靠性。實驗結果表明,CoLLM在多個CIR基準和設置中達到了最先進的性能。MTCIR取得了競爭力的結果,性能提升最高達15%。我們改進的基準為CIR模型提供了更可靠的評估指標,推動了這一重要領域的發展。
English
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images
based on a multimodal query. Typical training data consists of triplets
containing a reference image, a textual description of desired modifications,
and the target image, which are expensive and time-consuming to acquire. The
scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic
triplets or leveraging vision-language models (VLMs) with ubiquitous
web-crawled image-caption pairs. However, these methods have significant
limitations: synthetic triplets suffer from limited scale, lack of diversity,
and unnatural modification text, while image-caption pairs hinder joint
embedding learning of the multimodal query due to the absence of triplet data.
Moreover, existing approaches struggle with complex and nuanced modification
texts that demand sophisticated fusion and understanding of vision and language
modalities. We present CoLLM, a one-stop framework that effectively addresses
these limitations. Our approach generates triplets on-the-fly from
image-caption pairs, enabling supervised training without manual annotation. We
leverage Large Language Models (LLMs) to generate joint embeddings of reference
images and modification texts, facilitating deeper multimodal fusion.
Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset
comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and
Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate
that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks
and settings. MTCIR yields competitive results, with up to 15% performance
improvement. Our refined benchmarks provide more reliable evaluation metrics
for CIR models, contributing to the advancement of this important field.Summary
AI-Generated Summary