CoLLM:面向组合图像检索的大型语言模型
CoLLM: A Large Language Model for Composed Image Retrieval
March 25, 2025
作者: Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava
cs.AI
摘要
组合图像检索(Composed Image Retrieval, CIR)是一项旨在基于多模态查询检索图像的复杂任务。典型的训练数据由包含参考图像、期望修改的文本描述及目标图像的三元组构成,这些数据的获取既昂贵又耗时。CIR数据集的稀缺性催生了利用合成三元组或借助网络爬取的图像-标题对进行零样本学习的方法。然而,这些方法存在显著局限:合成三元组受限于规模小、多样性不足及修改文本不自然,而图像-标题对由于缺乏三元组数据,阻碍了多模态查询的联合嵌入学习。此外,现有方法在处理需要视觉与语言模态深度融合与理解的复杂且细腻的修改文本时表现欠佳。我们提出了CoLLM,一个一站式框架,有效解决了上述问题。我们的方法从图像-标题对中即时生成三元组,实现了无需人工标注的监督训练。我们利用大语言模型(LLMs)生成参考图像与修改文本的联合嵌入,促进了更深层次的多模态融合。同时,我们引入了多文本CIR(MTCIR)数据集,包含340万样本,并优化了现有CIR基准(CIRR和Fashion-IQ),以提升评估的可靠性。实验结果显示,CoLLM在多个CIR基准和设置下均达到了最先进的性能。MTCIR取得了具有竞争力的结果,性能提升最高达15%。我们优化的基准为CIR模型提供了更可靠的评估指标,推动了这一重要领域的发展。
English
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images
based on a multimodal query. Typical training data consists of triplets
containing a reference image, a textual description of desired modifications,
and the target image, which are expensive and time-consuming to acquire. The
scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic
triplets or leveraging vision-language models (VLMs) with ubiquitous
web-crawled image-caption pairs. However, these methods have significant
limitations: synthetic triplets suffer from limited scale, lack of diversity,
and unnatural modification text, while image-caption pairs hinder joint
embedding learning of the multimodal query due to the absence of triplet data.
Moreover, existing approaches struggle with complex and nuanced modification
texts that demand sophisticated fusion and understanding of vision and language
modalities. We present CoLLM, a one-stop framework that effectively addresses
these limitations. Our approach generates triplets on-the-fly from
image-caption pairs, enabling supervised training without manual annotation. We
leverage Large Language Models (LLMs) to generate joint embeddings of reference
images and modification texts, facilitating deeper multimodal fusion.
Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset
comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and
Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate
that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks
and settings. MTCIR yields competitive results, with up to 15% performance
improvement. Our refined benchmarks provide more reliable evaluation metrics
for CIR models, contributing to the advancement of this important field.Summary
AI-Generated Summary