重新审视组合图像检索评估:基於圖像編輯的細粒度基準
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
January 22, 2026
作者: Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu
cs.AI
摘要
组合图像检索(CIR)是多模态理解领域一项关键而复杂的任务。当前CIR基准测试通常查询类别有限,难以反映现实场景的多样化需求。为弥补这一评估缺口,我们利用图像编辑技术实现对修改类型与内容的精准控制,构建出覆盖广泛类别的查询合成流程。基于该流程,我们创建了新型细粒度CIR基准测试集EDIR,包含5,000个高质量查询,涵盖5个主类别和15个子类别的结构化设计。通过对13个多模态嵌入模型的综合评估,我们发现存在显著的能力差距:即使最先进的模型(如RzenEmbed和GME)也难以在所有子类别中保持稳定表现,这凸显了我们基准测试的严谨性。通过对比分析,我们进一步揭示了现有基准测试的固有局限,例如模态偏差和类别覆盖不足。此外,领域内训练实验证明了我们基准测试的可行性。该实验通过区分"可通过针对性数据解决的类别"与"暴露当前模型架构固有局限的类别",明确了任务挑战的实质。
English
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.