ChatPaper.aiChatPaper

重新审视组合图像检索评估:基于图像编辑的细粒度基准

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

January 22, 2026
作者: Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu
cs.AI

摘要

组合图像检索(CIR)是多模态理解领域一项关键而复杂的任务。当前CIR基准测试通常存在查询类别有限的问题,难以反映现实场景的多样化需求。为弥补这一评估缺口,我们通过图像编辑技术实现对修改类型与内容的精准控制,构建出能够跨广泛类别合成查询的流水线。基于此流水线,我们建立了新型细粒度CIR基准测试EDIR,该数据集包含5,000个高质量查询,按五大主类别和十五个子类别进行结构化组织。通过对13个多模态嵌入模型的全面评估,我们发现存在显著的能力断层:即使最先进的模型(如RzenEmbed和GME)也难以在所有子类别中保持稳定表现,这凸显了我们基准测试的严苛性。通过对比分析,我们进一步揭示了现有基准测试的固有局限,如模态偏差和类别覆盖不足等问题。此外,领域内训练实验验证了我们基准测试的可行性。该实验通过区分“可通过定向数据解决”与“暴露当前模型架构固有缺陷”的类别,明晰了任务挑战的实质。
English
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
PDF131January 24, 2026