Ref-Adv：探索多模态大语言模型在指代表达任务中的视觉推理能力

摘要

指称表达理解（REC）将语言与区域级视觉感知相连接。随着多模态大语言模型的发展，标准基准数据集（RefCOCO、RefCOCO+、RefCOCOg）虽进展迅速，但对视觉推理与定位能力的检验仍显薄弱：（i）多数表达过于简短，缺乏推理深度；（ii）图像中干扰物稀少，目标易于定位；（iii）冗余描述符使模型可通过捷径策略规避真正的文本理解与视觉推理。我们推出Ref-Adv这一现代REC基准，通过将语言复杂度更高的表达与仅能唯一确定目标的信息相配对，有效抑制捷径策略。该数据集包含真实图像上的指称表达，精心设计具有高干扰性的场景，并标注含否定语义在内的推理要素。我们通过全面消融实验（词序扰动与描述符删除充分性测试）表明，解决Ref-Adv需超越简单线索的推理能力。在对当代多模态大语言模型的评估中，尽管在RefCOCO系列数据集上表现优异，但模型在Ref-Adv上性能显著下降，揭示其对捷径策略的依赖及视觉推理与定位能力的不足。我们提供了深入的错误分析，旨在通过Ref-Adv为未来多模态大语言的视觉推理与定位研究提供指引。

English

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Ref-Adv：探索多模态大语言模型在指代表达任务中的视觉推理能力

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

摘要

Support