Ref-Adv：探索多模态大语言模型在指代表达任务中的视觉推理能力

摘要

指称表达理解（REC）将语言与区域级视觉感知相连接。尽管多模态大语言模型已推动标准基准数据集（RefCOCO、RefCOCO+、RefCOCOg）快速发展，但这些数据集对视觉推理与定位能力的检验仍存在局限：（i）多数指称表达过于简短，缺乏推理深度；（ii）图像中干扰物稀少，使目标定位过于简单；（iii）冗余描述符催生了绕过真实文本理解与视觉推理的捷径解决方案。我们提出Ref-Adv这一现代REC基准，通过将语言复杂度更高的表达与仅能唯一确定目标的最小必要信息配对，有效抑制捷径策略。该数据集包含真实图像上的指称表达，通过精心设计的高难度干扰物进行构建，并标注了包含否定推理在内的多维度推理要素。我们通过全面消融实验（词序扰动与描述符删除充分性测试）证明，解决Ref-Adv需超越简单线索的深层推理。此外，我们在Ref-Adv上评估了当前主流多模态大语言模型，发现尽管它们在传统基准上表现强劲，但在Ref-Adv上性能显著下降，揭示了模型对捷径的依赖及其在视觉推理与定位方面的缺陷。我们提供了深入的错误分析，旨在以Ref-Adv引导未来多模态大语言模型在视觉推理与定位方向的研究。

English

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Ref-Adv：探索多模态大语言模型在指代表达任务中的视觉推理能力

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

摘要

Support