Ref-Adv: 参照表現タスクにおけるMLLM視覚推論の探求

要旨

参照表現理解（REC）は、言語と領域レベルの視覚的知覚を結びつけるタスクである。標準的なベンチマーク（RefCOCO、RefCOCO+、RefCOCOg）はマルチモーダルLLMの登場により急速に進展したが、視覚的推論とグラウンディングに関するテストとしては未だ不十分である。その理由は、(i) 多くの参照表現が極めて短く、推論の要求が低い、(ii) 画像に混乱要因が少なく、対象物が容易に見つかる、(iii) 冗長な記述子により、真のテキスト理解と視覚的推論を経ないショートカット解法が可能になる、という点にある。本研究では、これらのショートカットを抑制する現代的なRECベンチマーク「Ref-Adv」を提案する。Ref-Advは、言語的に意味のある表現を、対象を一意に特定するために必要な情報のみと組み合わせることで構成される。このデータセットは実画像に対する参照表現を含み、困難な混乱要因を意図的に配置し、否定を含む推論の側面を注記している。包括的な ablation 実験（語順摂動および記述子削除の十分性検証）により、Ref-Advの解決には単純な手がかりを超えた推論が必要であることを示す。さらに、現代の多種多様なマルチモーダルLLMをRef-Advで評価した。その結果、RefCOCO、RefCOCO+、RefCOCOgでは高い性能を示すモデル群も、Ref-Advでは性能が大幅に低下し、ショートカットへの依存と、視覚的推論・グラウンディング能力の欠如が明らかになった。詳細な失敗例分析を提供し、Ref-AdvがMLLMの視覚的推論とグラウンディングに関する将来の研究を導くことを目指す。

English

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Ref-Adv: 参照表現タスクにおけるMLLM視覚推論の探求

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

要旨

Support