VLM-R^3:区域识别、推理与优化,助力增强型多模态思维链
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
May 22, 2025
作者: Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
cs.AI
摘要
近期,基于推理的多模态大语言模型(MLLMs)在生成长篇文本推理链方面取得了一定成功。然而,面对需要动态且迭代地聚焦并重新审视视觉区域以实现文本推理与视觉证据精确对接的复杂任务时,这些模型仍显不足。我们提出了VLM-R^3(视觉语言模型与区域识别及推理框架),该框架赋予MLLM以下能力:(i) 判断何时需要额外的视觉证据,(ii) 确定图像中的具体对接区域,以及(iii) 将相关子图像内容无缝编织进交错的思维链中。我们方法的核心是区域条件强化策略优化(R-GRPO),这一训练范式奖励模型选择信息丰富的区域、制定恰当的变换(如裁剪、缩放),并将由此产生的视觉上下文整合到后续推理步骤中。为引导这一策略,我们精心编制了一个规模适中但内容精选的视觉语言交错推理(VLIR)语料库,提供区域选择与文本论证的步骤级监督。在MathVista、ScienceQA及其他基准上的广泛实验表明,VLM-R^3在零样本和少样本设置下均创下了新的技术标杆,尤其是在需要微妙空间推理或精细视觉线索提取的问题上,提升最为显著。
English
Recently, reasoning-based MLLMs have achieved a degree of success in
generating long-form textual reasoning chains. However, they still struggle
with complex tasks that necessitate dynamic and iterative focusing on and
revisiting of visual regions to achieve precise grounding of textual reasoning
in visual evidence. We introduce VLM-R^3 (Visual
Language Model with Region Recognition and
Reasoning), a framework that equips an MLLM with the ability to (i)
decide when additional visual evidence is needed, (ii) determine
where to ground within the image, and (iii) seamlessly weave the
relevant sub-image content back into an interleaved chain-of-thought. The core
of our method is Region-Conditioned Reinforcement Policy Optimization
(R-GRPO), a training paradigm that rewards the model for selecting informative
regions, formulating appropriate transformations (e.g.\ crop, zoom), and
integrating the resulting visual context into subsequent reasoning steps. To
bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual
Interleaved Rationale (VLIR) corpus that provides step-level supervision on
region selection and textual justification. Extensive experiments on MathVista,
ScienceQA, and other benchmarks show that VLM-R^3 sets a new state of the art
in zero-shot and few-shot settings, with the largest gains appearing on
questions demanding subtle spatial reasoning or fine-grained visual cue
extraction.Summary
AI-Generated Summary