ChatPaper.aiChatPaper

VLM-R^3:區域識別、推理與精煉,強化多模態思維鏈

VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

May 22, 2025
作者: Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang
cs.AI

摘要

近期,基於推理的多模態大語言模型(MLLMs)在生成長篇文本推理鏈方面取得了一定成功。然而,這些模型在處理複雜任務時仍面臨挑戰,這些任務需要動態且迭代地聚焦並重新審視視覺區域,以實現文本推理在視覺證據上的精確定位。我們提出了VLM-R^3(具備區域識別與推理能力的視覺語言模型),這是一個框架,賦予MLLM以下能力:(i) 判斷何時需要額外的視覺證據,(ii) 確定在圖像中的哪個位置進行定位,以及(iii) 將相關的子圖像內容無縫編織到交錯的思維鏈中。我們方法的核心是區域條件強化策略優化(R-GRPO),這是一種訓練範式,獎勵模型選擇信息豐富的區域、制定適當的轉換(如裁剪、縮放),並將由此產生的視覺上下文整合到後續的推理步驟中。為了引導這一策略,我們編制了一個規模適中但精心策劃的視覺-語言交錯推理(VLIR)語料庫,該語料庫提供了區域選擇和文本解釋的步驟級監督。在MathVista、ScienceQA及其他基準測試上的廣泛實驗表明,VLM-R^3在零樣本和少樣本設置下達到了新的技術水平,尤其是在需要細微空間推理或精細視覺線索提取的問題上,取得了最大的進步。
English
Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce VLM-R^3 (Visual Language Model with Region Recognition and Reasoning), a framework that equips an MLLM with the ability to (i) decide when additional visual evidence is needed, (ii) determine where to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is Region-Conditioned Reinforcement Policy Optimization (R-GRPO), a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R^3 sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

Summary

AI-Generated Summary

PDF62May 23, 2025