ChatPaper.aiChatPaper

更多思考,更少準確性?論視覺語言模型中的雙重推理本質

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

September 30, 2025
作者: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
cs.AI

摘要

推理能力已成為大型語言模型(LLMs)中的一項關鍵能力。通過強化學習(RL),尤其是群體相對策略優化(GRPO),這些模型能夠解決諸如數學和代碼生成等複雜任務。基於這些進展,近期研究試圖將推理能力擴展至視覺-語言模型(VLMs),並在多樣化的視覺任務中取得了令人鼓舞的成果。儘管如此,我們的研究揭示了多模態推理的雙重特性:雖然它顯著增強了邏輯推理能力並促進了在難題上的表現,但可能逐漸削弱感知基礎,導致在原本基礎的視覺問題上出現識別失敗。通過進一步分析,我們將此現象歸因於視覺遺忘,即長時間的推理過程使模型逐漸忽視視覺輸入。為解決這一問題,我們提出了視覺錨定策略優化(VAPO),這是一種簡單而有效的方法,能明確引導推理過程朝向視覺基礎的軌跡發展。我們的最終模型VAPO-Thinker-7B顯著增強了模型對視覺信息的依賴,並在廣泛的基準測試中取得了新的最先進成果。項目頁面:https://xytian1008.github.io/VAPO/
English
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/
PDF281October 1, 2025