更多思考，更低准确度？论视觉-语言模型推理的双重性

摘要

推理能力已成为大型语言模型（LLMs）的一项关键能力。通过强化学习（RL），尤其是群体相对策略优化（GRPO），这些模型能够解决诸如数学和代码生成等复杂任务。基于这些进展，近期研究致力于将推理能力扩展至视觉语言模型（VLMs），在多种视觉任务中取得了令人瞩目的成果。然而，我们的研究揭示了多模态推理的双重特性：虽然它显著增强了逻辑推理能力，助力解决难题，但也可能逐渐削弱感知基础，导致在原本基础的视觉问题上出现识别失败。通过深入分析，我们将此现象归因于视觉遗忘，即长时间的推理过程使模型逐渐忽视视觉输入。针对这一问题，我们提出了视觉锚定策略优化（VAPO），这是一种简单而有效的方法，旨在明确引导推理过程沿着视觉基础轨迹进行。我们的成果模型VAPO-Thinker-7B显著增强了模型对视觉信息的依赖，并在广泛认可的基准测试中取得了新的最先进成果。项目页面：https://xytian1008.github.io/VAPO/

English

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

更多思考，更低准确度？论视觉-语言模型推理的双重性

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

摘要

Support