從看見到思考：解耦感知與推理，提升視覺語言模型的後訓練效果

摘要

近期视觉-语言模型（VLM）的进展强调长链式思维推理；然而，我们发现其在视觉任务上的表现主要受限于视觉感知能力的不足，而非推理本身。本研究通过将VLM后训练能力分解为三个独立训练阶段——视觉感知、视觉推理与文本推理，并引入专门训练数据，系统探讨了感知与推理在VLM后训练中的相互作用。我们证明：（a）视觉感知需借助专用数据进行针对性优化；（b）视觉感知作为基础框架，应在完善视觉推理前通过分阶段训练加以巩固；（c）相较于基于描述的监督微调，强化学习对视觉感知的训练效果更优。跨多款VLM的实验表明，相较于混合训练，分阶段训练能持续提升视觉感知与推理性能。值得注意的是，采用本方法训练的模型在推理准确率提升1.5%的同时，推理链长度缩短20.8%，表明更优的感知能力可减少对过度推理的依赖。此外，本研究揭示这种基于能力的阶段划分代表了一种与传统基于难度的课程设计正交的新维度，两者结合可产生进一步叠加增益。我们的分阶段训练模型在开源权重VLM中表现卓越，在多项视觉数学与感知任务中相较于基础模型取得领先成果（例如WeMath任务提升5.2%，RealWorldQA任务提升3.7%）。

English

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.