ChatPaper.aiChatPaper

从看见到思考:解耦感知与推理提升视觉语言模型的后训练效果

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

May 19, 2026
作者: Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou
cs.AI

摘要

近期,视觉语言模型(VLM)的发展强调长思维链推理;然而,我们发现它们在视觉任务上的表现主要受限于视觉感知能力的不足,而非推理本身。在本研究中,我们通过将VLM后训练能力分解为视觉感知、视觉推理和文本推理三个独立阶段,并融入专门训练数据,系统探究了感知与推理之间的相互作用。我们证明:(a) 视觉感知需通过专业化数据实施针对性优化;(b) 视觉感知是基础性支撑框架,应在完善视觉推理前通过分阶段训练加以巩固;(c) 相比基于描述的监督微调(SFT),强化学习(RL)能更有效地提升视觉感知能力。我们在多个VLM上的实验表明:分阶段训练在视觉感知和推理性能上均优于混合训练。值得注意的是,采用本方法训练的模型在推理准确率提升1.5%的同时,推理链条长度缩短20.8%,这表明更优的感知能力可降低对过度推理的需求。此外,我们展示这种基于能力的阶段性训练代表了与传统难度递进课程正交的新课程维度,二者结合可产生额外增益。我们的分阶段训练模型在开源VLMs中取得领先性能,在多项视觉数学与感知任务(如WeMath提升5.2%,RealWorldQA提升3.7%)上相比基础模型实现了显著进步。
English
Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.