强化学习如何提升视觉推理能力?一项弗兰肯斯坦式分析
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
February 12, 2026
作者: Xirui Li, Ming Li, Tianyi Zhou
cs.AI
摘要
具有可验证奖励的强化学习已成为提升视觉语言模型推理能力的标准后训练阶段,但其相较于作为冷启动初始化的监督微调究竟能提升哪些能力仍不明确。端到端基准测试的提升混杂了多重因素,难以将改进归因于具体技能。为弥补这一空白,我们提出弗兰肯斯坦式分析框架,包括:(一)通过因果探测实现功能定位;(二)通过参数比较进行更新特征分析;(三)通过模型融合开展可迁移性测试。研究发现,强化学习主要在模型中后期层引发一致的推理时偏移,这些中后期优化既具有可迁移性(通过融合验证),又是强化学习增益的必要条件(通过冻结实验验证)。总体而言,我们的结果表明强化学习在视觉推理中的可靠贡献并非对视觉感知的均匀增强,而是通过系统化改进Transformer中后期计算,优化视觉到推理的对齐与推理性能,这凸显了仅依赖基准测试来理解多模态推理改进的局限性。
English
Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.