強化學習在視覺推理中提升了什麼?一項弗蘭肯斯坦式分析
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
February 12, 2026
作者: Xirui Li, Ming Li, Tianyi Zhou
cs.AI
摘要
具備可驗證獎勵機制的強化學習(RL)已成為提升視覺語言模型視覺推理能力的標準後訓練階段,然而與作為冷啟動初始化(IN)的監督式微調相比,強化學習究竟提升了哪些能力仍不明確。端到端基準測試的進步往往混雜多重因素,難以將改進歸因於特定技能。為此,我們提出弗蘭肯斯坦式分析框架,包括:(i)通過因果探測實現功能定位;(ii)通過參數比較進行更新特徵分析;(iii)通過模型融合實施可遷移性測試。結果表明,強化學習主要在模型中後期層誘發一致的推理時偏移,這些中後期層的改進既具可遷移性(通過融合驗證)又具必要性(通過凍結實驗證實)。總體而言,我們的研究顯示強化學習對視覺推理的可靠貢獻並非對視覺感知的均勻增強,而是對變壓器中後期計算的系統性優化,從而改善視覺到推理的對齊與推理性能,這凸顯了僅依賴基準測試來理解多模態推理改進的局限性。
English
Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.