見ることから考えることへ：知覚と推論の分離が視覚言語モデルのポストトレーニングを改善する

要旨

近年の視覚言語モデル（VLM）の進歩は長い思考連鎖推論を重視しているが、我々は視覚タスクにおけるその性能が、推論そのものではなく視覚知覚の欠如によって主に制限されていることを見出した。本研究では、VLMのポストトレーニングにおける知覚と推論の相互作用を体系的に調査するため、その能力を視覚知覚、視覚推論、テキスト推論の3つの独立したトレーニング段階に分解し、それぞれに特化したトレーニングデータを組み込む。我々は、視覚知覚が（a）特殊なデータを用いた的を絞った最適化を必要とすること、（b）視覚推論を洗練する前に段階的トレーニングによって強化すべき基本的な基盤として機能すること、（c）キャプションベースのSFTよりも強化学習（RL）によってより効果的に学習されることを実証する。複数のVLMにわたる実験により、段階的トレーニングがマージトレーニングよりも一貫して視覚知覚と推論性能の両方を向上させることを示す。特筆すべきは、我々のアプローチで訓練されたモデルが推論精度を1.5%向上させ、推論トレースを20.8%短縮したことであり、優れた知覚が過剰な推論の必要性を低減することを示唆している。さらに、この能力ベースの段階化が従来の難易度ベースのカリキュラムとは直交する新たなカリキュラム次元を表しており、両者を組み合わせることでさらなる相加的利益が得られることを示す。我々の段階的トレーニングモデルはオープンウェイトのVLMの中で優れた性能を達成し、ベースモデルと比較して、複数の視覚数学および知覚タスク（例：WeMathで+5.2%、RealWorldQAで+3.7%）において高度な結果を確立した。

English

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.