SFTかRLか？ R1型推論を備えた大規模視覚言語モデルの訓練に関する初期調査

要旨

本研究では、大規模視覚言語モデル（LVLM）のトレーニングにおける主流の手法である教師ありファインチューニング（SFT）と強化学習（RL）のパラダイムを再検証し、重要な発見を明らかにしました。SFTは、専門家モデルから模倣された「疑似推論パス」を誘発することで、その後のRLを著しく損なう可能性があります。これらのパスはRLモデルの本来の推論パスに似ているように見えるものの、しばしば冗長で躊躇しがちな、情報量の少ないステップや誤った推論を含んでいます。この効果を体系的に研究するため、我々はVLAA-Thinkingという新しいマルチモーダルデータセットを導入しました。このデータセットは、キャプショニング、推論蒸留、回答書き換え、検証を含む6段階のパイプラインを経て構築され、SFTのための高品質な段階的視覚推論トレースと、同じデータソースからのより挑戦的なRL分割を含んでいます。このデータセットを用いて、SFT、RL、およびそれらの組み合わせを比較する広範な実験を行いました。結果は、SFTがモデルに推論形式を学習させる一方で、整列されたモデルを模倣的で硬直した推論モードに固定し、さらなる学習を妨げることが多いことを示しています。一方、我々のRLアプローチは、知覚と認知の両方の信号を統合した新しい混合報酬モジュールを備えたGroup Relative Policy Optimization（GRPO）に基づいており、より本物の適応的な推論行動を促進します。特に、Qwen2.5VL 3Bに基づく我々のモデルVLAA-Thinkerは、Open LMM Reasoning Leaderboard（https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard）において、4BスケールのLVLMの中でトップ1の性能を達成し、従来の最先端を1.8%上回りました。我々の発見が、推論能力を持つLVLMの開発に貴重な洞察を提供し、この分野の将来の研究に役立つことを願っています。

English

This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

SFTかRLか？ R1型推論を備えた大規模視覚言語モデルの訓練に関する初期調査

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

要旨

Support