SFTは記憶し、RLは一般化する：ファウンデーションモデルの比較研究トレーニング後

要旨

教師付きファインチューニング（SFT）と強化学習（RL）は、基盤モデルの後処理技術として広く使用されています。ただし、これらがモデルの汎化能力を向上させる役割は依然として不明です。本論文では、SFTとRLの一般化と記憶への影響の違いを研究し、テキストベースのルール変種とビジュアル変種に焦点を当てます。私たちは、算術推論カードゲームであるGeneralPointsを導入し、実世界のナビゲーション環境であるV-IRLを採用して、SFTとRLで訓練されたモデルがテキストとビジュアルの両ドメインで未知の変種にどのように一般化するかを評価します。RLは、特に結果ベースの報酬で訓練された場合、ルールベースのテキストとビジュアルの変種の両方にわたって一般化することを示します。これに対して、SFTは訓練データを記憶し、分布外シナリオでの一般化が困難です。さらなる分析から、RLがモデルの基礎となるビジュアル認識能力を向上させ、視覚ドメインでの一般化を促進することが明らかになります。RLの優れた一般化能力にもかかわらず、SFTは効果的なRLトレーニングには不可欠であることを示します。SFTはモデルの出力形式を安定させ、その後のRLがパフォーマンスを向上させることを可能にします。これらの知見は、複雑なマルチモーダルタスクで一般化可能な知識を獲得するためのRLの能力を示しています。

English

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

SFTは記憶し、RLは一般化する：ファウンデーションモデルの比較研究トレーニング後

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

要旨

Support