SFT記憶，RL泛化：基礎模型訓練後的比較研究

摘要

監督微調（SFT）和強化學習（RL）是廣泛應用於基礎模型的事後訓練技術。然而，它們在增強模型泛化能力方面的作用仍不清楚。本文研究了SFT和RL在泛化和記憶方面的差異，重點放在基於文本規則變體和視覺變體上。我們引入了GeneralPoints，一款算術推理紙牌遊戲，並採用V-IRL，一個現實世界的導航環境，來評估使用SFT和RL訓練的模型如何對文本和視覺領域中的未見變體進行泛化。我們展示了RL，特別是當使用基於結果的獎勵進行訓練時，能夠跨越基於規則的文本和視覺變體進行泛化。相比之下，SFT傾向於記憶訓練數據，並且在無分佈情況下難以進行泛化。進一步的分析顯示，RL改善了模型的基礎視覺識別能力，有助於其在視覺領域中的增強泛化。儘管RL具有優越的泛化能力，我們展示了SFT對於有效的RL訓練仍然至關重要；SFT穩定了模型的輸出格式，使後續的RL能夠實現其性能增益。這些發現展示了RL在複雜的多模態任務中獲取可泛化知識的能力。

English

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

SFT記憶，RL泛化：基礎模型訓練後的比較研究

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

摘要

Support