單一標記展開：利用策略梯度指導大型語言模型的監督微調

摘要

監督式微調（SFT）是適應大型語言模型（LLMs）的主流方法，但與強化學習（RL）相比，其在泛化能力上往往表現不佳。在本研究中，我們認為這種性能差異不僅源於損失函數，更來自於一個更根本的區別：SFT從固定、預先收集的數據集中學習，而RL則利用從當前策略中採樣的即時策略數據。基於這一假設，我們引入了一種新穎的微調算法——單詞滾動（OTR），該算法通過策略梯度方法指導SFT。OTR重新構建了自迴歸學習過程，將每個單詞的生成視為單步強化學習軌跡。在每一步，它通過從當前策略的分佈中採樣多個候選單詞來執行蒙特卡洛“滾動”。然後，監督數據中的真實單詞被用來為這些樣本提供獎勵信號。在策略梯度的引導下，我們的算法將靜態、離線策略的監督數據轉化為動態、在線策略的單詞級信號，既捕捉了在線學習的泛化優勢，又避免了完整句子生成的高昂成本。通過在數學推理、代碼生成和通用領域推理等一系列具有挑戰性的基準測試上進行廣泛實驗，我們證明OTR始終優於標準SFT。我們的研究結果確立了OTR作為微調LLMs的一種強大且實用的替代方案，並提供了有力證據表明數據的在線策略性質是泛化的關鍵驅動因素，為微調LLMs開闢了一條充滿前景的新方向。

English

Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

單一標記展開：利用策略梯度指導大型語言模型的監督微調

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

摘要

Support