单令牌展开：利用策略梯度指导大语言模型的监督微调

摘要

监督微调（SFT）是调整大型语言模型（LLMs）的主流方法，但与强化学习（RL）相比，其在泛化能力上常显不足。本研究提出，这种性能差异不仅源于损失函数，更源于一个根本性区别：SFT依赖于固定、预先收集的数据集学习，而RL则利用从当前策略中采样的在线数据。基于这一假设，我们引入了一种新颖的微调算法——单步展开（OTR），它通过策略梯度方法指导SFT。OTR将自回归学习过程重构，视每个令牌生成为单步强化学习轨迹。每一步，它通过从当前策略分布中采样多个候选令牌，执行蒙特卡洛“展开”。随后，利用监督数据中的真实令牌为这些样本提供奖励信号。在策略梯度的引导下，我们的算法将静态、离线的监督数据转化为令牌级别的动态、在线信号，既捕捉了在线学习的泛化优势，又避免了完整句子生成的高昂开销。通过在一系列涵盖数学推理、代码生成及通用领域推理的挑战性基准测试中的广泛实验，我们证明OTR始终优于标准SFT。我们的研究确立了OTR作为微调LLMs的强大且实用的替代方案，并有力证明了数据的在线性质是推动泛化的关键因素，为LLMs微调开辟了一条充满前景的新方向。

English

Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

单令牌展开：利用策略梯度指导大语言模型的监督微调

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

摘要

Support