重新审视专家轨迹在大型语言模型后训练中的运用 (注:此处"Post-training"根据语境译为"后训练",指预训练完成后进行的微调/优化阶段。若指代更具体的RLHF等流程,也可考虑译为"训练后优化")
Rethinking Expert Trajectory Utilization in LLM Post-training
December 12, 2025
作者: Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin
cs.AI
摘要
尽管有效的后训练结合了监督微调(SFT)与强化学习(RL),但如何最优利用专家轨迹的机制仍悬而未决。我们提出可塑性-上限框架为此领域建立理论基础,将性能分解为基础SFT性能与后续RL可塑性。通过大规模基准测试,我们确立"先SFT后RL"的串行流程为最优标准,其克服了同步方法的稳定性缺陷。此外我们推导出精确的规模扩展准则:(1)在SFT稳定期或轻度过拟合子阶段转向RL,可通过确保基础SFT性能且不损害RL可塑性来实现最终上限最大化;(2)驳斥"少即是多"在SFT后RL扩展中的适用性,证明数据规模决定后训练的主要潜力,而轨迹难度充当性能倍增器;(3)发现SFT最小验证损失可作为选择专家轨迹的稳健指标,以最大化最终性能上限。我们的研究为提取专家轨迹的最大价值提供了可操作的指导原则。
English
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.