ChatPaper.aiChatPaper

监督式强化学习:从专家轨迹到逐步推理

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

October 29, 2025
作者: Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee
cs.AI

摘要

大型语言模型(LLMs)在处理需要多步推理的问题时常常表现不佳。对于小规模开源模型而言,当正确解在多次尝试后仍难以采样时,基于可验证奖励的强化学习(RLVR)会失效,而监督微调(SFT)则容易因僵化的逐词模仿对长示范样本产生过拟合。为弥补这一缺陷,我们提出监督式强化学习(SRL),该框架将问题求解重构为生成一系列逻辑"动作"的过程。SRL训练模型在确定每个动作前先生成内部推理独白,并通过逐步比对模型动作与从SFT数据集中提取的专家动作来提供更平滑的奖励信号。这种监督机制即使在全部分支结果均错误时也能提供更丰富的学习信号,同时鼓励模型根据专家示范进行灵活推理。实验表明,SRL能使小模型学会原本无法通过SFT或RLVR掌握的复杂问题。此外,先采用SRL初始化训练再通过RLVR微调的策略可达成最佳整体性能。除推理基准测试外,SRL在智能体软件工程任务中也展现出色泛化能力,确立了其作为面向推理的LLMs的稳健且通用的训练框架地位。
English
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
PDF442December 2, 2025