ChatPaper.aiChatPaper

監督式強化學習:從專家軌跡到逐步推理

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

October 29, 2025
作者: Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee
cs.AI

摘要

大型語言模型在處理需要多步驟推理的問題時往往表現不佳。對於小規模開源模型而言,當正確解答在多次嘗試後仍難以採樣時,基於可驗證獎勵的強化學習便會失效;而監督式微調則容易因僵化的逐詞模仿,對長示範樣本產生過度擬合。為解決此問題,我們提出監督式強化學習框架,將問題解決重新定義為生成邏輯「動作」序列的過程。該方法訓練模型在執行每個動作前,首先生成內部推理獨白,並根據模型動作與從監督式微調數據集中提取的專家動作之間的逐步相似性,提供更平滑的獎勵信號。這種監督機制即使在所有推演結果均錯誤時,仍能提供更豐富的學習信號,同時鼓勵模型在專家示範引導下進行靈活推理。實驗表明,監督式強化學習能讓小模型學會原本無法通過監督式微調或可驗證獎勵強化學習掌握的難題。此外,先以監督式強化學習初始化訓練,再通過可驗證獎勵強化學習進行精調,可達成最強的整體性能。除推理基準測試外,監督式強化學習在智能體軟體工程任務中也展現出卓越的泛化能力,確立了其作為面向推理的大型語言模型之強健且多功能的訓練框架地位。
English
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
PDF442December 2, 2025