ChatPaper.aiChatPaper

生成、筛选、调控与回放:大语言模型强化学习推演策略综述

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

April 8, 2026
作者: Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, Julian McAuley
cs.AI

摘要

强化学习(RL)已成为提升大语言模型(LLM)推理能力的核心后训练工具。在这类系统中,从初始提示到终止的整个轨迹(包括中间推理步骤及可选的工具或环境交互)决定了优化器的学习数据,然而轨迹设计策略常被忽视。本综述从优化器无关的视角,系统探讨基于强化学习的推理LLM后训练中的轨迹策略。我们通过统一符号形式化轨迹生成流程,并提出生成-过滤-控制-回放(GFCR)生命周期分类法,将轨迹流程分解为四个模块化阶段:生成阶段提出候选轨迹与拓扑结构;过滤阶段通过验证器、评判器或批评器构建中间信号;控制阶段在预算约束下分配计算资源并做出延续/分支/终止决策;回放阶段在不更新模型权重的前提下跨轨迹保留并复用生成内容,包括自主生成新训练任务的自进化课程。我们进一步提出可靠性、覆盖度与成本敏感度的三元标准分类法,用以刻画轨迹策略的权衡关系。基于该框架,我们整合了可验证奖励强化学习、过程监督、基于评判器的门控、引导式与树状/分段式轨迹、自适应计算分配、早停与部分轨迹、吞吐量优化以及面向自我提升的回放重组等方法。通过数学推理、代码/SQL生成、多模态推理、工具调用智能体及智能体技能基准测试等案例,我们验证了该框架在技能归纳、复用与跨任务迁移评估中的实用性。最后,我们建立了诊断索引,将常见轨迹缺陷映射至GFCR模块及修正机制,并针对可复现、计算高效且可信赖的轨迹流程的构建提出了开放挑战。
English
Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.
PDF12May 7, 2026