生成、筛选、控制、回放：大語言模型強化學習的推演策略綜述

摘要

强化学习（RL）已成为提升大型语言模型（LLM）推理能力的核心后训练工具。在这类系统中，从初始提示采样至终止的推演轨迹——包括中间推理步骤及可选的工具或环境交互——决定了优化器的学习数据来源，然而推演轨迹的设计却常被忽视。本综述提出了一种与优化器无关的视角，系统探讨基于RL的推理型LLM后训练中的推演策略。我们通过统一符号形式化推演流程，并提出生成-过滤-控制-回放（GFCR）生命周期分类法，将推演流程分解为四个模块化阶段：生成阶段提出候选轨迹与拓扑结构；过滤阶段通过验证器、评判器、批评器构建中间信号；控制阶段在计算预算下分配资源并制定延续/分支/终止决策；回放阶段在无需权重更新的前提下跨推演保留并复用成果，包括能自主生成新训练任务的自我演进课程。我们进一步提出可靠性、覆盖度与成本敏感度的标准分类法，作为GFCR的补充框架以刻画推演权衡。基于该框架，我们整合了涵盖可验证奖励的RL、过程监督、基于评判器的门控、引导式与树状/分段推演、自适应计算分配、提前退出与部分推演、吞吐量优化、以及用于自我改进的回放/重组等方法。我们通过数学推理、代码/SQL、多模态推理、工具调用智能体、以及评估技能归纳、复用与跨任务迁移的智能体技能基准等案例研究夯实框架。最后，我们建立了诊断索引，将常见推演缺陷映射至GFCR模块与缓解机制，并针对构建可复现、计算高效且可信赖的推演流程提出了开放挑战。

English

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

生成、筛选、控制、回放：大語言模型強化學習的推演策略綜述

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

摘要

Support