生成、フィルタリング、制御、リプレイ：LLM強化学習におけるロールアウト戦略の包括的調査

要旨

強化学習（RL）は、大規模言語モデル（LLM）の推論能力を向上させるための中心的な学習後ツールとして確立されつつある。これらのシステムにおいて、ロールアウト（プロンプトから終了条件までサンプリングされた軌跡であり、中間推論ステップやオプションのツール・環境相互作用を含む）は、最適化器が学習するデータを決定する重要な要素である。しかし、ロールアウト設計に関する報告は往々にして不十分である。本サーベイは、推論LLMのRLベース学習後処理におけるロールアウト戦略を、最適化器に依存しない視点で整理する。統一記法を用いてロールアウトパイプラインを形式化し、Generate-Filter-Control-Replay（GFCR）というライフサイクル分類体系を提案する。これはロールアウトパイプラインを4つのモジュール化されたステージに分解する：Generateは候補軌跡とトポロジを提案し、Filterは検証器、判定器、批評家を介して中間信号を構築し、Controlは計算資源を割り当て予算制約下で継続/分岐/停止判断を行い、Replayは重み更新を伴わずロールアウト間で成果物を保持・再利用する（自己進化するカリキュラムを含む）。GFCRを補完するため、信頼性、網羅性、コスト感応性の基準からなるロールアウトのトレードオフを特徴づける分類体系を示す。この枠組みを用いて、検証可能な報酬を用いたRL、プロセス監視、判定器ベースのゲーティング、ガイド付き及びツリー/セグメントロールアウト、適応的計算資源割り当て、早期終了と部分ロールアウト、スループット最適化、自己改善のための再生・再構成など、多岐にわたる手法を統合的に整理する。数学、コード/SQL、マルチモーダル推論、ツール利用エージェント、技能的ベンチマーク（スキル獲得、再利用、課題間転移を評価する）におけるケーススタディを通じて、提案枠組みの具体性を高める。最後に、一般的なロールアウトの問題点をGFCRモジュールと改善策に対応づける診断指標を提供し、再現性・計算効率・信頼性の高いロールアウトパイプライン構築に向けた未解決課題を提示する。

English

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

生成、フィルタリング、制御、リプレイ：LLM強化学習におけるロールアウト戦略の包括的調査

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

要旨

Support