생성, 필터링, 제어, 재생: LLM 강화학습을 위한 롤아웃 전략 종합 분석

초록

강화학습(RL)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키기 위한 핵심 사후 훈련 도구로 자리잡았습니다. 이러한 시스템에서 롤아웃(프롬프트부터 종료까지 샘플링된 궤적으로, 중간 추론 단계 및 선택적 도구/환경 상호작용을 포함)은 최적화기가 학습하는 데이터를 결정함에도 불구하고, 롤아웃 설계는 종종 제대로 보고되지 않습니다. 본 설문 논문은 추론 LLM의 RL 기반 사후 훈련을 위한 롤아웃 전략을 최적화기 종속성 없이 조명합니다. 우리는 통일된 표기법으로 롤아웃 파이프라인을 형식화하고, 롤아웃 파이프라인을 네 가지 모듈식 단계로 분해하는 라이프사이클 분류 체계인 생성-필터-제어-재생(GFCR)을 소개합니다: 생성(Generate)은 후보 궤적 및 토폴로지를 제안하고; 필터(Filter)는 검증기, 판단자, 비평가를 통해 중간 신호를 구성하며; 제어(Control)는 예산 내에서 계산 자원을 할당하고 계속/분기/중단 결정을 내리며; 재생(Replay)은 가중치 갱신 없이 롤아웃 간 생성물을 보관하고 재사용하며, 새로운 훈련 과제를 자율적으로 생성하는 자기 진화 커리큘럼을 포함합니다. 우리는 GFCR을 보완하여 롤아웃 상충 관계를 특징짓는 신뢰도, 커버리지, 비용 민감도의 기준 분류 체계를 제시합니다. 이 프레임워크를 활용하여 검증 가능 보상 RL, 과정 감독, 판단자 기반 게이팅, 가이드 및 트리/세그먼트 롤아웃, 적응형 계산 할당, 조기 종료 및 부분 롤아웃, 처리량 최적화, 자기 개선을 위한 재생/재구성에 이르는 방법론들을 종합합니다. 우리는 이 프레임워크를 수학, 코드/SQL, 다중모드 추론, 도구 사용 에이전트 및 기술 유도, 재사용, 교차 과제 전이를 평가하는 에이전트 기술 벤치마크에 대한 사례 연구로 구체화합니다. 마지막으로, 일반적인 롤아웃 문제점을 GFCR 모듈 및 완화 방안에 매핑하는 진단 지표와 함께, 재현 가능하고 계산 효율적이며 신뢰할 수 있는 롤아웃 파이프라인 구축을 위한 미해결 과제를 제시합니다.

English

Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

생성, 필터링, 제어, 재생: LLM 강화학습을 위한 롤아웃 전략 종합 분석

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

초록

Support