EfficientRollout: 강화학습 롤아웃을 위한 시스템 인식 자기 추측 디코딩

초록

강화학습(RL)은 대규모 언어 모델(LLM)을 위한 대표적인 사후 훈련 패러다임으로 자리 잡아, 강력한 추론 및 에이전트 능력을 가능하게 한다. 그러나 롤아웃 생성은 여전히 주요 지연 시간 병목으로 남아 있는데, 이는 자기회귀적 샘플링이 응답을 순차적으로 디코딩하고, 소수의 긴 꼬리 생성이 완료 시간을 결정하기 때문이다. 추측적 디코딩(SD)은 이러한 병목을 해결하는 자연스러운 방법을 제공한다. 이는 고정된 LLM을 서빙하기 위해 잘 정립된 기법으로, 토큰을 신속하게 초안 작성하고 병렬 검증을 통해 이를 수용함으로써 지연 시간을 줄이면서도 대상 모델 분포를 보존한다. 그러나 실질적인 속도 향상이 RL 롤아웃에 직접적으로 이어지지는 않는다: (i) 진화하는 대상 정책으로 인해 고정된 초안 생성기(드래프터)가 정책의 출력 분포와 점점 더 불일치하게 되고; (ii) 롤아웃 디코딩 과정에서 활성 배치 크기가 줄어들어 디코딩이 계산 바운드 영역에서 메모리 바운드 영역으로 전환되며, 이때 병렬 검증이 활용도가 낮은 계산 자원을 이용할 수 있다. 따라서 RL 롤아웃을 가속화하려면 진화하는 정책의 길고 높은 온도의 생성에서도 효과적인 초안 생성기와, 계산 바운드 영역을 피하는 시스템 인식형 SD 사용이 모두 필요하다. 본 논문에서는 이러한 격차를 해소하기 위해 설계된 시스템 인식형 자기 추측적 디코딩 프레임워크인 EfficientRollout을 제시한다. EfficientRollout은 대상 모델로부터 양자화된 초안 생성기를 유도(자기 추측적 디코딩)하여, 별도의 초안 생성기 사전 훈련이나 온라인 적응 없이도 진화하는 정책과 결합된 상태를 유지한다. 또한 수용 인식형 초안 길이 적응과 결합된 시스템 인식형 SD 전환 정책을 조정하여, 유리한 영역에서만 추측을 수행하고 초안 작성 예산을 진화하는 초안 생성기 품질에 맞춘다. EfficientRollout은 가속화된 자기회귀(AR) 롤아웃 기준선 대비 롤아웃 지연 시간을 최대 19.6%, 종단 간 지연 시간을 최대 12.7% 줄이면서도 최종 모델 품질을 유지한다.

English

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.