엔트로피 한계 돌파: 거절 샘플링을 활용한 MTP 기반 강화학습 훈련 가속화

초록

강화학습(RL)은 현대의 대규모 언어 모델에서 핵심 구성 요소가 되었지만, 롤아웃 단계는 여전히 RL 훈련 파이프라인에서 주요 병목 현상으로 남아 있다. 다중 토큰 예측(MTP)은 추측 디코딩을 통해 롤아웃을 가속화하는 자연스러운 해결책을 제공하지만, 많은 연구에서 RL 훈련 중 MTP 수용률이 크게 저하되어 속도 향상 성능이 제한적임을 관찰했다. 이러한 병목 현상을 해결하기 위해, 우리는 LLM 사후 훈련에서 MTP에 대한 체계적인 연구인 Bebop을 제시하고, MTP를 대규모 RL 파이프라인에 통합하기 위한 실용적인 방법론을 제공한다. 첫째, 우리는 MTP 수용률이 근본적으로 모델 엔트로피의 변동에 의해 제약되며, 이는 RL 단계에서 엔트로피 증가와 명확한 음의 선형 관계를 보임을 밝힌다. 둘째, 확률적 거절 샘플링이 탐욕적 드래프트 샘플링에 비해 RL에서 엔트로피에 의해 도입된 교란을 크게 완화함을 보인다. 나아가 기존의 MTP 훈련 목적 함수(크로스 엔트로피 또는 KL)는 이러한 설정에서 최적이 아님을 확인하고, 다단계 거절 샘플링 수용률을 직접 최적화하는 새로운 종단간 TV 손실을 제안한다. 이는 약 10%의 수용률 개선, 최대 95%의 수용률, 그리고 수학적 추론, 코드 생성, 에이전트 작업 전반에 걸쳐 최대 25%의 추가 추론 처리량 향상을 달성한다. 셋째, RL 동안 다양한 온라인 MTP 훈련 전략을 테스트하고, 종단간 TV 손실과 거절 샘플링을 사용한 사전 RL MTP 훈련이 전체 RL 과정에서 일관된 수용률과 속도 향상을 유지하여, 비용이 많이 드는 온라인 MTP 업데이트의 필요성을 제거함을 보인다. 우리의 발견을 검증하는 광범위한 실험과 분석을 제공한다. 실험 결과, 이 방법은 Qwen3.5, Qwen3.6, Qwen3.7 모델의 비동기 RL 훈련에서 최대 1.8배의 종단간 가속을 달성함을 보여준다.

English

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.