突破熵界：利用帶有拒絕採樣的MTP加速強化學習訓練

摘要

強化學習（RL）已成為現代大型語言模型的關鍵組成部分，然而在RL訓練流程中，展開階段（rollout stage）仍然是主要的瓶頸。儘管多令牌預測（MTP）透過推測解碼提供了一種加速展開的自然解決方案，但許多研究觀察到，MTP接受率在RL訓練期間會顯著下降，導致加速性能受限。為解決這一瓶頸，我們提出了Bebop，一項關於MTP在LLM後訓練中的系統性研究，並提供了將MTP整合到大規模RL流水線中的實用方案。首先，我們揭示了MTP接受率根本上受模型熵的波動所約束，且與RL階段熵的增加呈現明顯的負線性關係。其次，我們表明，與貪婪草稿採樣相比，機率拒絕採樣在很大程度上緩解了RL中熵引入的干擾。我們進一步發現，傳統的MTP訓練目標（交叉熵或KL散度）在此類設定中並非最優，因此我們提出了一種新穎的端到端總變差（TV）損失，直接最佳化多步拒絕採樣的接受率，使得接受率提升約10%，在數學推理、程式碼生成和智慧體任務中實現了高達95%的接受率以及最高25%的額外推論吞吐量增益。第三，我們在RL期間測試了多種線上MTP訓練策略，並表明，採用端到端TV損失和拒絕採樣的預RL MTP訓練在整個RL過程中實現了穩定的接受率和加速，從而消除了昂貴的線上MTP更新需求。我們提供了大量的實驗與分析來驗證我們的發現。實驗結果表明，我們的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的非同步RL訓練中實現了高達1.8倍的端到端加速。

English

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.