エントロピー限界を突破する：棄却サンプリングを用いたMTPによるRL訓練の高速化

要旨

強化学習（RL）は現代の大規模言語モデルにおいて重要な構成要素となっているが、ロールアウト段階が依然としてRL訓練パイプラインの主要なボトルネックである。マルチトークン予測（MTP）は投機的復号によりロールアウトを高速化する自然な解決策を提供するが、多くの研究でRL訓練中にMTPの受容率が著しく低下し、速度向上効果が限定的になることが観測されている。このボトルネックに対処するため、本稿ではLLMの事後訓練におけるMTPの体系的研究であるBebopを提示し、MTPを大規模RLパイプラインに統合するための実践的なレシピを提供する。第一に、MTP受容率は本質的にモデルエントロピーの変動によって制約され、RL段階におけるエントロピーの上昇と明確な負の線形関係を示すことを明らかにする。第二に、確率的棄却サンプリングが貪欲ドラフトサンプリングと比較して、RLで導入されるエントロピーによる擾乱を大幅に緩和することを示す。さらに、従来のMTP訓練目的関数（クロスエントロピーまたはKL）はこの設定において最適ではないことを特定し、多段階棄却サンプリングの受容率を直接最適化する新たなエンドツーエンドのTV損失を提案する。これにより約10%の受容率向上を達成し、数学的推論、コード生成、エージェントタスクにおいて最大95%の受容率と最大25%の追加推論スループット向上を実現する。第三に、RL中に様々なオンラインMTP訓練戦略をテストし、e2e TV損失と棄却サンプリングを用いたRL事前MTP訓練がRL全体を通じて一貫した受容率と高速化を達成し、高コストなオンラインMTP更新の必要性を排除することを示す。我々は発見を検証する広範な実験と分析を提供する。実験結果は、本手法がQwen3.5、Qwen3.6、Qwen3.7モデルの非同期RL訓練において最大1.8倍のエンドツーエンド高速化を達成することを示している。

English

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.