突破熵界：通过MTP与拒绝采样加速强化学习训练

摘要

强化学习（RL）已成为现代大型语言模型的关键组成部分，然而展开阶段仍是RL训练流程中的主要瓶颈。尽管多令牌预测（MTP）通过投机解码提供了一种加速展开的自然解决方案，但许多研究发现MTP的接受率在RL训练期间显著下降，导致加速性能有限。为解决这一瓶颈，我们提出Bebop——一项针对LLM后训练中MTP的系统性研究，并提供将MTP集成到大规模RL流程中的实用方案。首先，我们揭示MTP接受率本质上受模型熵波动的约束，且与RL阶段熵的上升呈现清晰的负线性关系。其次，我们证明相比贪婪草稿采样，概率拒绝采样能大幅缓解RL中熵引入的干扰。我们进一步发现传统MTP训练目标（交叉熵或KL散度）在此类设置中表现次优，因此提出一种新型端到端全变差（TV）损失，直接优化多步拒绝采样的接受率，实现约10%的接受率提升，在数学推理、代码生成及智能体任务中达到高达95%的接受率与25%的额外推理吞吐增益。第三，我们测试了RL期间多种在线MTP训练策略，并表明采用端到端TV损失与拒绝采样的预RL MTP训练，能在整个RL过程中保持稳定的接受率与加速效果，从而消除代价高昂的在线MTP更新需求。我们通过大量实验与分析验证了上述发现。实验结果表明，我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。

English

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.