UloRL：一種超長輸出強化學習方法，用於提升大型語言模型的推理能力

摘要

近期，大型语言模型（LLMs）的进展凸显了通过可验证奖励的强化学习（RLVR）在扩展输出序列中提升推理能力的潜力。然而，传统RL框架在处理超长输出时，因长尾序列分布及训练过程中的熵崩溃问题而效率低下。为应对这些挑战，我们提出了一种超长输出强化学习（UloRL）方法，旨在提升大型语言模型的推理能力。具体而言，我们将超长输出解码分割为短片段，通过缓解长尾样本导致的延迟来实现高效训练。此外，我们引入了对已掌握正标记（MPTs）的动态掩码机制，以防止熵崩溃。实验结果表明了该方法的有效性。在Qwen3-30B-A3B模型上，采用分段展开的RL训练速度提升了2.06倍，而针对128k标记输出的RL训练，则使模型在AIME2025上的表现从70.9%提升至85.1%，在BeyondAIME上从50.7%提升至61.9%，甚至超越了Qwen3-235B-A22B，取得了显著增益。这些发现强调了我们的方法在通过超长序列生成推进LLMs推理能力方面的潜力。我们将发布代码和模型，以供社区进一步使用。

English

Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models' reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model's performance on AIME2025 from 70.9\% to 85.1\% and on BeyondAIME from 50.7\% to 61.9\%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.

UloRL：一種超長輸出強化學習方法，用於提升大型語言模型的推理能力

UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities

摘要

Support