Nemotron级联:面向通用推理模型的级联强化学习规模化研究
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
December 15, 2025
作者: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI
摘要
构建通用推理模型的强化学习(RL)方法面临显著的跨领域异质性挑战,包括推理时响应长度和验证延迟的巨大差异。这种变异性不仅增加了RL基础设施的复杂性、拖慢训练进程,还使得训练课程(如响应长度扩展)和超参数选择变得困难。本文提出级联式领域强化学习(Cascade RL)方法,开发出具备指令模式与深度思考模式的双模通用推理模型Nemotron-Cascade。与传统混合多领域异构提示的方法不同,Cascade RL采用顺序化、分领域的RL训练策略,既降低了工程复杂度,又在广泛基准测试中实现了领先性能。值得注意的是,作为前置步骤的RLHF对齐技术不仅能优化模型偏好,更显著提升了推理能力;后续分领域RLVR阶段几乎不会损害已习得的领域性能,甚至可能进一步提升(图1示例)。经过RL训练的14B参数模型在LiveCodeBench v5/v6/Pro上超越其SFT教师模型DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中达到银牌水平。我们公开分享了完整的训练方案与数据配方。
English
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.