ChatPaper.aiChatPaper

Nemotron-Cascade:面向通用推理模型的级联强化学习规模化框架

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

December 15, 2025
作者: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI

摘要

基于强化学习(RL)构建通用推理模型面临显著的跨领域异质性挑战,包括推理时响应长度和验证延迟的巨大差异。这种变异性不仅增加了RL基础设施的复杂度、拖慢训练进程,还使得训练课程(如响应长度扩展)和超参数选择变得困难。本文提出级联式分领域强化学习(Cascade RL)方法,开发出能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。与传统混合不同领域异构提示的方法不同,Cascade RL采用顺序分领域RL训练架构,既降低了工程复杂度,又在广泛基准测试中实现了最先进性能。值得注意的是,作为前置步骤的RLHF对齐技术不仅能优化模型偏好,更显著提升了推理能力;后续分领域RLVR阶段几乎不会削弱已习得的基准性能,甚至可能进一步提升(图1示例)。经过RL训练后,我们的140亿参数模型在LiveCodeBench v5/v6/Pro上超越其SFT教师模型DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中达到银牌水平。我们公开分享了完整的训练方案与数据配方。
English
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
PDF161December 18, 2025