Nemotron-Cascade 2: 캐스케이드 강화학습 및 다중 도메인 온-폴리시 지식을 활용한 사후 학습 대규모 언어 모델

초록

저희는 30B MoE(활성 파라미터 3B) 규모의 오픈 모델인 Nemotron-Cascade 2를 소개합니다. 이 모델은 최고 수준의 추론 능력과 강력한 에이전트 능력을 제공합니다. 컴팩트한 크기에도 불구하고 수학 및 코딩 추론 성능은 최첨단 오픈 모델에 근접합니다. DeepSeekV3.2-Speciale-671B-A37B에 이어 2025년 국제수학올림피아드(IMO), 국제정보올림피아드(IOI), ICPC 월드 파이널에서 금메달 수준의 성능을 달성한 두 번째 오픈웨이트 LLM으로, 매개변수 수가 20배 적음에도 놀라울 정도로 높은 지능 밀도를 입증했습니다. Nemotron-Cascade 1과 대비되는 주요 기술 발전 사항은 다음과 같습니다. 정성적으로 구성된 데이터셋에 대한 SFT 이후, 훨씬 더 광범위한 추론 및 에이전트 영역을 포괄하도록 Cascade RL을 대폭 확장했습니다. 또한 Cascade RL 과정 전반에 걸쳐 각 영역별 가장 강력한 중간 교사 모델로부터의 다중 도메인 온-폴리시 지식 증류를 도입하여 벤치마크 회귀를 효율적으로 복구하고 강력한 성능 향상을 꾸준히 유지할 수 있었습니다. 모델 체크포인트와 학습 데이터 컬렉션을 공개합니다.

English

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

Nemotron-Cascade 2: 캐스케이드 강화학습 및 다중 도메인 온-폴리시 지식을 활용한 사후 학습 대규모 언어 모델

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

초록

Support