AceReason-Nemotron: 강화 학습을 통한 수학 및 코드 추론의 발전

초록

최근 대규모 강화 학습(RL)을 통한 추론 분야에서의 진전에도 불구하고, 고성능 추론 모델을 구축하기 위한 훈련 방법론은 여전히 명확하지 않다. DeepSeek-R1과 같은 최첨단 모델의 주요 구현 세부 사항, 예를 들어 데이터 큐레이션 전략과 RL 훈련 방법론 등은 종종 생략된다. 또한, 최근 연구에 따르면 소규모 모델의 경우 증류(distillation)가 RL보다 여전히 더 효과적인 것으로 나타난다. 본 연구에서는 대규모 RL이 강력한 소규모 및 중간 규모 모델의 추론 능력을 크게 향상시킬 수 있으며, 이를 통해 최첨단 증류 기반 모델을 능가하는 결과를 달성할 수 있음을 보여준다. 우리는 광범위한 절제 실험(ablation study)을 통해 RL 훈련 과정을 체계적으로 연구하고, 수학 전용 프롬프트로 먼저 훈련한 후 코드 전용 프롬프트로 훈련하는 간단하지만 효과적인 접근 방식을 제안한다. 특히, 수학 전용 RL은 강력한 증류 모델의 수학 벤치마크 성능을 크게 향상시킬 뿐만 아니라(예: 7B / 14B 모델에서 AIME 2025 기준 +14.6% / +17.2%), 코드 추론 작업에서도 성능을 향상시키는 것으로 나타났다(예: 7B / 14B 모델에서 LiveCodeBench 기준 +6.8% / +5.8%). 또한, 확장된 코드 전용 RL 반복은 코드 벤치마크에서의 성능을 더욱 개선시키면서 수학 결과의 저하를 최소화하거나 전혀 발생시키지 않았다. 우리는 두 도메인에서 검증 기반 RL을 가능하게 하기 위해 고품질의 검증 가능한 답변과 테스트 케이스가 포함된 도전적인 프롬프트를 수집하는 견고한 데이터 큐레이션 파이프라인을 개발했다. 마지막으로, 점진적으로 증가하는 응답 길이를 통한 커리큘럼 학습과 온-정책(on-policy) 파라미터 업데이트의 안정화 효과를 포함한 주요 실험적 통찰을 도출했다. 우리는 RL이 사전 훈련과 지도 미세 조정(예: 증류) 동안 획득한 기본 추론 능력을 이끌어낼 뿐만 아니라, 모델의 추론 능력의 한계를 넘어 이전에 해결할 수 없었던 문제를 해결할 수 있게 한다는 것을 발견했다.

English

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

AceReason-Nemotron: 강화 학습을 통한 수학 및 코드 추론의 발전

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

초록

Support