NEMOTRON-CROSSTHINK: 수학적 추론을 넘어 자기 학습 확장하기

초록

대규모 언어 모델(LLMs)은 특히 강화 학습(RL)을 통해 강화될 때 강력한 추론 능력을 보여주고 있습니다. 기존 연구에서는 규칙과 정확성이 명확하게 정의된 수학적 추론에 RL을 성공적으로 적용했지만, 제한된 데이터, 검증 가능한 보상 구조의 부재, 다양한 작업 요구사항으로 인해 이러한 방법을 더 넓은 추론 영역으로 일반화하는 것은 여전히 어려운 과제로 남아 있습니다. 본 연구에서는 다양한 추론 작업에 걸쳐 일반화를 개선하기 위해 합성 및 실제 질문-답변 쌍을 포함한 다중 도메인 코퍼스를 RL 훈련에 체계적으로 통합하는 NEMOTRON-CROSSTHINK 프레임워크를 제안합니다. NEMOTRON-CROSSTHINK는 (1) STEM, 인문학, 사회과학 등 다양한 출처의 데이터를 통합하고, (2) 다중 선택형 및 자유 응답형과 같은 구조화된 템플릿을 적용하여 답변 공간의 복잡성을 제어하며, (3) 검증 가능한 답변을 필터링하고, (4) 여러 출처의 데이터를 효과적으로 활용하는 데이터 혼합 전략을 최적화함으로써 주요 과제를 해결합니다. 우리의 접근 방식은 수학을 넘어 확장 가능하고 검증 가능한 보상 모델링을 가능하게 하며, 수학(MATH-500: +30.1%, AMC23: +27.5%) 및 비수학 추론 벤치마크(MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%) 모두에서 정확도가 향상되었음을 보여줍니다. 또한, NEMOTRON-CROSSTHINK는 정답에 대해 28% 더 적은 토큰을 사용함으로써 상당히 향상된 응답 효율성을 보여주며, 더 집중적이고 효과적인 추론을 강조합니다. NEMOTRON-CROSSTHINK를 통해 우리는 RL에서 다중 도메인, 다중 형식의 데이터를 통합함으로써 더 정확하고 효율적이며 일반화 가능한 LLMs를 달성할 수 있음을 입증합니다.

English

Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.

NEMOTRON-CROSSTHINK: 수학적 추론을 넘어 자기 학습 확장하기

NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning

초록

Support