NEMOTRON-CROSSTHINK: 数学的推論を超えた自己学習のスケーリング

要旨

大規模言語モデル（LLM）は、特に強化学習（RL）によって強化された場合、強力な推論能力を示しています。これまでの研究では、数学的推論（規則と正しさが明確に定義されている領域）へのRLの適用に成功していますが、これらの手法をより広範な推論領域に一般化することは、データの不足、検証可能な報酬構造の欠如、多様なタスク要件のため、依然として困難です。本研究では、NEMOTRON-CROSSTHINKを提案します。これは、合成および実世界の質問-回答ペアを含む多領域コーパスをRLトレーニングに体系的に組み込み、多様な推論タスクにおける一般化を改善するフレームワークです。NEMOTRON-CROSSTHINKは、(1) STEM、人文科学、社会科学など多様なソースからのデータを取り込む、(2) 回答空間の複雑さを制御するための構造化テンプレート（例: 多肢選択式および自由回答式）を適用する、(3) 検証可能な回答をフィルタリングする、(4) 複数のソースからのデータを効果的に活用するデータブレンディング戦略を最適化する、という主要な課題に対処します。我々のアプローチは、数学を超えたスケーラブルで検証可能な報酬モデリングを可能にし、数学（MATH-500: +30.1%, AMC23: +27.5%）および非数学的推論ベンチマーク（MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%）の両方で精度の向上を示しています。さらに、NEMOTRON-CROSSTHINKは、正解に必要なトークン数を28%削減し、より焦点を絞った効果的な推論を実現しています。NEMOTRON-CROSSTHINKを通じて、多領域・多形式のデータをRLに統合することが、より正確で効率的かつ一般化可能なLLMを導くことを実証しました。

English

Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL). While prior work has successfully applied RL to mathematical reasoning -- where rules and correctness are well-defined -- generalizing these methods to broader reasoning domains remains challenging due to limited data, the lack of verifiable reward structures, and diverse task requirements. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) incorporating data from varied sources spanning STEM, humanities, social sciences, etc.; (2) applying structured templates (e.g., multiple-choice and open-ended) to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies that utilizes data from multiple sources effectively. Our approach enables scalable and verifiable reward modeling beyond mathematics and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23:+27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency -- using 28% fewer tokens for correct answers -- highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.

NEMOTRON-CROSSTHINK: 数学的推論を超えた自己学習のスケーリング

NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning

要旨

Support