ScaleDiff: 高度な数学的推論のための難問スケーリング

要旨

大規模推論モデル（LRM）は、複雑な問題解決において印象的な能力を示しており、しばしば複雑な推論を刺激する難しい数学問題のトレーニングから恩恵を受けています。最近の研究では、シードデータや内在的な数学的概念から、プロプライエタリモデルや大規模オープンソースモデルをプロンプティングすることで数学問題を自動生成する手法が探求されています。しかし、これらの手法をスケールアップすることは、高い計算コスト/APIコスト、プロンプティングの複雑さ、生成される問題の難易度の限界といった課題により困難です。これらの制限を克服するため、我々はScaleDiffという、難しい問題の作成をスケールアップするためのシンプルかつ効果的なパイプラインを提案します。我々は、適応的思考モデルを使用して、既存のデータセットから難しい問題を効率的に特定します。このモデルは問題の難易度を認識し、「思考」モードと「非思考」モードを自動的に切り替えることができます。その後、このフィルタリングされた難しいデータに基づいて、専門的な難問生成器（DiffGen-8B）をトレーニングし、大規模に新しい難しい問題を生成します。これにより、複雑なインスタンスごとのプロンプティングとそれに伴う高いAPIコストが不要になります。ScaleDiff-MathデータセットでQwen2.5-Math-7B-Instructをファインチューニングすると、元のデータセットと比較して11.3%の大幅な性能向上が得られ、AIME'24、AIME'25、HMMT-Feb'25、BRUMO'25、MATH500において65.9%の平均精度を達成し、OpenThinker3のような最近の強力なLRMを上回ります。特に、この性能はコスト効率の高いQwen3-8Bモデルを教師として使用して達成されており、我々のパイプラインがより大規模で高価な教師モデルに依存せずに高度な推論能力を効果的に転移できることを示しています。さらに、難しい問題の量が増加するにつれて、難しいベンチマークにおけるモデルの性能に明確なスケーリング現象が観察されます。コード: https://github.com/QizhiPei/ScaleDiff。

English

Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.

ScaleDiff: 高度な数学的推論のための難問スケーリング

ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

要旨

Support