ScaleDiff:為高階數學推理擴展難題規模
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
September 25, 2025
作者: Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu
cs.AI
摘要
大型推理模型(LRMs)在複雜問題解決方面展現了令人印象深刻的能力,這往往得益於對困難數學問題的訓練,這些問題能激發細緻的推理。近期的研究探索了通過提示專有模型或大規模開源模型,從種子數據或內在數學概念自動合成數學問題的方法。然而,由於其高昂的計算/API成本、提示的複雜性以及生成問題難度水平的限制,這些方法的擴展仍面臨挑戰。為克服這些限制,我們提出了ScaleDiff,一個簡單而有效的管道,旨在擴展困難問題的創建。我們利用自適應思維模型,僅需一次前向傳播,就能高效地從現有數據集中識別出困難問題,該模型能感知問題難度並自動在「思考」與「無思考」模式間切換。隨後,我們在這些過濾出的困難數據上訓練了一個專門的困難問題生成器(DiffGen-8B),它能大規模生成新的困難問題,消除了複雜的逐例提示需求及其相關的高昂API成本。在ScaleDiff-Math數據集上微調Qwen2.5-Math-7B-Instruct,相比原始數據集,性能顯著提升了11.3%,並在AIME'24、AIME'25、HMMT-Feb'25、BRUMO'25和MATH500上達到了65.9%的平均準確率,超越了近期如OpenThinker3等強勁的LRMs。值得注意的是,這一性能是使用成本效益高的Qwen3-8B模型作為教師實現的,表明我們的管道能夠有效傳遞高級推理能力,而無需依賴更大、更昂貴的教師模型。此外,我們觀察到,隨著困難問題數量的增加,模型在困難基準測試上的性能呈現出明顯的擴展現象。代碼:https://github.com/QizhiPei/ScaleDiff。
English
Large Reasoning Models (LRMs) have shown impressive capabilities in complex
problem-solving, often benefiting from training on difficult mathematical
problems that stimulate intricate reasoning. Recent efforts have explored
automated synthesis of mathematical problems by prompting proprietary models or
large-scale open-source models from seed data or inherent mathematical
concepts. However, scaling up these methods remains challenging due to their
high computational/API cost, complexity of prompting, and limited difficulty
level of the generated problems. To overcome these limitations, we propose
ScaleDiff, a simple yet effective pipeline designed to scale the creation of
difficult problems. We efficiently identify difficult problems from existing
datasets with only a single forward pass using an adaptive thinking model,
which can perceive problem difficulty and automatically switch between
"Thinking" and "NoThinking" modes. We then train a specialized difficult
problem generator (DiffGen-8B) on this filtered difficult data, which can
produce new difficult problems in large scale, eliminating the need for
complex, per-instance prompting and its associated high API costs. Fine-tuning
Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial
performance increase of 11.3% compared to the original dataset and achieves a
65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500,
outperforming recent strong LRMs like OpenThinker3. Notably, this performance
is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating
that our pipeline can effectively transfer advanced reasoning capabilities
without relying on larger, more expensive teacher models. Furthermore, we
observe a clear scaling phenomenon in model performance on difficult benchmarks
as the quantity of difficult problems increases. Code:
https://github.com/QizhiPei/ScaleDiff.