ScaleDiff:为高级数学推理扩展难题规模
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
September 25, 2025
作者: Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu
cs.AI
摘要
大规模推理模型(LRMs)在解决复杂问题方面展现了卓越的能力,通常得益于在具有挑战性的数学问题上的训练,这些问题能够激发复杂的推理过程。近期研究探索了通过提示专有模型或大规模开源模型,从种子数据或内在数学概念中自动合成数学问题的方法。然而,由于高昂的计算/API成本、提示的复杂性以及生成问题难度有限,这些方法的扩展仍面临挑战。为克服这些限制,我们提出了ScaleDiff,一个简单而有效的流程,旨在规模化创建高难度问题。我们利用自适应思维模型,仅需一次前向传播即可高效地从现有数据集中识别出难题,该模型能够感知问题难度并自动在“思考”与“非思考”模式间切换。随后,我们在这些筛选出的难题数据上训练了一个专门的难题生成器(DiffGen-8B),它能够大规模生成新的难题,无需复杂的逐实例提示及其伴随的高昂API成本。在ScaleDiff-Math数据集上微调Qwen2.5-Math-7B-Instruct,相比原始数据集实现了11.3%的性能显著提升,并在AIME'24、AIME'25、HMMT-Feb'25、BRUMO'25和MATH500上取得了65.9%的平均准确率,超越了近期如OpenThinker3等强大的LRMs。值得注意的是,这一性能是通过成本效益高的Qwen3-8B模型作为教师实现的,表明我们的流程能够有效转移高级推理能力,而无需依赖更大、更昂贵的教师模型。此外,我们观察到随着难题数量的增加,模型在困难基准测试上的性能呈现出明显的扩展现象。代码见:https://github.com/QizhiPei/ScaleDiff。
English
Large Reasoning Models (LRMs) have shown impressive capabilities in complex
problem-solving, often benefiting from training on difficult mathematical
problems that stimulate intricate reasoning. Recent efforts have explored
automated synthesis of mathematical problems by prompting proprietary models or
large-scale open-source models from seed data or inherent mathematical
concepts. However, scaling up these methods remains challenging due to their
high computational/API cost, complexity of prompting, and limited difficulty
level of the generated problems. To overcome these limitations, we propose
ScaleDiff, a simple yet effective pipeline designed to scale the creation of
difficult problems. We efficiently identify difficult problems from existing
datasets with only a single forward pass using an adaptive thinking model,
which can perceive problem difficulty and automatically switch between
"Thinking" and "NoThinking" modes. We then train a specialized difficult
problem generator (DiffGen-8B) on this filtered difficult data, which can
produce new difficult problems in large scale, eliminating the need for
complex, per-instance prompting and its associated high API costs. Fine-tuning
Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial
performance increase of 11.3% compared to the original dataset and achieves a
65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500,
outperforming recent strong LRMs like OpenThinker3. Notably, this performance
is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating
that our pipeline can effectively transfer advanced reasoning capabilities
without relying on larger, more expensive teacher models. Furthermore, we
observe a clear scaling phenomenon in model performance on difficult benchmarks
as the quantity of difficult problems increases. Code:
https://github.com/QizhiPei/ScaleDiff.