ScaleDiff：為高階數學推理擴展難題規模

摘要

大型推理模型（LRMs）在複雜問題解決方面展現了令人印象深刻的能力，這往往得益於對困難數學問題的訓練，這些問題能激發細緻的推理。近期的研究探索了通過提示專有模型或大規模開源模型，從種子數據或內在數學概念自動合成數學問題的方法。然而，由於其高昂的計算/API成本、提示的複雜性以及生成問題難度水平的限制，這些方法的擴展仍面臨挑戰。為克服這些限制，我們提出了ScaleDiff，一個簡單而有效的管道，旨在擴展困難問題的創建。我們利用自適應思維模型，僅需一次前向傳播，就能高效地從現有數據集中識別出困難問題，該模型能感知問題難度並自動在「思考」與「無思考」模式間切換。隨後，我們在這些過濾出的困難數據上訓練了一個專門的困難問題生成器（DiffGen-8B），它能大規模生成新的困難問題，消除了複雜的逐例提示需求及其相關的高昂API成本。在ScaleDiff-Math數據集上微調Qwen2.5-Math-7B-Instruct，相比原始數據集，性能顯著提升了11.3%，並在AIME'24、AIME'25、HMMT-Feb'25、BRUMO'25和MATH500上達到了65.9%的平均準確率，超越了近期如OpenThinker3等強勁的LRMs。值得注意的是，這一性能是使用成本效益高的Qwen3-8B模型作為教師實現的，表明我們的管道能夠有效傳遞高級推理能力，而無需依賴更大、更昂貴的教師模型。此外，我們觀察到，隨著困難問題數量的增加，模型在困難基準測試上的性能呈現出明顯的擴展現象。代碼：https://github.com/QizhiPei/ScaleDiff。

English

Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.

ScaleDiff：為高階數學推理擴展難題規模

ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

摘要

Support