SAND-Math：利用大型語言模型生成新穎、困難且實用的數學問題與解答

摘要

對於具備複雜數學推理能力的大型語言模型（LLMs）的需求，在各行各業中日益增長。然而，開發性能優異的數學LLMs面臨著一個關鍵瓶頸：缺乏困難且新穎的訓練數據。我們引入了SAND-Math（合成增強新穎與困難數學問題及解答），這是一個解決此問題的流程，首先從零開始生成高質量的問題，然後通過一個新的「難度提升」步驟系統性地增加其複雜性。我們通過兩個關鍵發現展示了我們方法的有效性。首先，將SAND-Math數據增強到一個強大的基線模型上，顯著提升了性能，在AIME25基準測試中，比次佳合成數據集高出17.85個絕對點。其次，在一項專門的消融研究中，我們展示了我們的「難度提升」過程極為有效：通過將平均問題難度從5.02提升至5.98，這一步驟將AIME25的表現從46.38%提升至49.23%。完整的生成流程、最終數據集以及一個微調模型，共同構成了構建更強大、更高效數學推理LLMs的實用且可擴展工具包。SAND-Math數據集已在此發布： https://huggingface.co/datasets/amd/SAND-MATH{https://huggingface.co/datasets/amd/SAND-MATH}

English

The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce SAND-Math (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new Difficulty Hiking step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by uparrow 17.85 absolute points on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: https://huggingface.co/datasets/amd/SAND-MATH{https://huggingface.co/datasets/amd/SAND-MATH}

SAND-Math：利用大型語言模型生成新穎、困難且實用的數學問題與解答

SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

摘要

Support