InfinityMATH：一個在程式化數學推理中可擴展的指令調整資料集

摘要

最近在思維鏈 (Chain-of-Thoughts, CoT) 和思維程序 (Program-of-Thoughts, PoT) 方法方面的進展大大增強了語言模型的數學推理能力，有助於將它們整合到具有LLMs的指導調整數據集中。然而，現有的大規模數據集創建方法需要大量種子數據和高計算成本進行數據合成，對可擴展性構成重大挑戰。我們引入了InfinityMATH，這是一個可擴展的用於程序化數學推理的指導調整數據集。構建流程強調將數字與數學問題解耦，以合成獨立於數字的程序，實現高效靈活的擴展，同時最大程度地減少對特定數值的依賴。使用開源語言和代碼模型（如Llama2和CodeLlama）進行微調實驗，展示了InfinityMATH的實際效益。這些微調模型在域內和域外基準測試中都顯示出顯著的相對改進，平均範圍從184.7%到514.3%。此外，這些模型在GSM8K+和MATH+基準測試中表現出很高的穩健性，這是具有僅數字變化的增強版本測試集。InfinityMATH確保模型在更廣泛範圍的數學問題上更加多才多藝和有效。數據可在https://huggingface.co/datasets/flagopen/InfinityMATH 上獲得。

English

Recent advancements in Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT) methods have greatly enhanced language models' mathematical reasoning capabilities, facilitating their integration into instruction tuning datasets with LLMs. However, existing methods for large-scale dataset creation require substantial seed data and high computational costs for data synthesis, posing significant challenges for scalability. We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. The construction pipeline emphasizes decoupling numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependency on specific numerical values. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH. These fine-tuned models, showed significant relative improvements on both in-domain and out-of-domain benchmarks, ranging from 184.7% to 514.3% on average. Additionally, these models exhibited high robustness on the GSM8K+ and MATH+ benchmarks, which are enhanced version of test sets with simply the number variations. InfinityMATH ensures that models are more versatile and effective across a broader range of mathematical problems. The data is available at https://huggingface.co/datasets/flagopen/InfinityMATH.

InfinityMATH：一個在程式化數學推理中可擴展的指令調整資料集

InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning

摘要

Support