通过可扩展的问题合成从零开始释放LLM的推理能力

摘要

高质量数据的可用性是提升LLM推理能力的最重要因素之一。现有研究已经证明了从种子问题或知识库中创建更多指导数据的有效性。最近的研究表明，持续扩大来自强模型（例如GPT-4）的数据合成可以进一步激发推理性能。尽管有所希望，但开源社区仍然缺乏规模化的高质量数据和可负担成本的可扩展数据合成方法。为了解决这个问题，我们引入了ScaleQuest，这是一种可扩展且新颖的数据合成方法，利用“小型”（例如7B）开源模型从头开始生成问题，而无需复杂的增强约束种子数据。通过高效的ScaleQuest，我们自动构建了一个包含100万个问题-解决方案对的数学推理数据集，比现有的开源数据集更有效。它可以普遍提高主流开源模型（例如Mistral、Llama3、DeepSeekMath和Qwen2-Math）的性能，MATH上的增益达到29.2%至46.4%。值得注意的是，仅通过使用我们的数据集微调Qwen2-Math-7B-Base模型，甚至可以超越Qwen2-Math-7B-Instruct，这是一个在闭源数据上表现强大且良好对齐的模型，以及GPT-4-Turbo和Claude-3.5 Sonnet等专有模型。

English

The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes "small-size" (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Sonnet.

通过可扩展的问题合成从零开始释放LLM的推理能力

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

摘要

Support