Phi-4-Mini-Reasoning:探索小型推理语言模型在数学中的极限
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
April 30, 2025
作者: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen
cs.AI
摘要
思维链(CoT)通过训练大型语言模型(LLMs)显式生成中间推理步骤,显著提升了其形式推理能力。尽管LLMs能轻松受益于此类技术,但由于模型容量有限,提升小型语言模型(SLMs)的推理能力仍具挑战性。近期,Deepseek-R1的研究表明,利用LLM生成的合成数据进行蒸馏可大幅增强SLM的推理能力,但具体的建模方法尚未公开。在本研究中,我们提出了一套系统化的SLM训练方案,包含四个步骤:(1)在大规模多样化的蒸馏长CoT数据上进行中期训练,(2)在高质量长CoT数据上进行监督微调,(3)利用精心筛选的偏好数据集进行Rollout DPO,(4)结合可验证奖励的强化学习(RL)。我们将此方法应用于Phi-4-Mini,一个紧凑的3.8B参数模型。最终得到的Phi-4-Mini-Reasoning模型在数学推理任务上超越了规模更大的推理模型,例如在Math-500测试中,分别以3.2分和7.7分的优势超过了DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B。我们的结果验证了,通过精心设计的训练方案,结合大规模高质量的CoT数据,即使在资源受限的小型模型中也能有效解锁强大的推理能力。
English
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities
in Large Language Models (LLMs) by training them to explicitly generate
intermediate reasoning steps. While LLMs readily benefit from such techniques,
improving reasoning in Small Language Models (SLMs) remains challenging due to
their limited model capacity. Recent work by Deepseek-R1 demonstrates that
distillation from LLM-generated synthetic data can substantially improve the
reasoning ability of SLM. However, the detailed modeling recipe is not
disclosed. In this work, we present a systematic training recipe for SLMs that
consists of four steps: (1) large-scale mid-training on diverse distilled
long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3)
Rollout DPO leveraging a carefully curated preference dataset, and (4)
Reinforcement Learning (RL) with Verifiable Reward. We apply our method on
Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning
model exceeds, on math reasoning tasks, much larger reasoning models, e.g.,
outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and
DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate
that a carefully designed training recipe, with large-scale high-quality CoT
data, is effective to unlock strong reasoning capabilities even in
resource-constrained small models.