Phi-4-Mini-Reasoning:探索小型推理语言模型在数学中的极限
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
April 30, 2025
作者: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen
cs.AI
摘要
鏈式思考(CoT)通過訓練大型語言模型(LLMs)明確生成中間推理步驟,顯著提升了其形式推理能力。儘管LLMs能輕鬆受益於此類技術,但由於模型容量有限,提升小型語言模型(SLMs)的推理能力仍具挑戰性。Deepseek-R1的最新研究表明,利用LLM生成的合成數據進行蒸餾,可以大幅提升SLM的推理能力。然而,具體的建模方法並未公開。在本研究中,我們提出了一套系統的SLM訓練方案,包含四個步驟:(1)在多樣化的蒸餾長鏈CoT數據上進行大規模中期訓練,(2)在高質量的長鏈CoT數據上進行監督微調,(3)利用精心策劃的偏好數據集進行Rollout DPO,以及(4)採用可驗證獎勵的強化學習(RL)。我們將此方法應用於Phi-4-Mini,這是一個緊湊的3.8B參數模型。最終的Phi-4-Mini-Reasoning模型在數學推理任務上超越了許多更大的推理模型,例如在Math-500上分別以3.2分和7.7分的優勢超越了DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B。我們的結果驗證了,通過精心設計的訓練方案,配合大規模高質量的CoT數據,即使在資源受限的小型模型中,也能有效釋放強大的推理能力。
English
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities
in Large Language Models (LLMs) by training them to explicitly generate
intermediate reasoning steps. While LLMs readily benefit from such techniques,
improving reasoning in Small Language Models (SLMs) remains challenging due to
their limited model capacity. Recent work by Deepseek-R1 demonstrates that
distillation from LLM-generated synthetic data can substantially improve the
reasoning ability of SLM. However, the detailed modeling recipe is not
disclosed. In this work, we present a systematic training recipe for SLMs that
consists of four steps: (1) large-scale mid-training on diverse distilled
long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3)
Rollout DPO leveraging a carefully curated preference dataset, and (4)
Reinforcement Learning (RL) with Verifiable Reward. We apply our method on
Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning
model exceeds, on math reasoning tasks, much larger reasoning models, e.g.,
outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and
DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate
that a carefully designed training recipe, with large-scale high-quality CoT
data, is effective to unlock strong reasoning capabilities even in
resource-constrained small models.