ChatPaper.aiChatPaper

Phi-4-Mini-Reasoning:探索小型推理语言模型在数学中的极限

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

April 30, 2025
作者: Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen
cs.AI

摘要

鏈式思考(CoT)通過訓練大型語言模型(LLMs)明確生成中間推理步驟,顯著提升了其形式推理能力。儘管LLMs能輕鬆受益於此類技術,但由於模型容量有限,提升小型語言模型(SLMs)的推理能力仍具挑戰性。Deepseek-R1的最新研究表明,利用LLM生成的合成數據進行蒸餾,可以大幅提升SLM的推理能力。然而,具體的建模方法並未公開。在本研究中,我們提出了一套系統的SLM訓練方案,包含四個步驟:(1)在多樣化的蒸餾長鏈CoT數據上進行大規模中期訓練,(2)在高質量的長鏈CoT數據上進行監督微調,(3)利用精心策劃的偏好數據集進行Rollout DPO,以及(4)採用可驗證獎勵的強化學習(RL)。我們將此方法應用於Phi-4-Mini,這是一個緊湊的3.8B參數模型。最終的Phi-4-Mini-Reasoning模型在數學推理任務上超越了許多更大的推理模型,例如在Math-500上分別以3.2分和7.7分的優勢超越了DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B。我們的結果驗證了,通過精心設計的訓練方案,配合大規模高質量的CoT數據,即使在資源受限的小型模型中,也能有效釋放強大的推理能力。
English
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.
PDF482May 4, 2025