Phi-4-Mini-Reasoning：探索小型推理语言模型在数学中的极限

摘要

思维链（CoT）通过训练大型语言模型（LLMs）显式生成中间推理步骤，显著提升了其形式推理能力。尽管LLMs能轻松受益于此类技术，但由于模型容量有限，提升小型语言模型（SLMs）的推理能力仍具挑战性。近期，Deepseek-R1的研究表明，利用LLM生成的合成数据进行蒸馏可大幅增强SLM的推理能力，但具体的建模方法尚未公开。在本研究中，我们提出了一套系统化的SLM训练方案，包含四个步骤：（1）在大规模多样化的蒸馏长CoT数据上进行中期训练，（2）在高质量长CoT数据上进行监督微调，（3）利用精心筛选的偏好数据集进行Rollout DPO，（4）结合可验证奖励的强化学习（RL）。我们将此方法应用于Phi-4-Mini，一个紧凑的3.8B参数模型。最终得到的Phi-4-Mini-Reasoning模型在数学推理任务上超越了规模更大的推理模型，例如在Math-500测试中，分别以3.2分和7.7分的优势超过了DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B。我们的结果验证了，通过精心设计的训练方案，结合大规模高质量的CoT数据，即使在资源受限的小型模型中也能有效解锁强大的推理能力。

English

Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.

Phi-4-Mini-Reasoning：探索小型推理语言模型在数学中的极限

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

摘要

Support