ChatPaper.aiChatPaper

Phi-4推理技术报告

Phi-4-reasoning Technical Report

April 30, 2025
作者: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng
cs.AI

摘要

我们推出了Phi-4-reasoning,这是一个拥有140亿参数的推理模型,在复杂推理任务上展现出强劲性能。该模型通过对Phi-4进行监督微调训练而成,训练数据包括精心挑选的“可教学”提示集——这些提示在复杂度和多样性上恰到好处——以及利用o3-mini生成的推理示范。Phi-4-reasoning能够生成详细的推理链条,有效利用推理时的计算资源。我们还开发了Phi-4-reasoning-plus,这一变体通过短期基于结果的强化学习得到增强,通过生成更长的推理轨迹来提供更高性能。在广泛的推理任务中,这两个模型均显著超越了如DeepSeek-R1-Distill-Llama-70B等更大规模的开放权重模型,并接近完整版DeepSeek-R1模型的性能水平。我们的全面评估涵盖了数学与科学推理、编码、算法问题解决、规划以及空间理解等多个基准测试。有趣的是,我们还观察到这些改进对通用基准测试也有非平凡的迁移效果。在本报告中,我们深入探讨了训练数据、训练方法及评估过程。我们证明,对于推理语言模型而言,精心策划的数据用于监督微调(SFT)带来的益处是显著的,并且可以通过强化学习(RL)进一步放大。最后,我们的评估指出了在如何评估推理模型的性能与鲁棒性方面存在的改进空间。
English
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
PDF503May 4, 2025