ChatPaper.aiChatPaper

Phi-4推理技術報告

Phi-4-reasoning Technical Report

April 30, 2025
作者: Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Yousefi, Guoqing Zheng
cs.AI

摘要

我們推出Phi-4-reasoning,這是一個擁有140億參數的推理模型,在複雜推理任務中展現出卓越性能。該模型通過對Phi-4進行監督微調訓練,使用精心挑選的“可教學”提示集——這些提示在複雜性和多樣性上恰到好處——以及利用o3-mini生成的推理示範,Phi-4-reasoning能夠生成詳細的推理鏈,有效利用推理時的計算資源。我們進一步開發了Phi-4-reasoning-plus,這一變體通過短期的基於結果的強化學習得到增強,通過生成更長的推理軌跡來提供更高的性能。在廣泛的推理任務中,這兩個模型均顯著超越了如DeepSeek-R1-Distill-Llama-70B等更大規模的開源模型,並接近完整版DeepSeek-R1模型的性能水平。我們的全面評估涵蓋了數學與科學推理、編程、算法問題解決、規劃以及空間理解等多個基準測試。有趣的是,我們觀察到這些改進在通用基準測試上也有非平凡的遷移效果。在本報告中,我們深入探討了訓練數據、訓練方法以及評估過程。我們展示了精心策劃數據對於監督微調(SFT)的好處同樣適用於推理語言模型,並且可以通過強化學習(RL)進一步放大。最後,我們的評估指出了在評估推理模型性能和魯棒性方面改進方法的機會。
English
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
PDF503May 4, 2025