Phi-4推論技術レポート

要旨

私たちは、複雑な推論タスクにおいて強力な性能を発揮する140億パラメータの推論モデル「Phi-4-reasoning」を紹介します。このモデルは、Phi-4を教師ありファインチューニングし、適切な複雑さと多様性を持つ「教示可能な」プロンプトの厳選セットと、o3-miniを使用して生成された推論デモンストレーションを用いて訓練されています。Phi-4-reasoningは、推論時の計算を効果的に活用する詳細な推論チェーンを生成します。さらに、結果ベースの強化学習を短期的に適用して強化したバリアント「Phi-4-reasoning-plus」を開発し、より長い推論トレースを生成することで高い性能を実現しています。幅広い推論タスクにおいて、両モデルはDeepSeek-R1-Distill-Llama-70Bモデルなどの大幅に大規模なオープンウェイトモデルを大きく上回り、完全なDeepSeek-R1モデルの性能レベルに近づいています。私たちの包括的な評価は、数学や科学的推論、コーディング、アルゴリズム問題解決、計画、空間理解などのベンチマークに及びます。興味深いことに、汎用ベンチマークへの改善の非自明な転移も観察されています。本レポートでは、訓練データ、訓練方法論、評価に関する洞察を提供します。教師ありファインチューニング（SFT）のための注意深いデータキュレーションの利点が推論言語モデルにも拡張され、強化学習（RL）によってさらに増幅されることを示します。最後に、私たちの評価は、推論モデルの性能と堅牢性を評価する方法を改善する機会を示唆しています。

English

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Phi-4推論技術レポート

Phi-4-reasoning Technical Report

要旨

Support