Phi-4 추론 기술 보고서

초록

우리는 복잡한 추론 과제에서 강력한 성능을 달성하는 140억 파라미터 규모의 추론 모델인 Phi-4-reasoning을 소개합니다. 이 모델은 적절한 수준의 복잡성과 다양성을 갖춘 "가르칠 수 있는" 프롬프트 세트와 o3-mini를 사용해 생성한 추론 데모를 기반으로 Phi-4를 지도 미세 조정(supervised fine-tuning, SFT)하여 학습되었습니다. Phi-4-reasoning은 추론 시간 계산을 효과적으로 활용하는 상세한 추론 체인을 생성합니다. 또한, 우리는 결과 기반 강화 학습(reinforcement learning, RL)을 통해 짧은 단계로 개선된 변형 모델인 Phi-4-reasoning-plus를 개발했습니다. 이 모델은 더 긴 추론 흔적을 생성함으로써 더 높은 성능을 제공합니다. 다양한 추론 과제에서 두 모델 모두 DeepSeek-R1-Distill-Llama-70B와 같은 훨씬 더 큰 오픈 웨이트 모델을 크게 능가하며, 전체 DeepSeek-R1 모델의 성능 수준에 근접합니다. 우리의 포괄적인 평가는 수학 및 과학적 추론, 코딩, 알고리즘 문제 해결, 계획 수립, 공간 이해 등 다양한 벤치마크를 아우릅니다. 흥미롭게도, 일반 목적 벤치마크에서도 개선 사항이 비약적으로 전이되는 것을 관찰했습니다. 이 보고서에서는 학습 데이터, 학습 방법론, 평가에 대한 통찰을 제공합니다. 우리는 지도 미세 조정을 위한 신중한 데이터 큐레이션의 이점이 추론 언어 모델에도 적용되며, 강화 학습을 통해 더욱 증폭될 수 있음을 보여줍니다. 마지막으로, 우리의 평가는 추론 모델의 성능과 견고성을 평가하는 방법을 개선할 수 있는 기회를 제시합니다.

English

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Phi-4 추론 기술 보고서

Phi-4-reasoning Technical Report

초록

Support