Phi-4-Mini-Reasoning: 수학에서 소형 추론 언어 모델의 한계 탐구

초록

Chain-of-Thought (CoT)는 대형 언어 모델(LLMs)이 중간 추론 단계를 명시적으로 생성하도록 훈련시킴으로써 형식적 추론 능력을 크게 향상시킵니다. LLMs는 이러한 기법으로부터 쉽게 이점을 얻지만, 소형 언어 모델(SLMs)의 경우 제한된 모델 용량으로 인해 추론 능력 향상이 여전히 어려운 과제로 남아 있습니다. 최근 Deepseek-R1의 연구는 LLM에서 생성된 합성 데이터를 통해 SLM의 추론 능력을 크게 개선할 수 있음을 보여주었습니다. 그러나 구체적인 모델링 방법은 공개되지 않았습니다. 본 연구에서는 SLMs를 위한 체계적인 훈련 방법을 제시하며, 이는 네 단계로 구성됩니다: (1) 다양한 증류된 장문 CoT 데이터에 대한 대규모 중간 훈련, (2) 고품질 장문 CoT 데이터에 대한 지도 미세 조정, (3) 신중하게 선별된 선호 데이터셋을 활용한 Rollout DPO, (4) 검증 가능한 보상을 통한 강화 학습(RL). 우리는 이 방법을 3.8B 파라미터의 소형 모델인 Phi-4-Mini에 적용했습니다. 그 결과로 탄생한 Phi-4-Mini-Reasoning 모델은 수학 추론 과제에서 훨씬 더 큰 추론 모델들을 능가하며, Math-500에서 DeepSeek-R1-Distill-Qwen-7B를 3.2점, DeepSeek-R1-Distill-Llama-8B를 7.7점 앞섰습니다. 우리의 결과는 대규모 고품질 CoT 데이터와 함께 신중하게 설계된 훈련 방법이 자원이 제한된 소형 모델에서도 강력한 추론 능력을 발휘할 수 있음을 입증합니다.

English

Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.

Phi-4-Mini-Reasoning: 수학에서 소형 추론 언어 모델의 한계 탐구

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

초록

Support