Phi-4-reasoning-vision-15B 技術報告

摘要

我們推出Phi-4-reasoning-vision-15B——一款精簡的開放權重多模態推理模型，並闡述其開發過程中的設計動機、架構選擇、實驗驗證與關鍵發現。本研究旨在為學界提供建構更小巧高效的多模態推理模型的實用洞見，同時將研究成果以開放權重形式共享。該模型不僅擅長常規視覺語言任務，更在科學數學推理與用戶界面理解方面表現卓越。我們的核心貢獻在於證明：透過精細的架構設計與嚴謹的數據策劃，小型開放權重多模態模型能以顯著更少的訓練/推理計算量與標記數實現競爭性性能。最顯著的提升源自系統化的數據篩選、錯誤修正與合成擴增——這再次印證數據質量仍是模型性能的核心槓桿。系統消融實驗表明，高解析度動態編碼器能帶來持續改進，因為精準感知是高質量推理的前提。最後，通過混合推理與非推理數據並輔以顯式模式標記的混合策略，單一模型既能對簡單任務給出快速直接回應，也能對複雜問題進行思維鏈推理。

English

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

Phi-4-reasoning-vision-15B 技術報告

Phi-4-reasoning-vision-15B Technical Report

摘要

Support