Phi-4-reasoning-vision-15B 技術レポート

要旨

我々は、コンパクトなオープンウェイトのマルチモーダル推論モデル「Phi-4-reasoning-vision-15B」を発表し、その開発に影響を与えた動機、設計選択、実験、および知見を共有します。我々の目標は、より小型で効率的なマルチモーダル推論モデルの構築に関する実践的な洞察を研究コミュニティに提供し、これらの知見の成果を、一般的な視覚・言語タスクに優れ、科学的・数学的推論およびユーザーインターフェースの理解に秀でたオープンウェイトモデルとして公開することです。我々の貢献は、慎重なアーキテクチャ選択と厳格なデータ選別により、より少ない訓練および推論時の計算量とトークン数で、小型のオープンウェイトマルチモーダルモデルが競争力のある性能を達成できることを実証した点にあります。最も大きな改善は、体系的なフィルタリング、誤り修正、および合成的なデータ拡張からもたらされ、データ品質が依然としてモデル性能の主要な要因であることを裏付けています。体系的なアブレーション研究により、高解像度で動的解像度のエンコーダが一貫した改善をもたらすことが示され、正確な知覚が高品質な推論の前提条件であることが確認されました。最後に、推論データと非推論データのハイブリッド混合と明示的なモードトークンの採用により、単一のモデルが、より単純なタスクには高速な直接回答を、複雑な問題には連鎖思考推論を提供できることが実証されました。

English

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

Phi-4-reasoning-vision-15B 技術レポート

Phi-4-reasoning-vision-15B Technical Report

要旨

Support