**Phi-4-推理视觉-15B技术报告**

摘要

我们推出Phi-4-reasoning-vision-15B——一款紧凑型开放权重的多模态推理模型，并分享其研发过程中的设计动机、架构选择、实验数据与实践心得。本研究旨在为学术界提供构建更小型、高效多模态推理模型的实用洞见，同时将研究成果以开放权重形式发布。该模型在通用视觉语言任务中表现优异，并特别擅长科学数学推理与用户界面理解。我们的核心贡献在于证明：通过精心的架构设计与严格的数据筛选，小型开放权重多模态模型能以显著更少的训练/推理计算量和令牌数实现竞争力性能。最显著的性能提升源于系统化的数据过滤、错误修正与合成增强——这再次印证数据质量仍是模型性能的首要决定因素。系统性消融实验表明，高分辨率动态编码器能带来持续改进，因为精准感知是高质量推理的前提。最后，通过混合使用推理与非推理数据并辅以显式模式标记，单一模型可同时实现简单任务的快速直接应答与复杂问题的思维链推理。

English

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.