Ovis2.5 技术报告
Ovis2.5 Technical Report
August 15, 2025
作者: Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
cs.AI
摘要
我们推出Ovis2.5,作为Ovis2的继任者,专为原生分辨率视觉感知与强大多模态推理而设计。Ovis2.5集成了一个原生分辨率视觉Transformer,能够以图像的原生可变分辨率进行处理,避免了固定分辨率分块带来的质量下降,同时保留了精细细节与全局布局——这对于复杂图表等视觉密集内容至关重要。为增强推理能力,我们训练模型超越线性思维链,执行反思——包括自我检查与修正。这一高级能力在推理时作为可选的“思考模式”呈现,允许用户在延迟与困难输入上的准确性之间做出权衡。模型通过一个全面的五阶段课程进行训练,逐步构建其技能。该过程始于基础视觉与多模态预训练,经过大规模指令调优,最终利用DPO和GRPO进行对齐与推理增强。为高效扩展这些升级,我们采用多模态数据打包与混合并行策略,实现了显著的端到端加速。我们发布了两款开源模型:Ovis2.5-9B与Ovis2.5-2B。后者延续了Ovis2“小模型,大性能”的理念,非常适合资源受限的端侧场景。在OpenCompass多模态排行榜上,Ovis2.5-9B平均得分78.3,较其前身Ovis2-8B有显著提升,并在40B参数以下的开源MLLM中达到顶尖水平;Ovis2.5-2B得分73.9,确立了其规模下的SOTA地位。除了综合得分,Ovis2.5在STEM基准测试中取得领先成绩,在基础任务与视频任务上展现出强大能力,并在复杂图表分析方面实现了其规模下的开源SOTA。
English
We present Ovis2.5, a successor to Ovis2 designed for native-resolution
visual perception and strong multimodal reasoning. Ovis2.5 integrates a
native-resolution vision transformer that processes images at their native,
variable resolutions, avoiding the degradation from fixed-resolution tiling and
preserving both fine detail and global layout -- crucial for visually dense
content like complex charts. To strengthen reasoning, we train the model to
move beyond linear chain-of-thought and perform reflection -- including
self-checking and revision. This advanced capability is exposed as an optional
"thinking mode" at inference time, allowing users to trade latency for enhanced
accuracy on difficult inputs. The model is trained via a comprehensive
five-phase curriculum that progressively builds its skills. The process begins
with foundational visual and multimodal pretraining, advances through
large-scale instruction tuning, and culminates in alignment and reasoning
enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ
multimodal data packing and hybrid parallelism, yielding a significant
end-to-end speedup. We release two open-source models: Ovis2.5-9B and
Ovis2.5-2B. The latter continues the "small model, big performance" philosophy
of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the
OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a
substantial improvement over its predecessor, Ovis2-8B, and achieving
state-of-the-art results among open-source MLLMs in the sub-40B parameter
range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate
scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong
capabilities on grounding and video tasks, and achieves open-source SOTA at its
scale for complex chart analysis.