Ovis-Image技术报告
Ovis-Image Technical Report
November 28, 2025
作者: Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen
cs.AI
摘要
我们推出Ovis-Image——一款专门针对高质量文本渲染优化的70亿参数文生图模型,其设计可在严格算力限制下高效运行。该模型基于我们此前开发的Ovis-U1框架,将基于扩散机制的视觉解码器与更强大的Ovis 2.5多模态主干网络相结合,采用以文本为中心的训练流程,融合了大规模预训练与精心设计的训练后优化。尽管采用紧凑架构,Ovis-Image的文本渲染性能仍可媲美Qwen-Image等规模更大的开源模型,并接近Seedream、GPT4o等闭源系统。关键优势在于,该模型仅需单张高端GPU与适中显存即可部署,显著缩小了前沿文本渲染技术与实际应用之间的差距。实验结果表明,通过将强大多模态主干网络与精心设计的文本导向训练方案相结合,无需依赖超大模型或专有系统即可实现可靠的双语文本渲染。
English
We introduce Ovis-Image, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.