ChatPaper.aiChatPaper

奥维斯图像技术报告

Ovis-Image Technical Report

November 28, 2025
作者: Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen
cs.AI

摘要

我们推出Ovis-Image——一个专门针对高质量文本渲染优化的70亿参数文生图模型,其设计可在严格算力限制下高效运行。该模型基于我们先前提出的Ovis-U1框架,将基于扩散的视觉解码器与更强大的Ovis 2.5多模态主干网络相结合,采用以文本为中心的训练流程,融合了大规模预训练与精心设计的训练后优化。尽管架构紧凑,Ovis-Image的文本渲染性能却可与Qwen-Image等参数量显著更大的开源模型相媲美,并逼近Seedream、GPT4o等闭源系统。关键优势在于,该模型仅需单张高端GPU配合适中显存即可部署,大幅缩小了前沿文本渲染技术与实际应用之间的鸿沟。实验结果表明:通过将强大多模态主干网络与精心设计的文本导向训练方案相结合,无需依赖超大参数规模或专有模型,即可实现可靠的双语文本渲染能力。
English
We introduce Ovis-Image, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.
PDF21December 4, 2025