奥维斯图像技术报告

摘要

我们推出Ovis-Image——一个专门针对高质量文本渲染优化的70亿参数文生图模型，其设计可在严格算力限制下高效运行。该模型基于我们先前提出的Ovis-U1框架，将基于扩散的视觉解码器与更强大的Ovis 2.5多模态主干网络相结合，采用以文本为中心的训练流程，融合了大规模预训练与精心设计的训练后优化。尽管架构紧凑，Ovis-Image的文本渲染性能却可与Qwen-Image等参数量显著更大的开源模型相媲美，并逼近Seedream、GPT4o等闭源系统。关键优势在于，该模型仅需单张高端GPU配合适中显存即可部署，大幅缩小了前沿文本渲染技术与实际应用之间的鸿沟。实验结果表明：通过将强大多模态主干网络与精心设计的文本导向训练方案相结合，无需依赖超大参数规模或专有模型，即可实现可靠的双语文本渲染能力。

English

We introduce Ovis-Image, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.

奥维斯图像技术报告

Ovis-Image Technical Report

摘要

Support