ChatPaper.aiChatPaper

Ovis-U1 技術報告

Ovis-U1 Technical Report

June 29, 2025
作者: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
cs.AI

摘要

在本报告中,我们介绍了Ovis-U1,一个拥有30亿参数的统一模型,该模型整合了多模态理解、文本到图像生成以及图像编辑功能。基于Ovis系列的基础,Ovis-U1引入了一个基于扩散的视觉解码器,并搭配双向令牌精炼器,使其在图像生成任务上能够与GPT-4o等领先模型相媲美。与以往某些使用冻结多模态大语言模型(MLLM)进行生成任务的模型不同,Ovis-U1采用了一种新的统一训练方法,从语言模型出发进行训练。相较于仅针对理解或生成任务进行训练,统一训练方法展现了更优的性能,证明了整合这两类任务所带来的提升。Ovis-U1在OpenCompass多模态学术基准测试中取得了69.6分的成绩,超越了近期如Ristretto-3B和SAIL-VL-1.5-2B等顶尖模型。在文本到图像生成方面,它在DPG-Bench和GenEval基准测试中分别以83.72和0.89的分数表现出色。在图像编辑方面,它在ImgEdit-Bench和GEdit-Bench-EN上分别获得了4.00和6.42的评分。作为Ovis统一模型系列的首个版本,Ovis-U1在多模态理解、生成与编辑领域推动了技术的前沿。
English
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
PDF432July 1, 2025