Ovis-U1 技術レポート

要旨

本報告では、30億パラメータの統一モデルであるOvis-U1を紹介します。Ovis-U1は、マルチモーダル理解、テキストから画像への生成、および画像編集機能を統合したモデルです。Ovisシリーズの基盤を基に、Ovis-U1は拡散ベースのビジュアルデコーダと双方向トークンリファイナーを組み合わせており、GPT-4oのような主要モデルに匹敵する画像生成タスクを実現しています。従来の一部のモデルが生成タスクに凍結されたMLLMを使用するのとは異なり、Ovis-U1は言語モデルから始まる新しい統一トレーニングアプローチを採用しています。理解タスクまたは生成タスクのみでトレーニングする場合と比較して、統一トレーニングはより優れたパフォーマンスを発揮し、これら2つのタスクを統合することで得られる向上を示しています。Ovis-U1は、OpenCompassマルチモーダルアカデミックベンチマークで69.6のスコアを達成し、Ristretto-3BやSAIL-VL-1.5-2Bなどの最新の最先端モデルを凌駕しています。テキストから画像への生成では、DPG-BenchとGenEvalベンチマークでそれぞれ83.72と0.89のスコアを記録しています。画像編集では、ImgEdit-BenchとGEdit-Bench-ENでそれぞれ4.00と6.42を達成しています。Ovis統一モデルシリーズの最初のバージョンとして、Ovis-U1はマルチモーダル理解、生成、および編集の境界を押し広げています。

English

In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.