Ovis-U1 기술 보고서

초록

본 보고서에서는 다중 모달 이해, 텍스트-이미지 생성, 이미지 편집 기능을 통합한 30억 파라미터 규모의 통합 모델인 Ovis-U1을 소개합니다. Ovis 시리즈의 기반 위에 구축된 Ovis-U1은 확산 기반 시각 디코더와 양방향 토큰 정제기를 결합하여 GPT-4o와 같은 선도 모델에 필적하는 이미지 생성 작업을 가능하게 합니다. 이전 일부 모델들이 생성 작업을 위해 고정된 MLLM(Multimodal Large Language Model)을 사용한 것과 달리, Ovis-U1은 언어 모델에서 시작하는 새로운 통합 학습 방식을 활용합니다. 이해 또는 생성 작업만을 단독으로 학습하는 것과 비교했을 때, 통합 학습은 두 작업을 통합함으로써 달성된 성능 향상을 보여줍니다. Ovis-U1은 OpenCompass 다중 모달 학술 벤치마크에서 69.6점을 달성하며, Ristretto-3B 및 SAIL-VL-1.5-2B와 같은 최신 최첨단 모델들을 능가합니다. 텍스트-이미지 생성에서는 DPG-Bench와 GenEval 벤치마크에서 각각 83.72점과 0.89점으로 우수한 성능을 보입니다. 이미지 편집에서는 ImgEdit-Bench와 GEdit-Bench-EN에서 각각 4.00점과 6.42점을 기록합니다. Ovis 통합 모델 시리즈의 초기 버전인 Ovis-U1은 다중 모달 이해, 생성, 편집의 경계를 넓혀 나갑니다.

English

In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.