SEED-X:具有統一多粒度理解和生成的多模型
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
April 22, 2024
作者: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan
cs.AI
摘要
多模基礎模型的快速演進已經在視覺語言理解和生成方面取得了顯著進展,例如我們先前的工作SEED-LLaMA。然而,由於模型對於各種用戶指令的有效回應和與多樣化視覺數據的互動能力有限,導致其能力與真實應用之間仍存在差距。在這項工作中,我們專注於通過整合兩個增強功能來彌合這一差距:(1)理解任意大小和比例的圖像,以及(2)實現多粒度圖像生成。我們提出了一個統一且多功能的基礎模型,名為SEED-X,能夠為理解和生成任務建模多粒度的視覺語義。除了在公共基準測試中取得競爭優異的結果外,SEED-X在經過指導調整後展示了其在各個領域的真實應用中的有效性。我們希望我們的工作能激發對多功能多模基礎模型在真實應用中所能實現的未來研究。模型、代碼和數據集將在https://github.com/AILab-CVC/SEED-X 上發布。
English
The rapid evolution of multimodal foundation model has demonstrated
significant progresses in vision-language understanding and generation, e.g.,
our previous work SEED-LLaMA. However, there remains a gap between its
capability and the real-world applicability, primarily due to the model's
limited capacity to effectively respond to various user instructions and
interact with diverse visual data. In this work, we focus on bridging this gap
through integrating two enhanced features: (1) comprehending images of
arbitrary sizes and ratios, and (2) enabling multi-granularity image
generation. We present a unified and versatile foundation model, namely,
SEED-X, which is able to model multi-granularity visual semantics for
comprehension and generation tasks. Besides the competitive results on public
benchmarks, SEED-X demonstrates its effectiveness in handling real-world
applications across various domains after instruction tuning. We hope that our
work will inspire future research into what can be achieved by versatile
multimodal foundation models in real-world applications. The models, codes, and
datasets will be released in https://github.com/AILab-CVC/SEED-X.Summary
AI-Generated Summary