SEED-X：統一されたマルチグラニュラリティ理解と生成を備えたマルチモーダルモデル

要旨

マルチモーダル基盤モデルの急速な進化は、視覚と言語の理解と生成において大きな進展を示してきました。例えば、私たちの以前の研究であるSEED-LLaMAがその一例です。しかし、その能力と実世界での適用性の間には依然としてギャップが存在します。これは主に、モデルが様々なユーザー指示に効果的に応答し、多様な視覚データと相互作用する能力が限られているためです。本研究では、このギャップを埋めるために、以下の2つの強化された機能を統合することに焦点を当てます：(1)任意のサイズと比率の画像を理解する能力、(2)マルチグラニュラリティ（多粒度）の画像生成を可能にする能力。私たちは、理解と生成タスクのためのマルチグラニュラリティ視覚意味論をモデル化できる統一された汎用基盤モデル、SEED-Xを提案します。公開ベンチマークでの競争力のある結果に加えて、SEED-Xは指示チューニング後に様々な分野での実世界アプリケーションを扱う有効性を示しています。私たちの研究が、汎用マルチモーダル基盤モデルが実世界アプリケーションで達成できることについての将来の研究を刺激することを願っています。モデル、コード、およびデータセットはhttps://github.com/AILab-CVC/SEED-Xで公開されます。

English

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.

SEED-X：統一されたマルチグラニュラリティ理解と生成を備えたマルチモーダルモデル

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

要旨

Support