ChatPaper.aiChatPaper

SEED-X:具有统一多粒度理解和生成的多模态模型

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

April 22, 2024
作者: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan
cs.AI

摘要

多模基础模型的快速演进展示了在视觉-语言理解和生成方面取得的显著进展,例如我们之前的工作SEED-LLaMA。然而,由于模型对于各种用户指令的有效响应能力有限以及与多样化视觉数据进行交互的能力,导致其能力与实际应用之间仍存在差距。在这项工作中,我们专注于通过整合两个增强功能来弥合这一差距:(1)理解任意大小和比例的图像,以及(2)实现多粒度图像生成。我们提出了一个统一且多才多艺的基础模型,名为SEED-X,能够为理解和生成任务建模多粒度视觉语义。除了在公共基准测试中取得的竞争结果外,SEED-X在经过指令调整后展示了其在各个领域处理实际应用的有效性。我们希望我们的工作能激发未来研究多才多艺的多模基础模型在实际应用中可以取得什么成就。模型、代码和数据集将在https://github.com/AILab-CVC/SEED-X发布。
English
The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X.

Summary

AI-Generated Summary

PDF192December 15, 2024