PUMA:利用多粒度视觉生成增强统一MLLM
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
October 17, 2024
作者: Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu
cs.AI
摘要
最近在多模态基础模型方面取得了重大进展,为视觉-语言理解带来了显著进步。最初的尝试还探索了多模态大型语言模型(MLLMs)在视觉内容生成方面的潜力。然而,现有研究尚未充分解决统一MLLM范式中不同图像生成任务的不同粒度需求问题 - 从文本到图像生成所需的多样性到图像操作所需的精确可控性。在这项工作中,我们提出了PUMA,即利用多粒度视觉生成赋能统一MLLM。PUMA将多粒度视觉特征统一为MLLM的输入和输出,优雅地解决了统一MLLM框架中各种图像生成任务的不同粒度要求。在多模态预训练和任务特定指导调整之后,PUMA展示了在广泛的多模态任务中的熟练表现。这项工作代表了朝着能够适应各种视觉任务粒度需求的真正统一MLLM迈出的重要一步。代码和模型将在https://github.com/rongyaofang/PUMA发布。
English
Recent advancements in multimodal foundation models have yielded significant
progress in vision-language understanding. Initial attempts have also explored
the potential of multimodal large language models (MLLMs) for visual content
generation. However, existing works have insufficiently addressed the varying
granularity demands of different image generation tasks within a unified MLLM
paradigm - from the diversity required in text-to-image generation to the
precise controllability needed in image manipulation. In this work, we propose
PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA
unifies multi-granular visual features as both inputs and outputs of MLLMs,
elegantly addressing the different granularity requirements of various image
generation tasks within a unified MLLM framework. Following multimodal
pretraining and task-specific instruction tuning, PUMA demonstrates proficiency
in a wide range of multimodal tasks. This work represents a significant step
towards a truly unified MLLM capable of adapting to the granularity demands of
various visual tasks. The code and model will be released in
https://github.com/rongyaofang/PUMA.Summary
AI-Generated Summary