PUMA：利用多粒度视觉生成增强统一MLLM

摘要

最近在多模态基础模型方面取得了重大进展，为视觉-语言理解带来了显著进步。最初的尝试还探索了多模态大型语言模型（MLLMs）在视觉内容生成方面的潜力。然而，现有研究尚未充分解决统一MLLM范式中不同图像生成任务的不同粒度需求问题 - 从文本到图像生成所需的多样性到图像操作所需的精确可控性。在这项工作中，我们提出了PUMA，即利用多粒度视觉生成赋能统一MLLM。PUMA将多粒度视觉特征统一为MLLM的输入和输出，优雅地解决了统一MLLM框架中各种图像生成任务的不同粒度要求。在多模态预训练和任务特定指导调整之后，PUMA展示了在广泛的多模态任务中的熟练表现。这项工作代表了朝着能够适应各种视觉任务粒度需求的真正统一MLLM迈出的重要一步。代码和模型将在https://github.com/rongyaofang/PUMA发布。

English

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in https://github.com/rongyaofang/PUMA.

PUMA：利用多粒度视觉生成增强统一MLLM

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

摘要

Support