PUMA: 多粒度ビジュアル生成を備えた統合されたMLLMの強化

要旨

最近の多様な基盤モデルの進歩により、ビジョン言語理解において重要な進展が見られています。初期の試みでは、視覚コンテンツ生成のための多様な大規模言語モデル（MLLMs）の潜在能力も探求されてきました。しかしながら、既存の研究は、統一されたMLLMパラダイム内で異なる画像生成タスクの変化する粒度要件に十分対処していません - テキストから画像への生成に必要な多様性から画像操作に必要な正確な制御性まで。本研究では、PUMA（emPowering Unified MLLM with Multi-grAnular visual generation）を提案します。PUMAは、MLLMの入力および出力として複数の粒度の視覚的特徴を統一し、さまざまな画像生成タスクの異なる粒度要件を優雅に対処します。マルチモーダルの事前トレーニングとタスク固有の指示チューニングに続いて、PUMAは幅広いマルチモーダルタスクで優れた能力を示しています。この研究は、さまざまな視覚タスクの粒度要件に適応できる真に統一されたMLLMに向けた重要な一歩を表しています。コードとモデルはhttps://github.com/rongyaofang/PUMAで公開されます。

English

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in https://github.com/rongyaofang/PUMA.

PUMA: 多粒度ビジュアル生成を備えた統合されたMLLMの強化

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

要旨

Support