UNIMO-G：通過多模態條件擴散實現統一的圖像生成

摘要

現有的文本到圖像擴散模型主要是從文本提示生成圖像。然而，文本描述的內在簡潔性在忠實合成具有細緻細節的圖像方面存在挑戰，例如特定實體或場景。本文提出了UNIMO-G，一個簡單的多模態條件擴散框架，它在多模態提示上運作，其中包含交錯的文本和視覺輸入，展示了對於既有文本驅動又有主題驅動的圖像生成的統一能力。UNIMO-G包含兩個核心組件：一個用於編碼多模態提示的多模態大語言模型（MLLM），以及一個有條件的去噪擴散網絡，用於基於編碼的多模態輸入生成圖像。我們採用了兩階段訓練策略來有效訓練這個框架：首先在大規模文本-圖像對上進行預訓練，以發展有條件的圖像生成能力，然後通過多模態提示進行指導微調，以實現統一的圖像生成能力。採用了一個精心設計的數據處理流程，涉及語言基礎和圖像分割，用於構建多模態提示。UNIMO-G在文本到圖像生成和零樣本主題驅動合成方面表現出色，特別擅長於從涉及多個圖像實體的複雜多模態提示中生成高保真度的圖像。

English

Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

UNIMO-G：通過多模態條件擴散實現統一的圖像生成

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

摘要

Support