ChatPaper.aiChatPaper

多模態表徵對齊於圖像生成:文本-圖像交錯控制比你想像的更簡單

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

February 27, 2025
作者: Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
cs.AI

摘要

在先进的文本到圖像生成領域,正見證著統一框架的興起,這些框架將強大的文本編碼器(如CLIP和T5)與擴散變換器(Diffusion Transformer)骨幹相結合。儘管已有嘗試通過額外條件(如邊緣檢測和深度圖)來控制輸出圖像,但對於任意文本-圖像交錯控制的全面框架仍顯不足。這一差距在嘗試於生成過程中融合多張圖像的概念或視覺元素時尤為明顯。為彌補這一差距,我們進行了初步實驗,結果顯示大型多模態模型(LMMs)提供了一個有效的共享表示空間,其中圖像和文本能夠良好對齊,作為外部擴散模型的條件。基於這一發現,我們提出了Dream Engine,這是一個高效且統一的框架,專為圖像生成模型中的任意文本-圖像交錯控制而設計。在強大的文本到圖像模型(如SD3.5)基礎上,我們通過整合多功能多模態信息編碼器(如QwenVL)來替換原有的僅文本編碼器。我們的方法採用兩階段訓練範式,包括聯合文本-圖像對齊和多模態交錯指令微調。實驗表明,這種訓練方法有效,在GenEval基準測試中獲得了0.69的總體分數,並與最先進的文本到圖像模型(如SD3.5和FLUX)的性能相匹配。
English
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.

Summary

AI-Generated Summary

PDF283February 28, 2025