ChatPaper.aiChatPaper

我思故我擴散:實現擴散模型中的多模態上下文推理

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

February 12, 2025
作者: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
cs.AI

摘要

本文提出了一種名為ThinkDiff的新穎對齊範式,該範式通過整合視覺語言模型(VLMs)的優勢,賦予文本到圖像擴散模型多模態上下文理解與推理能力。現有的多模態擴散微調方法主要集中於像素級重建而非上下文推理,且受限於基於推理的數據集的複雜性和有限可用性。ThinkDiff通過將視覺語言訓練作為代理任務來應對這些挑戰,將VLMs與編碼器-解碼器大型語言模型(LLM)的解碼器對齊,而非擴散解碼器。這一代理任務基於以下觀察:LLM解碼器與使用相應LLM編碼器進行提示嵌入的擴散解碼器共享相同的輸入特徵空間。因此,通過與LLM解碼器對齊,可以簡化VLMs與擴散解碼器的對齊過程。無需複雜的訓練和數據集,ThinkDiff有效釋放了擴散模型中的理解、推理和組合能力。實驗表明,ThinkDiff在具有挑戰性的多模態上下文推理生成基準CoBSAT上,僅在4塊A100 GPU上訓練5小時,就將準確率從19.2%顯著提升至46.3%。此外,ThinkDiff在將多個圖像和文本組合成邏輯連貫的圖像方面展現出卓越性能。項目頁面:https://mizhenxing.github.io/ThinkDiff。
English
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the LLM decoder shares the same input feature space with diffusion decoders that use the corresponding LLM encoder for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

Summary

AI-Generated Summary

PDF353February 18, 2025