IMG：通過隱式多模態指導校準擴散模型

摘要

確保擴散生成圖像與輸入提示之間的精確多模態對齊，一直是長期以來的挑戰。早期研究通過高質量偏好數據微調擴散權重，但這類數據往往有限且難以擴展。近期的基於編輯的方法進一步細化了生成圖像的局部區域，但可能損害圖像的整體質量。在本研究中，我們提出了隱式多模態引導（IMG），這是一種無需額外數據或編輯操作的新型基於重新生成的多模態對齊框架。具體而言，給定生成的圖像及其提示，IMG首先利用多模態大語言模型（MLLM）識別不對齊之處；其次引入隱式對齊器，通過操控擴散條件特徵來減少不對齊並實現重新生成；最後將重新對齊目標表述為可訓練的目標，即迭代更新的偏好目標。在SDXL、SDXL-DPO和FLUX上的廣泛定性與定量評估表明，IMG優於現有的對齊方法。此外，IMG作為一種靈活的即插即用適配器，無縫增強了基於微調的對齊方法。我們的代碼將在https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment 上公開。

English

Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.

IMG：通過隱式多模態指導校準擴散模型

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

摘要

Support