IMG:通过隐式多模态引导校准扩散模型
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
September 30, 2025
作者: Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi
cs.AI
摘要
确保扩散生成图像与输入提示之间的精确多模态对齐一直是一个长期存在的挑战。早期工作通过高质量偏好数据微调扩散权重,但这类数据往往有限且难以扩展。最近的基于编辑的方法进一步优化生成图像的局部区域,但可能会影响整体图像质量。在本研究中,我们提出了隐式多模态引导(IMG),一种无需额外数据或编辑操作的新型再生成式多模态对齐框架。具体而言,给定生成图像及其提示,IMG首先利用多模态大语言模型(MLLM)识别不对齐之处;其次引入隐式对齐器,通过操控扩散条件特征来减少不对齐并实现再生成;最后将对齐目标转化为可训练的迭代更新偏好目标。在SDXL、SDXL-DPO和FLUX上的广泛定性与定量评估表明,IMG优于现有的对齐方法。此外,IMG作为一种灵活的即插即用适配器,能够无缝增强基于微调的对齐方法。我们的代码将发布于https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment。
English
Ensuring precise multimodal alignment between diffusion-generated images and
input prompts has been a long-standing challenge. Earlier works finetune
diffusion weight using high-quality preference data, which tends to be limited
and difficult to scale up. Recent editing-based methods further refine local
regions of generated images but may compromise overall image quality. In this
work, we propose Implicit Multimodal Guidance (IMG), a novel
re-generation-based multimodal alignment framework that requires no extra data
or editing operations. Specifically, given a generated image and its prompt,
IMG a) utilizes a multimodal large language model (MLLM) to identify
misalignments; b) introduces an Implicit Aligner that manipulates diffusion
conditioning features to reduce misalignments and enable re-generation; and c)
formulates the re-alignment goal into a trainable objective, namely Iteratively
Updated Preference Objective. Extensive qualitative and quantitative
evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing
alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter,
seamlessly enhancing prior finetuning-based alignment methods. Our code will be
available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.