IMG: 暗黙的多モーダルガイダンスによる拡散モデルのキャリブレーション

要旨

拡散モデルによって生成された画像と入力プロンプトとの間の正確なマルチモーダルアライメントを確保することは、長年の課題であった。従来の研究では、高品質な選好データを用いて拡散モデルの重みをファインチューニングする手法が採用されてきたが、そのようなデータは限られており、スケールアップが困難である。最近の編集ベースの手法では、生成された画像の局所領域をさらに洗練するが、全体的な画像品質を損なう可能性がある。本研究では、追加のデータや編集操作を必要としない、再生成ベースのマルチモーダルアライメントフレームワークであるImplicit Multimodal Guidance（IMG）を提案する。具体的には、生成された画像とそのプロンプトが与えられた場合、IMGはa) マルチモーダル大規模言語モデル（MLLM）を利用してミスアライメントを特定し、b) 拡散条件付け特徴を操作してミスアライメントを軽減し、再生成を可能にするImplicit Alignerを導入し、c) 再アライメントの目標を学習可能な目的関数、すなわちIteratively Updated Preference Objectiveとして定式化する。SDXL、SDXL-DPO、およびFLUXにおける広範な定性的および定量的評価により、IMGが既存のアライメント手法を凌駕することが示された。さらに、IMGは柔軟なプラグアンドプレイアダプターとして機能し、従来のファインチューニングベースのアライメント手法をシームレスに強化する。我々のコードはhttps://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignmentで公開される予定である。

English

Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.

IMG: 暗黙的多モーダルガイダンスによる拡散モデルのキャリブレーション

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

要旨

Support