EasyRef: マルチモーダルLLMを介した拡散モデル用のオムニ汎化されたグループ画像リファレンス

要旨

拡散モデルの個人化における重要な成果が目覚ましく見られています。従来のチューニングフリーな方法は、主に複数の参照画像を画像埋め込みの平均化によって符号化し、それを注入条件としていますが、このような画像に依存しない操作では画像間の相互作用を行うことができず、複数の参照画像内で一貫した視覚要素を捉えることができません。チューニングベースのLow-Rank Adaptation（LoRA）は、トレーニングプロセスを通じて複数の画像内で一貫した要素を効果的に抽出できますが、各異なる画像グループに対して特定のファインチューニングが必要です。本論文では、複数の参照画像とテキストプロンプトに依存するように拡散モデルを調整する新しいプラグアンドプレイ適応方法であるEasyRefを紹介します。複数の画像内で一貫した視覚要素を効果的に活用するために、マルチモーダル大規模言語モデル（MLLM）のマルチ画像理解と指示に従う能力を活用し、指示に基づいて一貫した視覚要素を捉えるよう促します。さらに、MLLMの表現をアダプタを介して拡散プロセスに注入することで、未知のドメインに容易に汎化し、未知データ内の一貫した視覚要素を探索します。計算コストを軽減し、細かい詳細を保存するために、効率的な参照集約戦略と段階的トレーニングスキームを導入します。最後に、新しいマルチ参照画像生成ベンチマークであるMRBenchを紹介します。実験結果は、EasyRefがIP-Adapterなどのチューニングフリーな方法やLoRAなどのチューニングベースの方法を凌駕し、優れた美的品質と多様なドメインでの堅牢なゼロショット汎化を達成していることを示しています。

English

Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.

EasyRef: マルチモーダルLLMを介した拡散モデル用のオムニ汎化されたグループ画像リファレンス

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

要旨

Support