CoMat: テキストから画像への拡散モデルと画像からテキストへの概念マッチングの整合

要旨

拡散モデルはテキストから画像生成の分野で大きな成功を収めています。しかし、テキストプロンプトと画像の間の不一致を軽減することは依然として課題です。この不一致の根本的な原因は十分に調査されていません。私たちは、この不一致が不十分なトークン注意活性化によって引き起こされていることを観察しました。さらに、この現象を拡散モデルのトレーニングパラダイムに起因する条件利用の不十分さに帰着させました。この問題に対処するため、私たちはCoMatを提案します。これは、画像からテキストへの概念マッチングメカニズムを備えたエンドツーエンドの拡散モデルファインチューニング戦略です。画像キャプションモデルを活用して画像とテキストの整合性を測定し、拡散モデルが無視されたトークンを再考するよう導きます。また、属性バインディング問題に対処するために、新しい属性集中モジュールも提案します。画像や人間の選好データを使用せず、20Kのテキストプロンプトのみを使用してSDXLをファインチューニングし、CoMat-SDXLを取得します。広範な実験により、CoMat-SDXLが2つのテキストから画像への整合性ベンチマークでベースラインモデルSDXLを大幅に上回り、最先端の性能を達成することが示されています。

English

Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.

CoMat: テキストから画像への拡散モデルと画像からテキストへの概念マッチングの整合

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

要旨

Support