テキストから画像への拡散モデルにおける長文のアラインメントの改善

要旨

テキストから画像への変換（T2I）拡散モデルの急速な進化により、与えられたテキストから前例のない結果を生成することが可能になりました。しかし、テキスト入力が長くなると、CLIPなどの既存のエンコーディング手法に制限が生じ、生成された画像を長いテキストに整列させることが困難になります。これらの問題に対処するために、私たちはLongAlignを提案します。LongAlignには、長いテキストを処理するためのセグメントレベルのエンコーディング手法と、効果的な整列トレーニングのための分解された選好最適化手法が含まれています。セグメントレベルのエンコーディングでは、長いテキストが複数のセグメントに分割され、個別に処理されます。この手法は、事前学習されたエンコーディングモデルの最大入力長の制限を克服します。選好最適化において、私たちは分解されたCLIPベースの選好モデルを提供し、拡散モデルを微調整します。具体的には、T2I整列にCLIPベースの選好モデルを活用するために、そのスコアリングメカニズムに深入りし、選好スコアをテキストに関連する部分（T2I整列を測定する）とテキストに関係のない部分（人間の選好の他の視覚的側面を評価する）の2つの要素に分解できることを見出しました。さらに、テキストに関係のない部分が微調整中の一般的な過学習問題に寄与することがわかりました。この問題に対処するために、これら2つの要素に異なる重みを割り当てるリウェーティング戦略を提案し、過学習を軽減し、整列を向上させます。私たちの手法を用いて、512回の512 Stable Diffusion（SD）v1.5を約20時間微調整した結果、微調整されたSDは、PixArt-alphaやKandinsky v2.2などの強力な基盤モデルを上回るT2I整列で優れた性能を発揮します。コードはhttps://github.com/luping-liu/LongAlignで入手可能です。

English

The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning 512 times 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt-alpha and Kandinsky v2.2. The code is available at https://github.com/luping-liu/LongAlign.

テキストから画像への拡散モデルにおける長文のアラインメントの改善

Improving Long-Text Alignment for Text-to-Image Diffusion Models

要旨

Support