トレーニング不要の地域プロンプティングによるディフュージョントランスフォーマー

要旨

拡散モデルは、テキストから画像を生成する際に優れた能力を示しています。彼らの意味理解（つまり、プロンプトに従う）能力も、大規模言語モデル（例：T5、Llama）によって大幅に向上しています。ただし、既存のモデルは、特にテキストプロンプトがさまざまなオブジェクトを多数含み、相互に関連する空間関係を持つ場合など、長く複雑なテキストプロンプトを完璧に処理することができません。UNetベースのモデル（SD1.5、SDXLなど）には多くの地域プロンプティング手法が提案されていますが、最近のDiffusion Transformer（DiT）アーキテクチャに基づいた実装はまだありません。例えば、SD3やFLUX.1などです。このレポートでは、我々はFLUX.1向けの地域プロンプティングを提案し、実装しています。これは、アテンション操作に基づくもので、トレーニング不要でDiTに微細な構成テキストから画像を生成する能力を提供します。コードは以下で入手可能です：https://github.com/antonioo-c/Regional-Prompting-FLUX。

English

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

トレーニング不要の地域プロンプティングによるディフュージョントランスフォーマー

Training-free Regional Prompting for Diffusion Transformers

要旨

Support