GrounDiT: ノイズのあるパッチ移植を介したグラウンディフュージョントランスフォーマー

要旨

我々は、Diffusion Transformers（DiT）を用いた画像生成のための新しいトレーニング不要の空間基盤技術を紹介します。バウンディングボックスを用いた空間基盤は、そのシンプルさと汎用性から注目されており、画像生成においてユーザーのコントロールを強化することが可能です。しかしながら、従来のトレーニング不要の手法は、しばしばノイズの多い画像をカスタム損失関数からの逆拡散プロセスによる逆伝搬を用いて更新することに依存しており、個々のバウンディングボックスに対する正確な制御を提供するのに苦労することがよくあります。本研究では、Transformerアーキテクチャの柔軟性を活用し、DiTが各バウンディングボックスに対応するノイズの多いパッチを生成し、対象オブジェクトを完全にエンコードし、各領域に対する細かい制御を可能にすることを示します。我々の手法は、DiTの興味深い特性である「意味共有」に基づいて構築されています。意味共有により、より小さなパッチが生成可能なサイズの画像と共に共同でノイズ除去されると、その2つは「意味的クローン」となります。各パッチは生成プロセスの独自の枝でノイズ除去され、それから各タイムステップで元のノイズの多い画像の対応領域に移植されるため、各バウンディングボックスに対する堅牢な空間基盤が実現されます。HRSとDrawBenchのベンチマーク実験において、従来のトレーニング不要の空間基盤手法と比較して、最先端のパフォーマンスを達成しました。

English

We introduce a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become "semantic clones". Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free spatial grounding approaches.

GrounDiT: ノイズのあるパッチ移植を介したグラウンディフュージョントランスフォーマー

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

要旨

Support