GrounDiT：透過噪聲補丁移植實現基於擴散的Transformer模型

摘要

我們引入了一種新穎的無需訓練的空間定位技術，用於使用擴散Transformer（DiT）進行文本到圖像生成。使用邊界框的空間定位因其簡單性和多功能性而受到關注，使得在圖像生成中增強了用戶控制。然而，先前的無需訓練方法通常依賴於通過從自定義損失函數進行反向擴散過程的反向傳播來更新嘈雜圖像，這經常難以提供對個別邊界框的精確控制。在這項工作中，我們利用Transformer架構的靈活性，展示了DiT可以生成與每個邊界框對應的嘈雜補丁，完全編碼目標對象並允許對每個區域進行精細控制。我們的方法建立在DiT的一個引人入勝的特性上，我們稱之為語義共享。由於語義共享，當一個較小的補丁與可生成大小的圖像一起聯合去噪時，這兩者變成了“語義克隆”。每個補丁在生成過程的自己分支中去噪，然後在每個時間步驟將其移植到原始嘈雜圖像的相應區域，從而實現對每個邊界框的堅固空間定位。在我們對HRS和DrawBench基準測試的實驗中，與先前的無需訓練的空間定位方法相比，我們實現了最先進的性能。

English

We introduce a novel training-free spatial grounding technique for text-to-image generation using Diffusion Transformers (DiT). Spatial grounding with bounding boxes has gained attention for its simplicity and versatility, allowing for enhanced user control in image generation. However, prior training-free approaches often rely on updating the noisy image during the reverse diffusion process via backpropagation from custom loss functions, which frequently struggle to provide precise control over individual bounding boxes. In this work, we leverage the flexibility of the Transformer architecture, demonstrating that DiT can generate noisy patches corresponding to each bounding box, fully encoding the target object and allowing for fine-grained control over each region. Our approach builds on an intriguing property of DiT, which we refer to as semantic sharing. Due to semantic sharing, when a smaller patch is jointly denoised alongside a generatable-size image, the two become "semantic clones". Each patch is denoised in its own branch of the generation process and then transplanted into the corresponding region of the original noisy image at each timestep, resulting in robust spatial grounding for each bounding box. In our experiments on the HRS and DrawBench benchmarks, we achieve state-of-the-art performance compared to previous training-free spatial grounding approaches.

GrounDiT：透過噪聲補丁移植實現基於擴散的Transformer模型

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

摘要

Support