零樣本空間佈局條件對文本到圖像擴散模型的影響

摘要

大規模文本到圖像擴散模型顯著提升了生成圖像建模的最新技術水平，並允許直觀且強大的用戶界面來驅動圖像生成過程。使用文本表達空間限制，例如將特定物件放置在特定位置，使用文本來說很繁瑣；目前基於文本的圖像生成模型無法準確遵循此類指示。本文考慮從與圖像畫布上的區段相關聯的文本生成圖像，這結合了直觀的自然語言界面與對生成內容的精確空間控制。我們提出ZestGuide，一種零樣本分割引導方法，可插入預先訓練的文本到圖像擴散模型中，並且無需額外訓練。它利用可以從交叉注意力層中提取的隱式分割地圖，並使用它們來對齊生成與輸入遮罩。我們的實驗結果結合了高質量的圖像與生成內容與輸入分割的準確對齊，從定量和定性上均優於先前的工作，包括需要在具有相應分割的圖像上進行訓練的方法。與Paint with Words相比，這是先前在使用零樣本分割條件進行圖像生成的最新技術水平，我們在COCO數據集上的mIoU分數相似的情況下提高了5到10個百分點。

English

Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

零樣本空間佈局條件對文本到圖像擴散模型的影響

Zero-shot spatial layout conditioning for text-to-image diffusion models

摘要

Support