零樣本空間佈局條件對文本到圖像擴散模型的影響
Zero-shot spatial layout conditioning for text-to-image diffusion models
June 23, 2023
作者: Guillaume Couairon, Marlène Careil, Matthieu Cord, Stéphane Lathuilière, Jakob Verbeek
cs.AI
摘要
大規模文本到圖像擴散模型顯著提升了生成圖像建模的最新技術水平,並允許直觀且強大的用戶界面來驅動圖像生成過程。使用文本表達空間限制,例如將特定物件放置在特定位置,使用文本來說很繁瑣;目前基於文本的圖像生成模型無法準確遵循此類指示。本文考慮從與圖像畫布上的區段相關聯的文本生成圖像,這結合了直觀的自然語言界面與對生成內容的精確空間控制。我們提出ZestGuide,一種零樣本分割引導方法,可插入預先訓練的文本到圖像擴散模型中,並且無需額外訓練。它利用可以從交叉注意力層中提取的隱式分割地圖,並使用它們來對齊生成與輸入遮罩。我們的實驗結果結合了高質量的圖像與生成內容與輸入分割的準確對齊,從定量和定性上均優於先前的工作,包括需要在具有相應分割的圖像上進行訓練的方法。與Paint with Words相比,這是先前在使用零樣本分割條件進行圖像生成的最新技術水平,我們在COCO數據集上的mIoU分數相似的情況下提高了5到10個百分點。
English
Large-scale text-to-image diffusion models have significantly improved the
state of the art in generative image modelling and allow for an intuitive and
powerful user interface to drive the image generation process. Expressing
spatial constraints, e.g. to position specific objects in particular locations,
is cumbersome using text; and current text-based image generation models are
not able to accurately follow such instructions. In this paper we consider
image generation from text associated with segments on the image canvas, which
combines an intuitive natural language interface with precise spatial control
over the generated content. We propose ZestGuide, a zero-shot segmentation
guidance approach that can be plugged into pre-trained text-to-image diffusion
models, and does not require any additional training. It leverages implicit
segmentation maps that can be extracted from cross-attention layers, and uses
them to align the generation with input masks. Our experimental results combine
high image quality with accurate alignment of generated content with input
segmentations, and improve over prior work both quantitatively and
qualitatively, including methods that require training on images with
corresponding segmentations. Compared to Paint with Words, the previous
state-of-the art in image generation with zero-shot segmentation conditioning,
we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.