零-shot 空间布局条件对文本到图像扩散模型的影响
Zero-shot spatial layout conditioning for text-to-image diffusion models
June 23, 2023
作者: Guillaume Couairon, Marlène Careil, Matthieu Cord, Stéphane Lathuilière, Jakob Verbeek
cs.AI
摘要
大规模文本到图像扩散模型显著改进了生成图像建模的最新技术,并允许直观、强大的用户界面驱动图像生成过程。使用文本表达空间约束,例如将特定对象定位在特定位置,使用文本是繁琐的;当前基于文本的图像生成模型无法准确地遵循这些指令。本文考虑从与图像画布上的部分相关联的文本生成图像,这将直观的自然语言界面与对生成内容进行精确空间控制相结合。我们提出了ZestGuide,一种零样本分割引导方法,可插入预训练的文本到图像扩散模型中,而无需额外训练。它利用可以从交叉注意力层中提取的隐式分割地图,用于将生成与输入蒙版对齐。我们的实验结果将高图像质量与生成内容与输入分割准确对齐相结合,从定量和定性上都优于以前的工作,包括需要在具有相应分割的图像上进行训练的方法。与以前在零样本分割条件下的图像生成中的最新技术Paint with Words相比,我们在COCO数据集上的mIoU分数提高了5到10个点,而FID分数相似。
English
Large-scale text-to-image diffusion models have significantly improved the
state of the art in generative image modelling and allow for an intuitive and
powerful user interface to drive the image generation process. Expressing
spatial constraints, e.g. to position specific objects in particular locations,
is cumbersome using text; and current text-based image generation models are
not able to accurately follow such instructions. In this paper we consider
image generation from text associated with segments on the image canvas, which
combines an intuitive natural language interface with precise spatial control
over the generated content. We propose ZestGuide, a zero-shot segmentation
guidance approach that can be plugged into pre-trained text-to-image diffusion
models, and does not require any additional training. It leverages implicit
segmentation maps that can be extracted from cross-attention layers, and uses
them to align the generation with input masks. Our experimental results combine
high image quality with accurate alignment of generated content with input
segmentations, and improve over prior work both quantitatively and
qualitatively, including methods that require training on images with
corresponding segmentations. Compared to Paint with Words, the previous
state-of-the art in image generation with zero-shot segmentation conditioning,
we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.