零-shot 空间布局条件对文本到图像扩散模型的影响

摘要

大规模文本到图像扩散模型显著改进了生成图像建模的最新技术，并允许直观、强大的用户界面驱动图像生成过程。使用文本表达空间约束，例如将特定对象定位在特定位置，使用文本是繁琐的；当前基于文本的图像生成模型无法准确地遵循这些指令。本文考虑从与图像画布上的部分相关联的文本生成图像，这将直观的自然语言界面与对生成内容进行精确空间控制相结合。我们提出了ZestGuide，一种零样本分割引导方法，可插入预训练的文本到图像扩散模型中，而无需额外训练。它利用可以从交叉注意力层中提取的隐式分割地图，用于将生成与输入蒙版对齐。我们的实验结果将高图像质量与生成内容与输入分割准确对齐相结合，从定量和定性上都优于以前的工作，包括需要在具有相应分割的图像上进行训练的方法。与以前在零样本分割条件下的图像生成中的最新技术Paint with Words相比，我们在COCO数据集上的mIoU分数提高了5到10个点，而FID分数相似。

English

Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores.

零-shot 空间布局条件对文本到图像扩散模型的影响

Zero-shot spatial layout conditioning for text-to-image diffusion models

摘要

Support