ChatPaper.aiChatPaper

使用密集斑块表示进行组合式文本到图像生成

Compositional Text-to-Image Generation with Dense Blob Representations

May 14, 2024
作者: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
cs.AI

摘要

现有的文本到图像模型在遵循复杂文本提示方面存在困难,因此需要额外的基础输入以提高可控性。在这项工作中,我们提出将场景分解为视觉基元 - 表示为密集斑块的形式 - 这些基元包含场景的细粒度细节,同时具有模块化、易于解释和易于构建的特点。基于斑块表示,我们开发了一种基于斑块的文本到图像扩散模型,命名为BlobGEN,用于组合生成。特别地,我们引入了一个新的蒙版交叉注意力模块,以解开斑块表示和视觉特征之间的融合。为了利用大型语言模型(LLMs)的组合性,我们引入了一种新的上下文学习方法,从文本提示中生成斑块表示。我们广泛的实验表明,BlobGEN在MS-COCO数据集上实现了优越的零样本生成质量和更好的布局引导可控性。当与LLMs相结合时,我们的方法在组合图像生成基准上表现出优越的数值和空间正确性。项目页面:https://blobgen-2d.github.io。
English
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

Summary

AI-Generated Summary

PDF181December 15, 2024