使用密集 Blob 表示進行組合式文本到圖像生成
Compositional Text-to-Image Generation with Dense Blob Representations
May 14, 2024
作者: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
cs.AI
摘要
現有的文本轉圖像模型在遵循複雜文本提示方面遇到困難,因此需要額外的基礎輸入以提高可控性。在這項工作中,我們提出將場景分解為視覺基元 - 以密集塊表示的形式 - 這些基元包含場景的細節,同時具有模塊化、易於解釋和易於構建的特性。基於塊表示,我們開發了一個基於塊的文本轉圖像擴散模型,稱為BlobGEN,用於組合生成。特別地,我們引入了一個新的遮罩交叉注意模塊,以解開塊表示和視覺特徵之間的融合。為了利用大型語言模型(LLMs)的組合性,我們引入了一種新的上下文學習方法,從文本提示生成塊表示。我們的廣泛實驗表明,BlobGEN在MS-COCO上實現了優越的零樣本生成質量和更好的布局引導可控性。當與LLMs結合時,我們的方法在組合圖像生成基準上展現出優越的數值和空間正確性。項目頁面:https://blobgen-2d.github.io。
English
Existing text-to-image models struggle to follow complex text prompts,
raising the need for extra grounding inputs for better controllability. In this
work, we propose to decompose a scene into visual primitives - denoted as dense
blob representations - that contain fine-grained details of the scene while
being modular, human-interpretable, and easy-to-construct. Based on blob
representations, we develop a blob-grounded text-to-image diffusion model,
termed BlobGEN, for compositional generation. Particularly, we introduce a new
masked cross-attention module to disentangle the fusion between blob
representations and visual features. To leverage the compositionality of large
language models (LLMs), we introduce a new in-context learning approach to
generate blob representations from text prompts. Our extensive experiments show
that BlobGEN achieves superior zero-shot generation quality and better
layout-guided controllability on MS-COCO. When augmented by LLMs, our method
exhibits superior numerical and spatial correctness on compositional image
generation benchmarks. Project page: https://blobgen-2d.github.io.Summary
AI-Generated Summary