注意力聚焦的基于文本的图像合成

摘要

受大规模配对文本-图像数据集训练的可扩展扩散模型驱动，文本到图像合成方法展现出引人注目的结果。然而，当文本提示涉及多个对象、属性和空间构图时，这些模型仍然无法精确地遵循文本提示。本文中，我们在扩散模型的交叉注意力和自注意力层中确定潜在原因。我们提出了两种新的损失函数，在采样过程中根据给定布局重新聚焦注意力图。我们在DrawBench和HRS基准上进行了全面实验，使用由大型语言模型合成的布局，结果显示我们提出的损失函数可以轻松有效地集成到现有的文本到图像方法中，并持续改善生成图像与文本提示之间的对齐。

English

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

注意力聚焦的基于文本的图像合成

Grounded Text-to-Image Synthesis with Attention Refocusing

摘要

Support