使用注意力重新聚焦的基於文本的圖像合成

摘要

受可擴展擴散模型在大規模配對文本-圖像數據集上訓練的驅動，文本到圖像合成方法展現出引人入勝的結果。然而，這些模型在處理涉及多個物件、屬性和空間組合的提示時，仍然無法準確遵循文本提示。本文中，我們在擴散模型的交叉注意力和自注意力層中識別潛在原因。我們提出兩種新的損失函數，在採樣過程中根據給定的佈局重新聚焦注意力地圖。我們在DrawBench和HRS基準測試中進行全面實驗，使用由大型語言模型合成的佈局，結果顯示我們提出的損失函數可以輕鬆且有效地整合到現有的文本到圖像方法中，並持續改善生成的圖像與文本提示之間的對齊。

English

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

使用注意力重新聚焦的基於文本的圖像合成

Grounded Text-to-Image Synthesis with Attention Refocusing

摘要

Support