Gegründete Text-zu-Bild-Synthese mit Aufmerksamkeits-Neufokussierung

papers.abstract

Angetrieben durch skalierbare Diffusionsmodelle, die auf umfangreichen gepaarten Text-Bild-Datensätzen trainiert wurden, haben Text-zu-Bild-Synthese-Methoden überzeugende Ergebnisse gezeigt. Diese Modelle scheitern jedoch noch daran, den Textprompt präzise zu befolgen, wenn mehrere Objekte, Attribute und räumliche Kompositionen im Prompt involviert sind. In diesem Artikel identifizieren wir die potenziellen Gründe sowohl in den Cross-Attention- als auch in den Self-Attention-Schichten des Diffusionsmodells. Wir schlagen zwei neuartige Loss-Funktionen vor, um die Aufmerksamkeitskarten gemäß eines gegebenen Layouts während des Sampling-Prozesses neu auszurichten. Wir führen umfassende Experimente auf den Benchmarks DrawBench und HRS durch, wobei wir Layouts verwenden, die von Large Language Models synthetisiert wurden. Die Ergebnisse zeigen, dass unsere vorgeschlagenen Loss-Funktionen einfach und effektiv in bestehende Text-zu-Bild-Methoden integriert werden können und deren Übereinstimmung zwischen den generierten Bildern und den Textprompts konsequent verbessern.

English

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

Gegründete Text-zu-Bild-Synthese mit Aufmerksamkeits-Neufokussierung

Grounded Text-to-Image Synthesis with Attention Refocusing

papers.abstract

Support