주의 재초점을 통한 기반 텍스트-이미지 합성

초록

대규모 텍스트-이미지 쌍 데이터셋으로 학습된 확장 가능한 확산 모델에 의해 주도되는 텍스트-이미지 합성 방법은 인상적인 결과를 보여주고 있다. 그러나 이러한 모델들은 프롬프트에 여러 객체, 속성 및 공간 구성이 포함될 경우 텍스트 프롬프트를 정확히 따르지 못하는 한계를 여전히 가지고 있다. 본 논문에서는 확산 모델의 교차 주의(cross-attention) 층과 자기 주의(self-attention) 층에서 이러한 문제의 잠재적 원인을 규명한다. 우리는 샘플링 과정 중 주어진 레이아웃에 따라 주의 맵(attention map)을 재조정하기 위한 두 가지 새로운 손실 함수를 제안한다. 대형 언어 모델(Large Language Models)로 합성된 레이아웃을 사용하여 DrawBench 및 HRS 벤치마크에서 포괄적인 실험을 수행한 결과, 제안된 손실 함수가 기존 텍스트-이미지 방법에 쉽고 효과적으로 통합될 수 있으며, 생성된 이미지와 텍스트 프롬프트 간의 정렬을 지속적으로 개선할 수 있음을 보여준다.

English

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

주의 재초점을 통한 기반 텍스트-이미지 합성

Grounded Text-to-Image Synthesis with Attention Refocusing

초록

Support