FRAP:具有自適應提示加權的忠實且逼真的文本到圖像生成
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting
August 21, 2024
作者: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohan Sai Singamsetti, Fengyu Sun, Wei Lu, Di Niu
cs.AI
摘要
文本到圖像(T2I)擴散模型展現出令人印象深刻的能力,能夠根據文本提示生成高質量圖像。然而,確保提示-圖像對齊仍然是一個相當大的挑戰,即生成與提示語義忠實對齊的圖像。最近的研究試圖通過優化潛在代碼來提高忠實度,但這可能導致潛在代碼超出分佈範圍,進而產生不現實的圖像。在本文中,我們提出了FRAP,一種簡單而有效的方法,基於自適應調整每個標記提示權重,以改善提示-圖像對齊和生成圖像的真實性。我們設計了一種在線算法來自適應地更新每個標記的權重係數,通過最小化一個統一的目標函數來實現,該函數鼓勵對象存在和對象-修飾符對的綁定。通過廣泛的評估,我們展示了FRAP生成的圖像與來自複雜數據集的提示具有顯著更高的提示-圖像對齊,同時與最近的潛在代碼優化方法相比,具有較低的平均延遲,例如,在COCO-Subject數據集上比D&B快4秒。此外,通過視覺比較和在CLIP-IQA-Real指標上的評估,我們展示了FRAP不僅改善了提示-圖像對齊,還生成了外觀更真實的圖像。我們還探討將FRAP與提示重寫LLM結合,以恢復其降級的提示-圖像對齊,我們觀察到提示-圖像對齊和圖像質量都有所改善。
English
Text-to-image (T2I) diffusion models have demonstrated impressive
capabilities in generating high-quality images given a text prompt. However,
ensuring the prompt-image alignment remains a considerable challenge, i.e.,
generating images that faithfully align with the prompt's semantics. Recent
works attempt to improve the faithfulness by optimizing the latent code, which
potentially could cause the latent code to go out-of-distribution and thus
produce unrealistic images. In this paper, we propose FRAP, a simple, yet
effective approach based on adaptively adjusting the per-token prompt weights
to improve prompt-image alignment and authenticity of the generated images. We
design an online algorithm to adaptively update each token's weight
coefficient, which is achieved by minimizing a unified objective function that
encourages object presence and the binding of object-modifier pairs. Through
extensive evaluations, we show FRAP generates images with significantly higher
prompt-image alignment to prompts from complex datasets, while having a lower
average latency compared to recent latent code optimization methods, e.g., 4
seconds faster than D&B on the COCO-Subject dataset. Furthermore, through
visual comparisons and evaluation on the CLIP-IQA-Real metric, we show that
FRAP not only improves prompt-image alignment but also generates more authentic
images with realistic appearances. We also explore combining FRAP with prompt
rewriting LLM to recover their degraded prompt-image alignment, where we
observe improvements in both prompt-image alignment and image quality.Summary
AI-Generated Summary