做對:改善文本到圖像模型中的空間一致性
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
April 1, 2024
作者: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang
cs.AI
摘要
目前文本到圖像(T2I)模型的一個主要缺陷是它們無法一貫地生成忠實地遵循文本提示中指定的空間關係的圖像。在本文中,我們對這一限制進行了全面調查,同時開發了能夠實現最先進性能的數據集和方法。首先,我們發現目前的視覺語言數據集未能很好地表示空間關係;為了減輕這一瓶頸,我們通過對來自4個廣泛使用的視覺數據集的600萬張圖像重新標題,創建了首個空間專注、大規模數據集SPRIGHT。通過三重評估和分析流程,我們發現SPRIGHT在很大程度上改進了現有數據集在捕捉空間關係方面的能力。為了證明其有效性,我們僅利用SPRIGHT的約0.25%,在生成空間準確圖像方面實現了22%的改進,同時提高了FID和CMMD分數。其次,我們發現在訓練包含大量物體的圖像時,空間一致性會顯著改善。值得注意的是,通過在不到500張圖像上進行微調,我們在T2I-CompBench上實現了0.2133的空間分數,達到了最先進水平。最後,通過一系列受控實驗和消融實驗,我們記錄了多個發現,這些發現我們認為將增進對影響文本到圖像模型中空間一致性的因素的理解。我們公開發布我們的數據集和模型,以促進這一領域的進一步研究。
English
One of the key shortcomings in current text-to-image (T2I) models is their
inability to consistently generate images which faithfully follow the spatial
relationships specified in the text prompt. In this paper, we offer a
comprehensive investigation of this limitation, while also developing datasets
and methods that achieve state-of-the-art performance. First, we find that
current vision-language datasets do not represent spatial relationships well
enough; to alleviate this bottleneck, we create SPRIGHT, the first
spatially-focused, large scale dataset, by re-captioning 6 million images from
4 widely used vision datasets. Through a 3-fold evaluation and analysis
pipeline, we find that SPRIGHT largely improves upon existing datasets in
capturing spatial relationships. To demonstrate its efficacy, we leverage only
~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially
accurate images while also improving the FID and CMMD scores. Secondly, we find
that training on images containing a large number of objects results in
substantial improvements in spatial consistency. Notably, we attain
state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by
fine-tuning on <500 images. Finally, through a set of controlled experiments
and ablations, we document multiple findings that we believe will enhance the
understanding of factors that affect spatial consistency in text-to-image
models. We publicly release our dataset and model to foster further research in
this area.Summary
AI-Generated Summary