精准把控：提升文本到图像模型中的空间一致性

摘要

当前文本到图像（T2I）模型的一个关键不足之处在于，它们无法始终如一地生成忠实遵循文本提示中指定空间关系的图像。本文对此局限性进行了全面探讨，并开发了数据集与方法，以达到最先进的性能水平。首先，我们发现现有的视觉-语言数据集在表现空间关系方面尚显不足；为缓解这一瓶颈，我们通过重新标注来自四个广泛使用的视觉数据集中的600万张图像，创建了首个专注于空间关系的大规模数据集SPRIGHT。通过三重评估与分析流程，我们发现SPRIGHT在捕捉空间关系方面显著优于现有数据集。为展示其效能，我们仅利用了SPRIGHT的约0.25%数据，便在生成空间准确图像方面实现了22%的提升，同时FID和CMMD评分也有所提高。其次，我们发现，在包含大量对象的图像上进行训练，能显著提升空间一致性。特别地，我们通过对少于500张图像进行微调，在T2I-CompBench上达到了0.2133的空间评分，刷新了最先进记录。最后，通过一系列控制实验和消融分析，我们记录了多个发现，相信这些发现将有助于加深对影响文本到图像模型空间一致性因素的理解。我们公开发布了数据集和模型，以促进该领域的进一步研究。

English

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

精准把控：提升文本到图像模型中的空间一致性

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

摘要

Support