稳定语义:一种合成的语言-视觉数据集,其中包含自然图像中的语义表示。
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
June 19, 2024
作者: Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe
cs.AI
摘要
在计算机视觉中,理解视觉场景的语义是一个基本挑战。这一挑战的关键在于,具有相似语义含义或功能的物体可能在视觉上存在显著差异,这使得准确识别和分类变得困难。近期文本到图像框架的进展导致了能够隐式捕捉自然场景统计信息的模型。这些框架考虑了物体的视觉变化性,以及复杂的物体共现和诸如不同照明条件之类的噪声来源。通过利用大规模数据集和交叉注意力调节,这些模型生成了详细且具有上下文丰富性的场景表示。这种能力为改进在各种具有挑战性的环境中的物体识别和场景理解开辟了新途径。我们的工作介绍了StableSemantics,这是一个包含22.4万个经人工筛选的提示、处理过的自然语言标题、超过200万张合成图像以及1000万个与单个名词块对应的注意力图的数据集。我们明确利用与视觉上有趣的稳定扩散生成相对应的人类生成提示,每个短语提供10代,并为每个图像提取交叉注意力图。我们探索了生成图像的语义分布,检查了图像中物体的分布,并在我们的数据上对字幕生成和开放词汇分割方法进行了基准测试。据我们所知,我们是第一个发布具有语义属性的扩散数据集。我们期望我们提出的数据集能够推动视觉语义理解方面的进展,并为开发更复杂和有效的视觉模型奠定基础。网站:https://stablesemantics.github.io/StableSemantics
English
Understanding the semantics of visual scenes is a fundamental challenge in
Computer Vision. A key aspect of this challenge is that objects sharing similar
semantic meanings or functions can exhibit striking visual differences, making
accurate identification and categorization difficult. Recent advancements in
text-to-image frameworks have led to models that implicitly capture natural
scene statistics. These frameworks account for the visual variability of
objects, as well as complex object co-occurrences and sources of noise such as
diverse lighting conditions. By leveraging large-scale datasets and
cross-attention conditioning, these models generate detailed and contextually
rich scene representations. This capability opens new avenues for improving
object recognition and scene understanding in varied and challenging
environments. Our work presents StableSemantics, a dataset comprising 224
thousand human-curated prompts, processed natural language captions, over 2
million synthetic images, and 10 million attention maps corresponding to
individual noun chunks. We explicitly leverage human-generated prompts that
correspond to visually interesting stable diffusion generations, provide 10
generations per phrase, and extract cross-attention maps for each image. We
explore the semantic distribution of generated images, examine the distribution
of objects within images, and benchmark captioning and open vocabulary
segmentation methods on our data. To the best of our knowledge, we are the
first to release a diffusion dataset with semantic attributions. We expect our
proposed dataset to catalyze advances in visual semantic understanding and
provide a foundation for developing more sophisticated and effective visual
models. Website: https://stablesemantics.github.io/StableSemanticsSummary
AI-Generated Summary