StableSemantics:一個合成的語言視覺資料集,其中包含自然圖像中的語義表示。
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
June 19, 2024
作者: Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe
cs.AI
摘要
在計算機視覺中,理解視覺場景的語義是一個基本挑戰。這個挑戰的一個關鍵方面是,具有相似語義含義或功能的物體可能呈現明顯的視覺差異,這使得準確識別和分類變得困難。最近在文本到圖像框架方面的進展已經導致了能夠隱含地捕捉自然場景統計信息的模型。這些框架考慮了物體的視覺變異性,以及複雜的物體共現和諸如多樣的光線條件之類的噪聲來源。通過利用大規模數據集和交叉注意力條件,這些模型生成了詳細且具有上下文豐富性的場景表示。這種能力為改善在各種具有挑戰性的環境中的物體識別和場景理解開辟了新的途徑。我們的工作提出了StableSemantics,這是一個包含224,000個人工精選提示、處理過的自然語言標題、超過2百萬張合成圖像以及對應於單個名詞塊的1千萬個注意力地圖的數據集。我們明確利用與視覺上有趣的穩定擴散生成相對應的人工生成提示,每個短語提供10代,並為每個圖像提取交叉注意力地圖。我們探索了生成圖像的語義分佈,檢查了圖像中物體的分佈,並在我們的數據上對標題和開放詞彙分割方法進行了基準測試。據我們所知,我們是第一個釋出具有語義歸因的擴散數據集。我們期望我們提出的數據集將推動視覺語義理解的進步,並為開發更複雜和有效的視覺模型奠定基礎。網站:https://stablesemantics.github.io/StableSemantics
English
Understanding the semantics of visual scenes is a fundamental challenge in
Computer Vision. A key aspect of this challenge is that objects sharing similar
semantic meanings or functions can exhibit striking visual differences, making
accurate identification and categorization difficult. Recent advancements in
text-to-image frameworks have led to models that implicitly capture natural
scene statistics. These frameworks account for the visual variability of
objects, as well as complex object co-occurrences and sources of noise such as
diverse lighting conditions. By leveraging large-scale datasets and
cross-attention conditioning, these models generate detailed and contextually
rich scene representations. This capability opens new avenues for improving
object recognition and scene understanding in varied and challenging
environments. Our work presents StableSemantics, a dataset comprising 224
thousand human-curated prompts, processed natural language captions, over 2
million synthetic images, and 10 million attention maps corresponding to
individual noun chunks. We explicitly leverage human-generated prompts that
correspond to visually interesting stable diffusion generations, provide 10
generations per phrase, and extract cross-attention maps for each image. We
explore the semantic distribution of generated images, examine the distribution
of objects within images, and benchmark captioning and open vocabulary
segmentation methods on our data. To the best of our knowledge, we are the
first to release a diffusion dataset with semantic attributions. We expect our
proposed dataset to catalyze advances in visual semantic understanding and
provide a foundation for developing more sophisticated and effective visual
models. Website: https://stablesemantics.github.io/StableSemanticsSummary
AI-Generated Summary