StableSemantics：一個合成的語言視覺資料集，其中包含自然圖像中的語義表示。

摘要

在計算機視覺中，理解視覺場景的語義是一個基本挑戰。這個挑戰的一個關鍵方面是，具有相似語義含義或功能的物體可能呈現明顯的視覺差異，這使得準確識別和分類變得困難。最近在文本到圖像框架方面的進展已經導致了能夠隱含地捕捉自然場景統計信息的模型。這些框架考慮了物體的視覺變異性，以及複雜的物體共現和諸如多樣的光線條件之類的噪聲來源。通過利用大規模數據集和交叉注意力條件，這些模型生成了詳細且具有上下文豐富性的場景表示。這種能力為改善在各種具有挑戰性的環境中的物體識別和場景理解開辟了新的途徑。我們的工作提出了StableSemantics，這是一個包含224,000個人工精選提示、處理過的自然語言標題、超過2百萬張合成圖像以及對應於單個名詞塊的1千萬個注意力地圖的數據集。我們明確利用與視覺上有趣的穩定擴散生成相對應的人工生成提示，每個短語提供10代，並為每個圖像提取交叉注意力地圖。我們探索了生成圖像的語義分佈，檢查了圖像中物體的分佈，並在我們的數據上對標題和開放詞彙分割方法進行了基準測試。據我們所知，我們是第一個釋出具有語義歸因的擴散數據集。我們期望我們提出的數據集將推動視覺語義理解的進步，並為開發更複雜和有效的視覺模型奠定基礎。網站：https://stablesemantics.github.io/StableSemantics

English

Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics

StableSemantics：一個合成的語言視覺資料集，其中包含自然圖像中的語義表示。

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

摘要

Support