StableSemantics: 自然画像における意味表現の合成言語視覚データセット

要旨

視覚シーンの意味論を理解することは、コンピュータビジョンにおける基本的な課題である。この課題の重要な側面は、類似した意味や機能を持つ物体が、顕著な視覚的差異を示すことがあり、正確な識別と分類を困難にすることである。最近のテキストから画像へのフレームワークの進展により、自然なシーンの統計を暗黙的に捉えるモデルが開発されている。これらのフレームワークは、物体の視覚的変動性、複雑な物体の共起、そして多様な照明条件などのノイズ源を考慮している。大規模なデータセットとクロスアテンション条件付けを活用することで、これらのモデルは詳細で文脈的に豊かなシーン表現を生成する。この能力は、多様で挑戦的な環境における物体認識とシーン理解の改善に向けた新たな道を開くものである。本研究では、StableSemanticsというデータセットを提案する。このデータセットは、22万4千の人間がキュレートしたプロンプト、処理された自然言語キャプション、200万以上の合成画像、そして個々の名詞句に対応する1000万のアテンションマップから構成されている。我々は、視覚的に興味深い安定拡散生成に対応する人間が生成したプロンプトを明示的に活用し、各フレーズに対して10の生成を提供し、各画像のクロスアテンションマップを抽出する。生成画像の意味論的分布を探り、画像内の物体の分布を調査し、キャプショニングとオープン語彙セグメンテーション手法を我々のデータでベンチマークする。我々の知る限り、意味論的属性を持つ拡散データセットを公開するのは初めてである。提案するデータセットが、視覚的意味論理解の進展を促進し、より洗練された効果的な視覚モデルの開発の基盤を提供することを期待している。ウェブサイト: https://stablesemantics.github.io/StableSemantics

English

Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics

StableSemantics: 自然画像における意味表現の合成言語視覚データセット

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

要旨

Support