OCTScenes:一個多功能的現實世界桌面場景數據集,用於物體中心學習。
OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning
June 16, 2023
作者: Yinxuan Huang, Tonglin Chen, Zhimeng Shen, Jinghao Huang, Bin Li, Xiangyang Xue
cs.AI
摘要
人類具備以組合方式理解場景的認知能力。為了賦予人工智能系統類似的能力,以物體為中心的表示學習旨在在視覺場景中無需監督地獲取個別物體的表示。儘管最近在以物體為中心的表示學習方面取得了對複雜合成數據集的顯著進展,但在應用於複雜現實世界場景中仍存在巨大挑戰。其中一個重要原因是現實世界數據集對以物體為中心的表示學習方法的特定設計非常稀缺。為了解決這個問題,我們提出了一個名為 OCTScenes 的多功能現實世界桌面場景數據集,旨在作為比較、評估和分析以物體為中心的表示學習方法的基準。OCTScenes 包含 5000 個桌面場景,總共包含 15 個日常物品。每個場景以 60 幀捕捉,涵蓋 360 度的視角。因此,OCTScenes 是一個多功能基準數據集,可以同時滿足對靜態場景、動態場景和多視角場景任務的以物體為中心的表示學習方法的評估。在 OCTScenes 上進行了對靜態、動態和多視角場景的以物體為中心的表示學習方法的大量實驗。結果顯示,儘管這些方法在複雜合成數據集上表現出色,但在從現實世界數據中學習有意義的表示方面存在不足。此外,OCTScenes 可以作為推動現有最先進方法進步的催化劑,激勵它們適應現實世界場景。數據集和代碼可在 https://huggingface.co/datasets/Yinxuan/OCTScenes 上獲得。
English
Humans possess the cognitive ability to comprehend scenes in a compositional
manner. To empower AI systems with similar abilities, object-centric
representation learning aims to acquire representations of individual objects
from visual scenes without any supervision. Although recent advancements in
object-centric representation learning have achieved remarkable progress on
complex synthesis datasets, there is a huge challenge for application in
complex real-world scenes. One of the essential reasons is the scarcity of
real-world datasets specifically tailored to object-centric representation
learning methods. To solve this problem, we propose a versatile real-world
dataset of tabletop scenes for object-centric learning called OCTScenes, which
is meticulously designed to serve as a benchmark for comparing, evaluating and
analyzing object-centric representation learning methods. OCTScenes contains
5000 tabletop scenes with a total of 15 everyday objects. Each scene is
captured in 60 frames covering a 360-degree perspective. Consequently,
OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the
evaluation of object-centric representation learning methods across static
scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of
object-centric representation learning methods for static, dynamic and
multi-view scenes are conducted on OCTScenes. The results demonstrate the
shortcomings of state-of-the-art methods for learning meaningful
representations from real-world data, despite their impressive performance on
complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for
advancing existing state-of-the-art methods, inspiring them to adapt to
real-world scenes. Dataset and code are available at
https://huggingface.co/datasets/Yinxuan/OCTScenes.