OCTScenes:用于物体中心学习的多功能现实世界桌面场景数据集
OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning
June 16, 2023
作者: Yinxuan Huang, Tonglin Chen, Zhimeng Shen, Jinghao Huang, Bin Li, Xiangyang Xue
cs.AI
摘要
人类具有理解场景的组合方式的认知能力。为了赋予人工智能系统类似的能力,以物体为中心的表示学习旨在在没有任何监督的情况下从视觉场景中获取单个物体的表示。尽管最近在以物体为中心的表示学习方面取得了显著进展,能够在复杂的合成数据集上取得重大进展,但在复杂的现实世界场景中应用仍然存在巨大挑战。其中一个重要原因是缺乏专门针对以物体为中心的表示学习方法的现实世界数据集。为了解决这个问题,我们提出了一个名为OCTScenes的多功能现实世界桌面场景数据集,精心设计为用于比较、评估和分析以物体为中心的表示学习方法的基准。OCTScenes包含5000个桌面场景,共包含15种日常物品。每个场景在60帧中捕获,覆盖360度视角。因此,OCTScenes是一个多功能基准数据集,可以同时满足对静态场景、动态场景和多视角场景任务的以物体为中心的表示学习方法的评估。在OCTScenes上进行了针对静态、动态和多视角场景的以物体为中心的表示学习方法的大量实验。结果表明,尽管现有技术在复杂的合成数据集上表现出色,但在从真实世界数据中学习有意义的表示方面存在不足。此外,OCTScenes可以作为推动现有最先进方法发展的催化剂,激励它们适应真实世界场景。数据集和代码可在https://huggingface.co/datasets/Yinxuan/OCTScenes 上获取。
English
Humans possess the cognitive ability to comprehend scenes in a compositional
manner. To empower AI systems with similar abilities, object-centric
representation learning aims to acquire representations of individual objects
from visual scenes without any supervision. Although recent advancements in
object-centric representation learning have achieved remarkable progress on
complex synthesis datasets, there is a huge challenge for application in
complex real-world scenes. One of the essential reasons is the scarcity of
real-world datasets specifically tailored to object-centric representation
learning methods. To solve this problem, we propose a versatile real-world
dataset of tabletop scenes for object-centric learning called OCTScenes, which
is meticulously designed to serve as a benchmark for comparing, evaluating and
analyzing object-centric representation learning methods. OCTScenes contains
5000 tabletop scenes with a total of 15 everyday objects. Each scene is
captured in 60 frames covering a 360-degree perspective. Consequently,
OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the
evaluation of object-centric representation learning methods across static
scenes, dynamic scenes, and multi-view scenes tasks. Extensive experiments of
object-centric representation learning methods for static, dynamic and
multi-view scenes are conducted on OCTScenes. The results demonstrate the
shortcomings of state-of-the-art methods for learning meaningful
representations from real-world data, despite their impressive performance on
complex synthesis datasets. Furthermore, OCTScenes can serves as a catalyst for
advancing existing state-of-the-art methods, inspiring them to adapt to
real-world scenes. Dataset and code are available at
https://huggingface.co/datasets/Yinxuan/OCTScenes.